Skip to main content
Effective dataset curation requires combining multiple search and filter tools in a specific sequence. These recipes provide step-by-step workflows for common platform outcomes, helping you achieve high-quality results efficiently.

Recipe 1: Finding Diverse Examples of a Specific Pattern

Objective: Isolate a diverse set of examples for a specific visual pattern, ensuring you capture rare variations without filling the dataset with repetitive images. Example scenarios:

Manufacturing

Surface damage patterns for defect detection training

Medical Imaging

Specific pathology presentations across patient populations

Retail/Insurance

Product damage types for claims processing

Defense & Intelligence

Specific threat or anomaly patterns in surveillance data
1

Broad Search

Start with Semantic Search.
  • Query: Describe the pattern in natural language (e.g., “surface damage,” “cracked glass,” “skin lesion”)
  • Result: This returns a broad set of candidates, likely including some irrelevant images (false positives).
2

Visual Refinement

Switch to Visual Search.
  • Find a clear, high-quality example of the specific pattern you want in the search results.
  • Crop the image to isolate just the pattern (excluding background or irrelevant context).
  • Run the visual search to find visually similar patterns.
3

Ensure Diversity

Apply the Uniques Filter.
  • Set the Uniqueness Threshold to High.
  • Why: This hides repetitive examples of the same common pattern, surfacing visually distinct variations and edge cases.
4

Final Polish

Clean up the selection.
  • Use Duplicate Detection to remove near-identical frames.
  • Save as View with a descriptive name (e.g., “Distinctive Scratch Patterns,” “Rare Lesion Variants”) for your labeling team.

Recipe 2: Cleaning Raw Data for Labeling

Objective: Rapidly prepare a messy, raw dataset for labeling by removing low-quality data that would waste annotator time and budget. Example scenarios:

Manufacturing

Raw production line footage with lighting issues and redundant frames

Medical Imaging

Scans from multiple sources with varying quality standards

Retail

User-generated product photos with technical failures

Defense & Intelligence

Surveillance footage with motion blur and poor lighting

Research

Web-scraped images with inconsistent quality
1

Remove Technical Failures

Filter by Quality Issues.
  • Set Blurry to IS NOT.
  • Set Dark to IS NOT.
  • Result: Removes unreadable or low-information images immediately.
2

Remove Annotation Errors

Filter by Mislabels.
  • Set Mislabels to IS NOT.
  • Result: Excludes images where existing metadata likely conflicts with visual content, preventing bad ground truth from entering the pipeline.
3

Reduce Redundancy

Apply Select Uniques.
  • Set threshold to Medium.
  • Result: If the ingest contains burst-mode photos or video sequences, this keeps only representative frames, significantly reducing the total count sent to labeling.
4

Export for Labeling

  • Select all remaining items.
  • Export the cleaned list to JSON/CSV to hand off to your annotation workforce.

Recipe 3: Balancing Common Scenarios with Rare Edge Cases

Objective: Curate a dataset that captures both typical scenarios and rare edge cases while managing storage volumes efficiently. Example scenarios:

Autonomous Vehicles

Common driving conditions vs. rare weather, road signs, or obstacles

Manufacturing

Standard production vs. unusual failure modes or material variations

Medical Imaging

Common presentations vs. rare complications or co-morbidities

Defense & Intelligence

Normal activity patterns vs. anomalous events requiring investigation

Retail

Standard product views vs. unusual angles or lighting conditions
1

Reduce Storage Costs

Apply Duplicate Detection.
  • Action: Review duplicate clusters from video sequences or burst captures.
  • Select: Keep one representative frame per scenario.
  • Result: Often reduces dataset size by 30-40% without losing scenario coverage.
2

Surface Rare Cases

Apply the Outliers Filter.
  • Action: Sort by high confidence outliers.
  • Result: Surfaces rare variations that are critical for model robustness but easy to miss in manual review.
3

Categorize Challenging Conditions

Filter by Quality Issues.
  • Filter: Dark and Bright.
  • Action: Instead of deleting these, tag them with a descriptive name (e.g., “Challenging Lighting,” “Low Visibility”).
  • Result: Creates specific subsets for testing model performance in adverse conditions.
4

Validate Coverage

Use Cluster View to verify distribution.
  • Review cluster sizes to ensure no single scenario dominates the dataset.
  • Use Select Uniques within overrepresented clusters to balance the distribution.

Recipe 4: Managing Large Visual Catalogs

Objective: Consolidate duplicate assets, enforce quality standards, and organize large collections of visual content. Example scenarios:

E-commerce

Multi-vendor product catalogs with duplicate stock photos

Real Estate

Property listings with redundant images from different agents

Manufacturing

Parts catalogs with multiple photos of the same component

Media/Creative

Stock photo libraries with similar compositions

Digital Asset Management

Corporate image libraries across departments
1

Consolidate Duplicate Assets

Apply Duplicate Detection.
  • Scenario: Multiple sources upload the same or nearly identical images.
  • Action: Identify duplicate groups and link them to a single master asset.
  • Result: Prevents search results from being flooded with identical or near-identical images.
2

Enforce Quality Standards

Filter by Quality Issues.
  • Filter: Blurry OR Dark OR Bright.
  • Action: Flag these images for review, replacement, or auto-rejection.
  • Result: Ensures only professional-quality images remain in the catalog.
3

Organize Unlabeled Content

Filter by Labels.
  • Filter: Labels IS Unlabeled.
  • Action: Isolate unlabeled content and use Semantic Search to bulk-select and categorize items (e.g., “red sneakers,” “two-bedroom apartments,” “hydraulic fittings”).
4

Create Curated Collections

  • Use Save as View to create themed collections (e.g., “Hero Images,” “Seasonal Products,” “Premium Listings”).
  • Share views with relevant teams to ensure everyone works from the same quality-controlled subset.

Recipe 5: Identifying Annotation Inconsistencies

Objective: Find and fix labeling errors or inconsistencies across your dataset to improve model training quality. Example scenarios:

Manufacturing

Mixed defect categories or mislabeled quality grades

Medical Imaging

Inconsistent diagnostic labels across radiologists

Retail

Product category errors or attribute mismatches

Defense & Intelligence

Misclassified threat levels or event types

Autonomous Vehicles

Inconsistent object classifications across annotators
1

Find Visual-Label Mismatches

Apply the Mislabels Filter.
  • Action: Sort by high confidence mislabels.
  • Result: Surfaces images where the visual content doesn’t align with the assigned label.
2

Review Class Outliers

Apply the Outliers Filter and filter by specific labels.
  • Action: Review images flagged as outliers within their assigned class.
  • Result: Finds images that are technically correct but visually anomalous for that category (e.g., drawings in a photo dataset).
3

Validate with Visual Search

Select a flagged image and run Visual Search.
  • Action: See what other images visually match this item.
  • Result: If all visual matches have a different label, this confirms a likely mislabel.
4

Bulk Correction

  • Tag all confirmed errors with “Needs Relabeling.”
  • Export this view to CSV/JSON for your annotation team to correct.
  • Track corrections by saving views before and after relabeling.

Recipe 6: Creating Balanced Training Sets

Objective: Build a training dataset with appropriate class distribution and representation across important variations. Example scenarios:

Manufacturing

Equal representation of defect types and severity levels

Medical Imaging

Balanced demographics and presentation variations

Retail

Proportional product categories and seasonal coverage

Defense & Intelligence

Representative samples of normal and anomalous events

Autonomous Vehicles

Balanced weather, lighting, and scenario types
1

Assess Current Distribution

Use Cluster View and group by labels.
  • Action: Review the distribution of images across classes.
  • Result: Identify overrepresented and underrepresented categories.
2

Reduce Overrepresented Classes

For dominant classes, apply Select Uniques.
  • Set threshold to High to keep only the most distinctive examples.
  • Result: Reduces redundancy while preserving diversity within that class.
3

Augment Underrepresented Classes

For rare classes, use Semantic Search to find more examples.
  • Query: Describe the underrepresented category in detail.
  • Review results and tag valid examples to expand that class.
4

Validate Diversity

Within each class, check cluster distribution.
  • Use Visual Search from different cluster centers to ensure visual variety.
  • Apply Select Uniques to prevent any single visual pattern from dominating.
5

Export Balanced Set

  • Save the final balanced distribution as a view.
  • Export with stratified sampling to maintain proportions in train/validation splits.

Additional Tips for Recipe Success

Combine Filters Strategically

Most recipes work best when you apply filters in a specific order:
  1. Start broad with semantic or visual search to establish scope.
  2. Remove obvious problems with quality filters early.
  3. Refine for diversity with uniqueness and outlier filters.
  4. Final polish with duplicate detection and targeted tagging.

Save Intermediate Steps

Save views at each major step in your recipe:
  • Enables you to backtrack if a filter removes too much.
  • Creates audit trail for dataset curation decisions.
  • Allows different team members to review at different stages.

Iterate and Adjust

These recipes are starting points, not rigid procedures:
  • Adjust thresholds based on your dataset characteristics.
  • Add custom metadata filters for domain-specific criteria.
  • Combine multiple recipes for complex curation workflows.