Recipe 1: Finding Diverse Examples of a Specific Pattern
Objective: Isolate a diverse set of examples for a specific visual pattern, ensuring you capture rare variations without filling the dataset with repetitive images. Example scenarios:Manufacturing
Surface damage patterns for defect detection training
Medical Imaging
Specific pathology presentations across patient populations
Retail/Insurance
Product damage types for claims processing
Defense & Intelligence
Specific threat or anomaly patterns in surveillance data
Broad Search
Start with Semantic Search.
- Query: Describe the pattern in natural language (e.g., “surface damage,” “cracked glass,” “skin lesion”)
- Result: This returns a broad set of candidates, likely including some irrelevant images (false positives).
Visual Refinement
Switch to Visual Search.
- Find a clear, high-quality example of the specific pattern you want in the search results.
- Crop the image to isolate just the pattern (excluding background or irrelevant context).
- Run the visual search to find visually similar patterns.
Ensure Diversity
Apply the Uniques Filter.
- Set the Uniqueness Threshold to High.
- Why: This hides repetitive examples of the same common pattern, surfacing visually distinct variations and edge cases.
Recipe 2: Cleaning Raw Data for Labeling
Objective: Rapidly prepare a messy, raw dataset for labeling by removing low-quality data that would waste annotator time and budget. Example scenarios:Manufacturing
Raw production line footage with lighting issues and redundant frames
Medical Imaging
Scans from multiple sources with varying quality standards
Retail
User-generated product photos with technical failures
Defense & Intelligence
Surveillance footage with motion blur and poor lighting
Research
Web-scraped images with inconsistent quality
Remove Technical Failures
Filter by Quality Issues.
- Set Blurry to
IS NOT. - Set Dark to
IS NOT. - Result: Removes unreadable or low-information images immediately.
Remove Annotation Errors
Filter by Mislabels.
- Set Mislabels to
IS NOT. - Result: Excludes images where existing metadata likely conflicts with visual content, preventing bad ground truth from entering the pipeline.
Reduce Redundancy
Apply Select Uniques.
- Set threshold to Medium.
- Result: If the ingest contains burst-mode photos or video sequences, this keeps only representative frames, significantly reducing the total count sent to labeling.
Recipe 3: Balancing Common Scenarios with Rare Edge Cases
Objective: Curate a dataset that captures both typical scenarios and rare edge cases while managing storage volumes efficiently. Example scenarios:Autonomous Vehicles
Common driving conditions vs. rare weather, road signs, or obstacles
Manufacturing
Standard production vs. unusual failure modes or material variations
Medical Imaging
Common presentations vs. rare complications or co-morbidities
Defense & Intelligence
Normal activity patterns vs. anomalous events requiring investigation
Retail
Standard product views vs. unusual angles or lighting conditions
Reduce Storage Costs
Apply Duplicate Detection.
- Action: Review duplicate clusters from video sequences or burst captures.
- Select: Keep one representative frame per scenario.
- Result: Often reduces dataset size by 30-40% without losing scenario coverage.
Surface Rare Cases
Apply the Outliers Filter.
- Action: Sort by high confidence outliers.
- Result: Surfaces rare variations that are critical for model robustness but easy to miss in manual review.
Categorize Challenging Conditions
Filter by Quality Issues.
- Filter:
DarkandBright. - Action: Instead of deleting these, tag them with a descriptive name (e.g., “Challenging Lighting,” “Low Visibility”).
- Result: Creates specific subsets for testing model performance in adverse conditions.
Recipe 4: Managing Large Visual Catalogs
Objective: Consolidate duplicate assets, enforce quality standards, and organize large collections of visual content. Example scenarios:E-commerce
Multi-vendor product catalogs with duplicate stock photos
Real Estate
Property listings with redundant images from different agents
Manufacturing
Parts catalogs with multiple photos of the same component
Media/Creative
Stock photo libraries with similar compositions
Digital Asset Management
Corporate image libraries across departments
Consolidate Duplicate Assets
Apply Duplicate Detection.
- Scenario: Multiple sources upload the same or nearly identical images.
- Action: Identify duplicate groups and link them to a single master asset.
- Result: Prevents search results from being flooded with identical or near-identical images.
Enforce Quality Standards
Filter by Quality Issues.
- Filter:
BlurryORDarkORBright. - Action: Flag these images for review, replacement, or auto-rejection.
- Result: Ensures only professional-quality images remain in the catalog.
Organize Unlabeled Content
Filter by Labels.
- Filter:
LabelsISUnlabeled. - Action: Isolate unlabeled content and use Semantic Search to bulk-select and categorize items (e.g., “red sneakers,” “two-bedroom apartments,” “hydraulic fittings”).
Recipe 5: Identifying Annotation Inconsistencies
Objective: Find and fix labeling errors or inconsistencies across your dataset to improve model training quality. Example scenarios:Manufacturing
Mixed defect categories or mislabeled quality grades
Medical Imaging
Inconsistent diagnostic labels across radiologists
Retail
Product category errors or attribute mismatches
Defense & Intelligence
Misclassified threat levels or event types
Autonomous Vehicles
Inconsistent object classifications across annotators
Find Visual-Label Mismatches
Apply the Mislabels Filter.
- Action: Sort by high confidence mislabels.
- Result: Surfaces images where the visual content doesn’t align with the assigned label.
Review Class Outliers
Apply the Outliers Filter and filter by specific labels.
- Action: Review images flagged as outliers within their assigned class.
- Result: Finds images that are technically correct but visually anomalous for that category (e.g., drawings in a photo dataset).
Validate with Visual Search
Select a flagged image and run Visual Search.
- Action: See what other images visually match this item.
- Result: If all visual matches have a different label, this confirms a likely mislabel.
Recipe 6: Creating Balanced Training Sets
Objective: Build a training dataset with appropriate class distribution and representation across important variations. Example scenarios:Manufacturing
Equal representation of defect types and severity levels
Medical Imaging
Balanced demographics and presentation variations
Retail
Proportional product categories and seasonal coverage
Defense & Intelligence
Representative samples of normal and anomalous events
Autonomous Vehicles
Balanced weather, lighting, and scenario types
Assess Current Distribution
Use Cluster View and group by labels.
- Action: Review the distribution of images across classes.
- Result: Identify overrepresented and underrepresented categories.
Reduce Overrepresented Classes
For dominant classes, apply Select Uniques.
- Set threshold to High to keep only the most distinctive examples.
- Result: Reduces redundancy while preserving diversity within that class.
Augment Underrepresented Classes
For rare classes, use Semantic Search to find more examples.
- Query: Describe the underrepresented category in detail.
- Review results and tag valid examples to expand that class.
Validate Diversity
Within each class, check cluster distribution.
- Use Visual Search from different cluster centers to ensure visual variety.
- Apply Select Uniques to prevent any single visual pattern from dominating.
Additional Tips for Recipe Success
Combine Filters Strategically
Most recipes work best when you apply filters in a specific order:- Start broad with semantic or visual search to establish scope.
- Remove obvious problems with quality filters early.
- Refine for diversity with uniqueness and outlier filters.
- Final polish with duplicate detection and targeted tagging.
Save Intermediate Steps
Save views at each major step in your recipe:- Enables you to backtrack if a filter removes too much.
- Creates audit trail for dataset curation decisions.
- Allows different team members to review at different stages.
Iterate and Adjust
These recipes are starting points, not rigid procedures:- Adjust thresholds based on your dataset characteristics.
- Add custom metadata filters for domain-specific criteria.
- Combine multiple recipes for complex curation workflows.