> ## Documentation Index > Fetch the complete documentation index at: https://docs.visual-layer.com/llms.txt > Use this file to discover all available pages before exploring further. # Exploration Recipes > Step-by-step guides for common dataset curation workflows, from defect detection to training set cleanup. Effective dataset curation requires combining multiple search and filter tools in a specific sequence. These recipes provide step-by-step workflows for common platform outcomes, helping you achieve high-quality results efficiently. ## Recipe 1: Finding Diverse Examples of a Specific Pattern **Objective:** Isolate a diverse set of examples for a specific visual pattern, ensuring you capture rare variations without filling the dataset with repetitive images. **Example scenarios:** Surface damage patterns for defect detection training Specific pathology presentations across patient populations Product damage types for claims processing Specific threat or anomaly patterns in surveillance data Start with **Semantic Search**. * **Query:** Describe the pattern in natural language (e.g., "surface damage," "cracked glass," "skin lesion") * **Result:** This returns a broad set of candidates, likely including some irrelevant images (false positives). Switch to **Visual Search**. * Find a clear, high-quality example of the specific pattern you want in the search results. * **Crop** the image to isolate just the pattern (excluding background or irrelevant context). * Run the visual search to find visually similar patterns. Apply the **Uniques Filter**. * Set the **Uniqueness Threshold** to **High**. * **Why:** This hides repetitive examples of the same common pattern, surfacing visually distinct variations and edge cases. Clean up the selection. * Use **Duplicate Detection** to remove near-identical frames. * **Save as View** with a descriptive name (e.g., "Distinctive Scratch Patterns," "Rare Lesion Variants") for your labeling team. ## Recipe 2: Cleaning Raw Data for Labeling **Objective:** Rapidly prepare a messy, raw dataset for labeling by removing low-quality data that would waste annotator time and budget. **Example scenarios:** Raw production line footage with lighting issues and redundant frames Scans from multiple sources with varying quality standards User-generated product photos with technical failures Surveillance footage with motion blur and poor lighting Web-scraped images with inconsistent quality Filter by **Quality Issues**. * Set **Blurry** to `IS NOT`. * Set **Dark** to `IS NOT`. * **Result:** Removes unreadable or low-information images immediately. Filter by **Mislabels**. * Set **Mislabels** to `IS NOT`. * **Result:** Excludes images where existing metadata likely conflicts with visual content, preventing bad ground truth from entering the pipeline. Apply **Select Uniques**. * Set threshold to **Medium**. * **Result:** If the ingest contains burst-mode photos or video sequences, this keeps only representative frames, significantly reducing the total count sent to labeling. * Select all remaining items. * **Export** the cleaned list to JSON/CSV to hand off to your annotation workforce. ## Recipe 3: Balancing Common Scenarios with Rare Edge Cases **Objective:** Curate a dataset that captures both typical scenarios and rare edge cases while managing storage volumes efficiently. **Example scenarios:** Common driving conditions vs. rare weather, road signs, or obstacles Standard production vs. unusual failure modes or material variations Common presentations vs. rare complications or co-morbidities Normal activity patterns vs. anomalous events requiring investigation Standard product views vs. unusual angles or lighting conditions Apply **Duplicate Detection**. * **Action:** Review duplicate clusters from video sequences or burst captures. * **Select:** Keep one representative frame per scenario. * **Result:** Often reduces dataset size by 30-40% without losing scenario coverage. Apply the **Outliers Filter**. * **Action:** Sort by high confidence outliers. * **Result:** Surfaces rare variations that are critical for model robustness but easy to miss in manual review. Filter by **Quality Issues**. * **Filter:** `Dark` and `Bright`. * **Action:** Instead of deleting these, tag them with a descriptive name (e.g., "Challenging Lighting," "Low Visibility"). * **Result:** Creates specific subsets for testing model performance in adverse conditions. Use **Cluster View** to verify distribution. * Review cluster sizes to ensure no single scenario dominates the dataset. * Use **Select Uniques** within overrepresented clusters to balance the distribution. ## Recipe 4: Managing Large Visual Catalogs **Objective:** Consolidate duplicate assets, enforce quality standards, and organize large collections of visual content. **Example scenarios:** Multi-vendor product catalogs with duplicate stock photos Property listings with redundant images from different agents Parts catalogs with multiple photos of the same component Stock photo libraries with similar compositions Corporate image libraries across departments Apply **Duplicate Detection**. * **Scenario:** Multiple sources upload the same or nearly identical images. * **Action:** Identify duplicate groups and link them to a single master asset. * **Result:** Prevents search results from being flooded with identical or near-identical images. Filter by **Quality Issues**. * **Filter:** `Blurry` OR `Dark` OR `Bright`. * **Action:** Flag these images for review, replacement, or auto-rejection. * **Result:** Ensures only professional-quality images remain in the catalog. Filter by **Labels**. * **Filter:** `Labels` IS `Unlabeled`. * **Action:** Isolate unlabeled content and use **Semantic Search** to bulk-select and categorize items (e.g., "red sneakers," "two-bedroom apartments," "hydraulic fittings"). * Use **Save as View** to create themed collections (e.g., "Hero Images," "Seasonal Products," "Premium Listings"). * Share views with relevant teams to ensure everyone works from the same quality-controlled subset. ## Recipe 5: Identifying Annotation Inconsistencies **Objective:** Find and fix labeling errors or inconsistencies across your dataset to improve model training quality. **Example scenarios:** Mixed defect categories or mislabeled quality grades Inconsistent diagnostic labels across radiologists Product category errors or attribute mismatches Misclassified threat levels or event types Inconsistent object classifications across annotators Apply the **Mislabels Filter**. * **Action:** Sort by high confidence mislabels. * **Result:** Surfaces images where the visual content doesn't align with the assigned label. Apply the **Outliers Filter** and filter by specific labels. * **Action:** Review images flagged as outliers within their assigned class. * **Result:** Finds images that are technically correct but visually anomalous for that category (e.g., drawings in a photo dataset). Select a flagged image and run **Visual Search**. * **Action:** See what other images visually match this item. * **Result:** If all visual matches have a different label, this confirms a likely mislabel. * Tag all confirmed errors with "Needs Relabeling." * **Export** this view to CSV/JSON for your annotation team to correct. * Track corrections by saving views before and after relabeling. ## Recipe 6: Creating Balanced Training Sets **Objective:** Build a training dataset with appropriate class distribution and representation across important variations. **Example scenarios:** Equal representation of defect types and severity levels Balanced demographics and presentation variations Proportional product categories and seasonal coverage Representative samples of normal and anomalous events Balanced weather, lighting, and scenario types Use **Cluster View** and group by labels. * **Action:** Review the distribution of images across classes. * **Result:** Identify overrepresented and underrepresented categories. For dominant classes, apply **Select Uniques**. * Set threshold to **High** to keep only the most distinctive examples. * **Result:** Reduces redundancy while preserving diversity within that class. For rare classes, use **Semantic Search** to find more examples. * **Query:** Describe the underrepresented category in detail. * Review results and tag valid examples to expand that class. Within each class, check cluster distribution. * Use **Visual Search** from different cluster centers to ensure visual variety. * Apply **Select Uniques** to prevent any single visual pattern from dominating. * Save the final balanced distribution as a view. * **Export** with stratified sampling to maintain proportions in train/validation splits. ## Additional Tips for Recipe Success ### Combine Filters Strategically Most recipes work best when you apply filters in a specific order: 1. **Start broad** with semantic or visual search to establish scope. 2. **Remove obvious problems** with quality filters early. 3. **Refine for diversity** with uniqueness and outlier filters. 4. **Final polish** with duplicate detection and targeted tagging. ### Save Intermediate Steps Save views at each major step in your recipe: * Enables you to backtrack if a filter removes too much. * Creates audit trail for dataset curation decisions. * Allows different team members to review at different stages. ### Iterate and Adjust These recipes are starting points, not rigid procedures: * Adjust thresholds based on your dataset characteristics. * Add custom metadata filters for domain-specific criteria. * Combine multiple recipes for complex curation workflows. ## Related Resources The core Find-Narrow-Refine workflow philosophy Detailed guide to every filter operator and option How similarity clustering powers many recipes Saving and sharing curated datasets