Introduction
Caption generation is one of the most time-consuming operations in Visual Layer’s dataset pipeline. When creating multiple datasets with the same images, re-running caption generation for each dataset wastes valuable time and computational resources. This guide shows you how to extract and reuse caption data from previous pipeline runs on the same data, allowing you to:- ✓ Skip caption generation on subsequent dataset creations
- ✓ Maintain consistent captions across multiple datasets
Use Case: This approach is ideal when you need to create multiple datasets or dataset versions using the same images but with different configurations.
Overview
After running a dataset pipeline, Visual Layer stores processed data in:Using the Caption Extraction Script
Prerequisites
- Python 3.x
- pandas library (
pip install pandas pyarrow)
What the Script Does
The extraction script processes Visual Layer’s internal parquet files to create a clean annotation file ready for reuse:- Extracts relevant columns: Keeps only
filenameandcaption - Removes system paths: Strips prefixes like
/hostfs,/mnt, etc. - Creates relative paths: Converts absolute paths to relative filenames
- Outputs clean parquet: Generates a properly formatted
image_annotations.parquet
Script Location
View Complete Script Code
The complete Python script is available on a separate page. Click here to view and copy the code.
Installation
Step-by-Step Workflow
Step 1: Create Initial Dataset (with Captioning)
Create your first dataset with captioning enabled as usual. This will generate the initial captions and store them in the internal parquet file. After the pipeline completes, find the dataset ID and locate the parquet file:Step 3: Run the Extraction Script
Process the parquet file to extract captions:Step 4: Copy to New Dataset Directory
Place the extracted parquet file in your new dataset directory alongside the images:Step 5: Create New Dataset (Fast!)
Now create your new dataset. Visual Layer will: ✓ Detect the existingimage_annotations.parquet file
✓ Use the provided captions
✓ Complete much faster!
Understanding Relative Paths
Critical Concept: Filenames in the parquet file must be relative to the dataset directory location.Why Relative Paths?
Visual Layer looks for images relative to where theimage_annotations.parquet file is located. Absolute paths won’t work because they reference specific system locations that may not exist or may differ across environments.
Examples
✗ Wrong - Absolute Paths
/new/location/, Visual Layer can’t find /home/ubuntu/images/dog_1.jpg.
✓ Correct - Relative Paths
Scenario 1: Parquet in same directory as imagesHow the Script Handles Paths
The script automatically removes common system prefixes:| Original Path (from VL) | After Processing | Notes |
|---|---|---|
/hostfs/home/ubuntu/images/dog.jpg | dog.jpg | Removed /hostfs/home/ubuntu/images/ |
/mnt/data/dogs/dog.jpg | dog.jpg | Custom prefix with --prefix /mnt/data/dogs/ |
/workspace/project/images/subdir/dog.jpg | dog.jpg or subdir/dog.jpg | Depends on prefix specified |
Complete Example
Let’s walk through a real example using dog images.Initial State: Dataset 1 (with captioning)
After running the first dataset pipeline:| filename | caption |
|---|---|
/hostfs/home/ubuntu/images/dog_1.jpg | ”A Golden Retriever sitting on grass. The dog has a friendly expression…” |
/hostfs/home/ubuntu/images/dog_2.jpg | ”A playful puppy with a red collar running through a park…” |
Processing with Script
| filename | caption |
|---|---|
dog_1.jpg | ”A Golden Retriever sitting on grass. The dog has a friendly expression…” |
dog_2.jpg | ”A playful puppy with a red collar running through a park…” |
Dataset 2 (without captioning - fast!)
Troubleshooting
Images Not Found
Error: “Could not find image at path: dog_1.jpg” Cause: Filenames in parquet don’t match actual file locations. Solutions:- Verify parquet is in the same directory as images
- Check that filenames match exactly (case-sensitive)
- Inspect parquet contents to verify paths are relative
Checking Parquet Contents
Captions Not Being Used
Issue: Visual Layer is still generating captions even thoughimage_annotations.parquet exists.
Solutions:
- Verify filename is exactly
image_annotations.parquet(notimage_annotations_processed.parquet) - Ensure file is in the correct location relative to images
- Check that parquet file has both
filenameandcaptioncolumns
Script Errors
Error: “Missing required columns: [‘caption’]” Cause: Source parquet file doesn’t contain caption data. Solution: The source dataset must have had captions generated. Check if captioning was enabled in the original pipeline.Related Documentation
- Preparing Annotation Files - Format requirements for annotation files
- Annotations Overview - Complete guide to importing annotations
- Creating Datasets - Dataset creation fundamentals
By following this workflow, you can significantly reduce dataset creation time when working with the same images across multiple datasets or configurations. The initial investment of generating captions once pays off through faster subsequent dataset creations.