Skip to main content

Introduction

Caption generation is one of the most time-consuming operations in Visual Layer’s dataset pipeline. When creating multiple datasets with the same images, re-running caption generation for each dataset wastes valuable time and computational resources. This guide shows you how to extract and reuse caption data from previous pipeline runs on the same data, allowing you to:
  • ✓ Skip caption generation on subsequent dataset creations
  • ✓ Maintain consistent captions across multiple datasets
Use Case: This approach is ideal when you need to create multiple datasets or dataset versions using the same images but with different configurations.

Overview

After running a dataset pipeline, Visual Layer stores processed data in:
/.vl/tmp/[dataset-id]/input/metadata/image_annotations.parquet
This parquet file contains all the caption data you need to reuse.

Using the Caption Extraction Script

Prerequisites

  • Python 3.x
  • pandas library (pip install pandas pyarrow)

What the Script Does

The extraction script processes Visual Layer’s internal parquet files to create a clean annotation file ready for reuse:
  1. Extracts relevant columns: Keeps only filename and caption
  2. Removes system paths: Strips prefixes like /hostfs, /mnt, etc.
  3. Creates relative paths: Converts absolute paths to relative filenames
  4. Outputs clean parquet: Generates a properly formatted image_annotations.parquet

Script Location

View Complete Script Code

The complete Python script is available on a separate page. Click here to view and copy the code.

Installation

# Ensure pandas and pyarrow are installed
pip install pandas pyarrow

# Make script executable (optional)
chmod +x process_annotations.py

Step-by-Step Workflow

Step 1: Create Initial Dataset (with Captioning)

Create your first dataset with captioning enabled as usual. This will generate the initial captions and store them in the internal parquet file. After the pipeline completes, find the dataset ID and locate the parquet file:
# List recent datasets
ls -lt /.vl/tmp/

# Navigate to your dataset's metadata
cd /.vl/tmp/[your-dataset-id]/input/metadata/

# Verify the file exists
ls image_annotations.parquet

Step 3: Run the Extraction Script

Process the parquet file to extract captions:
# Basic usage - creates image_annotations_processed.parquet in same directory
python3 process_annotations.py /.vl/tmp/[dataset-id]/input/metadata/image_annotations.parquet

# Specify custom output location
python3 process_annotations.py /.vl/tmp/[dataset-id]/input/metadata/image_annotations.parquet \
  -o /path/to/new-dataset/image_annotations.parquet

# Custom prefix removal (if needed)
python3 process_annotations.py input.parquet --prefix /custom/prefix/to/remove
Script Output:
Reading parquet file: /.vl/tmp/abc123.../input/metadata/image_annotations.parquet
Original shape: (12, 9)
Columns: ['filename', 'file_size_bytes', 'video', 'frame_timestamp', 'caption', ...]

Removing prefix '/hostfs' from filenames...

Processed shape: (12, 2)

Sample filenames after processing:
['dog_1.jpg', 'dog_2.jpg', 'dog_3.jpg']

✓ Successfully processed 12 rows
✓ Output saved to: /path/to/output.parquet

Step 4: Copy to New Dataset Directory

Place the extracted parquet file in your new dataset directory alongside the images:
# Copy to new dataset location
cp image_annotations_processed.parquet /path/to/new-dataset/image_annotations.parquet

# Directory structure should look like:
# /path/to/new-dataset/
#   ├── image_annotations.parquet  (your extracted file)
#   ├── dog_1.jpg
#   ├── dog_2.jpg
#   └── dog_3.jpg
The parquet file must be named exactly image_annotations.parquet for Visual Layer to recognize it.

Step 5: Create New Dataset (Fast!)

Now create your new dataset. Visual Layer will: ✓ Detect the existing image_annotations.parquet file ✓ Use the provided captions ✓ Complete much faster!
Note you will need to remove the captioning step from your dataset configuration to avoid conflicts.

Understanding Relative Paths

Critical Concept: Filenames in the parquet file must be relative to the dataset directory location.

Why Relative Paths?

Visual Layer looks for images relative to where the image_annotations.parquet file is located. Absolute paths won’t work because they reference specific system locations that may not exist or may differ across environments.

Examples

✗ Wrong - Absolute Paths

filename: /home/ubuntu/images/dog_1.jpg
filename: /mnt/data/dogs/dog_2.jpg
filename: /hostfs/workspace/dog_3.jpg
Problem: These paths are tied to specific locations. If the parquet is in /new/location/, Visual Layer can’t find /home/ubuntu/images/dog_1.jpg.

✓ Correct - Relative Paths

Scenario 1: Parquet in same directory as images
Dataset directory: /any/path/dataset/
  ├── image_annotations.parquet
  ├── dog_1.jpg
  ├── dog_2.jpg
  └── dog_3.jpg

Filenames in parquet:
  - dog_1.jpg
  - dog_2.jpg
  - dog_3.jpg
Scenario 2: Images in subdirectory
Dataset directory: /any/path/dataset/
  ├── image_annotations.parquet
  └── images/
      ├── dog_1.jpg
      ├── dog_2.jpg
      └── dog_3.jpg

Filenames in parquet:
  - images/dog_1.jpg
  - images/dog_2.jpg
  - images/dog_3.jpg

How the Script Handles Paths

The script automatically removes common system prefixes:
Original Path (from VL)After ProcessingNotes
/hostfs/home/ubuntu/images/dog.jpgdog.jpgRemoved /hostfs/home/ubuntu/images/
/mnt/data/dogs/dog.jpgdog.jpgCustom prefix with --prefix /mnt/data/dogs/
/workspace/project/images/subdir/dog.jpgdog.jpg or subdir/dog.jpgDepends on prefix specified
Key Insight: The dataset directory can be anywhere on your system. The important thing is that filenames are relative to wherever you place the image_annotations.parquet file.

Complete Example

Let’s walk through a real example using dog images.

Initial State: Dataset 1 (with captioning)

After running the first dataset pipeline:
# Pipeline completed, captions generated
Dataset ID: b451f7c6-f911-4ceb-8b8a-dd6c1ebb50fd
Internal parquet file:
Location: /.vl/tmp/b451f7c6-f911-4ceb-8b8a-dd6c1ebb50fd/input/metadata/image_annotations.parquet
Columns: [filename, file_size_bytes, video, frame_timestamp, caption,
          captions_source_id, default_embedding_index, _vl_stats, stats]
Rows: 12
Sample data:
filenamecaption
/hostfs/home/ubuntu/images/dog_1.jpg”A Golden Retriever sitting on grass. The dog has a friendly expression…”
/hostfs/home/ubuntu/images/dog_2.jpg”A playful puppy with a red collar running through a park…”

Processing with Script

# Run extraction script
python3 process_annotations.py \
  /.vl/tmp/b451f7c6-.../input/metadata/image_annotations.parquet \
  -o /home/ubuntu/new-dataset/image_annotations.parquet
Output parquet file:
Location: /home/ubuntu/new-dataset/image_annotations.parquet
Columns: [filename, caption]
Rows: 12
Processed data:
filenamecaption
dog_1.jpg”A Golden Retriever sitting on grass. The dog has a friendly expression…”
dog_2.jpg”A playful puppy with a red collar running through a park…”

Dataset 2 (without captioning - fast!)

# Directory structure
/home/ubuntu/new-dataset/
  ├── image_annotations.parquet  (extracted file)
  ├── dog_1.jpg
  ├── dog_2.jpg
  └── ... (all 12 images)

# Create new dataset
# Visual Layer detects image_annotations.parquet

Troubleshooting

Images Not Found

Error: “Could not find image at path: dog_1.jpg” Cause: Filenames in parquet don’t match actual file locations. Solutions:
  1. Verify parquet is in the same directory as images
  2. Check that filenames match exactly (case-sensitive)
  3. Inspect parquet contents to verify paths are relative

Checking Parquet Contents

# View parquet file contents
python3 -c "
import pandas as pd
df = pd.read_parquet('image_annotations.parquet')
print('Columns:', df.columns.tolist())
print('\nFirst 5 filenames:')
print(df['filename'].head().tolist())
"
Expected output:
Columns: ['filename', 'caption']

First 5 filenames:
['dog_1.jpg', 'dog_2.jpg', 'dog_3.jpg', 'dog_4.jpg', 'dog_5.jpg']

Captions Not Being Used

Issue: Visual Layer is still generating captions even though image_annotations.parquet exists. Solutions:
  1. Verify filename is exactly image_annotations.parquet (not image_annotations_processed.parquet)
  2. Ensure file is in the correct location relative to images
  3. Check that parquet file has both filename and caption columns

Script Errors

Error: “Missing required columns: [‘caption’]” Cause: Source parquet file doesn’t contain caption data. Solution: The source dataset must have had captions generated. Check if captioning was enabled in the original pipeline.
By following this workflow, you can significantly reduce dataset creation time when working with the same images across multiple datasets or configurations. The initial investment of generating captions once pays off through faster subsequent dataset creations.