Visual Layer Documentation: Visual Intelligence, At Scale

Visual Layer’s Enrichment Hub lets you generate high-value metadata using pre-trained models tailored for image and video datasets. These enrichment models can:

Extract descriptive labels, captions, and tags
Enable advanced semantic and object-level search
Power downstream filtering, QA, and automation
Improve annotation coverage and data understanding

Available Enrichment Models

Visual Layer provides a wide range of built-in models designed for diverse enrichment tasks:

Model Name	Task Type	Description
VL-Object-Detector	Object Detection	Identifies and localizes objects with bounding boxes and class labels.
VL-Image-Tagger	Multi-Class Classification	Applies multiple labels to the entire image for categorization and metadata generation.
VL-Object-Captioner	Object to Text	Generates short captions describing individual objects in context.
VL-Image-Captioner	Image to Text	Summarizes the scene or image with natural language.
VL-Image-Semantic Search	Semantic Image Search	Enables conceptual search over images using natural language queries.
VL-Object-Semantic Search	Semantic Object Search	Enables contextual search for specific objects based on semantics.
NVILA-Lite-2B	Image-Text-to-Text (VQA)	Efficient VQA model for visual understanding tasks across multiple frames or images.
Janus-Pro-1B	Image-Text-to-Text (VQA)	Autoregressive model for multi-modal reasoning and question answering.

Some models require pre-existing labels before enrichment.
These models include:

Object/Image Captioners
Semantic Search (Object or Image)

Labels may come from user annotations, the Object Detector, or the Image Tagger model.

Coming Soon

These models are in development and will be available soon in the enrichment catalog:

GPT-4o – Multimodal LLM for advanced captioning and reasoning
Nv-grounding dino – Grounded object detection with natural language prompts
Neva-22B – High-performance image-to-text transformer
Nvclip – Lightweight vision-language model for fast embeddings
Qwen-VL-2B – VQA and captioning model for diverse use cases
Moondream2 – Compact vision-language model
Molmo-7B – Multimodal model for contextual enrichment
Advanced-Image-Search – Conceptual image retrieval using complex queries
Advanced-Object-Search – Enhanced semantic object retrieval
VL-Object-Tagger – Assigns tags at the object level across frames or images

Want Early Access?

Get in Touch

Have questions or want to try out upcoming models early?
Contact us to request access or learn more.

Getting started

On-premises

Integrations

Creating datasets

Managing datasets

Exploring datasets

Dataset enrichment

Explore model catalog

Available Enrichment Models

Coming Soon

Want Early Access?

Get in Touch

Getting started

On-premises

Integrations

Creating datasets

Managing datasets

Exploring datasets

Dataset enrichment

​Available Enrichment Models

​Coming Soon

​Want Early Access?

Get in Touch

Available Enrichment Models

Coming Soon

Want Early Access?