Skip to main content
Visual Layer’s Enrichment Hub lets you generate high-value metadata using pre-trained models tailored for image and video datasets. These enrichment models can:
  • Extract descriptive labels, captions, and tags
  • Enable advanced semantic and object-level search
  • Power downstream filtering, QA, and automation
  • Improve annotation coverage and data understanding

Available Enrichment Models

Visual Layer provides a wide range of built-in models designed for diverse enrichment tasks:
Model NameTask TypeDescription
VL-Object-DetectorObject DetectionIdentifies and localizes objects with bounding boxes and class labels.
VL-Image-TaggerMulti-Class ClassificationApplies multiple labels to the entire image for categorization and metadata generation.
VL-Object-CaptionerObject to TextGenerates short captions describing individual objects in context.
VL-Image-CaptionerImage to TextSummarizes the scene or image with natural language.
VL-Image-Semantic SearchSemantic Image SearchEnables conceptual search over images using natural language queries.
VL-Object-Semantic SearchSemantic Object SearchEnables contextual search for specific objects based on semantics.
NVILA-Lite-2BImage-Text-to-Text (VQA)Efficient VQA model for visual understanding tasks across multiple frames or images.
Janus-Pro-1BImage-Text-to-Text (VQA)Autoregressive model for multi-modal reasoning and question answering.
Some models require pre-existing labels before enrichment.
These models include:
  • Object/Image Captioners
  • Semantic Search (Object or Image)
Labels may come from user annotations, the Object Detector, or the Image Tagger model.

Coming Soon

These models are in development and will be available soon in the enrichment catalog:
Model NameDescription
GPT-4oMultimodal LLM for advanced captioning and reasoning
Nv-grounding dinoGrounded object detection with natural language prompts
Neva-22BHigh-performance image-to-text transformer
NvclipLightweight vision-language model for fast embeddings
Qwen-VL-2BVQA and captioning model for diverse use cases
Moondream2Compact vision-language model
Molmo-7BMultimodal model for contextual enrichment
Advanced-Image-SearchConceptual image retrieval using complex queries
Advanced-Object-SearchEnhanced semantic object retrieval
VL-Object-TaggerAssigns tags at the object level across frames or images

Want Early Access?