Visual Layer’s Enrichment Hub lets you generate high-value metadata using pre-trained models tailored for image and video datasets. These enrichment models can:

  • Extract descriptive labels, captions, and tags
  • Enable advanced semantic and object-level search
  • Power downstream filtering, QA, and automation
  • Improve annotation coverage and data understanding

You can apply one or more models during the enrichment process, depending on the type of insights you want to generate.

Available Enrichment Models

Visual Layer provides a wide range of built-in models designed for diverse enrichment tasks:

Model NameTask TypeDescription
VL-Object-DetectorObject DetectionIdentifies and localizes objects with bounding boxes and class labels.
VL-Image-TaggerMulti-Class ClassificationApplies multiple labels to the entire image for categorization and metadata generation.
VL-Object-CaptionerObject to TextGenerates short captions describing individual objects in context.
VL-Image-CaptionerImage to TextSummarizes the scene or image with natural language.
VL-Image-Semantic SearchSemantic Image SearchEnables conceptual search over images using natural language queries.
VL-Object-Semantic SearchSemantic Object SearchEnables contextual search for specific objects based on semantics.
NVILA-Lite-2BImage-Text-to-Text (VQA)Efficient VQA model for visual understanding tasks across multiple frames or images.
Janus-Pro-1BImage-Text-to-Text (VQA)Autoregressive model for multi-modal reasoning and question answering.

Important

Some models require pre-existing labels before enrichment.
These models include:

  • Object/Image Captioners
  • Semantic Search (Object or Image)

Labels may come from user annotations, the Object Detector, or the Image Tagger model.

Coming Soon

These models are in development and will be available soon in the enrichment catalog:

  • GPT-4o – Multimodal LLM for advanced captioning and reasoning
  • Nv-grounding dino – Grounded object detection with natural language prompts
  • Neva-22B – High-performance image-to-text transformer
  • Nvclip – Lightweight vision-language model for fast embeddings
  • Qwen-VL-2B – VQA and captioning model for diverse use cases
  • Moondream2 – Compact vision-language model
  • Molmo-7B – Multimodal model for contextual enrichment
  • Advanced-Image-Search – Conceptual image retrieval using complex queries
  • Advanced-Object-Search – Enhanced semantic object retrieval
  • VL-Object-Tagger – Assigns tags at the object level across frames or images

Want Early Access?