Explore model catalog
Browse all available enrichment models in Visual Layer’s catalog and learn how each can enhance your dataset.
Visual Layer’s Enrichment Hub lets you generate high-value metadata using pre-trained models tailored for image and video datasets. These enrichment models can:
- Extract descriptive labels, captions, and tags
- Enable advanced semantic and object-level search
- Power downstream filtering, QA, and automation
- Improve annotation coverage and data understanding
You can apply one or more models during the enrichment process, depending on the type of insights you want to generate.
Available Enrichment Models
Visual Layer provides a wide range of built-in models designed for diverse enrichment tasks:
Model Name | Task Type | Description |
---|---|---|
VL-Object-Detector | Object Detection | Identifies and localizes objects with bounding boxes and class labels. |
VL-Image-Tagger | Multi-Class Classification | Applies multiple labels to the entire image for categorization and metadata generation. |
VL-Object-Captioner | Object to Text | Generates short captions describing individual objects in context. |
VL-Image-Captioner | Image to Text | Summarizes the scene or image with natural language. |
VL-Image-Semantic Search | Semantic Image Search | Enables conceptual search over images using natural language queries. |
VL-Object-Semantic Search | Semantic Object Search | Enables contextual search for specific objects based on semantics. |
NVILA-Lite-2B | Image-Text-to-Text (VQA) | Efficient VQA model for visual understanding tasks across multiple frames or images. |
Janus-Pro-1B | Image-Text-to-Text (VQA) | Autoregressive model for multi-modal reasoning and question answering. |
Important
Some models require pre-existing labels before enrichment.
These models include:
- Object/Image Captioners
- Semantic Search (Object or Image)
Labels may come from user annotations, the Object Detector, or the Image Tagger model.
Coming Soon
These models are in development and will be available soon in the enrichment catalog:
- GPT-4o – Multimodal LLM for advanced captioning and reasoning
- Nv-grounding dino – Grounded object detection with natural language prompts
- Neva-22B – High-performance image-to-text transformer
- Nvclip – Lightweight vision-language model for fast embeddings
- Qwen-VL-2B – VQA and captioning model for diverse use cases
- Moondream2 – Compact vision-language model
- Molmo-7B – Multimodal model for contextual enrichment
- Advanced-Image-Search – Conceptual image retrieval using complex queries
- Advanced-Object-Search – Enhanced semantic object retrieval
- VL-Object-Tagger – Assigns tags at the object level across frames or images
Want Early Access?
Get in Touch
Have questions or want to try out upcoming models early?
Contact us to request access or learn more.