Vision Language Model
A Vision Language Model (VLM), also known as a visual language model, is a type of artificial intelligence system that can simultaneously understand and process both visual information (images, videos) and textual information (natural language), enabling machines to perceive, reason about, and communicate regarding visual content.[1] These multimodal models learn rich correlations from billions of web-scale image-text pairs, enabling zero-shot predictions across diverse visual recognition tasks with a single unified architecture.[2]
Unlike traditional computer vision models that require massive manually-labeled datasets for each specific task, VLMs leverage freely available image-text pairs from the internet, fundamentally changing how machines perceive and communicate about visual content. These models combine computer vision and natural language processing (NLP) capabilities to understand and process information from both visual and textual data simultaneously, learning to map the complex relationships between images or videos and their corresponding text descriptions.[3] VLMs are a class of foundation model in AI that can be adapted to a wide range of vision-and-language tasks, from describing photographs and summarizing videos to aiding in visual search and human-computer interaction.[4]
Architecture
Core Components
Most contemporary Vision Language Models are constructed using a modular, three-part architecture that combines specialized components for vision, language, and the interface between them.[4]
Vision Encoder: The vision encoder is the component responsible for "seeing." It processes visual input through architectures like Vision Transformer (ViT) or CNNs, dividing images into patches (typically 16×16 pixels) that are flattened into vectors and processed through transformer layers.[5] A 224×224 image becomes 196 patches in a 14×14 grid, with each patch linearly projected to a 768-dimensional embedding. These visual features then flow through 12-24 transformer layers using multi-head self-attention, creating rich representations that capture objects, attributes, spatial relationships, and semantic content.[6] Many state-of-the-art VLMs, including LLaVA, leverage pre-trained ViT-based encoders from CLIP for their powerful and generalizable visual representations.[7]
Language Model (LLM): The Large Language Model serves as the cognitive backbone of the VLM. It is responsible for high-level reasoning, contextual understanding, and generating coherent textual output.[4] Typically a large pre-trained transformer like Vicuna, LLaMA, Gemma, or GPT-4, the LLM processes the projected visual tokens alongside text tokens. During generation, the model autoregressively predicts the next token based on the full multimodal context, treating vision-language understanding as sequence prediction where images become another type of token in the language model's input stream.[7]
Vision-Language Connector: The connector, also referred to as a projector or bridge, is a critical component that links the vision encoder and the language model. Its function is to translate the visual embeddings produced by the vision encoder into a format that is compatible with the LLM's input space.[4] Common connector architectures include:
- Simple Linear Projection: A lightweight and data-efficient approach where a single trainable linear layer maps the visual feature space to the LLM's word embedding space. Used by models like LLaVA and PaliGemma.[7]
- Multi-Layer Perceptron (MLP): A slightly more complex connector using a small neural network with one or more hidden layers, employed in models like LLaVA-1.5.[8]
- Cross-Attention Layers: More deeply integrated approach where new cross-attention layers are added directly into the LLM's architecture, used by Flamingo and Llama 3.2 Vision.[9]
- Q-Former: BLIP-2's sophisticated fusion mechanism using 188 million trainable parameters with 32 learnable query embeddings to map visual features into the language model's space through cross-attention.[10]
- Perceiver Resampler: A specialized module used by Flamingo that takes a potentially large and variable number of visual features and uses an attention-based mechanism to "distill" them into a smaller, fixed number of latent visual tokens.[9]
Fusion Strategies
Three primary fusion strategies enable different architectural trade-offs:
Early Fusion: Combines visual and textual inputs before deep processing. Models like Chameleon use VQ-VAE to tokenize images into discrete tokens (1024 tokens per 256×256 image), processing them alongside text in a single decoder.[11] Fuyu-8B exemplifies early fusion by directly feeding image patches into the decoder model without a separate vision encoder.[12]
Late Fusion: Processes modalities independently before combining at the output level. CLIP epitomizes this strategy, training separate image and text encoders to project into a shared 512-dimensional embedding space where cosine similarity measures alignment.[1]
Intermediate Fusion: Through cross-attention has emerged as the dominant approach. Flamingo's architecture freezes both the vision encoder and language model, training only the Perceiver Resampler and gated cross-attention layers with a tanh-gating mechanism initialized to zero, allowing stable training by gradually opening the gate during learning.[9]
History
Early Concepts and Precursors (2010s–2020)
Research at the intersection of vision and language gained significant momentum around 2015, with early efforts focusing on specific tasks like image captioning and visual question answering.[13] These initial models typically paired a Convolutional Neural Network (CNN) for visual feature extraction with a Recurrent Neural Network (RNN) for language generation.[3]
Initial efforts included:
- Neural Image Captioners (2014–2015): Early models like Show and Tell (Google, 2015) combined a CNN image encoder with an RNN language decoder to generate captions, demonstrating the first end-to-end learned image description system.[13]
- Visual Question Answering Models (2015): The first VQA models used CNNs for images and sequence models for questions to output answers, coinciding with the introduction of the VQA dataset.[14]
- Vision-Transformer-based Models (2019): Models like ViLBERT and LXMERT extended the BERT architecture to multimodal input, using two-stream transformers with cross-attention for fusion.[15][16]
The Rise of Contrastive Pre-training (2021)
The year 2021 marked a pivotal moment for VLMs with the introduction of models that leveraged large-scale contrastive pre-training. This approach represented a fundamental shift away from relying on smaller, meticulously human-labeled datasets like ImageNet toward using the vast and noisy data available on the internet.
CLIP (Contrastive Language–Image Pre-training): Developed by OpenAI and released in February 2021, CLIP trained on 400 million web-scraped image-text pairs achieved zero-shot ImageNet accuracy matching supervised ResNet-50. Its core innovation was the use of a contrastive learning objective, where the model learns to maximize the similarity between the embeddings of a correct image-text pair while simultaneously minimizing the similarity with all other pairs in a batch.[1]
ALIGN: Shortly after CLIP, Google researchers introduced ALIGN, which embraced noisy web data at unprecedented scale with 1.8 billion image-text pairs, demonstrating that massive scale compensates for data noise.[2]
Integration with Large Language Models (2022)
April 2022 brought Flamingo from DeepMind, marking the transition from understanding to few-shot learning. The 80-billion parameter model processed arbitrarily interleaved sequences of images, videos, and text, trained on MultiModal MassiveWeb (M3W) containing 43 million webpages.[9]
Instruction Tuning Era (2023)
The instruction-tuning revolution arrived in April 2023 with LLaVA (Large Language and Vision Assistant), which connected CLIP ViT-L/14 to Vicuna through a simple linear projection. LLaVA's key innovation was using GPT-4 to generate 150,000 multimodal instruction-following samples, pioneering instruction tuning in the multimodal domain.[7]
September 2023 marked VLMs entering the mainstream with GPT-4V (GPT-4 with Vision), OpenAI's multimodal extension integrated into ChatGPT, bringing advanced vision capabilities to millions of users.[17]
Current Generation (2024–2025)
March 2024 saw Anthropic release Claude 3 with three variants (Haiku, Sonnet, Opus) all featuring native vision capabilities and 200K token context windows.[18] Google's Gemini family emerged as the first major models trained multimodally from the start, with Gemini 1.0 Ultra achieving human-expert performance on MMLU (90.0%).[19]
Training
Pre-training Objectives
The goal of pre-training is to establish a fundamental alignment between the vision and language encoders using massive, often web-scale, datasets.[20]
Contrastive Learning: A foundational pre-training technique where the model is presented with a batch of image-text pairs and learns to distinguish between corresponding (positive) pairs and non-corresponding (negative) pairs. The training objective, often implemented with a contrastive loss function like InfoNCE, is to pull the vector embeddings of positive pairs closer together in a shared embedding space while pushing the embeddings of negative pairs farther apart.[1]
Masked Modeling: Inspired by models like BERT, this technique involves randomly hiding or "masking" a portion of the input and training the model to predict or reconstruct the missing part. This can be applied to either modality:
- Masked Language Modeling (MLM): The model predicts masked words based on visual context and surrounding text
- Masked Image Modeling (MIM): The model reconstructs missing image patches based on textual context and visible parts[21]
Generative Objectives: Models learn to autoregressively generate captions or images from multimodal inputs, as demonstrated by CoCa and Flamingo.[22]
Supervised Fine-tuning
Supervised fine-tuning adapts aligned models to specific tasks and instruction-following behavior. LLaVA's approach uses 558K image-text pairs for initial alignment followed by 158K instruction samples for task adaptation.[7]
Many modern VLMs employ a two-stage fine-tuning process:
- Feature Alignment: Only the vision-language connector is trained while the vision encoder and LLM remain frozen
- End-to-end Fine-tuning: Both the connector and the LLM (or a subset of its parameters) are trained on instruction-following data[7]
VILA research demonstrated that interleaving text-only instruction data with vision-language data during SFT remedies text-only task degradation.[23]
Reinforcement Learning
Methods like RLHF (Reinforcement Learning from Human Feedback) and RLVR (Reinforcement Learning with Verifiable Rewards) fine-tune aligned models for safety, helpfulness, and accuracy.[24]
Datasets
Vision-language models are typically trained on large sets of image–text pairs, using various training objectives to learn joint representations.
| Dataset | Size | Description | Usage |
|---|---|---|---|
| LAION-5B | 5.85B pairs | Web-scraped image-text pairs with CLIP filtering, multilingual | Pre-training[25] |
| ALIGN Dataset | 1.8B pairs | Noisy web data from ALT-text | Contrastive alignment[2] |
| Conceptual Captions | CC3M: 3.3M, CC12M: 12.4M | Curated web captions, cleaned and filtered | Pre-training and fine-tuning[26] |
| COCO | 330K images | Image captioning with 5 captions per image, 1.5M object instances | Evaluation/Fine-tuning[27] |
| Visual Genome | 108K images | Dense annotations (regions, relations, scene graphs) | Grounding and detailed understanding[28] |
| VQAv2 | 1.1M questions | Visual question answering on COCO images | Evaluation/Fine-tuning[29] |
| LLaVA-Instruct | 150K-558K samples | GPT-4 generated instruction-following data | Instruction tuning[7] |
| MMC4 | 100M+ pairs | Synthetic interleaved web data | Generative pretraining[30] |
Notable Vision-Language Models
Research in vision-language modeling has progressed rapidly, with numerous notable models marking significant milestones:
| Model | Developer | Year | Parameters | Key Features | License |
|---|---|---|---|---|---|
| CLIP | OpenAI | 2021 | Varies (ViT-B/32: 63M to ViT-L/14: 427M) | Zero-shot classification, contrastive learning, dual encoder | Open |
| ALIGN | 2021 | ~650M | Noisy web-scale alignment, 1.8B training pairs | Research | |
| Flamingo | DeepMind | 2022 | 3B-80B | Few-shot VQA, perceiver resampler, interleaved inputs | Research |
| BLIP-2 | Salesforce | 2023 | 2.7B–13B | Bootstrapped pretraining, Q-Former, frozen LLMs | Apache 2.0 |
| LLaVA | Liu et al. | 2023 | 7B–34B | Instruction-tuned, chat capabilities, simple projection | Apache 2.0 |
| Kosmos-1/2 | Microsoft | 2023 | 1.3B-1.6B | Grounding, zero-shot detection, end-to-end training | MIT |
| PaLM-E | 2023 | 562B | Embodied reasoning, robotics integration | Research | |
| Qwen-VL | Alibaba | 2023 | 7B | Multilingual, OCR, long video understanding | Apache 2.0 |
| MiniGPT-4 | Wang et al. | 2023 | 7B | LLM alignment via projector, efficient adaptation | BSD 3-Clause |
| GPT-4V/GPT-4o | OpenAI | 2023-2024 | Undisclosed | Real-time multimodality, best commercial performance | Proprietary |
| Gemini | 2024 | Undisclosed | Video understanding, long context (1M+ tokens), native multimodal training | Proprietary | |
| Claude 3/3.5 | Anthropic | 2024 | Undisclosed | Strong reasoning, 200K context window, safety focus | Proprietary |
| LLaMA 3.2 Vision | Meta | 2024 | 11B–90B | Open-source, efficient, cross-attention architecture | Llama License |
| Qwen2.5-VL | Alibaba | 2025 | 3B–72B | Advanced OCR, multilingual, state-of-the-art open performance | Apache 2.0 |
| DeepSeek-VL | DeepSeek | 2024 | 7B | High-resolution support, efficient training | MIT |
| PaliGemma | 2024 | 3B | Strong transferable performance, SigLIP encoder | Apache 2.0 | |
| InternVL2 | Shanghai AI Lab | 2024 | 2B–108B | Progressive alignment, dynamic high-resolution | Apache 2.0 |
Applications
VLMs have broad applications wherever visual content needs to be interpreted or generated in conjunction with language.
Visual Understanding Tasks
- Image and Video Captioning: Generating concise and accurate textual descriptions for images or videos, describing actions, interactions between objects, and overall scene context.[3][31]
- Visual Question Answering (VQA): Answering natural language questions about images, ranging from simple identification queries to complex reasoning questions requiring spatial understanding or inference.[3]
- Object Detection and Segmentation: VLMs enable "open-vocabulary" object detection, where models can detect and localize objects described in free-form text, even for categories not in training data.[32]
- Optical Character Recognition (OCR): Reading and transcribing text embedded within images, such as text on street signs, documents, or product labels.[6]
Real-World Applications
Accessibility and Assistive Technology
VLMs power accessibility technologies that make digital content inclusive. Systems like Be My Eyes use GPT-4V to provide real-time descriptions of environments and objects through smartphone cameras for visually impaired users.[33] These applications can describe surroundings in real-time, read text from documents or product labels, and answer questions about visual content.
Healthcare and Medical Imaging
VLMs assist healthcare professionals by analyzing medical images like X-rays or CT scans and generating preliminary reports. Applications include automated radiology report generation and medical VQA systems. However, research reveals concerning fairness gaps, with foundation models consistently underdiagnosing marginalized groups compared to board-certified radiologists.[34]
Robotics and Embodied AI
VLMs form the perceptual core of Vision-Language-Action (VLA) models, allowing robots to understand natural language commands within physical environments. RT-2 unified vision, language, and action tokens, achieving 63% improvement on novel objects.[35] OpenVLA provides open-source VLA trained on 970k robot demonstrations.[36]
Document Understanding
VLMs excel at extracting information from structured documents, interpreting charts and graphs, and understanding document layouts. DocVQA benchmarks document comprehension with 12,000+ document images and 50,000+ questions.[37]
Content Moderation
VLMs can more accurately detect harmful or inappropriate content on social media platforms by analyzing the combined context of images and accompanying text. KuaiMod framework deployed at Kuaishou processes millions of videos daily using VLM Chain-of-Thought reasoning, achieving 20% reduction in user reporting rate.[38]
Autonomous Systems
In applications like autonomous driving, VLMs enhance situational awareness by interpreting non-standard situations, such as handwritten detour signs, combining visual recognition with reasoning capabilities.[39]
E-commerce and Visual Search
Visual search matches user-uploaded images to products; VLMs enable natural language queries about product images and generate product descriptions automatically.[31]
Education
VLMs can generate step-by-step explanations from diagrams, solve visual math problems, and provide interactive tutoring based on visual content.[40]
Evaluation
Benchmarks
VLMs are assessed on benchmarks measuring multimodal capabilities:
| Benchmark | Tasks | Size | Metrics | Description |
|---|---|---|---|---|
| MMBench | 20 ability dimensions | ~3K questions | Accuracy | Object recognition, OCR, spatial reasoning, chart interpretation[41] |
| MMMU | Multi-discipline reasoning | 11.5K questions | Accuracy | College-level expert knowledge across 6 disciplines[42] |
| MMStar | Vision-indispensable tasks | 1.5K samples | Accuracy | Ensures visual dependency with elite samples[43] |
| VQAv2 | Open-ended VQA | 265K questions | Accuracy | Standard VQA benchmark on COCO images[29] |
| MathVista | Visual math reasoning | 6K problems | Accuracy | Mathematical reasoning in visual contexts[40] |
| OCRBench | Document understanding | Varies | F1-score | Text recognition and understanding[44] |
| Winoground | Compositional reasoning | 400 pairs | Human judgment | Tests understanding of compositional structures[45] |
| POPE | Object hallucination | 3K questions | Accuracy/F1 | Evaluates tendency to hallucinate objects[46] |
Metrics
- CIDEr: Consensus-based Image Description Evaluation for captioning[47]
- BLEU: Bilingual Evaluation Understudy for text generation quality[48]
- ANLS: Average Normalized Levenshtein Similarity for document understanding[49]
- Accuracy: Standard metric for VQA and classification tasks
Leaderboards like the Open VLM Leaderboard on Hugging Face rank open-source VLMs across these benchmarks.[50]
Limitations and Challenges
Despite their rapid progress and impressive capabilities, Vision Language Models face several significant limitations:
Visual Hallucination
One of the most critical issues is visual hallucination, where models generate text that is fluent and plausible but factually inconsistent with the provided image. Even advanced models exhibit hallucination rates exceeding 10%, with some open-source models showing rates above 40% on specialized benchmarks.[51] This can manifest as describing objects that are not present, misstating attributes of objects, or misinterpreting relationships between objects.[52]
Spatial Reasoning
Models struggle with basic directional concepts despite processing spatial information. Research found VLMs allocate only approximately 10% attention to image tokens despite images comprising approximately 90% of input sequence length.[53] Poor performance on relations like "left of" or "above" persists without specialized fine-tuning.
Data Bias and Fairness
VLMs are typically trained on vast datasets scraped from the internet, which inevitably contain societal biases, stereotypes, and problematic content. Foundation models consistently underdiagnose marginalized groups in medical imaging.[34] CulturalVQA benchmark reveals better performance for North American cultures and worse for African and Islamic cultures.[54]
Research has shown that VLMs exhibit strong confirmation bias, where they tend to rely on memorized knowledge from training data rather than analyzing visual evidence. For instance, when shown an image of a dog with a digitally added fifth leg, many VLMs will still confidently state that the dog has four legs.[55]
Computational Requirements
Training and deployment of large-scale VLMs are extremely resource-intensive. Training large VLMs requires massive resources, with GPT-4 estimated to consume 2.1 × 10²⁵ FLOPs. However, efficient techniques like LoRA (Low-Rank Adaptation) and quantization make fine-tuning accessible.[56]
Robustness and Generalization
While VLMs show impressive performance on many benchmarks, their robustness can be brittle. Studies have shown that some models struggle with simple image transformations like rotation or color inversion, suggesting a lack of deep, compositional understanding of visual scenes.[57]
Future Directions
The field of Vision Language Models is evolving rapidly, with several key research directions:
Multimodal Reasoning
QVQ-72B-preview and LLaVA-CoT pioneered open-source multimodal reasoning performing autonomous, multi-stage reasoning similar to OpenAI's o1 model.[58] Future work focuses on developing models with more sophisticated and reliable reasoning abilities, better grounding language in visual reality, and reducing hallucinations.[59]
Long-Context Understanding
LongVILA supports over 1 million token context for long video understanding.[60] Qwen2.5-VL scales context through specialized pretraining. Future models aim to process hour-long videos and complex multi-document visual content.
Vision-Language-Action Models
The integration of action prediction with vision-language understanding enables embodied AI. RT-2 unified vision, language, and action tokens, achieving significant improvements on novel objects.[35] OpenVLA provides open-source VLA trained on 970k robot demonstrations.[36]
Efficiency and On-Device Deployment
Research into model compression, quantization, and knowledge distillation aims to reduce computational requirements. SmolVLM demonstrates ultra-efficient models for edge deployment (256M-2.2B parameters).[61] Mixture-of-Experts architectures like MoE-LLaVA enable selective activation for efficiency.[62]
Any-to-Any Multimodality
The vision-language paradigm is being extended to incorporate additional data types including audio, depth information, thermal imaging, and IMU data. The goal is building "any-to-any" multimodal models that can process and generate information across a wide spectrum of sensory inputs.[63]
Enhanced Safety and Alignment
Ensuring that VLMs behave safely, fairly, and in alignment with human values is critical. This includes developing robust methods to detect and mitigate data biases, prevent generation of harmful content, and improve overall controllability and interpretability.[64]
Commercial Implementations
| Provider | Model | Pricing (per 1M tokens) | Key Features | Context Window |
|---|---|---|---|---|
| OpenAI | GPT-4o | $10 input / $30 output | Best general performance, wide availability | 128K tokens[17] |
| Anthropic | Claude 3.5 Sonnet | $3 input / $15 output | Strong reasoning, safety focus | 200K tokens[18] |
| Gemini 2.0 Flash | $30 per 1M tokens | Native multimodal, video understanding | 1M+ tokens[19] | |
| Microsoft | Azure OpenAI GPT-4V | Variable pricing | Enterprise integration, Azure ecosystem | 128K tokens[65] |
| Amazon | Bedrock Claude 3 | Pay per use | AWS integration, multiple models | 200K tokens[66] |
Open-Source Models
The open-source community has produced numerous high-quality VLMs:
- LLaVA Family: Most influential open-source VLM family (7B-34B parameters), with variants like LLaVA-1.5, LLaVA-NeXT[7]
- Qwen2.5-VL: State-of-the-art open performance with 3B, 7B, and 72B variants[67]
- InternVL2: Scalable from 2B to 108B parameters with progressive alignment[68]
- SmolVLM: Ultra-efficient models for edge deployment (256M-2.2B parameters)[61]
- Phi-3-Vision: Microsoft's efficient 4.2B parameter model[69]
- CogVLM2: 19B parameters with strong performance on various benchmarks[70]
See Also
- Multimodal learning
- Computer vision
- Natural language processing
- Transformer (machine learning model)
- Large language model
- CLIP
- GPT-4
- Visual Question Answering
- Image captioning
- Foundation model
- Zero-shot learning
- Contrastive learning
- Vision Transformer
- Instruction tuning
References
- ↑ 1.0 1.1 1.2 1.3 Radford, A., et al. (2021). "Learning Transferable Visual Models From Natural Language Supervision". OpenAI. https://arxiv.org/abs/2103.00020
- ↑ 2.0 2.1 2.2 Jia, C., et al. (2021). "Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision". Google Research. https://arxiv.org/abs/2102.05918
- ↑ 3.0 3.1 3.2 3.3 IBM (2024). "What are vision language models (VLMs)?". IBM Think Blog. https://www.ibm.com/think/topics/vision-language-models
- ↑ 4.0 4.1 4.2 4.3 NVIDIA (2024). "What Are Vision Language Models?". NVIDIA Glossary. https://www.nvidia.com/en-us/glossary/vision-language-models/
- ↑ Dosovitskiy, A., et al. (2021). "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale". Google Research. https://arxiv.org/abs/2010.11929
- ↑ 6.0 6.1 Hugging Face (2024). "Vision Language Models Explained". Hugging Face Blog. https://huggingface.co/blog/vlms
- ↑ 7.0 7.1 7.2 7.3 7.4 7.5 7.6 7.7 Liu, H., et al. (2023). "Visual Instruction Tuning". University of Wisconsin-Madison. https://arxiv.org/abs/2304.08485
- ↑ Liu, H., et al. (2024). "Improved Baselines with Visual Instruction Tuning". https://arxiv.org/abs/2310.03744
- ↑ 9.0 9.1 9.2 9.3 Alayrac, J.B., et al. (2022). "Flamingo: a Visual Language Model for Few-Shot Learning". DeepMind. https://arxiv.org/abs/2204.14198
- ↑ Li, J., et al. (2023). "BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models". Salesforce Research. https://arxiv.org/abs/2301.12597
- ↑ Team Chameleon (2024). "Chameleon: Mixed-Modal Early-Fusion Foundation Models". Meta AI. https://arxiv.org/abs/2405.09818
- ↑ Adept (2023). "Fuyu-8B: A Multimodal Architecture for AI Agents". https://www.adept.ai/blog/fuyu-8b
- ↑ 13.0 13.1 Vinyals, O., et al. (2015). "Show and Tell: A Neural Image Caption Generator". Google. https://arxiv.org/abs/1411.4555
- ↑ Antol, S., et al. (2015). "VQA: Visual Question Answering". ICCV. https://arxiv.org/abs/1505.00468
- ↑ Lu, J., et al. (2019). "ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations". NeurIPS. https://arxiv.org/abs/1908.02265
- ↑ Tan, H. & Bansal, M. (2019). "LXMERT: Learning Cross-Modality Encoder Representations from Transformers". EMNLP. https://arxiv.org/abs/1908.07490
- ↑ 17.0 17.1 OpenAI (2023). "GPT-4V(ision) System Card". https://openai.com/research/gpt-4v-system-card
- ↑ 18.0 18.1 Anthropic (2024). "The Claude 3 Model Family: Opus, Sonnet, Haiku". https://www.anthropic.com/news/claude-3-family
- ↑ 19.0 19.1 Google (2023). "Gemini: A Family of Highly Capable Multimodal Models". https://arxiv.org/abs/2312.11805
- ↑ Zhang, K., et al. (2024). "A Comprehensive Survey on Applications of Vision Large Language Models". https://arxiv.org/html/2501.02765v1
- ↑ Singh, A., et al. (2022). "FLAVA: A Foundational Language And Vision Alignment Model". Meta AI. https://arxiv.org/abs/2112.04482
- ↑ Yu, J., et al. (2022). "CoCa: Contrastive Captioners are Image-Text Foundation Models". Google Research. https://arxiv.org/abs/2205.01917
- ↑ Lin, J., et al. (2024). "VILA: On Pre-training for Visual Language Models". NVIDIA. https://arxiv.org/abs/2312.07533
- ↑ Ouyang, L., et al. (2022). "Training language models to follow instructions with human feedback". OpenAI. https://arxiv.org/abs/2203.02155
- ↑ Schuhmann, C., et al. (2022). "LAION-5B: An open large-scale dataset for training next generation image-text models". NeurIPS Datasets and Benchmarks. https://arxiv.org/abs/2210.08402
- ↑ Changpinyo, S., et al. (2021). "Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts". Google Research. https://arxiv.org/abs/2102.08981
- ↑ Lin, T.Y., et al. (2014). "Microsoft COCO: Common Objects in Context". Microsoft Research. https://arxiv.org/abs/1405.0312
- ↑ Krishna, R., et al. (2017). "Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations". Stanford. https://arxiv.org/abs/1602.07332
- ↑ 29.0 29.1 Goyal, Y., et al. (2017). "Making the V in VQA Matter". Facebook AI Research. https://arxiv.org/abs/1612.00837
- ↑ Zhu, D., et al. (2024). "Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved with Text". https://arxiv.org/abs/2304.06939
- ↑ 31.0 31.1 OpenCV (2025). "Applications of Vision Language Models". https://opencv.org/blog/applications-of-vision-language-models/
- ↑ Gu, X., et al. (2022). "Open-Vocabulary Object Detection via Vision and Language Knowledge Distillation". https://arxiv.org/abs/2104.13921
- ↑ Be My Eyes (2023). "Be My AI powered by GPT-4". https://www.bemyeyes.com/blog/announcing-be-my-ai
- ↑ 34.0 34.1 Zhang, Y., et al. (2024). "Fairness in Medical Foundation Models". Nature Medicine. https://www.nature.com/articles/s41591-023-02778-7
- ↑ 35.0 35.1 Brohan, A., et al. (2023). "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control". Google DeepMind. https://arxiv.org/abs/2307.15818
- ↑ 36.0 36.1 Kim, M., et al. (2024). "OpenVLA: An Open-Source Vision-Language-Action Model". UC Berkeley. https://arxiv.org/abs/2406.09246
- ↑ Mathew, M., et al. (2021). "DocVQA: A Dataset for VQA on Document Images". CVPR. https://arxiv.org/abs/2007.00398
- ↑ Wang, J., et al. (2024). "KuaiMod: A Large-scale Content Moderation Framework". Kuaishou Technology. https://arxiv.org/abs/2404.12709
- ↑ Li, L., et al. (2024). "Vision-Language Models for Autonomous Driving: A Survey". https://arxiv.org/abs/2407.08123
- ↑ 40.0 40.1 Lu, P., et al. (2024). "MathVista: Evaluating Mathematical Reasoning in Visual Contexts". https://arxiv.org/abs/2310.02255
- ↑ Liu, Y., et al. (2023). "MMBench: Is Your Multi-modal Model an All-around Player?". OpenCompass. https://arxiv.org/abs/2307.06281
- ↑ Yue, X., et al. (2024). "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark". https://arxiv.org/abs/2311.16502
- ↑ Chen, L., et al. (2024). "Are We on the Right Way for Evaluating Large Vision-Language Models?". https://arxiv.org/abs/2403.20330
- ↑ Liu, Y., et al. (2024). "OCRBench: Hidden Challenges in OCR for Large Multimodal Models". https://arxiv.org/abs/2305.07895
- ↑ Thrush, T., et al. (2022). "Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality". CVPR. https://arxiv.org/abs/2204.03162
- ↑ Li, Y., et al. (2023). "Evaluating Object Hallucination in Large Vision-Language Models". EMNLP. https://arxiv.org/abs/2305.10355
- ↑ Vedantam, R., et al. (2015). "CIDEr: Consensus-based Image Description Evaluation". CVPR. https://arxiv.org/abs/1411.5726
- ↑ Papineni, K., et al. (2002). "BLEU: a Method for Automatic Evaluation of Machine Translation". ACL. https://aclanthology.org/P02-1040/
- ↑ Biten, A.F., et al. (2019). "Scene Text Visual Question Answering". ICCV. https://arxiv.org/abs/1905.13648
- ↑ Hugging Face (2024). "Open VLM Leaderboard". https://huggingface.co/spaces/opencompass/open_vlm_leaderboard
- ↑ Zhou, Y., et al. (2024). "Analyzing and Mitigating Object Hallucination in Large Vision-Language Models". ICLR. https://arxiv.org/abs/2310.00754
- ↑ Leng, Y., et al. (2024). "Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding". CVPR. https://arxiv.org/abs/2408.10253
- ↑ Kamath, A., et al. (2024). "What's Left and Right in Vision Language Models?". https://arxiv.org/abs/2312.01772
- ↑ Nayak, N., et al. (2024). "CulturalVQA: Benchmarking Cultural Understanding in Vision Language Models". https://arxiv.org/abs/2407.19788
- ↑ VLMs Are Blind (2024). "Vision Language Models Are Blind". https://vlmsareblind.github.io/
- ↑ Hu, E.J., et al. (2022). "LoRA: Low-Rank Adaptation of Large Language Models". Microsoft. https://arxiv.org/abs/2106.09685
- ↑ Anis, A. M., et al. (2024). "On the Limitations of Vision-Language Models in Understanding Image Transforms". https://arxiv.org/abs/2503.09837
- ↑ Qwen Team (2024). "QVQ: Multimodal Reasoning at Scale". Alibaba. https://qwenlm.github.io/blog/qvq-72b-preview/
- ↑ Xu, P., et al. (2024). "LLaVA-CoT: Let Vision Language Models Reason Step-by-Step". https://arxiv.org/abs/2411.10440
- ↑ Li, F., et al. (2024). "LongVILA: Scaling Long-Context Visual Language Models". NVIDIA. https://arxiv.org/abs/2408.00400
- ↑ 61.0 61.1 Hugging Face (2024). "SmolVLM: Small Vision Language Models". https://huggingface.co/blog/smolvlm
- ↑ Lin, B., et al. (2024). "MoE-LLaVA: Mixture of Experts for Large Vision-Language Models". https://arxiv.org/abs/2401.15947
- ↑ Bai, J., et al. (2024). "Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action". https://arxiv.org/abs/2312.17172
- ↑ Ji, J., et al. (2024). "AI Alignment: A Comprehensive Survey". https://arxiv.org/abs/2310.19852
- ↑ Microsoft (2024). "Azure OpenAI Service". https://azure.microsoft.com/en-us/products/ai-services/openai-service
- ↑ Amazon (2024). "Amazon Bedrock". https://aws.amazon.com/bedrock/
- ↑ Qwen Team (2024). "Qwen2.5-VL: Frontier Vision-Language Understanding". https://qwenlm.github.io/blog/qwen2.5-vl/
- ↑ Chen, Z., et al. (2024). "InternVL: Scaling up Vision Foundation Models with Large Language Models". https://arxiv.org/abs/2312.14238
- ↑ Abdin, M., et al. (2024). "Phi-3 Technical Report". Microsoft. https://arxiv.org/abs/2404.14219
- ↑ Wang, W., et al. (2024). "CogVLM2: Visual Language Models for Image and Video Understanding". https://arxiv.org/abs/2408.16500