Vision Language Model

Template:Infobox technology

A Vision Language Model (VLM), also known as a visual language model, is a type of artificial intelligence system that can simultaneously understand and process both visual information (images, videos) and textual information (natural language), enabling machines to perceive, reason about, and communicate regarding visual content.^[1] These multimodal models learn rich correlations from billions of web-scale image-text pairs, enabling zero-shot predictions across diverse visual recognition tasks with a single unified architecture.^[2]

Unlike traditional computer vision models that require massive manually-labeled datasets for each specific task, VLMs leverage freely available image-text pairs from the internet, fundamentally changing how machines perceive and communicate about visual content. These models combine computer vision and natural language processing (NLP) capabilities to understand and process information from both visual and textual data simultaneously, learning to map the complex relationships between images or videos and their corresponding text descriptions.^[3] VLMs are a class of foundation model in AI that can be adapted to a wide range of vision-and-language tasks, from describing photographs and summarizing videos to aiding in visual search and human-computer interaction.^[4]

Architecture

Core Components

Most contemporary Vision Language Models are constructed using a modular, three-part architecture that combines specialized components for vision, language, and the interface between them.^[4]

Vision Encoder: The vision encoder is the component responsible for "seeing." It processes visual input through architectures like Vision Transformer (ViT) or CNNs, dividing images into patches (typically 16×16 pixels) that are flattened into vectors and processed through transformer layers.^[5] A 224×224 image becomes 196 patches in a 14×14 grid, with each patch linearly projected to a 768-dimensional embedding. These visual features then flow through 12-24 transformer layers using multi-head self-attention, creating rich representations that capture objects, attributes, spatial relationships, and semantic content.^[6] Many state-of-the-art VLMs, including LLaVA, leverage pre-trained ViT-based encoders from CLIP for their powerful and generalizable visual representations.^[7]

Language Model (LLM): The Large Language Model serves as the cognitive backbone of the VLM. It is responsible for high-level reasoning, contextual understanding, and generating coherent textual output.^[4] Typically a large pre-trained transformer like Vicuna, LLaMA, Gemma, or GPT-4, the LLM processes the projected visual tokens alongside text tokens. During generation, the model autoregressively predicts the next token based on the full multimodal context, treating vision-language understanding as sequence prediction where images become another type of token in the language model's input stream.^[7]

Vision-Language Connector: The connector, also referred to as a projector or bridge, is a critical component that links the vision encoder and the language model. Its function is to translate the visual embeddings produced by the vision encoder into a format that is compatible with the LLM's input space.^[4] Common connector architectures include:

Simple Linear Projection: A lightweight and data-efficient approach where a single trainable linear layer maps the visual feature space to the LLM's word embedding space. Used by models like LLaVA and PaliGemma.^[7]
Multi-Layer Perceptron (MLP): A slightly more complex connector using a small neural network with one or more hidden layers, employed in models like LLaVA-1.5.^[8]
Cross-Attention Layers: More deeply integrated approach where new cross-attention layers are added directly into the LLM's architecture, used by Flamingo and Llama 3.2 Vision.^[9]
Q-Former: BLIP-2's sophisticated fusion mechanism using 188 million trainable parameters with 32 learnable query embeddings to map visual features into the language model's space through cross-attention.^[10]
Perceiver Resampler: A specialized module used by Flamingo that takes a potentially large and variable number of visual features and uses an attention-based mechanism to "distill" them into a smaller, fixed number of latent visual tokens.^[9]

Fusion Strategies

Three primary fusion strategies enable different architectural trade-offs:

Early Fusion: Combines visual and textual inputs before deep processing. Models like Chameleon use VQ-VAE to tokenize images into discrete tokens (1024 tokens per 256×256 image), processing them alongside text in a single decoder.^[11] Fuyu-8B exemplifies early fusion by directly feeding image patches into the decoder model without a separate vision encoder.^[12]

Late Fusion: Processes modalities independently before combining at the output level. CLIP epitomizes this strategy, training separate image and text encoders to project into a shared 512-dimensional embedding space where cosine similarity measures alignment.^[1]

Intermediate Fusion: Through cross-attention has emerged as the dominant approach. Flamingo's architecture freezes both the vision encoder and language model, training only the Perceiver Resampler and gated cross-attention layers with a tanh-gating mechanism initialized to zero, allowing stable training by gradually opening the gate during learning.^[9]

History

Early Concepts and Precursors (2010s–2020)

Research at the intersection of vision and language gained significant momentum around 2015, with early efforts focusing on specific tasks like image captioning and visual question answering.^[13] These initial models typically paired a Convolutional Neural Network (CNN) for visual feature extraction with a Recurrent Neural Network (RNN) for language generation.^[3]

Initial efforts included:

Neural Image Captioners (2014–2015): Early models like Show and Tell (Google, 2015) combined a CNN image encoder with an RNN language decoder to generate captions, demonstrating the first end-to-end learned image description system.^[13]
Visual Question Answering Models (2015): The first VQA models used CNNs for images and sequence models for questions to output answers, coinciding with the introduction of the VQA dataset.^[14]
Vision-Transformer-based Models (2019): Models like ViLBERT and LXMERT extended the BERT architecture to multimodal input, using two-stream transformers with cross-attention for fusion.^[15]^[16]

The Rise of Contrastive Pre-training (2021)

The year 2021 marked a pivotal moment for VLMs with the introduction of models that leveraged large-scale contrastive pre-training. This approach represented a fundamental shift away from relying on smaller, meticulously human-labeled datasets like ImageNet toward using the vast and noisy data available on the internet.

CLIP (Contrastive Language–Image Pre-training): Developed by OpenAI and released in February 2021, CLIP trained on 400 million web-scraped image-text pairs achieved zero-shot ImageNet accuracy matching supervised ResNet-50. Its core innovation was the use of a contrastive learning objective, where the model learns to maximize the similarity between the embeddings of a correct image-text pair while simultaneously minimizing the similarity with all other pairs in a batch.^[1]

ALIGN: Shortly after CLIP, Google researchers introduced ALIGN, which embraced noisy web data at unprecedented scale with 1.8 billion image-text pairs, demonstrating that massive scale compensates for data noise.^[2]

Integration with Large Language Models (2022)

April 2022 brought Flamingo from DeepMind, marking the transition from understanding to few-shot learning. The 80-billion parameter model processed arbitrarily interleaved sequences of images, videos, and text, trained on MultiModal MassiveWeb (M3W) containing 43 million webpages.^[9]

Instruction Tuning Era (2023)

The instruction-tuning revolution arrived in April 2023 with LLaVA (Large Language and Vision Assistant), which connected CLIP ViT-L/14 to Vicuna through a simple linear projection. LLaVA's key innovation was using GPT-4 to generate 150,000 multimodal instruction-following samples, pioneering instruction tuning in the multimodal domain.^[7]

September 2023 marked VLMs entering the mainstream with GPT-4V (GPT-4 with Vision), OpenAI's multimodal extension integrated into ChatGPT, bringing advanced vision capabilities to millions of users.^[17]

Current Generation (2024–2025)

March 2024 saw Anthropic release Claude 3 with three variants (Haiku, Sonnet, Opus) all featuring native vision capabilities and 200K token context windows.^[18] Google's Gemini family emerged as the first major models trained multimodally from the start, with Gemini 1.0 Ultra achieving human-expert performance on MMLU (90.0%).^[19]

Training

Pre-training Objectives

The goal of pre-training is to establish a fundamental alignment between the vision and language encoders using massive, often web-scale, datasets.^[20]

Contrastive Learning: A foundational pre-training technique where the model is presented with a batch of image-text pairs and learns to distinguish between corresponding (positive) pairs and non-corresponding (negative) pairs. The training objective, often implemented with a contrastive loss function like InfoNCE, is to pull the vector embeddings of positive pairs closer together in a shared embedding space while pushing the embeddings of negative pairs farther apart.^[1]

Masked Modeling: Inspired by models like BERT, this technique involves randomly hiding or "masking" a portion of the input and training the model to predict or reconstruct the missing part. This can be applied to either modality:

Masked Language Modeling (MLM): The model predicts masked words based on visual context and surrounding text
Masked Image Modeling (MIM): The model reconstructs missing image patches based on textual context and visible parts^[21]

Generative Objectives: Models learn to autoregressively generate captions or images from multimodal inputs, as demonstrated by CoCa and Flamingo.^[22]

Supervised Fine-tuning

Supervised fine-tuning adapts aligned models to specific tasks and instruction-following behavior. LLaVA's approach uses 558K image-text pairs for initial alignment followed by 158K instruction samples for task adaptation.^[7]

Many modern VLMs employ a two-stage fine-tuning process:

Feature Alignment: Only the vision-language connector is trained while the vision encoder and LLM remain frozen
End-to-end Fine-tuning: Both the connector and the LLM (or a subset of its parameters) are trained on instruction-following data^[7]

VILA research demonstrated that interleaving text-only instruction data with vision-language data during SFT remedies text-only task degradation.^[23]

Reinforcement Learning

Methods like RLHF (Reinforcement Learning from Human Feedback) and RLVR (Reinforcement Learning with Verifiable Rewards) fine-tune aligned models for safety, helpfulness, and accuracy.^[24]

Datasets

Vision-language models are typically trained on large sets of image–text pairs, using various training objectives to learn joint representations.

Major VLM Training Datasets
Dataset	Size	Description	Usage
LAION-5B	5.85B pairs	Web-scraped image-text pairs with CLIP filtering, multilingual	Pre-training^[25]
ALIGN Dataset	1.8B pairs	Noisy web data from ALT-text	Contrastive alignment^[2]
Conceptual Captions	CC3M: 3.3M, CC12M: 12.4M	Curated web captions, cleaned and filtered	Pre-training and fine-tuning^[26]
COCO	330K images	Image captioning with 5 captions per image, 1.5M object instances	Evaluation/Fine-tuning^[27]
Visual Genome	108K images	Dense annotations (regions, relations, scene graphs)	Grounding and detailed understanding^[28]
VQAv2	1.1M questions	Visual question answering on COCO images	Evaluation/Fine-tuning^[29]
LLaVA-Instruct	150K-558K samples	GPT-4 generated instruction-following data	Instruction tuning^[7]
MMC4	100M+ pairs	Synthetic interleaved web data	Generative pretraining^[30]

Notable Vision-Language Models

Research in vision-language modeling has progressed rapidly, with numerous notable models marking significant milestones:

Comparison of Major Vision-Language Models
Model	Developer	Year	Parameters	Key Features	License
CLIP	OpenAI	2021	Varies (ViT-B/32: 63M to ViT-L/14: 427M)	Zero-shot classification, contrastive learning, dual encoder	Open
ALIGN	Google	2021	~650M	Noisy web-scale alignment, 1.8B training pairs	Research
Flamingo	DeepMind	2022	3B-80B	Few-shot VQA, perceiver resampler, interleaved inputs	Research
BLIP-2	Salesforce	2023	2.7B–13B	Bootstrapped pretraining, Q-Former, frozen LLMs	Apache 2.0
LLaVA	Liu et al.	2023	7B–34B	Instruction-tuned, chat capabilities, simple projection	Apache 2.0
Kosmos-1/2	Microsoft	2023	1.3B-1.6B	Grounding, zero-shot detection, end-to-end training	MIT
PaLM-E	Google	2023	562B	Embodied reasoning, robotics integration	Research
Qwen-VL	Alibaba	2023	7B	Multilingual, OCR, long video understanding	Apache 2.0
MiniGPT-4	Wang et al.	2023	7B	LLM alignment via projector, efficient adaptation	BSD 3-Clause
GPT-4V/GPT-4o	OpenAI	2023-2024	Undisclosed	Real-time multimodality, best commercial performance	Proprietary
Gemini	Google	2024	Undisclosed	Video understanding, long context (1M+ tokens), native multimodal training	Proprietary
Claude 3/3.5	Anthropic	2024	Undisclosed	Strong reasoning, 200K context window, safety focus	Proprietary
LLaMA 3.2 Vision	Meta	2024	11B–90B	Open-source, efficient, cross-attention architecture	Llama License
Qwen2.5-VL	Alibaba	2025	3B–72B	Advanced OCR, multilingual, state-of-the-art open performance	Apache 2.0
DeepSeek-VL	DeepSeek	2024	7B	High-resolution support, efficient training	MIT
PaliGemma	Google	2024	3B	Strong transferable performance, SigLIP encoder	Apache 2.0
InternVL2	Shanghai AI Lab	2024	2B–108B	Progressive alignment, dynamic high-resolution	Apache 2.0

Applications

VLMs have broad applications wherever visual content needs to be interpreted or generated in conjunction with language.

Visual Understanding Tasks

Image and Video Captioning: Generating concise and accurate textual descriptions for images or videos, describing actions, interactions between objects, and overall scene context.^[3]^[31]
Visual Question Answering (VQA): Answering natural language questions about images, ranging from simple identification queries to complex reasoning questions requiring spatial understanding or inference.^[3]
Object Detection and Segmentation: VLMs enable "open-vocabulary" object detection, where models can detect and localize objects described in free-form text, even for categories not in training data.^[32]
Optical Character Recognition (OCR): Reading and transcribing text embedded within images, such as text on street signs, documents, or product labels.^[6]

Real-World Applications

Accessibility and Assistive Technology

VLMs power accessibility technologies that make digital content inclusive. Systems like Be My Eyes use GPT-4V to provide real-time descriptions of environments and objects through smartphone cameras for visually impaired users.^[33] These applications can describe surroundings in real-time, read text from documents or product labels, and answer questions about visual content.

Healthcare and Medical Imaging

VLMs assist healthcare professionals by analyzing medical images like X-rays or CT scans and generating preliminary reports. Applications include automated radiology report generation and medical VQA systems. However, research reveals concerning fairness gaps, with foundation models consistently underdiagnosing marginalized groups compared to board-certified radiologists.^[34]

Robotics and Embodied AI

VLMs form the perceptual core of Vision-Language-Action (VLA) models, allowing robots to understand natural language commands within physical environments. RT-2 unified vision, language, and action tokens, achieving 63% improvement on novel objects.^[35] OpenVLA provides open-source VLA trained on 970k robot demonstrations.^[36]

Document Understanding

VLMs excel at extracting information from structured documents, interpreting charts and graphs, and understanding document layouts. DocVQA benchmarks document comprehension with 12,000+ document images and 50,000+ questions.^[37]

Content Moderation

VLMs can more accurately detect harmful or inappropriate content on social media platforms by analyzing the combined context of images and accompanying text. KuaiMod framework deployed at Kuaishou processes millions of videos daily using VLM Chain-of-Thought reasoning, achieving 20% reduction in user reporting rate.^[38]

Autonomous Systems

In applications like autonomous driving, VLMs enhance situational awareness by interpreting non-standard situations, such as handwritten detour signs, combining visual recognition with reasoning capabilities.^[39]

E-commerce and Visual Search

Visual search matches user-uploaded images to products; VLMs enable natural language queries about product images and generate product descriptions automatically.^[31]

Education

VLMs can generate step-by-step explanations from diagrams, solve visual math problems, and provide interactive tutoring based on visual content.^[40]

Evaluation

Benchmarks

VLMs are assessed on benchmarks measuring multimodal capabilities:

Benchmark	Tasks	Size	Metrics	Description
MMBench	20 ability dimensions	~3K questions	Accuracy	Object recognition, OCR, spatial reasoning, chart interpretation^[41]
MMMU	Multi-discipline reasoning	11.5K questions	Accuracy	College-level expert knowledge across 6 disciplines^[42]
MMStar	Vision-indispensable tasks	1.5K samples	Accuracy	Ensures visual dependency with elite samples^[43]
VQAv2	Open-ended VQA	265K questions	Accuracy	Standard VQA benchmark on COCO images^[29]
MathVista	Visual math reasoning	6K problems	Accuracy	Mathematical reasoning in visual contexts^[40]
OCRBench	Document understanding	Varies	F1-score	Text recognition and understanding^[44]
Winoground	Compositional reasoning	400 pairs	Human judgment	Tests understanding of compositional structures^[45]
POPE	Object hallucination	3K questions	Accuracy/F1	Evaluates tendency to hallucinate objects^[46]

Metrics

CIDEr: Consensus-based Image Description Evaluation for captioning^[47]
BLEU: Bilingual Evaluation Understudy for text generation quality^[48]
ANLS: Average Normalized Levenshtein Similarity for document understanding^[49]
Accuracy: Standard metric for VQA and classification tasks

Leaderboards like the Open VLM Leaderboard on Hugging Face rank open-source VLMs across these benchmarks.^[50]

Limitations and Challenges

Despite their rapid progress and impressive capabilities, Vision Language Models face several significant limitations:

Visual Hallucination

One of the most critical issues is visual hallucination, where models generate text that is fluent and plausible but factually inconsistent with the provided image. Even advanced models exhibit hallucination rates exceeding 10%, with some open-source models showing rates above 40% on specialized benchmarks.^[51] This can manifest as describing objects that are not present, misstating attributes of objects, or misinterpreting relationships between objects.^[52]

Spatial Reasoning

Models struggle with basic directional concepts despite processing spatial information. Research found VLMs allocate only approximately 10% attention to image tokens despite images comprising approximately 90% of input sequence length.^[53] Poor performance on relations like "left of" or "above" persists without specialized fine-tuning.

Data Bias and Fairness

VLMs are typically trained on vast datasets scraped from the internet, which inevitably contain societal biases, stereotypes, and problematic content. Foundation models consistently underdiagnose marginalized groups in medical imaging.^[34] CulturalVQA benchmark reveals better performance for North American cultures and worse for African and Islamic cultures.^[54]

Research has shown that VLMs exhibit strong confirmation bias, where they tend to rely on memorized knowledge from training data rather than analyzing visual evidence. For instance, when shown an image of a dog with a digitally added fifth leg, many VLMs will still confidently state that the dog has four legs.^[55]

Computational Requirements

Training and deployment of large-scale VLMs are extremely resource-intensive. Training large VLMs requires massive resources, with GPT-4 estimated to consume 2.1 × 10²⁵ FLOPs. However, efficient techniques like LoRA (Low-Rank Adaptation) and quantization make fine-tuning accessible.^[56]

Robustness and Generalization

While VLMs show impressive performance on many benchmarks, their robustness can be brittle. Studies have shown that some models struggle with simple image transformations like rotation or color inversion, suggesting a lack of deep, compositional understanding of visual scenes.^[57]

Future Directions

The field of Vision Language Models is evolving rapidly, with several key research directions:

Multimodal Reasoning

QVQ-72B-preview and LLaVA-CoT pioneered open-source multimodal reasoning performing autonomous, multi-stage reasoning similar to OpenAI's o1 model.^[58] Future work focuses on developing models with more sophisticated and reliable reasoning abilities, better grounding language in visual reality, and reducing hallucinations.^[59]

Long-Context Understanding

LongVILA supports over 1 million token context for long video understanding.^[60] Qwen2.5-VL scales context through specialized pretraining. Future models aim to process hour-long videos and complex multi-document visual content.

Vision-Language-Action Models

The integration of action prediction with vision-language understanding enables embodied AI. RT-2 unified vision, language, and action tokens, achieving significant improvements on novel objects.^[35] OpenVLA provides open-source VLA trained on 970k robot demonstrations.^[36]

Efficiency and On-Device Deployment

Research into model compression, quantization, and knowledge distillation aims to reduce computational requirements. SmolVLM demonstrates ultra-efficient models for edge deployment (256M-2.2B parameters).^[61] Mixture-of-Experts architectures like MoE-LLaVA enable selective activation for efficiency.^[62]

Any-to-Any Multimodality

The vision-language paradigm is being extended to incorporate additional data types including audio, depth information, thermal imaging, and IMU data. The goal is building "any-to-any" multimodal models that can process and generate information across a wide spectrum of sensory inputs.^[63]

Enhanced Safety and Alignment

Ensuring that VLMs behave safely, fairly, and in alignment with human values is critical. This includes developing robust methods to detect and mitigate data biases, prevent generation of harmful content, and improve overall controllability and interpretability.^[64]

Commercial Implementations

Major Commercial VLM APIs
Provider	Model	Pricing (per 1M tokens)	Key Features	Context Window
OpenAI	GPT-4o	$10 input / $30 output	Best general performance, wide availability	128K tokens^[17]
Anthropic	Claude 3.5 Sonnet	$3 input / $15 output	Strong reasoning, safety focus	200K tokens^[18]
Google	Gemini 2.0 Flash	$30 per 1M tokens	Native multimodal, video understanding	1M+ tokens^[19]
Microsoft	Azure OpenAI GPT-4V	Variable pricing	Enterprise integration, Azure ecosystem	128K tokens^[65]
Amazon	Bedrock Claude 3	Pay per use	AWS integration, multiple models	200K tokens^[66]

Open-Source Models

The open-source community has produced numerous high-quality VLMs:

LLaVA Family: Most influential open-source VLM family (7B-34B parameters), with variants like LLaVA-1.5, LLaVA-NeXT^[7]
Qwen2.5-VL: State-of-the-art open performance with 3B, 7B, and 72B variants^[67]
InternVL2: Scalable from 2B to 108B parameters with progressive alignment^[68]
SmolVLM: Ultra-efficient models for edge deployment (256M-2.2B parameters)^[61]
Phi-3-Vision: Microsoft's efficient 4.2B parameter model^[69]
CogVLM2: 19B parameters with strong performance on various benchmarks^[70]

References

↑ ^1.0 ^1.1 ^1.2 ^1.3 Radford, A., et al. (2021). "Learning Transferable Visual Models From Natural Language Supervision". OpenAI. https://arxiv.org/abs/2103.00020
↑ ^2.0 ^2.1 ^2.2 Jia, C., et al. (2021). "Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision". Google Research. https://arxiv.org/abs/2102.05918
↑ ^3.0 ^3.1 ^3.2 ^3.3 IBM (2024). "What are vision language models (VLMs)?". IBM Think Blog. https://www.ibm.com/think/topics/vision-language-models
↑ ^4.0 ^4.1 ^4.2 ^4.3 NVIDIA (2024). "What Are Vision Language Models?". NVIDIA Glossary. https://www.nvidia.com/en-us/glossary/vision-language-models/
↑ Dosovitskiy, A., et al. (2021). "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale". Google Research. https://arxiv.org/abs/2010.11929
↑ ^6.0 ^6.1 Hugging Face (2024). "Vision Language Models Explained". Hugging Face Blog. https://huggingface.co/blog/vlms
↑ ^7.0 ^7.1 ^7.2 ^7.3 ^7.4 ^7.5 ^7.6 ^7.7 Liu, H., et al. (2023). "Visual Instruction Tuning". University of Wisconsin-Madison. https://arxiv.org/abs/2304.08485
↑ Liu, H., et al. (2024). "Improved Baselines with Visual Instruction Tuning". https://arxiv.org/abs/2310.03744
↑ ^9.0 ^9.1 ^9.2 ^9.3 Alayrac, J.B., et al. (2022). "Flamingo: a Visual Language Model for Few-Shot Learning". DeepMind. https://arxiv.org/abs/2204.14198
↑ Li, J., et al. (2023). "BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models". Salesforce Research. https://arxiv.org/abs/2301.12597
↑ Team Chameleon (2024). "Chameleon: Mixed-Modal Early-Fusion Foundation Models". Meta AI. https://arxiv.org/abs/2405.09818
↑ Adept (2023). "Fuyu-8B: A Multimodal Architecture for AI Agents". https://www.adept.ai/blog/fuyu-8b
↑ ^13.0 ^13.1 Vinyals, O., et al. (2015). "Show and Tell: A Neural Image Caption Generator". Google. https://arxiv.org/abs/1411.4555
↑ Antol, S., et al. (2015). "VQA: Visual Question Answering". ICCV. https://arxiv.org/abs/1505.00468
↑ Lu, J., et al. (2019). "ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations". NeurIPS. https://arxiv.org/abs/1908.02265
↑ Tan, H. & Bansal, M. (2019). "LXMERT: Learning Cross-Modality Encoder Representations from Transformers". EMNLP. https://arxiv.org/abs/1908.07490
↑ ^17.0 ^17.1 OpenAI (2023). "GPT-4V(ision) System Card". https://openai.com/research/gpt-4v-system-card
↑ ^18.0 ^18.1 Anthropic (2024). "The Claude 3 Model Family: Opus, Sonnet, Haiku". https://www.anthropic.com/news/claude-3-family
↑ ^19.0 ^19.1 Google (2023). "Gemini: A Family of Highly Capable Multimodal Models". https://arxiv.org/abs/2312.11805
↑ Zhang, K., et al. (2024). "A Comprehensive Survey on Applications of Vision Large Language Models". https://arxiv.org/html/2501.02765v1
↑ Singh, A., et al. (2022). "FLAVA: A Foundational Language And Vision Alignment Model". Meta AI. https://arxiv.org/abs/2112.04482
↑ Yu, J., et al. (2022). "CoCa: Contrastive Captioners are Image-Text Foundation Models". Google Research. https://arxiv.org/abs/2205.01917
↑ Lin, J., et al. (2024). "VILA: On Pre-training for Visual Language Models". NVIDIA. https://arxiv.org/abs/2312.07533
↑ Ouyang, L., et al. (2022). "Training language models to follow instructions with human feedback". OpenAI. https://arxiv.org/abs/2203.02155
↑ Schuhmann, C., et al. (2022). "LAION-5B: An open large-scale dataset for training next generation image-text models". NeurIPS Datasets and Benchmarks. https://arxiv.org/abs/2210.08402
↑ Changpinyo, S., et al. (2021). "Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts". Google Research. https://arxiv.org/abs/2102.08981
↑ Lin, T.Y., et al. (2014). "Microsoft COCO: Common Objects in Context". Microsoft Research. https://arxiv.org/abs/1405.0312
↑ Krishna, R., et al. (2017). "Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations". Stanford. https://arxiv.org/abs/1602.07332
↑ ^29.0 ^29.1 Goyal, Y., et al. (2017). "Making the V in VQA Matter". Facebook AI Research. https://arxiv.org/abs/1612.00837
↑ Zhu, D., et al. (2024). "Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved with Text". https://arxiv.org/abs/2304.06939
↑ ^31.0 ^31.1 OpenCV (2025). "Applications of Vision Language Models". https://opencv.org/blog/applications-of-vision-language-models/
↑ Gu, X., et al. (2022). "Open-Vocabulary Object Detection via Vision and Language Knowledge Distillation". https://arxiv.org/abs/2104.13921
↑ Be My Eyes (2023). "Be My AI powered by GPT-4". https://www.bemyeyes.com/blog/announcing-be-my-ai
↑ ^34.0 ^34.1 Zhang, Y., et al. (2024). "Fairness in Medical Foundation Models". Nature Medicine. https://www.nature.com/articles/s41591-023-02778-7
↑ ^35.0 ^35.1 Brohan, A., et al. (2023). "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control". Google DeepMind. https://arxiv.org/abs/2307.15818
↑ ^36.0 ^36.1 Kim, M., et al. (2024). "OpenVLA: An Open-Source Vision-Language-Action Model". UC Berkeley. https://arxiv.org/abs/2406.09246
↑ Mathew, M., et al. (2021). "DocVQA: A Dataset for VQA on Document Images". CVPR. https://arxiv.org/abs/2007.00398
↑ Wang, J., et al. (2024). "KuaiMod: A Large-scale Content Moderation Framework". Kuaishou Technology. https://arxiv.org/abs/2404.12709
↑ Li, L., et al. (2024). "Vision-Language Models for Autonomous Driving: A Survey". https://arxiv.org/abs/2407.08123
↑ ^40.0 ^40.1 Lu, P., et al. (2024). "MathVista: Evaluating Mathematical Reasoning in Visual Contexts". https://arxiv.org/abs/2310.02255
↑ Liu, Y., et al. (2023). "MMBench: Is Your Multi-modal Model an All-around Player?". OpenCompass. https://arxiv.org/abs/2307.06281
↑ Yue, X., et al. (2024). "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark". https://arxiv.org/abs/2311.16502
↑ Chen, L., et al. (2024). "Are We on the Right Way for Evaluating Large Vision-Language Models?". https://arxiv.org/abs/2403.20330
↑ Liu, Y., et al. (2024). "OCRBench: Hidden Challenges in OCR for Large Multimodal Models". https://arxiv.org/abs/2305.07895
↑ Thrush, T., et al. (2022). "Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality". CVPR. https://arxiv.org/abs/2204.03162
↑ Li, Y., et al. (2023). "Evaluating Object Hallucination in Large Vision-Language Models". EMNLP. https://arxiv.org/abs/2305.10355
↑ Vedantam, R., et al. (2015). "CIDEr: Consensus-based Image Description Evaluation". CVPR. https://arxiv.org/abs/1411.5726
↑ Papineni, K., et al. (2002). "BLEU: a Method for Automatic Evaluation of Machine Translation". ACL. https://aclanthology.org/P02-1040/
↑ Biten, A.F., et al. (2019). "Scene Text Visual Question Answering". ICCV. https://arxiv.org/abs/1905.13648
↑ Hugging Face (2024). "Open VLM Leaderboard". https://huggingface.co/spaces/opencompass/open_vlm_leaderboard
↑ Zhou, Y., et al. (2024). "Analyzing and Mitigating Object Hallucination in Large Vision-Language Models". ICLR. https://arxiv.org/abs/2310.00754
↑ Leng, Y., et al. (2024). "Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding". CVPR. https://arxiv.org/abs/2408.10253
↑ Kamath, A., et al. (2024). "What's Left and Right in Vision Language Models?". https://arxiv.org/abs/2312.01772
↑ Nayak, N., et al. (2024). "CulturalVQA: Benchmarking Cultural Understanding in Vision Language Models". https://arxiv.org/abs/2407.19788
↑ VLMs Are Blind (2024). "Vision Language Models Are Blind". https://vlmsareblind.github.io/
↑ Hu, E.J., et al. (2022). "LoRA: Low-Rank Adaptation of Large Language Models". Microsoft. https://arxiv.org/abs/2106.09685
↑ Anis, A. M., et al. (2024). "On the Limitations of Vision-Language Models in Understanding Image Transforms". https://arxiv.org/abs/2503.09837
↑ Qwen Team (2024). "QVQ: Multimodal Reasoning at Scale". Alibaba. https://qwenlm.github.io/blog/qvq-72b-preview/
↑ Xu, P., et al. (2024). "LLaVA-CoT: Let Vision Language Models Reason Step-by-Step". https://arxiv.org/abs/2411.10440
↑ Li, F., et al. (2024). "LongVILA: Scaling Long-Context Visual Language Models". NVIDIA. https://arxiv.org/abs/2408.00400
↑ ^61.0 ^61.1 Hugging Face (2024). "SmolVLM: Small Vision Language Models". https://huggingface.co/blog/smolvlm
↑ Lin, B., et al. (2024). "MoE-LLaVA: Mixture of Experts for Large Vision-Language Models". https://arxiv.org/abs/2401.15947
↑ Bai, J., et al. (2024). "Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action". https://arxiv.org/abs/2312.17172
↑ Ji, J., et al. (2024). "AI Alignment: A Comprehensive Survey". https://arxiv.org/abs/2310.19852
↑ Microsoft (2024). "Azure OpenAI Service". https://azure.microsoft.com/en-us/products/ai-services/openai-service
↑ Amazon (2024). "Amazon Bedrock". https://aws.amazon.com/bedrock/
↑ Qwen Team (2024). "Qwen2.5-VL: Frontier Vision-Language Understanding". https://qwenlm.github.io/blog/qwen2.5-vl/
↑ Chen, Z., et al. (2024). "InternVL: Scaling up Vision Foundation Models with Large Language Models". https://arxiv.org/abs/2312.14238
↑ Abdin, M., et al. (2024). "Phi-3 Technical Report". Microsoft. https://arxiv.org/abs/2404.14219
↑ Wang, W., et al. (2024). "CogVLM2: Visual Language Models for Image and Video Understanding". https://arxiv.org/abs/2408.16500

[clip2021-1] 1.0 ^1.1 ^1.2 ^1.3 Radford, A., et al. (2021). "Learning Transferable Visual Models From Natural Language Supervision". OpenAI. https://arxiv.org/abs/2103.00020

[align2022-2] 2.0 ^2.1 ^2.2 Jia, C., et al. (2021). "Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision". Google Research. https://arxiv.org/abs/2102.05918

[ibm2024-3] 3.0 ^3.1 ^3.2 ^3.3 IBM (2024). "What are vision language models (VLMs)?". IBM Think Blog. https://www.ibm.com/think/topics/vision-language-models

[nvidia2024-4] 4.0 ^4.1 ^4.2 ^4.3 NVIDIA (2024). "What Are Vision Language Models?". NVIDIA Glossary. https://www.nvidia.com/en-us/glossary/vision-language-models/

[vit2021-5] Dosovitskiy, A., et al. (2021). "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale". Google Research. https://arxiv.org/abs/2010.11929

[huggingface2024-6] 6.0 ^6.1 Hugging Face (2024). "Vision Language Models Explained". Hugging Face Blog. https://huggingface.co/blog/vlms

[llava2023-7] 7.0 ^7.1 ^7.2 ^7.3 ^7.4 ^7.5 ^7.6 ^7.7 Liu, H., et al. (2023). "Visual Instruction Tuning". University of Wisconsin-Madison. https://arxiv.org/abs/2304.08485

[llava15-8] Liu, H., et al. (2024). "Improved Baselines with Visual Instruction Tuning". https://arxiv.org/abs/2310.03744

[flamingo2022-9] 9.0 ^9.1 ^9.2 ^9.3 Alayrac, J.B., et al. (2022). "Flamingo: a Visual Language Model for Few-Shot Learning". DeepMind. https://arxiv.org/abs/2204.14198

[blip2-10] Li, J., et al. (2023). "BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models". Salesforce Research. https://arxiv.org/abs/2301.12597

[chameleon2024-11] Team Chameleon (2024). "Chameleon: Mixed-Modal Early-Fusion Foundation Models". Meta AI. https://arxiv.org/abs/2405.09818

[fuyu2023-12] Adept (2023). "Fuyu-8B: A Multimodal Architecture for AI Agents". https://www.adept.ai/blog/fuyu-8b

[showtell2015-13] 13.0 ^13.1 Vinyals, O., et al. (2015). "Show and Tell: A Neural Image Caption Generator". Google. https://arxiv.org/abs/1411.4555

[vqav1-14] Antol, S., et al. (2015). "VQA: Visual Question Answering". ICCV. https://arxiv.org/abs/1505.00468

[vilbert-15] Lu, J., et al. (2019). "ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations". NeurIPS. https://arxiv.org/abs/1908.02265

[lxmert-16] Tan, H. & Bansal, M. (2019). "LXMERT: Learning Cross-Modality Encoder Representations from Transformers". EMNLP. https://arxiv.org/abs/1908.07490

[gpt4v2023-17] 17.0 ^17.1 OpenAI (2023). "GPT-4V(ision) System Card". https://openai.com/research/gpt-4v-system-card

[claude3-18] 18.0 ^18.1 Anthropic (2024). "The Claude 3 Model Family: Opus, Sonnet, Haiku". https://www.anthropic.com/news/claude-3-family

[gemini2023-19] 19.0 ^19.1 Google (2023). "Gemini: A Family of Highly Capable Multimodal Models". https://arxiv.org/abs/2312.11805

[arxiv2024survey-20] Zhang, K., et al. (2024). "A Comprehensive Survey on Applications of Vision Large Language Models". https://arxiv.org/html/2501.02765v1

[flava2022-21] Singh, A., et al. (2022). "FLAVA: A Foundational Language And Vision Alignment Model". Meta AI. https://arxiv.org/abs/2112.04482

[coca2022-22] Yu, J., et al. (2022). "CoCa: Contrastive Captioners are Image-Text Foundation Models". Google Research. https://arxiv.org/abs/2205.01917

[vila2024-23] Lin, J., et al. (2024). "VILA: On Pre-training for Visual Language Models". NVIDIA. https://arxiv.org/abs/2312.07533

[rlhf2023-24] Ouyang, L., et al. (2022). "Training language models to follow instructions with human feedback". OpenAI. https://arxiv.org/abs/2203.02155

[laion5b-25] Schuhmann, C., et al. (2022). "LAION-5B: An open large-scale dataset for training next generation image-text models". NeurIPS Datasets and Benchmarks. https://arxiv.org/abs/2210.08402

[cc12m-26] Changpinyo, S., et al. (2021). "Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts". Google Research. https://arxiv.org/abs/2102.08981

[mscoco-27] Lin, T.Y., et al. (2014). "Microsoft COCO: Common Objects in Context". Microsoft Research. https://arxiv.org/abs/1405.0312

[visualgenome-28] Krishna, R., et al. (2017). "Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations". Stanford. https://arxiv.org/abs/1602.07332

[vqav2-29] 29.0 ^29.1 Goyal, Y., et al. (2017). "Making the V in VQA Matter". Facebook AI Research. https://arxiv.org/abs/1612.00837

[mmc4-30] Zhu, D., et al. (2024). "Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved with Text". https://arxiv.org/abs/2304.06939

[opencv2025-31] 31.0 ^31.1 OpenCV (2025). "Applications of Vision Language Models". https://opencv.org/blog/applications-of-vision-language-models/

[ovd2022-32] Gu, X., et al. (2022). "Open-Vocabulary Object Detection via Vision and Language Knowledge Distillation". https://arxiv.org/abs/2104.13921

[bemyeyes2023-33] Be My Eyes (2023). "Be My AI powered by GPT-4". https://www.bemyeyes.com/blog/announcing-be-my-ai

[medical2024-34] 34.0 ^34.1 Zhang, Y., et al. (2024). "Fairness in Medical Foundation Models". Nature Medicine. https://www.nature.com/articles/s41591-023-02778-7

[rt2-35] 35.0 ^35.1 Brohan, A., et al. (2023). "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control". Google DeepMind. https://arxiv.org/abs/2307.15818

[openvla-36] 36.0 ^36.1 Kim, M., et al. (2024). "OpenVLA: An Open-Source Vision-Language-Action Model". UC Berkeley. https://arxiv.org/abs/2406.09246

[docvqa-37] Mathew, M., et al. (2021). "DocVQA: A Dataset for VQA on Document Images". CVPR. https://arxiv.org/abs/2007.00398

[kuaimod2024-38] Wang, J., et al. (2024). "KuaiMod: A Large-scale Content Moderation Framework". Kuaishou Technology. https://arxiv.org/abs/2404.12709

[autonomous2024-39] Li, L., et al. (2024). "Vision-Language Models for Autonomous Driving: A Survey". https://arxiv.org/abs/2407.08123

[mathvista-40] 40.0 ^40.1 Lu, P., et al. (2024). "MathVista: Evaluating Mathematical Reasoning in Visual Contexts". https://arxiv.org/abs/2310.02255

[mmbench-41] Liu, Y., et al. (2023). "MMBench: Is Your Multi-modal Model an All-around Player?". OpenCompass. https://arxiv.org/abs/2307.06281

[mmmu-42] Yue, X., et al. (2024). "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark". https://arxiv.org/abs/2311.16502

[mmstar-43] Chen, L., et al. (2024). "Are We on the Right Way for Evaluating Large Vision-Language Models?". https://arxiv.org/abs/2403.20330

[ocrbench-44] Liu, Y., et al. (2024). "OCRBench: Hidden Challenges in OCR for Large Multimodal Models". https://arxiv.org/abs/2305.07895

[winoground-45] Thrush, T., et al. (2022). "Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality". CVPR. https://arxiv.org/abs/2204.03162

[pope-46] Li, Y., et al. (2023). "Evaluating Object Hallucination in Large Vision-Language Models". EMNLP. https://arxiv.org/abs/2305.10355

[cider-47] Vedantam, R., et al. (2015). "CIDEr: Consensus-based Image Description Evaluation". CVPR. https://arxiv.org/abs/1411.5726

[bleu-48] Papineni, K., et al. (2002). "BLEU: a Method for Automatic Evaluation of Machine Translation". ACL. https://aclanthology.org/P02-1040/

[anls-49] Biten, A.F., et al. (2019). "Scene Text Visual Question Answering". ICCV. https://arxiv.org/abs/1905.13648

[openvlm-50] Hugging Face (2024). "Open VLM Leaderboard". https://huggingface.co/spaces/opencompass/open_vlm_leaderboard

[hallucination2024-51] Zhou, Y., et al. (2024). "Analyzing and Mitigating Object Hallucination in Large Vision-Language Models". ICLR. https://arxiv.org/abs/2310.00754

[visual_hall-52] Leng, Y., et al. (2024). "Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding". CVPR. https://arxiv.org/abs/2408.10253

[spatial2024-53] Kamath, A., et al. (2024). "What's Left and Right in Vision Language Models?". https://arxiv.org/abs/2312.01772

[culturalvqa-54] Nayak, N., et al. (2024). "CulturalVQA: Benchmarking Cultural Understanding in Vision Language Models". https://arxiv.org/abs/2407.19788

[vlmsbias-55] VLMs Are Blind (2024). "Vision Language Models Are Blind". https://vlmsareblind.github.io/

[lora-56] Hu, E.J., et al. (2022). "LoRA: Low-Rank Adaptation of Large Language Models". Microsoft. https://arxiv.org/abs/2106.09685

[robustness2024-57] Anis, A. M., et al. (2024). "On the Limitations of Vision-Language Models in Understanding Image Transforms". https://arxiv.org/abs/2503.09837

[qvq2024-58] Qwen Team (2024). "QVQ: Multimodal Reasoning at Scale". Alibaba. https://qwenlm.github.io/blog/qvq-72b-preview/

[cot_vlm-59] Xu, P., et al. (2024). "LLaVA-CoT: Let Vision Language Models Reason Step-by-Step". https://arxiv.org/abs/2411.10440

[longvila-60] Li, F., et al. (2024). "LongVILA: Scaling Long-Context Visual Language Models". NVIDIA. https://arxiv.org/abs/2408.00400

[smolvlm-61] 61.0 ^61.1 Hugging Face (2024). "SmolVLM: Small Vision Language Models". https://huggingface.co/blog/smolvlm

[moellava-62] Lin, B., et al. (2024). "MoE-LLaVA: Mixture of Experts for Large Vision-Language Models". https://arxiv.org/abs/2401.15947

[unified2024-63] Bai, J., et al. (2024). "Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action". https://arxiv.org/abs/2312.17172

[safety2024-64] Ji, J., et al. (2024). "AI Alignment: A Comprehensive Survey". https://arxiv.org/abs/2310.19852

[azure2024-65] Microsoft (2024). "Azure OpenAI Service". https://azure.microsoft.com/en-us/products/ai-services/openai-service

[bedrock2024-66] Amazon (2024). "Amazon Bedrock". https://aws.amazon.com/bedrock/

[qwen2vl-67] Qwen Team (2024). "Qwen2.5-VL: Frontier Vision-Language Understanding". https://qwenlm.github.io/blog/qwen2.5-vl/

[internvl2-68] Chen, Z., et al. (2024). "InternVL: Scaling up Vision Foundation Models with Large Language Models". https://arxiv.org/abs/2312.14238

[phi3-69] Abdin, M., et al. (2024). "Phi-3 Technical Report". Microsoft. https://arxiv.org/abs/2404.14219

[cogvlm2-70] Wang, W., et al. (2024). "CogVLM2: Visual Language Models for Image and Video Understanding". https://arxiv.org/abs/2408.16500

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

[44]

[45]

[46]

[47]

[48]

[49]

[50]

[51]

[52]

[53]

[54]

[55]

[56]

[57]

[58]

[59]

[60]

[61]

[62]

[63]

[64]

[65]

[66]

[67]

[68]

[69]

[70]