Vision Language Model

From AI Wiki
(Redirected from Vision language model)

Template:Infobox technology

A Vision Language Model (VLM), also known as a visual language model, is a type of artificial intelligence system that can simultaneously understand and process both visual information (images, videos) and textual information (natural language), enabling machines to perceive, reason about, and communicate regarding visual content.[1] These multimodal models learn rich correlations from billions of web-scale image-text pairs, enabling zero-shot predictions across diverse visual recognition tasks with a single unified architecture.[2]

Unlike traditional computer vision models that require massive manually-labeled datasets for each specific task, VLMs leverage freely available image-text pairs from the internet, fundamentally changing how machines perceive and communicate about visual content. These models combine computer vision and natural language processing (NLP) capabilities to understand and process information from both visual and textual data simultaneously, learning to map the complex relationships between images or videos and their corresponding text descriptions.[3] VLMs are a class of foundation model in AI that can be adapted to a wide range of vision-and-language tasks, from describing photographs and summarizing videos to aiding in visual search and human-computer interaction.[4]

Architecture

Core Components

Most contemporary Vision Language Models are constructed using a modular, three-part architecture that combines specialized components for vision, language, and the interface between them.[4]

Vision Encoder: The vision encoder is the component responsible for "seeing." It processes visual input through architectures like Vision Transformer (ViT) or CNNs, dividing images into patches (typically 16×16 pixels) that are flattened into vectors and processed through transformer layers.[5] A 224×224 image becomes 196 patches in a 14×14 grid, with each patch linearly projected to a 768-dimensional embedding. These visual features then flow through 12-24 transformer layers using multi-head self-attention, creating rich representations that capture objects, attributes, spatial relationships, and semantic content.[6] Many state-of-the-art VLMs, including LLaVA, leverage pre-trained ViT-based encoders from CLIP for their powerful and generalizable visual representations.[7]

Language Model (LLM): The Large Language Model serves as the cognitive backbone of the VLM. It is responsible for high-level reasoning, contextual understanding, and generating coherent textual output.[4] Typically a large pre-trained transformer like Vicuna, LLaMA, Gemma, or GPT-4, the LLM processes the projected visual tokens alongside text tokens. During generation, the model autoregressively predicts the next token based on the full multimodal context, treating vision-language understanding as sequence prediction where images become another type of token in the language model's input stream.[7]

Vision-Language Connector: The connector, also referred to as a projector or bridge, is a critical component that links the vision encoder and the language model. Its function is to translate the visual embeddings produced by the vision encoder into a format that is compatible with the LLM's input space.[4] Common connector architectures include:

  • Simple Linear Projection: A lightweight and data-efficient approach where a single trainable linear layer maps the visual feature space to the LLM's word embedding space. Used by models like LLaVA and PaliGemma.[7]
  • Multi-Layer Perceptron (MLP): A slightly more complex connector using a small neural network with one or more hidden layers, employed in models like LLaVA-1.5.[8]
  • Cross-Attention Layers: More deeply integrated approach where new cross-attention layers are added directly into the LLM's architecture, used by Flamingo and Llama 3.2 Vision.[9]
  • Q-Former: BLIP-2's sophisticated fusion mechanism using 188 million trainable parameters with 32 learnable query embeddings to map visual features into the language model's space through cross-attention.[10]
  • Perceiver Resampler: A specialized module used by Flamingo that takes a potentially large and variable number of visual features and uses an attention-based mechanism to "distill" them into a smaller, fixed number of latent visual tokens.[9]

Fusion Strategies

Three primary fusion strategies enable different architectural trade-offs:

Early Fusion: Combines visual and textual inputs before deep processing. Models like Chameleon use VQ-VAE to tokenize images into discrete tokens (1024 tokens per 256×256 image), processing them alongside text in a single decoder.[11] Fuyu-8B exemplifies early fusion by directly feeding image patches into the decoder model without a separate vision encoder.[12]

Late Fusion: Processes modalities independently before combining at the output level. CLIP epitomizes this strategy, training separate image and text encoders to project into a shared 512-dimensional embedding space where cosine similarity measures alignment.[1]

Intermediate Fusion: Through cross-attention has emerged as the dominant approach. Flamingo's architecture freezes both the vision encoder and language model, training only the Perceiver Resampler and gated cross-attention layers with a tanh-gating mechanism initialized to zero, allowing stable training by gradually opening the gate during learning.[9]

History

Early Concepts and Precursors (2010s–2020)

Research at the intersection of vision and language gained significant momentum around 2015, with early efforts focusing on specific tasks like image captioning and visual question answering.[13] These initial models typically paired a Convolutional Neural Network (CNN) for visual feature extraction with a Recurrent Neural Network (RNN) for language generation.[3]

Initial efforts included:

  • Neural Image Captioners (2014–2015): Early models like Show and Tell (Google, 2015) combined a CNN image encoder with an RNN language decoder to generate captions, demonstrating the first end-to-end learned image description system.[13]
  • Visual Question Answering Models (2015): The first VQA models used CNNs for images and sequence models for questions to output answers, coinciding with the introduction of the VQA dataset.[14]
  • Vision-Transformer-based Models (2019): Models like ViLBERT and LXMERT extended the BERT architecture to multimodal input, using two-stream transformers with cross-attention for fusion.[15][16]

The Rise of Contrastive Pre-training (2021)

The year 2021 marked a pivotal moment for VLMs with the introduction of models that leveraged large-scale contrastive pre-training. This approach represented a fundamental shift away from relying on smaller, meticulously human-labeled datasets like ImageNet toward using the vast and noisy data available on the internet.

CLIP (Contrastive Language–Image Pre-training): Developed by OpenAI and released in February 2021, CLIP trained on 400 million web-scraped image-text pairs achieved zero-shot ImageNet accuracy matching supervised ResNet-50. Its core innovation was the use of a contrastive learning objective, where the model learns to maximize the similarity between the embeddings of a correct image-text pair while simultaneously minimizing the similarity with all other pairs in a batch.[1]

ALIGN: Shortly after CLIP, Google researchers introduced ALIGN, which embraced noisy web data at unprecedented scale with 1.8 billion image-text pairs, demonstrating that massive scale compensates for data noise.[2]

Integration with Large Language Models (2022)

April 2022 brought Flamingo from DeepMind, marking the transition from understanding to few-shot learning. The 80-billion parameter model processed arbitrarily interleaved sequences of images, videos, and text, trained on MultiModal MassiveWeb (M3W) containing 43 million webpages.[9]

Instruction Tuning Era (2023)

The instruction-tuning revolution arrived in April 2023 with LLaVA (Large Language and Vision Assistant), which connected CLIP ViT-L/14 to Vicuna through a simple linear projection. LLaVA's key innovation was using GPT-4 to generate 150,000 multimodal instruction-following samples, pioneering instruction tuning in the multimodal domain.[7]

September 2023 marked VLMs entering the mainstream with GPT-4V (GPT-4 with Vision), OpenAI's multimodal extension integrated into ChatGPT, bringing advanced vision capabilities to millions of users.[17]

Current Generation (2024–2025)

March 2024 saw Anthropic release Claude 3 with three variants (Haiku, Sonnet, Opus) all featuring native vision capabilities and 200K token context windows.[18] Google's Gemini family emerged as the first major models trained multimodally from the start, with Gemini 1.0 Ultra achieving human-expert performance on MMLU (90.0%).[19]

Training

Pre-training Objectives

The goal of pre-training is to establish a fundamental alignment between the vision and language encoders using massive, often web-scale, datasets.[20]

Contrastive Learning: A foundational pre-training technique where the model is presented with a batch of image-text pairs and learns to distinguish between corresponding (positive) pairs and non-corresponding (negative) pairs. The training objective, often implemented with a contrastive loss function like InfoNCE, is to pull the vector embeddings of positive pairs closer together in a shared embedding space while pushing the embeddings of negative pairs farther apart.[1]

Masked Modeling: Inspired by models like BERT, this technique involves randomly hiding or "masking" a portion of the input and training the model to predict or reconstruct the missing part. This can be applied to either modality:

Generative Objectives: Models learn to autoregressively generate captions or images from multimodal inputs, as demonstrated by CoCa and Flamingo.[22]

Supervised Fine-tuning

Supervised fine-tuning adapts aligned models to specific tasks and instruction-following behavior. LLaVA's approach uses 558K image-text pairs for initial alignment followed by 158K instruction samples for task adaptation.[7]

Many modern VLMs employ a two-stage fine-tuning process:

  1. Feature Alignment: Only the vision-language connector is trained while the vision encoder and LLM remain frozen
  2. End-to-end Fine-tuning: Both the connector and the LLM (or a subset of its parameters) are trained on instruction-following data[7]

VILA research demonstrated that interleaving text-only instruction data with vision-language data during SFT remedies text-only task degradation.[23]

Reinforcement Learning

Methods like RLHF (Reinforcement Learning from Human Feedback) and RLVR (Reinforcement Learning with Verifiable Rewards) fine-tune aligned models for safety, helpfulness, and accuracy.[24]

Datasets

Vision-language models are typically trained on large sets of image–text pairs, using various training objectives to learn joint representations.

Major VLM Training Datasets
Dataset Size Description Usage
LAION-5B 5.85B pairs Web-scraped image-text pairs with CLIP filtering, multilingual Pre-training[25]
ALIGN Dataset 1.8B pairs Noisy web data from ALT-text Contrastive alignment[2]
Conceptual Captions CC3M: 3.3M, CC12M: 12.4M Curated web captions, cleaned and filtered Pre-training and fine-tuning[26]
COCO 330K images Image captioning with 5 captions per image, 1.5M object instances Evaluation/Fine-tuning[27]
Visual Genome 108K images Dense annotations (regions, relations, scene graphs) Grounding and detailed understanding[28]
VQAv2 1.1M questions Visual question answering on COCO images Evaluation/Fine-tuning[29]
LLaVA-Instruct 150K-558K samples GPT-4 generated instruction-following data Instruction tuning[7]
MMC4 100M+ pairs Synthetic interleaved web data Generative pretraining[30]

Notable Vision-Language Models

Research in vision-language modeling has progressed rapidly, with numerous notable models marking significant milestones:

Comparison of Major Vision-Language Models
Model Developer Year Parameters Key Features License
CLIP OpenAI 2021 Varies (ViT-B/32: 63M to ViT-L/14: 427M) Zero-shot classification, contrastive learning, dual encoder Open
ALIGN Google 2021 ~650M Noisy web-scale alignment, 1.8B training pairs Research
Flamingo DeepMind 2022 3B-80B Few-shot VQA, perceiver resampler, interleaved inputs Research
BLIP-2 Salesforce 2023 2.7B–13B Bootstrapped pretraining, Q-Former, frozen LLMs Apache 2.0
LLaVA Liu et al. 2023 7B–34B Instruction-tuned, chat capabilities, simple projection Apache 2.0
Kosmos-1/2 Microsoft 2023 1.3B-1.6B Grounding, zero-shot detection, end-to-end training MIT
PaLM-E Google 2023 562B Embodied reasoning, robotics integration Research
Qwen-VL Alibaba 2023 7B Multilingual, OCR, long video understanding Apache 2.0
MiniGPT-4 Wang et al. 2023 7B LLM alignment via projector, efficient adaptation BSD 3-Clause
GPT-4V/GPT-4o OpenAI 2023-2024 Undisclosed Real-time multimodality, best commercial performance Proprietary
Gemini Google 2024 Undisclosed Video understanding, long context (1M+ tokens), native multimodal training Proprietary
Claude 3/3.5 Anthropic 2024 Undisclosed Strong reasoning, 200K context window, safety focus Proprietary
LLaMA 3.2 Vision Meta 2024 11B–90B Open-source, efficient, cross-attention architecture Llama License
Qwen2.5-VL Alibaba 2025 3B–72B Advanced OCR, multilingual, state-of-the-art open performance Apache 2.0
DeepSeek-VL DeepSeek 2024 7B High-resolution support, efficient training MIT
PaliGemma Google 2024 3B Strong transferable performance, SigLIP encoder Apache 2.0
InternVL2 Shanghai AI Lab 2024 2B–108B Progressive alignment, dynamic high-resolution Apache 2.0

Applications

VLMs have broad applications wherever visual content needs to be interpreted or generated in conjunction with language.

Visual Understanding Tasks

  • Image and Video Captioning: Generating concise and accurate textual descriptions for images or videos, describing actions, interactions between objects, and overall scene context.[3][31]
  • Visual Question Answering (VQA): Answering natural language questions about images, ranging from simple identification queries to complex reasoning questions requiring spatial understanding or inference.[3]
  • Object Detection and Segmentation: VLMs enable "open-vocabulary" object detection, where models can detect and localize objects described in free-form text, even for categories not in training data.[32]
  • Optical Character Recognition (OCR): Reading and transcribing text embedded within images, such as text on street signs, documents, or product labels.[6]

Real-World Applications

Accessibility and Assistive Technology

VLMs power accessibility technologies that make digital content inclusive. Systems like Be My Eyes use GPT-4V to provide real-time descriptions of environments and objects through smartphone cameras for visually impaired users.[33] These applications can describe surroundings in real-time, read text from documents or product labels, and answer questions about visual content.

Healthcare and Medical Imaging

VLMs assist healthcare professionals by analyzing medical images like X-rays or CT scans and generating preliminary reports. Applications include automated radiology report generation and medical VQA systems. However, research reveals concerning fairness gaps, with foundation models consistently underdiagnosing marginalized groups compared to board-certified radiologists.[34]

Robotics and Embodied AI

VLMs form the perceptual core of Vision-Language-Action (VLA) models, allowing robots to understand natural language commands within physical environments. RT-2 unified vision, language, and action tokens, achieving 63% improvement on novel objects.[35] OpenVLA provides open-source VLA trained on 970k robot demonstrations.[36]

Document Understanding

VLMs excel at extracting information from structured documents, interpreting charts and graphs, and understanding document layouts. DocVQA benchmarks document comprehension with 12,000+ document images and 50,000+ questions.[37]

Content Moderation

VLMs can more accurately detect harmful or inappropriate content on social media platforms by analyzing the combined context of images and accompanying text. KuaiMod framework deployed at Kuaishou processes millions of videos daily using VLM Chain-of-Thought reasoning, achieving 20% reduction in user reporting rate.[38]

Autonomous Systems

In applications like autonomous driving, VLMs enhance situational awareness by interpreting non-standard situations, such as handwritten detour signs, combining visual recognition with reasoning capabilities.[39]

E-commerce and Visual Search

Visual search matches user-uploaded images to products; VLMs enable natural language queries about product images and generate product descriptions automatically.[31]

Education

VLMs can generate step-by-step explanations from diagrams, solve visual math problems, and provide interactive tutoring based on visual content.[40]

Evaluation

Benchmarks

VLMs are assessed on benchmarks measuring multimodal capabilities:

Benchmark Tasks Size Metrics Description
MMBench 20 ability dimensions ~3K questions Accuracy Object recognition, OCR, spatial reasoning, chart interpretation[41]
MMMU Multi-discipline reasoning 11.5K questions Accuracy College-level expert knowledge across 6 disciplines[42]
MMStar Vision-indispensable tasks 1.5K samples Accuracy Ensures visual dependency with elite samples[43]
VQAv2 Open-ended VQA 265K questions Accuracy Standard VQA benchmark on COCO images[29]
MathVista Visual math reasoning 6K problems Accuracy Mathematical reasoning in visual contexts[40]
OCRBench Document understanding Varies F1-score Text recognition and understanding[44]
Winoground Compositional reasoning 400 pairs Human judgment Tests understanding of compositional structures[45]
POPE Object hallucination 3K questions Accuracy/F1 Evaluates tendency to hallucinate objects[46]

Metrics

  • CIDEr: Consensus-based Image Description Evaluation for captioning[47]
  • BLEU: Bilingual Evaluation Understudy for text generation quality[48]
  • ANLS: Average Normalized Levenshtein Similarity for document understanding[49]
  • Accuracy: Standard metric for VQA and classification tasks

Leaderboards like the Open VLM Leaderboard on Hugging Face rank open-source VLMs across these benchmarks.[50]

Limitations and Challenges

Despite their rapid progress and impressive capabilities, Vision Language Models face several significant limitations:

Visual Hallucination

One of the most critical issues is visual hallucination, where models generate text that is fluent and plausible but factually inconsistent with the provided image. Even advanced models exhibit hallucination rates exceeding 10%, with some open-source models showing rates above 40% on specialized benchmarks.[51] This can manifest as describing objects that are not present, misstating attributes of objects, or misinterpreting relationships between objects.[52]

Spatial Reasoning

Models struggle with basic directional concepts despite processing spatial information. Research found VLMs allocate only approximately 10% attention to image tokens despite images comprising approximately 90% of input sequence length.[53] Poor performance on relations like "left of" or "above" persists without specialized fine-tuning.

Data Bias and Fairness

VLMs are typically trained on vast datasets scraped from the internet, which inevitably contain societal biases, stereotypes, and problematic content. Foundation models consistently underdiagnose marginalized groups in medical imaging.[34] CulturalVQA benchmark reveals better performance for North American cultures and worse for African and Islamic cultures.[54]

Research has shown that VLMs exhibit strong confirmation bias, where they tend to rely on memorized knowledge from training data rather than analyzing visual evidence. For instance, when shown an image of a dog with a digitally added fifth leg, many VLMs will still confidently state that the dog has four legs.[55]

Computational Requirements

Training and deployment of large-scale VLMs are extremely resource-intensive. Training large VLMs requires massive resources, with GPT-4 estimated to consume 2.1 × 10²⁵ FLOPs. However, efficient techniques like LoRA (Low-Rank Adaptation) and quantization make fine-tuning accessible.[56]

Robustness and Generalization

While VLMs show impressive performance on many benchmarks, their robustness can be brittle. Studies have shown that some models struggle with simple image transformations like rotation or color inversion, suggesting a lack of deep, compositional understanding of visual scenes.[57]

Future Directions

The field of Vision Language Models is evolving rapidly, with several key research directions:

Multimodal Reasoning

QVQ-72B-preview and LLaVA-CoT pioneered open-source multimodal reasoning performing autonomous, multi-stage reasoning similar to OpenAI's o1 model.[58] Future work focuses on developing models with more sophisticated and reliable reasoning abilities, better grounding language in visual reality, and reducing hallucinations.[59]

Long-Context Understanding

LongVILA supports over 1 million token context for long video understanding.[60] Qwen2.5-VL scales context through specialized pretraining. Future models aim to process hour-long videos and complex multi-document visual content.

Vision-Language-Action Models

The integration of action prediction with vision-language understanding enables embodied AI. RT-2 unified vision, language, and action tokens, achieving significant improvements on novel objects.[35] OpenVLA provides open-source VLA trained on 970k robot demonstrations.[36]

Efficiency and On-Device Deployment

Research into model compression, quantization, and knowledge distillation aims to reduce computational requirements. SmolVLM demonstrates ultra-efficient models for edge deployment (256M-2.2B parameters).[61] Mixture-of-Experts architectures like MoE-LLaVA enable selective activation for efficiency.[62]

Any-to-Any Multimodality

The vision-language paradigm is being extended to incorporate additional data types including audio, depth information, thermal imaging, and IMU data. The goal is building "any-to-any" multimodal models that can process and generate information across a wide spectrum of sensory inputs.[63]

Enhanced Safety and Alignment

Ensuring that VLMs behave safely, fairly, and in alignment with human values is critical. This includes developing robust methods to detect and mitigate data biases, prevent generation of harmful content, and improve overall controllability and interpretability.[64]

Commercial Implementations

Major Commercial VLM APIs
Provider Model Pricing (per 1M tokens) Key Features Context Window
OpenAI GPT-4o $10 input / $30 output Best general performance, wide availability 128K tokens[17]
Anthropic Claude 3.5 Sonnet $3 input / $15 output Strong reasoning, safety focus 200K tokens[18]
Google Gemini 2.0 Flash $30 per 1M tokens Native multimodal, video understanding 1M+ tokens[19]
Microsoft Azure OpenAI GPT-4V Variable pricing Enterprise integration, Azure ecosystem 128K tokens[65]
Amazon Bedrock Claude 3 Pay per use AWS integration, multiple models 200K tokens[66]

Open-Source Models

The open-source community has produced numerous high-quality VLMs:

  • LLaVA Family: Most influential open-source VLM family (7B-34B parameters), with variants like LLaVA-1.5, LLaVA-NeXT[7]
  • Qwen2.5-VL: State-of-the-art open performance with 3B, 7B, and 72B variants[67]
  • InternVL2: Scalable from 2B to 108B parameters with progressive alignment[68]
  • SmolVLM: Ultra-efficient models for edge deployment (256M-2.2B parameters)[61]
  • Phi-3-Vision: Microsoft's efficient 4.2B parameter model[69]
  • CogVLM2: 19B parameters with strong performance on various benchmarks[70]

See Also

References

  1. 1.0 1.1 1.2 1.3 Radford, A., et al. (2021). "Learning Transferable Visual Models From Natural Language Supervision". OpenAI. https://arxiv.org/abs/2103.00020
  2. 2.0 2.1 2.2 Jia, C., et al. (2021). "Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision". Google Research. https://arxiv.org/abs/2102.05918
  3. 3.0 3.1 3.2 3.3 IBM (2024). "What are vision language models (VLMs)?". IBM Think Blog. https://www.ibm.com/think/topics/vision-language-models
  4. 4.0 4.1 4.2 4.3 NVIDIA (2024). "What Are Vision Language Models?". NVIDIA Glossary. https://www.nvidia.com/en-us/glossary/vision-language-models/
  5. Dosovitskiy, A., et al. (2021). "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale". Google Research. https://arxiv.org/abs/2010.11929
  6. 6.0 6.1 Hugging Face (2024). "Vision Language Models Explained". Hugging Face Blog. https://huggingface.co/blog/vlms
  7. 7.0 7.1 7.2 7.3 7.4 7.5 7.6 7.7 Liu, H., et al. (2023). "Visual Instruction Tuning". University of Wisconsin-Madison. https://arxiv.org/abs/2304.08485
  8. Liu, H., et al. (2024). "Improved Baselines with Visual Instruction Tuning". https://arxiv.org/abs/2310.03744
  9. 9.0 9.1 9.2 9.3 Alayrac, J.B., et al. (2022). "Flamingo: a Visual Language Model for Few-Shot Learning". DeepMind. https://arxiv.org/abs/2204.14198
  10. Li, J., et al. (2023). "BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models". Salesforce Research. https://arxiv.org/abs/2301.12597
  11. Team Chameleon (2024). "Chameleon: Mixed-Modal Early-Fusion Foundation Models". Meta AI. https://arxiv.org/abs/2405.09818
  12. Adept (2023). "Fuyu-8B: A Multimodal Architecture for AI Agents". https://www.adept.ai/blog/fuyu-8b
  13. 13.0 13.1 Vinyals, O., et al. (2015). "Show and Tell: A Neural Image Caption Generator". Google. https://arxiv.org/abs/1411.4555
  14. Antol, S., et al. (2015). "VQA: Visual Question Answering". ICCV. https://arxiv.org/abs/1505.00468
  15. Lu, J., et al. (2019). "ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations". NeurIPS. https://arxiv.org/abs/1908.02265
  16. Tan, H. & Bansal, M. (2019). "LXMERT: Learning Cross-Modality Encoder Representations from Transformers". EMNLP. https://arxiv.org/abs/1908.07490
  17. 17.0 17.1 OpenAI (2023). "GPT-4V(ision) System Card". https://openai.com/research/gpt-4v-system-card
  18. 18.0 18.1 Anthropic (2024). "The Claude 3 Model Family: Opus, Sonnet, Haiku". https://www.anthropic.com/news/claude-3-family
  19. 19.0 19.1 Google (2023). "Gemini: A Family of Highly Capable Multimodal Models". https://arxiv.org/abs/2312.11805
  20. Zhang, K., et al. (2024). "A Comprehensive Survey on Applications of Vision Large Language Models". https://arxiv.org/html/2501.02765v1
  21. Singh, A., et al. (2022). "FLAVA: A Foundational Language And Vision Alignment Model". Meta AI. https://arxiv.org/abs/2112.04482
  22. Yu, J., et al. (2022). "CoCa: Contrastive Captioners are Image-Text Foundation Models". Google Research. https://arxiv.org/abs/2205.01917
  23. Lin, J., et al. (2024). "VILA: On Pre-training for Visual Language Models". NVIDIA. https://arxiv.org/abs/2312.07533
  24. Ouyang, L., et al. (2022). "Training language models to follow instructions with human feedback". OpenAI. https://arxiv.org/abs/2203.02155
  25. Schuhmann, C., et al. (2022). "LAION-5B: An open large-scale dataset for training next generation image-text models". NeurIPS Datasets and Benchmarks. https://arxiv.org/abs/2210.08402
  26. Changpinyo, S., et al. (2021). "Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts". Google Research. https://arxiv.org/abs/2102.08981
  27. Lin, T.Y., et al. (2014). "Microsoft COCO: Common Objects in Context". Microsoft Research. https://arxiv.org/abs/1405.0312
  28. Krishna, R., et al. (2017). "Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations". Stanford. https://arxiv.org/abs/1602.07332
  29. 29.0 29.1 Goyal, Y., et al. (2017). "Making the V in VQA Matter". Facebook AI Research. https://arxiv.org/abs/1612.00837
  30. Zhu, D., et al. (2024). "Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved with Text". https://arxiv.org/abs/2304.06939
  31. 31.0 31.1 OpenCV (2025). "Applications of Vision Language Models". https://opencv.org/blog/applications-of-vision-language-models/
  32. Gu, X., et al. (2022). "Open-Vocabulary Object Detection via Vision and Language Knowledge Distillation". https://arxiv.org/abs/2104.13921
  33. Be My Eyes (2023). "Be My AI powered by GPT-4". https://www.bemyeyes.com/blog/announcing-be-my-ai
  34. 34.0 34.1 Zhang, Y., et al. (2024). "Fairness in Medical Foundation Models". Nature Medicine. https://www.nature.com/articles/s41591-023-02778-7
  35. 35.0 35.1 Brohan, A., et al. (2023). "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control". Google DeepMind. https://arxiv.org/abs/2307.15818
  36. 36.0 36.1 Kim, M., et al. (2024). "OpenVLA: An Open-Source Vision-Language-Action Model". UC Berkeley. https://arxiv.org/abs/2406.09246
  37. Mathew, M., et al. (2021). "DocVQA: A Dataset for VQA on Document Images". CVPR. https://arxiv.org/abs/2007.00398
  38. Wang, J., et al. (2024). "KuaiMod: A Large-scale Content Moderation Framework". Kuaishou Technology. https://arxiv.org/abs/2404.12709
  39. Li, L., et al. (2024). "Vision-Language Models for Autonomous Driving: A Survey". https://arxiv.org/abs/2407.08123
  40. 40.0 40.1 Lu, P., et al. (2024). "MathVista: Evaluating Mathematical Reasoning in Visual Contexts". https://arxiv.org/abs/2310.02255
  41. Liu, Y., et al. (2023). "MMBench: Is Your Multi-modal Model an All-around Player?". OpenCompass. https://arxiv.org/abs/2307.06281
  42. Yue, X., et al. (2024). "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark". https://arxiv.org/abs/2311.16502
  43. Chen, L., et al. (2024). "Are We on the Right Way for Evaluating Large Vision-Language Models?". https://arxiv.org/abs/2403.20330
  44. Liu, Y., et al. (2024). "OCRBench: Hidden Challenges in OCR for Large Multimodal Models". https://arxiv.org/abs/2305.07895
  45. Thrush, T., et al. (2022). "Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality". CVPR. https://arxiv.org/abs/2204.03162
  46. Li, Y., et al. (2023). "Evaluating Object Hallucination in Large Vision-Language Models". EMNLP. https://arxiv.org/abs/2305.10355
  47. Vedantam, R., et al. (2015). "CIDEr: Consensus-based Image Description Evaluation". CVPR. https://arxiv.org/abs/1411.5726
  48. Papineni, K., et al. (2002). "BLEU: a Method for Automatic Evaluation of Machine Translation". ACL. https://aclanthology.org/P02-1040/
  49. Biten, A.F., et al. (2019). "Scene Text Visual Question Answering". ICCV. https://arxiv.org/abs/1905.13648
  50. Hugging Face (2024). "Open VLM Leaderboard". https://huggingface.co/spaces/opencompass/open_vlm_leaderboard
  51. Zhou, Y., et al. (2024). "Analyzing and Mitigating Object Hallucination in Large Vision-Language Models". ICLR. https://arxiv.org/abs/2310.00754
  52. Leng, Y., et al. (2024). "Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding". CVPR. https://arxiv.org/abs/2408.10253
  53. Kamath, A., et al. (2024). "What's Left and Right in Vision Language Models?". https://arxiv.org/abs/2312.01772
  54. Nayak, N., et al. (2024). "CulturalVQA: Benchmarking Cultural Understanding in Vision Language Models". https://arxiv.org/abs/2407.19788
  55. VLMs Are Blind (2024). "Vision Language Models Are Blind". https://vlmsareblind.github.io/
  56. Hu, E.J., et al. (2022). "LoRA: Low-Rank Adaptation of Large Language Models". Microsoft. https://arxiv.org/abs/2106.09685
  57. Anis, A. M., et al. (2024). "On the Limitations of Vision-Language Models in Understanding Image Transforms". https://arxiv.org/abs/2503.09837
  58. Qwen Team (2024). "QVQ: Multimodal Reasoning at Scale". Alibaba. https://qwenlm.github.io/blog/qvq-72b-preview/
  59. Xu, P., et al. (2024). "LLaVA-CoT: Let Vision Language Models Reason Step-by-Step". https://arxiv.org/abs/2411.10440
  60. Li, F., et al. (2024). "LongVILA: Scaling Long-Context Visual Language Models". NVIDIA. https://arxiv.org/abs/2408.00400
  61. 61.0 61.1 Hugging Face (2024). "SmolVLM: Small Vision Language Models". https://huggingface.co/blog/smolvlm
  62. Lin, B., et al. (2024). "MoE-LLaVA: Mixture of Experts for Large Vision-Language Models". https://arxiv.org/abs/2401.15947
  63. Bai, J., et al. (2024). "Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action". https://arxiv.org/abs/2312.17172
  64. Ji, J., et al. (2024). "AI Alignment: A Comprehensive Survey". https://arxiv.org/abs/2310.19852
  65. Microsoft (2024). "Azure OpenAI Service". https://azure.microsoft.com/en-us/products/ai-services/openai-service
  66. Amazon (2024). "Amazon Bedrock". https://aws.amazon.com/bedrock/
  67. Qwen Team (2024). "Qwen2.5-VL: Frontier Vision-Language Understanding". https://qwenlm.github.io/blog/qwen2.5-vl/
  68. Chen, Z., et al. (2024). "InternVL: Scaling up Vision Foundation Models with Large Language Models". https://arxiv.org/abs/2312.14238
  69. Abdin, M., et al. (2024). "Phi-3 Technical Report". Microsoft. https://arxiv.org/abs/2404.14219
  70. Wang, W., et al. (2024). "CogVLM2: Visual Language Models for Image and Video Understanding". https://arxiv.org/abs/2408.16500