SmolVLA

Certain elements of this article are incomplete. You can help the AI Wiki by expanding it.

See also: Terms and artificial intelligence terms

SmolVLA (Small Vision-Language-Action) is a compact, open-source vision-language-action model (VLA) for robotics developed by Hugging Face in collaboration with Google DeepMind. Released in June 2025, SmolVLA represents a significant advancement in making advanced robotic control accessible to researchers and practitioners with limited computational resources.^[1] The name "SmolVLA" reflects its small size, being a diminutive of "small."^[2]

Unlike existing VLAs that typically require billions of parameters and extensive computational resources, SmolVLA achieves competitive performance with only 450 million parameters, two orders of magnitude smaller than many contemporary VLAs, and can run on consumer-grade hardware, including CPUs, single GPUs, and even devices like a MacBook or Raspberry Pi-class edge computers.^[3]^[4]

Overview

SmolVLA is designed to address key challenges in robotic learning: the high computational costs of existing VLA models, limited access to training data, and the difficulty of deploying models on affordable hardware.^[5] The model democratizes robotics by enabling advanced vision-language-action capabilities on consumer-grade hardware, making it suitable for educational projects, research, and small-scale automation.^[6]

The model achieves its efficiency through several innovations:

Efficient Architecture: Utilizes only the first half of the vision-language model's layers, reducing computational cost by approximately 50% and latency by ~40%^[7]
Community-Driven Training: Trained exclusively on open-source, community-contributed datasets under the LeRobot tag^[3]
Asynchronous Inference: Decouples perception and action prediction from execution, enabling 30% faster response times and 2× task throughput^[5]
Hardware Accessibility: Can be trained on a single GPU and deployed on consumer-grade hardware, including edge devices for privacy-sensitive installations^[4]

SmolVLA follows the same three-modality paradigm as larger VLAs, vision, language, and action, but emphasizes efficiency and accessibility over scale. Its open-source nature and reliance on community-driven data foster collaboration, potentially accelerating innovation in robotics.^[8]

Development

Background

SmolVLA emerged from Hugging Face's broader initiative to democratize robotics through open-source tools and models. The project builds upon the company's LeRobot ecosystem, launched in 2024, which provides a collection of robotics-focused models, datasets, and tools.^[4] The development was motivated by the observation that existing VLA models, while powerful, remained inaccessible to most researchers due to their computational requirements and reliance on proprietary datasets.^[3]

The development represents a shift in robotics foundation models toward more open, efficient, and reproducible systems. By leveraging community-contributed data and affordable hardware, SmolVLA lowers the barrier to entry for robotics research and encourages broader participation.^[9]

Team

SmolVLA was developed by a team of researchers at Hugging Face and collaborating institutions, including Google DeepMind. The primary authors include:^[1]

Mustafa Shukor - PhD student at Sorbonne University^[10]
Dana Aubakirova - M2 student in MVA at ENS Paris-Saclay^[11]
Francesco Capuano
Pepijn Kooijmans
Steven Palma
Adil Zouitine
Michel Aractingi
Caroline Pascal
Martino Russi
Andres Marafioti
Simon Alibert
Matthieu Cord
Thomas Wolf - Co-founder of Hugging Face
Remi Cadene - Research Scientist at Hugging Face

Release Timeline

Date	Milestone	Significance
June 2, 2025	Initial arXiv paper released^[1]	First public disclosure of the model
June 3, 2025	Official blog post and model release on Hugging Face^[3]	Model weights and code made publicly available
June 4, 2025	Media coverage highlighting the model's efficiency^[4]	Widespread recognition of accessibility features
June 10, 2025	Community adoption milestone	Over 1,000 downloads and first community contributions
June 13, 2025	Asynchronous inference stack released	Performance improvements made available

Architecture

SmolVLA's architecture consists of two main components that work together to process visual inputs, language instructions, and generate robot actions:^[7]

Perception Module (SmolVLM-2)

The perception module is based on SmolVLM-2, an efficient vision-language model optimized for multi-image and video inputs. It comprises a SigLIP visual encoder and a compact language decoder based on SmolLM2.^[12] Key features include:

Vision Encoder: Uses SigLIP for robust visual feature encoding
Language Decoder: Employs SmolLM2, a compact language model
Token Efficiency: Limits visual tokens to 64 per frame through pixel-shuffle token reduction techniques
Layer Pruning: Uses only the first N layers (N=L/2, where L is the total number of layers) of the VLM's language decoder, reducing latency by approximately 40%^[13]

Action Expert

The action expert is a specialized transformer module (~100M parameters) that generates continuous robot actions:^[3]

Architecture: Alternates between self-attention and cross-attention blocks with causal masking
Training Method: Uses flow matching objective to guide noisy samples back to ground truth
Action Generation: Produces "action chunks" - sequences of future robot actions (default 50 timesteps)
Temporal Consistency: Applies causal masking to ensure temporal coherence and improve smoothness

Asynchronous Inference Stack

A key innovation is SmolVLA's asynchronous inference system that introduces a RobotClient ↔ PolicyServer schema:^[14]

The robot executes the current action chunk while the server predicts the next
Fills an action queue until a guard band threshold is reached
Enables low-latency control suitable for real-time applications
Makes the system more adaptable and capable of faster recovery from errors

Input Processing

SmolVLA processes three types of inputs:^[7]

Multiple RGB Images: Up to four frames from different camera views (resized to 512×512 pixels, global view only without tiling)
Language Instructions: Natural language task descriptions tokenized into text tokens
Sensorimotor States: Robot's current state projected into a single token via linear layer

Training

Datasets

SmolVLA was trained exclusively on community-contributed datasets from the LeRobot ecosystem, totaling approximately 10 million frames across ~30,000 episodes. The training data consists of:^[3]

487 high-quality datasets focused on the SO100 robot platform
Diverse task coverage: Including pick-and-place, stacking, sorting, and manipulation tasks
Natural diversity: Varied lighting conditions, suboptimal demonstrations, and heterogeneous control schemes
Multiple environments: Data collected in homes, maker spaces, and research labs

Dataset Family	Episodes	Frames (M)	Robot Types	Notes
SO-100 multi-task	~20,000	~7.2	SO-100 arm	Primary training data
SO-101 OOD test	~3,000	~1.1	SO-101 arm	Out-of-distribution testing
LeKiwi mobile base	~2,000	~0.7	Mobile manipulator	Navigation + manipulation
Misc. hobby datasets	~5,000	~1.6	Various DIY rigs	Community contributions

The datasets were curated using a custom filtering tool created by Alexandre Chapin and Ville Kuosmanen, with manual review by Marina Barannikov. Automatic instruction rewriting with Qwen2.5-VL-3B-Instruct standardized noisy labels to a maximum of 30 characters with action verbs.^[3]

Camera views were standardized as follows:

Camera View	Description
OBS_IMAGE_1	Top-down view
OBS_IMAGE_2	Wrist-mounted camera
OBS_IMAGE_3+	Additional views

Training Process

The training methodology follows a two-phase approach inspired by large language models:^[3]

Pretraining Phase: 200,000 steps on general manipulation data from community datasets
Task-Specific Post-Training: 100,000-200,000 steps of fine-tuning on specific tasks

Key training specifications:^[7]

Can be trained on a single consumer GPU (for example RTX 3080Ti with 12GB VRAM)
Batch size: 44 (adjustable based on available VRAM, 16 for 6GB GPUs)
Training time: Approximately 4 hours for 20,000 steps on a single A100 GPU^[15]
Memory usage: ~11.53 GB GPU memory during training
Loss convergence: From 1.198 to 0.004 over 200,000 steps

Performance

Simulation Benchmarks

SmolVLA demonstrates strong performance on established robotics benchmarks despite its compact size:^[7]

Benchmark	SmolVLA (0.45B)	π₀ (3.3B)	OpenVLA (7B)	Diffusion Policy	ACT
LIBERO-40	87.3%	~85%	<87.3%	<87.3%	<87.3%
Meta-World MT50	Outperforms	-	Lower	Lower	-
Average Success Rate	82.5%	80.2%	78.9%	75.3%	76.8%

Real-World Performance

On real-world robotic platforms, SmolVLA achieves:^[5]

Platform	Task	Success Rate	Notes
SO100	Pick-Place	78.3% (avg)	Trained on this platform
SO100	Stacking		In-distribution performance
SO100	Sorting		With object variations
SO101	Pick-Place	76.5%	Zero-shot generalization
SO101	Complex manipulation	72.1%	Out-of-distribution

Impact of Pretraining

The effectiveness of community dataset pretraining is demonstrated by:^[3]

Without pretraining: 51.7% success rate on SO100 tasks
With pretraining: 78.3% success rate (+26.6% absolute improvement)
With multitask finetuning: Further improvements in low-data regimes (up to 85% on specific tasks)

Asynchronous Inference Benefits

The asynchronous inference stack provides:^[5]

30% reduction in average task completion time
2× increase in completed actions within fixed time scenarios (19 vs. 9 cubes moved)
40% reduction in inference latency through layer pruning
Average inference time: 0.086982 seconds
Maximum GPU memory usage: 908.43 MB during inference

Technical Specifications

Model Details

Total Parameters: 450 million (roughly two orders of magnitude smaller than contemporary VLAs)^[3]
Action Expert Parameters: ~100 million^[7]
VLM Layers Used: First 16 layers (out of 32 total)^[16]
Visual Tokens per Frame: 64^[7]
Action Chunk Size: Configurable (typically 50 timesteps for 1 second)^[7]
License: Apache-2.0 (code & model weights)^[17]

Hardware Requirements

Operation	Minimum Hardware	Recommended Hardware	Performance Notes
Training	Single consumer GPU (6GB VRAM)	GPU with 12GB+ VRAM	4 hours for 20k steps on A100
Inference	CPU (modern laptop)	Consumer GPU	Real-time on MacBook Pro
Fine-tuning	RTX 3080Ti (12GB)	A100 GPU	Batch size adjustable
Edge Deployment	Raspberry Pi 4	Jetson Nano	For privacy-sensitive installations

Software Integration

SmolVLA is fully integrated with the LeRobot framework:^[18]

# Example training command
python lerobot/scripts/train.py \
    --policy.path=lerobot/smolvla_base \
    --dataset.repo_id=lerobot/your_dataset \
    --batch_size=64 \
    --steps=20000

Applications

SmolVLA has been successfully deployed for various robotic manipulation tasks and environments:^[3]

Supported Tasks

Pick-and-Place Operations: Grasping and relocating objects with various shapes and sizes
Stacking Tasks: Building stable structures with blocks
Sorting Activities: Organizing objects by category or properties
Assembly Operations: Simple construction tasks
Kitchen Tasks: Basic food preparation and cleaning
Mobile Manipulation: Combined navigation and manipulation tasks

Robot Platforms

SO100: Primary training platform
SO101: Demonstrates zero-shot generalization
Koch Arm: Community-tested implementation^[4]
ALOHA-style robots: Compatible through LeRobot framework
Raspberry Pi robots: Edge deployment for education
Custom DIY platforms: Community-built robots

Use Cases

Education & Hobby Robotics: Runs on Raspberry Pi-class edge computers, enabling classroom demos and maker projects^[4]
Research Prototyping: Quick fine-tuning with only a handful of additional demos using the LeRobot trainer^[19]
Edge Deployment: Works fully offline on consumer GPUs or CPUs, important for privacy-sensitive installations^[8]
Research Baseline: Serves as a reproducible small-scale reference when studying VLA design choices^[1]

Comparison with Other VLAs

Model	Parameters	Training Data	Hardware Requirements	Open Source	Real-time Capable
SmolVLA	450M	Community datasets	Consumer GPU/CPU	✓	✓
OpenVLA	7B	OXE dataset	High-end GPU	✓	✗
RT-2	55B	Proprietary	Enterprise GPU cluster	✗	✗
π0	3.3B	Mixed proprietary/open	High-end GPU	Partial	✗
ACT	1B	Task-specific	Mid-range GPU	✓	Partial

Impact and Reception

Academic Impact

SmolVLA has been cited as a significant advancement in democratizing robotic learning. Researchers have noted its importance in:^[20]

Lowering barriers to entry for robotics research
Demonstrating the effectiveness of community-driven datasets
Proving that compact models can achieve competitive performance
Challenging the trend of scaling up model sizes
Promoting sustainable and efficient AI development

Industry Adoption

The model has seen rapid adoption in:^[21]

Educational institutions with limited budgets
Small robotics startups
Research labs in developing countries
Hobbyist and maker communities
Privacy-conscious industrial applications

Community Response

The robotics community has responded positively, with researchers describing it as potentially a "BERT moment for robotics".^[4] Key community contributions include:

Over 100 additional dataset contributions to the LeRobot ecosystem
Ports to various robot platforms including mobile manipulators
Performance optimizations reducing inference time by additional 15%
Integration with popular robotics frameworks like ROS

Limitations

Despite its achievements, SmolVLA has several acknowledged limitations:^[16]

Dataset Diversity: Training data is predominantly from SO100 platform, limiting cross-embodiment generalization
Dataset Size: Uses significantly less data (30k episodes) than state-of-the-art VLAs like OpenVLA (millions of trajectories)
Long-Horizon Tasks: Limited evaluation on extended task sequences beyond 1-2 minutes
VLM Backbone: Uses a general-purpose VLM not specifically pretrained for robotics
Single-Arm Focus: Primary evaluation on single-arm manipulation tasks
Complex Language Grounding: Compact size trades off complex language understanding compared to billion-parameter VLAs

Future Directions

The SmolVLA team and community have identified several areas for future development:^[3]

Cross-Embodiment Training: Expanding to more diverse robot platforms including quadrupeds and humanoids
Scaling Studies: Investigating optimal model sizes for different applications (exploring 200M-1B parameter variants)
Joint Multimodal Training: Combining robotics data with general vision-language datasets
Real-Time Optimizations: Further improvements to inference speed targeting sub-50ms latency
Sim-to-Real Transfer: Better integration with simulation environments like Isaac Gym
Reinforcement Learning: Integration of RL fine-tuning for improved task performance
Larger Community Datasets: Goal of reaching 1 million demonstration episodes by 2026

References

↑ ^1.0 ^1.1 ^1.2 ^1.3 Shukor, Mustafa et al. (2025-06-02). "SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics". https://arxiv.org/abs/2506.01844.
↑ "SmolVLA naming convention". Hugging Face. https://huggingface.co/blog/smolvla.
↑ ^3.00 ^3.01 ^3.02 ^3.03 ^3.04 ^3.05 ^3.06 ^3.07 ^3.08 ^3.09 ^3.10 ^3.11 "SmolVLA: Efficient Vision-Language-Action Model trained on Lerobot Community Data". Hugging Face. https://huggingface.co/blog/smolvla.
↑ ^4.0 ^4.1 ^4.2 ^4.3 ^4.4 ^4.5 ^4.6 "Hugging Face says its new robotics model is so efficient it can run on a MacBook". TechCrunch. 2025-06-04. https://techcrunch.com/2025/06/04/hugging-face-says-its-new-robotics-model-is-so-efficient-it-can-run-on-a-macbook/.
↑ ^5.0 ^5.1 ^5.2 ^5.3 "Hugging Face Releases SmolVLA: A Compact Vision-Language-Action Model for Affordable and Efficient Robotics". MarkTechPost. 2025-06-03. https://www.marktechpost.com/2025/06/03/hugging-face-releases-smolvla-a-compact-vision-language-action-model-for-affordable-and-efficient-robotics/.
↑ "An Open-Source Vision-Language-Action Model for Modern Robotics - SmolVLA". Black Coffee Robotics. https://blackcoffeerobotics.com/blog/smolvla-open-source-vision-language-action-model.
↑ ^7.0 ^7.1 ^7.2 ^7.3 ^7.4 ^7.5 ^7.6 ^7.7 "SmolVLA: A vision-language-action model for affordable and efficient robotics". https://arxiv.org/html/2506.01844v1.
↑ ^8.0 ^8.1 "AI and Robotics: With SmolVLA, Hugging Face Opens VLA Models to the Community". ActuIA. 2025-06-17. https://www.actuia.com/english/ai-and-robotics-with-smolvla-hugging-face-opens-vision-language-action-models-to-the-community/.
↑ "SmolVLA (Efficient Vision-Language-Action Model)". Emergent Mind. 2025-06-21. https://www.emergentmind.com/papers/smolvla.
↑ "Mustafa Shukor - Google Scholar". https://scholar.google.com/citations?user=lhp9mRgAAAAJ&hl=en.
↑ "Dana Aubakirova - Google Scholar". https://scholar.google.com/citations?user=iX_-o7IAAAAJ&hl=en.
↑ "SmolVLA architecture variations". https://arxiv.org/html/2506.01844v1.
↑ "SmolVLA: Efficient Vision-Language-Action Model trained on Lerobot Community Data". 2025-06-11. https://learnopencv.com/smolvla-lerobot-vision-language-action-model/.
↑ "Decoding SmolVLA: A Vision-Language-Action Model for Efficient and Accessible Robotics". Phospho AI Blog. 2025-06-13. https://blog.phospho.ai/decoding-smolvla.
↑ "Finetune SmolVLA". Hugging Face. https://huggingface.co/docs/lerobot/smolvla.
↑ ^16.0 ^16.1 "[Literature Review SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics"]. Moonlight. https://www.themoonlight.io/en/review/smolvla-a-vision-language-action-model-for-affordable-and-efficient-robotics.
↑ "LeRobot License". GitHub. https://github.com/huggingface/lerobot/blob/main/LICENSE.
↑ "GitHub - huggingface/lerobot: 🤗 LeRobot: Making AI for Robotics more accessible with end-to-end learning". GitHub. https://github.com/huggingface/lerobot.
↑ "Train SmolVLA – starter pack". Phospho AI Docs. https://docs.phospho.ai/smolvla-training.
↑ "Paper page - SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics". Hugging Face. https://huggingface.co/papers/2506.01844.
↑ "SmolVLA Model: How Hugging Face's Vision-Language-Action AI is Democratizing Robotics". 2025-06-20. https://flowgrammer.ca/smolvla-model-democratizing-robotics/.

Cite error: <ref> tag with name "lerobot-model" defined in <references> is not used in prior text.
Cite error: <ref> tag with name "github-blog" defined in <references> is not used in prior text.
Cite error: <ref> tag with name "lerobot-community" defined in <references> is not used in prior text.
Cite error: <ref> tag with name "libero-issue" defined in <references> is not used in prior text.
Cite error: <ref> tag with name "release-issue" defined in <references> is not used in prior text.
Cite error: <ref> tag with name "pureai" defined in <references> is not used in prior text.
Cite error: <ref> tag with name "learnopoly" defined in <references> is not used in prior text.
Cite error: <ref> tag with name "chatpaper" defined in <references> is not used in prior text.

External Links

[arxiv-1] 1.0 ^1.1 ^1.2 ^1.3 Shukor, Mustafa et al. (2025-06-02). "SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics". https://arxiv.org/abs/2506.01844.

[etymology-2] "SmolVLA naming convention". Hugging Face. https://huggingface.co/blog/smolvla.

[huggingface-blog-3] 3.00 ^3.01 ^3.02 ^3.03 ^3.04 ^3.05 ^3.06 ^3.07 ^3.08 ^3.09 ^3.10 ^3.11 "SmolVLA: Efficient Vision-Language-Action Model trained on Lerobot Community Data". Hugging Face. https://huggingface.co/blog/smolvla.

[techcrunch-4] 4.0 ^4.1 ^4.2 ^4.3 ^4.4 ^4.5 ^4.6 "Hugging Face says its new robotics model is so efficient it can run on a MacBook". TechCrunch. 2025-06-04. https://techcrunch.com/2025/06/04/hugging-face-says-its-new-robotics-model-is-so-efficient-it-can-run-on-a-macbook/.

[marktechpost-5] 5.0 ^5.1 ^5.2 ^5.3 "Hugging Face Releases SmolVLA: A Compact Vision-Language-Action Model for Affordable and Efficient Robotics". MarkTechPost. 2025-06-03. https://www.marktechpost.com/2025/06/03/hugging-face-releases-smolvla-a-compact-vision-language-action-model-for-affordable-and-efficient-robotics/.

[blackcoffee-6] "An Open-Source Vision-Language-Action Model for Modern Robotics - SmolVLA". Black Coffee Robotics. https://blackcoffeerobotics.com/blog/smolvla-open-source-vision-language-action-model.

[arxiv-full-7] 7.0 ^7.1 ^7.2 ^7.3 ^7.4 ^7.5 ^7.6 ^7.7 "SmolVLA: A vision-language-action model for affordable and efficient robotics". https://arxiv.org/html/2506.01844v1.

[actuia-8] 8.0 ^8.1 "AI and Robotics: With SmolVLA, Hugging Face Opens VLA Models to the Community". ActuIA. 2025-06-17. https://www.actuia.com/english/ai-and-robotics-with-smolvla-hugging-face-opens-vision-language-action-models-to-the-community/.

[emergentmind-9] "SmolVLA (Efficient Vision-Language-Action Model)". Emergent Mind. 2025-06-21. https://www.emergentmind.com/papers/smolvla.

[scholar-shukor-10] "Mustafa Shukor - Google Scholar". https://scholar.google.com/citations?user=lhp9mRgAAAAJ&hl=en.

[scholar-dana-11] "Dana Aubakirova - Google Scholar". https://scholar.google.com/citations?user=iX_-o7IAAAAJ&hl=en.

[palm2-variant-12] "SmolVLA architecture variations". https://arxiv.org/html/2506.01844v1.

[learnopencv-13] "SmolVLA: Efficient Vision-Language-Action Model trained on Lerobot Community Data". 2025-06-11. https://learnopencv.com/smolvla-lerobot-vision-language-action-model/.

[phospho-14] "Decoding SmolVLA: A Vision-Language-Action Model for Efficient and Accessible Robotics". Phospho AI Blog. 2025-06-13. https://blog.phospho.ai/decoding-smolvla.

[finetune-docs-15] "Finetune SmolVLA". Hugging Face. https://huggingface.co/docs/lerobot/smolvla.

[moonlight-16] 16.0 ^16.1 "[Literature Review SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics"]. Moonlight. https://www.themoonlight.io/en/review/smolvla-a-vision-language-action-model-for-affordable-and-efficient-robotics.

[apache-license-17] "LeRobot License". GitHub. https://github.com/huggingface/lerobot/blob/main/LICENSE.

[lerobot-github-18] "GitHub - huggingface/lerobot: 🤗 LeRobot: Making AI for Robotics more accessible with end-to-end learning". GitHub. https://github.com/huggingface/lerobot.

[docs-phospho-19] "Train SmolVLA – starter pack". Phospho AI Docs. https://docs.phospho.ai/smolvla-training.

[huggingface-papers-20] "Paper page - SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics". Hugging Face. https://huggingface.co/papers/2506.01844.

[flowgrammer-21] "SmolVLA Model: How Hugging Face's Vision-Language-Action AI is Democratizing Robotics". 2025-06-20. https://flowgrammer.ca/smolvla-model-democratizing-robotics/.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]