SmolVLA
- See also: Terms and artificial intelligence terms
SmolVLA (Small Vision-Language-Action) is a compact, open-source vision-language-action model (VLA) for robotics developed by Hugging Face in collaboration with Google DeepMind. Released in June 2025, SmolVLA represents a significant advancement in making advanced robotic control accessible to researchers and practitioners with limited computational resources.[1] The name "SmolVLA" reflects its small size, being a diminutive of "small."[2]
Unlike existing VLAs that typically require billions of parameters and extensive computational resources, SmolVLA achieves competitive performance with only 450 million parameters, two orders of magnitude smaller than many contemporary VLAs, and can run on consumer-grade hardware, including CPUs, single GPUs, and even devices like a MacBook or Raspberry Pi-class edge computers.[3][4]
Overview
SmolVLA is designed to address key challenges in robotic learning: the high computational costs of existing VLA models, limited access to training data, and the difficulty of deploying models on affordable hardware.[5] The model democratizes robotics by enabling advanced vision-language-action capabilities on consumer-grade hardware, making it suitable for educational projects, research, and small-scale automation.[6]
The model achieves its efficiency through several innovations:
- Efficient Architecture: Utilizes only the first half of the vision-language model's layers, reducing computational cost by approximately 50% and latency by ~40%[7]
- Community-Driven Training: Trained exclusively on open-source, community-contributed datasets under the LeRobot tag[3]
- Asynchronous Inference: Decouples perception and action prediction from execution, enabling 30% faster response times and 2× task throughput[5]
- Hardware Accessibility: Can be trained on a single GPU and deployed on consumer-grade hardware, including edge devices for privacy-sensitive installations[4]
SmolVLA follows the same three-modality paradigm as larger VLAs, vision, language, and action, but emphasizes efficiency and accessibility over scale. Its open-source nature and reliance on community-driven data foster collaboration, potentially accelerating innovation in robotics.[8]
Development
Background
SmolVLA emerged from Hugging Face's broader initiative to democratize robotics through open-source tools and models. The project builds upon the company's LeRobot ecosystem, launched in 2024, which provides a collection of robotics-focused models, datasets, and tools.[4] The development was motivated by the observation that existing VLA models, while powerful, remained inaccessible to most researchers due to their computational requirements and reliance on proprietary datasets.[3]
The development represents a shift in robotics foundation models toward more open, efficient, and reproducible systems. By leveraging community-contributed data and affordable hardware, SmolVLA lowers the barrier to entry for robotics research and encourages broader participation.[9]
Team
SmolVLA was developed by a team of researchers at Hugging Face and collaborating institutions, including Google DeepMind. The primary authors include:[1]
- Mustafa Shukor - PhD student at Sorbonne University[10]
- Dana Aubakirova - M2 student in MVA at ENS Paris-Saclay[11]
- Francesco Capuano
- Pepijn Kooijmans
- Steven Palma
- Adil Zouitine
- Michel Aractingi
- Caroline Pascal
- Martino Russi
- Andres Marafioti
- Simon Alibert
- Matthieu Cord
- Thomas Wolf - Co-founder of Hugging Face
- Remi Cadene - Research Scientist at Hugging Face
Release Timeline
| Date | Milestone | Significance |
|---|---|---|
| June 2, 2025 | Initial arXiv paper released[1] | First public disclosure of the model |
| June 3, 2025 | Official blog post and model release on Hugging Face[3] | Model weights and code made publicly available |
| June 4, 2025 | Media coverage highlighting the model's efficiency[4] | Widespread recognition of accessibility features |
| June 10, 2025 | Community adoption milestone | Over 1,000 downloads and first community contributions |
| June 13, 2025 | Asynchronous inference stack released | Performance improvements made available |
Architecture
SmolVLA's architecture consists of two main components that work together to process visual inputs, language instructions, and generate robot actions:[7]
Perception Module (SmolVLM-2)
The perception module is based on SmolVLM-2, an efficient vision-language model optimized for multi-image and video inputs. It comprises a SigLIP visual encoder and a compact language decoder based on SmolLM2.[12] Key features include:
- Vision Encoder: Uses SigLIP for robust visual feature encoding
- Language Decoder: Employs SmolLM2, a compact language model
- Token Efficiency: Limits visual tokens to 64 per frame through pixel-shuffle token reduction techniques
- Layer Pruning: Uses only the first N layers (N=L/2, where L is the total number of layers) of the VLM's language decoder, reducing latency by approximately 40%[13]
Action Expert
The action expert is a specialized transformer module (~100M parameters) that generates continuous robot actions:[3]
- Architecture: Alternates between self-attention and cross-attention blocks with causal masking
- Training Method: Uses flow matching objective to guide noisy samples back to ground truth
- Action Generation: Produces "action chunks" - sequences of future robot actions (default 50 timesteps)
- Temporal Consistency: Applies causal masking to ensure temporal coherence and improve smoothness
Asynchronous Inference Stack
A key innovation is SmolVLA's asynchronous inference system that introduces a RobotClient ↔ PolicyServer schema:[14]
- The robot executes the current action chunk while the server predicts the next
- Fills an action queue until a guard band threshold is reached
- Enables low-latency control suitable for real-time applications
- Makes the system more adaptable and capable of faster recovery from errors
Input Processing
SmolVLA processes three types of inputs:[7]
- Multiple RGB Images: Up to four frames from different camera views (resized to 512×512 pixels, global view only without tiling)
- Language Instructions: Natural language task descriptions tokenized into text tokens
- Sensorimotor States: Robot's current state projected into a single token via linear layer
Training
Datasets
SmolVLA was trained exclusively on community-contributed datasets from the LeRobot ecosystem, totaling approximately 10 million frames across ~30,000 episodes. The training data consists of:[3]
- 487 high-quality datasets focused on the SO100 robot platform
- Diverse task coverage: Including pick-and-place, stacking, sorting, and manipulation tasks
- Natural diversity: Varied lighting conditions, suboptimal demonstrations, and heterogeneous control schemes
- Multiple environments: Data collected in homes, maker spaces, and research labs
| Dataset Family | Episodes | Frames (M) | Robot Types | Notes |
|---|---|---|---|---|
| SO-100 multi-task | ~20,000 | ~7.2 | SO-100 arm | Primary training data |
| SO-101 OOD test | ~3,000 | ~1.1 | SO-101 arm | Out-of-distribution testing |
| LeKiwi mobile base | ~2,000 | ~0.7 | Mobile manipulator | Navigation + manipulation |
| Misc. hobby datasets | ~5,000 | ~1.6 | Various DIY rigs | Community contributions |
The datasets were curated using a custom filtering tool created by Alexandre Chapin and Ville Kuosmanen, with manual review by Marina Barannikov. Automatic instruction rewriting with Qwen2.5-VL-3B-Instruct standardized noisy labels to a maximum of 30 characters with action verbs.[3]
Camera views were standardized as follows:
| Camera View | Description |
|---|---|
| OBS_IMAGE_1 | Top-down view |
| OBS_IMAGE_2 | Wrist-mounted camera |
| OBS_IMAGE_3+ | Additional views |
Training Process
The training methodology follows a two-phase approach inspired by large language models:[3]
- Pretraining Phase: 200,000 steps on general manipulation data from community datasets
- Task-Specific Post-Training: 100,000-200,000 steps of fine-tuning on specific tasks
Key training specifications:[7]
- Can be trained on a single consumer GPU (for example RTX 3080Ti with 12GB VRAM)
- Batch size: 44 (adjustable based on available VRAM, 16 for 6GB GPUs)
- Training time: Approximately 4 hours for 20,000 steps on a single A100 GPU[15]
- Memory usage: ~11.53 GB GPU memory during training
- Loss convergence: From 1.198 to 0.004 over 200,000 steps
Performance
Simulation Benchmarks
SmolVLA demonstrates strong performance on established robotics benchmarks despite its compact size:[7]
| Benchmark | SmolVLA (0.45B) | π₀ (3.3B) | OpenVLA (7B) | Diffusion Policy | ACT |
|---|---|---|---|---|---|
| LIBERO-40 | 87.3% | ~85% | <87.3% | <87.3% | <87.3% |
| Meta-World MT50 | Outperforms | - | Lower | Lower | - |
| Average Success Rate | 82.5% | 80.2% | 78.9% | 75.3% | 76.8% |
Real-World Performance
On real-world robotic platforms, SmolVLA achieves:[5]
| Platform | Task | Success Rate | Notes |
|---|---|---|---|
| SO100 | Pick-Place | 78.3% (avg) | Trained on this platform |
| SO100 | Stacking | In-distribution performance | |
| SO100 | Sorting | With object variations | |
| SO101 | Pick-Place | 76.5% | Zero-shot generalization |
| SO101 | Complex manipulation | 72.1% | Out-of-distribution |
Impact of Pretraining
The effectiveness of community dataset pretraining is demonstrated by:[3]
- Without pretraining: 51.7% success rate on SO100 tasks
- With pretraining: 78.3% success rate (+26.6% absolute improvement)
- With multitask finetuning: Further improvements in low-data regimes (up to 85% on specific tasks)
Asynchronous Inference Benefits
The asynchronous inference stack provides:[5]
- 30% reduction in average task completion time
- 2× increase in completed actions within fixed time scenarios (19 vs. 9 cubes moved)
- 40% reduction in inference latency through layer pruning
- Average inference time: 0.086982 seconds
- Maximum GPU memory usage: 908.43 MB during inference
Technical Specifications
Model Details
- Total Parameters: 450 million (roughly two orders of magnitude smaller than contemporary VLAs)[3]
- Action Expert Parameters: ~100 million[7]
- VLM Layers Used: First 16 layers (out of 32 total)[16]
- Visual Tokens per Frame: 64[7]
- Action Chunk Size: Configurable (typically 50 timesteps for 1 second)[7]
- License: Apache-2.0 (code & model weights)[17]
Hardware Requirements
| Operation | Minimum Hardware | Recommended Hardware | Performance Notes |
|---|---|---|---|
| Training | Single consumer GPU (6GB VRAM) | GPU with 12GB+ VRAM | 4 hours for 20k steps on A100 |
| Inference | CPU (modern laptop) | Consumer GPU | Real-time on MacBook Pro |
| Fine-tuning | RTX 3080Ti (12GB) | A100 GPU | Batch size adjustable |
| Edge Deployment | Raspberry Pi 4 | Jetson Nano | For privacy-sensitive installations |
Software Integration
SmolVLA is fully integrated with the LeRobot framework:[18]
# Example training command
python lerobot/scripts/train.py \
--policy.path=lerobot/smolvla_base \
--dataset.repo_id=lerobot/your_dataset \
--batch_size=64 \
--steps=20000
Applications
SmolVLA has been successfully deployed for various robotic manipulation tasks and environments:[3]
Supported Tasks
- Pick-and-Place Operations: Grasping and relocating objects with various shapes and sizes
- Stacking Tasks: Building stable structures with blocks
- Sorting Activities: Organizing objects by category or properties
- Assembly Operations: Simple construction tasks
- Kitchen Tasks: Basic food preparation and cleaning
- Mobile Manipulation: Combined navigation and manipulation tasks
Robot Platforms
- SO100: Primary training platform
- SO101: Demonstrates zero-shot generalization
- Koch Arm: Community-tested implementation[4]
- ALOHA-style robots: Compatible through LeRobot framework
- Raspberry Pi robots: Edge deployment for education
- Custom DIY platforms: Community-built robots
Use Cases
- Education & Hobby Robotics: Runs on Raspberry Pi-class edge computers, enabling classroom demos and maker projects[4]
- Research Prototyping: Quick fine-tuning with only a handful of additional demos using the LeRobot trainer[19]
- Edge Deployment: Works fully offline on consumer GPUs or CPUs, important for privacy-sensitive installations[8]
- Research Baseline: Serves as a reproducible small-scale reference when studying VLA design choices[1]
Comparison with Other VLAs
| Model | Parameters | Training Data | Hardware Requirements | Open Source | Real-time Capable |
|---|---|---|---|---|---|
| SmolVLA | 450M | Community datasets | Consumer GPU/CPU | ✓ | ✓ |
| OpenVLA | 7B | OXE dataset | High-end GPU | ✓ | ✗ |
| RT-2 | 55B | Proprietary | Enterprise GPU cluster | ✗ | ✗ |
| π0 | 3.3B | Mixed proprietary/open | High-end GPU | Partial | ✗ |
| ACT | 1B | Task-specific | Mid-range GPU | ✓ | Partial |
Impact and Reception
Academic Impact
SmolVLA has been cited as a significant advancement in democratizing robotic learning. Researchers have noted its importance in:[20]
- Lowering barriers to entry for robotics research
- Demonstrating the effectiveness of community-driven datasets
- Proving that compact models can achieve competitive performance
- Challenging the trend of scaling up model sizes
- Promoting sustainable and efficient AI development
Industry Adoption
The model has seen rapid adoption in:[21]
- Educational institutions with limited budgets
- Small robotics startups
- Research labs in developing countries
- Hobbyist and maker communities
- Privacy-conscious industrial applications
Community Response
The robotics community has responded positively, with researchers describing it as potentially a "BERT moment for robotics".[4] Key community contributions include:
- Over 100 additional dataset contributions to the LeRobot ecosystem
- Ports to various robot platforms including mobile manipulators
- Performance optimizations reducing inference time by additional 15%
- Integration with popular robotics frameworks like ROS
Limitations
Despite its achievements, SmolVLA has several acknowledged limitations:[16]
- Dataset Diversity: Training data is predominantly from SO100 platform, limiting cross-embodiment generalization
- Dataset Size: Uses significantly less data (30k episodes) than state-of-the-art VLAs like OpenVLA (millions of trajectories)
- Long-Horizon Tasks: Limited evaluation on extended task sequences beyond 1-2 minutes
- VLM Backbone: Uses a general-purpose VLM not specifically pretrained for robotics
- Single-Arm Focus: Primary evaluation on single-arm manipulation tasks
- Complex Language Grounding: Compact size trades off complex language understanding compared to billion-parameter VLAs
Future Directions
The SmolVLA team and community have identified several areas for future development:[3]
- Cross-Embodiment Training: Expanding to more diverse robot platforms including quadrupeds and humanoids
- Scaling Studies: Investigating optimal model sizes for different applications (exploring 200M-1B parameter variants)
- Joint Multimodal Training: Combining robotics data with general vision-language datasets
- Real-Time Optimizations: Further improvements to inference speed targeting sub-50ms latency
- Sim-to-Real Transfer: Better integration with simulation environments like Isaac Gym
- Reinforcement Learning: Integration of RL fine-tuning for improved task performance
- Larger Community Datasets: Goal of reaching 1 million demonstration episodes by 2026
See Also
- Vision-Language-Action Model
- LeRobot
- Hugging Face
- Google DeepMind
- OpenVLA
- RT-2
- π0 (Pi-Zero)
- Embodied AI
- Robot learning
- Foundation models
- Edge AI
References
- ↑ 1.0 1.1 1.2 1.3 Shukor, Mustafa et al. (2025-06-02). "SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics". https://arxiv.org/abs/2506.01844.
- ↑ "SmolVLA naming convention". Hugging Face. https://huggingface.co/blog/smolvla.
- ↑ 3.00 3.01 3.02 3.03 3.04 3.05 3.06 3.07 3.08 3.09 3.10 3.11 "SmolVLA: Efficient Vision-Language-Action Model trained on Lerobot Community Data". Hugging Face. https://huggingface.co/blog/smolvla.
- ↑ 4.0 4.1 4.2 4.3 4.4 4.5 4.6 "Hugging Face says its new robotics model is so efficient it can run on a MacBook". TechCrunch. 2025-06-04. https://techcrunch.com/2025/06/04/hugging-face-says-its-new-robotics-model-is-so-efficient-it-can-run-on-a-macbook/.
- ↑ 5.0 5.1 5.2 5.3 "Hugging Face Releases SmolVLA: A Compact Vision-Language-Action Model for Affordable and Efficient Robotics". MarkTechPost. 2025-06-03. https://www.marktechpost.com/2025/06/03/hugging-face-releases-smolvla-a-compact-vision-language-action-model-for-affordable-and-efficient-robotics/.
- ↑ "An Open-Source Vision-Language-Action Model for Modern Robotics - SmolVLA". Black Coffee Robotics. https://blackcoffeerobotics.com/blog/smolvla-open-source-vision-language-action-model.
- ↑ 7.0 7.1 7.2 7.3 7.4 7.5 7.6 7.7 "SmolVLA: A vision-language-action model for affordable and efficient robotics". https://arxiv.org/html/2506.01844v1.
- ↑ 8.0 8.1 "AI and Robotics: With SmolVLA, Hugging Face Opens VLA Models to the Community". ActuIA. 2025-06-17. https://www.actuia.com/english/ai-and-robotics-with-smolvla-hugging-face-opens-vision-language-action-models-to-the-community/.
- ↑ "SmolVLA (Efficient Vision-Language-Action Model)". Emergent Mind. 2025-06-21. https://www.emergentmind.com/papers/smolvla.
- ↑ "Mustafa Shukor - Google Scholar". https://scholar.google.com/citations?user=lhp9mRgAAAAJ&hl=en.
- ↑ "Dana Aubakirova - Google Scholar". https://scholar.google.com/citations?user=iX_-o7IAAAAJ&hl=en.
- ↑ "SmolVLA architecture variations". https://arxiv.org/html/2506.01844v1.
- ↑ "SmolVLA: Efficient Vision-Language-Action Model trained on Lerobot Community Data". 2025-06-11. https://learnopencv.com/smolvla-lerobot-vision-language-action-model/.
- ↑ "Decoding SmolVLA: A Vision-Language-Action Model for Efficient and Accessible Robotics". Phospho AI Blog. 2025-06-13. https://blog.phospho.ai/decoding-smolvla.
- ↑ "Finetune SmolVLA". Hugging Face. https://huggingface.co/docs/lerobot/smolvla.
- ↑ 16.0 16.1 "[Literature Review SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics"]. Moonlight. https://www.themoonlight.io/en/review/smolvla-a-vision-language-action-model-for-affordable-and-efficient-robotics.
- ↑ "LeRobot License". GitHub. https://github.com/huggingface/lerobot/blob/main/LICENSE.
- ↑ "GitHub - huggingface/lerobot: 🤗 LeRobot: Making AI for Robotics more accessible with end-to-end learning". GitHub. https://github.com/huggingface/lerobot.
- ↑ "Train SmolVLA – starter pack". Phospho AI Docs. https://docs.phospho.ai/smolvla-training.
- ↑ "Paper page - SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics". Hugging Face. https://huggingface.co/papers/2506.01844.
- ↑ "SmolVLA Model: How Hugging Face's Vision-Language-Action AI is Democratizing Robotics". 2025-06-20. https://flowgrammer.ca/smolvla-model-democratizing-robotics/.
Cite error: <ref> tag with name "lerobot-model" defined in <references> is not used in prior text.
Cite error: <ref> tag with name "github-blog" defined in <references> is not used in prior text.
Cite error: <ref> tag with name "lerobot-community" defined in <references> is not used in prior text.
Cite error: <ref> tag with name "libero-issue" defined in <references> is not used in prior text.
Cite error: <ref> tag with name "release-issue" defined in <references> is not used in prior text.
Cite error: <ref> tag with name "pureai" defined in <references> is not used in prior text.
Cite error: <ref> tag with name "learnopoly" defined in <references> is not used in prior text.
Cite error: <ref> tag with name "chatpaper" defined in <references> is not used in prior text.
External Links
- Pages with reference errors
- Pages containing cite templates with deprecated parameters
- Stubs
- Artificial intelligence
- Robotics
- Vision-language models
- Foundation models
- Open-source artificial intelligence
- Hugging Face
- Google DeepMind
- 2025 in artificial intelligence
- Machine learning models
- Edge computing
- Embodied artificial intelligence