awesome-embodied-vla-va-vln

A curated list of state-of-the-art research in embodied AI, focusing on vision-language-action (VLA) models, vision-language navigation (VLN), and related multimodal learning approaches.

1,044

View on GitHub

Top Related Projects

CLIP

29,576

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image

habitat-lab

2,500

A modular high-level library to train embodied AI agents across a variety of tasks and environments.

habitat-sim

3,086

A flexible, high-performance 3D simulator for Embodied AI research.

ai2thor

1,435

An open-source platform for Visual AI.

The Unity Machine Learning Agents Toolkit (ML-Agents) is an open-source project that enables games and simulations to serve as environments for training intelligent agents using deep reinforcement learning and imitation learning.

Quick Overview

The GitHub repository jonyzhang2023/awesome-embodied-vla-va-vln is a curated list of resources related to embodied vision-language AI, vision-and-audio AI, and vision-language navigation. It serves as a comprehensive collection of papers, datasets, and tools for researchers and practitioners in these fields, focusing on the intersection of computer vision, natural language processing, and robotics.

Pros

Extensive collection of up-to-date research papers and resources
Well-organized structure, categorizing content by specific sub-fields
Includes both theoretical papers and practical implementations
Regularly updated with new contributions from the community

Cons

May be overwhelming for beginners due to the large volume of information
Lacks detailed explanations or summaries of individual resources
Some links may become outdated over time if not actively maintained
Limited coverage of certain niche areas within the broader field

Code Examples

This repository is not a code library but a curated list of resources. Therefore, there are no code examples to provide.

Getting Started

As this is not a code library, there are no specific getting started instructions. However, users can navigate the repository by exploring the different sections and following the links to the resources that interest them most.

Competitor Comparisons

CLIP

29,576

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image

Pros of CLIP

Robust and versatile model for connecting text and images
Extensive documentation and examples for implementation
Backed by OpenAI, with ongoing development and support

Cons of CLIP

Focused solely on image-text understanding, not embodied AI or navigation
May require more computational resources for training and inference
Less specialized for specific tasks like visual language navigation

Code Comparison

CLIP example:

import torch
from PIL import Image
import clip

model, preprocess = clip.load("ViT-B/32", device="cuda")
image = preprocess(Image.open("image.jpg")).unsqueeze(0).to("cuda")
text = clip.tokenize(["a dog", "a cat"]).to("cuda")

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)

awesome-embodied-vla-va-vln doesn't provide specific code examples, as it's a curated list of resources for embodied AI, visual language acquisition, and visual language navigation.

Summary

CLIP is a powerful model for image-text understanding, while awesome-embodied-vla-va-vln is a comprehensive resource list for embodied AI and related fields. CLIP offers ready-to-use implementations but is limited to image-text tasks, whereas awesome-embodied-vla-va-vln provides a broader scope of resources for various embodied AI applications, including navigation and language acquisition.

habitat-lab

2,500

A modular high-level library to train embodied AI agents across a variety of tasks and environments.

Pros of habitat-lab

Comprehensive simulation platform for embodied AI research
Extensive documentation and tutorials for easy onboarding
Active development and support from Facebook AI Research

Cons of habitat-lab

Steeper learning curve due to complex architecture
Focused primarily on 3D environments, less versatile for other domains
Requires more computational resources for running simulations

Code comparison

habitat-lab:

import habitat
env = habitat.Env(
    config=habitat.get_config("benchmark/nav/pointnav/pointnav_gibson.yaml")
)
observations = env.reset()

awesome-embodied-vla-va-vln:

No code available for comparison. This repository is a curated list of resources
rather than a functional codebase.

Summary

habitat-lab is a powerful simulation platform for embodied AI research, offering comprehensive tools and documentation. However, it has a steeper learning curve and focuses primarily on 3D environments. awesome-embodied-vla-va-vln, on the other hand, serves as a curated list of resources for embodied vision-language AI, visual answering, and vision-language navigation. It provides a broader overview of the field but doesn't offer a functional codebase like habitat-lab.

habitat-sim

3,086

A flexible, high-performance 3D simulator for Embodied AI research.

Pros of habitat-sim

Comprehensive simulation platform for embodied AI research
Highly optimized C++ core with Python bindings for performance
Extensive documentation and tutorials for ease of use

Cons of habitat-sim

Steeper learning curve due to complex architecture
Requires more computational resources for running simulations
Limited to specific types of embodied AI tasks

Code comparison

habitat-sim:

import habitat_sim
cfg = habitat_sim.SimulatorConfiguration()
agent_cfg = habitat_sim.agent.AgentConfiguration()
sim = habitat_sim.Simulator(habitat_sim.Configuration(cfg, [agent_cfg]))

awesome-embodied-vla-va-vln:

# Awesome Embodied Vision-Language-Action (VLA) and Vision-and-Language Navigation (VLN)

A curated list of resources for Embodied VLA and VLN research.

The code comparison shows that habitat-sim is a functional simulation framework, while awesome-embodied-vla-va-vln is a curated list of resources. habitat-sim provides actual implementation, whereas awesome-embodied-vla-va-vln serves as a reference guide for researchers in the field.

ai2thor

1,435

An open-source platform for Visual AI.

Pros of AI2-THOR

Comprehensive 3D environment for AI agents with physics simulation
Extensive documentation and tutorials for easy onboarding
Active development and regular updates from a dedicated team

Cons of AI2-THOR

Focused solely on indoor environments, limiting outdoor scenarios
Requires more computational resources due to 3D rendering
Less flexibility for custom environments compared to curated lists

Code Comparison

AI2-THOR example:

controller = Controller()
event = controller.step(dict(action="MoveAhead"))

awesome-embodied-vla-va-vln doesn't contain code samples as it's a curated list.

Summary

AI2-THOR is a robust 3D simulation platform for embodied AI research, offering realistic indoor environments with physics simulation. It provides a complete solution for training and testing AI agents in various tasks.

awesome-embodied-vla-va-vln, on the other hand, is a curated list of resources for embodied vision-language AI, visual attention, and vision-language navigation. It serves as a comprehensive reference for researchers and developers in these fields.

While AI2-THOR offers a ready-to-use environment, awesome-embodied-vla-va-vln provides a broader overview of available tools, datasets, and research papers across multiple domains of embodied AI.

ml-agents

18,478

Pros of ml-agents

Comprehensive toolkit for developing AI agents in Unity environments
Extensive documentation and tutorials for easy onboarding
Active community and regular updates from Unity Technologies

Cons of ml-agents

Focused solely on Unity engine, limiting its application outside of Unity projects
Steeper learning curve for those not familiar with Unity or game development
Requires more computational resources for training complex agents

Code comparison

ml-agents:

from mlagents_envs.environment import UnityEnvironment
from mlagents_envs.side_channel.engine_configuration_channel import EngineConfigurationChannel

channel = EngineConfigurationChannel()
env = UnityEnvironment(file_name="MyEnvironment", side_channels=[channel])

awesome-embodied-vla-va-vln:

# No direct code comparison available as this repository is a curated list of resources
# rather than a specific implementation or framework

Summary

ml-agents is a powerful toolkit for developing AI agents within Unity environments, offering comprehensive tools and documentation. However, it's limited to Unity projects and may have a steeper learning curve. awesome-embodied-vla-va-vln, on the other hand, is a curated list of resources for embodied vision-language AI, vision-and-language navigation, and related topics. It provides a broader overview of the field but doesn't offer a specific implementation framework like ml-agents does.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

awesome-embodied-vla/va/vln

As more and more outstanding vision-language-based policies emerge, this repository aims to organize and showcase the state-of-the-art technologies in robot learning, including vision-language-action (VLA) models, vision-language-navigation (VLN) models, vision-action (VA) models and other MLLM-based embodied learning. We hope that in the near future, robotics will experience its own 'LLM moment.'

This repository will be continuously updated, and we warmly invite contributions from the community. If you have any papers, projects, or resources that are not yet included, please feel free to submit them via a pull request or open an issue for discussion.

Let's build a comprehensive resource for the robotics and AI community!

Jony and Sage

ð Note on Paper Ordering

Within each year section, papers are generally listed in reverse chronological order (newer papers appear lower in the list).
However, particularly influential or representative works may be highlighted at the top of each year's section, regardless of their exact publication date.

ð Survey

[2025] [PKU-PsiBot] A Survey on Vision-Language-Action Models: An Action Tokenization Perspective [paper]
[2025] A Survey on Vision-Language-Action Models for Autonomous Driving [paper] [project]
[2025] Parallels Between VLA Model Post-Training and Human Motor Learning: Progress, Challenges, and Trends [paper] [project]
[2025] [IJRR 25] Foundation Models in Robotics: Applications, Challenges, and the Future [paper] [project]
[2025] Deep Reinforcement Learning for Robotics: A Survey of Real-World Successes [paper]
[2025] A Survey on Diffusion Policy for Robotic Manipulation: Taxonomy, Analysis, and Future Directions [paper] [project]
[2025] Embodied Intelligent Industrial Robotics: Concepts and Techniques [paper] [project]
[2025] Neural Brain: A Neuroscience-inspired Framework for Embodied Agents [paper] [project]
[2025] Vision-Language-Action Models: Concepts, Progress, Applications and Challenges [paper]
[2025] A Survey of Robotic Navigation and Manipulation with Physics Simulators in the Era of Embodied AI [paper]
[2025] Multimodal Perception for Goal-oriented Navigation: A Survey [paper]
[2025] Diffusion Models for Robotic Manipulation: A Survey [paper]
[2025] Dexterous Manipulation through Imitation Learning: A Survey [paper]
[2025] Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision [paper] [project]
[2025] SE(3)-Equivariant Robot Learning and Control: A Tutorial Survey [paper]
[2025] Generative Artificial Intelligence in Robotic Manipulation: A Survey [paper] [project]
[2025] Development Report of Embodied Intelligence (Chinese) [paper]
[2025] Survey on Vision-Language-Action Models [paper]
[2025] Exploring Embodied Multimodal Large Models: Development, Datasets, and Future Directions [paper]
[2024] Embodied-AI with large models: research and challenges [paper]
[2024] A Survey on Vision-Language-Action Models for Embodied AI [paper]
[2024] A Survey of Embodied Learning for Object-Centric Robotic Manipulation [paper]
[2024] Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI [paper]
[2024] Vision-language navigation: a survey and taxonomy [paper]

ð¥ Vision Language Action (VLA) Models

2025

[2025] [Gemini Robotics] Gemini Robotics On-Device brings AI to local robotic devices [report]
[2025] [Meta] V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning [paper] [project] [code]
[2025] [Physical Intelligence] Knowledge Insulating Vision-Language-Action Models: Train Fast, Run Fast, Generalize Better [paper] [project]
[2025] [Physical Intelligence] Ï0.5: A Vision-Language-Action Model with Open-World Generalization [paper] [project]
[2025] [Nvidia] GR00T N1: An Open Foundation Model for Generalist Humanoid Robots [paper] [project]
[2025] [Gemini Robotics] Gemini Robotics: Bringing AI into the Physical World [report]
[2025] [AgiBot] AgiBot World Colosseo: Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems [paper] [project]
[2025] [PsiBot] DexGraspVLA: A Vision-Language-Action Framework Towards General Dexterous Grasping [paper] [project]
[2025] [RSS 25] Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success [paper] [project]
[2025] [Physical Intelligence] Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models [paper] [project]
[2025] [Figure] Helix: A Vision-Language-Action Model for Generalist Humanoid Control [report]
[2025] [AgiBot] EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation [paper]
[2025] FLaRe: Achieving Masterful and Adaptive Robot Policies with Large-Scale Reinforcement Learning Fine-Tuning [paper][project]
[2025] Shake-VLA: Vision-Language-Action Model-Based System for Bimanual Robotic Manipulations and Liquid Mixing [paper]
[2025] Beyond Sight: Finetuning Generalist Robot Policies with Heterogeneous Sensors via Language Grounding [paper]
[2025] [Physical Intelligence] FAST: Efficient Action Tokenization for Vision-Language-Action Models [paper]
[2025] GeoManip: Geometric Constraints as General Interfaces for Robot Manipulation [paper]
[2025] Universal Actions for Enhanced Embodied Foundation Models [paper]
[2025] SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model [paper]
[2025] RoboHorizon: An LLM-Assisted Multi-View World Model for Long-Horizon Robotic Manipulation [paper]
[2025] SAM2Act: Integrating Visual Foundation Model with A Memory Architecture for Robotic Manipulation [paper]
[2025] Improving Vision-Language-Action Model with Online Reinforcement Learning [paper]
[2025] Integrating LMM Planners and 3D Skill Policies for Generalizable Manipulation [paper]
[2025] VLA-Cache: Towards Efficient Vision-Language-Action Model via Adaptive Token Caching in Robotic Manipulation [paper]
[2025] From Foresight to Forethought: VLM-In-the-Loop Policy Steering via Latent Alignment [paper]
[2025] GRAPE: Generalizing Robot Policy via Preference Alignment [paper]
[2025] DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control [paper]
[2025] HAMSTER: Hierarchical Action Models For Open-World Robot Manipulation [paper]
[2025] Temporal Representation Alignment: Successor Features Enable Emergent Compositionality in Robot Instruction Following Temporal Representation Alignment [paper]
[2025] ConRFT: A Reinforced Fine-tuning Method for VLA Models via Consistency Policy [paper]
[2025] RoboBERT: An End-to-end Multimodal Robotic Manipulation Model [paper]
[2025] Diffusion Transformer Policy: Scaling Diffusion Transformer for Generalist Visual-Language-Action Learning [paper]
[2025] GEVRM: Goal-Expressive Video Generation Model For Robust Visual Manipulation [paper]
[2025] SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation [paper]
[2025] Pre-training Auto-regressive Robotic Models with 4D Representations [paper]
[2025] Magma: A Foundation Model for Multimodal AI Agents [paper]
[2025] An Atomic Skill Library Construction Method for Data-Efficient Embodied Manipulation [paper]
[2025] VLAS: Vision-Language-Action Model with Speech Instructions for Customized Robot Manipulation [paper]
[2025] Humanoid-VLA: Towards Universal Humanoid Control with Visual Integration [paper]
[2025] ChatVLA: Unified Multimodal Understanding and Robot Control with Vision-Language-Action Model [paper]
[2025] ObjectVLA: End-to-End Open-World Object Manipulation Without Demonstration [paper] [project]
[2025] [CVPR 25] RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete [paper] [project]
[2025] SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Safe Reinforcement Learning [paper] [project]
[2025] CognitiveDrone: A VLA Model and Evaluation Benchmark for Real-Time Cognitive Task Solving and Reasoning in UAVs [paper] [project]
[2025] UAV-VLA: Vision-Language-Action System for Large Scale Aerial Mission Generation [paper]
[2025] Accelerating Vision-Language-Action Model Integrated with Action Chunking via Parallel Decoding [paper]
[2025] RaceVLA: VLA-based Racing Drone Navigation with Human-like Behaviour [paper] [project]
[2025] OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction [paper] [project]
[2025] A Generative System for Robot-to-Human Handovers: from Intent Inference to Spatial Configuration Imagery [paper] [project]
[2025] VLA Model-Expert Collaboration for Bi-directional Manipulation Learning [paper] [project]
[2025] PointVLA: Injecting the 3D World into Vision-Language-Action Models [paper] [project]
[2025] Towards Safe Robot Foundation Models [paper] [project]
[2025] VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation [paper] [project]
[2025] iManip: Skill-Incremental Learning for Robotic Manipulation [paper]
[2025] Refined Policy Distillation: From VLA Generalists to RL Experts [paper]
[2025] LUMOS: Language-Conditioned Imitation Learning with World Models [paper] [project]
[2025] HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model [paper] [project]
[2025] MoRE: Unlocking Scalability in Reinforcement Learning for Quadruped Vision-Language-Action Models [paper]
[2025] TLA: Tactile-Language-Action Model for Contact-Rich Manipulation [paper] [project]
[2025] FP3: A 3D Foundation Policy for Robotic Manipulation [paper] [project]
[2025] [CVPR 25] MoManipVLA: Transferring Vision-language-action Models for General Mobile Manipulation [paper] [project]
[2025] Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy [paper] [project]
[2025] Think Small, Act Big: Primitive-level Skill Prompt Learning for Lifelong Robot Manipulation [paper]
[2025] JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse [paper] [project]
[2025] RoboFlamingo-Plus: Fusion of Depth and RGB Perception with Vision-Language Models for Enhanced Robotic Manipulation [paper]
[2025] CubeRobot: Grounding Language in Rubik's Cube Manipulation via Vision-Language Model [paper]
[2025] MoLe-VLA: Dynamic Layer-skipping Vision Language Action Model via Mixture-of-Layers for Efficient Robot Manipulation [paper] [project]
[2025] [RSS 25] ConRFT: A Reinforced Fine-tuning Method for VLA Models via Consistency Policy [paper] [project]
[2025] [CVPR 25] CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models [paper] [project]
[2025] DexTOG: Learning Task-Oriented Dexterous Grasp with Language Condition [paper] [project]
[2025] SPECI: Skill Prompts based Hierarchical Continual Imitation Learning for Robot Manipulation [paper]
[2025] [RSS 25] Gripper Keypose and Object Pointflow as Interfaces for Bimanual Robotic Manipulation [paper] [project]
[2025] NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks [paper] [project]
[2025] [CVPR 25] RoboGround: Robotic Manipulation with Grounded Vision-Language Priors [paper] [project]
[2025] ViSA-Flow: Accelerating Robot Skill Learning via Large-Scale Video Semantic Action Flow [paper] [project]
[2025] [RA-L 25] GR-MG: Leveraging Partially-Annotated Data via Multi-Modal Goal-Conditioned Policy [paper] [project]
[2025] Task Reconstruction and Extrapolation for using Text Latent [paper]
[2025] Interleave-VLA: Enhancing Robot Manipulation with Interleaved Image-Text Instructions [paper] [project]
[2025] OpenHelix: A Short Survey, Empirical Analysis, and Open-Source Dual-System VLA Model for Robotic Manipulation [paper] [project]
[2025] CrayonRobo: Object-Centric Prompt-Driven Vision-Language-Action Model for Robotic Manipulation [paper]
[2025] [RSS 25] Learning to Act Anywhere with Task-centric Latent Actions [paper] [project]
[2025] Pixel Motion as Universal Representation for Robot Control [paper] [project]
[2025] [RSS 25] CLIP-RT: Learning Language-Conditioned Robotic Policies from Natural Language Supervision [paper] [project]
[2025] Training Strategies for Efficient Embodied Reasoning [paper]
[2025] A0: An Affordance-Aware Hierarchical Model for General Robotic Manipulation [paper] [project]
[2025] VTLA: Vision-Tactile-Language-Action Model with Preference Learning for Insertion Manipulation [paper] [project]
[2025] OneTwoVLA: A Unified Vision-Language-Action Model with Adaptive Reasoning [paper] [project]
[2025] InSpire: Vision-Language-Action Models with Intrinsic Spatial Reasoning [paper] [project]
[2025] ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation [paper] [project]
[2025] Hume: Introducing System-2 Thinking in Visual-Language-Action Model [paper] [project]
[2025] Agentic Robot: A Brain-Inspired Framework for Vision-Language-Action Models in Embodied Agents [paper] [project]
[2025] TrackVLA: Embodied Visual Tracking in the Wild [paper] [project]
[2025] SwitchVLA: Execution-Aware Task Switching for Vision-Language-Action Models [paper] [project]
[2025] OG-VLA: 3D-Aware Vision Language Action Model via Orthographic Image Generation [paper] [project]
[2025] LoHoVLA: A Vision-Language-Action Model for Long-Horizon Embodied Tasks [paper]
[2025] SmolVLA: A vision-language-action model for affordable and efficient robotics [paper] [project]
[2025] VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning [paper] [project]
[2025] Online RL with Simple Reward Enables Training VLA Models with Only One Trajectory [project]
[2025] BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation [paper] [project]
[2025] Fast ECoT: Efficient Embodied Chain-of-Thought via Thoughts Reuse [paper]
[2025] BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models [paper] [project]
[2025] SAFE: Multitask Failure Detection for Vision-Language-Action Models [paper] [project]
[2025] From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models [paper] [project]
[2025] Time- Diffusion Policy with Action Discrimination for Robotic Manipulation [paper]
[2025] RationalVLA: A Rational Vision-Language-Action Model with Dual System [paper] [project]
[2025] Fast-in-Slow: A Dual-System Foundation Model Unifying Fast Manipulation within Slow Reasoning [paper] [project]
[2025] [Physical Intelligence] Real-Time Execution of Action Chunking Flow Policies [paper] [project]
[2025] LLaVA-VLA: A Simple Yet Powerful Vision-Language-Action Model [project]
[2025] LeVERB: Humanoid Whole-Body Control with Latent Vision-Language Instruction [paper]
[2025] CEED-VLA: Consistency Vision-Language-Action Model with Early-Exit Decoding [paper] [project]
[2025] DreamGen: Unlocking Genearlization in Robot Learning through Video World Models [paper] [project]
[2025] RLRC: Reinforcement Learning-based Recovery for Compressed Vision-Language-Action Models [paper] [project]
[2025] MinD: Unified Visual Imagination and Control via Hierarchical World Model [paper] [project]
[2025] CronusVLA: Transferring Latent Motion Across Time for Multi-Frame Prediction in Manipulation [paper] [project]
[2025] WorldVLA: Towards Autoregressive Action World Model [paper] [project]
[2025] ACTLLM: Action Consistency Tuned Large Language Model [paper]
[2025] TriVLA: A Unified Triple-System-Based Unified Vision-Language-Action Model for General Robot Control [paper] [project]
[2025] [ICCV 25] VQ-VLA: Improving Vision-Language-Action Models via Scaling Vector-Quantized Action Tokenizers [paper] [project]
[2025] Evo-0: Vision-Language-Action Model with Implicit Spatial Understanding [paper] [project]
[2025] AC-DiT: Adaptive Coordination Diffusion Transformer for Mobile Manipulation [paper] [project]
[2025] MultiGen: Using Multimodal Generation in Simulation to Learn Multimodal Policies in Real [paper]

2024

[2024] [Physical Intelligence] Ï0: A Vision-Language-Action Flow Model for General Robot Control [paper]
[2024] [ICLR 25] RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation [paper] [project]
[2024] [CoRL 24] OpenVLA: An Open-Source Vision-Language-Action Model [paper] [project]
[2024] MiniVLA: A Better VLA with a Smaller Footprint [paper] [project]
[2024] [RSS 24] Octo: An Open-Source Generalist Robot Policy [paper] [project]
[2024] [ICRA 24 Best Paper] Open X-Embodiment: Robotic Learning Datasets and RT-X Models [paper] [project]
[2024] RT-H: Action Hierarchies Using Language [paper]
[2024] Towards Generalist Robot Policies: What Matters in Building Vision-Language-Action Models [paper]
[2024] Baku: An Efficient Transformer for Multi-Task Policy Learning [paper]
[2024] Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals [paper]
[2024] TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation [paper]
[2024] Diffusion-VLA: Scaling Robot Foundation Models via Unified Diffusion and Autoregression [paper]
[2024] CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation [paper]
[2024] 3D-VLA: A 3D Vision-Language-Action Generative World Model [paper]
[2024] Bi-VLA: Vision-Language-Action Model-Based System for Bimanual Robotic Dexterous Manipulations [paper]
[2024] An Embodied Generalist Agent in 3D World [paper]
[2024] RoboMM: All-in-One Multimodal Large Model for Robotic Manipulation [paper]
[2024] SpatialBot: Precise Spatial Understanding with Vision Language Models [paper]
[2024] Depth Helps: Improving Pre-trained RGB-based Policy with Depth Information Injection [paper]
[2024] HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers [paper]
[2024] LLaRA: Supercharging Robot Learning Data for Vision-Language Policy [paper]
[2024] RoboUniView: Visual-Language Model with Unified View Representation for Robotic Manipulation [paper]
[2024] Robotic Control via Embodied Chain-of-Thought Reasoning [paper]
[2024] GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation [paper]
[2024] Latent Action Pretraining from Videos [paper]
[2024] DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution [paper]
[2024] RT-Affordance: Affordances are Versatile Intermediate Representations for Robot Manipulation [paper]
[2024] Moto: Latent Motion Token as the Bridging Language for Robot Manipulation [paper]
[2024] TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies [paper]
[2024] Robot Utility Models: General Policies for Zero-Shot Deployment in New Environments [paper]
[2024] Moto: Latent Motion Token as the Bridging Language for Robot Manipulation [paper]
[2024] RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics [paper]
[2024] Yell At Your Robot: Improving On-the-Fly from Language Corrections [paper]
[2024] Any-point Trajectory Modeling for Policy Learning [paper]
[2024] Run-time Observation Interventions Make Vision-Language-Action Models More Visually Robust [paper]
[2024] RoboNurse-VLA: Robotic Scrub Nurse System based on Vision-Language-Action Model [paper]
[2024] Actra: Optimized Transformer Architecture for Vision-Language-Action Models in Robot Learning [paper]
[2024] QUAR-VLA: Vision-Language-Action Model for Quadruped Robots [paper]
[2024] RoboNurse-VLA: Robotic Scrub Nurse System based on Vision-Language-Action Model [paper]
[2024] DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution [paper]
[2024] General Flow as Foundation Affordance for Scalable Robot Learning [paper] [project]

2023

[2023] RT-1: Robotics Transformer for Real-World Control at Scale [paper]
[2023] RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control [paper]
[2023] PaLM-E: An Embodied Multimodal Language Model: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control [paper]
[2023] Vision-Language Foundation Models as Effective Robot Imitators [paper]
[2023] Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation [paper]
[2023] Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models [paper]
[2023] Learning Universal Policies via Text-Guided Video Generation [paper]
[2023] Learning to Act from Actionless Videos through Dense Correspondences [paper]
[2023] Compositional Foundation Models for Hierarchical Planning [paper]
[2023] VIMA: General Robot Manipulation with Multimodal Prompts [paper]
[2023] Prompt a Robot to Walk with Large Language Models [paper]
[2023] Look Before You Leap: Unveiling the Power of GPT-4V in Robotic Vision-Language Planning [paper]

ð¶ Vision Language Navigation (VLN) Models

2025

[2025] [IJRR 25] Multimodal Spatial Language Maps for Robot Navigation and Manipulation [paper] [project]
[2025] [RSS 25] NaVILA: Legged Robot Vision-Language-Action Model for Navigation [paper] [project]
[2025] Learned Perceptive Forward Dynamics Model for Safe and Platform-aware Robotic Navigation [paper] [project]
[2025] Semantic Mapping in Indoor Embodied AI - A Comprehensive Survey and Future Directions [paper]
[2025] VL-Nav: Real-time Vision-Language Navigation with Spatial Reasoning [paper]
[2025] TRAVEL: Training-Free Retrieval and Alignment for Vision-and-Language Navigation [paper]
[2025] VR-Robo: A Real-to-Sim-to-Real Framework for Visual Robot Navigation and Locomotion [paper]
[2025] NavigateDiff: Visual Predictors are Zero-Shot Navigation Assistants [paper]
[2025] MapNav: A Novel Memory Representation via Annotated Semantic Maps for VLM-based Vision-and-Language Navigation [paper]
[2025] OpenFly: A Versatile Toolchain and Large-scale Benchmark for Aerial Vision-Language Navigation [paper]
[2025] Ground-level Viewpoint Vision-and-Language Navigation in Continuous Environments [paper]
[2025] WMNav: Integrating Vision-Language Models into World Models for Object Goal Navigation [paper] [project]
[2025] Dynamic Path Navigation for Motion Agents with LLM Reasoning [paper]
[2025] SmartWay: Enhanced Waypoint Prediction and Backtracking for Zero-Shot Vision-and-Language Navigation [paper]
[2025] Vi-LAD: Vision-Language Attention Distillation for Socially-Aware Robot Navigation in Dynamic Environments [paper]
[2025] UniGoal: Towards Universal Zero-shot Goal-oriented Navigation [paper] [project]
[2025] PanoGen++: Domain-Adapted Text-Guided Panoramic Environment Generation for Vision-and-Language Navigation [paper]
[2025] Do Visual Imaginations Improve Vision-and-Language Navigation Agents? [paper] [project]
[2025] HA-VLN: A Benchmark for Human-Aware Navigation in Discrete-Continuous Environments with Dynamic Multi-Human Interactions, Real-World Validation, and an Open Leaderboard [paper] [project]
[2025] FlexVLN: Flexible Adaptation for Diverse Vision-and-Language Navigation Tasks [paper]
[2025] P3Nav: A Unified Framework for Embodied Navigation Integrating Perception, Planning, and Prediction [paper]
[2025] Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation [paper] [project]
[2025] COSMO: Combination of Selective Memorization for Low-cost Vision-and-Language Navigation [paper]
[2025] ForesightNav: Learning Scene Imagination for Efficient Exploration [paper] [project]
[2025] CityNavAgent: Aerial Vision-and-Language Navigation with Hierarchical Semantic Planning and Global Memory [paper] [project]
[2025] NavDP: Learning Sim-to-Real Navigation Diffusion Policy with Privileged Information Guidance [paper]
[2025] VISTA: Generative Visual Imagination for Vision-and-Language Navigation [paper]
[2025] Dynam3D: Dynamic Layered 3D Tokens Empower VLM for Vision-and-Language Navigation [paper] [project]
[2025] Aux-Think: Exploring Reasoning Strategies for Data-Efficient Vision-Language Navigation [paper]
[2025] Active Test-time Vision-Language Navigation [paper]
[2025] BeliefMapNav: 3D Voxel-Based Belief Map for Zero-Shot Object Navigation [paper] [project]
[2025] OctoNav: Towards Generalist Embodied Navigation [paper] [project]
[2025] Grounded Vision-Language Navigation for UAVs with Open-Vocabulary Goal Understanding [paper] [project]
[2025] DyNaVLM: Zero-Shot Vision-Language Navigation System with Dynamic Viewpoints and Self-Refining Graph Memory [paper]
[2025] TopV-Nav: Unlocking the Top-View Spatial Reasoning Potential of MLLM for Zero-shot Object Navigation [paper]

2024

[2024] [RSS 24] Navid: Video-based vlm plans the next step for vision-and-language navigation [paper]
[2024] The One RING: a Robotic Indoor Navigation Generalist [paper]
[2024] Mobility VLA: Multimodal Instruction Navigation with Long-Context VLMs and Topological Graphs [paper]

2023

[2023] Bridging Zero-shot Object Navigation and Foundation Models through Pixel-Guided Navigation Skill [paper]
[2023] [CVPR 23] Adaptive Zone-Aware Hierarchical Planner for Vision-Language Navigation [paper]

ð¬ Vision Action (VA) Models

2025

[2025] Steering Your Diffusion Policy with Latent Space Reinforcement Learning [paper] [project]
[2025] [ByteDance Seed] Chain-of-Action: Trajectory Autoregressive Modeling for Robotic Manipulation [paper] [project]
[2025] [RSS 25] Unified Video Action Model [paper] [project]
[2025] Streaming Flow Policy: Simplifying diffusion/flow-matching policies by treating action trajectories as flow trajectories [paper] [project]
[2025] Modality-Composable Diffusion Policy via Inference-Time Distribution-level Composition [paper] [project]
[2025] Adapt3R: Adaptive 3D Scene Representation for Domain Transfer in Imitation Learning [paper] [project]
[2025] BEHAVIOR Robot Suite: Streamlining Real-World Whole-Body Manipulation for Everyday Household Activities [paper] [project]
[2025] [RSS 25] Reactive Diffusion Policy: Slow-Fast Visual-Tactile Policy Learning for Contact-Rich Manipulation [paper] [project]
[2025] Robotic World Model: A Neural Network Simulator for Robust Policy Optimization in Robotics [paper]
[2025] You Only Teach Once: Learn One-Shot Bimanual Robotic Manipulation from Video Demonstrations [paper]
[2025] ASAP: Aligning Simulation and Real-World Physics for Learning Agile Humanoid Whole-Body Skills [paper]
[2025] VILP: Imitation Learning with Latent Video Planning [paper]
[2025] Learning the RoPEs: Better 2D and 3D Position Encodings with STRING [paper]
[2025] When Pre-trained Visual Representations Fall Short: Limitations in Visuo-Motor Robot Learning [paper]
[2025] RoboGrasp: A Universal Grasping Policy for Robust Robotic Control [paper]
[2025] CordViP: Correspondence-based Visuomotor Policy for Dexterous Manipulation in Real-World [paper]
[2025] Learning to Group and Grasp Multiple Objects [paper]
[2025] Beyond Behavior Cloning: Robustness through Interactive Imitation and Contrastive Learning [paper]
[2025] COMBO-Grasp: Learning Constraint-Based Manipulation for Bimanual Occluded Grasping [paper]
[2025] DexTrack: Towards Generalizable Neural Tracking Control for Dexterous Manipulation from Human References [paper]
[2025] S2-Diffusion: Generalizing from Instance-level to Category-level Skills in Robot Manipulation [paper]
[2025] MTDP: Modulated Transformer Diffusion Policy Model [paper]
[2025] FUNCTO: Function-Centric One-Shot Imitation Learning for Tool Manipulation [paper]
[2025] RHINO: Learning Real-Time Humanoid-Human-Object Interaction from Human Demonstrations [paper]
[2025] Responsive Noise-Relaying Diffusion Policy: Responsive and Efficient Visuomotor Control [paper]
[2025] Learning a High-quality Robotic Wiping Policy Using Systematic Reward Analysis and Visual-Language Model Based Curriculum [paper]
[2025] IMLE Policy: Fast and Sample Efficient Visuomotor Policy Learning via Implicit Maximum Likelihood Estimation [paper]
[2025] X-IL: Exploring the Design Space of Imitation Learning Policies [paper]
[2025] Towards Fusing Point Cloud and Visual Representations for Imitation Learning [paper]
[2025] Pick-and-place Manipulation Across Grippers Without Retraining: A Learning-optimization Diffusion Policy Approach [paper]
[2025] FACTR: Force-Attending Curriculum Training for Contact-Rich Policy Learning [paper]
[2025] DemoGen: Synthetic Demonstration Generation for Data-Efficient Visuomotor Policy Learning [paper]
[2025] Human2Robot: Learning Robot Actions from Paired Human-Robot Videos [paper]
[2025] AnyDexGrasp: General Dexterous Grasping for Different Hands with Human-level Learning Efficiency [paper]
[2025] COMPASS: Cross-embOdiment Mobility Policy via ResiduAl RL and Skill Synthesis [paper]
[2025] Retrieval Dexterity: Efficient Object Retrieval in Clutters with Dexterous Hand [paper]
[2025] From planning to policy: distilling Skill-RRT for long-horizon prehensile and non-prehensile manipulation [paper]
[2025] FetchBot: Object Fetching in Cluttered Shelves via Zero-Shot Sim2Real [paper]
[2025] Point Policy: Unifying Observations and Actions with Key Points for Robot Manipulation [paper] [project]
[2025] FuseGrasp: Radar-Camera Fusion for Robotic Grasping of Transparent Objects [paper]
[2025] Sensor-Invariant Tactile Representation [paper]
[2025] Generalist World Model Pre-Training for Efficient Reinforcement Learning [paper]
[2025] ProDapt: Proprioceptive Adaptation using Long-term Memory Diffusion [paper] [project]
[2025] Falcon: Fast Visuomotor Policies via Partial Denoising [paper]
[2025] HGDiffuser: Efficient Task-Oriented Grasp Generation via Human-Guided Grasp Diffusion Models [paper] [project]
[2025] SHADOW: Leveraging Segmentation Masks for Cross-Embodiment Policy Transfer [paper] [project]
[2025] Phantom: Training Robots Without Robots Using Only Human Videos [paper] [project]
[2025] General Force Sensation for Tactile Robot [paper]
[2025] Action Tokenizer Matters in In-Context Imitation Learning [paper]
[2025] AVR: Active Vision-Driven Robotic Precision Manipulation with Viewpoint and Focal Length Optimization [paper] [project]
[2025] FRMD: Fast Robot Motion Diffusion with Consistency-Distilled Movement Primitives for Smooth Action Generation [paper]
[2025] Variable-Friction In-Hand Manipulation for Arbitrary Objects via Diffusion-Based Imitation Learning [paper] [project]
[2025] Learning Dexterous In-Hand Manipulation with Multifingered Hands via Visuomotor Diffusion [paper] [project]
[2025] RGBSQGrasp: Inferring Local Superquadric Primitives from Single RGB Image for Graspability-Aware Bin Picking [paper] [project]
[2025] ArticuBot: Learning Universal Articulated Object Manipulation Policy via Large Scale Simulation [paper] [project]
[2025] SRSA: Skill Retrieval and Adaptation for Robotic Assembly Tasks [paper] [project]
[2025] GAGrasp: Geometric Algebra Diffusion for Dexterous Grasping [paper] [project]
[2025] OPG-Policy: Occluded Push-Grasp Policy Learning with Amodal Segmentation [paper]
[2025] RA-DP: Rapid Adaptive Diffusion Policy for Training-Free High-frequency Robotics Replanning [paper]
[2025] Robotic Compliant Object Prying Using Diffusion Policy Guided by Vision and Force Observations [paper] [project]
[2025] CoinRobot: Generalized End-to-end Robotic Learning for Physical Intelligence [paper]
[2025] Persistent Object Gaussian Splat (POGS) for Tracking Human and Robot Manipulation of Irregularly Shaped Objects [paper] [project]
[2025] How to Train Your Robots? The Impact of Demonstration Modality on Imitation Learning [paper]
[2025] One-Shot Dual-Arm Imitation Learning [paper] [project]
[2025] GAT-Grasp: Gesture-Driven Affordance Transfer for Task-Aware Robotic Grasping [paper]
[2025] Enhanced View Planning for Robotic Harvesting: Tackling Occlusions with Imitation Learning [paper]
[2025] ES-Parkour: Advanced Robot Parkour with Bio-inspired Event Camera and Spiking Neural Network [paper]
[2025] NIL: No-data Imitation Learning by Leveraging Pre-trained Video Diffusion Models [paper]
[2025] World Modeling Makes a Better Planner: Dual Preference Optimization for Embodied Task Planning [paper]
[2025] RILe: Reinforced Imitation Learning [paper]
[2025] HumanoidPano: Hybrid Spherical Panoramic-LiDAR Cross-Modal Perception for Humanoid Robots [paper]
[2025] Distillation-PPO: A Novel Two-Stage Reinforcement Learning Framework for Humanoid Robot Perceptive Locomotion [paper]
[2025] Trinity: A Modular Humanoid Robot AI System [paper]
[2025] LiPS: Large-Scale Humanoid Robot Reinforcement Learning with Parallel-Series Structures [paper]
[2025] Elastic Motion Policy: An Adaptive Dynamical System for Robust and Efficient One-Shot Imitation Learning [paper] [project]
[2025] Learning Gentle Grasping Using Vision, Sound, and Touch [paper] [project]
[2025] RoboCopilot: Human-in-the-loop Interactive Imitation Learning for Robot Manipulation [paper]
[2025] Rethinking Bimanual Robotic Manipulation: Learning with Decoupled Interaction Framework [paper]
[2025] MoE-Loco: Mixture of Experts for Multitask Locomotion [paper] [project]
[2025] Humanoid Policy ~ Human Policy [paper] [project]
[2025] Dense Policy: Bidirectional Autoregressive Learning of Actions [paper] [project]
[2025] Learning to Play Piano in the Real World [paper] [project]
[2025] CCDP: Composition of Conditional Diffusion Policies with Guided Sampling [paper] [project]
[2025] DyWA: Dynamics-adaptive World Action Model for Generalizable Non-prehensile Manipulation [paper] [project]
[2025] AdaWorld: Learning Adaptable World Models with Latent Actions [paper] [project]
[2025] Visuo-Tactile Object Pose Estimation for a Multi-Finger Robot Hand with Low-Resolution In-Hand Tactile Sensing [paper]
[2025] Empirical Analysis of Sim-and-Real Cotraining Of Diffusion Policies For Planar Pushing from Pixels [paper]
[2025] ManipTrans: Efficient Dexterous Bimanual Manipulation Transfer via Residual Learning [paper] [project]
[2025] Sim-and-Real Co-Training: A Simple Recipe for Vision-Based Robotic Manipulation [paper] [project]
[2025] HACTS: a Human-As-Copilot Teleoperation System for Robot Learning [paper]
[2025] ZeroMimic: Distilling Robotic Manipulation Skills from Web Videos [paper] [project]
[2025] Learning Coordinated Bimanual Manipulation Policies using State Diffusion and Inverse Dynamics Models [paper]
[2025] Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets [paper] [project]
[2025] RoboAct-CLIP: Video-Driven Pre-training of Atomic Action Understanding for Robotics [paper]
[2025] Slot-Level Robotic Placement via Visual Imitation from Single Human Video [paper] [project]
[2025] Robust Dexterous Grasping of General Objects from Single-view Perception [paper] [project]
[2025] Two by Two: Learning Multi-Task Pairwise Objects Assembly for Generalizable Robot Manipulation [paper] [project]
[2025] ZeroGrasp: Zero-Shot Shape Reconstruction Enabled Robotic Grasping [paper] [project]
[2025] Novel Demonstration Generation with Gaussian Splatting Enables Robust One-Shot Manipulation [paper] [project]
[2025] Grasping Deformable Objects via Reinforcement Learning with Cross-Modal Attention to Visuo-Tactile Inputs [paper]
[2025] Few-Shot Vision-Language Action-Incremental Policy Learning [paper] [project]
[2025] Latent Diffusion Planning for Imitation Learning [paper] [project]
[2025] Physically Consistent Humanoid Loco-Manipulation using Latent Diffusion Models [paper]
[2025] PRISM-DP: Spatial Pose-based Observations for Diffusion-Policies via Segmentation, Mesh Generation, and Pose Tracking [paper]
[2025] Rethinking Latent Representations in Behavior Cloning: An Information Bottleneck Approach for Robot Manipulation [paper] [project]
[2025] Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation [paper] [project]
[2025] Fast Flow-based Visuomotor Policies via Conditional Optimal Transport Couplings [paper] [project]
[2025] KineDex: Learning Tactile-Informed Visuomotor Policies via Kinesthetic Teaching for Dexterous Manipulation [paper] [project]
[2025] CLAM: Continuous Latent Action Models for Robot Learning from Unlabeled Demonstrations [paper] [project]
[2025] H3DP: Triply-Hierarchical Diffusion Policy for Visuomotor Learning [paper] [project]
[2025] UniSkill: Imitating Human Videos via Cross-Embodiment Skill Representations [paper] [project]
[2025] Learning Long-Context Diffusion Policies via Past-Token Prediction [paper] [project]
[2025] DataMIL: Selecting Data for Robot Imitation Learning with Datamodels [paper] [project]
[2025] [ICLR 25] Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning [paper] [project]
[2025] IN-RIL: Interleaved Reinforcement and Imitation Learning for Policy Fine-Tuning [paper] [project]
[2025] NVSPolicy: Adaptive Novel-View Synthesis for Generalizable Language-Conditioned Policy Learning [paper]
[2025] EmbodiedMAE: A Unified 3D Multi-Modal Representation for Robot Manipulation [paper]
[2025] FlowDreamer: A RGB-D World Model with Flow-based Motion Representations for Robot Manipulation [paper] [project]
[2025] Conditioning Matters: Training Diffusion Policies is Faster Than You Think [paper]
[2025] H2R: A Human-to-Robot Data Augmentation for Robot Pre-training from Videos [paper]
[2025] GLOVER++: Unleashing the Potential of Affordance Learning from Human Behaviors for Robotic Manipulation [paper] [project]
[2025] Zero-Shot Visual Generalization in Robot Manipulation [paper] [project]
[2025] Object-Centric Representations Improve Policy Generalization in Robot Manipulation [paper]
[2025] LaDi-WM: A Latent Diffusion-based World Model for Predictive Manipulation [paper]
[2025] GraspMolmo: Generalizable Task-Oriented Grasping via Large-Scale Synthetic Data Generation [paper] [project]
[2025] A Practical Guide for Incorporating Symmetry in Diffusion Policy [paper]
[2025] Adaptive Visuo-Tactile Fusion with Predictive Force Attention for Dexterous Manipulation [paper] [project]
[2025] EquAct: An SE(3)-Equivariant Multi-Task Transformer for Open-Loop Robotic Manipulation [paper]
[2025] Spatial RoboGrasp: Generalized Robotic Grasping Control Policy [paper]
[2025] Learning Generalizable Robot Policy with Human Demonstration Video as a Prompt [paper]
[2025] [AAAI 25] FlowPolicy: Enabling Fast and Robust 3D Flow-Based Policy via Consistency Flow Matching for Robot Manipulation [paper] [project]
[2025] Object-centric 3D Motion Field for Robot Learning from Human Videos [paper] [project]
[2025] Evaluating Robot Policies in a World Model [paper] [project]
[2025] 3DFlowAction: Learning Cross-Embodiment Manipulation from 3D Flow World Model [paper] [project]
[2025] SpikePingpong: High-Frequency Spike Vision-based Robot Learning for Precise Striking in Table Tennis Game [paper]
[2025] SAIL: Faster-than-Demonstration Execution of Imitation Learning Policies [paper] [project]
[2025] Gondola: Grounded Vision Language Planning for Generalizable Robotic Manipulation [paper] [project]
[2025] Touch begins where vision ends: Generalizable policies for contact-rich manipulation [paper] [project]
[2025] AMPLIFY: Actionless Motion Priors for Robot Learning from Videos [paper] [project]
[2025] GAF: Gaussian Action Field as a Dynamic World Model for Robotic Manipulation [paper]
[2025] Tactile Beyond Pixels: Multisensory Touch Representations for Robot Manipulation [paper]
[2025] Latent Action Diffusion for Cross-Embodiment Manipulation [paper] [project]
[2025] Vision in Action: Learning Active Perception from Human Demonstrations [paper] [project]
[2025] [IROS 25] Robust Instant Policy: Leveraging Studentâs t-Regression Model for Robust In-context Imitation Learning of Robot Manipulation [paper] [project]
[2025] [RSS 25] Dex1B: Learning with 1B Demonstrations for Dexterous Manipulation [paper] [project]
[2025] DemoDiffusion: One-Shot Human Imitation using pre-trained Diffusion Policy [paper] [project]
[2025] World4Omni: A Zero-Shot Framework from Image Generation World Model to Robotic Manipulation [paper] [project]

2024

[2024] Learning Robotic Manipulation Policies from Point Clouds with Conditional Flow Matching [paper]
[2024] Point Cloud Matters: Rethinking the Impact of Different Observation Spaces on Robot Learning [paper]
[2024] [RSS 25] 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations [paper]
[2024] Sparse diffusion policy: A sparse, reusable, and flexible policy for robot learning [paper]
[2024] ManiCM: Real-time 3D Diffusion Policy via Consistency Model for Robotic Manipulation [paper]
[2024] 3d diffuser actor: Policy diffusion with 3d scene representations [paper]
[2024] [ICLR 25] Diffusion Policy Policy Optimization [paper]
[2024] Language-Guided Object-Centric Diffusion Policy for Collision-Aware Robotic Manipulation [paper]
[2024] EquiBot: SIM(3)-Equivariant Diffusion Policy for Generalizable and Data Efficient Learning [paper]
[2024] Equivariant Diffusion Policy [paper]
[2024] [IROS 25] Mamba Policy: Towards Efficient 3D Diffusion Policy with Hybrid Selective State Models [paper]
[2024] Generalizable Humanoid Manipulation with Improved 3D Diffusion Policies [paper]
[2024] Motion Before Action: Diffusing Object Motion as Manipulation Condition [paper]
[2024] One-Step Diffusion Policy: Fast Visuomotor Policies via Diffusion Distillation [paper]
[2024] Consistency policy: Accelerated visuomotor policies via consistency distillation [paper]
[2024] SPOT: SE(3) Pose Trajectory Diffusion for Object-Centric Manipulation [paper]
[2024] Few-Shot Task Learning through Inverse Generative Modeling [paper]
[2024] G3Flow: Generative 3D Semantic Flow for Pose-aware and Generalizable Object Manipulation [paper]
[2024] Towards Synergistic, Generalized, and Efficient Dual-System for Robotic Manipulation [paper]
[2024] Diffusion Policy Attacker: Crafting Adversarial Attacks for Diffusion-based Policies [paper]
[2024] Imagination Policy: Using Generative Point Cloud Models for Learning Manipulation Policies [paper]
[2024] Equivariant diffusion policy [paper]
[2024] Scaling diffusion policy in transformer to 1 billion parameters for robotic manipulation [paper]
[2024] Data Scaling Laws in Imitation Learning for Robotic Manipulation [paper]
[2024] Hierarchical Diffusion Policy for Kinematics-Aware Multi-Task Robotic Manipulation [paper]
[2024] Equivariant diffusion policy [paper]
[2024] Learning universal policies via text-guided video generation [paper]
[2024] Crossway Diffusion: Improving Diffusion-based Visuomotor Policy via Self-supervised Learning [paper]
[2024] 3D Diffuser Actor: Policy Diffusion with 3D Scene Representations [paper]
[2024] Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation [paper]
[2024] GenDP: 3D Semantic Fields for Category-Level Generalizable Diffusion Policy [paper]
[2024] Lift3D Foundation Policy: Lifting 2D Large-Scale Pretrained Models for Robust 3D Robotic Manipulation [paper]
[2024] Prediction with Action: Visual Policy Learning via Joint Denoising Process [paper]
[2024] Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations [paper]
[2024] Bidirectional Decoding: Improving Action Chunking via Closed-Loop Resampling [paper]
[2024] Streaming Diffusion Policy: Fast Policy Synthesis with Variable Noise Diffusion Models [paper]
[2024] CARP: Visuomotor Policy Learning via Coarse-to-Fine Autoregressive Prediction [paper]
[2024] In-Context Imitation Learning via Next-Token Prediction [paper] [project]
[2024] Learning Diffusion Policies from Demonstrations For Compliant Contact-rich Manipulation [paper]

2023

[2023] Diffusion policy: Visuomotor policy learning via action diffusion [paper]
[2023] Exploring Visual Pre-training for Robot Manipulation: Datasets, Models and Methods [paper]

ð§ Other Multimodal Large Language Model (MLLM)-based/related Embodied Learning

2025

[2025] [ICRA 25 Best Paper Finalist] UAD: Unsupervised Affordance Distillation for Generalization in Robotic Manipulation [paper] [project]
[2025] [NVIDIA] Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning [paper] [project]
[2025] EmbodiedVSR: Dynamic Scene Graph-Guided Chain-of-Thought Reasoning for Visual Spatial Tasks [paper]
[2025] RoboDexVLM: Visual Language Model-Enabled Task Planning and Motion Control for Dexterous Robot Manipulation [paper] [project]
[2025] Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation [paper]
[2025] RoboGrasp: A Universal Grasping Policy for Robust Robotic Control [paper]
[2025] VLP: Vision-Language Preference Learning for Embodied Manipulation [paper]
[2025] 3D-AffordanceLLM: Harnessing Large Language Models for Open-Vocabulary Affordance Detection in 3D Worlds [paper]
[2025] CLEA: Closed-Loop Embodied Agent for Enhancing Task Execution in Dynamic Environments [paper]
[2025] TRACE: A Self-Improving Framework for Robot Behavior Forecasting with Vision-Language Models [paper] [project]
[2025] AffordGrasp: In-Context Affordance Reasoning for Open-Vocabulary Task-Oriented Grasping in Clutter [paper] [project]
[2025] Code-as-Symbolic-Planner: Foundation Model-Based Robot Planning via Symbolic Code Generation [paper] [project]
[2025] Large Language Models as Natural Selector for Embodied Soft Robot Design [paper] [project]
[2025] OVAMOS: A Framework for Open-Vocabulary Multi-Object Search in Unknown Environments [paper]
[2025] Bridging VLM and KMP: Enabling Fine-grained robotic manipulation via Semantic Keypoints Representation [paper]
[2025] FlowPlan: Zero-Shot Task Planning with LLM Flow Engineering for Robotic Instruction Following [paper] [project]
[2025] UAV-VLRR: Vision-Language Informed NMPC for Rapid Response in UAV Search and Rescue [paper] [video]
[2025] UAV-VLPA: A Vision-Language-Path-Action System for Optimal Route Generation on a Large Scales [paper]
[2025] Afford-X: Generalizable and Slim Affordance Reasoning for Task-oriented Manipulation [paper] [project]
[2025] Towards Autonomous Reinforcement Learning for Real-World Robotic Manipulation with Large Language Models [paper] [video]
[2025] LensDFF: Language-enhanced Sparse Feature Distillation for Efficient Few-Shot Dexterous Manipulation [paper]
[2025] Learning Generalizable Language-Conditioned Cloth Manipulation from Long Demonstrations [paper] [project]
[2025] Look Before You Leap: Using Serialized State Machine for Language Conditioned Robotic Manipulation [paper]
[2025] Perceiving, Reasoning, Adapting: A Dual-Layer Framework for VLM-Guided Precision Robotic Manipulation [paper]
[2025] AutoSpatial: Visual-Language Reasoning for Social Robot Navigation through Efficient Spatial Reasoning Learning [paper] [project]
[2025] AffordDexGrasp: Open-set Language-guided Dexterous Grasp with Generalizable-Instructive Affordance [paper] [project]
[2025] Self-Corrective Task Planning by Inverse Prompting with Large Language Models [paper]
[2025] HELM: Human-Preferred Exploration with Language Models [paper]
[2025] SafePlan: Leveraging Formal Logic and Chain-of-Thought Reasoning for Enhanced Safety in LLM-based Robotic Task Planning [paper]
[2025] Graphormer-Guided Task Planning: Beyond Static Rules with LLM Safety Perception [paper]
[2025] RoboDesign1M: A Large-scale Dataset for Robot Design Understanding [paper]
[2025] STAR: A Foundation Model-driven Framework for Robust Task Planning and Failure Recovery in Robotic Systems [paper]
[2025] MatchMaker: Automated Asset Generation for Robotic Assembly [paper]
[2025] Object-Centric World Model for Language-Guided Manipulation [paper]
[2025] KUDA: Keypoints to Unify Dynamics Learning and Visual Prompting for Open-Vocabulary Robotic Manipulation [paper] [project]
[2025] IMPACT : Intelligent Motion Planning with Acceptable Contact Trajectories via Vision-Language Models [paper] [project]
[2025] Multi-Agent LLM Actor-Critic Framework for Social Robot Navigation [paper] [project]
[2025] PhysVLM: Enabling Visual Language Models to Understand Robotic Physical Reachability [paper]
[2025] MetaFold: Language-Guided Multi-Category Garment Folding Framework via Trajectory Generation and Foundation Model [paper]
[2025] NVP-HRI: Zero Shot Natural Voice and Posture-based Human-Robot Interaction via Large Language Model [paper] [project]
[2025] MindEye-OmniAssist: A Gaze-Driven LLM-Enhanced Assistive Robot System for Implicit Intention Recognition and Task Execution [paper]
[2025] HybridGen: VLM-Guided Hybrid Planning for Scalable Data Generation of Imitation Learning [paper]
[2025] Free-form language-based robotic reasoning and grasping [paper] [project]
[2025] Mitigating Cross-Modal Distraction and Ensuring Geometric Feasibility via Affordance-Guided, Self-Consistent MLLMs for Food Preparation Task Planning [paper]
[2025] VISO-Grasp: Vision-Language Informed Spatial Object-centric 6-DoF Active View Planning and Grasping in Clutter and Invisibility [paper]
[2025] Diffusion Dynamics Models with Generative State Estimation for Cloth Manipulation [paper]
[2025] GraspCoT: Integrating Physical Property Reasoning for 6-DoF Grasping under Flexible Language Instructions [paper]
[2025] Safety Aware Task Planning via Large Language Models in Robotics [paper]
[2025] LLM+MAP: Bimanual Robot Task Planning using Large Language Models and Planning Domain Definition Language [paper] [project]
[2025] Leveraging Language Models for Out-of-Distribution Recovery in Reinforcement Learning [paper] [project]
[2025] RoboEngine: Plug-and-Play Robot Data Augmentation with Semantic Robot Segmentation and Background Generation [paper] [project]
[2025] IRef-VLA: A Benchmark for Interactive Referential Grounding with Imperfect Language in 3D Scenes [paper]
[2025] Cooking Task Planning using LLM and Verified by Graph Network [paper]
[2025] Context-Aware Human Behavior Prediction Using Multimodal Large Language Models: Challenges and Insights [paper]
[2025] Visual Environment-Interactive Planning for Embodied Complex-Question Answering [paper]
[2025] GROVE: A Generalized Reward for Learning Open-Vocabulary Physical Skill [paper] [project]
[2025] Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks [paper] [project]
[2025] Grounding Multimodal LLMs to Embodied Agents that Ask for Help with Reinforcement Learning [paper]
[2025] EmbodiedAgent: A Scalable Hierarchical Approach to Overcome Practical Challenge in Multi-Robot Control [paper]
[2025] Trajectory Adaptation Using Large Language Models [paper]
[2025] Chain-of-Modality: Learning Manipulation Programs from Multimodal Human Videos with Vision-Language-Models [paper] [project]
[2025] SAS-Prompt: Large Language Models as Numerical Optimizers for Robot Self-Improvement [paper] [project]
[2025] Identifying Uncertainty in Self-Adaptive Robotics with Large Language Models [paper]
[2025] Robotic Visual Instruction [paper]
[2025] DeCo: Task Decomposition and Skill Composition for Zero-Shot Generalization in Long-Horizon 3D Manipulation [paper] [project]
[2025] LLM-based Interactive Imitation Learning for Robotic Manipulation [paper] [project]
[2025] Dynamic Robot Tool Use with Vision Language Models [paper]
[2025] RoBridge: A Hierarchical Architecture Bridging Cognition and Execution for General Robotic Manipulation [paper] [project]
[2025] Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement Learning [paper] [project]
[2025] From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation [paper] [project]
[2025] Real-Time Verification of Embodied Reasoning for Generative Skill Acquisition [paper]
[2025] RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics [paper] [project]
[2025] Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces [paper] [project]
[2025] RobotSmith: Generative Robotic Tool Design for Acquisition of Complex Manipulation Skills [paper] [project]
[2025] CASPER: Inferring Diverse Intents for Assistive Teleoperation with Vision Language Models [paper] [project]
[2025] FrankenBot: Brain-Morphic Modular Orchestration for Robotic Manipulation with Vision-Language Models [paper]
[2025] RoboPearls: Editable Video Simulation for Robot Manipulation [paper] [project]
[2025] RoboBrain 2.0 Technical Report [paper] [project]

2024

[2024] Building Cooperative Embodied Agents Modularly with Large Language Models [paper] [project]
[2024] AIC MLLM: Autonomous Interactive Correction MLLM for Robust Robotic Manipulation [paper]
[2024] OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints [paper]
[2024] Grasp What You Want: Embodied Dexterous Grasping System Driven by Your Voice [paper]
[2024] Towards Open-World Grasping with Large Vision-Language Models [paper]
[2024] ThinkGrasp: A Vision-Language System for Strategic Part Grasping in Clutter [paper]
[2024] MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World [paper]
[2024] Physically Grounded Vision-Language Models for Robotic Manipulation [paper]
[2024] Eureka: Human-Level Reward Design via Coding Large Language Models [paper]
[2024] Towards Efficient LLM Grounding for Embodied Multi-Agent Collaboration [paper]
[2024] Do As I Can, Not As I Say: Grounding Language in Robotic Affordances [paper]

2023

[2023] LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models [paper]
[2023] LLM+P: Empowering Large Language Models with Optimal Planning Proficiency [paper]
[2023] Code as Policies: Language Model Programs for Embodied Control [paper]

2022

[2022] Inner Monologue: Embodied Reasoning through Planning with Language Models [paper]

Physics-aware Policy

[2025] Surface-Based Manipulation [paper]

Sim-to-Real Transfer

[2025] RE3SIM: Generating High-Fidelity Simulation Data via 3D-Photorealistic Real-to-Sim for Robotic Manipulation [paper]
[2025] VR-Robo: A Real-to-Sim-to-Real Framework for Visual Robot Navigation and Locomotion [paper]
[2025] A Real-to-Sim-to-Real Approach to Robotic Manipulation with VLM-Generated Iterative Keypoint Rewards [paper]
[2025] A Distributional Treatment of Real2Sim2Real for Vision-Driven Deformable Linear Object Manipulation [paper]
[2025] Sim-to-Real Reinforcement Learning for Vision-Based Dexterous Manipulation on Humanoids [paper] [project]
[2025] Impact of Static Friction on Sim2Real in Robotic Reinforcement Learning [paper]
[2025] Few-shot Sim2Real Based on High Fidelity Rendering with Force Feedback Teleop [paper]
[2025] An Real-Sim-Real (RSR) Loop Framework for Generalizable Robotic Policy Transfer with Differentiable Simulation [paper] [project]
[2025] Real2Render2Real: Scaling Robot Data Without Dynamics Simulation or Robot Hardware [paper] [project]

Benchmark

[2025] RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation [paper] [project]
[2025] RoboArena: Distributed Real-World Evaluation of Generalist Robot Policies [paper] [project]
[2025] RoboCerebra: A Large-scale Benchmark for Long-horizon Robotic Manipulation Evaluation [paper] [project]
[2025] DivScene: Benchmarking LVLMs for Object Navigation with Diverse Scenes and Objects [paper] [project]
[2025] EWMBENCH: Evaluating Scene, Motion, and Semantic Quality in Embodied World Models [paper] [project]
[2025] ManipBench: Benchmarking Vision-Language Models for Low-Level Robot Manipulation [paper] [project]
[2025] [CVPR 25] RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins [paper] [project]
[2025] LocoMuJoCo [documentation] [project]
[2025] RoboVerse: Towards a Unified Platform, Dataset and Benchmark for Scalable and Generalizable Robot Learning [paper] [project]
[2025] AutoEval: Autonomous Evaluation of Generalist Robot Manipulation Policies in the Real World [paper] [project]
[2025] RoboFactory: Exploring Embodied Agent Collaboration with Compositional Constraints [paper] [project]
[2025] OpenFly: A Versatile Toolchain and Large-scale Benchmark for Aerial Vision-Language Navigation [paper]
[2025] BOSS: Benchmark for Observation Space Shift in Long-Horizon Task [paper]
[2025] OpenBench: A New Benchmark and Baseline for Semantic Navigation in Smart Logistics [paper]
[2024] EmbSpatial-Bench: Benchmarking Spatial Understanding for Embodied Tasks with Large Vision-Language Models [paper]
[2024] VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Task [paper]
[2024] Towards Diverse Behaviors: A Benchmark for Imitation Learning with Human Demonstrations [paper]

Simulator

[2025] DexGarmentLab: Dexterous Garment Manipulation Environment with Generalizable Policy [paper] [project]
[2025] MuBlE: MuJoCo and Blender simulation Environment and Benchmark for Task Planning in Robot Manipulation [paper] [project]
[2025] MuJoCo Playground [paper]
[2024] Genesis: A Generative and Universal Physics Engine for Robotics and Beyond [paper]
[2024] Maniskill V1-V3 [v1] [v2] [v3]
[2024] Nvidia Isaac [Isaac Lab] [Isaac Sim] [Isaac Gym]
[2022] Mojoco [document]

Related Works

Awesome-VLA-RL [repo]
Awesome-Robotics-Foundation-Models [repo]
Awesome VLA for Robotics [repo]
Awesome-Generalist-Agents [repo]
Awesome-LLM-Robotics [repo]
Awesome World Models for Robotics [repo]
Awesome-VLA-Post-Training [repo]
Awesome-BFM-Papers [repo]

Star History

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

Pros of CLIP

Cons of CLIP

Code Comparison

Summary

Pros of habitat-lab

Cons of habitat-lab

Code comparison

Summary

Pros of habitat-sim

Cons of habitat-sim

Code comparison

Pros of AI2-THOR

Cons of AI2-THOR

Code Comparison

Summary

Pros of ml-agents

Cons of ml-agents

Code comparison

Summary

Convert designs to code with AI

README

awesome-embodied-vla/va/vln

ð Note on Paper Ordering

ð Survey

ð¥ Vision Language Action (VLA) Models

2025

2024

2023

ð¶ Vision Language Navigation (VLN) Models

2025

2024

2023

ð¬ Vision Action (VA) Models

2025

2024

2023

ð§ Other Multimodal Large Language Model (MLLM)-based/related Embodied Learning

2025

2024

2023

2022

Physics-aware Policy

Sim-to-Real Transfer

Benchmark

Simulator

Related Works

Star History

Top Related Projects

Convert designs to code with AI

ð Note on Paper Ordering

ð Survey

ð¥ Vision Language Action (VLA) Models

ð¶ Vision Language Navigation (VLN) Models

ð¬ Vision Action (VA) Models

ð§ Other Multimodal Large Language Model (MLLM)-based/related Embodied Learning