Embodied-AI-papers

What can LLMs do for ROBOTs?

🙌 This repository collects papers integrating Embodied AI and Large Language Models (LLMs).

😎 Welcome to recommend missing papers through Adding Issues or Pull Requests.

🥽 The papers are ranked according to our subjective opinions.

📜 Table of Content

Embodied-AI-papers

✨︎ Outstanding Papers

[arXiv 2024] Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation [Project]
[arXiv 2024] Real-World Robot Applications of Foundation Models: A Review
[arXiv 2023] Grounding Language with Visual Affordances over Unstructured Data (HULC++)

📥 Paper Inbox

Survey

[arXiv 2024] Language-conditioned Learning for Robotic Manipulation: A Survey
[arXiv 2024] Real-World Robot Applications of Foundation Models: A Review
[arXiv 2023] Foundation Models in Robotics: Applications, Challenges, and the Future

Datasets & Simulator

[CoRL 2023 Workshop TGR] Open X-Embodiment: Robotic Learning Datasets and RT-X Models [Project]
[IEEE RA-L 2023] Orbit: A Unified Simulation Framework for Interactive Robot Learning Environments [Project]
[arXiv 2023] Towards Building AI-CPS with NVIDIA Isaac Sim: An Industrial Benchmark and Case Study for Robotics Manipulation [Project]
[IROS 2023] HANDAL: A Dataset of Real-World Manipulable Object Categories with Pose Annotations, Affordances, and Reconstructions [Project]
[CORL 2023] Bridgedata v2: A dataset for robot learning at scale [Project]
[RSS 2023 LTAMP] RH20T: A robotic dataset for learning diverse skills in one-shot [Project]
[arXiv 2024] DROID: A large-scale in-the-wild robot manipulation dataset [Project]
[CoRL 2023] AR2-D2: Training a robot without a robot [Project]
[CVPR 2024] OAKINK2: A Dataset of Bimanual Hands-Object Manipulation in Complex Task Completion Dual-Arm [Project]

Algorithms

[TMLR 2024] Robocat: A self-improving foundation agent for robotic manipulation
[arXiv 2023] RoboAgent: Generalization and Efficiency in Robot Manipulation via Semantic Augmentations and Action Chunking
[NeurIPS 2023] Supervised pretraining can learn in-context reinforcement learning
[NeurIPS 2023] EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought [Project]
[ICML 2023] PaLM-E: An embodied multimodal language model [Project]
[arXiv 2023 RT-2: Vision-language-action models transfer web knowledge to robotic control [Project]
[NeurIPS 2023] STEVE-1: A generative model for text-to-behavior in minecraft [Project] Minecraft
[CoRL 2023] Q-transformer: Scalable offline reinforcement learning via autoregressive q-functions [Project]
[NeurIPS 2023] Diffusion model is an effective planner and data synthesizer for multi-task reinforcement learning
[ICML 2023] LIV: Language-image representations and rewards for robotic control
[PMLR 2023] ViNT: A foundation model for visual navigation
[arXiv 2023] Foundation reinforcement learning: towards embodied generalist agents with foundation prior assistance
[arXiv 2023] Physically grounded vision-language models for robotic manipulation [Project]
[arXiv 2022] RT-1: Robotics transformer for real-world control at scale [Project]

Applications

Perception

[arXiv 2024] Affordancellm: Grounding affordance from vision language models [Project]
[arXiv 2024] Physically Grounded Vision-Language Models for Robotic Manipulation
[CoRL 2023] REFLECT: Summarizing Robot Experiences for FaiLure Explanation and CorrecTion [Project]
[arXiv 2023] Swin3D: A Pretrained Transformer Backbone for 3D Indoor Scene Understanding 3D
[ICRA 2022] Affordance Learning from Play for Sample-Efficient Policy Learning [Project] (VAPO) [w/o LLM]
[IEEE RA-L 2022] What Matters in Language Conditioned Robotic Imitation Learning over Unstructured Data [Project]
[CVPR 2022] Learning affordance grounding from exocentric images

Policy

[ICRA 2024 Workshop VLMNM] Octo: An Open-Source Generalist Robot Policy
[IEEE RA-L 2024] Language models as zero-shot trajectory generators [Project]
[arXiv 2024] BTGenBot: Behavior Tree Generation for Robotic Tasks with Lightweight LLMs
[arXiv 2024] Grounding Language Plans in Demonstrations Through Counterfactual Perturbations
[arXiv 2024] ManipVQA: Injecting Robotic Affordance and Physically Grounded Information into Multi-Modal Large Language Models [Code]
[CoRL 2023] Mutex: Learning unified policies from multimodal task specifications [Project]
[ICRA 2023] Code as Policies: Language Model Programs for Embodied Control [Project]
[CoRL 2023] Robots That Ask For Help: Uncertainty Alignment for Large Language Model Planners [Project]
[NeurIPS 2023] Roboclip: One demonstration is enough to learn robot policies [Project]
[ICCV 2023] Skill transformer: A monolithic policy for mobile manipulation
[arXiv 2023] Generalizable long-horizon manipulations with large language models
[arXiv 2023] LLM-MARS: Large Language Model for Behavior Tree Generation and NLP-enhanced Dialogue in Multi-Agent Robot Systems

Action

[arXiv 2024] Reconciling Reality through Simulation: A Real-to-Sim-to-Real Approach for Robust Manipulation [Project]
[arXiv 2024] Language-Conditioned Imitation Learning with Base Skill Priors under Unstructured Data
[arXiv 2024] Object-centric instruction augmentation for robotic manipulation [Project]
[arXiv 2024] OpenVLA: An Open-Source Vision-Language-Action Model [Project]
[NeurIPS 2023 GCRL workshop] Zero-shot robotic manipulation with pretrained image-editing diffusion models [Project]
[NeurIPS 2023 Poster] Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents [Project] [Minecraft]
[CoRL 2023 Poster] Open-world object manipulation using pre-trained vision-language models [Project]
[arXiv 2023] Pave the way to grasp anything: Transferring foundation models for universal pick-place robots
[arXiv 2023] Waypoint-based imitation learning for robotic manipulation [Project]
[IROS 2023] MOMA-Force: Visual-Force Imitation for Real-World Mobile Manipulation [Project]

Control

[arXiv 2024] Spatiotemporal Predictive Pre-training for Robotic Motor Control
[PMLR 2023] Robot learning with sensorimotor pre-training
[arXiv 2023] A generalist dynamics model for control
[ICLR 2023 RRL Poster] Chain-of-thought predictive control
[ICML 2023] On pre-training for visuo-motor control: Revisiting a learning-from-scratch baseline
[arXiv 2023] MoDem-V2: Visuo-Motor World Models for Real-World Robot Manipulation

Long Video Understanding

[arXiv 2024] GUIDE: A Guideline-Guided Dataset for Instructional Video Comprehension [Project]
[arXiv 2024] Koala: Key frame-conditioned long video-LLM
[arXiv 2024] ST-LLM: Large Language Models Are Effective Temporal Learners
[arXiv 2024] MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
[arXiv 2024] LongVLM: Efficient Long Video Understanding via Large Language Models
[arXiv 2024] ManipVQA: Injecting Robotic Affordance and Physically Grounded Information into Multi-Modal Large Language Models
[arXiv 2024] ShareGPT4Video: Improving Video Understanding and Generation with Better Captions
[arXiv 2024] An Introduction to Vision-Language Modeling
[arXiv 2023] MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
[arXiv 2023] EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought
[arXiv 2023] Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

LLM-based Video Agents

[arXiv 2023] Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions
[arXiv 2023] ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions
[arXiv 2023] X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning
[arXiv 2023] InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
[arXiv 2023] ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System
[arXiv 2023] MM-VID: Advancing Video Understanding with GPT-4V(ision)
[arXiv 2023] MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning
[arXiv 2023] MISAR: A Multimodal Instructional System with Augmented Reality
[arXiv 2022] BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
[arXiv 2022] Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

System Implementation

[SIGCHI 2024] Language, Camera, Autonomy! Prompt-engineered Robot Control for Rapidly Evolving Deployment (CLEAR) [Software]
[Autonomous Robots 2023] TidyBot: personalized robot assistance with large language models [Project]
[RSS 2024]Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ALOHA)
[arXiv 2024] Embodied AI with Two Arms: Zero-shot Learning, Safety and Modularity Dual-Arm
- [arXiv 2023] Palm 2 technical report
- [arXiv 2022] Simple Open-Vocabulary Object Detection with Vision Transformers
- [ICRA 2023] Robotic Table Wiping via Reinforcement Learning and Whole-body Trajectory Optimization
[Paper 2024] HumanPlus: Humanoid Shadowing and Imitation from Humans [Project]

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Embodied-AI-papers

📜 Table of Content

✨︎ Outstanding Papers

📥 Paper Inbox

Survey

Datasets & Simulator

Algorithms

Applications

Perception

Policy

Action

Control

Long Video Understanding

LLM-based Video Agents

System Implementation

About

Releases

Packages

License

TommeyChang/Embodied-AI-papers

Folders and files

Latest commit

History

Repository files navigation

Embodied-AI-papers

📜 Table of Content

✨︎ Outstanding Papers

📥 Paper Inbox

Survey

Datasets & Simulator

Algorithms

Applications

Perception

Policy

Action

Control

Long Video Understanding

LLM-based Video Agents

System Implementation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages