Skip to content

TommeyChang/Embodied-AI-papers

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

76 Commits
Β 
Β 
Β 
Β 

Repository files navigation

Embodied-AI-papers

Awesome License: MIT

What can LLMs do for ROBOTs?

πŸ™Œ This repository collects papers integrating Embodied AI and Large Language Models (LLMs).

😎 Welcome to recommend missing papers through Adding Issues or Pull Requests.

πŸ₯½ The papers are ranked according to our subjective opinions.

πŸ“œ Table of Content

✨︎ Outstanding Papers

  • [arXiv 2024] Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation [Project]
  • [arXiv 2024] Real-World Robot Applications of Foundation Models: A Review
  • [arXiv 2023] Grounding Language with Visual Affordances over Unstructured Data (HULC++)

πŸ“₯ Paper Inbox

Survey

  • [arXiv 2024] Language-conditioned Learning for Robotic Manipulation: A Survey
  • [arXiv 2024] Real-World Robot Applications of Foundation Models: A Review
  • [arXiv 2023] Foundation Models in Robotics: Applications, Challenges, and the Future

Datasets & Simulator

  • [CoRL 2023 Workshop TGR] Open X-Embodiment: Robotic Learning Datasets and RT-X Models [Project]
  • [IEEE RA-L 2023] Orbit: A Unified Simulation Framework for Interactive Robot Learning Environments [Project]
  • [arXiv 2023] Towards Building AI-CPS with NVIDIA Isaac Sim: An Industrial Benchmark and Case Study for Robotics Manipulation [Project]
  • [IROS 2023] HANDAL: A Dataset of Real-World Manipulable Object Categories with Pose Annotations, Affordances, and Reconstructions [Project]
  • [CORL 2023] Bridgedata v2: A dataset for robot learning at scale [Project]
  • [RSS 2023 LTAMP] RH20T: A robotic dataset for learning diverse skills in one-shot [Project]
  • [arXiv 2024] DROID: A large-scale in-the-wild robot manipulation dataset [Project]
  • [CoRL 2023] AR2-D2: Training a robot without a robot [Project]
  • [CVPR 2024] OAKINK2: A Dataset of Bimanual Hands-Object Manipulation in Complex Task Completion Dual-Arm [Project]

Algorithms

  • [TMLR 2024] Robocat: A self-improving foundation agent for robotic manipulation
  • [arXiv 2023] RoboAgent: Generalization and Efficiency in Robot Manipulation via Semantic Augmentations and Action Chunking
  • [NeurIPS 2023] Supervised pretraining can learn in-context reinforcement learning
  • [NeurIPS 2023] EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought [Project]
  • [ICML 2023] PaLM-E: An embodied multimodal language model [Project]
  • [arXiv 2023 RT-2: Vision-language-action models transfer web knowledge to robotic control [Project]
  • [NeurIPS 2023] STEVE-1: A generative model for text-to-behavior in minecraft [Project] Minecraft
  • [CoRL 2023] Q-transformer: Scalable offline reinforcement learning via autoregressive q-functions [Project]
  • [NeurIPS 2023] Diffusion model is an effective planner and data synthesizer for multi-task reinforcement learning
  • [ICML 2023] LIV: Language-image representations and rewards for robotic control
  • [PMLR 2023] ViNT: A foundation model for visual navigation
  • [arXiv 2023] Foundation reinforcement learning: towards embodied generalist agents with foundation prior assistance
  • [arXiv 2023] Physically grounded vision-language models for robotic manipulation [Project]
  • [arXiv 2022] RT-1: Robotics transformer for real-world control at scale [Project]

Applications

Perception

  • [arXiv 2024] Affordancellm: Grounding affordance from vision language models [Project]
  • [arXiv 2024] Physically Grounded Vision-Language Models for Robotic Manipulation
  • [CoRL 2023] REFLECT: Summarizing Robot Experiences for FaiLure Explanation and CorrecTion [Project]
  • [arXiv 2023] Swin3D: A Pretrained Transformer Backbone for 3D Indoor Scene Understanding 3D
  • [ICRA 2022] Affordance Learning from Play for Sample-Efficient Policy Learning [Project] (VAPO) [w/o LLM]
  • [IEEE RA-L 2022] What Matters in Language Conditioned Robotic Imitation Learning over Unstructured Data [Project]
  • [CVPR 2022] Learning affordance grounding from exocentric images

Policy

  • [ICRA 2024 Workshop VLMNM] Octo: An Open-Source Generalist Robot Policy
  • [IEEE RA-L 2024] Language models as zero-shot trajectory generators [Project]
  • [arXiv 2024] BTGenBot: Behavior Tree Generation for Robotic Tasks with Lightweight LLMs
  • [arXiv 2024] Grounding Language Plans in Demonstrations Through Counterfactual Perturbations
  • [arXiv 2024] ManipVQA: Injecting Robotic Affordance and Physically Grounded Information into Multi-Modal Large Language Models [Code]
  • [CoRL 2023] Mutex: Learning unified policies from multimodal task specifications [Project]
  • [ICRA 2023] Code as Policies: Language Model Programs for Embodied Control [Project]
  • [CoRL 2023] Robots That Ask For Help: Uncertainty Alignment for Large Language Model Planners [Project]
  • [NeurIPS 2023] Roboclip: One demonstration is enough to learn robot policies [Project]
  • [ICCV 2023] Skill transformer: A monolithic policy for mobile manipulation
  • [arXiv 2023] Generalizable long-horizon manipulations with large language models
  • [arXiv 2023] LLM-MARS: Large Language Model for Behavior Tree Generation and NLP-enhanced Dialogue in Multi-Agent Robot Systems

Action

  • [arXiv 2024] Reconciling Reality through Simulation: A Real-to-Sim-to-Real Approach for Robust Manipulation [Project]
  • [arXiv 2024] Language-Conditioned Imitation Learning with Base Skill Priors under Unstructured Data
  • [arXiv 2024] Object-centric instruction augmentation for robotic manipulation [Project]
  • [arXiv 2024] OpenVLA: An Open-Source Vision-Language-Action Model [Project]
  • [NeurIPS 2023 GCRL workshop] Zero-shot robotic manipulation with pretrained image-editing diffusion models [Project]
  • [NeurIPS 2023 Poster] Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents [Project] [Minecraft]
  • [CoRL 2023 Poster] Open-world object manipulation using pre-trained vision-language models [Project]
  • [arXiv 2023] Pave the way to grasp anything: Transferring foundation models for universal pick-place robots
  • [arXiv 2023] Waypoint-based imitation learning for robotic manipulation [Project]
  • [IROS 2023] MOMA-Force: Visual-Force Imitation for Real-World Mobile Manipulation [Project]

Control

  • [arXiv 2024] Spatiotemporal Predictive Pre-training for Robotic Motor Control
  • [PMLR 2023] Robot learning with sensorimotor pre-training
  • [arXiv 2023] A generalist dynamics model for control
  • [ICLR 2023 RRL Poster] Chain-of-thought predictive control
  • [ICML 2023] On pre-training for visuo-motor control: Revisiting a learning-from-scratch baseline
  • [arXiv 2023] MoDem-V2: Visuo-Motor World Models for Real-World Robot Manipulation

Long Video Understanding

  • [arXiv 2024] GUIDE: A Guideline-Guided Dataset for Instructional Video Comprehension [Project]
  • [arXiv 2024] Koala: Key frame-conditioned long video-LLM
  • [arXiv 2024] ST-LLM: Large Language Models Are Effective Temporal Learners
  • [arXiv 2024] MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
  • [arXiv 2024] LongVLM: Efficient Long Video Understanding via Large Language Models
  • [arXiv 2024] ManipVQA: Injecting Robotic Affordance and Physically Grounded Information into Multi-Modal Large Language Models
  • [arXiv 2024] ShareGPT4Video: Improving Video Understanding and Generation with Better Captions
  • [arXiv 2024] An Introduction to Vision-Language Modeling
  • [arXiv 2023] MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
  • [arXiv 2023] EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought
  • [arXiv 2023] Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

LLM-based Video Agents

  • [arXiv 2023] Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions
  • [arXiv 2023] ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions
  • [arXiv 2023] X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning
  • [arXiv 2023] InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
  • [arXiv 2023] ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System
  • [arXiv 2023] MM-VID: Advancing Video Understanding with GPT-4V(ision)
  • [arXiv 2023] MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning
  • [arXiv 2023] MISAR: A Multimodal Instructional System with Augmented Reality
  • [arXiv 2022] BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
  • [arXiv 2022] Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

System Implementation

  • [SIGCHI 2024] Language, Camera, Autonomy! Prompt-engineered Robot Control for Rapidly Evolving Deployment (CLEAR) [Software]
  • [Autonomous Robots 2023] TidyBot: personalized robot assistance with large language models [Project]
  • [RSS 2024]Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ALOHA)
  • [arXiv 2024] Embodied AI with Two Arms: Zero-shot Learning, Safety and Modularity Dual-Arm
    • [arXiv 2023] Palm 2 technical report
    • [arXiv 2022] Simple Open-Vocabulary Object Detection with Vision Transformers
    • [ICRA 2023] Robotic Table Wiping via Reinforcement Learning and Whole-body Trajectory Optimization
  • [Paper 2024] HumanPlus: Humanoid Shadowing and Imitation from Humans [Project]

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published