Information
# "Language meet Driving" Papers Collection
> This repo is all you need for doing research in autonomous driving systems integrated with LLMs and VLMs. The papers are organized in different sections based on the topics they cover.
## Table of Contents
- [LLMs-only models](#llms-only-models)
- [LLMs-to-action models](#llms-to-action-models)
- [VLMs-based models (sensory bridge)](#vlms-based-models-sensory-bridge)
- [VLMs-based models (perception tasks)](#vlms-based-models-perception-tasks)
- [Full VLMs closed-loop models](#full-vlms-closed-loop-models)
- [Surveys](#surveys)
## LLMs-only models
Here we collect papers that integrate large language models for explainability or reasoning purposes only.
| Title (Year) | Short Description | Main Contributions | Dataset / Simulator | Project Page |
|-------|-------------------|--------------------|---------|--------------|
|[Drive Like a Human: Rethinking Autonomous Driving with Large Language Models](https://www.researchgate.net/profile/Pinlong-Cai/publication/372404110_Drive_Like_a_Human_Rethinking_Autonomous_Driving_with_Large_Language_Models/links/65ee5c9db7819b433bf52adc/Drive-Like-a-Human-Rethinking-Autonomous-Driving-with-Large-Language-Models.pdf) (2023) | This is one of the first published paper to cover this topic. We are in a highway environment and information from lanes and other agents are converted to text and elaborated by GPT-3.5, which reasons about them. They also introduce the concept of **MEMORY SELF-REFLECTION**: when the model does a mistake, it keeps it in memory and learns from it, so that the next time it is in a similar situation, it knows how to behave. |
- One of the first papers to employ LLMs in driving scenarios.
- Extensive experiments in their study express impressive comprehension and the ability to solve long-tailed cases.
- First paper to leverage LLMs to enable multi-vehicle collaborative driving.
- They introduce other features like cognitive memory, lifelong learning and chain-of-thought reasoning. | [HighwayEnv](https://highway-env.farama.org/) |None |
| [A Language Agent for Autonomous Driving](https://arxiv.org/pdf/2311.10813) (2023) | They introduce the usage of a versatile tool library (like \`get_leading_object\`, \`get_lane\` etc.) accessible via function calls, a cognitive memory of common sense and experiential knowledge for decision-making, and a reasoning engine capable of chain-of-thought reasoning, task planning, motion planning, and self-reflection. The system demonstrates superior interpretability and few-shot learning ability, which are important for adapting to new and unforeseen driving scenarios. |
- Agent-Driver introduces a tool library for dynamic perception and prediction, a cognitive memory for human knowledge, and a reasoning engine that emulates human decision-making. | [nuScenes](https://www.nuscenes.org/) | [Project Page](https://usc-gvl.github.io/Agent-Driver/) |
## LLMs-to-action models
These papers try to translate and convert LLMs reasoning capabilities into output action maneuvers (text-to-action).
| Title (Year) | Short Description | Main Contributions | Dataset / Simulator | Project Page |
|-------|-------------------|--------------------|---------|--------------|
| [LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving](https://arxiv.org/pdf/2310.03026) (2023) | In this paper, an LLM is able to control and tune the parameters of an MPC controller. For example, you can say in a prompt "*Drive fast, I am late*" or "*I am a conservative driver*" and the LLM will tune the MPC parameters accordingly. |
- Devised a dedicated chain-of-thought framework for LLMs for driving scenarios.
- Possibility to tune bottom-level controller parameters using high-level textual decisions provided by LLMs. | [IDSim](https://www.researchgate.net/profile/Shengbo-Li-2/publication/375804851_A_Reinforcement_Learning_Benchmark_for_Autonomous_Driving_in_General_Urban_Scenarios/links/6601233dd3a08551424b116b/A-Reinforcement-Learning-Benchmark-for-Autonomous-Driving-in-General-Urban-Scenarios.pdf) | [Project Page](https://sites.google.com/view/llm-ad)|
|[LANGPROP: a code optimization framework using Large Language Models applied to driving](https://arxiv.org/pdf/2401.10314) (2024) | They introduce LangProp, a framework designed to iteratively optimize code generated by Large Language Models (LLMs) in both supervised and reinforcement learning settings. While LLMs can produce viable coding solutions in a zero-shot manner, these solutions are often suboptimal and fail on edge cases. LangProp addresses this by automatically evaluating code performance on input-output pairs, identifying issues, and feeding the results back into the training loop to refine the code. |
- They propose LangProp, a novel framework for iterative code optimization that adapts the machine learning training paradigm (e.g., imitation learning, reinforcement learning) to symbolic systems using LLMs as optimizers.
- It generates interpretable code to control vehicles and improves driving performance with iterative optimization and more training data compared to zero-shot LLM code generation. | Wayve Private Dataset | [GitHub Page](https://github.com/shuishida/LangProp) |
## VLMs-based models (sensory bridge)
These papers integrate VLMs for autonomous driving tasks, focusing more on bridging sensor information as input for VLMs.
| Title (Year) | Short Description | Main Contributions | Dataset / Simulator | Project Page |
|-------|-------------------|--------------------|---------|--------------|
| [DriveGPT4: Interpretable End-to-End Autonomous Driving Via Large Language Model](https://arxiv.org/pdf/2310.01412) (2024) | They introduce DriveGPT4, which leverages Multimodal Large Language Models (MLLMs). DriveGPT4 processes multi-frame video inputs and textual queries to predict vehicle control signals, interpret actions, provide reasoning, and address user questions. It utilizes a custom visual instruction tuning dataset and a mix-finetuning strategy tailored for autonomous driving. |
- DriveGPT4 can process multimodal input data and generate text responses as well as low-level control signals.
- New visual instruction tuning dataset for interpretable autonomous driving with the assistance of ChatGPT. | [BDD-X](https://github.com/JinkyuKimUCB/BDD-X-dataset) | [Project Page](https://tonyxuqaq.github.io/projects/DriveGPT4/) |
| [LingoQA: Visual Question Answering for Autonomous Driving](https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/09911.pdf) (2024) | LingoQA is a new dataset and benchmark for visual question answering in autonomous driving. They also propose Lingo-Judge, a truthfulness classifier with a high correlation to human evaluations, outperforming traditional metrics like METEOR and BLEU. The dataset and benchmark, along with a baseline model and extensive ablation studies, are provided as a platform to enhance vision-language models for autonomous driving. |
- Novel LingoQA Benchmark for autonomous driving.
- LingoQA Dataset stands out with its freeform questions and answers, covering not just perception but also driving reasoning from the drivers directly.
- Most effective LingoQA Baseline consists of partially fine-tuning the attention layers their vision-language model equipped with Vicuna-1.5-7B and a late video fusion technique. | [LingoQA](https://github.com/wayveai/LingoQA?tab=readme-ov-file) | [GitHub Page](https://github.com/wayveai/LingoQA) |
## VLMs-based models (perception tasks)
In this section, we present VLMs papers that are more focused in autonomous driving perception tasks like bounding boxes generation, segmentation etc.
| Title (Year) | Short Description | Main Contributions | Dataset / Simulator | Project Page |
|-------|-------------------|--------------------|---------|--------------|
| [Embodied Understanding of Driving Scenarios](https://arxiv.org/pdf/2403.04593) (2024) | This is one of the most advanced papers in the field. They introduce a time token that unlocks temporal predictions (predict what will happen in the future and refer to what happened in the past). For example, you can ask the model the past coordinates of a pedestrian 2 seconds ago. Really sophisticated perception system. |
- ELM (Embodied Language Model), a vision-language model for embodied understanding in driving scenarios.
- They propose spaceaware pre-training strategy and time-aware token selection that enhance agents’ comprehension in long-range four-dimensional space. |
- [nuScenes](https://www.nuscenes.org/)
- [Waymo](https://waymo.com/open/)
- [Ego4D](https://ego4d-data.org/)
- [YouTube](https://docs.google.com/spreadsheets/d/1HV-zOO6bh1sKjimhM1ZBcxWqPxgbalE3FDGyh2UHwPw/edit?gid=1708687592#gid=1708687592) | [Project Page](https://opendrivelab.github.io/elm.github.io/) |
| [HiLM-D: Towards High-Resolution Understanding in Multimodal Large Language Models for Autonomous Driving](https://arxiv.org/pdf/2309.05186) (2023) | HiLM-D consolidates multiple tasks, such as identifying and interpreting risk objects, understanding ego-vehicle intentions, and providing motion suggestions. Traditional MLLMs struggle with high-resolution (HR) details, often missing small objects and over-focusing on salient ones. Thanks to its contributions, HiLM-D is able to address this problem. |
- Low-resolution reasoning branch: utilizes existing MLLMs for general reasoning from low-resolution videos.
- High-resolution perception branch (HR-PB): A plug-and-play module that processes HR images to enhance detection of small and less prominent risk objects. | [DRAMA](https://usa.honda-ri.com/drama) | None |
## Full VLMs closed-loop models
These works cover the whole pipeline of VLMs, from sensory bridge to perception tasks.
| Title (Year) | Short Description | Main Contributions | Dataset / Simulator | Project Page |
|-------|-------------------|--------------------|---------|--------------|
| [Drive Anywhere: Generalizable End-to-end Autonomous Driving with Multi-modal Foundation Models](https://arxiv.org/pdf/2310.17642) (2024) | It is one of the first paper in this section. They introduce a novel end-to-end multimodal autonomous driving model that integrates language and visual perception using large multimodal foundational models. The approach contributes to address challenges like open-set environments and black-box model complexity by extracting detailed spatial features using transformers. |
- SOTA results in OOD settings.
- Usage of pixel/patch-aligned feature descriptors to expand the capabilities of foundational models.
- Incorporation of latent space simulation for enhanced training and policy debugging.
- Enhanced generalization of end-to end driving policies across diverse scenarios; the system can drive seamlessly in environments not seen during training and avoid obstacles not trained on.
- Deployment and validation on a full-scale autonomous vehicle in real-world environments. | [VISTA](https://github.com/vista-simulator/vista) | [Project Page](https://drive-anywhere.github.io/) |
| [LMDrive: Closed-Loop End-to-End Driving with Large Language Models](https://openaccess.thecvf.com/content/CVPR2024/papers/Shao_LMDrive_Closed-Loop_End-to-End_Driving_with_Large_Language_Models_CVPR_2024_paper.pdf) (2024) | They introduce LMDrive, a pioneering language-guided, end-to-end autonomous driving framework that integrates multi-modal sensor data with natural language instructions. It enhances interaction with humans and navigation systems, addressing limitations of previous methods that rely solely on sensor data. The framework is accompanied by a 64K clip dataset and the LangAuto benchmark, designed to evaluate performance in complex instruction-following and challenging driving scenarios. |
- LMDrive,a novel end-to-end, closed-loop, language-based autonomous driving framework.
- They provide a dataset with 64k data clips.
- Extensive closed-loop experiments. |
- [CARLA](https://carla.org/)
- [LMDrive](https://huggingface.co/datasets/OpenDILabCommunity/LMDrive) | [Project Page](https://hao-shao.com/projects/lmdrive.html) |
| [AD-H: Autonomous Driving with Hierarchical Agents](https://arxiv.org/pdf/2406.03474) (2024) | AD-H is a hierarchical multi-agent driving system that bridges the gap between high-level instructions and low-level control signals using mid-level language-driven commands. The system features a multimodal large language model (MLLM) planner for high-level reasoning and a lightweight controller for execution. By focusing the MLLM on perception, reasoning, and planning, AD-H enhances performance and generalization capabilities. |
- New autonomous driving dataset which can effectively facilitate hierarchical policy learning.
- Intensive experiments prove generalization to novel scenarios and long-horizon instructions. |
- [CARLA](https://carla.org/)
- [LMDrive](https://huggingface.co/datasets/OpenDILabCommunity/LMDrive) | [GitHub Page](https://github.com/zhangzaibin/AD-H) |
| [CarLLaVA: Vision language models for camera-only closed-loop driving](https://arxiv.org/pdf/2406.10165) (2024) | CarLLaVA uses the LLaVA vision encoder and the LLaMA architecture, achieves state-of-the-art closed-loop driving performance with only camera input. It employs a semi-disentangled output representation, combining path predictions for lateral control and waypoints for longitudinal control. An efficient training strategy ensures effective use of large driving datasets. Ability to generate language commentary alongside driving outputs. |
- Input is camera only without the need of expensive labels (like BEV, depth or semantic segmentation).
- High-resolution input, for accessing smaller details in the image.
- Efficient training recipe.
- CarLLaVA ranks 1st place in the sensor track of the CARLA Autonomous Driving Challenge. |
- [CARLA](https://carla.org/)
- [PDM-Lite](https://huggingface.co/datasets/autonomousvision/PDM_Lite_Carla_LB2) | [Project Video](https://www.youtube.com/watch?v=E1nsEgcHRuc&ab_channel=KatrinRenz) |
| [Hidden Biases of End-to-End Driving Datasets](https://arxiv.org/pdf/2412.09602) (2024) | This study applies end-to-end driving to CARLA Leaderboard 2.0, emphasizing the importance of training datasets over architectures. Key findings: expert driving style affects performance, simplistic frame-weighting fails for complex datasets, and prioritizing frames that change target labels reduces dataset size effectively. The model achieves top rankings in CARLA Challenge and Bench2Drive, while proposing improved evaluation metrics. |
- Introduced the first IL-based method for CARLA Leaderboard 2.0, leveraging the PDM-Lite planner to collect high-quality training data for complex driving scenarios.
- Shifted focus from model architecture to the underexplored impact of training dataset characteristics, analyzing factors beyond dataset scale.
- Demonstrated that expert driving style significantly impacts IL performance. Effective experts should rely on interpretable signals rather than privileged inputs, mirroring human driving behavior.
- Proposed a data filtering strategy to reduce dataset size by ~50% without sacrificing model performance.
- Achieved second place in the 2024 CARLA Challenge and first on Bench2Drive test routes with a robust IL model. |
- [CARLA](https://carla.org/)
- [PDM-Lite](https://huggingface.co/datasets/autonomousvision/PDM_Lite_Carla_LB2) | [GitHub Page](https://github.com/autonomousvision/carla_garage) | ## Surveys Here we show some survey papers regarding LLMs and VLMs for autonomous driving. | Title (Year) | Short Description | Project Page | |-------|-------------------|--------------| | [End-To-End Planning of Autonomous Driving in Industry and Academia: 2022-2023](https://arxiv.org/pdf/2401.08658) (2023) | This paper offers a concise review of current end-to-end planning methods in autonomous driving, covering technologies from both industry and academia. It highlights key developments from companies such as Tesla (FSD V12), Momenta, Horizon Robotics, Motional RoboTaxi, Woven Planet (Toyota: Urban Driver), and Nvidia, alongside state-of-the-art academic research from 2022-2023. The review provides a structured overview aimed at beginners seeking an introduction to the field and advanced researchers looking for supplementary insights into recent advancements. | None | | [Towards Knowledge-driven Autonomous Driving](https://arxiv.org/pdf/2312.04316) (2023) | This paper investigates knowledge-driven autonomous driving as a solution to the limitations of current systems, such as data bias, challenges in long-tail scenarios, and lack of interpretability. It emphasizes the potential of knowledge-driven approaches, which integrate cognition, generalization, and lifelong learning to address these issues. The paper examines key components (datasets and benchmarks, environments, and driver agents) and explores advanced techniques like large language models, world models, and neural rendering. It reviews previous research efforts and provides guidance for future advancements, aiming to create more adaptive, intelligent, and holistic autonomous driving systems. | [GitHub Page](https://github.com/PJLab-ADG/awesome-knowledge-driven-AD) | | [Vision Language Models in Autonomous Driving: A Survey and Outlook](https://arxiv.org/pdf/2310.14414) (2023) | This paper surveys the use of Vision-Language Models (VLMs) in autonomous driving, highlighting their role in improving safety and efficiency through deeper environmental understanding. It reviews advancements in perception, navigation, decision-making, and end-to-end driving, while summarizing tasks, metrics, datasets, and challenges, offering insights for future research. | [GitHub Page](https://github.com/ge25nab/Awesome-VLM-AD-ITS) | | [A Survey on Multimodal Large Language Models for Autonomous Driving](https://openaccess.thecvf.com/content/WACV2024W/LLVM-AD/papers/Cui_A_Survey_on_Multimodal_Large_Language_Models_for_Autonomous_Driving_WACVW_2024_paper.pdf) (2024) | This paper reviews the integration of Multimodal Large Language Models (MLLMs) and Vision Foundation Models (VFMs) in autonomous driving and mapping systems, highlighting their potential to mimic human-like perception and decision-making. It provides an overview of MLLM tools, datasets, and benchmarks, summarizes insights from the 1st WACV Workshop on LLMs for Autonomous Driving (LLVM-AD), and identifies key challenges and opportunities for advancing this field. | None | | [Forging Vision Foundation Models for Autonomous Driving: Challenges, Methodologies, and Opportunities](https://arxiv.org/pdf/2401.08045) (2024) | The paper explores the challenges in developing Vision Foundation Models (VFMs) for autonomous driving, highlighting data limitations and task diversity. It reviews key techniques and advancements like NeRF and diffusion models, providing a roadmap for future VFM research. | [GitHub Page](https://github.com/zhanghm1995/Forge_VFM4AD) |
- Agent-Driver introduces a tool library for dynamic perception and prediction, a cognitive memory for human knowledge, and a reasoning engine that emulates human decision-making. | [nuScenes](https://www.nuscenes.org/) | [Project Page](https://usc-gvl.github.io/Agent-Driver/) |
## LLMs-to-action models
These papers try to translate and convert LLMs reasoning capabilities into output action maneuvers (text-to-action).
| Title (Year) | Short Description | Main Contributions | Dataset / Simulator | Project Page |
|-------|-------------------|--------------------|---------|--------------|
| [LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving](https://arxiv.org/pdf/2310.03026) (2023) | In this paper, an LLM is able to control and tune the parameters of an MPC controller. For example, you can say in a prompt "*Drive fast, I am late*" or "*I am a conservative driver*" and the LLM will tune the MPC parameters accordingly. |