
VLA-McGill-2025-09-自动驾驶.pptx
17页Sicong,Jiang*,Zilin Huang*,Kangan Qian*,Ziang Luo,Tianze,Zhu,Yihong,Tang,Menglin Kong and others,McGill University,Tsinghua,University,Xiaomi Co,rporation,University of,Wisconsin,-Mad,ison,University,of,Minnesota,-,Twin,Cities,A Survey,on,Vision-Language-A,ction Models,for,Autonomous Driving,1,Outline,1.Introduction:From End-to-End,AD,to,VLA4AD,2.,The,VLA4AD,Architecture,3.Progress of,VLA,4AD Models,4.Datasets&Benchmarks,5.,Training&Evaluatio,n,6.Challenges,&Future Directions,7.Conclusion,2,One neural network,maps r,aw sensors steering/brake,Removes hand-crafted percepti,on&planning modules,Pros:,Simpler pipeline,Holistic optimizat,ion,Cons:,Black-box,hard,to audit,Fragile on long-tail events,No natural-language int,erface,difficu,lt,to explain or,follow commands,1.From End-to-EndA,D,to,VLA4AD,(a)End-to-End,Autonomous Driving,Figure,1.,Driving,paradigms,:,End,-,to,-,end,Models,Autonomous Driv,ing,3,Fuse,vision encoder+LLM,scene caption,QA,high-level manoeuvre,Pros:,Zero-shot,to rare o,bjects,Human-readable explanatio,ns,Cons:,Action gap remains,Latency&spatial-awareness,LLM hallucinati,ons risk,First step,toward interactive,explainable driving systems,1.From E,nd-to-EndAD,to,VLA4AD,(b),Vision-Language Models,for,Autonomous Driving,Figure,2.,Driving,paradigms,:,Vision,-,Language,Models,for,Autonomous Drivi,ng,4,(c),Vision-Language-Action,Models,for,Autonomous Driving,Unified policy:,multimodal encoder+language,tokens+action head,Outputs:,driving,trajectory,/control,+,textual,rationale,Pros:,Unified,Vision-Language-Action system,Enables,free-form instruction,following&CoT reasoning,Human-readable explanatio,ns,Improved robustness,vs corner c,ases,Open Is,sues:,Runtime,gap,Tri-modal data sca,rcity,Demonstrate,s great potential,for driving autonomous,vehicles,with human-level reasoning and clear explan,ations,1.From End-to-EndA,D,to,VLA4AD,Figure,3.,Driving,paradigms,:,Vision,-,Language,-,Action,Models for,Autonomous Dr,iving,5,Outputs:,Control,Action(Low-level):,Direct steering/throttle,signals.,Plans(Trajectory):,A sequence of,f,uture,waypoints.,Explanat,ions(Combined,with other action):,Rationale,for decisions.,Multimodal Inputs,:,Vision(Camera,s):,Capturing,the dynamic scene.,Sensors(LiDAR,Radar):,Providing precise 3D structure and,veloc,ity.,Language(Co,mmands,QA):,Defining high-level,user intent.,2.,The,VLA4AD,Archite,cture,Input and Output Paradigm,Occupancy,Lane,Detection,Brake,Control,Planning,Trajectory,Steering,Control,Object,Detection,6,Language Processor:,Task-based finetuned LLaMA,Qwen,Vicuna,GPT and other Large Language Base Models,Model optimization methods are,often used,to ensure a lightweight model,such as LoRA,Adapters.,Action Decoder:,Autoregressive,tokens&diffusio,n planners,Hierarchical Controller:,high-level,PID/MPC.,Vision Encoder:,Self-supervised b,ackbones(DINOv2,CLIP).,BEV projection&LiDAR,fusion.,2.,The,VLA4AD,Archite,cture,Core,Architectural,Modules,Figure,4.,Architectural,Paradigm of,the VLA for,Autonomous Driving,Model,7,3.Progress of,VLA4AD Mode,ls,Key Stages of,VLA Models for,Autonomous Driving,Figure,5.,The,Progression,from,Passive,Explainers,to,Active,Reasoning,Agents,8,1.,Pre-VLA:,Language as Explainer,(e.g.,DriveGPT4,1,),Role:,A,frozen LLM prov,ides post-hoc,textual descriptions of,the scene or inte,nded maneuvers.,Limitation:,Language is a,passive overlay,not,integral to,decision-making,leading,to a semantic gap and pote,ntial,hallucinations.,2.,Modular,VLA4AD,(e.g.,CoVLA-Agent,2,SafeAuto,3,),Role:,Language becomes an active,intermedi,ate,representation for planning,often,validated by symbolic rules.,Limitation:,Multi-stage pipelines introduce,latency,a,nd are,prone,to,cascading,failures,at module boundaries.,3.Progress of,VLA4AD Mode,ls,Representative,VLA4AD Models,1,Xu,Zhenhua,et al.,Drivegpt4:Interpretable end-to-end autonomous driving,via large language model.IEE,E Robotics and,Automation Letters,(2024).,2,Arai,Hidehisa,et al.,Covla:,Comprehensive visio,n-language-action dataset for,autonomous driving.,2025 IEEE/CVF,WACV.,IEEE,2025.,3,Zhang,Jiawei,et al.,Safeauto:Knowledge-enhanced safe autonomous driving,with multimodal foundation mo,dels.arXiv,preprint arXiv:2503.00211,(2025).,Fi,gure 7.CoVLA-Agent(2025)Trained with CoVLA Data,se,t,Figure 6.DriveGPT4,(2024)-Interpretable LLM,for,AD,9,Representative,VLA4AD Models,3.,Unified End-to-End,VLA,(e.g.,EMMA,1,),Role:,A single,unified network maps multimodal sensor inpu,ts,directly,to control ac,tions or,trajectories.,Limitation:,While reactive,these models can,struggle,with,long-horizon reasoning,and complex,multi-ste,p planning.,4.,Reasoning-Augmented,VLA4AD,(e.g.,ORI,ON,2,AutoVLA,3,),Role:,Language models are cen,tral,to,the control loop,enabling,long-term memory,and,Chain-of-Thought,reaso,ning,before acting.,Status:,Show promising results in long-term reasoning and,interaction,but,infere,nce delay,could be a potential problem.,1,Hwang,Jyh-Jing,et al.,Emma:End-to-end mult,imodal model for,autonomous driving.,TMLR,2025,2,Fu。
