VLM4OTSU: Vision-Language Models (VLMs) in Open Traffic Scene Understanding

Introduction

Recent progress in scene understanding has been accompanied by significant improvements in the multimodal information fusion capabilities of vision-language models (VLMs). The open traffic scene—a representative natural environment characterized by high dynamism, uncertainty in weather conditions, background complexity, environmental diversity, and dependency on traffic rules, among other factors—poses numerous challenges but is also closely related to daily life. Understanding the open traffic scene can provide essential information for various practical applications, such as autonomous driving, smart transportation, intelligent traffic signal systems, and route optimization. Compared with traditional CNN-based models, VLMs offer a natural and effective approach to understanding open traffic scenes due to their powerful image description abilities. However, many challenges remain in VLM-based traffic scene understanding. For instance, in autonomous driving systems, how can we ensure VLMs can perceive driving situations while maintaining sensitivity to changing natural conditions and adapting to the uncertainty in open traffic scenes? How can we design and construct VLMs that effectively fuse different types of inputs (infrared images, RGB images, 3D point cloud data, etc.) to enhance scene understanding capabilities? Furthermore, how can VLMs analyze directional guidance objects (e.g., traffic signs, texts, lights, and ground markers) with strong expert priors to support decision-making?

This workshop aims to provide a platform for researchers in related fields to share their latest work, while offering opportunities to discuss the current state of open traffic scene understanding and the limitations of VLMs in this task. Key questions include: Can VLMs truly achieve open traffic scene understanding? How far are current VLMs from reaching this goal? What optimizations are necessary for VLMs to adapt to the hardware constraints of edge devices in real-world deployment, and what are the associated impacts?

Call for Papers

The WACV 2025 Vision-Language Models (VLMs) in Open Traffic Scene Understanding Workshop (https://vlm4otsu.github.io/) seeks to cover a wide range of topics related to advancements in VLMs for traffic scene understanding, including but not limited to:

VLM-based Multimodal Data Fusion for Open Traffic Scene Understanding
Adapting VLMs to Uncertain Natural Conditions in Autonomous Driving
Traffic Sign and Signal Interpretation Using VLMs with Expert Priors
Real-Time Edge Deployment of VLMs for Smart Transportation
VLMs for Predictive Analytics in Dynamic Traffic Environments
VLM-Based Robust Scene Parsing for Autonomous Vehicles
Leveraging VLMs for Multi-Agent Collaboration in Traffic Systems
Understanding Complex Traffic Situations Through VLM-Enhanced Scene Comprehension
Evaluation of VLMs for Open-World Traffic Scene Tasks
Benchmarking VLMs for Multimodal Traffic Scene Datasets

Paper Submission and supplemental material deadline	Nov 22th, 2024 (PST)
Reviews and Final Decisions released to authors	Dec 28th, 2024 (PST)
Camera ready deadline	Jan 5th, 2025 (PST)

Paper Submission and supplemental material deadline

Nov 22th, 2024 (PST)

Reviews and Final Decisions released to authors

Dec 28th, 2024 (PST)

Camera ready deadline

Jan 5th, 2025 (PST)

Invited Speakers

Jian Zhao

Leader of Evolutionary Vision+x Oriented Learning (EVOL) Lab & Principal Research Scientist

Jian Zhao is currently the Leader of EVOL Lab & a Principal Research Scientist with the Institute of AI (TeleAI), China Telecom, Beijing, P.R. China, and a Researcher & Ph.D. Supervisor with School of Artificial Intelligence, Optics and Electronics (iOPEN), Northwestern Polytechnical University (NWPU), Xi'an Shanxi, P.R. China. Previously, He was an Assistant Researcher with Academy of Military Sciences (AMS), Beijing, P.R. China. He received his Ph.D. degree from National University of Singapore (NUS) in 2019 under the supervision of Assist. Prof. Jiashi Feng, Assoc. Prof. Shuicheng Yan, and Prof. Hengzhu Liu, Master degree from National University of Defense Technology (NUDT) in 2014 under the supervision of Prof. Xucan Chen, and Bachelor degree from Beihang University (BUAA) in 2012 under the supervision of Dr. Shaopeng Dong and Prof. Mei Yuan. He was supported by China Scholarship Council (CSC) and School of Computer, NUDT to pursue his Ph.D. degree at Learning and Vision Group, Faculty of Engineering (FOE), Department of Electrical and Computer Engineering (ECE), NUS, Singapore.

Zuxuan Wu

Associate Professor at Fudan University

Zuxuan Wu is an Associate Professor in School of Computer Science at Fudan University, and a member of the Fudan Vision and Learning Laboratory. He recieved his Ph.D. in Computer Science from the University of Maryland with Prof. Larry Davis in 2020. His research interests are in computer vision and deep learning. His current research particularly focuses on large-scale video understanding, video generation, and efficient architectures.

Jia Wan

Professor at the Harbin Institute of Technology (Shenzhen)

Jia Wan is a Professor at the Harbin Institute of Technology (Shenzhen) in the School of Computer Science and Technology. Before joining HITSZ, he was a Posdoc in the Statistical Visual Computing Laboratory (SVCL), University of California, San Diego, advised by Prof. Nuno Vasconcelos and at Boston College, supervised by Prof. Donglai Wei. He received the Ph.D degree in the Video, Image, and Sound Analysis Lab (VISAL), City University of Hong Kong, supervised by Prof. Antoni B. Chan. He received the M.Sc. degree in the School of Computer Science and the Center for OPTical IMagery Analysis and Learning (OPTIMAL), supervised by Prof. Qi Wang, and B.Eng. degree at the Software Engineering School in Northwestern Polytechnical University, Xi'an, Shaanxi, China. In 2018, He was an intern at Tencent AI Lab in Shenzhen, working with Dr. Wenhan Luo and Dr. Baoyuan Wu. His research interests include computer vision, intelligent transportation, crowd analysis, and brain image analysis.

Haoyu Chen

Assistant Professor at University of Oulu

Haoyu Chen is a Tenure-track Assistant Professor at CMVS, University of Oulu. Before that, I conducted Postdoc research in CMVS, University of Oulu, with the project of in Emotion AI (Academy Finland project) and trustworthy AI (Infotech project). He received His Ph.D. from University of Oulu , Finland, where He was advised by Academy Professor Guoying Zhao. During his PhD study, I visited CEL, TUDelft, the Netherlands. Prior to that, He received the B.E. degree from China University of Geosciences, China, and Master degree from University of Oulu, Finland. His research interests include Machine Learning, Human Behaviour Analysis, Emotion AI and Adversarial Learning.

Tentative Schedule (Feb 28th, 2024)

Opening remarks and welcome	02:00 PM - 02:05 PM
Jian Zhao	02:10 PM - 02:35 PM
Zuxuan Wu	02:40 PM - 03:05 PM
Jia Wan	03:10 PM - 03:35 PM
Haoyu Chen	03:40 AM - 04:05 PM
Oral Session	04:15 PM - 05:00 PM