Workshop

VLM4OTSU:
Vision-Language Models (VLMs) in Open Traffic Scene Understanding

Introduction

Recent progress in scene understanding has been accompanied by significant improvements in the multimodal information fusion capabilities of vision-language models (VLMs). The open traffic scene—a representative natural environment characterized by high dynamism, uncertainty in weather conditions, background complexity, environmental diversity, and dependency on traffic rules, among other factors—poses numerous challenges but is also closely related to daily life. Understanding the open traffic scene can provide essential information for various practical applications, such as autonomous driving, smart transportation, intelligent traffic signal systems, and route optimization. Compared with traditional CNN-based models, VLMs offer a natural and effective approach to understanding open traffic scenes due to their powerful image description abilities. However, many challenges remain in VLM-based traffic scene understanding. For instance, in autonomous driving systems, how can we ensure VLMs can perceive driving situations while maintaining sensitivity to changing natural conditions and adapting to the uncertainty in open traffic scenes? How can we design and construct VLMs that effectively fuse different types of inputs (infrared images, RGB images, 3D point cloud data, etc.) to enhance scene understanding capabilities? Furthermore, how can VLMs analyze directional guidance objects (e.g., traffic signs, texts, lights, and ground markers) with strong expert priors to support decision-making?

This workshop aims to provide a platform for researchers in related fields to share their latest work, while offering opportunities to discuss the current state of open traffic scene understanding and the limitations of VLMs in this task. Key questions include: Can VLMs truly achieve open traffic scene understanding? How far are current VLMs from reaching this goal? What optimizations are necessary for VLMs to adapt to the hardware constraints of edge devices in real-world deployment, and what are the associated impacts?



Call for Papers

The WACV 2025 Vision-Language Models (VLMs) in Open Traffic Scene Understanding Workshop (https://vlm4otsu.github.io/) seeks to cover a wide range of topics related to advancements in VLMs for traffic scene understanding, including but not limited to:

  • VLM-based Multimodal Data Fusion for Open Traffic Scene Understanding
  • Adapting VLMs to Uncertain Natural Conditions in Autonomous Driving
  • Traffic Sign and Signal Interpretation Using VLMs with Expert Priors
  • Real-Time Edge Deployment of VLMs for Smart Transportation
  • VLMs for Predictive Analytics in Dynamic Traffic Environments
  • VLM-Based Robust Scene Parsing for Autonomous Vehicles
  • Leveraging VLMs for Multi-Agent Collaboration in Traffic Systems
  • Understanding Complex Traffic Situations Through VLM-Enhanced Scene Comprehension
  • Evaluation of VLMs for Open-World Traffic Scene Tasks
  • Benchmarking VLMs for Multimodal Traffic Scene Datasets

Style and Author Instructions

  • Paper Length: We ask authors to use the official WACV2025 template and limit submissions to 4-8 pages excluding references.
  • Dual Submissions: The workshop is non-archival. In addition, in light of the new single-track policy of WACV 2025, we strongly encourage papers accepted to WACV 2025 to present at our workshop.
  • Presentation Forms: All accepted papers will get poster presentations during the workshop; selected papers will get oral presentations.

All submissions should anonymized. Papers with more than 4 pages (excluding references) will be reviewed as long papers, and papers with more than 8 pages (excluding references) will be rejected without review. Supplementary material is optional with supported formats: pdf, mp4 and zip. All papers that were not previously presented in a major conference, will be peer-reviewed by three experts in the field in a double-blind manner. In case you are submitting a previously accepted conference paper, please also attach a copy of the acceptance notification email in the supplementary material documents.

All submissions should adhere to the WACV 2025 author guidelines.

Contact: If you have any questions, please contact otsuorg@gmail.com.

Submission Portal: https://cmt3.research.microsoft.com/VLM4OTSU2025/Submission/Index

Paper Review Timeline:

Paper Submission and supplemental material deadline Nov 22th, 2024 (PST)
Reviews and Final Decisions released to authors Dec 28th, 2024 (PST)
Camera ready deadline Jan 5th, 2025 (PST)

Top


Invited Speakers

Jian Zhao

Jian Zhao

Leader of Evolutionary Vision+x Oriented Learning (EVOL) Lab & Principal Research Scientist

Jian Zhao is currently the Leader of EVOL Lab & a Principal Research Scientist with the Institute of AI (TeleAI), China Telecom, Beijing, P.R. China, and a Researcher & Ph.D. Supervisor with School of Artificial Intelligence, Optics and Electronics (iOPEN), Northwestern Polytechnical University (NWPU), Xi'an Shanxi, P.R. China. Previously, He was an Assistant Researcher with Academy of Military Sciences (AMS), Beijing, P.R. China. He received his Ph.D. degree from National University of Singapore (NUS) in 2019 under the supervision of Assist. Prof. Jiashi Feng, Assoc. Prof. Shuicheng Yan, and Prof. Hengzhu Liu, Master degree from National University of Defense Technology (NUDT) in 2014 under the supervision of Prof. Xucan Chen, and Bachelor degree from Beihang University (BUAA) in 2012 under the supervision of Dr. Shaopeng Dong and Prof. Mei Yuan. He was supported by China Scholarship Council (CSC) and School of Computer, NUDT to pursue his Ph.D. degree at Learning and Vision Group, Faculty of Engineering (FOE), Department of Electrical and Computer Engineering (ECE), NUS, Singapore.

Zuxuan Wu

Zuxuan Wu

Associate Professor at Fudan University

Zuxuan Wu is an Associate Professor in School of Computer Science at Fudan University, and a member of the Fudan Vision and Learning Laboratory. He recieved his Ph.D. in Computer Science from the University of Maryland with Prof. Larry Davis in 2020. His research interests are in computer vision and deep learning. His current research particularly focuses on large-scale video understanding, video generation, and efficient architectures.

Jia Wan

Jia Wan

Professor at the Harbin Institute of Technology (Shenzhen)

Jia Wan is a Professor at the Harbin Institute of Technology (Shenzhen) in the School of Computer Science and Technology. Before joining HITSZ, he was a Posdoc in the Statistical Visual Computing Laboratory (SVCL), University of California, San Diego, advised by Prof. Nuno Vasconcelos and at Boston College, supervised by Prof. Donglai Wei. He received the Ph.D degree in the Video, Image, and Sound Analysis Lab (VISAL), City University of Hong Kong, supervised by Prof. Antoni B. Chan. He received the M.Sc. degree in the School of Computer Science and the Center for OPTical IMagery Analysis and Learning (OPTIMAL), supervised by Prof. Qi Wang, and B.Eng. degree at the Software Engineering School in Northwestern Polytechnical University, Xi'an, Shaanxi, China. In 2018, He was an intern at Tencent AI Lab in Shenzhen, working with Dr. Wenhan Luo and Dr. Baoyuan Wu. His research interests include computer vision, intelligent transportation, crowd analysis, and brain image analysis.

Haoyu Chen

Haoyu Chen

Assistant Professor at University of Oulu

Haoyu Chen is a Tenure-track Assistant Professor at CMVS, University of Oulu. Before that, I conducted Postdoc research in CMVS, University of Oulu, with the project of in Emotion AI (Academy Finland project) and trustworthy AI (Infotech project). He received His Ph.D. from University of Oulu , Finland, where He was advised by Academy Professor Guoying Zhao. During his PhD study, I visited CEL, TUDelft, the Netherlands. Prior to that, He received the B.E. degree from China University of Geosciences, China, and Master degree from University of Oulu, Finland. His research interests include Machine Learning, Human Behaviour Analysis, Emotion AI and Adversarial Learning.


Top

Tentative Schedule (Feb 28th, 2024)

Opening remarks and welcome 02:00 PM - 02:05 PM
Jian Zhao 02:10 PM - 02:35 PM
Zuxuan Wu 02:40 PM - 03:05 PM
Jia Wan 03:10 PM - 03:35 PM
Haoyu Chen 03:40 AM - 04:05 PM
Oral Session 04:15 PM - 05:00 PM




Top