Kai Chen's Homepage

Email / CV / Github / Twitter / Google Scholar

About Me

I am currently a Ph.D. candidate in CSE department of Hong Kong University of Science and Technology (HKUST), supervised by Prof. Dit-Yan Yeung. Previously, I was an undergraduate student majoring in Computer Science in Fudan University honored as the Outstanding Undergraduate of Shanghai (上海市优秀毕业生), supervised by Prof. Yanwei Fu. My research aims at building reliable Multi-modal AI systems from a data-centric perspective. Currently, I'm trying to answer, 1) How to build end-to-end Multi-modal LLMs with frontier visual, textual, and speech abilities? 2) How to build 3D visual world models in a controllable and scalable way? 3) How to get effective feedback from (M)LLMs themselves without reward models? 4) Does more data always result in better performance?

👋 I'm currently on job market of both academics and industry. Feel free to send me emails if we are a good fit!

Some recent works include:

Multi-modal Foundation Models - Omni-modality and Reasoning: EMOVA, RAPID.
Multi-modal Foundation Models - Mixture of Cluster-conditional Experts (MoCE): MoCLE, MoCE, SDR.
Multi-modal Foundation Models - Scalable Oversight for MLLM Self-align: Mistake analysis, MoTE, ECSO, Tri-HE, Val_PPL.
Visual World Models - Corner Cases for Autonomous Driving: ECCV 2024 W-CODA Workshop, CODA-LM, CODA.
Visual World Models - Geometric-controllable Visual Generation: GeoDiffusion, TrackDiffusion, MagicDrive, MagicDrive-V2.

News

[2025.06] [New!] We open-source RACRO, an efficient and scalable method of build multi-modal reasoning models which can flexibly adapt to any advanced reasoning LLMs during infernece time. Welcome to try our demo!
[2025.03] [New!] We open-source EMOVA, a frontier end-to-end Omni-modal LLM with SoTA vision-language and speech abilities, which has been accepted by CVPR 2025!
[2025.11] One paper (MagicDrive3D) accepted by WACV 2026! See you in Tucson!
[2025.10] One paper (MoCLE) accepted by IEEE TIP 2025!
[2025.10] Invited to serve as a reviewer for CVPR 2026, ARR 2026, ICLR 2026, WACV 2026!
[2025.08] One paper (Val_PPL) accepted by EMNLP 2025 (Oral)! See you in Suzhou!
[2025.06] One paper (MagicDrive-V2) accepted by ICCV 2025! See you in Hawaii!
[2025.06] One paper (Tri-HE) accepted by TMLR 2025!
[2025.05] One paper (MoTE) accepted by ACL 2025! See you in Vienna!
[2025.05] I give a talk on EMOVA at AI TIME!
[2025.03] Invited to serve as a reviewer for NeurIPS 2025, ICCV 2025, ACM MM 2025!
[2025.02] One paper (EMOVA) accepted by CVPR 2025! See you in Nashville!
[2024.12] Invited to serve as an area chair for IJCAI 2025!
[2024.12] Invited to serve as a reviewer for ICML 2025!
[2024.10] Two papers (CODA-LM and TrackDiffusion) accepted by WACV 2025! See you in Tucson!
[2024.09] Invited to serve as a reviewer for ICLR 2025!
[2024.09] We will hold the Workshop on Multimodal Perception and Comprehension of Corner Cases in Autonomous Driving: Towards Next-Generation Solutions at ECCV 2024! Looking forward to see you in Milano, Italy!
[2024.07] Two papers accepted by ECCV 2024! See you in Milano, Italy!
[2024.05] I give a talk about Geometric-controllable Visual Generation: A Systemetic Solution (GeoDiffusion, TrackDiffusion, MagicDrive, W-CODA2024) at VALSE Webinar!
[2024.04] I give a talk about corner case generation for autonomous driving (GeoDiffusion, TrackDiffusion, MagicDrive, DetDiffusion) at AIDriver!
[2024.04] CODA-LM, the new multi-modal version of CODA, is online!
[2024.03] Invited to serve as a reviewer for NeuIPS 2024, TCSVT!
[2024.02] One paper accepted by CVPR 2024! See you in Seattle!
[2024.02] Code and checkpoints of GeoDiffusion and MagicDrive have been released. Welcome to try!
[2024.02] I give a talk about our ICLR 2024 work Mistake Analysis at AI TIME!
[2024.01] I give a talk about our ICLR 2024 work Mistake Analysis at TechBeat!
[2024.01] Three papers accepted by ICLR 2024! See you in Vienna!
[2024.01] Invited to serve as a reviewer for TPAMI, ECCV 2024, ACCV 2024!
[2023.12] Our MoCLE is reported by Liangziwei!
[2023.12] Our MoCLE, the first MLLM with MoE architecture for instruction customization and generalization, is on Arxiv!
[2023.12] Recent surveys [1][2] show the remarkble GPT-4V still suffers from corner cases from our CODA dataset!
[2023.12] Invited to serve as a reviewer for IJCAI 2024, CVPR 2024, ICLR 2024!
[2023.10] Our MagicDrive is reported by Xinzhiyuan, and Mistake Analysis is reported by Liangziwei!
[2023.05] Our papers MixedAE (CVPR 2023), MoCE (ICLR 2023) and CODA (ECCV 2022) will be presented in VALSE 2023! See you in Wuxi!
[2023.05] One paper accepted by Workshop of Self-supervised Learning, VALSE 2023 (spotlight)!
[2023.05] One paper accepted by Workshop of Autonomous Driving, VALSE 2023 (spotlight)!
[2023.03] Invited to serve as a reviewer for NeurIPS 2023!
[2023.02] One paper accepted by CVPR 2023! See you in Vancouver!
[2023.01] One paper accepted by ICLR 2023 (spotlight Top25%)! Happy Lunar New Year!
[2023.01] Invited to serve as a reviewer for ICCV 2023, IJCAI 2023!
[2022.11] Invited to serve as a reviewer for CVPR 2023!
[2022.08] Our CODA dataset will be utilized to hold the 2nd SSLAD ECCV 2022 workshop and competition at CodaLab!
[2022.08] Invited to serve as a reviewer for ICLR 2023!
[2022.07] One paper accepted by ECCV 2022!
[2022.06] Invited to serve as a reviewer for TIP!
[2022.05] Invited to serve as a reviewer for NeurIPS 2022, ECCV 2022!
[2021.12] One paper accepted by AAAI 2022!
[2021.11] Invited to serve as a reviewer for CVPR 2022, ICRA 2022 and AAAI 2022!
[2021.10] One paper accepted by NeurIPS 2021!
[2021.07] One paper accepted by ICCV 2021!
[2021.07] Our SODA10M dataset will be utilized to hold the SSLAD ICCV 2021 workshop on Self-supervised Learning for Next-Generation Industry-level Autonomous Driving. All are welcome!
[2021.06] Invited to serve as a reviewer for NeurIPS 2021!
[2020.06] Successful undergrad thesis defend!
[2020.03] One paper accepted by IEEE Access!
[2019.06] One paper accepted by IROS 2019!

Publications

Full publication list on Google Scholar. (* denotes equal contribution, highlighted blocks denote the represnetative works)

Works are organized with respect to topics, including:

Multi-modal Foundation Models: Omni-modal LLMs, Mixture of Experts, (M)LLM Self-alignment
Visual World Models: Corner Cases for Autonomous Driving, Geometric-controllable Visual Generation
Representation Learning: Object-level Self-supervised Learning

Multi-modal Foundation Models - Omni-modality and Reasoning

EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions

Kai Chen*, Yunhao Gou*, Runhui Huang*, Zhili Liu*, Daxin Tan* and other 26 authors

Fully open-sourced Omni-modal LLMs with SoTA vision-language and speech abilities!

IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2025

[PDF] [Webpage] [Talk] [Talk (Chinese)] [Wechat Post] [Code] GitHub Repo stars

Reasoning-Aligned Perception Decoupling for Scalable Multi-modal Reasoning

Yunhao Gou*, Kai Chen*, Zhili Liu*, Lanqing Hong, Xin Jin, Zhenguo Li, James T. Kwok, Yu Zhang

Scaling reasoning MLLMs via adopting any advanced LLM reasoners during inference time!

Arxiv preprint, 2025

[PDF] [Demo] [Code] GitHub Repo stars

Multi-modal Foundation Models - Mixture of Cluster-conditional Experts (MoCE)

Mixture of Cluster-conditional LoRA Experts for Vision-language Instruction Tuning

Yunhao Gou*, Zhili Liu*, Kai Chen*, Lanqing Hong, Hang Xu, Aoxue Li, Dit-Yan Yeung, James Kwok, Yu Zhang.

First MLLM with MoE for instruction customization and generalization!

IEEE Transactions on Image Processing (TIP), 2025.

[PDF] [Webpage] [Talk] [Wechat Post] [Code] GitHub Repo stars

Task-customized Masked Autoencoder via Mixture of Cluster-conditional Experts

Zhili Liu*, Kai Chen*, Jianhua Han, Lanqing HONG, Hang Xu, Zhenguo Li, James Kwok.

International Conference on Learning Representations (ICLR), 2023 (spotlight Top25%).

[PDF][Wechat Post]

Task-Customized Self-Supervised Pre-training with Scalable Dynamic Routing

Zhili Liu, Jianhua Han, Kai Chen, Lanqing Hong, Hang Xu, Chunjing Xu, Zhenguo Li.

AAAI Conference on Artificial Intelligence (AAAI), 2022.

[PDF]

Multi-modal Foundation Models - Scalable Oversight for (M)LLM Self-alignment

Corrupted but Not Broken: Rethinking the Impact of Corrupted Data in Visual Instruction Tuning

Yunhao Gou, Hansi Yang, Zhili Liu, Kai Chen, Yihan Zeng, Lanqing Hong, Zhenguo Li, Qun Liu, James T Kwok, Yu Zhang.

Conference on Empirical Methods in Natural Language Processing (EMNLP), 2025 (Oral).

[PDF]

Unified Triplet-Level Hallucination Evaluation for Large Vision-Language Models

Junjie Wu*, Tsz Ting Chung*, Kai Chen*, Dit-Yan Yeung.

Transactions on Machine Learning Research (TMLR), 2025.

[PDF] [Webpage] [Code] GitHub Repo stars

Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation

Yunhao Gou*, Kai Chen*, Zhili Liu*, Lanqing Hong, Hang Xu, Zhenguo Li, Dit-Yan Yeung, James Kwok, Yu Zhang.

Aligning a MLLM with its own LLM!

European Conference on Computer Vision (ECCV), 2024.

[PDF] [Webpage] [Code] GitHub Repo stars

Mixture of insighTful Experts (MoTE): The Synergy of Thought Chains and Expert Mixtures in Self-Alignment

Zhili Liu*, Yunhao Gou*, Kai Chen*, Lanqing Hong, Jiahui Gao, Fei Mi, Yu Zhang, Zhenguo Li, Xin Jiang, Qun Liu, James T. Kwok

Annual Meeting of the Association for Computational Linguistics (ACL), 2025.

[PDF]

Gaining Wisdom from Setbacks: Aligning Large Language Models via Mistake Analysis

Kai Chen*, Chunwei Wang*, Kuo Yang, Jianhua Han, Lanqing Hong, Fei Mi, Hang Xu, Zhengying Liu, Wenyong Huang, Zhenguo Li, Dit-Yan Yeung, Lifeng Shang, Xin Jiang, Qun Liu.

Scalable oversight for LLM's generation ability via its own discrimination ability!

International Conference on Learning Representations (ICLR), 2024.

[PDF] [Talk] [Wechat Post] [Mistake Analysis @ ICLR 2024]

Visual World Models - Corner Cases for Autonomous Driving (CODA)

ECCV 2024 W-CODA: 1st Workshop on Multimodal Perception and Comprehension of Corner Cases in Autonomous Driving

Kai Chen*, Ruiyuan Gao*, Lanqing Hong*, Hang Xu, Xu Jia, Holger Caesar, Dengxin Dai, Bingbing Liu, Dzmitry Tsishkou, Songcen Xu, Chunjing Xu, Qiang Xu, Huchuan Lu, and Dit-Yan Yeung.

First workshop on Multi-modal Foundation Models for Autonomous Driving Corner Cases!

European Conference on Computer Vision (ECCV), 2024.

[PDF] [Webpage] [Wechat Post 1] [Wechat Post 2]

Automated Evaluation of Large Vision-Language Models on Self-driving Corner Cases

Kai Chen*, Yanze Li*, Wenhua Zhang*, Yanxin Liu, Pengxiang Li, Ruiyuan Gao, Lanqing Hong, Meng Tian, Xinhai Zhao, Zhenguo Li, Dit-Yan Yeung, Huchuan Lu, and Xu Jia.

IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2025.

[PDF] [Webpage] [ECCV 2024 Workshop] [Code] GitHub Repo stars

CODA: A Real-World Corner Case Dataset for Object Detection in Autonomous Driving

Kaican Li*, Kai Chen*, Haoyu Wang*, Lanqing Hong, Chaoqiang Ye, Jianhua Han, Yukuai Chen, Wei Zhang, Chunjing Xu, Dit-Yan Yeung, Xiaodan Liang, Zhenguo Li, Hang Xu.

European Conference on Computer Vision (ECCV), 2022.

Workshop of Automonous Driving, Vision and Learning Seminar (VALSE), 2023 (spotlight).

[PDF] [Webpage] [Talk] [ECCV 2022 Workshop] [GPT-4V suffers from CODA]

Visual World Models - Geometric-controllable Visual Generation

MagicDrive3D: Controllable 3D Generation for Any-View Rendering in Street Scenes

Ruiyuan Gao, Kai Chen, Zhihao Li, Lanqing Hong, Zhenguo Li, Qiang Xu.

First work on geometric-controllable 3D scene generation!

IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2026.

[PDF] [Webpage] [Wechat Post] [Code] GitHub Repo stars

MagicDrive-V2: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control

Ruiyuan Gao, Kai Chen, Bo Xiao, Lanqing Hong, Zhenguo Li, and Qiang Xu.

Stronger and longer multi-view video generation!

IEEE/CVF International Conference on Computer Vision (ICCV), 2025.

[PDF] [Webpage] [Wechat Post] [Code] GitHub Repo stars

Implicit Concept Removal of Diffusion Models

Zhili Liu*, Kai Chen*, Yifan Zhang, Jianhua Han, Lanqing Hong, Hang Xu, Zhenguo Li, Dit-Yan Yeung, James Kwok.

European Conference on Computer Vision (ECCV), 2024.

[PDF] [Webpage] [Talk]

DetDiffusion: Synergizing Generative and Perceptive Models for Enhanced Data Generation and Perception

Yibo Wang*, Ruiyuan Gao*, Kai Chen*, Kaiqiang Zhou, Yingjie Cai, Lanqing Hong, Zhenguo Li, Lihui Jiang, Dit-Yan Yeung, Qiang Xu, Kai Zhang.

First work on corner case generation for object detection!

IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2024.

[PDF] [Wechat Post]

MagicDrive: Street View Generation with Diverse 3D Geometry Control

Ruiyuan Gao*, Kai Chen*, Enze Xie, Lanqing Hong, Zhenguo Li, Dit-Yan Yeung, Qiang Xu.

Foundational work of visual world modeling for autonomous driving!

First work on geometric-controllable multi-view video generation!

International Conference on Learning Representations (ICLR), 2024.

[PDF] [Webpage] [Talk] [Wechat Post] [HDC 2024] [ECCV 2024 Workshop] [Code] GitHub Repo stars

TrackDiffusion: Tracklet-Conditioned Video Generation via Diffusion Models

Pengxiang Li*, Kai Chen*, Zhili Liu*, Ruiyuan Gao, Lanqing Hong, Guo Zhou, Hua Yao, Dit-Yan Yeung, Huchuan Lu, Xu Jia.

First work on geometric-controllable video generation!

IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2025.

[PDF] [Webpage] [Code] GitHub Repo stars

GeoDiffusion: Text-Prompted Geometric Control for Object Detection Data Generation

Kai Chen*, Enze Xie*, Zhe Chen, Yibo Wang, Lanqing Hong, Zhenguo Li, Dit-Yan Yeung.

First work on geometric-controllable image generation!

International Conference on Learning Representations (ICLR), 2024.

[PDF] [Webpage] [Code] GitHub Repo stars

Representation Learning - Object-level Self-supervised Learning

Mixed Autoencoder for Self-supervised Visual Representation Learning

Kai Chen*, Zhili Liu*, Lanqing Hong, Hang Xu, Zhenguo Li, Dit-Yan Yeung.

IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2023.

Workshop of Self-supervised Learning, Vision and Learning Seminar (VALSE), 2023 (spotlight).

[PDF][Wechat Post][Talk]

MultiSiam: Self-supervised Multi-instance Siamese Representation Learning for Autonomous Driving

Kai Chen, Lanqing Hong, Hang Xu, Zhenguo Li, Dit-Yan Yeung.

IEEE/CVF International Conference on Computer Vision (ICCV), 2021.

[PDF]

SODA10M: A Large-Scale 2D Self/Semi-Supervised Object Detection Dataset for Autonomous Driving.

Jianhua Han, Xiwen Liang, Hang Xu, Kai Chen, Lanqing Hong, Jiageng Mao, Chaoqiang Ye, Wei Zhang, Zhenguo Li, Xiaodan Liang, Chunjing Xu.

Datasets and Benchmarks Track, Neural Information Processing Systems (NeurIPS), 2021.

[PDF] [Webpage] [Talk] [ICCV 2021 Workshop]

Talks

[AI TIME Online] EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions. [Recording]
[VALSE Webinar] Geometric-controllable Visual Generation: A Systemetic Solution. [Recording]
[AIDriver Online] Controllable Corner Case Generation for Autonomous Driving. [Recording]
[AI TIME Online] Gaining Wisdom from Setbacks: Aligning Large Language Models via Mistake Analysis. [Recording]
[TechBeat Online] Gaining Wisdom from Setbacks: Aligning Large Language Models via Mistake Analysis. [Recording]
[VALSE 2023@Wuxi] Mixed Autoencoder for Self-supervised Visual Representation Learning. [Recording]
[VALSE 2023@Wuxi] CODA: A Real-World Road Corner Case Dataset for Object Detection in Autonomous Driving. [Recording]

Experiences

Indiana University Bloomington

Indiana, U.S.A.

June 2019 - Sep. 2019

Visiting Scholar at Computer Vision lab, supervised by Prof. David Crandall

University of Manchester

Manchester, U.K.

Sep. 2018 - Jan. 2019

International exchange student, supervised by Dr. Tingting Mu

Academic Services

Program committee/Organizer:

The 1st W-CODA Workshop on Multimodal Perception and Comprehension of Corner Cases in Autonomous Driving at ECCV'24.
The 2nd SSLAD Workshop at ECCV 2022.
The 1st SSLAD (Self-supervised Learning for Next-generation Industry-level Autonomous Driving) Workshop at ICCV 2021.

Area chair:

Conference: IJCAI 2025.

Reviewer:

Conference: CVPR 2026-2022, ICLR 2026-2023, WACV 2026, ARR 2025, NeurIPS 2025-2021, MM 2025, ICCV 2025-2023, ICML 2025, ECCV 2024-2022, ACCV 2024, IJCAI 2024-2023, ICRA 2022, AAAI 2022.
Journal: TPAMI, TCSVT, TIP and IEEE Access.

Selected Awards

CVPR 2025 Doctoral Consortium Awards

2025

HKUST Research Travel Grant

2023-2025

HKUST Postgraduate Scholarship

2020

Outstanding Graduate of Shanghai [post]

2020

Scholarship for Outstanding Graduates of Fudan University

2020

Oversea Visiting Student Stipend of Fudan University

2019

Joel & Ruth Spira Scholarship

2019

National Scholarship

2018

Scholarship for Outstanding Undergraduates of Fudan University

2017

Interest

I love basketball and I'm also a big fan of Stepfen Curry, MVP point guard of Golden State Warriors, NBA. I'm a team member of my class's basketball team and often play Score / Power forward (SF/PF). In my spare time, I also play the role of a basketball game referee. Hope one day I can have a chance to see a home game of Warriors in Chase Center San Francisco!

Kai Chen (陈铠)

Ph.D. Candidate @ HKUST

Multi-modal Foundation Models - Omni-modality and Reasoning

Multi-modal Foundation Models - Mixture of Cluster-conditional Experts (MoCE)

Multi-modal Foundation Models - Scalable Oversight for (M)LLM Self-alignment

Visual World Models - Corner Cases for Autonomous Driving (CODA)

Visual World Models - Geometric-controllable Visual Generation

Representation Learning - Object-level Self-supervised Learning