Peng Li — Datasets

Datasets, benchmarks, and resources developed through the research publications.

How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game

Ziyue Wang, Yurui Dong, Fuwen Luo, Minyuan Ruan, Zhili Cheng, Chi Chen, Peng Li, Yang Liu. ICCV 2025, 4807-4817. [pdf] [arXiv] [code] [Slides] [Video] [Media 1] [Media 2] [Poster]

Turns complex multimodal reasoning into an extensible escape-game setting, making it clear whether MLLMs can combine clues rather than recognize isolated images.

Evaluating Time Awareness and Cross-modal Active Perception of Large Models via 4D Escape Room Task

Yurui Dong, Ziyue Wang, Shuyun Lu, Dairu Liu, Xuechen Liu, Fuwen Luo, Peng Li, Yang Liu. Preprint. [arXiv] [code]

Extends escape-room evaluation into 4D time-aware interaction, testing whether large models can actively gather cross-modal evidence as scenes change over time.

StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding

Junming Lin, Zheng Fang, Chi Chen, Haoxuan Cheng, Zihao Wan, Fuwen Luo, Ziyue Wang, Peng Li, Yang Liu, Maosong Sun. ICASSP 2026, 12147-12151. [pdf] [arXiv] [code] [Project Webpage] [Poster]

Targets real streaming video use cases by evaluating online video understanding, where models must reason from incomplete, continuously arriving visual information.

CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models

Fuwen Luo, Chi Chen, Zihao Wan, Zhaolu Kang, Qidong Yan, Yingjie Li, Xiaolong Wang, Siyu Wang, Ziyue Wang, Xiaoyue Mi, Peng Li, Ning Ma, Maosong Sun, Yang Liu. ACL 2024, 10639-10659. [pdf] [arXiv] [code] [Project Webpage]

Focuses on context-dependent visual comprehension, exposing whether MLLMs can use surrounding context to resolve what an image actually means.

MUCAR: Benchmarking Multilingual Cross-Modal Ambiguity Resolution for Multimodal Large Language Models

Xiaolong Wang, Zhaolu Kang, Wangyuxuan Zhai, Xinyue Lou, Yunghwei Lai, Ziyue Wang, Yawen Wang, Kaiyu Huang, Yile Wang, Peng Li, Yang Liu. EMNLP 2025, 15026–15048. [pdf] [arXiv]

Broadens multimodal evaluation to multilingual ambiguity, testing whether MLLMs can resolve meaning when both language and vision leave room for interpretation.

ActiView: Evaluating Active Perception Ability for Multimodal Large Language Models

Ziyue Wang, Chi Chen, Fuwen Luo, Yurui Dong, Yuanchi Zhang, Yuzhuang Xu, Xiaolong Wang, Peng Li, Yang Liu. ACL 2025, 7605-7633. [pdf] [arXiv] [code]

Tests active perception: whether MLLMs know when to seek additional views instead of making brittle guesses from a single observation.

CoSpace: Benchmarking Continuous Space Perception Ability for Vision-Language Models

Yiqi Zhu, Ziyue Wang, Can Zhang, Peng Li, Yang Liu. CVPR 2025, 29569-29579. [pdf] [arXiv] [code] [Project Webpage]

Measures continuous-space perception for vision-language models, turning spatial understanding from a qualitative impression into a structured benchmark.

EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models

Sijie Cheng, Zhicheng Guo, Jingwen Wu, Kechen Fang, Peng Li, Huaping Liu, Yang Liu. CVPR 2024, 14291-14302. [pdf] [arXiv] [code] [Project Webpage]

CVPR 2025 Highlight

Evaluates first-person perspective thinking, asking whether models can reason from an embodied viewer's point of view rather than an outside observer's.

StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models

Zhicheng Guo, Sijie Cheng, Hao Wang, Shihao Liang, Yujia Qin, Peng Li, Zhiyuan Liu, Maosong Sun, Yang Liu. Findings of ACL 2024, 11143-11156. [pdf] [arXiv] [code] [Project Webpage]

Makes tool-learning evaluation more reliable through stable, large-scale benchmarking, reducing noise from changing tools and environments.

DongbaMIE: A Multimodal Information Extraction Dataset for Evaluating Semantic Understanding of Dongba Pictograms

Xiaojun Bi, Shuo Li, Junyao Xing, Ziyue Wang, Fuwen Luo, Weizheng Qiao, Lu Han, Ziwei Sun, Peng Li, Yang Liu. Findings of EMNLP 2025, 976-990. [pdf] [arXiv] [code] [Dataset]

Builds a multimodal information extraction dataset for Dongba pictograms, supporting semantic understanding of a culturally distinctive writing system.

PATIMT-Bench: A Multi-Scenario Benchmark for Position-Aware Text Image Machine Translation in Large Vision-Language Models

Wanru Zhuang, Wenbo Li, Zhibin Lan, Xu Han, Peng Li, Jinsong Su. Findings of EMNLP 2025, 16572-16588. [pdf]

Highlights a practical translation gap by testing whether vision-language models understand where text appears in an image, not just what it says.

Bench4Merge: A Comprehensive Benchmark for Merging in Realistic Dense Traffic with Micro-Interactive Vehicles

Zhengming Wang, Junli Wang, Pengfei Li, Zhaohan Li, Chunyang Liu, Bo Zhang, Peng Li, Yilun Chen. IROS 2025, 8910-8917. [pdf] [arXiv] [code] [Video]

Creates realistic dense-traffic merge scenarios with micro-interactive vehicles, pushing autonomous-driving evaluation closer to difficult everyday decisions.

DocRED: A Large-Scale Document-Level Relation Extraction Dataset

Yuan Yao, Deming Ye, Peng Li, Xu Han, Yankai Lin, Zhenghao Liu, Zhiyuan Liu, Lixin Huang, Jie Zhou, Maosong Sun. ACL 2019, 764-777. [pdf] [code & data] [Leaderboard]

Established large-scale document-level relation extraction, requiring models to connect evidence across an entire document rather than a single sentence.

CodRED: A Cross-Document Relation Extraction Dataset for Acquiring Knowledge in the Wild

Yuan Yao, Jiaju Du, Yankai Lin, Peng Li, Zhiyuan Liu, Jie Zhou, Maosong Sun. EMNLP 2021, 4452-4472. [pdf] [code & data] [Leaderboard]

Moves relation extraction into the wild by requiring evidence across documents, closer to how knowledge is acquired from real web corpora.

MAVEN: A Massive General Domain Event Detection Dataset

Xiaozhi Wang, Ziqi Wang, Xu Han, Wangyi Jiang, Rong Han, Zhiyuan Liu, Juanzi Li, Peng Li, Yankai Lin, Jie Zhou. EMNLP 2020, 1652–1671. [pdf] [code]

Provides a massive general-domain event detection resource, making event understanding broader and more representative of real text.

MAVEN-ERE: A Unified Large-scale Dataset for Event Coreference, Temporal, Causal and Subevent Relation Extraction

Xiaozhi Wang, Yulin Chen, Ning Ding, Hao Peng, Zimu Wang, Yankai Lin, Xu Han, Lei Hou, Juanzi Li, Zhiyuan Liu, Peng Li, Jie Zhou. EMNLP 2022, 926-941. [pdf] [arXiv] [code]

Unifies event coreference, temporal, causal, and subevent relations, supporting richer event reasoning beyond simply detecting that an event occurred.

MOOCCubeX: A Large Knowledge-centered Repository for Adaptive Learning in MOOCs

Jifan Yu, Yuquan Wang, Qingyang Zhong, Gan Luo, Yiming Mao, Kai Sun, Wenzheng Feng, Wei Xu, Shulin Cao, Kaisheng Zeng, Zijun Yao, Lei Hou, Yankai Lin, Peng Li, Jie Zhou, Bin Xu, Juanzi Li, Jie Tang, Maosong Sun. CIKM 2021, 4643-4652. [pdf] [code]

CIKM 2021 Best Resource Paper Nomination

Builds a knowledge-centered MOOC repository for adaptive learning, connecting courses, concepts, exercises, and learners at large scale.

FewRel 2.0: Towards More Challenging Few-Shot Relation Classification

Tianyu Gao, Xu Han, Hao Zhu, Zhiyuan Liu, Peng Li, Maosong Sun, Jie Zhou. EMNLP 2019, 6250–6255. [pdf] [code] [benchmark]

Raises the bar for few-shot relation classification with harder settings that better test whether models can adapt from limited examples.