Turns complex multimodal reasoning into an extensible escape-game setting, making it clear whether MLLMs can combine clues rather than recognize isolated images.
Evaluating Time Awareness and Cross-modal Active Perception of Large Models via 4D Escape Room Task
Extends escape-room evaluation into 4D time-aware interaction, testing whether large models can actively gather cross-modal evidence as scenes change over time.
StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding
Targets real streaming video use cases by evaluating online video understanding, where models must reason from incomplete, continuously arriving visual information.
CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models
Broadens multimodal evaluation to multilingual ambiguity, testing whether MLLMs can resolve meaning when both language and vision leave room for interpretation.
ActiView: Evaluating Active Perception Ability for Multimodal Large Language Models
Tests active perception: whether MLLMs know when to seek additional views instead of making brittle guesses from a single observation.
CoSpace: Benchmarking Continuous Space Perception Ability for Vision-Language Models
Yiqi Zhu, Ziyue Wang, Can Zhang, Peng Li, Yang Liu. CVPR 2025, 29569-29579. [pdf] [arXiv] [code] [Project Webpage]
Measures continuous-space perception for vision-language models, turning spatial understanding from a qualitative impression into a structured benchmark.
EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models
Evaluates first-person perspective thinking, asking whether models can reason from an embodied viewer's point of view rather than an outside observer's.
StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models
Makes tool-learning evaluation more reliable through stable, large-scale benchmarking, reducing noise from changing tools and environments.
DongbaMIE: A Multimodal Information Extraction Dataset for Evaluating Semantic Understanding of Dongba Pictograms
Xiaojun Bi, Shuo Li, Junyao Xing, Ziyue Wang, Fuwen Luo, Weizheng Qiao, Lu Han, Ziwei Sun, Peng Li, Yang Liu. Findings of EMNLP 2025, 976-990. [pdf] [arXiv] [code] [Dataset]
Builds a multimodal information extraction dataset for Dongba pictograms, supporting semantic understanding of a culturally distinctive writing system.
PATIMT-Bench: A Multi-Scenario Benchmark for Position-Aware Text Image Machine Translation in Large Vision-Language Models
Established large-scale document-level relation extraction, requiring models to connect evidence across an entire document rather than a single sentence.
CodRED: A Cross-Document Relation Extraction Dataset for Acquiring Knowledge in the Wild