Yuzhi Huang 「黄誉之」

| CV | Email | Github |
| Google Scholar | HuggingFace |
| LinkedIn | RedNote |

I am an incoming Ph.D. student at MMLab@SIGS, Tsinghua Shenzhen International Graduate School (SIGS), advised by Prof. Zhi Wang. Currently, I am pursuing my master's degree in the School of Informatics at Xiamen University, advised by Prof. Xinghao Ding and Prof. Yue Huang. I also collaborate with Dr. Chenxin Li from the AIM group at The Chinese University of Hong Kong. My long-term research goal is to develop embodied agents that are fundamentally grounded in the dynamic physical world, progressing from 4D scene modeling and multimodal dynamic reasoning toward agents capable of robust planning and action in open-ended, long-horizon real-world environments.

        My current research primarily covers the following topics:

  • Embodied AI & Robotic Agents: Long-horizon manipulation, spatio-temporal reasoning with persistent memory, VLM-based robotic planning.
  • 4D World Modeling & Perception: Dynamic scene reconstruction, spatio-temporal understanding, streaming vision for embodied 4D perception.
  • Multimodal LLM Evaluation & Adaptation: Benchmarking MLLMs on dynamic reasoning, task-specific fine-tuning for spatial and temporal understanding.
  • Vision Foundation Models: Video anomaly detection, ambiguity-aware segmentation, SAM-based perception.

        WeChat: Swaggyzz-13    Email: yzhuang13@stu.xmu.edu.cn


  Recent News
  • [2026.02]  One paper (Dyn-Bench) accepted at CVPR 2026 🎉!
  • [2025.09]  One paper (DynamicVerse) accepted at NeurIPS 2025.
  • [2025.02]  One paper (Track Any Anomalous Object) accepted at CVPR 2025.
  • [2024.09]  One paper (Flaws can be Applause) accepted to NeurIPS 2024.
  • [2024.07]  One paper (P^2SAM) accepted to ACM MM 2024.

  Publications    ( * denotes equal contribution, † denotes project lead )
RoboStream

RoboStream: Weaving Spatio-Temporal Reasoning with Memory in Vision-Language Models for Robotics
Yuzhi Huang*†, Jie Wu*, Weijue Bu, Ziyi Xiong, Gaoyang Jiang, Ye Li, Kangye Ji, Shuzhao Xie, Yue Huang, Chenglei Wu, Jingyan Jiang, Zhi Wang
Preprint

Project | Paper | Abstract

    Current VLM-based robotic planners treat each manipulation step independently, recomputing scene geometry from pixels at every decision point without memory of prior actions or occluded objects. This causes perceptual errors to accumulate and breaks preconditions for subsequent steps, fundamentally limiting long-horizon task completion.
    To address this, we propose RoboStream, a training-free framework that weaves spatio-temporal reasoning with persistent memory into VLMs. RoboStream introduces two core components: Spatio-Temporal Fusion Tokens (STF-Tokens) that bind visual evidence with 3D geometric attributes for persistent object grounding across long horizons, and a Causal Spatio-Temporal Graph (CSTG) that records action-triggered state transitions to capture how actions reshape the environment over time. Together, they enable geometric anchoring and causal reasoning akin to human mental models. RoboStream achieves 90.5% on long-horizon RLBench and 44.4% on challenging real-world block-building tasks, substantially outperforming comparable methods.

sym

Zip-VGGT: Object-Centric Spatiotemporal KV Compression for Streaming Vision Transformers
Kairun Wen†, Runyu Chen, Wenyan Cong, Peiwei Lin, Tao Lu, Lihan Jiang, Weiguang Zhao, Junting Dong, Yunlong Lin, Yuzhi Huang, Xinghao Ding, Hongsheng Li, Linning Xu, Mulin Yu
Preprint

Project | Abstract

    Streaming vision architectures have emerged as a powerful paradigm for embodied 4D spatial perception, enabling on-the-fly scene reconstruction. However, deploying these models in real-world infinite-horizon scenarios exposes a critical memory bottleneck: the unbounded accumulation of historical representations in the Key-Value (KV) cache. Existing token compression strategies attempt to mitigate this but suffer from decoupled, heuristic processing. They blindly aggregate tokens across physical boundaries, destroying high-frequency 3D geometry, and employ motion-blind eviction policies that erase the temporal trajectories of highly dynamic objects.
    To shatter this dilemma, we propose Zip-VGGT, the first object-centric spatiotemporal KV compression framework tailored for streaming vision models. Our core insight is to guide compression using high-level semantic and kinematic priors without introducing heavy computational overhead. Specifically, Zip-VGGT synergizes crisp 2D entity boundaries from a lightweight Segment Anything Model (SAM) with implicit 3D motion representations naturally encoded within the backbone's global attention layers. This enables object-bounded geometric spatial merging to fiercely preserve 3D topologies, alongside motion-aware temporal eviction that dynamically allocates cache budgets to retain interactive dynamic elements. Extensive experiments demonstrate that Zip-VGGT achieves huge compression ratios and inference speedups, successfully preventing memory explosion while maintaining high-fidelity 4D reconstruction for long-term embodied perception.

sym

Thinking in Dynamics: How Multimodal Large Language Models Perceive, Track, and Reason Dynamics in Physical 4D World
Yuzhi Huang*†, Kairun Wen*, Rongxin Gao*, Dongxuan Liu, Yibin Lou, Jie Wu, Jing Xu, Jian Zhang, Zheng Yang, Yunlong Lin, Chenxin Li, Panwang Pan, Junbin Lu, Jingyan Jiang, Xinghao Ding, Yue Huang, Zhi Wang
CVPR 2026

Project | Paper | Abstract | HF Data | Code

    Humans inhabit a physical 4D world where geometric structure and semantic content evolve over time, constituting a dynamic 4D reality (spatial with temporal dimension). While current Multimodal Large Language Models (MLLMs) excel in static visual understanding, can they also be adept at "thinking in dynamics", i.e., perceive, track and reason about spatio-temporal dynamics in evolving scenes?
    To systematically assess their spatio-temporal reasoning and localized dynamics perception capabilities, we introduce Dyn-Bench, a large-scale benchmark built from diverse real-world and synthetic video datasets, enabling robust and scalable evaluation of spatio-temporal understanding. Through multi-stage filtering from massive 2D and 4D data sources, Dyn-Bench provides a high-quality collection of dynamic scenes, comprising 1k videos, 7k visual question answering (VQA) pairs, and 3k dynamic object grounding pairs. We probe general, spatial and region-level MLLMs to express how they think in dynamics both linguistically and visually, and find that existing models cannot simultaneously maintain strong performance in both spatio-temporal reasoning and dynamic object grounding, often producing inconsistent interpretations of motion and interaction. Notably, conventional prompting strategies (e.g., chain-of-thought or caption-based hints) provide limited improvement, whereas structured integration approaches, including Mask-Guided Fusion and Spatio-Temporal Textual Cognitive Map (ST-TCM), significantly enhance MLLMs' dynamics perception and spatio-temporal reasoning in the physical 4D world.

DynamicVerse: Physically-Aware Multimodal Modeling for Dynamic 4D Worlds
Kairun Wen*, Yuzhi Huang*, Runyu Chen, Hui Zheng, Yunlong Lin, Panwang Pan, Chenxin Li, Wenyan Cong, Jian Zhang, Junbin Lu, Chenguo Lin, Dilin Wang, Zhicheng Yan, Hongyu Xu, Justin Theiss, Yue Huang, Xinghao Ding, Rakesh Ranjan, Zhiwen Fan
NeurIPS 2025

Project | Paper | Abstract

    Understanding the dynamic physical world, characterized by its evolving 3D structure, real-world motion, and semantic content with textual descriptions, is crucial for human-agent interaction and enables embodied agents to perceive and act within real environments with human‑like capabilities. However, existing datasets are often derived from limited simulators or utilize traditional Structure-from-Motion for up-to-scale annotation and offer limited descriptive captioning, which restricts the capacity of foundation models to accurately interpret real-world dynamics from monocular videos, commonly sourced from the internet. To bridge these gaps, we introduce DynamicVerse, a physical‑scale, multimodal 4D modeling framework for real-world video. We employ large vision, geometric, and multimodal models to interpret metric-scale static geometry, real-world dynamic motion, instance-level masks, and holistic descriptive captions. By integrating window-based Bundle Adjustment with global optimization, our method converts long real-world video sequences into a comprehensive 4D multimodal format. DynamicVerse delivers a large-scale dataset consists of 100K+ videos with 800K+ annotated masks and 10M+ frames from internet videos. Experimental evaluations on three benchmark tasks---video depth estimation, camera pose estimation, and camera intrinsics estimation---validate that our 4D modeling achieves superior performance in capturing physical-scale measurements with greater global accuracy than existing methods.

TAO project

Track Any Anomalous Object: A Granular Video Anomaly Detection Pipeline
Yuzhi Huang*, Chenxin Li*, Haitao Zhang, Zixu Lin, Yunlong Lin, Hengyu Liu, Wuyang Li, Xinyu Liu, Jiechao Gao, Yue Huang, Xinghao Ding, Yixuan Yuan
CVPR 2025

Project | Paper | Abstract | Code

    Video anomaly detection (VAD) is crucial in scenarios such as surveillance and autonomous driving, where timely detection of unexpected activities is essential. Albeit existing methods have primarily focused on detecting anomalous objects in videos—either by identifying anomalous frames or objects—they often neglect finer-grained analysis, such as anomalous pixels, which limits their ability to capture a broader range of anomalies. To address this challenge, we propose an innovative VAD framework called Track Any Anomalous Object (TAO), which introduces a Granular Video Anomaly Detection Framework that, for the first time, integrates the detection of multiple fine-grained anomalous objects into a unified framework. Unlike methods that assign anomaly scores to every pixel at each moment, our approach transforms the problem into pixel-level tracking of anomalous objects. By linking anomaly scores to subsequent tasks such as image segmentation and video tracking, our method eliminates the need for threshold selection and achieves more precise anomaly localization, even in long and challenging video sequences. Experiments on extensive datasets demonstrate that TAOachieves state-of-the-art performance, setting a new progress for VAD by providing a practical, granular, and holistic solution.

ASAM project

Flaws can be Applause: Unleashing Potential of Segmenting Ambiguous Objects in SAM
Chenxin Li*, Yuzhi Huang*, Wuyang Li, Hengyu Liu, Xinyu Liu, Qing Xu, Zhen Chen, Yue Huang, Yixuan Yuan
NeurIPS 2024

Project | Paper | Abstract | Code

    As the vision foundation models like the Segment Anything Model (SAM) demonstrate potent universality, they also present challenges in giving ambiguous and uncertain predictions. Significant variations in the model output and granularity can occur with simply subtle changes in the prompt, contradicting the consensus requirement for the robustness of a model. While some established works have been dedicated to stabilizing and fortifying the prediction of SAM, this paper takes a unique path to explore how this flaw can be inverted into an advantage when modeling inherently ambiguous data distributions. We introduce an optimization framework based on a conditional variational autoencoder, which jointly models the prompt and the granularity of the object with a latent probability distribution. This approach enables the model to adaptively perceive and represent the real ambiguous label distribution, taming SAM to produce a series of diverse, convincing, and reasonable segmentation outputs controllably. Extensive experiments on several practical deployment scenarios involving ambiguity demonstrates the exceptional performance of our framework.

P2SAM project

P²SAM: Probabilistically Prompted SAMs Are Efficient Segmentator for Ambiguous Medical Images
Yuzhi Huang*, Chenxin Li*, Zixu Lin, Hengyu Liu, Haote Xu, Yifan Liu, Yue Huang, Xinghao Ding, Yixuan Yuan
ACM MM 2024

Project | Paper | Abstract | Code

    The ability to generate an array of plausible outputs for a single input has profound implications for dealing with inherent ambiguity in visual scenarios. This is evident in scenarios where diverse semantic segmentation annotations for a single medical image are provided by various experts. Existing methods hinge on probabilistic modelling of representations to depict this ambiguity and rely on extensive multi-output annotated data to learn this probabilistic space. However, these methods often falter when only a limited amount of ambiguously labelled data is available, which is a common occurrence in real-world applications. To surmount these challenges, we propose a novel framework, termed as (P²SAM), that leverages the prior knowledge of the Segment Anything Model (SAM) during the segmentation of ambiguous objects. Specifically, we delve into an inherent drawback of SAM in deterministic segmentation, i.e., the sensitivity of output to prompts, and ingeniously transform this into an advantage for ambiguous segmentation tasks by introducing a prior probabilistic space for prompts. Experimental results demonstrate that our strategy significantly enhances the precision and diversity of medical segmentation through the utilization of a small number of ambiguously annotated samples by doctors. Rigorous benchmarking experiments against state-of-the-art methods indicate that our method achieves superior segmentation precision and diversified outputs with fewer training data (using simply 5.5% samples, +12% Dmax). The (P²SAM) signifies a substantial step towards the practical deployment of probabilistic models in real-world scenarios with limited data.

Honors & Awards
Huang Xilie University-Level Scholarship, 2025
University Level Outstanding Student Award, 2024
Meritorious Winner in the Mathematical Contest in Modeling (MCM), USA, 2024

Reviewer Services
International Conference on Machine Learning (ICML), 2025
International Conference on Learning Representations (ICLR), 2025
Conference on Neural Information Processing Systems (NeurIPS), 2025, 2024





Website template from here and here