Haoji Zhang (张颢继)

I am a first-year master student at Shenzhen International Graduate School, Tsinghua University. I am fortunate to be supervised by Prof. Yansong Tang in IVG@SZ group. Before that, I got B.S. in Mathematics and Physics from Tsinghua University (THU) in 2024.

My research interests lie in the fields of Computer Vision and Efficient Deep Learning. My current research focuses on Long Video Understanding, Large Multimodal Model.

Google Scholar  /  Email  /  Github  /  LinkedIn

profile photo
News

  • 2025.05: Ponder & Press is accepted as ACL Findings, 2025.
  • 2024.12: Uni-AdaFocus is accepted by TPAMI, IF=20.8, 2025.
  • 2024.06: Flash-VStream wins the 1st place in LOVEU challenge track 1, CVPR 2024.
  • 2023.09: Start an internship at Bytedance.
  • 2023.03: PREIM3D is accepted by CVPR 2023.
  • Publications and Preprints

    * indicates equal contribution, † indicates corresponding author

    dise Ponder & Press: Advancing Visual GUI Agent towards General Computer Control
    Yiqin Wang*, Haoji Zhang*, Jingqi Tian, Yansong Tang
    Findings of the Association for Computational Linguistics ACL (ACL), 2025
    [arXiv] [Code] [Project Page]
    We propose Ponder & Press, a divide-and-conquer GUI agent framework that only relies on visual input to mimic human-like interaction with GUIs.
    dise Uni-AdaFocus: Spatial-Temporal Dynamic Computation for Video Recognition
    Yulin Wang*, Haoji Zhang*, Yang Yue, Shiji Song, Chao Deng, Junlan Feng, Gao Huang
    IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI, IF=20.8), 2025
    [arXiv] [IEEE Paper] [Code]
    We explore the phenomenon of spatial/temporal/sample-wise redundancy and propose Uni-AdaFocus, an efficient end-to-end video recognition framework.
    dise Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams
    Haoji Zhang*, Yiqin Wang*, Yansong Tang †, Yong Liu, Jiashi Feng, Jifeng Dai, Xiaojie Jin
    Preprint, 1st place solution of LOVEU@CVPR'24 challenge track 1 , 2024 [Award]
    [arXiv] [Code] [Project Page]
    We proposed Flash-VStream, a video-language model that simulates the memory mechanism of human, able to process long video streams in real-time.
    dise PREIM3D: 3D Consistent Precise Image Attribute Editing from a Single Image
    Jianhui Li, Jianmin Li†, Haoji Zhang, Shilong Liu, Zhengyi Wang, Zihao Xiao, Kaiwen Zheng, Jun Zhu
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023
    [arXiv] [Code] [Project Page]
    We propose PREIM3D, a novel framework for 3D-aware image attribute editing that achieves better 3D consistency and precision at large camera poses.
    dise UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning
    Sule Bai, Mingxing Li, Yong Liu, Jing Tang, Haoji Zhang, Lei Sun, Xiangxiang Chu, Yansong Tang
    Preprint, 2025
    [arXiv] [Code] [Project Page]
    We propose UniVG-R1, a reasoning guided MLLM for universal visual grounding.
    dise Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation
    Sule Bai*, Yong Liu*, Yifei Han, Haoji Zhang, Yansong Tang
    Preprint, 2024
    [arXiv] [Code]
    We propose SC-CLIP, a training-free open-vocabulary segmentation framework that achieves competitive performance on various segmentation tasks.
    Selected Honors and Awards

  • Outstanding Bachelor Graduate of Beijing, 2024. (北京市优秀毕业生, Top 5% in Tsinghua University)
  • Comprehensive Outstanding Scholarship of Tsinghua University, 2023. (清华大学综合优秀奖学金, 校级一等)
  • Comprehensive Outstanding Scholarship of Tsinghua University, 2022. (清华大学综合优秀奖学金, 校级一等)
  • Comprehensive Outstanding Scholarship of Tsinghua University, 2021. (清华大学综合优秀奖学金, 校级一等)
  • THUWC2019 Gold Medal, 2019. (清华大学全国优秀中学生信息学冬令营金牌)
  • NOIWC2019 Silver Medal, 2019. (第36届全国信息学奥林匹克冬令营银牌)
  • NOI2019 Bronze Medal, 2019. (第36届全国信息学奥林匹克竞赛铜牌)

  • Website Template

    © Haoji Zhang | Last updated: May 20, 2025