Haoji Zhang (张颢继)
I am a first-year master student at
Shenzhen International Graduate School,
Tsinghua University.
I am fortunate to be supervised by Prof. Yansong Tang in IVG@SZ group.
Before that, I got B.S. in Mathematics and Physics from Tsinghua University (THU) in 2024.
My research interests lie in the fields of Computer Vision and Efficient Deep Learning.
My current research focuses on Long Video Understanding, Large Multimodal Model.
Google Scholar  / 
Email  / 
Github  / 
LinkedIn
|
|
News
2025.05: Ponder & Press is accepted as ACL Findings, 2025.
2024.12: Uni-AdaFocus is accepted by TPAMI, IF=20.8, 2025.
2024.06: Flash-VStream wins the 1st place in LOVEU challenge track 1, CVPR 2024.
2023.09: Start an internship at Bytedance.
2023.03: PREIM3D is accepted by CVPR 2023.
|
Publications and Preprints
* indicates equal contribution, † indicates corresponding author
|
|
Ponder & Press: Advancing Visual GUI Agent towards General Computer Control
Yiqin Wang*,
Haoji Zhang*,
Jingqi Tian,
Yansong Tang†
Findings of the Association for Computational Linguistics ACL (ACL), 2025
[arXiv]
[Code]
[Project Page]
We propose Ponder & Press, a divide-and-conquer GUI agent framework that only relies on visual input to mimic human-like interaction with GUIs.
|
|
Uni-AdaFocus: Spatial-Temporal Dynamic Computation for Video Recognition
Yulin Wang*,
Haoji Zhang*,
Yang Yue,
Shiji Song,
Chao Deng,
Junlan Feng,
Gao Huang†
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI, IF=20.8), 2025
[arXiv]
[IEEE Paper]
[Code]
We explore the phenomenon of spatial/temporal/sample-wise redundancy and propose Uni-AdaFocus, an efficient end-to-end video recognition framework.
|
|
Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams
Haoji Zhang*,
Yiqin Wang*,
Yansong Tang †,
Yong Liu,
Jiashi Feng,
Jifeng Dai,
Xiaojie Jin†
Preprint,
1st place solution of LOVEU@CVPR'24 challenge track 1
, 2024
[Award]
[arXiv]
[Code]
[Project Page]
We proposed Flash-VStream, a video-language model that simulates the memory mechanism of human, able to process long video streams in real-time.
|
|
PREIM3D: 3D Consistent Precise Image Attribute Editing from a Single Image
Jianhui Li,
Jianmin Li†,
Haoji Zhang,
Shilong Liu,
Zhengyi Wang,
Zihao Xiao,
Kaiwen Zheng,
Jun Zhu†
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023
[arXiv]
[Code]
[Project Page]
We propose PREIM3D, a novel framework for 3D-aware image attribute editing that achieves better 3D consistency and precision at large camera poses.
|
|
UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning
Sule Bai,
Mingxing Li,
Yong Liu,
Jing Tang,
Haoji Zhang,
Lei Sun,
Xiangxiang Chu,
Yansong Tang†
Preprint, 2025
[arXiv]
[Code]
[Project Page]
We propose UniVG-R1, a reasoning guided MLLM for universal visual grounding.
|
|
Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation
Sule Bai*,
Yong Liu*,
Yifei Han,
Haoji Zhang,
Yansong Tang†
Preprint, 2024
[arXiv]
[Code]
We propose SC-CLIP, a training-free open-vocabulary segmentation framework that achieves competitive performance on various segmentation tasks.
|
Selected Honors and Awards
Outstanding Bachelor Graduate of Beijing, 2024.
(北京市优秀毕业生, Top 5% in Tsinghua University)
Comprehensive Outstanding Scholarship of Tsinghua University, 2023.
(清华大学综合优秀奖学金, 校级一等)
Comprehensive Outstanding Scholarship of Tsinghua University, 2022.
(清华大学综合优秀奖学金, 校级一等)
Comprehensive Outstanding Scholarship of Tsinghua University, 2021.
(清华大学综合优秀奖学金, 校级一等)
THUWC2019 Gold Medal, 2019.
(清华大学全国优秀中学生信息学冬令营金牌)
NOIWC2019 Silver Medal, 2019.
(第36届全国信息学奥林匹克冬令营银牌)
NOI2019 Bronze Medal, 2019.
(第36届全国信息学奥林匹克竞赛铜牌)
|
|