Publications
Selected papers and preprints.
2026
- ARM: Advantage Reward Modeling for Long-Horizon ManipulationYiming Mao, Zixi Yu, Weixin Mao, Yinhao Li, Qirui Hu, Zihan Lan, Minzhao Zhu, and Hua ChenIn CVPR workshop 2026, 2026
Long-horizon robotic manipulation remains challenging for reinforcement learning (RL) because sparse rewards provide limited guidance for credit assignment. Practical policy improvement thus relies on richer intermediate supervision, such as dense progress rewards, which are costly to obtain and ill-suited to non-monotonic behaviors such as backtracking and recovery. To address this, we propose Advantage Reward Modeling (ARM), a framework that shifts from hard-to-quantify absolute progress to estimating relative advantage. We introduce a cost-effective tri-state labeling strategy -- Progressive, Regressive, and Stagnant -- that reduces human cognitive overhead while ensuring high cross-annotator consistency. By training on these intuitive signals, ARM enables automated progress annotation for both complete demonstrations and fragmented DAgger-style data. Integrating ARM into an offline RL pipeline allows for adaptive action-reward reweighting, effectively filtering suboptimal samples. Our approach achieves a 99.4% success rate on a challenging long-horizon towel-folding task, demonstrating improved stability and data efficiency over current VLA baselines with near-zero human intervention during policy training.
@inproceedings{mao2026arm, title = {{ARM}: Advantage Reward Modeling for Long-Horizon Manipulation}, author = {Mao, Yiming and Yu, Zixi and Mao, Weixin and Li, Yinhao and Hu, Qirui and Lan, Zihan and Zhu, Minzhao and Chen, Hua}, booktitle = {CVPR workshop 2026}, year = {2026}, } - TEAR of the SUNSET: A Benchmark for Road Detection in Semi-Structured EnvironmentsIn 2026 IEEE International Conference on Multimedia and Expo (ICME), Bangkok, Thailand, Jul 2026
Recent advances in autonomous driving technology have enabled its mature deployment in structured urban scenarios that rely on standardized artificial markers for perception. However, this reliance limits its applicability in unstructured environments. Semi-structured environments are a subset of unstructured environments characterized by the absence of artificial road markings but the presence of road traces. To address road detection in such environments, this work proposes a dedicated benchmark. We design a novel road representation that models the road edge lines using higher-order Bézier curves. Meanwhile, we construct the annotated SUNSET dataset, tailored for road detection tasks in such environments. Furthermore, we present TEAR, an end-to-end road detection method which employs an Interconvertible Dual-Instance decoder to decouple road and line instances. We also design a Hierarchical Bipartite Matching strategy for instance association. The experimental results demonstrate that TEAR achieves excellent performance on the proposed benchmark.
@inproceedings{xu2026tear, author = {Xu, Haonan and Hu, Qirui and Liu, Xinyuan and Li, Hu and Xu, Hang and Ma, Yike and Zhang, Yucheng and Dai, Feng}, title = {TEAR of the SUNSET: A Benchmark for Road Detection in Semi-Structured Environments}, booktitle = {2026 IEEE International Conference on Multimedia and Expo (ICME)}, month = jul, year = {2026}, }
2025
- Preprint
OmniD: Generalizable Robot Manipulation Policy via Image-Based BEV RepresentationJilei Mao, Jiarui Guan, Yingjuan Tang, Qirui Hu, Zhihang Li, Junjie Yu, Yongjie Mao, Yunzhe Sun, Shuang Liu, and Xiaozhu Ju2025The visuomotor policy can easily overfit to its training datasets, such as fixed camera positions and backgrounds. This overfitting makes the policy perform well in the in-distribution scenarios but underperform in the out-of-distribution generalization. Additionally, the existing methods also have difficulty fusing multi-view information to generate an effective 3D representation. To tackle these issues, we propose Omni-Vision Diffusion Policy (OmniD), a multi-view fusion framework that synthesizes image observations into a unified bird's-eye view (BEV) representation. We introduce a deformable attention-based Omni-Feature Generator (OFG) to selectively abstract task-relevant features while suppressing view-specific noise and background distractions. OmniD achieves 11%, 17%, and 84% average improvement over the best baseline model for in-distribution, out-of-distribution, and few-shot experiments, respectively.
@misc{mao2025omnid, title = {{OmniD}: Generalizable Robot Manipulation Policy via Image-Based {BEV} Representation}, author = {Mao, Jilei and Guan, Jiarui and Tang, Yingjuan and Hu, Qirui and Li, Zhihang and Yu, Junjie and Mao, Yongjie and Sun, Yunzhe and Liu, Shuang and Ju, Xiaozhu}, year = {2025}, }