Shaobo Wang (王少博)

Ph.D Candidate, SAI, SJTU

avatar.jpg

Mail: gszfwsb@gmail.com

Tel: (+86) 15000937315

City: Shanghai, 200240

I am now a second-year Ph.D Candidate in the School of Artificial Intelligence, Shanghai Jiao Tong University (SJTU), fortunate to be advised by Prof. Linfeng Zhang. Currently, I am also a research intern at the Alibaba Qwen Team, where I am supervised by Dr. Dayiheng Liu and Xingzhang Ren. Here, I also closely collaborate with Dr. Fei Huang, Huiqiang Jiang, Kexin Yang, Yubo Ma, and Beichen Zhang.

Previously, I was a master’s student of ReThinkLab at SJTU, where I was grateful to be mentored by Prof. Junchi Yan. Additionally, I collaborated closely with Prof. Xuming Hu at Hong Kong University of Science and Technology (Guangzhou), and Dr. Conghui He at Shanghai AI Laboratory. I used to work with Prof. Zhuoran Yang at Yale University.

Research. My research bridges empirical and theoretical perspectives on data-centric AI. I am primarily focused on data selection, synthesis, and sampling strategies for Large Language Model pre-training. Previously, my work centered on Explainable AI.

Awards & Honors

  • 🏆 [May 2026] Our paper, “OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration,” was selected as an ICML 2026 Spotlight and invited for a talk at Pinterest.
  • 🏆 [July 2025] Tencent Hunyuan Scholar (Tencent PhD Research Incentive Program) (one of 23 recipients in China).
  • 🏆 [March 2025] Our paper, “Dataset Distillation with Neural Characteristic Function: A Minmax Perspective,” was selected as a CVPR 2025 Highlight, received full scores (5/5/5) from all three reviewers, and was invited for a talk at BAAI.

:blush: Short bio. I was born in Hefei, China, in 1999, two years after IBM’s Deep Blue defeated Garry Kasparov. In some sense, I grew up in the long shadow of AI. I started playing chess at an early age and later won several chess championships in Anhui Province, China, under the guidance of Chess Grandmaster Chongsheng Zeng and Chess Master Yongjin Zhou. When AlphaGo defeated Lee Sedol during my high school years, I was deeply shaken: as a chess player, I felt both the beauty of human intelligence and how small we are in front of a new kind of intelligence.

Music gave me a different but related feeling. I have been devoted to the piano for 15 years and once had the privilege of performing alongside the world-renowned pianist Lang Lang. My musical inspirations come from the Romantic era—especially Frédéric Chopin and Franz Liszt—as well as R&B, Jazz, and Neo-Soul. During my undergraduate years, seeing early demos of AI-generated music made me reflect on piano practice as a nearly religious form of self-cultivation: beautiful, demanding, and deeply personal, yet not necessarily a direct path to broader social progress. This reflection, together with my experiences in chess, gradually convinced me that AI was something I needed to work on.

The journey has not always been smooth, but I have continued along this path with the hope that my work can make a small contribution to human knowledge and progress.

Selected Publications

* denotes the equal contribution.
  1. OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration
    Shaobo Wang , Xuan Ouyang , Tianyi Xu , Yuzheng Hu , Jialin Liu , Guo Chen , Tianyu Zhang , Junhao Zheng , and 4 more authors
    In International Conference on Machine Learning , 2026
  2. CircuitSeer: Mining High-Quality Data by Probing Mathematical Reasoning Circuits in LLMs
    Shaobo Wang* , Yongliang Miao* , Yuancheng Liu , Qianli Ma , Ning Liao , and Linfeng Zhang
    In The 64th Annual Meeting of the Association for Computational Linguistics , 2026
  3. MelTrim: Coarse-to-Fine Data Pruning for Speech Classification
    Shaobo Wang , Tianle Niu , Xuan Ouyang , Xintong Li , Zhengkun Ge , Yue Min , Xiaoqian Liu , Hankun Wang , and 9 more authors
    In Findings of the Association for Computational Linguistics: ACL 2026 , 2026
  4. Socratic-Geo: Synthetic Data Generation and Cross-Modal Geometric Reasoning via Multi-Agent Interaction
    Zhengbo Jiao* , Shaobo Wang* , Zifan Zhang* , Wei Wang , Bing Zhao , Hu Wei , and Linfeng Zhang
    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2026
  5. Grounding and Enhancing Informativeness and Utility in Dataset Distillation
    Shaobo Wang , Yantai Yang , Guo Chen , Peiru Li , Kaixin Li , Yufa Zhou , Zhaorun Chen , and Linfeng Zhang
    In The Fourteenth International Conference on Learning Representations , 2026
  6. Rethinking LLM Evaluation: Can We Evaluate LLMs with 200x Less Data?
    Shaobo Wang* , Cong Wang* , Wenjie Fu* , Yue Min , Mingquan Feng , Isabel Guan , Xuming Hu , Conghui He , and 6 more authors
    In The Fourteenth International Conference on Learning Representations , 2026
  7. UNSEEN: Enhancing Dataset Pruning from a Generalization Perspective
    Furui Xu* , Shaobo Wang* , Jiajun Zhang , Chenghao Sun , Haixiang Tang , and Linfeng Zhang
    In Annual AAAI Conference on Artificial Intelligence , 2026
  8. ImagebindDC: Compressing Multi-modal Data with Imagebind-based Condensation
    Yue Min* , Shaobo Wang* , Jiaze Li , Tianle Niu , Junxin Fan , Yongliang Miao , Lijin Yang , and Linfeng Zhang
    In Annual AAAI Conference on Artificial Intelligence , 2026
  9. Data Whisperer: Efficient Data Selection for Task-Specific LLM Fine-Tuning via Few-Shot In-Context Learning
    Shaobo Wang , Xiangqi Jin , Ziming Wang , Jize Wang , Jiajun Zhang , Kaixin Li , Zichen Wen , Zhong Li , and 3 more authors
    In Annual Meeting of the Association for Computational Linguistics , 2025
  10. Dataset Distillation with Neural Characteristic Function: A Minmax Perspective
    Shaobo Wang , Yicun Yang , Zhiyuan Liu , Chenghao Sun , Xuming Hu , Conghui He , and Linfeng Zhang
    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2025
  11. Gnothi Seauton: Empowering Faithful Self-Interpretability in Black-Box Transformers
    Shaobo Wang , Hongxuan Tang , Mingyang Wang , Hongrui Zhang , Xuyang Liu , Weiya Li , Xuming Hu , and Linfeng Zhang
    In International Conference on Learning Representations , 2025