Blog of C.P.F.
🎓 Education
University of Science and Technology of China (Sep. 2023 - Present)
Master's Student, Department of Electronic Engineering and Information Science
- Research Focus:
- computational auditory perception, including sound event detection(SED), multimodal audio understanding, and LLM-based audio understanding;
- audio AIGC (speech/music/audio)
- Advisor: Assoc. Prof. Yan Song (National Engineering Research Center for Speech and Language Information Processing)
- Expected Graduation: June 2026
Dalian University of Technology (Sep. 2019 - June 2023)
Bachelor of Engineering, Electronic Information Engineering
- GPA: 93.20 / 100
- Rank: 2 / 183 (Top 1%)
📧 Contact
- e-mail: cqi525@mail.ustc.edu.cn or good_luck_cpf@163.com
- github: GitHub: cai525
📖 Works
Detect Any Sound : Open-Vocabulary Sound Event Detection with Multi-Modal Queries
Pengfei Cai, Yan Song, Qing Gu, Nan Jiang, Haoyu Song, Ian McLoughlin
In ACM MM, 2025 [ paper | demo | code ]

SegTune: Structured and Fine-Grained Control for Song Generation
Pengfei Cai, Joanna Wang, Haorui Zheng, Xu Li, Zihao Ji, Teng Ma, Zhongliang Liu, Chen Zhang, Pengfei Wan

Prototype based Masked Audio Model for Self-Supervised Learning of Sound Event Detection
Pengfei Cai, Yan Song, Nan Jiang, Qing Gu, Ian McLoughlin
In ICASSP, 2025 [ paper | code ]
MAT-SED: A Masked Audio Transformer with Masked-Reconstruction Based Pre-training for Sound Event Detection
Pengfei Cai, Yan Song, Kang Li, Haoyu Song, Ian McLoughlin
In Interspeech, 2024 [ paper | code ]
Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation
Jun Wang, Xijuan Zeng, Chunyu Qiang, Ruilong Chen, Shiyao Wang, Le Wang, Wangjing Zhou, Pengfei Cai, Jiahui Zhao, Nan Li, Zihan Li, Yuzhe Liang, Xiaopeng Wang, Haorui Zheng, Ming Wen, Kang Yin, Yiran Wang, Nan Li, Feng Deng, Liang Dong, Chen Zhang, Di Zhang, Kun Gai
Parameter-Efficient Tuning of Large Audio-Language Models for DCASE 2025 Challenge Task 5
Pengfei Cai, Yanfeng Shi, Qing Gu, Nan Jiang, Yan Song
DCASE 2025 challenge, Audio Question Answering task,second place [ DCASE | technical report ]