Hello, I'm Mingru Huang

I am a Master's degree student at Wuhan University of Technology. I'm passionate about computer vision research, particularly video understanding. My work spans video Q&A, video-text retrieval, and video captioning. I also explore large language models, prompt engineering, operator development, knowledge graphs, and Q&A systems. My goal is to develop an affordable, secure, and trustworthy generalized multimodal video model for everyone.

CV Scholar GitHub

Latest News

2024

Nov. 2024: Invited as a reviewer for the ICME2025 conference.
Aug. 2024: Incorporated a project on automotive maintenance inspection using a multimodal large model.
Jul. 2024: Joined the SpConv operator optimization project based on MetaX MXMACA computing platform.
Jun. 2024: Approved for the Chinese Software Copyright "Dermatology Clinical Feature Detection and Diagnosis System".
May 2024: The paper "ST-CLIP" has been accepted at ICIC 2024 conference.

2023

Jan. 2024: Invited as a reviewer for the ICME2024 conference.
Dec. 2023: Joined the school-enterprise cooperation program of Haluo Corporation, responsible for the AI speech generation part.
Nov. 2023: Completed the Transformer Heterogeneous Bisheng C++ Arithmetic Development Project of Huawei Crowd Intelligence Program.
Sept. 2023: Joined a video understanding project focused on dense video captioning.

Publications

Scene Knowledge Enhanced Multimodal Retrieval Model for Dense Video Captioning

Mingru Huang, Pengfei Duan, Yifang Zhang, Huimin Chen, Jiawang Peng, Shengwu Xiong

2025 Twenty-first International Conference on Intelligent Computing (ICIC 2025)

Introducing a Memory Enhanced Visual-Speech Aggregation model for dense video captioning, inspired by cognitive informatics on human memory recall. The model enhances visual representations by merging them with relevant text features retrieved from a memory bank through multimodal retrieval involving transcribed speech and visual inputs.

Project Page PDF arXiv

LDIT: Pseudo-Label Noise Adaptation via Label Diffusion Transformer

Jiawang Peng, Pengfei Duan, Mingru Huang, Shengwu Xiong

2025 Twenty-first International Conference on Intelligent Computing (ICIC 2025)

We reformulate label prediction as a progressive refinement process starting from an initial random guess, and propose LDiT (Label Diffusion Transformer) for pseudo-label noise adaptation. By modeling label uncertainty through a diffusion process, LDiT enables more robust learning under noisy supervision. In addition, to effectively capture the long-range dependencies in textual data, we adopt a Transformer-based latent denoising architecture with self-attention mechanisms.

PDF arXiv

ST-CLIP: Spatio-Temporal enhanced CLIP towards Dense Video Captioning

Huimin Chen, Pengfei Duan, Mingru Huang, Jingyi Guo, Shengwu Xiong

2024 Twentieth International Conference on Intelligent Computing (ICIC 2024)

Proposing a new factorized spatio-temporal self-attention paradigm to address inaccurate event descriptions caused by insufficient temporal relationship modeling between video frames and apply it to dense video captioning tasks.

Project Page PDF arXiv

Get In Touch

I'm always open to discussing research collaborations, new projects, or opportunities. Feel free to reach out!

Template inspired by Keunhong Park