03/06/2025
By Zhang Zhang

We warmly invite you to join us for the upcoming computer science proposal defense.

Proposal Title: Enhancing Endoscopic Lesion Detection with Lightweight Transformers, Multi-Scale Attention, and Vision-Language Models

Candidate Name: Zhang Zhang
Time: Thursday, March 20, 2025, 10:30 - 11:30 a.m. EST
Location: This will be a virtual defense via Zoom
Meeting ID: 472 310 4869

Committee Members:

Dr. Yu Cao (Advisor), Professor, Director, Miner School of Computer & Information Sciences, UMass Center for Digital Health (CDH)
Dr. Benyuan Liu (Advisor), Professor, Director, Miner School of Computer & Information Sciences, UMass Center for Digital Health (CDH), Computer Networking Lab, CHORDS
Dr. Hengyong Yu (Member), Professor, FIEEE, FAAPM, FAIMBE, FAAIA, FAIIA, Department of Electrical and Computer Engineering
Dr. Ming Shao (Member), Associate Professor, Miner School of Computer & Information Sciences

Abstract:

Gastrointestinal (GI) endoscopy is pivotal for diagnosing digestive tract disorders, yet its efficacy remains constrained by operator-dependent variability and challenges in detecting subtle lesions. While convolutional neural networks (CNNs) have advanced automated lesion detection in endoscopic imaging, their limited capacity to model long-range dependencies and complex anatomical contexts persists.

To address this, we propose a dual-axis enhancement framework. First, we integrate a lightweight transformer head into the YOLOX detection architecture, leveraging self-attention mechanisms to improve global feature representation while maintaining computational efficiency. Second, we introduce a novel multi-level and multi-scale attention (MLMSA) module, distinct from the YOLOX framework, designed to refine lesion localization through hierarchical feature fusion and adaptive scale aggregation.

Looking forward, we outline a strategic direction for extending these advancements through large vision-language models (LVMs) in gastroscopy. Future research will focus on constructing a multimodal dataset of high-resolution endoscopic images paired with structured clinical narratives, enabling cross-modal alignment of visual abnormalities and diagnostic descriptors. Advanced processing techniques, including dynamic pixel optimization, will preserve critical fine-grained details, while chain-of-thought reasoning architectures will be adapted to simulate clinical diagnostic workflows. To enhance generalizability, reinforcement learning paradigms will be explored to improve model robustness against domain shifts and out-of-distribution data. These efforts aim to transition LVMs from passive detection tools to interactive diagnostic assistants capable of contextualizing lesions within broader clinical narratives.

By synergizing immediate improvements in lesion detection architectures with fu- ture multimodal reasoning systems, this work bridges technical innovation and clinical applicability. The proposed YOLOX-transformer hybrid addresses current limitations in spatial modeling, while the standalone MLMSA module advances localization pre- cision through multi-scale analysis. Together with the envisioned LVM integration, this research establishes a pathway toward AI systems that not only detect abnormal- ities but also interpret them within the rich contextual framework of clinical practice, ultimately reducing diagnostic variability and improving patient outcomes.

Thank you!

Best Regards,
Zhang Zhang