Abstract
As the demand for applications such as industrial quality inspection and medical image analysis continues to grow, anomaly detection technology has attracted increasing attention. However, in real-world scenarios, anomalous samples are often scarce or even impossible to obtain, and traditional supervised learning methods that rely on labeled data face significant bottlenecks.
Research Breakthrough
Recently, the research team led by Professor Wang Quan from the Spectral Imaging Technology Laboratory at the Xi'an Institute of Optics and Precision Mechanics (XIOPM), Chinese Academy of Sciences, has achieved new progress in zero-shot anomaly detection and localization in the field of computer vision. The related work has been accepted by the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2026). The first author is Hu Ming, a 2024-enrolled master's student at XIOPM, and the corresponding authors are Dr. Hu Cong from Zhongnan Hospital of Wuhan University, Researcher Hu Bingliang from XIOPM, and Professor Wang Quan from XIOPM, with XIOPM serving as the first corresponding institution.
Challenges in Zero-Shot Anomaly Detection
Zero-shot anomaly detection methods based on vision-language models leverage large-scale pre-trained knowledge to detect anomalies without requiring anomalous labels. However, in fine-grained anomaly detection tasks, these methods still face three major challenges: first, models struggle to distinguish foreground objects from complex backgrounds, causing anomalous features to mix with the background and impairing detection accuracy; second, reliance on a single text representation limits semantic expressiveness, making it difficult to provide fine-grained evidence for anomaly discrimination; and third, during cross-modal alignment, uncertainty in image-text semantic matching constrains model performance improvement.
The FB-CLIP Framework
To address these issues, the research team proposed a novel framework called FB-CLIP (Foreground-Background Disentangled CLIP). The framework introduces innovations at three levels:
Text Modeling: A multi-strategy text feature fusion method is proposed that combines sentence-level representations, global contextual information, and attention-weighted features to construct richer task-aware semantic representations, enhancing the model's understanding of anomalous semantics.
Visual Modeling: A multi-perspective foreground-background disentanglement mechanism is designed to decouple image features across semantic, spatial, and structural dimensions. Combined with a background suppression strategy, it reduces interference from complex scenes, enabling the model to focus more precisely on anomalous regions.
Cross-Modal Alignment: A semantic consistency regularization constraint is introduced to improve prediction confidence and widen the semantic gap between normal and anomalous samples, thereby strengthening the model's discriminative ability for anomaly detection.
Experimental Results
Experimental results demonstrate that FB-CLIP achieves excellent performance across multiple industrial inspection and medical imaging datasets, particularly excelling in fine-grained anomaly localization tasks, with overall performance reaching an internationally leading level. Without requiring any annotated anomalous samples, the method enables precise detection and localization of subtle anomalies in complex scenes, demonstrating strong potential for practical applications.
Application Prospects
This achievement holds promise for applications in medical image-assisted diagnosis, industrial defect detection, and other related fields.
Professor Wang Quan's team at XIOPM has been deeply engaged in interdisciplinary research at the intersection of computer vision, biomedical imaging, and brain-computer intelligence. In recent years, the team has made a series of significant advances in related fields, with results published at CVPR 2025, in Pattern Recognition, and other top venues.
The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) is one of the most influential international academic conferences in the field of computer vision, rated as a CCF-A conference by the China Computer Federation.