基于Transformer的零样本食品图像检测

宋静茹; 闵巍庆; 周鹏飞; 饶全瑞; 盛国瑞; 杨延村; 王丽丽; 蒋树强

doi:10.13386/j.issn1002-0306.2024030027

基于Transformer的零样本食品图像检测

Zero-Shot Food Image Detection Based on Transformer

摘要

摘要: 食品检测作为食品计算的一项基本任务，能够对输入的食品图像进行定位和识别，在智慧食堂结算和饮食健康管理等食品应用领域发挥着至关重要的作用。然而在实际场景下，食品类别会不断更新，基于固定类别训练的食品检测器很难对未见过的食品类别进行精准的检测。为了解决这一问题，本文提出了一种零样本食品图像检测方法。首先，构建了一个基于Transformer的食品基元生成器，其中每个基元都包含与食品类别相关的细粒度属性，根据食品的特性，可以有选择地组装这些基元，以合成未见类特征。其次，为了给未见类的视觉特征更多约束，本文提出了一个视觉特征解纠缠的增强组件，将食品图像的视觉特征分解为语义相关特征和语义不相关特征，以此能更好地将食品类别的语义知识转移到其视觉特征。所提出的方法在ZSFooD和UEC-FOOD256两个食品数据集上进行了大量实验和消融研究，在零样本检测（Zero-Shot Detection，ZSD）设置下，未见类别取得了最优的平均精度，分别达到了4.9%和24.1%，在广义零样本检测（Generalized Zero-Shot Detection，GZSD）的设置下，可见类和未见类的调和平均值（Harmonic Mean，HM）分别达到了5.8%和22.0%，证明了所提出方法的有效性。

Abstract: As a fundamental task in food computing, food detection played a crucial role in locating and identifying food items from input images, particularly in applications such as intelligent canteen settlement and dietary health management. However, food categories were constantly updating in practical scenarios, making it difficult for food detectors trained on fixed categories to accurately detect previously unseen food categories. To address this issue, this paper proposed a zero-shot food image detection method. Firstly, a Transformer-based food primitive generator was constructed, where each primitive contained fine-grained attributes relevant to food categories. These primitives could be selectively assembled based on the food characteristics to synthesize new food features. Secondly, an enhancement component of visual feature disentanglement was proposed in order to impose more constraints on the visual features of unseen food categories. The visual features of food images were decomposed into semantically related features and semantically unrelated features, thereby better transferring semantic knowledge of food categories to their visual features. The proposed method was extensively evaluated on the ZSFooD and UEC-FOOD256 datasets through numerous experiments and ablation studies. Under the zero-shot detection (ZSD) setting, optimal average precision on unseen classes reached 4.9% and 24.1%, respectively, demonstrating the effectiveness of the proposed approach. Under the generalized zero-shot detection (GZSD) setting, the harmonic mean of visible and unseen classes reaches 5.8% and 22.0%, respectively, further validating the effectiveness of the proposed method.