Abstract:
As a fundamental task in food computing, food detection played a crucial role in locating and identifying food items from input images, particularly in applications such as intelligent canteen settlement and dietary health management. However, food categories were constantly updating in practical scenarios, making it difficult for food detectors trained on fixed categories to accurately detect previously unseen food categories. To address this issue, this paper proposed a zero-shot food image detection method. Firstly, a Transformer-based food primitive generator was constructed, where each primitive contained fine-grained attributes relevant to food categories. These primitives could be selectively assembled based on the food characteristics to synthesize new food features. Secondly, an enhancement component of visual feature disentanglement was proposed in order to impose more constraints on the visual features of unseen food categories. The visual features of food images were decomposed into semantically related features and semantically unrelated features, thereby better transferring semantic knowledge of food categories to their visual features. The proposed method was extensively evaluated on the ZSFooD and UEC-FOOD256 datasets through numerous experiments and ablation studies. Under the zero-shot detection (ZSD) setting, optimal average precision on unseen classes reached 4.9% and 24.1%, respectively, demonstrating the effectiveness of the proposed approach. Under the generalized zero-shot detection (GZSD) setting, the harmonic mean of visible and unseen classes reaches 5.8% and 22%, respectively, further validating the effectiveness of the proposed method.