A Preliminary Result of Food Object Detection using Swin Transformer

Publication Date


Document Type

Conference Proceeding

Publication Title

ACM International Conference Proceeding Series



First Page


Last Page



An inappropriate diet is one of the main causes of poor health. However, it is difficult to sustain a quantitative diet assessment in the general living environment. Food object detection is a key method for solving this problem; still, it is difficult to find studies that apply recent object detection techniques. In addition, the currently used high-performance food object detection models have a special architecture that combines two deep learning models-food localization and food classification-in series to achieve high accuracy. The disadvantage of this architecture is that it is difficult to predict the scalability of a model. In this study, we built an end-to-end food object detection model using the Swin Transformer, which is one of the latest backbone models. The experiment was conducted to compare the performance of the UECFOOD dataset with other food object detection studies. For the UECFOOD-100 dataset, a mAP(mean Average Precision) of 0.522 was obtained; also, a mAP of 0.52 was obtained for the UECFOOD-256 dataset. The findings show that the proposed model that uses only end-to-end object detection produces better performance than previous studies using a combination of food localization and food classification.

Funding Sponsor

Ministry of Science, ICT and Future Planning


Deep learning, Food Object Detection, Food recognition, Vision Transformer


Applied Data Science; Electrical Engineering