Abstract:
Metabolomics analysis was conducted using gas chromatography-mass spectrometry (GC-MS) data, employing principal component analysis (PCA) and partial least squares-discriminant analysis (PLS-DA). Two hundred samples of purple-skinned garlic from five regions (Yunnan, Shandong, Henan, Anhui, and Jiangsu) were analyzed. Three preprocessing methods encompassing maximum-minimum normalization (MMS), standardization of standard deviation (SS), and standard normal variate transform (SNV) were used. Five machine learning models including Random Forest (RF), Support Vector Machine (SVM), XGBoost, Convolutional Neural Network (CNN), and Long Short-Term Memory Neural Network (LSTM) were utilized to classify and discriminate the origin of garlic. The analysis revealed 66 metabolites screened across garlic samples from different origins, with 12 differential metabolites identified through PLS-DA analysis. These metabolites were associated with 6 metabolic pathways: valine, leucine, and isoleucine biosynthesis, galactose metabolism, cyanoamino acid metabolism, glyoxylate and dicarboxylic acid metabolism, glycine, serine, and threonine metabolism, D-amino acid metabolism. Among the five machine learning models, LSTM exhibited the best performance, achieving 100% accuracy on the test set across three preprocessing methods. This study based on the combination of metabolomics and LSTM, a type of machine learning, demonstrates high accuracy and reliability in identifying the origin of garlic, providing a reliable technical means for tracing the source of garlic products.