代谢组学与机器学习相结合用于大蒜的产地溯源

韩浩; 胡梦雅; 王珍珍; 沙如意; 黄俊; 毛建卫; 崔艳丽

doi:10.13386/j.issn1002-0306.2024020220

代谢组学与机器学习相结合用于大蒜的产地溯源

Metabolomics Combined with Machine Learning for Geographical Origin Tracing of Garlic

摘要

摘要: 以云南、山东、河南、安徽和江苏5个产地的200份紫皮大蒜为原料，基于气相色谱-质谱联用（gas chromatography-mass spectrometry，GC-MS）代谢物数据，使用主成分分析（PCA）和偏最小二乘判别分析（partial least squares-discriminant analysis，PLS-DA）进行代谢组学分析，采用最大最小归一化（min-max scaler，MMS）、标准差标准化（standard scaler，SS）和标准正态变量变换（standard normalized variate，SNV）三种预处理方法分别建立了随机森林（random forest，RF）、支持向量机（support vectormachine，SVM）、XGBoost以及卷积神经网络（convolutional neural network，CNN）、长短期记忆神经网络（long short term memory，LSTM）对大蒜产地进行分类判别。结果表明，不同产地的大蒜中共筛分到66种代谢物，基于PLS-DA筛选出12种差异代谢物，涉及到6条代谢通路：分别是缬氨酸、亮氨酸和异亮氨酸生物合成、半乳糖代谢、氰氨基酸代谢、乙醛酸和二羧酸代谢、甘氨酸、丝氨酸和苏氨酸代谢、D-氨基酸代谢。在5种机器学习模型中，LSTM表现最佳，其在三种预处理方法下的测试集准确率均为100%。本研究基于代谢组学和机器学习LSTM相结合，在识别大蒜产地方面具有很高的准确性和可靠性，为大蒜产品的溯源提供了可靠的技术手段。

Abstract: Metabolomics analysis was conducted using gas chromatography-mass spectrometry (GC-MS) data, employing principal component analysis (PCA) and partial least squares-discriminant analysis (PLS-DA). Two hundred samples of purple-skinned garlic from five regions (Yunnan, Shandong, Henan, Anhui, and Jiangsu) were analyzed. Three preprocessing methods encompassing maximum-minimum normalization (MMS), standardization of standard deviation (SS), and standard normal variate transform (SNV) were used. Five machine learning models including random forest (RF), support vector machine (SVM), XGBoost, convolutional neural network (CNN), and long short-term memory neural network (LSTM) were utilized to classify and discriminate the origin of garlic. The analysis revealed 66 metabolites screened across garlic samples from different origins, with 12 differential metabolites identified through PLS-DA analysis. These metabolites were associated with 6 metabolic pathways: valine, leucine, and isoleucine biosynthesis, galactose metabolism, cyanoamino acid metabolism, glyoxylate and dicarboxylic acid metabolism, glycine, serine, and threonine metabolism, D-amino acid metabolism. Among the five machine learning models, LSTM exhibited the best performance, achieving 100% accuracy on the test set across three preprocessing methods. This study based on the combination of metabolomics and LSTM, a type of machine learning, demonstrates high accuracy and reliability in identifying the origin of garlic, providing a reliable technical means for tracing the source of garlic products.