همایش ملی بیوانفورماتیک ایران

صفحه اصلی / 4th international edition and 13th Iranian Conference on Bioinformatics

Enhancing NAFLD Diagnosis with AI: Insights from the Persian Fasa Cohort Through Advanced Machine Learning Techniques

نویسندگان :

Marzie Shadpirouz¹ Mohammad Reza Zabihi² Zahra Salehi³ Kiarash Zare⁴ Mohammad Mehdi Naghizadeh⁵ Kaveh Kavousi⁶

1- University of Tehran 2- University of Tehran 3- Tehran University of Medical Sciences 4- Shiraz University of Medical Sciences 5- Fasa University of Medical Sciences 6- University of Tehran

کلمات کلیدی :

NAFLD،CNN،OWA،Artificial intelligence،Sugeno Fuzzy Integral

چکیده :

Non-alcoholic fatty liver disease (NAFLD) is a hepatic manifestation of metabolic syndrome, characterized by fat accumulation in the liver among individuals who do not consume excessive alcohol. Over the past three decades, its prevalence has risen globally, posing a significant public health challenge. NAFLD can progress to cirrhosis, liver failure, and an increased risk of cardiovascular disease, ultimately contributing to higher overall mortality. Despite its widespread occurrence, early detection remains a challenge due to limitations in current screening methods. Here, we aimed to develop an AI-driven model for diagnosing NAFLD based on blood parameters and anthropometric indices. This study utilized data from the Persian Fasa cohort, originally comprising 10,138 records and 226 features, categorized into discrete and continuous features. After preprocessing, normalization, and dimensionality reduction, statistical analyses were conducted using Python. Patients were categorized into three groups based on the Fatty Liver Index (FLI), including healthy (<30), borderline (30–60), and NAFLD (>60). The dataset was divided into training (70%) and testing (30%) subsets. Seven feature selection methods, including ANOVA, Mutual Information (MI), Independent Component Analysis (ICA), Non-negative Matrix Factorization (NMF), Principal Component Analysis (PCA), Penalized Support Vector Machine (SVM_L1), and Elastic Net Logistic Regression, were applied to extract common features. The Random Forest algorithm identified the most important extracted features, which were validated through Receiver Operating Characteristic (ROC) curve analysis. A variety of machine learning models, including Random Forest, Support Vector Machine (SVM), Logistic Regression, K-Nearest Neighbors (KNN), Decision Tree, CatBoost, AdaBoost, and XGBoost, were trained to evaluate classification performance using a 5-fold cross-validation approach. Model diversity was assessed using Kappa statistics and error analysis to ensure robustness. To further improve performance, Optimized Weighted Averaging (OWA) and Sugeno Fuzzy Integral methods were applied for model combination. Finally, a Convolutional Neural Network (CNN) was trained with 5-fold cross-validation to integrate robust models and enhance classification results. The final dataset comprised 70 clinical and lifestyle variables, including hypertension, smoking status, and others, collected from 10,007 patients (45.2% male and 54.8% female). The number of patients in each category was as follows: healthy (4,444), borderline (2,892), and NAFLD (2,671). Five key features, including BMI, waist-to-hip ratio, triglycerides, and GGT, were identified as the most significant predictors using the Random Forest method. The diagnostic value of these features was confirmed through ROC curve analysis, achieving an Area Under the Curve (AUC) greater than 0.7. SVM and CatBoost models demonstrated exceptional performance, with a Kappa score of 0.96 and an error rate of 0.01, indicating high model diversity and minimal error. Combining these two models using Sugeno Fuzzy Integral, OWA, and CNN-based meta-learning produced outstanding results: Accuracy 0.99, Precision 0.99, Recall 0.99, F1 Score 0.99, and an AUC of 1.00. By highlighting factors that could improve the diagnosis of NAFLD, we underscore the potential of AI in improving NAFLD diagnosis and provide valuable insights for early detection and intervention.