Improving Machine Learning Classification Predictions through SHAP and Features Analysis Interpretation

Tree-based machine learning (ML) algorithms, such as Extra Trees (ET), Random Forest (RF), Gradient Boosting Machine (GBM), and XGBoost (XGB) are among the most widely used in early drug discovery, given their versatility and performance. However, models based on these algorithms often suffer from m...

وصف كامل

محفوظ في:

التفاصيل البيبلوغرافية
المؤلف الرئيسي:	Leonardo Bernal (22461790) (author)
مؤلفون آخرون:	Giulio Rastelli (1513714) (author), Luca Pinzi (6268934) (author)
منشور في:	2025
الموضوعات:	Pharmacology Developmental Biology Space Science Biological Sciences not elsewhere classified Mathematical Sciences not elsewhere classified Information Systems not elsewhere classified virtual screening applications three prostate cancer shapley additive explanations reduced interpretability issues range typically associated known antiproliferative activity early drug discovery demonstrating satisfactory accuracy cell lines (< achieved mcc values benchmarked classifiers based based machine learning features value analyses algorithms often suffer shap values revealed lncap test sets likely misclassified compounds improve classifier performance shap values misclassified compounds data sets models based “ raw widely used valuable approach thus providing systematic exclusion shap ”. shap ”, shap ). several approaches prediction performance performing models opposite class novel approach model predictions integrates shap fall within extra trees ecfp4 descriptors e .</ commonly used 8 across
الوسوم:	إضافة وسم لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!

الوصف
الملخص:	Tree-based machine learning (ML) algorithms, such as Extra Trees (ET), Random Forest (RF), Gradient Boosting Machine (GBM), and XGBoost (XGB) are among the most widely used in early drug discovery, given their versatility and performance. However, models based on these algorithms often suffer from misclassification and reduced interpretability issues, which limit their applicability in practice. To address these challenges, several approaches have been proposed, including the use of SHapley Additive Explanations (SHAP). While SHAP values are commonly used to elucidate the importance of features driving models’ predictions, they can also be employed in strategies to improve their prediction performance. Building on these premises, we propose a novel approach that integrates SHAP and features value analyses to reduce misclassification in model predictions. Specifically, we benchmarked classifiers based on ET, RF, GBM, and XGB algorithms using data sets of compounds with known antiproliferative activity against three prostate cancer (PC) cell lines (<i>i.e.</i>, PC3, LNCaP, and DU-145). The best-performing models, based on RDKit and ECFP4 descriptors with GBM and XGB algorithms, achieved MCC values above 0.58 and F1-score above 0.8 across all data sets, demonstrating satisfactory accuracy and precision. Analyses of SHAP values revealed that many misclassified compounds possess feature values that fall within the range typically associated with the opposite class. Based on these findings, we developed a misclassification-detection framework using four filtering rules, which we termed “RAW”, SHAP, “RAW OR SHAP”, and “RAW AND SHAP”. These filtering rules successfully identified several potentially misclassified predictions, with the “RAW OR SHAP” rule retrieving up to 21%, 23%, and 63% of misclassified compounds in the PC3, DU-145, and LNCaP test sets, respectively. The developed flagging rules enable the systematic exclusion of likely misclassified compounds, even across progressively higher prediction confidence levels, thus providing a valuable approach to improve classifier performance in virtual screening applications.

Improving Machine Learning Classification Predictions through SHAP and Features Analysis Interpretation

مواد مشابهة