Scalable Nonparametric Supervised Learning for Streaming and Massive Data: Applications in Healthcare Monitoring and Credit Risk

<p dir="ltr">This paper introduces novel nonparametric supervised learning techniques for classifying massive datasets, addressing key limitations of existing methods in Big and Streaming Data framework. We propose an offline kernel-based classifier enhanced by Batch Principal Compon...

وصف كامل

محفوظ في:
التفاصيل البيبلوغرافية
المؤلف الرئيسي: Mohamed Chaouch (17983846) (author)
مؤلفون آخرون: Omama M. Al-Hamed (18021667) (author)
منشور في: 2025
الموضوعات:
الوسوم: إضافة وسم
لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!
_version_ 1864513531473297408
author Mohamed Chaouch (17983846)
author2 Omama M. Al-Hamed (18021667)
author2_role author
author_facet Mohamed Chaouch (17983846)
Omama M. Al-Hamed (18021667)
author_role author
dc.creator.none.fl_str_mv Mohamed Chaouch (17983846)
Omama M. Al-Hamed (18021667)
dc.date.none.fl_str_mv 2025-07-30T12:00:00Z
dc.identifier.none.fl_str_mv 10.1109/access.2025.3591883
dc.relation.none.fl_str_mv https://figshare.com/articles/journal_contribution/Scalable_Nonparametric_Supervised_Learning_for_Streaming_and_Massive_Data_Applications_in_Healthcare_Monitoring_and_Credit_Risk/30971302
dc.rights.none.fl_str_mv CC BY 4.0
info:eu-repo/semantics/openAccess
dc.subject.none.fl_str_mv Biomedical and clinical sciences
Reproductive medicine
Health sciences
Health services and systems
Information and computing sciences
Data management and data science
Machine learning
Big data applications
classification algorithms
dimensionality reduction
kernel methods
machine learning
nonparametric statistics
recursive estimation
principal component analysis
stochastic approximation algorithms
supervised learning
Vectors
Principal component analysis
Posterior probability
Covariance matrices
Accuracy
Random forests
Probability distribution
dc.title.none.fl_str_mv Scalable Nonparametric Supervised Learning for Streaming and Massive Data: Applications in Healthcare Monitoring and Credit Risk
dc.type.none.fl_str_mv Text
Journal contribution
info:eu-repo/semantics/publishedVersion
text
contribution to journal
description <p dir="ltr">This paper introduces novel nonparametric supervised learning techniques for classifying massive datasets, addressing key limitations of existing methods in Big and Streaming Data framework. We propose an offline kernel-based classifier enhanced by Batch Principal Component Analysis (PCA) for dimensionality reduction to mitigate the “curse of dimensionality”. Additionally, an online classifier is developed for streaming data, combining online PCA with a kernel-based recursive classifier using a stochastic approximation algorithm. Application to fetal well-being monitoring demonstrates that the online classifier achieves a competitive median misclassification rate (11.92%), comparable to the offline classifier (11.54%) and Random Forest (11.31%), while requiring only 1/15th of the offline classifier’s computation time. Receiver Operating Characteristic (ROC) analysis shows superior Area Under the Curve (AUC) for the offline classifier but at a significant computational cost. A second study on larger database of credit scoring confirms these findings, showing that the online classifier achieves an F1-score of 96.40% and an accuracy of 93.08%, closely matching the performance of neural networks (96.46%, 93.22%) and boosting (96.51%, 93.31%). Notably, the online classifier accomplishes this with a CPU time of only 0.87 seconds per classification - over 600 times faster than neural networks - demonstrating its effectiveness for high-frequency, real-time financial decision-making.</p><h2 dir="ltr">Other Information</h2><p dir="ltr">Published in: IEEE Access<br>License: <a href="https://creativecommons.org/licenses/by/4.0/deed.en" target="_blank">https://creativecommons.org/licenses/by/4.0/</a><br>See article on publisher's website: <a href="https://dx.doi.org/10.1109/access.2025.3591883" target="_blank">https://dx.doi.org/10.1109/access.2025.3591883</a></p>
eu_rights_str_mv openAccess
id Manara2_cc620cbb844fcb2473cd8b6406fe9721
identifier_str_mv 10.1109/access.2025.3591883
network_acronym_str Manara2
network_name_str Manara2
oai_identifier_str oai:figshare.com:article/30971302
publishDate 2025
repository.mail.fl_str_mv
repository.name.fl_str_mv
repository_id_str
rights_invalid_str_mv CC BY 4.0
spelling Scalable Nonparametric Supervised Learning for Streaming and Massive Data: Applications in Healthcare Monitoring and Credit RiskMohamed Chaouch (17983846)Omama M. Al-Hamed (18021667)Biomedical and clinical sciencesReproductive medicineHealth sciencesHealth services and systemsInformation and computing sciencesData management and data scienceMachine learningBig data applicationsclassification algorithmsdimensionality reductionkernel methodsmachine learningnonparametric statisticsrecursive estimationprincipal component analysisstochastic approximation algorithmssupervised learningVectorsPrincipal component analysisPosterior probabilityCovariance matricesAccuracyRandom forestsProbability distribution<p dir="ltr">This paper introduces novel nonparametric supervised learning techniques for classifying massive datasets, addressing key limitations of existing methods in Big and Streaming Data framework. We propose an offline kernel-based classifier enhanced by Batch Principal Component Analysis (PCA) for dimensionality reduction to mitigate the “curse of dimensionality”. Additionally, an online classifier is developed for streaming data, combining online PCA with a kernel-based recursive classifier using a stochastic approximation algorithm. Application to fetal well-being monitoring demonstrates that the online classifier achieves a competitive median misclassification rate (11.92%), comparable to the offline classifier (11.54%) and Random Forest (11.31%), while requiring only 1/15th of the offline classifier’s computation time. Receiver Operating Characteristic (ROC) analysis shows superior Area Under the Curve (AUC) for the offline classifier but at a significant computational cost. A second study on larger database of credit scoring confirms these findings, showing that the online classifier achieves an F1-score of 96.40% and an accuracy of 93.08%, closely matching the performance of neural networks (96.46%, 93.22%) and boosting (96.51%, 93.31%). Notably, the online classifier accomplishes this with a CPU time of only 0.87 seconds per classification - over 600 times faster than neural networks - demonstrating its effectiveness for high-frequency, real-time financial decision-making.</p><h2 dir="ltr">Other Information</h2><p dir="ltr">Published in: IEEE Access<br>License: <a href="https://creativecommons.org/licenses/by/4.0/deed.en" target="_blank">https://creativecommons.org/licenses/by/4.0/</a><br>See article on publisher's website: <a href="https://dx.doi.org/10.1109/access.2025.3591883" target="_blank">https://dx.doi.org/10.1109/access.2025.3591883</a></p>2025-07-30T12:00:00ZTextJournal contributioninfo:eu-repo/semantics/publishedVersiontextcontribution to journal10.1109/access.2025.3591883https://figshare.com/articles/journal_contribution/Scalable_Nonparametric_Supervised_Learning_for_Streaming_and_Massive_Data_Applications_in_Healthcare_Monitoring_and_Credit_Risk/30971302CC BY 4.0info:eu-repo/semantics/openAccessoai:figshare.com:article/309713022025-07-30T12:00:00Z
spellingShingle Scalable Nonparametric Supervised Learning for Streaming and Massive Data: Applications in Healthcare Monitoring and Credit Risk
Mohamed Chaouch (17983846)
Biomedical and clinical sciences
Reproductive medicine
Health sciences
Health services and systems
Information and computing sciences
Data management and data science
Machine learning
Big data applications
classification algorithms
dimensionality reduction
kernel methods
machine learning
nonparametric statistics
recursive estimation
principal component analysis
stochastic approximation algorithms
supervised learning
Vectors
Principal component analysis
Posterior probability
Covariance matrices
Accuracy
Random forests
Probability distribution
status_str publishedVersion
title Scalable Nonparametric Supervised Learning for Streaming and Massive Data: Applications in Healthcare Monitoring and Credit Risk
title_full Scalable Nonparametric Supervised Learning for Streaming and Massive Data: Applications in Healthcare Monitoring and Credit Risk
title_fullStr Scalable Nonparametric Supervised Learning for Streaming and Massive Data: Applications in Healthcare Monitoring and Credit Risk
title_full_unstemmed Scalable Nonparametric Supervised Learning for Streaming and Massive Data: Applications in Healthcare Monitoring and Credit Risk
title_short Scalable Nonparametric Supervised Learning for Streaming and Massive Data: Applications in Healthcare Monitoring and Credit Risk
title_sort Scalable Nonparametric Supervised Learning for Streaming and Massive Data: Applications in Healthcare Monitoring and Credit Risk
topic Biomedical and clinical sciences
Reproductive medicine
Health sciences
Health services and systems
Information and computing sciences
Data management and data science
Machine learning
Big data applications
classification algorithms
dimensionality reduction
kernel methods
machine learning
nonparametric statistics
recursive estimation
principal component analysis
stochastic approximation algorithms
supervised learning
Vectors
Principal component analysis
Posterior probability
Covariance matrices
Accuracy
Random forests
Probability distribution