The effects of data balancing approaches: A case study

<p dir="ltr">Imbalanced datasets affect the performance of machine learning algorithms adversely. To cope with this problem, several resampling methods have been developed recently. In this article, we present a case study approach for investigating the effects of data balancing appr...

وصف كامل

محفوظ في:
التفاصيل البيبلوغرافية
المؤلف الرئيسي: Paul Mooijman (4453189) (author)
مؤلفون آخرون: Cagatay Catal (6897842) (author), Bedir Tekinerdogan (6897839) (author), Arjen Lommen (471283) (author), Marco Blokland (12644072) (author)
منشور في: 2023
الموضوعات:
الوسوم: إضافة وسم
لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!
_version_ 1864513541619318784
author Paul Mooijman (4453189)
author2 Cagatay Catal (6897842)
Bedir Tekinerdogan (6897839)
Arjen Lommen (471283)
Marco Blokland (12644072)
author2_role author
author
author
author
author_facet Paul Mooijman (4453189)
Cagatay Catal (6897842)
Bedir Tekinerdogan (6897839)
Arjen Lommen (471283)
Marco Blokland (12644072)
author_role author
dc.creator.none.fl_str_mv Paul Mooijman (4453189)
Cagatay Catal (6897842)
Bedir Tekinerdogan (6897839)
Arjen Lommen (471283)
Marco Blokland (12644072)
dc.date.none.fl_str_mv 2023-01-01T21:00:00Z
dc.identifier.none.fl_str_mv 10.1016/j.asoc.2022.109853
dc.relation.none.fl_str_mv https://figshare.com/articles/journal_contribution/The_effects_of_data_balancing_approaches_A_case_study/24501118
dc.rights.none.fl_str_mv CC BY 4.0
info:eu-repo/semantics/openAccess
dc.subject.none.fl_str_mv Engineering
Biomedical engineering
Information and computing sciences
Artificial intelligence
Data management and data science
Machine learning
Imbalanced dataset
Resampling
Supervised machine learning
Classification
Feature selection
Missing data
LC–MS
Hormone abuse detection
Cattle
dc.title.none.fl_str_mv The effects of data balancing approaches: A case study
dc.type.none.fl_str_mv Text
Journal contribution
info:eu-repo/semantics/publishedVersion
text
contribution to journal
description <p dir="ltr">Imbalanced datasets affect the performance of machine learning algorithms adversely. To cope with this problem, several resampling methods have been developed recently. In this article, we present a case study approach for investigating the effects of data balancing approaches. The case study concerns the discrimination between growth hormone treated and non-treated animals using Liquid Chromatography-High Resolution Mass Spectrometry (LC-HRMS) data. Our LC-HRMS dataset contains 1241 bovine urine samples, of which only 65 specimens were from animal studies and guaranteed to contain growth-stimulating hormones while the rest has been reported to be untreated, making it a ∼5% imbalanced dataset. In this research, classification algorithms, combined with resampling strategies and dimensionality reduction methods, were investigated to find a prediction model to correctly identify the samples of treated animals. Furthermore, to cope with a large number of missing data points in the given dataset, a replacement with random low values strategy was applied. Our results showed that the replacement method was effective, and LogisticRegression combined with the oversampling algorithms SMOTE or ADASYN, GaussianProcessClassifier with the oversampling algorithm SMOTE, and LinearDiscriminantAnalysis were the best performing models after log transformation of the dataset was followed by Recursive Feature Elimination.</p><h2>Other Information</h2><p dir="ltr">Published in: Applied Soft Computing<br>License: <a href="http://creativecommons.org/licenses/by/4.0/" target="_blank">http://creativecommons.org/licenses/by/4.0/</a><br>See article on publisher's website: <a href="https://dx.doi.org/10.1016/j.asoc.2022.109853" target="_blank">https://dx.doi.org/10.1016/j.asoc.2022.109853</a></p>
eu_rights_str_mv openAccess
id Manara2_52a95cd5592ec16d5ffcf97363b78a44
identifier_str_mv 10.1016/j.asoc.2022.109853
network_acronym_str Manara2
network_name_str Manara2
oai_identifier_str oai:figshare.com:article/24501118
publishDate 2023
repository.mail.fl_str_mv
repository.name.fl_str_mv
repository_id_str
rights_invalid_str_mv CC BY 4.0
spelling The effects of data balancing approaches: A case studyPaul Mooijman (4453189)Cagatay Catal (6897842)Bedir Tekinerdogan (6897839)Arjen Lommen (471283)Marco Blokland (12644072)EngineeringBiomedical engineeringInformation and computing sciencesArtificial intelligenceData management and data scienceMachine learningImbalanced datasetResamplingSupervised machine learningClassificationFeature selectionMissing dataLC–MSHormone abuse detectionCattle<p dir="ltr">Imbalanced datasets affect the performance of machine learning algorithms adversely. To cope with this problem, several resampling methods have been developed recently. In this article, we present a case study approach for investigating the effects of data balancing approaches. The case study concerns the discrimination between growth hormone treated and non-treated animals using Liquid Chromatography-High Resolution Mass Spectrometry (LC-HRMS) data. Our LC-HRMS dataset contains 1241 bovine urine samples, of which only 65 specimens were from animal studies and guaranteed to contain growth-stimulating hormones while the rest has been reported to be untreated, making it a ∼5% imbalanced dataset. In this research, classification algorithms, combined with resampling strategies and dimensionality reduction methods, were investigated to find a prediction model to correctly identify the samples of treated animals. Furthermore, to cope with a large number of missing data points in the given dataset, a replacement with random low values strategy was applied. Our results showed that the replacement method was effective, and LogisticRegression combined with the oversampling algorithms SMOTE or ADASYN, GaussianProcessClassifier with the oversampling algorithm SMOTE, and LinearDiscriminantAnalysis were the best performing models after log transformation of the dataset was followed by Recursive Feature Elimination.</p><h2>Other Information</h2><p dir="ltr">Published in: Applied Soft Computing<br>License: <a href="http://creativecommons.org/licenses/by/4.0/" target="_blank">http://creativecommons.org/licenses/by/4.0/</a><br>See article on publisher's website: <a href="https://dx.doi.org/10.1016/j.asoc.2022.109853" target="_blank">https://dx.doi.org/10.1016/j.asoc.2022.109853</a></p>2023-01-01T21:00:00ZTextJournal contributioninfo:eu-repo/semantics/publishedVersiontextcontribution to journal10.1016/j.asoc.2022.109853https://figshare.com/articles/journal_contribution/The_effects_of_data_balancing_approaches_A_case_study/24501118CC BY 4.0info:eu-repo/semantics/openAccessoai:figshare.com:article/245011182023-01-01T21:00:00Z
spellingShingle The effects of data balancing approaches: A case study
Paul Mooijman (4453189)
Engineering
Biomedical engineering
Information and computing sciences
Artificial intelligence
Data management and data science
Machine learning
Imbalanced dataset
Resampling
Supervised machine learning
Classification
Feature selection
Missing data
LC–MS
Hormone abuse detection
Cattle
status_str publishedVersion
title The effects of data balancing approaches: A case study
title_full The effects of data balancing approaches: A case study
title_fullStr The effects of data balancing approaches: A case study
title_full_unstemmed The effects of data balancing approaches: A case study
title_short The effects of data balancing approaches: A case study
title_sort The effects of data balancing approaches: A case study
topic Engineering
Biomedical engineering
Information and computing sciences
Artificial intelligence
Data management and data science
Machine learning
Imbalanced dataset
Resampling
Supervised machine learning
Classification
Feature selection
Missing data
LC–MS
Hormone abuse detection
Cattle