The effects of data balancing approaches: A case study
<p dir="ltr">Imbalanced datasets affect the performance of machine learning algorithms adversely. To cope with this problem, several resampling methods have been developed recently. In this article, we present a case study approach for investigating the effects of data balancing appr...
محفوظ في:
| المؤلف الرئيسي: | |
|---|---|
| مؤلفون آخرون: | , , , |
| منشور في: |
2023
|
| الموضوعات: | |
| الوسوم: |
إضافة وسم
لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!
|
| _version_ | 1864513541619318784 |
|---|---|
| author | Paul Mooijman (4453189) |
| author2 | Cagatay Catal (6897842) Bedir Tekinerdogan (6897839) Arjen Lommen (471283) Marco Blokland (12644072) |
| author2_role | author author author author |
| author_facet | Paul Mooijman (4453189) Cagatay Catal (6897842) Bedir Tekinerdogan (6897839) Arjen Lommen (471283) Marco Blokland (12644072) |
| author_role | author |
| dc.creator.none.fl_str_mv | Paul Mooijman (4453189) Cagatay Catal (6897842) Bedir Tekinerdogan (6897839) Arjen Lommen (471283) Marco Blokland (12644072) |
| dc.date.none.fl_str_mv | 2023-01-01T21:00:00Z |
| dc.identifier.none.fl_str_mv | 10.1016/j.asoc.2022.109853 |
| dc.relation.none.fl_str_mv | https://figshare.com/articles/journal_contribution/The_effects_of_data_balancing_approaches_A_case_study/24501118 |
| dc.rights.none.fl_str_mv | CC BY 4.0 info:eu-repo/semantics/openAccess |
| dc.subject.none.fl_str_mv | Engineering Biomedical engineering Information and computing sciences Artificial intelligence Data management and data science Machine learning Imbalanced dataset Resampling Supervised machine learning Classification Feature selection Missing data LC–MS Hormone abuse detection Cattle |
| dc.title.none.fl_str_mv | The effects of data balancing approaches: A case study |
| dc.type.none.fl_str_mv | Text Journal contribution info:eu-repo/semantics/publishedVersion text contribution to journal |
| description | <p dir="ltr">Imbalanced datasets affect the performance of machine learning algorithms adversely. To cope with this problem, several resampling methods have been developed recently. In this article, we present a case study approach for investigating the effects of data balancing approaches. The case study concerns the discrimination between growth hormone treated and non-treated animals using Liquid Chromatography-High Resolution Mass Spectrometry (LC-HRMS) data. Our LC-HRMS dataset contains 1241 bovine urine samples, of which only 65 specimens were from animal studies and guaranteed to contain growth-stimulating hormones while the rest has been reported to be untreated, making it a ∼5% imbalanced dataset. In this research, classification algorithms, combined with resampling strategies and dimensionality reduction methods, were investigated to find a prediction model to correctly identify the samples of treated animals. Furthermore, to cope with a large number of missing data points in the given dataset, a replacement with random low values strategy was applied. Our results showed that the replacement method was effective, and LogisticRegression combined with the oversampling algorithms SMOTE or ADASYN, GaussianProcessClassifier with the oversampling algorithm SMOTE, and LinearDiscriminantAnalysis were the best performing models after log transformation of the dataset was followed by Recursive Feature Elimination.</p><h2>Other Information</h2><p dir="ltr">Published in: Applied Soft Computing<br>License: <a href="http://creativecommons.org/licenses/by/4.0/" target="_blank">http://creativecommons.org/licenses/by/4.0/</a><br>See article on publisher's website: <a href="https://dx.doi.org/10.1016/j.asoc.2022.109853" target="_blank">https://dx.doi.org/10.1016/j.asoc.2022.109853</a></p> |
| eu_rights_str_mv | openAccess |
| id | Manara2_52a95cd5592ec16d5ffcf97363b78a44 |
| identifier_str_mv | 10.1016/j.asoc.2022.109853 |
| network_acronym_str | Manara2 |
| network_name_str | Manara2 |
| oai_identifier_str | oai:figshare.com:article/24501118 |
| publishDate | 2023 |
| repository.mail.fl_str_mv | |
| repository.name.fl_str_mv | |
| repository_id_str | |
| rights_invalid_str_mv | CC BY 4.0 |
| spelling | The effects of data balancing approaches: A case studyPaul Mooijman (4453189)Cagatay Catal (6897842)Bedir Tekinerdogan (6897839)Arjen Lommen (471283)Marco Blokland (12644072)EngineeringBiomedical engineeringInformation and computing sciencesArtificial intelligenceData management and data scienceMachine learningImbalanced datasetResamplingSupervised machine learningClassificationFeature selectionMissing dataLC–MSHormone abuse detectionCattle<p dir="ltr">Imbalanced datasets affect the performance of machine learning algorithms adversely. To cope with this problem, several resampling methods have been developed recently. In this article, we present a case study approach for investigating the effects of data balancing approaches. The case study concerns the discrimination between growth hormone treated and non-treated animals using Liquid Chromatography-High Resolution Mass Spectrometry (LC-HRMS) data. Our LC-HRMS dataset contains 1241 bovine urine samples, of which only 65 specimens were from animal studies and guaranteed to contain growth-stimulating hormones while the rest has been reported to be untreated, making it a ∼5% imbalanced dataset. In this research, classification algorithms, combined with resampling strategies and dimensionality reduction methods, were investigated to find a prediction model to correctly identify the samples of treated animals. Furthermore, to cope with a large number of missing data points in the given dataset, a replacement with random low values strategy was applied. Our results showed that the replacement method was effective, and LogisticRegression combined with the oversampling algorithms SMOTE or ADASYN, GaussianProcessClassifier with the oversampling algorithm SMOTE, and LinearDiscriminantAnalysis were the best performing models after log transformation of the dataset was followed by Recursive Feature Elimination.</p><h2>Other Information</h2><p dir="ltr">Published in: Applied Soft Computing<br>License: <a href="http://creativecommons.org/licenses/by/4.0/" target="_blank">http://creativecommons.org/licenses/by/4.0/</a><br>See article on publisher's website: <a href="https://dx.doi.org/10.1016/j.asoc.2022.109853" target="_blank">https://dx.doi.org/10.1016/j.asoc.2022.109853</a></p>2023-01-01T21:00:00ZTextJournal contributioninfo:eu-repo/semantics/publishedVersiontextcontribution to journal10.1016/j.asoc.2022.109853https://figshare.com/articles/journal_contribution/The_effects_of_data_balancing_approaches_A_case_study/24501118CC BY 4.0info:eu-repo/semantics/openAccessoai:figshare.com:article/245011182023-01-01T21:00:00Z |
| spellingShingle | The effects of data balancing approaches: A case study Paul Mooijman (4453189) Engineering Biomedical engineering Information and computing sciences Artificial intelligence Data management and data science Machine learning Imbalanced dataset Resampling Supervised machine learning Classification Feature selection Missing data LC–MS Hormone abuse detection Cattle |
| status_str | publishedVersion |
| title | The effects of data balancing approaches: A case study |
| title_full | The effects of data balancing approaches: A case study |
| title_fullStr | The effects of data balancing approaches: A case study |
| title_full_unstemmed | The effects of data balancing approaches: A case study |
| title_short | The effects of data balancing approaches: A case study |
| title_sort | The effects of data balancing approaches: A case study |
| topic | Engineering Biomedical engineering Information and computing sciences Artificial intelligence Data management and data science Machine learning Imbalanced dataset Resampling Supervised machine learning Classification Feature selection Missing data LC–MS Hormone abuse detection Cattle |