Variable Selection in Data Analysis: A Synthetic Data Toolkit

Variable (feature) selection plays an important role in data analysis and mathematical modeling. This paper aims to address the significant lack of formal evaluation benchmarks for feature selection algorithms (FSAs). To evaluate FSAs effectively, controlled environments are required, and the use of...

وصف كامل

محفوظ في:
التفاصيل البيبلوغرافية
المؤلف الرئيسي: Mitra, Rohan (author)
مؤلفون آخرون: Ali, Eyad (author), Varam, Dara (author), Sulieman, Hana (author), Kamalov, Firuz (author)
التنسيق: article
منشور في: 2024
الموضوعات:
الوصول للمادة أونلاين:https://hdl.handle.net/11073/32528
الوسوم: إضافة وسم
لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!
_version_ 1864513438837899264
author Mitra, Rohan
author2 Ali, Eyad
Varam, Dara
Sulieman, Hana
Kamalov, Firuz
author2_role author
author
author
author
author_facet Mitra, Rohan
Ali, Eyad
Varam, Dara
Sulieman, Hana
Kamalov, Firuz
author_role author
dc.creator.none.fl_str_mv Mitra, Rohan
Ali, Eyad
Varam, Dara
Sulieman, Hana
Kamalov, Firuz
dc.date.none.fl_str_mv 2024
2025-12-08T06:57:52Z
2025-12-08T06:57:52Z
dc.format.none.fl_str_mv application/pdf
dc.identifier.none.fl_str_mv Mitra, R.; Ali, E.; Varam, D.; Sulieman, H.; Kamalov, F. Variable Selection in Data Analysis: A Synthetic Data Toolkit. Mathematics 2024, 12, 570. https://doi.org/10.3390/math12040570
2227-7390
https://hdl.handle.net/11073/32528
10.3390/math12040570
dc.language.none.fl_str_mv en_US
dc.publisher.none.fl_str_mv MDPI
dc.relation.none.fl_str_mv https://doi.org/10.3390/math12040570
dc.subject.none.fl_str_mv Variable selection
Data analysis
Synthetic datasets
Synthetic data generation
Feature selection algorithms
dc.title.none.fl_str_mv Variable Selection in Data Analysis: A Synthetic Data Toolkit
dc.type.none.fl_str_mv Peer-Reviewed
Published version
info:eu-repo/semantics/publishedVersion
info:eu-repo/semantics/article
description Variable (feature) selection plays an important role in data analysis and mathematical modeling. This paper aims to address the significant lack of formal evaluation benchmarks for feature selection algorithms (FSAs). To evaluate FSAs effectively, controlled environments are required, and the use of synthetic datasets offers significant advantages. We introduce a set of ten synthetically generated datasets with known relevance, redundancy, and irrelevance of features, derived from various mathematical, logical, and geometric sources. Additionally, eight FSAs are evaluated on these datasets based on their relevance and novelty. The paper first introduces the datasets and then provides a comprehensive experimental analysis of the performance of the selected FSAs on these datasets including testing the FSAs’ resilience on two types of induced data noise. The analysis has guided the grouping of the generated datasets into four groups of data complexity. Lastly, we provide public access to the generated datasets to facilitate bench-marking of new feature selection algorithms in the field via our Github repository. The contributions of this paper aim to foster the development of novel feature selection algorithms and advance their study.
format article
id aus_3bc1a00faa324bfd3289c38f50bb0cec
identifier_str_mv Mitra, R.; Ali, E.; Varam, D.; Sulieman, H.; Kamalov, F. Variable Selection in Data Analysis: A Synthetic Data Toolkit. Mathematics 2024, 12, 570. https://doi.org/10.3390/math12040570
2227-7390
10.3390/math12040570
language_invalid_str_mv en_US
network_acronym_str aus
network_name_str aus
oai_identifier_str oai:repository.aus.edu:11073/32528
publishDate 2024
publisher.none.fl_str_mv MDPI
repository.mail.fl_str_mv
repository.name.fl_str_mv
repository_id_str
spelling Variable Selection in Data Analysis: A Synthetic Data ToolkitMitra, RohanAli, EyadVaram, DaraSulieman, HanaKamalov, FiruzVariable selectionData analysisSynthetic datasetsSynthetic data generationFeature selection algorithmsVariable (feature) selection plays an important role in data analysis and mathematical modeling. This paper aims to address the significant lack of formal evaluation benchmarks for feature selection algorithms (FSAs). To evaluate FSAs effectively, controlled environments are required, and the use of synthetic datasets offers significant advantages. We introduce a set of ten synthetically generated datasets with known relevance, redundancy, and irrelevance of features, derived from various mathematical, logical, and geometric sources. Additionally, eight FSAs are evaluated on these datasets based on their relevance and novelty. The paper first introduces the datasets and then provides a comprehensive experimental analysis of the performance of the selected FSAs on these datasets including testing the FSAs’ resilience on two types of induced data noise. The analysis has guided the grouping of the generated datasets into four groups of data complexity. Lastly, we provide public access to the generated datasets to facilitate bench-marking of new feature selection algorithms in the field via our Github repository. The contributions of this paper aim to foster the development of novel feature selection algorithms and advance their study.American University of SharjahMDPI2025-12-08T06:57:52Z2025-12-08T06:57:52Z2024Peer-ReviewedPublished versioninfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfMitra, R.; Ali, E.; Varam, D.; Sulieman, H.; Kamalov, F. Variable Selection in Data Analysis: A Synthetic Data Toolkit. Mathematics 2024, 12, 570. https://doi.org/10.3390/math120405702227-7390https://hdl.handle.net/11073/3252810.3390/math12040570en_UShttps://doi.org/10.3390/math12040570oai:repository.aus.edu:11073/325282025-12-08T11:17:46Z
spellingShingle Variable Selection in Data Analysis: A Synthetic Data Toolkit
Mitra, Rohan
Variable selection
Data analysis
Synthetic datasets
Synthetic data generation
Feature selection algorithms
status_str publishedVersion
title Variable Selection in Data Analysis: A Synthetic Data Toolkit
title_full Variable Selection in Data Analysis: A Synthetic Data Toolkit
title_fullStr Variable Selection in Data Analysis: A Synthetic Data Toolkit
title_full_unstemmed Variable Selection in Data Analysis: A Synthetic Data Toolkit
title_short Variable Selection in Data Analysis: A Synthetic Data Toolkit
title_sort Variable Selection in Data Analysis: A Synthetic Data Toolkit
topic Variable selection
Data analysis
Synthetic datasets
Synthetic data generation
Feature selection algorithms
url https://hdl.handle.net/11073/32528