Variable Selection in Data Analysis: A Synthetic Data Toolkit
Variable (feature) selection plays an important role in data analysis and mathematical modeling. This paper aims to address the significant lack of formal evaluation benchmarks for feature selection algorithms (FSAs). To evaluate FSAs effectively, controlled environments are required, and the use of...
محفوظ في:
| المؤلف الرئيسي: | |
|---|---|
| مؤلفون آخرون: | , , , |
| التنسيق: | article |
| منشور في: |
2024
|
| الموضوعات: | |
| الوصول للمادة أونلاين: | https://hdl.handle.net/11073/32528 |
| الوسوم: |
إضافة وسم
لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!
|
| _version_ | 1864513438837899264 |
|---|---|
| author | Mitra, Rohan |
| author2 | Ali, Eyad Varam, Dara Sulieman, Hana Kamalov, Firuz |
| author2_role | author author author author |
| author_facet | Mitra, Rohan Ali, Eyad Varam, Dara Sulieman, Hana Kamalov, Firuz |
| author_role | author |
| dc.creator.none.fl_str_mv | Mitra, Rohan Ali, Eyad Varam, Dara Sulieman, Hana Kamalov, Firuz |
| dc.date.none.fl_str_mv | 2024 2025-12-08T06:57:52Z 2025-12-08T06:57:52Z |
| dc.format.none.fl_str_mv | application/pdf |
| dc.identifier.none.fl_str_mv | Mitra, R.; Ali, E.; Varam, D.; Sulieman, H.; Kamalov, F. Variable Selection in Data Analysis: A Synthetic Data Toolkit. Mathematics 2024, 12, 570. https://doi.org/10.3390/math12040570 2227-7390 https://hdl.handle.net/11073/32528 10.3390/math12040570 |
| dc.language.none.fl_str_mv | en_US |
| dc.publisher.none.fl_str_mv | MDPI |
| dc.relation.none.fl_str_mv | https://doi.org/10.3390/math12040570 |
| dc.subject.none.fl_str_mv | Variable selection Data analysis Synthetic datasets Synthetic data generation Feature selection algorithms |
| dc.title.none.fl_str_mv | Variable Selection in Data Analysis: A Synthetic Data Toolkit |
| dc.type.none.fl_str_mv | Peer-Reviewed Published version info:eu-repo/semantics/publishedVersion info:eu-repo/semantics/article |
| description | Variable (feature) selection plays an important role in data analysis and mathematical modeling. This paper aims to address the significant lack of formal evaluation benchmarks for feature selection algorithms (FSAs). To evaluate FSAs effectively, controlled environments are required, and the use of synthetic datasets offers significant advantages. We introduce a set of ten synthetically generated datasets with known relevance, redundancy, and irrelevance of features, derived from various mathematical, logical, and geometric sources. Additionally, eight FSAs are evaluated on these datasets based on their relevance and novelty. The paper first introduces the datasets and then provides a comprehensive experimental analysis of the performance of the selected FSAs on these datasets including testing the FSAs’ resilience on two types of induced data noise. The analysis has guided the grouping of the generated datasets into four groups of data complexity. Lastly, we provide public access to the generated datasets to facilitate bench-marking of new feature selection algorithms in the field via our Github repository. The contributions of this paper aim to foster the development of novel feature selection algorithms and advance their study. |
| format | article |
| id | aus_3bc1a00faa324bfd3289c38f50bb0cec |
| identifier_str_mv | Mitra, R.; Ali, E.; Varam, D.; Sulieman, H.; Kamalov, F. Variable Selection in Data Analysis: A Synthetic Data Toolkit. Mathematics 2024, 12, 570. https://doi.org/10.3390/math12040570 2227-7390 10.3390/math12040570 |
| language_invalid_str_mv | en_US |
| network_acronym_str | aus |
| network_name_str | aus |
| oai_identifier_str | oai:repository.aus.edu:11073/32528 |
| publishDate | 2024 |
| publisher.none.fl_str_mv | MDPI |
| repository.mail.fl_str_mv | |
| repository.name.fl_str_mv | |
| repository_id_str | |
| spelling | Variable Selection in Data Analysis: A Synthetic Data ToolkitMitra, RohanAli, EyadVaram, DaraSulieman, HanaKamalov, FiruzVariable selectionData analysisSynthetic datasetsSynthetic data generationFeature selection algorithmsVariable (feature) selection plays an important role in data analysis and mathematical modeling. This paper aims to address the significant lack of formal evaluation benchmarks for feature selection algorithms (FSAs). To evaluate FSAs effectively, controlled environments are required, and the use of synthetic datasets offers significant advantages. We introduce a set of ten synthetically generated datasets with known relevance, redundancy, and irrelevance of features, derived from various mathematical, logical, and geometric sources. Additionally, eight FSAs are evaluated on these datasets based on their relevance and novelty. The paper first introduces the datasets and then provides a comprehensive experimental analysis of the performance of the selected FSAs on these datasets including testing the FSAs’ resilience on two types of induced data noise. The analysis has guided the grouping of the generated datasets into four groups of data complexity. Lastly, we provide public access to the generated datasets to facilitate bench-marking of new feature selection algorithms in the field via our Github repository. The contributions of this paper aim to foster the development of novel feature selection algorithms and advance their study.American University of SharjahMDPI2025-12-08T06:57:52Z2025-12-08T06:57:52Z2024Peer-ReviewedPublished versioninfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfMitra, R.; Ali, E.; Varam, D.; Sulieman, H.; Kamalov, F. Variable Selection in Data Analysis: A Synthetic Data Toolkit. Mathematics 2024, 12, 570. https://doi.org/10.3390/math120405702227-7390https://hdl.handle.net/11073/3252810.3390/math12040570en_UShttps://doi.org/10.3390/math12040570oai:repository.aus.edu:11073/325282025-12-08T11:17:46Z |
| spellingShingle | Variable Selection in Data Analysis: A Synthetic Data Toolkit Mitra, Rohan Variable selection Data analysis Synthetic datasets Synthetic data generation Feature selection algorithms |
| status_str | publishedVersion |
| title | Variable Selection in Data Analysis: A Synthetic Data Toolkit |
| title_full | Variable Selection in Data Analysis: A Synthetic Data Toolkit |
| title_fullStr | Variable Selection in Data Analysis: A Synthetic Data Toolkit |
| title_full_unstemmed | Variable Selection in Data Analysis: A Synthetic Data Toolkit |
| title_short | Variable Selection in Data Analysis: A Synthetic Data Toolkit |
| title_sort | Variable Selection in Data Analysis: A Synthetic Data Toolkit |
| topic | Variable selection Data analysis Synthetic datasets Synthetic data generation Feature selection algorithms |
| url | https://hdl.handle.net/11073/32528 |