Variable Selection in Data Analysis: A Synthetic Data Toolkit

Variable (feature) selection plays an important role in data analysis and mathematical modeling. This paper aims to address the significant lack of formal evaluation benchmarks for feature selection algorithms (FSAs). To evaluate FSAs effectively, controlled environments are required, and the use of...

Full description

Saved in:

Bibliographic Details
Main Author:	Mitra, Rohan (author)
Other Authors:	Ali, Eyad (author), Varam, Dara (author), Sulieman, Hana (author), Kamalov, Firuz (author)
Format:	article
Published:	2024
Subjects:	Variable selection Data analysis Synthetic datasets Synthetic data generation Feature selection algorithms
Online Access:	https://hdl.handle.net/11073/32528
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1864513438837899264
author	Mitra, Rohan
author2	Ali, Eyad Varam, Dara Sulieman, Hana Kamalov, Firuz
author2_role	author author author author
author_facet	Mitra, Rohan Ali, Eyad Varam, Dara Sulieman, Hana Kamalov, Firuz
author_role	author
dc.creator.none.fl_str_mv	Mitra, Rohan Ali, Eyad Varam, Dara Sulieman, Hana Kamalov, Firuz
dc.date.none.fl_str_mv	2024 2025-12-08T06:57:52Z 2025-12-08T06:57:52Z
dc.format.none.fl_str_mv	application/pdf
dc.identifier.none.fl_str_mv	Mitra, R.; Ali, E.; Varam, D.; Sulieman, H.; Kamalov, F. Variable Selection in Data Analysis: A Synthetic Data Toolkit. Mathematics 2024, 12, 570. https://doi.org/10.3390/math12040570 2227-7390 https://hdl.handle.net/11073/32528 10.3390/math12040570
dc.language.none.fl_str_mv	en_US
dc.publisher.none.fl_str_mv	MDPI
dc.relation.none.fl_str_mv	https://doi.org/10.3390/math12040570
dc.subject.none.fl_str_mv	Variable selection Data analysis Synthetic datasets Synthetic data generation Feature selection algorithms
dc.title.none.fl_str_mv	Variable Selection in Data Analysis: A Synthetic Data Toolkit
dc.type.none.fl_str_mv	Peer-Reviewed Published version info:eu-repo/semantics/publishedVersion info:eu-repo/semantics/article
description	Variable (feature) selection plays an important role in data analysis and mathematical modeling. This paper aims to address the significant lack of formal evaluation benchmarks for feature selection algorithms (FSAs). To evaluate FSAs effectively, controlled environments are required, and the use of synthetic datasets offers significant advantages. We introduce a set of ten synthetically generated datasets with known relevance, redundancy, and irrelevance of features, derived from various mathematical, logical, and geometric sources. Additionally, eight FSAs are evaluated on these datasets based on their relevance and novelty. The paper first introduces the datasets and then provides a comprehensive experimental analysis of the performance of the selected FSAs on these datasets including testing the FSAs’ resilience on two types of induced data noise. The analysis has guided the grouping of the generated datasets into four groups of data complexity. Lastly, we provide public access to the generated datasets to facilitate bench-marking of new feature selection algorithms in the field via our Github repository. The contributions of this paper aim to foster the development of novel feature selection algorithms and advance their study.
format	article
id	aus_3bc1a00faa324bfd3289c38f50bb0cec
identifier_str_mv	Mitra, R.; Ali, E.; Varam, D.; Sulieman, H.; Kamalov, F. Variable Selection in Data Analysis: A Synthetic Data Toolkit. Mathematics 2024, 12, 570. https://doi.org/10.3390/math12040570 2227-7390 10.3390/math12040570
language_invalid_str_mv	en_US
network_acronym_str	aus
network_name_str	aus
oai_identifier_str	oai:repository.aus.edu:11073/32528
publishDate	2024
publisher.none.fl_str_mv	MDPI
repository.mail.fl_str_mv
repository.name.fl_str_mv
repository_id_str
spelling	Variable Selection in Data Analysis: A Synthetic Data ToolkitMitra, RohanAli, EyadVaram, DaraSulieman, HanaKamalov, FiruzVariable selectionData analysisSynthetic datasetsSynthetic data generationFeature selection algorithmsVariable (feature) selection plays an important role in data analysis and mathematical modeling. This paper aims to address the significant lack of formal evaluation benchmarks for feature selection algorithms (FSAs). To evaluate FSAs effectively, controlled environments are required, and the use of synthetic datasets offers significant advantages. We introduce a set of ten synthetically generated datasets with known relevance, redundancy, and irrelevance of features, derived from various mathematical, logical, and geometric sources. Additionally, eight FSAs are evaluated on these datasets based on their relevance and novelty. The paper first introduces the datasets and then provides a comprehensive experimental analysis of the performance of the selected FSAs on these datasets including testing the FSAs’ resilience on two types of induced data noise. The analysis has guided the grouping of the generated datasets into four groups of data complexity. Lastly, we provide public access to the generated datasets to facilitate bench-marking of new feature selection algorithms in the field via our Github repository. The contributions of this paper aim to foster the development of novel feature selection algorithms and advance their study.American University of SharjahMDPI2025-12-08T06:57:52Z2025-12-08T06:57:52Z2024Peer-ReviewedPublished versioninfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfMitra, R.; Ali, E.; Varam, D.; Sulieman, H.; Kamalov, F. Variable Selection in Data Analysis: A Synthetic Data Toolkit. Mathematics 2024, 12, 570. https://doi.org/10.3390/math120405702227-7390https://hdl.handle.net/11073/3252810.3390/math12040570en_UShttps://doi.org/10.3390/math12040570oai:repository.aus.edu:11073/325282025-12-08T11:17:46Z
spellingShingle	Variable Selection in Data Analysis: A Synthetic Data Toolkit Mitra, Rohan Variable selection Data analysis Synthetic datasets Synthetic data generation Feature selection algorithms
status_str	publishedVersion
title	Variable Selection in Data Analysis: A Synthetic Data Toolkit
title_full	Variable Selection in Data Analysis: A Synthetic Data Toolkit
title_fullStr	Variable Selection in Data Analysis: A Synthetic Data Toolkit
title_full_unstemmed	Variable Selection in Data Analysis: A Synthetic Data Toolkit
title_short	Variable Selection in Data Analysis: A Synthetic Data Toolkit
title_sort	Variable Selection in Data Analysis: A Synthetic Data Toolkit
topic	Variable selection Data analysis Synthetic datasets Synthetic data generation Feature selection algorithms
url	https://hdl.handle.net/11073/32528

Variable Selection in Data Analysis: A Synthetic Data Toolkit

Similar Items