A Novel Two-Fold Loss Function for Data Clustering and Reconstruction: Application to Document Analysis

<p dir="ltr">In the midst of the ongoing COVID-19 pandemic, there has been a surge in scientific literature aimed at understanding the virus and its impact. However, it has become challenging for a researcher to deal with thousands of articles published daily. This paper proposes a n...

وصف كامل

محفوظ في:
التفاصيل البيبلوغرافية
المؤلف الرئيسي: Mebarka Allaoui (17983795) (author)
مؤلفون آخرون: Mohammed Lamine Kherfi (17983798) (author), Oussama Aiadi (17983801) (author), Samir Brahim Belhaouari (9427347) (author)
منشور في: 2023
الموضوعات:
الوسوم: إضافة وسم
لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!
_version_ 1864513527414259712
author Mebarka Allaoui (17983795)
author2 Mohammed Lamine Kherfi (17983798)
Oussama Aiadi (17983801)
Samir Brahim Belhaouari (9427347)
author2_role author
author
author
author_facet Mebarka Allaoui (17983795)
Mohammed Lamine Kherfi (17983798)
Oussama Aiadi (17983801)
Samir Brahim Belhaouari (9427347)
author_role author
dc.creator.none.fl_str_mv Mebarka Allaoui (17983795)
Mohammed Lamine Kherfi (17983798)
Oussama Aiadi (17983801)
Samir Brahim Belhaouari (9427347)
dc.date.none.fl_str_mv 2023-09-06T06:00:00Z
dc.identifier.none.fl_str_mv 10.1109/access.2023.3312622
dc.relation.none.fl_str_mv https://figshare.com/articles/journal_contribution/A_Novel_Two-Fold_Loss_Function_for_Data_Clustering_and_Reconstruction_Application_to_Document_Analysis/25239511
dc.rights.none.fl_str_mv CC BY 4.0
info:eu-repo/semantics/openAccess
dc.subject.none.fl_str_mv Engineering
Electrical engineering
Electronics, sensors and digital hardware
Materials engineering
Noise measurement
COVID-19
Training
Encoding
Data models
Computational modeling
Mathematical models
Document handling
Clustering
deep learning
dimensionality reduction
document organization
topic modeling
dc.title.none.fl_str_mv A Novel Two-Fold Loss Function for Data Clustering and Reconstruction: Application to Document Analysis
dc.type.none.fl_str_mv Text
Journal contribution
info:eu-repo/semantics/publishedVersion
text
contribution to journal
description <p dir="ltr">In the midst of the ongoing COVID-19 pandemic, there has been a surge in scientific literature aimed at understanding the virus and its impact. However, it has become challenging for a researcher to deal with thousands of articles published daily. This paper proposes a novel deep-learning architecture to organize a large dataset of COVID-19-related scientific literature and provides a clear overview of the current state of knowledge. The proposed model is developed based on two main bases to ensure robustness and efficiency. In particular, we trained a denoising autoencoder with clean and noisy data to make the model can balance, preserving the underline structure and generalizing the new unseen data. Furthermore, the cornerstone of the proposed architecture lies in training the autoencoder using a two-fold objective function that jointly incorporates the data’s reconstruction and clustering. The advantage behind this combination is to avoid the distortion of the latent space and to improve the model efficiency. Afterward, we use the Latent Dirichlet Allocation (LDA) to analyze the document’s topics. For the sake of computational efficiency, instead of feeding the LDA with the whole dataset of documents, we fed it with the clusters produced in the phase of dimensionality reduction and clustering to count the frequency of topics in each cluster. The model was trained on a large public corpus of COVID-19-related articles and evaluated using a set of evaluation metrics. Experimental results indicate the superiority of our proposed model compared to several recent studies.</p><h2>Other Information</h2><p dir="ltr">Published in: IEEE Access<br>License: <a href="https://creativecommons.org/licenses/by/4.0/" target="_blank">https://creativecommons.org/licenses/by/4.0/</a><br>See article on publisher's website: <a href="https://dx.doi.org/10.1109/access.2023.3312622" target="_blank">https://dx.doi.org/10.1109/access.2023.3312622</a></p>
eu_rights_str_mv openAccess
id Manara2_826c763c5dd6940fbf12d3229056e8d8
identifier_str_mv 10.1109/access.2023.3312622
network_acronym_str Manara2
network_name_str Manara2
oai_identifier_str oai:figshare.com:article/25239511
publishDate 2023
repository.mail.fl_str_mv
repository.name.fl_str_mv
repository_id_str
rights_invalid_str_mv CC BY 4.0
spelling A Novel Two-Fold Loss Function for Data Clustering and Reconstruction: Application to Document AnalysisMebarka Allaoui (17983795)Mohammed Lamine Kherfi (17983798)Oussama Aiadi (17983801)Samir Brahim Belhaouari (9427347)EngineeringElectrical engineeringElectronics, sensors and digital hardwareMaterials engineeringNoise measurementCOVID-19TrainingEncodingData modelsComputational modelingMathematical modelsDocument handlingClusteringdeep learningdimensionality reductiondocument organizationtopic modeling<p dir="ltr">In the midst of the ongoing COVID-19 pandemic, there has been a surge in scientific literature aimed at understanding the virus and its impact. However, it has become challenging for a researcher to deal with thousands of articles published daily. This paper proposes a novel deep-learning architecture to organize a large dataset of COVID-19-related scientific literature and provides a clear overview of the current state of knowledge. The proposed model is developed based on two main bases to ensure robustness and efficiency. In particular, we trained a denoising autoencoder with clean and noisy data to make the model can balance, preserving the underline structure and generalizing the new unseen data. Furthermore, the cornerstone of the proposed architecture lies in training the autoencoder using a two-fold objective function that jointly incorporates the data’s reconstruction and clustering. The advantage behind this combination is to avoid the distortion of the latent space and to improve the model efficiency. Afterward, we use the Latent Dirichlet Allocation (LDA) to analyze the document’s topics. For the sake of computational efficiency, instead of feeding the LDA with the whole dataset of documents, we fed it with the clusters produced in the phase of dimensionality reduction and clustering to count the frequency of topics in each cluster. The model was trained on a large public corpus of COVID-19-related articles and evaluated using a set of evaluation metrics. Experimental results indicate the superiority of our proposed model compared to several recent studies.</p><h2>Other Information</h2><p dir="ltr">Published in: IEEE Access<br>License: <a href="https://creativecommons.org/licenses/by/4.0/" target="_blank">https://creativecommons.org/licenses/by/4.0/</a><br>See article on publisher's website: <a href="https://dx.doi.org/10.1109/access.2023.3312622" target="_blank">https://dx.doi.org/10.1109/access.2023.3312622</a></p>2023-09-06T06:00:00ZTextJournal contributioninfo:eu-repo/semantics/publishedVersiontextcontribution to journal10.1109/access.2023.3312622https://figshare.com/articles/journal_contribution/A_Novel_Two-Fold_Loss_Function_for_Data_Clustering_and_Reconstruction_Application_to_Document_Analysis/25239511CC BY 4.0info:eu-repo/semantics/openAccessoai:figshare.com:article/252395112023-09-06T06:00:00Z
spellingShingle A Novel Two-Fold Loss Function for Data Clustering and Reconstruction: Application to Document Analysis
Mebarka Allaoui (17983795)
Engineering
Electrical engineering
Electronics, sensors and digital hardware
Materials engineering
Noise measurement
COVID-19
Training
Encoding
Data models
Computational modeling
Mathematical models
Document handling
Clustering
deep learning
dimensionality reduction
document organization
topic modeling
status_str publishedVersion
title A Novel Two-Fold Loss Function for Data Clustering and Reconstruction: Application to Document Analysis
title_full A Novel Two-Fold Loss Function for Data Clustering and Reconstruction: Application to Document Analysis
title_fullStr A Novel Two-Fold Loss Function for Data Clustering and Reconstruction: Application to Document Analysis
title_full_unstemmed A Novel Two-Fold Loss Function for Data Clustering and Reconstruction: Application to Document Analysis
title_short A Novel Two-Fold Loss Function for Data Clustering and Reconstruction: Application to Document Analysis
title_sort A Novel Two-Fold Loss Function for Data Clustering and Reconstruction: Application to Document Analysis
topic Engineering
Electrical engineering
Electronics, sensors and digital hardware
Materials engineering
Noise measurement
COVID-19
Training
Encoding
Data models
Computational modeling
Mathematical models
Document handling
Clustering
deep learning
dimensionality reduction
document organization
topic modeling