Unsupervised Topical Organization of Documents using Corpus-based Text Analysis

This study aims at automating the process of topical keyword organization of set of documents in an input text corpus. It is conducted in the context of a larger project to investigate efficient unsupervised learning techniques to automatically extract relevant classes and their keyword descriptions...

Full description

Saved in:
Bibliographic Details
Main Author: Sarkissian, Sarkis (author)
Other Authors: Tekli, Joe (author)
Format: conferenceObject
Published: 2021
Subjects:
Online Access:http://hdl.handle.net/10725/16285
https://doi.org/10.1145/3444757.3485078
http://libraries.lau.edu.lb/research/laur/terms-of-use/articles.php
https://dl.acm.org/doi/abs/10.1145/3444757.3485078
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1864513472549617664
author Sarkissian, Sarkis
author2 Tekli, Joe
author2_role author
author_facet Sarkissian, Sarkis
Tekli, Joe
author_role author
dc.creator.none.fl_str_mv Sarkissian, Sarkis
Tekli, Joe
dc.date.none.fl_str_mv 2021
2021-11-09
2024-11-08T08:23:24Z
2024-11-08T08:23:24Z
dc.identifier.none.fl_str_mv 9781450383141
http://hdl.handle.net/10725/16285
https://doi.org/10.1145/3444757.3485078
Sarkissian, S., & Tekli, J. (2021, November). Unsupervised topical organization of documents using corpus-based text analysis. In Proceedings of 2021 13th International Conference on Management of Digital EcoSystems (MEDES 2021), (pp. 87-94). New York: ACM.
http://libraries.lau.edu.lb/research/laur/terms-of-use/articles.php
https://dl.acm.org/doi/abs/10.1145/3444757.3485078
dc.language.none.fl_str_mv en
dc.publisher.none.fl_str_mv The Association for Computing Machinery
dc.rights.*.fl_str_mv info:eu-repo/semantics/openAccess
dc.subject.none.fl_str_mv Big data -- Congresses
Computer security -- Congresses
Database management -- Congresses
dc.title.none.fl_str_mv Unsupervised Topical Organization of Documents using Corpus-based Text Analysis
dc.type.none.fl_str_mv Conference Paper / Proceeding
info:eu-repo/semantics/publishedVersion
info:eu-repo/semantics/conferenceObject
description This study aims at automating the process of topical keyword organization of set of documents in an input text corpus. It is conducted in the context of a larger project to investigate efficient unsupervised learning techniques to automatically extract relevant classes and their keyword descriptions from a set of the United Nations (UN) documents, and use the latter to produce reference corpora allowing to classify future UN documents. We assume that the reference classes are unknown in advance, and thus suggest an unsupervised clustering approach which accepts as input a bunch of unstructured text documents, and produces as output groups of similar documents describing similar topics. The input document feature vectors are augmented with term co-occurrence and relatedness scores produced from a distributional thesaurus built on the same (or a related) corpus. The augmented feature vectors are then run through a hierarchical clustering process to identify groups of similar documents, which serve as candidates for topical organization and keyword extraction. Experiments on a manually labelled dataset of documents classified against the UN's Sustainable Development Goals (SDGs) confirm the quality and potential of the approach.
eu_rights_str_mv openAccess
format conferenceObject
id LAURepo_d33f4075c0eeff597ce0813ea3c41639
identifier_str_mv 9781450383141
Sarkissian, S., & Tekli, J. (2021, November). Unsupervised topical organization of documents using corpus-based text analysis. In Proceedings of 2021 13th International Conference on Management of Digital EcoSystems (MEDES 2021), (pp. 87-94). New York: ACM.
language_invalid_str_mv en
network_acronym_str LAURepo
network_name_str Lebanese American University repository
oai_identifier_str oai:laur.lau.edu.lb:10725/16285
publishDate 2021
publisher.none.fl_str_mv The Association for Computing Machinery
repository.mail.fl_str_mv
repository.name.fl_str_mv
repository_id_str
spelling Unsupervised Topical Organization of Documents using Corpus-based Text AnalysisSarkissian, SarkisTekli, JoeBig data -- CongressesComputer security -- CongressesDatabase management -- CongressesThis study aims at automating the process of topical keyword organization of set of documents in an input text corpus. It is conducted in the context of a larger project to investigate efficient unsupervised learning techniques to automatically extract relevant classes and their keyword descriptions from a set of the United Nations (UN) documents, and use the latter to produce reference corpora allowing to classify future UN documents. We assume that the reference classes are unknown in advance, and thus suggest an unsupervised clustering approach which accepts as input a bunch of unstructured text documents, and produces as output groups of similar documents describing similar topics. The input document feature vectors are augmented with term co-occurrence and relatedness scores produced from a distributional thesaurus built on the same (or a related) corpus. The augmented feature vectors are then run through a hierarchical clustering process to identify groups of similar documents, which serve as candidates for topical organization and keyword extraction. Experiments on a manually labelled dataset of documents classified against the UN's Sustainable Development Goals (SDGs) confirm the quality and potential of the approach.ACMSIGAPPIncludes bibliographical referencesThe Association for Computing Machinery2024-11-08T08:23:24Z2024-11-08T08:23:24Z20212021-11-09Conference Paper / Proceedinginfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/conferenceObject9781450383141http://hdl.handle.net/10725/16285https://doi.org/10.1145/3444757.3485078Sarkissian, S., & Tekli, J. (2021, November). Unsupervised topical organization of documents using corpus-based text analysis. In Proceedings of 2021 13th International Conference on Management of Digital EcoSystems (MEDES 2021), (pp. 87-94). New York: ACM.http://libraries.lau.edu.lb/research/laur/terms-of-use/articles.phphttps://dl.acm.org/doi/abs/10.1145/3444757.3485078eninfo:eu-repo/semantics/openAccessoai:laur.lau.edu.lb:10725/162852024-11-08T08:43:07Z
spellingShingle Unsupervised Topical Organization of Documents using Corpus-based Text Analysis
Sarkissian, Sarkis
Big data -- Congresses
Computer security -- Congresses
Database management -- Congresses
status_str publishedVersion
title Unsupervised Topical Organization of Documents using Corpus-based Text Analysis
title_full Unsupervised Topical Organization of Documents using Corpus-based Text Analysis
title_fullStr Unsupervised Topical Organization of Documents using Corpus-based Text Analysis
title_full_unstemmed Unsupervised Topical Organization of Documents using Corpus-based Text Analysis
title_short Unsupervised Topical Organization of Documents using Corpus-based Text Analysis
title_sort Unsupervised Topical Organization of Documents using Corpus-based Text Analysis
topic Big data -- Congresses
Computer security -- Congresses
Database management -- Congresses
url http://hdl.handle.net/10725/16285
https://doi.org/10.1145/3444757.3485078
http://libraries.lau.edu.lb/research/laur/terms-of-use/articles.php
https://dl.acm.org/doi/abs/10.1145/3444757.3485078