Unsupervised Topical Organization of Documents using Corpus-based Text Analysis
This study aims at automating the process of topical keyword organization of set of documents in an input text corpus. It is conducted in the context of a larger project to investigate efficient unsupervised learning techniques to automatically extract relevant classes and their keyword descriptions...
Saved in:
| Main Author: | |
|---|---|
| Other Authors: | |
| Format: | conferenceObject |
| Published: |
2021
|
| Subjects: | |
| Online Access: | http://hdl.handle.net/10725/16285 https://doi.org/10.1145/3444757.3485078 http://libraries.lau.edu.lb/research/laur/terms-of-use/articles.php https://dl.acm.org/doi/abs/10.1145/3444757.3485078 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1864513472549617664 |
|---|---|
| author | Sarkissian, Sarkis |
| author2 | Tekli, Joe |
| author2_role | author |
| author_facet | Sarkissian, Sarkis Tekli, Joe |
| author_role | author |
| dc.creator.none.fl_str_mv | Sarkissian, Sarkis Tekli, Joe |
| dc.date.none.fl_str_mv | 2021 2021-11-09 2024-11-08T08:23:24Z 2024-11-08T08:23:24Z |
| dc.identifier.none.fl_str_mv | 9781450383141 http://hdl.handle.net/10725/16285 https://doi.org/10.1145/3444757.3485078 Sarkissian, S., & Tekli, J. (2021, November). Unsupervised topical organization of documents using corpus-based text analysis. In Proceedings of 2021 13th International Conference on Management of Digital EcoSystems (MEDES 2021), (pp. 87-94). New York: ACM. http://libraries.lau.edu.lb/research/laur/terms-of-use/articles.php https://dl.acm.org/doi/abs/10.1145/3444757.3485078 |
| dc.language.none.fl_str_mv | en |
| dc.publisher.none.fl_str_mv | The Association for Computing Machinery |
| dc.rights.*.fl_str_mv | info:eu-repo/semantics/openAccess |
| dc.subject.none.fl_str_mv | Big data -- Congresses Computer security -- Congresses Database management -- Congresses |
| dc.title.none.fl_str_mv | Unsupervised Topical Organization of Documents using Corpus-based Text Analysis |
| dc.type.none.fl_str_mv | Conference Paper / Proceeding info:eu-repo/semantics/publishedVersion info:eu-repo/semantics/conferenceObject |
| description | This study aims at automating the process of topical keyword organization of set of documents in an input text corpus. It is conducted in the context of a larger project to investigate efficient unsupervised learning techniques to automatically extract relevant classes and their keyword descriptions from a set of the United Nations (UN) documents, and use the latter to produce reference corpora allowing to classify future UN documents. We assume that the reference classes are unknown in advance, and thus suggest an unsupervised clustering approach which accepts as input a bunch of unstructured text documents, and produces as output groups of similar documents describing similar topics. The input document feature vectors are augmented with term co-occurrence and relatedness scores produced from a distributional thesaurus built on the same (or a related) corpus. The augmented feature vectors are then run through a hierarchical clustering process to identify groups of similar documents, which serve as candidates for topical organization and keyword extraction. Experiments on a manually labelled dataset of documents classified against the UN's Sustainable Development Goals (SDGs) confirm the quality and potential of the approach. |
| eu_rights_str_mv | openAccess |
| format | conferenceObject |
| id | LAURepo_d33f4075c0eeff597ce0813ea3c41639 |
| identifier_str_mv | 9781450383141 Sarkissian, S., & Tekli, J. (2021, November). Unsupervised topical organization of documents using corpus-based text analysis. In Proceedings of 2021 13th International Conference on Management of Digital EcoSystems (MEDES 2021), (pp. 87-94). New York: ACM. |
| language_invalid_str_mv | en |
| network_acronym_str | LAURepo |
| network_name_str | Lebanese American University repository |
| oai_identifier_str | oai:laur.lau.edu.lb:10725/16285 |
| publishDate | 2021 |
| publisher.none.fl_str_mv | The Association for Computing Machinery |
| repository.mail.fl_str_mv | |
| repository.name.fl_str_mv | |
| repository_id_str | |
| spelling | Unsupervised Topical Organization of Documents using Corpus-based Text AnalysisSarkissian, SarkisTekli, JoeBig data -- CongressesComputer security -- CongressesDatabase management -- CongressesThis study aims at automating the process of topical keyword organization of set of documents in an input text corpus. It is conducted in the context of a larger project to investigate efficient unsupervised learning techniques to automatically extract relevant classes and their keyword descriptions from a set of the United Nations (UN) documents, and use the latter to produce reference corpora allowing to classify future UN documents. We assume that the reference classes are unknown in advance, and thus suggest an unsupervised clustering approach which accepts as input a bunch of unstructured text documents, and produces as output groups of similar documents describing similar topics. The input document feature vectors are augmented with term co-occurrence and relatedness scores produced from a distributional thesaurus built on the same (or a related) corpus. The augmented feature vectors are then run through a hierarchical clustering process to identify groups of similar documents, which serve as candidates for topical organization and keyword extraction. Experiments on a manually labelled dataset of documents classified against the UN's Sustainable Development Goals (SDGs) confirm the quality and potential of the approach.ACMSIGAPPIncludes bibliographical referencesThe Association for Computing Machinery2024-11-08T08:23:24Z2024-11-08T08:23:24Z20212021-11-09Conference Paper / Proceedinginfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/conferenceObject9781450383141http://hdl.handle.net/10725/16285https://doi.org/10.1145/3444757.3485078Sarkissian, S., & Tekli, J. (2021, November). Unsupervised topical organization of documents using corpus-based text analysis. In Proceedings of 2021 13th International Conference on Management of Digital EcoSystems (MEDES 2021), (pp. 87-94). New York: ACM.http://libraries.lau.edu.lb/research/laur/terms-of-use/articles.phphttps://dl.acm.org/doi/abs/10.1145/3444757.3485078eninfo:eu-repo/semantics/openAccessoai:laur.lau.edu.lb:10725/162852024-11-08T08:43:07Z |
| spellingShingle | Unsupervised Topical Organization of Documents using Corpus-based Text Analysis Sarkissian, Sarkis Big data -- Congresses Computer security -- Congresses Database management -- Congresses |
| status_str | publishedVersion |
| title | Unsupervised Topical Organization of Documents using Corpus-based Text Analysis |
| title_full | Unsupervised Topical Organization of Documents using Corpus-based Text Analysis |
| title_fullStr | Unsupervised Topical Organization of Documents using Corpus-based Text Analysis |
| title_full_unstemmed | Unsupervised Topical Organization of Documents using Corpus-based Text Analysis |
| title_short | Unsupervised Topical Organization of Documents using Corpus-based Text Analysis |
| title_sort | Unsupervised Topical Organization of Documents using Corpus-based Text Analysis |
| topic | Big data -- Congresses Computer security -- Congresses Database management -- Congresses |
| url | http://hdl.handle.net/10725/16285 https://doi.org/10.1145/3444757.3485078 http://libraries.lau.edu.lb/research/laur/terms-of-use/articles.php https://dl.acm.org/doi/abs/10.1145/3444757.3485078 |