Data & code.
<div><p>Text embedding plays a crucial role in natural language processing (NLP). Among various approaches, nonnegative matrix factorization (NMF) is an effective method for this purpose. However, the standard NMF approach, fundamentally based on the bag-of-words model, fails to utilize...
Saved in:
| Main Author: | |
|---|---|
| Other Authors: | , , |
| Published: |
2024
|
| Subjects: | |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1852024683292000256 |
|---|---|
| author | Mingming Li (502877) |
| author2 | Xingjie Wang (4796859) Chunhua Li (18618) Anping Zeng (9524345) |
| author2_role | author author author |
| author_facet | Mingming Li (502877) Xingjie Wang (4796859) Chunhua Li (18618) Anping Zeng (9524345) |
| author_role | author |
| dc.creator.none.fl_str_mv | Mingming Li (502877) Xingjie Wang (4796859) Chunhua Li (18618) Anping Zeng (9524345) |
| dc.date.none.fl_str_mv | 2024-12-05T18:44:43Z |
| dc.identifier.none.fl_str_mv | 10.1371/journal.pone.0314762.s001 |
| dc.relation.none.fl_str_mv | https://figshare.com/articles/dataset/Data_code_/27971963 |
| dc.rights.none.fl_str_mv | CC BY 4.0 info:eu-repo/semantics/openAccess |
| dc.subject.none.fl_str_mv | Sociology Science Policy Space Science Biological Sciences not elsewhere classified Mathematical Sciences not elsewhere classified Information Systems not elsewhere classified processing text embeddings natural language processing enhances semantic representation among various approaches nonnegative matrix factorization leverage semantic information model &# 8217 standard nmf approach conventional nmf models wasserstein metric used regularization term based wasserstein metric contextual information context matrix words model fundamentally based based regularization regularized nmf ordinary nmf novel nmf nmf framework wr model typically symmetric topic modeling spd ), significant loss results indicate proposed algorithms positive definite popular datasets numerical algorithms nlp ). may result manifold structure introduce modifications gradient computation findings suggest explicit expression effective method document clustering crucial role conduct experiments build upon also underscores |
| dc.title.none.fl_str_mv | Data & code. |
| dc.type.none.fl_str_mv | Dataset info:eu-repo/semantics/publishedVersion dataset |
| description | <div><p>Text embedding plays a crucial role in natural language processing (NLP). Among various approaches, nonnegative matrix factorization (NMF) is an effective method for this purpose. However, the standard NMF approach, fundamentally based on the bag-of-words model, fails to utilize the contextual information of documents and may result in a significant loss of semantics. To address this issue, we propose a new NMF scheme incorporating a regularization term based on the Wasserstein metric, which quantifies an approximation of a word-context matrix to leverage semantic information. Since the word-context matrix is typically symmetric and positive definite (SPD), the Wasserstein metric used in the regularization is related to a natural SPD manifold structure. We then build upon this manifold structure to integrate an explicit expression of the Wasserstein metric into the NMF framework. Additionally, we introduce modifications to the gradient computation to adapt three fundamental classes of numerical algorithms from ordinary NMF to tackle the Wasserstein-regularized NMF (NMF-WR) problem. To demonstrate the effectiveness of the NMF-WR model and the proposed algorithms, we conduct experiments on popular datasets, focusing on topic modeling and document clustering. The results indicate that the NMF-WR model exhibits superior performance compared to other conventional NMF models in processing text embeddings. These findings suggest that this novel NMF-WR framework not only enhances semantic representation but also underscores a commitment to the model’s reliability and interpretability.</p></div> |
| eu_rights_str_mv | openAccess |
| id | Manara_5bb91f52d4fedb331a995f21d737df4e |
| identifier_str_mv | 10.1371/journal.pone.0314762.s001 |
| network_acronym_str | Manara |
| network_name_str | ManaraRepo |
| oai_identifier_str | oai:figshare.com:article/27971963 |
| publishDate | 2024 |
| repository.mail.fl_str_mv | |
| repository.name.fl_str_mv | |
| repository_id_str | |
| rights_invalid_str_mv | CC BY 4.0 |
| spelling | Data & code.Mingming Li (502877)Xingjie Wang (4796859)Chunhua Li (18618)Anping Zeng (9524345)SociologyScience PolicySpace ScienceBiological Sciences not elsewhere classifiedMathematical Sciences not elsewhere classifiedInformation Systems not elsewhere classifiedprocessing text embeddingsnatural language processingenhances semantic representationamong various approachesnonnegative matrix factorizationleverage semantic informationmodel &# 8217standard nmf approachconventional nmf modelswasserstein metric usedregularization term basedwasserstein metriccontextual informationcontext matrixwords modelfundamentally basedbased regularizationregularized nmfordinary nmfnovel nmfnmf frameworkwr modeltypically symmetrictopic modelingspd ),significant lossresults indicateproposed algorithmspositive definitepopular datasetsnumerical algorithmsnlp ).may resultmanifold structureintroduce modificationsgradient computationfindings suggestexplicit expressioneffective methoddocument clusteringcrucial roleconduct experimentsbuild uponalso underscores<div><p>Text embedding plays a crucial role in natural language processing (NLP). Among various approaches, nonnegative matrix factorization (NMF) is an effective method for this purpose. However, the standard NMF approach, fundamentally based on the bag-of-words model, fails to utilize the contextual information of documents and may result in a significant loss of semantics. To address this issue, we propose a new NMF scheme incorporating a regularization term based on the Wasserstein metric, which quantifies an approximation of a word-context matrix to leverage semantic information. Since the word-context matrix is typically symmetric and positive definite (SPD), the Wasserstein metric used in the regularization is related to a natural SPD manifold structure. We then build upon this manifold structure to integrate an explicit expression of the Wasserstein metric into the NMF framework. Additionally, we introduce modifications to the gradient computation to adapt three fundamental classes of numerical algorithms from ordinary NMF to tackle the Wasserstein-regularized NMF (NMF-WR) problem. To demonstrate the effectiveness of the NMF-WR model and the proposed algorithms, we conduct experiments on popular datasets, focusing on topic modeling and document clustering. The results indicate that the NMF-WR model exhibits superior performance compared to other conventional NMF models in processing text embeddings. These findings suggest that this novel NMF-WR framework not only enhances semantic representation but also underscores a commitment to the model’s reliability and interpretability.</p></div>2024-12-05T18:44:43ZDatasetinfo:eu-repo/semantics/publishedVersiondataset10.1371/journal.pone.0314762.s001https://figshare.com/articles/dataset/Data_code_/27971963CC BY 4.0info:eu-repo/semantics/openAccessoai:figshare.com:article/279719632024-12-05T18:44:43Z |
| spellingShingle | Data & code. Mingming Li (502877) Sociology Science Policy Space Science Biological Sciences not elsewhere classified Mathematical Sciences not elsewhere classified Information Systems not elsewhere classified processing text embeddings natural language processing enhances semantic representation among various approaches nonnegative matrix factorization leverage semantic information model &# 8217 standard nmf approach conventional nmf models wasserstein metric used regularization term based wasserstein metric contextual information context matrix words model fundamentally based based regularization regularized nmf ordinary nmf novel nmf nmf framework wr model typically symmetric topic modeling spd ), significant loss results indicate proposed algorithms positive definite popular datasets numerical algorithms nlp ). may result manifold structure introduce modifications gradient computation findings suggest explicit expression effective method document clustering crucial role conduct experiments build upon also underscores |
| status_str | publishedVersion |
| title | Data & code. |
| title_full | Data & code. |
| title_fullStr | Data & code. |
| title_full_unstemmed | Data & code. |
| title_short | Data & code. |
| title_sort | Data & code. |
| topic | Sociology Science Policy Space Science Biological Sciences not elsewhere classified Mathematical Sciences not elsewhere classified Information Systems not elsewhere classified processing text embeddings natural language processing enhances semantic representation among various approaches nonnegative matrix factorization leverage semantic information model &# 8217 standard nmf approach conventional nmf models wasserstein metric used regularization term based wasserstein metric contextual information context matrix words model fundamentally based based regularization regularized nmf ordinary nmf novel nmf nmf framework wr model typically symmetric topic modeling spd ), significant loss results indicate proposed algorithms positive definite popular datasets numerical algorithms nlp ). may result manifold structure introduce modifications gradient computation findings suggest explicit expression effective method document clustering crucial role conduct experiments build upon also underscores |