Data & code.

<div><p>Text embedding plays a crucial role in natural language processing (NLP). Among various approaches, nonnegative matrix factorization (NMF) is an effective method for this purpose. However, the standard NMF approach, fundamentally based on the bag-of-words model, fails to utilize...

Full description

Saved in:
Bibliographic Details
Main Author: Mingming Li (502877) (author)
Other Authors: Xingjie Wang (4796859) (author), Chunhua Li (18618) (author), Anping Zeng (9524345) (author)
Published: 2024
Subjects:
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1852024683292000256
author Mingming Li (502877)
author2 Xingjie Wang (4796859)
Chunhua Li (18618)
Anping Zeng (9524345)
author2_role author
author
author
author_facet Mingming Li (502877)
Xingjie Wang (4796859)
Chunhua Li (18618)
Anping Zeng (9524345)
author_role author
dc.creator.none.fl_str_mv Mingming Li (502877)
Xingjie Wang (4796859)
Chunhua Li (18618)
Anping Zeng (9524345)
dc.date.none.fl_str_mv 2024-12-05T18:44:43Z
dc.identifier.none.fl_str_mv 10.1371/journal.pone.0314762.s001
dc.relation.none.fl_str_mv https://figshare.com/articles/dataset/Data_code_/27971963
dc.rights.none.fl_str_mv CC BY 4.0
info:eu-repo/semantics/openAccess
dc.subject.none.fl_str_mv Sociology
Science Policy
Space Science
Biological Sciences not elsewhere classified
Mathematical Sciences not elsewhere classified
Information Systems not elsewhere classified
processing text embeddings
natural language processing
enhances semantic representation
among various approaches
nonnegative matrix factorization
leverage semantic information
model &# 8217
standard nmf approach
conventional nmf models
wasserstein metric used
regularization term based
wasserstein metric
contextual information
context matrix
words model
fundamentally based
based regularization
regularized nmf
ordinary nmf
novel nmf
nmf framework
wr model
typically symmetric
topic modeling
spd ),
significant loss
results indicate
proposed algorithms
positive definite
popular datasets
numerical algorithms
nlp ).
may result
manifold structure
introduce modifications
gradient computation
findings suggest
explicit expression
effective method
document clustering
crucial role
conduct experiments
build upon
also underscores
dc.title.none.fl_str_mv Data & code.
dc.type.none.fl_str_mv Dataset
info:eu-repo/semantics/publishedVersion
dataset
description <div><p>Text embedding plays a crucial role in natural language processing (NLP). Among various approaches, nonnegative matrix factorization (NMF) is an effective method for this purpose. However, the standard NMF approach, fundamentally based on the bag-of-words model, fails to utilize the contextual information of documents and may result in a significant loss of semantics. To address this issue, we propose a new NMF scheme incorporating a regularization term based on the Wasserstein metric, which quantifies an approximation of a word-context matrix to leverage semantic information. Since the word-context matrix is typically symmetric and positive definite (SPD), the Wasserstein metric used in the regularization is related to a natural SPD manifold structure. We then build upon this manifold structure to integrate an explicit expression of the Wasserstein metric into the NMF framework. Additionally, we introduce modifications to the gradient computation to adapt three fundamental classes of numerical algorithms from ordinary NMF to tackle the Wasserstein-regularized NMF (NMF-WR) problem. To demonstrate the effectiveness of the NMF-WR model and the proposed algorithms, we conduct experiments on popular datasets, focusing on topic modeling and document clustering. The results indicate that the NMF-WR model exhibits superior performance compared to other conventional NMF models in processing text embeddings. These findings suggest that this novel NMF-WR framework not only enhances semantic representation but also underscores a commitment to the model’s reliability and interpretability.</p></div>
eu_rights_str_mv openAccess
id Manara_5bb91f52d4fedb331a995f21d737df4e
identifier_str_mv 10.1371/journal.pone.0314762.s001
network_acronym_str Manara
network_name_str ManaraRepo
oai_identifier_str oai:figshare.com:article/27971963
publishDate 2024
repository.mail.fl_str_mv
repository.name.fl_str_mv
repository_id_str
rights_invalid_str_mv CC BY 4.0
spelling Data & code.Mingming Li (502877)Xingjie Wang (4796859)Chunhua Li (18618)Anping Zeng (9524345)SociologyScience PolicySpace ScienceBiological Sciences not elsewhere classifiedMathematical Sciences not elsewhere classifiedInformation Systems not elsewhere classifiedprocessing text embeddingsnatural language processingenhances semantic representationamong various approachesnonnegative matrix factorizationleverage semantic informationmodel &# 8217standard nmf approachconventional nmf modelswasserstein metric usedregularization term basedwasserstein metriccontextual informationcontext matrixwords modelfundamentally basedbased regularizationregularized nmfordinary nmfnovel nmfnmf frameworkwr modeltypically symmetrictopic modelingspd ),significant lossresults indicateproposed algorithmspositive definitepopular datasetsnumerical algorithmsnlp ).may resultmanifold structureintroduce modificationsgradient computationfindings suggestexplicit expressioneffective methoddocument clusteringcrucial roleconduct experimentsbuild uponalso underscores<div><p>Text embedding plays a crucial role in natural language processing (NLP). Among various approaches, nonnegative matrix factorization (NMF) is an effective method for this purpose. However, the standard NMF approach, fundamentally based on the bag-of-words model, fails to utilize the contextual information of documents and may result in a significant loss of semantics. To address this issue, we propose a new NMF scheme incorporating a regularization term based on the Wasserstein metric, which quantifies an approximation of a word-context matrix to leverage semantic information. Since the word-context matrix is typically symmetric and positive definite (SPD), the Wasserstein metric used in the regularization is related to a natural SPD manifold structure. We then build upon this manifold structure to integrate an explicit expression of the Wasserstein metric into the NMF framework. Additionally, we introduce modifications to the gradient computation to adapt three fundamental classes of numerical algorithms from ordinary NMF to tackle the Wasserstein-regularized NMF (NMF-WR) problem. To demonstrate the effectiveness of the NMF-WR model and the proposed algorithms, we conduct experiments on popular datasets, focusing on topic modeling and document clustering. The results indicate that the NMF-WR model exhibits superior performance compared to other conventional NMF models in processing text embeddings. These findings suggest that this novel NMF-WR framework not only enhances semantic representation but also underscores a commitment to the model’s reliability and interpretability.</p></div>2024-12-05T18:44:43ZDatasetinfo:eu-repo/semantics/publishedVersiondataset10.1371/journal.pone.0314762.s001https://figshare.com/articles/dataset/Data_code_/27971963CC BY 4.0info:eu-repo/semantics/openAccessoai:figshare.com:article/279719632024-12-05T18:44:43Z
spellingShingle Data & code.
Mingming Li (502877)
Sociology
Science Policy
Space Science
Biological Sciences not elsewhere classified
Mathematical Sciences not elsewhere classified
Information Systems not elsewhere classified
processing text embeddings
natural language processing
enhances semantic representation
among various approaches
nonnegative matrix factorization
leverage semantic information
model &# 8217
standard nmf approach
conventional nmf models
wasserstein metric used
regularization term based
wasserstein metric
contextual information
context matrix
words model
fundamentally based
based regularization
regularized nmf
ordinary nmf
novel nmf
nmf framework
wr model
typically symmetric
topic modeling
spd ),
significant loss
results indicate
proposed algorithms
positive definite
popular datasets
numerical algorithms
nlp ).
may result
manifold structure
introduce modifications
gradient computation
findings suggest
explicit expression
effective method
document clustering
crucial role
conduct experiments
build upon
also underscores
status_str publishedVersion
title Data & code.
title_full Data & code.
title_fullStr Data & code.
title_full_unstemmed Data & code.
title_short Data & code.
title_sort Data & code.
topic Sociology
Science Policy
Space Science
Biological Sciences not elsewhere classified
Mathematical Sciences not elsewhere classified
Information Systems not elsewhere classified
processing text embeddings
natural language processing
enhances semantic representation
among various approaches
nonnegative matrix factorization
leverage semantic information
model &# 8217
standard nmf approach
conventional nmf models
wasserstein metric used
regularization term based
wasserstein metric
contextual information
context matrix
words model
fundamentally based
based regularization
regularized nmf
ordinary nmf
novel nmf
nmf framework
wr model
typically symmetric
topic modeling
spd ),
significant loss
results indicate
proposed algorithms
positive definite
popular datasets
numerical algorithms
nlp ).
may result
manifold structure
introduce modifications
gradient computation
findings suggest
explicit expression
effective method
document clustering
crucial role
conduct experiments
build upon
also underscores