An Empirical Analysis of NMT-Derived Interlingual Embeddings and Their Use in Parallel Sentence Identification

<p dir="ltr">End-to-end neural machine translation has overtaken statistical machine translation in terms of translation quality for some language pairs, specially those with large amounts of parallel data. Besides this palpable improvement, neural networks provide several new proper...

Full description

Saved in:
Bibliographic Details
Main Author: Cristina Espana-Bonet (19720063) (author)
Other Authors: Adam Csaba Varga (19720066) (author), Alberto Barron-Cedeno (19720069) (author), Josef van Genabith (19720072) (author)
Published: 2017
Subjects:
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1864513557125660672
author Cristina Espana-Bonet (19720063)
author2 Adam Csaba Varga (19720066)
Alberto Barron-Cedeno (19720069)
Josef van Genabith (19720072)
author2_role author
author
author
author_facet Cristina Espana-Bonet (19720063)
Adam Csaba Varga (19720066)
Alberto Barron-Cedeno (19720069)
Josef van Genabith (19720072)
author_role author
dc.creator.none.fl_str_mv Cristina Espana-Bonet (19720063)
Adam Csaba Varga (19720066)
Alberto Barron-Cedeno (19720069)
Josef van Genabith (19720072)
dc.date.none.fl_str_mv 2017-10-18T03:00:00Z
dc.identifier.none.fl_str_mv 10.1109/jstsp.2017.2764273
dc.relation.none.fl_str_mv https://figshare.com/articles/journal_contribution/An_Empirical_Analysis_of_NMT-Derived_Interlingual_Embeddings_and_Their_Use_in_Parallel_Sentence_Identification/27082699
dc.rights.none.fl_str_mv CC BY 4.0
info:eu-repo/semantics/openAccess
dc.subject.none.fl_str_mv Information and computing sciences
Data management and data science
Machine learning
Training
Machine learning
Knowledge discovery
Vocabulary
Natural language processing
dc.title.none.fl_str_mv An Empirical Analysis of NMT-Derived Interlingual Embeddings and Their Use in Parallel Sentence Identification
dc.type.none.fl_str_mv Text
Journal contribution
info:eu-repo/semantics/publishedVersion
text
contribution to journal
description <p dir="ltr">End-to-end neural machine translation has overtaken statistical machine translation in terms of translation quality for some language pairs, specially those with large amounts of parallel data. Besides this palpable improvement, neural networks provide several new properties. A single system can be trained to translate between many languages at almost no additional cost other than training time. Furthermore, internal representations learned by the network serve as a new semantic representation of words-or sentences-which, unlike standard word embeddings, are learned in an essentially bilingual or even multilingual context. In view of these properties, the contribution of the present paper is twofold. First, we systematically study the neural machine translation (NMT) context vectors, i.e., output of the encoder, and their power as an interlingua representation of a sentence. We assess their quality and effectiveness by measuring similarities across translations, as well as semantically related and semantically unrelated sentence pairs. Second, as extrinsic evaluation of the first point, we identify parallel sentences in comparable corpora, obtaining an F <sub>1</sub> =98.2% on data from a shared task when using only NMT context vectors. Using context vectors jointly with similarity measures F <sub>1</sub> reaches 98.9%.</p><h2>Other Information</h2><p dir="ltr">Published in: IEEE Journal of Selected Topics in Signal Processing<br>License:<a href="https://creativecommons.org/licenses/by/4.0/" rel="noreferrer" target="_blank"> https://creativecommons.org/licenses/by/4.0/</a><br>See article on publisher's website: <a href="https://dx.doi.org/10.1109/jstsp.2017.2764273" target="_blank">https://dx.doi.org/10.1109/jstsp.2017.2764273</a></p>
eu_rights_str_mv openAccess
id Manara2_e0465b45af2bf993d1664fcf75f0bbf2
identifier_str_mv 10.1109/jstsp.2017.2764273
network_acronym_str Manara2
network_name_str Manara2
oai_identifier_str oai:figshare.com:article/27082699
publishDate 2017
repository.mail.fl_str_mv
repository.name.fl_str_mv
repository_id_str
rights_invalid_str_mv CC BY 4.0
spelling An Empirical Analysis of NMT-Derived Interlingual Embeddings and Their Use in Parallel Sentence IdentificationCristina Espana-Bonet (19720063)Adam Csaba Varga (19720066)Alberto Barron-Cedeno (19720069)Josef van Genabith (19720072)Information and computing sciencesData management and data scienceMachine learningTrainingMachine learningKnowledge discoveryVocabularyNatural language processing<p dir="ltr">End-to-end neural machine translation has overtaken statistical machine translation in terms of translation quality for some language pairs, specially those with large amounts of parallel data. Besides this palpable improvement, neural networks provide several new properties. A single system can be trained to translate between many languages at almost no additional cost other than training time. Furthermore, internal representations learned by the network serve as a new semantic representation of words-or sentences-which, unlike standard word embeddings, are learned in an essentially bilingual or even multilingual context. In view of these properties, the contribution of the present paper is twofold. First, we systematically study the neural machine translation (NMT) context vectors, i.e., output of the encoder, and their power as an interlingua representation of a sentence. We assess their quality and effectiveness by measuring similarities across translations, as well as semantically related and semantically unrelated sentence pairs. Second, as extrinsic evaluation of the first point, we identify parallel sentences in comparable corpora, obtaining an F <sub>1</sub> =98.2% on data from a shared task when using only NMT context vectors. Using context vectors jointly with similarity measures F <sub>1</sub> reaches 98.9%.</p><h2>Other Information</h2><p dir="ltr">Published in: IEEE Journal of Selected Topics in Signal Processing<br>License:<a href="https://creativecommons.org/licenses/by/4.0/" rel="noreferrer" target="_blank"> https://creativecommons.org/licenses/by/4.0/</a><br>See article on publisher's website: <a href="https://dx.doi.org/10.1109/jstsp.2017.2764273" target="_blank">https://dx.doi.org/10.1109/jstsp.2017.2764273</a></p>2017-10-18T03:00:00ZTextJournal contributioninfo:eu-repo/semantics/publishedVersiontextcontribution to journal10.1109/jstsp.2017.2764273https://figshare.com/articles/journal_contribution/An_Empirical_Analysis_of_NMT-Derived_Interlingual_Embeddings_and_Their_Use_in_Parallel_Sentence_Identification/27082699CC BY 4.0info:eu-repo/semantics/openAccessoai:figshare.com:article/270826992017-10-18T03:00:00Z
spellingShingle An Empirical Analysis of NMT-Derived Interlingual Embeddings and Their Use in Parallel Sentence Identification
Cristina Espana-Bonet (19720063)
Information and computing sciences
Data management and data science
Machine learning
Training
Machine learning
Knowledge discovery
Vocabulary
Natural language processing
status_str publishedVersion
title An Empirical Analysis of NMT-Derived Interlingual Embeddings and Their Use in Parallel Sentence Identification
title_full An Empirical Analysis of NMT-Derived Interlingual Embeddings and Their Use in Parallel Sentence Identification
title_fullStr An Empirical Analysis of NMT-Derived Interlingual Embeddings and Their Use in Parallel Sentence Identification
title_full_unstemmed An Empirical Analysis of NMT-Derived Interlingual Embeddings and Their Use in Parallel Sentence Identification
title_short An Empirical Analysis of NMT-Derived Interlingual Embeddings and Their Use in Parallel Sentence Identification
title_sort An Empirical Analysis of NMT-Derived Interlingual Embeddings and Their Use in Parallel Sentence Identification
topic Information and computing sciences
Data management and data science
Machine learning
Training
Machine learning
Knowledge discovery
Vocabulary
Natural language processing