An Empirical Analysis of NMT-Derived Interlingual Embeddings and Their Use in Parallel Sentence Identification

<p dir="ltr">End-to-end neural machine translation has overtaken statistical machine translation in terms of translation quality for some language pairs, specially those with large amounts of parallel data. Besides this palpable improvement, neural networks provide several new proper...

وصف كامل

محفوظ في:

التفاصيل البيبلوغرافية
المؤلف الرئيسي:	Cristina Espana-Bonet (19720063) (author)
مؤلفون آخرون:	Adam Csaba Varga (19720066) (author), Alberto Barron-Cedeno (19720069) (author), Josef van Genabith (19720072) (author)
منشور في:	2017
الموضوعات:	Information and computing sciences Data management and data science Machine learning Training Knowledge discovery Vocabulary Natural language processing
الوسوم:	إضافة وسم لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!

_version_	1864513557125660672
author	Cristina Espana-Bonet (19720063)
author2	Adam Csaba Varga (19720066) Alberto Barron-Cedeno (19720069) Josef van Genabith (19720072)
author2_role	author author author
author_facet	Cristina Espana-Bonet (19720063) Adam Csaba Varga (19720066) Alberto Barron-Cedeno (19720069) Josef van Genabith (19720072)
author_role	author
dc.creator.none.fl_str_mv	Cristina Espana-Bonet (19720063) Adam Csaba Varga (19720066) Alberto Barron-Cedeno (19720069) Josef van Genabith (19720072)
dc.date.none.fl_str_mv	2017-10-18T03:00:00Z
dc.identifier.none.fl_str_mv	10.1109/jstsp.2017.2764273
dc.relation.none.fl_str_mv	https://figshare.com/articles/journal_contribution/An_Empirical_Analysis_of_NMT-Derived_Interlingual_Embeddings_and_Their_Use_in_Parallel_Sentence_Identification/27082699
dc.rights.none.fl_str_mv	CC BY 4.0 info:eu-repo/semantics/openAccess
dc.subject.none.fl_str_mv	Information and computing sciences Data management and data science Machine learning Training Machine learning Knowledge discovery Vocabulary Natural language processing
dc.title.none.fl_str_mv	An Empirical Analysis of NMT-Derived Interlingual Embeddings and Their Use in Parallel Sentence Identification
dc.type.none.fl_str_mv	Text Journal contribution info:eu-repo/semantics/publishedVersion text contribution to journal
description	<p dir="ltr">End-to-end neural machine translation has overtaken statistical machine translation in terms of translation quality for some language pairs, specially those with large amounts of parallel data. Besides this palpable improvement, neural networks provide several new properties. A single system can be trained to translate between many languages at almost no additional cost other than training time. Furthermore, internal representations learned by the network serve as a new semantic representation of words-or sentences-which, unlike standard word embeddings, are learned in an essentially bilingual or even multilingual context. In view of these properties, the contribution of the present paper is twofold. First, we systematically study the neural machine translation (NMT) context vectors, i.e., output of the encoder, and their power as an interlingua representation of a sentence. We assess their quality and effectiveness by measuring similarities across translations, as well as semantically related and semantically unrelated sentence pairs. Second, as extrinsic evaluation of the first point, we identify parallel sentences in comparable corpora, obtaining an F <sub>1</sub> =98.2% on data from a shared task when using only NMT context vectors. Using context vectors jointly with similarity measures F <sub>1</sub> reaches 98.9%.</p><h2>Other Information</h2><p dir="ltr">Published in: IEEE Journal of Selected Topics in Signal Processing<br>License:<a href="https://creativecommons.org/licenses/by/4.0/" rel="noreferrer" target="_blank"> https://creativecommons.org/licenses/by/4.0/</a><br>See article on publisher's website: <a href="https://dx.doi.org/10.1109/jstsp.2017.2764273" target="_blank">https://dx.doi.org/10.1109/jstsp.2017.2764273</a></p>
eu_rights_str_mv	openAccess
id	Manara2_e0465b45af2bf993d1664fcf75f0bbf2
identifier_str_mv	10.1109/jstsp.2017.2764273
network_acronym_str	Manara2
network_name_str	Manara2
oai_identifier_str	oai:figshare.com:article/27082699
publishDate	2017
repository.mail.fl_str_mv
repository.name.fl_str_mv
repository_id_str
rights_invalid_str_mv	CC BY 4.0
spelling	An Empirical Analysis of NMT-Derived Interlingual Embeddings and Their Use in Parallel Sentence IdentificationCristina Espana-Bonet (19720063)Adam Csaba Varga (19720066)Alberto Barron-Cedeno (19720069)Josef van Genabith (19720072)Information and computing sciencesData management and data scienceMachine learningTrainingMachine learningKnowledge discoveryVocabularyNatural language processing<p dir="ltr">End-to-end neural machine translation has overtaken statistical machine translation in terms of translation quality for some language pairs, specially those with large amounts of parallel data. Besides this palpable improvement, neural networks provide several new properties. A single system can be trained to translate between many languages at almost no additional cost other than training time. Furthermore, internal representations learned by the network serve as a new semantic representation of words-or sentences-which, unlike standard word embeddings, are learned in an essentially bilingual or even multilingual context. In view of these properties, the contribution of the present paper is twofold. First, we systematically study the neural machine translation (NMT) context vectors, i.e., output of the encoder, and their power as an interlingua representation of a sentence. We assess their quality and effectiveness by measuring similarities across translations, as well as semantically related and semantically unrelated sentence pairs. Second, as extrinsic evaluation of the first point, we identify parallel sentences in comparable corpora, obtaining an F <sub>1</sub> =98.2% on data from a shared task when using only NMT context vectors. Using context vectors jointly with similarity measures F <sub>1</sub> reaches 98.9%.</p><h2>Other Information</h2><p dir="ltr">Published in: IEEE Journal of Selected Topics in Signal Processing<br>License:<a href="https://creativecommons.org/licenses/by/4.0/" rel="noreferrer" target="_blank"> https://creativecommons.org/licenses/by/4.0/</a><br>See article on publisher's website: <a href="https://dx.doi.org/10.1109/jstsp.2017.2764273" target="_blank">https://dx.doi.org/10.1109/jstsp.2017.2764273</a></p>2017-10-18T03:00:00ZTextJournal contributioninfo:eu-repo/semantics/publishedVersiontextcontribution to journal10.1109/jstsp.2017.2764273https://figshare.com/articles/journal_contribution/An_Empirical_Analysis_of_NMT-Derived_Interlingual_Embeddings_and_Their_Use_in_Parallel_Sentence_Identification/27082699CC BY 4.0info:eu-repo/semantics/openAccessoai:figshare.com:article/270826992017-10-18T03:00:00Z
spellingShingle	An Empirical Analysis of NMT-Derived Interlingual Embeddings and Their Use in Parallel Sentence Identification Cristina Espana-Bonet (19720063) Information and computing sciences Data management and data science Machine learning Training Machine learning Knowledge discovery Vocabulary Natural language processing
status_str	publishedVersion
title	An Empirical Analysis of NMT-Derived Interlingual Embeddings and Their Use in Parallel Sentence Identification
title_full	An Empirical Analysis of NMT-Derived Interlingual Embeddings and Their Use in Parallel Sentence Identification
title_fullStr	An Empirical Analysis of NMT-Derived Interlingual Embeddings and Their Use in Parallel Sentence Identification
title_full_unstemmed	An Empirical Analysis of NMT-Derived Interlingual Embeddings and Their Use in Parallel Sentence Identification
title_short	An Empirical Analysis of NMT-Derived Interlingual Embeddings and Their Use in Parallel Sentence Identification
title_sort	An Empirical Analysis of NMT-Derived Interlingual Embeddings and Their Use in Parallel Sentence Identification
topic	Information and computing sciences Data management and data science Machine learning Training Machine learning Knowledge discovery Vocabulary Natural language processing

An Empirical Analysis of NMT-Derived Interlingual Embeddings and Their Use in Parallel Sentence Identification

مواد مشابهة