An Empirical Analysis of NMT-Derived Interlingual Embeddings and Their Use in Parallel Sentence Identification
<p dir="ltr">End-to-end neural machine translation has overtaken statistical machine translation in terms of translation quality for some language pairs, specially those with large amounts of parallel data. Besides this palpable improvement, neural networks provide several new proper...
Saved in:
| Main Author: | |
|---|---|
| Other Authors: | , , |
| Published: |
2017
|
| Subjects: | |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1864513557125660672 |
|---|---|
| author | Cristina Espana-Bonet (19720063) |
| author2 | Adam Csaba Varga (19720066) Alberto Barron-Cedeno (19720069) Josef van Genabith (19720072) |
| author2_role | author author author |
| author_facet | Cristina Espana-Bonet (19720063) Adam Csaba Varga (19720066) Alberto Barron-Cedeno (19720069) Josef van Genabith (19720072) |
| author_role | author |
| dc.creator.none.fl_str_mv | Cristina Espana-Bonet (19720063) Adam Csaba Varga (19720066) Alberto Barron-Cedeno (19720069) Josef van Genabith (19720072) |
| dc.date.none.fl_str_mv | 2017-10-18T03:00:00Z |
| dc.identifier.none.fl_str_mv | 10.1109/jstsp.2017.2764273 |
| dc.relation.none.fl_str_mv | https://figshare.com/articles/journal_contribution/An_Empirical_Analysis_of_NMT-Derived_Interlingual_Embeddings_and_Their_Use_in_Parallel_Sentence_Identification/27082699 |
| dc.rights.none.fl_str_mv | CC BY 4.0 info:eu-repo/semantics/openAccess |
| dc.subject.none.fl_str_mv | Information and computing sciences Data management and data science Machine learning Training Machine learning Knowledge discovery Vocabulary Natural language processing |
| dc.title.none.fl_str_mv | An Empirical Analysis of NMT-Derived Interlingual Embeddings and Their Use in Parallel Sentence Identification |
| dc.type.none.fl_str_mv | Text Journal contribution info:eu-repo/semantics/publishedVersion text contribution to journal |
| description | <p dir="ltr">End-to-end neural machine translation has overtaken statistical machine translation in terms of translation quality for some language pairs, specially those with large amounts of parallel data. Besides this palpable improvement, neural networks provide several new properties. A single system can be trained to translate between many languages at almost no additional cost other than training time. Furthermore, internal representations learned by the network serve as a new semantic representation of words-or sentences-which, unlike standard word embeddings, are learned in an essentially bilingual or even multilingual context. In view of these properties, the contribution of the present paper is twofold. First, we systematically study the neural machine translation (NMT) context vectors, i.e., output of the encoder, and their power as an interlingua representation of a sentence. We assess their quality and effectiveness by measuring similarities across translations, as well as semantically related and semantically unrelated sentence pairs. Second, as extrinsic evaluation of the first point, we identify parallel sentences in comparable corpora, obtaining an F <sub>1</sub> =98.2% on data from a shared task when using only NMT context vectors. Using context vectors jointly with similarity measures F <sub>1</sub> reaches 98.9%.</p><h2>Other Information</h2><p dir="ltr">Published in: IEEE Journal of Selected Topics in Signal Processing<br>License:<a href="https://creativecommons.org/licenses/by/4.0/" rel="noreferrer" target="_blank"> https://creativecommons.org/licenses/by/4.0/</a><br>See article on publisher's website: <a href="https://dx.doi.org/10.1109/jstsp.2017.2764273" target="_blank">https://dx.doi.org/10.1109/jstsp.2017.2764273</a></p> |
| eu_rights_str_mv | openAccess |
| id | Manara2_e0465b45af2bf993d1664fcf75f0bbf2 |
| identifier_str_mv | 10.1109/jstsp.2017.2764273 |
| network_acronym_str | Manara2 |
| network_name_str | Manara2 |
| oai_identifier_str | oai:figshare.com:article/27082699 |
| publishDate | 2017 |
| repository.mail.fl_str_mv | |
| repository.name.fl_str_mv | |
| repository_id_str | |
| rights_invalid_str_mv | CC BY 4.0 |
| spelling | An Empirical Analysis of NMT-Derived Interlingual Embeddings and Their Use in Parallel Sentence IdentificationCristina Espana-Bonet (19720063)Adam Csaba Varga (19720066)Alberto Barron-Cedeno (19720069)Josef van Genabith (19720072)Information and computing sciencesData management and data scienceMachine learningTrainingMachine learningKnowledge discoveryVocabularyNatural language processing<p dir="ltr">End-to-end neural machine translation has overtaken statistical machine translation in terms of translation quality for some language pairs, specially those with large amounts of parallel data. Besides this palpable improvement, neural networks provide several new properties. A single system can be trained to translate between many languages at almost no additional cost other than training time. Furthermore, internal representations learned by the network serve as a new semantic representation of words-or sentences-which, unlike standard word embeddings, are learned in an essentially bilingual or even multilingual context. In view of these properties, the contribution of the present paper is twofold. First, we systematically study the neural machine translation (NMT) context vectors, i.e., output of the encoder, and their power as an interlingua representation of a sentence. We assess their quality and effectiveness by measuring similarities across translations, as well as semantically related and semantically unrelated sentence pairs. Second, as extrinsic evaluation of the first point, we identify parallel sentences in comparable corpora, obtaining an F <sub>1</sub> =98.2% on data from a shared task when using only NMT context vectors. Using context vectors jointly with similarity measures F <sub>1</sub> reaches 98.9%.</p><h2>Other Information</h2><p dir="ltr">Published in: IEEE Journal of Selected Topics in Signal Processing<br>License:<a href="https://creativecommons.org/licenses/by/4.0/" rel="noreferrer" target="_blank"> https://creativecommons.org/licenses/by/4.0/</a><br>See article on publisher's website: <a href="https://dx.doi.org/10.1109/jstsp.2017.2764273" target="_blank">https://dx.doi.org/10.1109/jstsp.2017.2764273</a></p>2017-10-18T03:00:00ZTextJournal contributioninfo:eu-repo/semantics/publishedVersiontextcontribution to journal10.1109/jstsp.2017.2764273https://figshare.com/articles/journal_contribution/An_Empirical_Analysis_of_NMT-Derived_Interlingual_Embeddings_and_Their_Use_in_Parallel_Sentence_Identification/27082699CC BY 4.0info:eu-repo/semantics/openAccessoai:figshare.com:article/270826992017-10-18T03:00:00Z |
| spellingShingle | An Empirical Analysis of NMT-Derived Interlingual Embeddings and Their Use in Parallel Sentence Identification Cristina Espana-Bonet (19720063) Information and computing sciences Data management and data science Machine learning Training Machine learning Knowledge discovery Vocabulary Natural language processing |
| status_str | publishedVersion |
| title | An Empirical Analysis of NMT-Derived Interlingual Embeddings and Their Use in Parallel Sentence Identification |
| title_full | An Empirical Analysis of NMT-Derived Interlingual Embeddings and Their Use in Parallel Sentence Identification |
| title_fullStr | An Empirical Analysis of NMT-Derived Interlingual Embeddings and Their Use in Parallel Sentence Identification |
| title_full_unstemmed | An Empirical Analysis of NMT-Derived Interlingual Embeddings and Their Use in Parallel Sentence Identification |
| title_short | An Empirical Analysis of NMT-Derived Interlingual Embeddings and Their Use in Parallel Sentence Identification |
| title_sort | An Empirical Analysis of NMT-Derived Interlingual Embeddings and Their Use in Parallel Sentence Identification |
| topic | Information and computing sciences Data management and data science Machine learning Training Machine learning Knowledge discovery Vocabulary Natural language processing |