Turkronicles: diachronic resources for the fast evolving Turkish language
<p dir="ltr">Over the past century, the Turkish language has undergone substantial changes, mainly driven by governmental interventions. The relatively rapid linguistic evolution of the Turkish language complicates the processing of historical Turkish documents. In this work, we intr...
محفوظ في:
| المؤلف الرئيسي: | |
|---|---|
| مؤلفون آخرون: | , |
| منشور في: |
2025
|
| الموضوعات: | |
| الوسوم: |
إضافة وسم
لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!
|
| _version_ | 1864513531579203584 |
|---|---|
| author | Togay Yazar (22828175) |
| author2 | Mucahid Kutlu (14274942) İsa Kerem Bayırlı (22828178) |
| author2_role | author author |
| author_facet | Togay Yazar (22828175) Mucahid Kutlu (14274942) İsa Kerem Bayırlı (22828178) |
| author_role | author |
| dc.creator.none.fl_str_mv | Togay Yazar (22828175) Mucahid Kutlu (14274942) İsa Kerem Bayırlı (22828178) |
| dc.date.none.fl_str_mv | 2025-07-29T09:00:00Z |
| dc.identifier.none.fl_str_mv | 10.1007/s10579-025-09857-w |
| dc.relation.none.fl_str_mv | https://figshare.com/articles/journal_contribution/Turkronicles_diachronic_resources_for_the_fast_evolving_Turkish_language/30860318 |
| dc.rights.none.fl_str_mv | CC BY 4.0 info:eu-repo/semantics/openAccess |
| dc.subject.none.fl_str_mv | Information and computing sciences Artificial intelligence Language, communication and culture Language studies Diachronic corpora Diachronic analysis Turkish corpus Frequency analysis |
| dc.title.none.fl_str_mv | Turkronicles: diachronic resources for the fast evolving Turkish language |
| dc.type.none.fl_str_mv | Text Journal contribution info:eu-repo/semantics/publishedVersion text contribution to journal |
| description | <p dir="ltr">Over the past century, the Turkish language has undergone substantial changes, mainly driven by governmental interventions. The relatively rapid linguistic evolution of the Turkish language complicates the processing of historical Turkish documents. In this work, we introduce Turkronicles, which is a diachronic corpus for Turkish derived from the Official Gazette of Türkiye and the records of the Grand National Assembly of Türkiye, spanning the period from 1920 to 2024. Turkronicles contains 46,328 documents and 1.1B tokens, making it an important resource for analyzing the linguistic evolution of Turkish and developing models to process historical Turkish documents. In addition, we develop a library to conduct linguistic analysis on diachronic corpora easily. Furthermore, we train a model to fix OCR errors within the documents. Moreover, we explore how the Turkish vocabulary and the writing conventions have changed since 1920 using our corpus. Our analysis reveals that the vocabulary has changed significantly and multiple spellings exist for several words. Specifically, we show that vocabulary divergence increases over time, as expected. Due to such significant vocabulary change in Turkish over time, similarity between the periods 1920–1929 and 2010–2019 is 57%. Despite the substantial vocabulary changes, we demonstrate that it is possible to identify old Turkish words that have the same meanings with newly coined ones using word embeddings. Regarding writing conventions, we found a noticeable decrease in the use of circumflex. In addition, words ending with the letters ‘-b’ and ‘-d’ have been largely replaced by their counterparts ending with ‘-p’ and ‘-t’, respectively, although the former are still in use. Lastly, we observe an increase in the usage of words that comply with vowel harmony rules as a result of the “purification” process of Turkish Language Reform. Overall, our study quantitatively highlights the dramatic changes in Turkish from various linguistic aspects.</p><h2 dir="ltr">Other Information</h2><p dir="ltr">Published in: Language Resources and Evaluation<br>License: <a href="https://creativecommons.org/licenses/by/4.0" target="_blank">https://creativecommons.org/licenses/by/4.0</a><br>See article on publisher's website: <a href="https://dx.doi.org/10.1007/s10579-025-09857-w" target="_blank">https://dx.doi.org/10.1007/s10579-025-09857-w</a></p> |
| eu_rights_str_mv | openAccess |
| id | Manara2_889c7ef0acdaa216d2edb8773b9e5c1b |
| identifier_str_mv | 10.1007/s10579-025-09857-w |
| network_acronym_str | Manara2 |
| network_name_str | Manara2 |
| oai_identifier_str | oai:figshare.com:article/30860318 |
| publishDate | 2025 |
| repository.mail.fl_str_mv | |
| repository.name.fl_str_mv | |
| repository_id_str | |
| rights_invalid_str_mv | CC BY 4.0 |
| spelling | Turkronicles: diachronic resources for the fast evolving Turkish languageTogay Yazar (22828175)Mucahid Kutlu (14274942)İsa Kerem Bayırlı (22828178)Information and computing sciencesArtificial intelligenceLanguage, communication and cultureLanguage studiesDiachronic corporaDiachronic analysisTurkish corpusFrequency analysis<p dir="ltr">Over the past century, the Turkish language has undergone substantial changes, mainly driven by governmental interventions. The relatively rapid linguistic evolution of the Turkish language complicates the processing of historical Turkish documents. In this work, we introduce Turkronicles, which is a diachronic corpus for Turkish derived from the Official Gazette of Türkiye and the records of the Grand National Assembly of Türkiye, spanning the period from 1920 to 2024. Turkronicles contains 46,328 documents and 1.1B tokens, making it an important resource for analyzing the linguistic evolution of Turkish and developing models to process historical Turkish documents. In addition, we develop a library to conduct linguistic analysis on diachronic corpora easily. Furthermore, we train a model to fix OCR errors within the documents. Moreover, we explore how the Turkish vocabulary and the writing conventions have changed since 1920 using our corpus. Our analysis reveals that the vocabulary has changed significantly and multiple spellings exist for several words. Specifically, we show that vocabulary divergence increases over time, as expected. Due to such significant vocabulary change in Turkish over time, similarity between the periods 1920–1929 and 2010–2019 is 57%. Despite the substantial vocabulary changes, we demonstrate that it is possible to identify old Turkish words that have the same meanings with newly coined ones using word embeddings. Regarding writing conventions, we found a noticeable decrease in the use of circumflex. In addition, words ending with the letters ‘-b’ and ‘-d’ have been largely replaced by their counterparts ending with ‘-p’ and ‘-t’, respectively, although the former are still in use. Lastly, we observe an increase in the usage of words that comply with vowel harmony rules as a result of the “purification” process of Turkish Language Reform. Overall, our study quantitatively highlights the dramatic changes in Turkish from various linguistic aspects.</p><h2 dir="ltr">Other Information</h2><p dir="ltr">Published in: Language Resources and Evaluation<br>License: <a href="https://creativecommons.org/licenses/by/4.0" target="_blank">https://creativecommons.org/licenses/by/4.0</a><br>See article on publisher's website: <a href="https://dx.doi.org/10.1007/s10579-025-09857-w" target="_blank">https://dx.doi.org/10.1007/s10579-025-09857-w</a></p>2025-07-29T09:00:00ZTextJournal contributioninfo:eu-repo/semantics/publishedVersiontextcontribution to journal10.1007/s10579-025-09857-whttps://figshare.com/articles/journal_contribution/Turkronicles_diachronic_resources_for_the_fast_evolving_Turkish_language/30860318CC BY 4.0info:eu-repo/semantics/openAccessoai:figshare.com:article/308603182025-07-29T09:00:00Z |
| spellingShingle | Turkronicles: diachronic resources for the fast evolving Turkish language Togay Yazar (22828175) Information and computing sciences Artificial intelligence Language, communication and culture Language studies Diachronic corpora Diachronic analysis Turkish corpus Frequency analysis |
| status_str | publishedVersion |
| title | Turkronicles: diachronic resources for the fast evolving Turkish language |
| title_full | Turkronicles: diachronic resources for the fast evolving Turkish language |
| title_fullStr | Turkronicles: diachronic resources for the fast evolving Turkish language |
| title_full_unstemmed | Turkronicles: diachronic resources for the fast evolving Turkish language |
| title_short | Turkronicles: diachronic resources for the fast evolving Turkish language |
| title_sort | Turkronicles: diachronic resources for the fast evolving Turkish language |
| topic | Information and computing sciences Artificial intelligence Language, communication and culture Language studies Diachronic corpora Diachronic analysis Turkish corpus Frequency analysis |