Turkronicles: diachronic resources for the fast evolving Turkish language
<p dir="ltr">Over the past century, the Turkish language has undergone substantial changes, mainly driven by governmental interventions. The relatively rapid linguistic evolution of the Turkish language complicates the processing of historical Turkish documents. In this work, we intr...
محفوظ في:
| المؤلف الرئيسي: | |
|---|---|
| مؤلفون آخرون: | , |
| منشور في: |
2025
|
| الموضوعات: | |
| الوسوم: |
إضافة وسم
لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!
|
| الملخص: | <p dir="ltr">Over the past century, the Turkish language has undergone substantial changes, mainly driven by governmental interventions. The relatively rapid linguistic evolution of the Turkish language complicates the processing of historical Turkish documents. In this work, we introduce Turkronicles, which is a diachronic corpus for Turkish derived from the Official Gazette of Türkiye and the records of the Grand National Assembly of Türkiye, spanning the period from 1920 to 2024. Turkronicles contains 46,328 documents and 1.1B tokens, making it an important resource for analyzing the linguistic evolution of Turkish and developing models to process historical Turkish documents. In addition, we develop a library to conduct linguistic analysis on diachronic corpora easily. Furthermore, we train a model to fix OCR errors within the documents. Moreover, we explore how the Turkish vocabulary and the writing conventions have changed since 1920 using our corpus. Our analysis reveals that the vocabulary has changed significantly and multiple spellings exist for several words. Specifically, we show that vocabulary divergence increases over time, as expected. Due to such significant vocabulary change in Turkish over time, similarity between the periods 1920–1929 and 2010–2019 is 57%. Despite the substantial vocabulary changes, we demonstrate that it is possible to identify old Turkish words that have the same meanings with newly coined ones using word embeddings. Regarding writing conventions, we found a noticeable decrease in the use of circumflex. In addition, words ending with the letters ‘-b’ and ‘-d’ have been largely replaced by their counterparts ending with ‘-p’ and ‘-t’, respectively, although the former are still in use. Lastly, we observe an increase in the usage of words that comply with vowel harmony rules as a result of the “purification” process of Turkish Language Reform. Overall, our study quantitatively highlights the dramatic changes in Turkish from various linguistic aspects.</p><h2 dir="ltr">Other Information</h2><p dir="ltr">Published in: Language Resources and Evaluation<br>License: <a href="https://creativecommons.org/licenses/by/4.0" target="_blank">https://creativecommons.org/licenses/by/4.0</a><br>See article on publisher's website: <a href="https://dx.doi.org/10.1007/s10579-025-09857-w" target="_blank">https://dx.doi.org/10.1007/s10579-025-09857-w</a></p> |
|---|