Turkronicles: diachronic resources for the fast evolving Turkish language

<p dir="ltr">Over the past century, the Turkish language has undergone substantial changes, mainly driven by governmental interventions. The relatively rapid linguistic evolution of the Turkish language complicates the processing of historical Turkish documents. In this work, we intr...

وصف كامل

محفوظ في:
التفاصيل البيبلوغرافية
المؤلف الرئيسي: Togay Yazar (22828175) (author)
مؤلفون آخرون: Mucahid Kutlu (14274942) (author), İsa Kerem Bayırlı (22828178) (author)
منشور في: 2025
الموضوعات:
الوسوم: إضافة وسم
لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!
_version_ 1864513531579203584
author Togay Yazar (22828175)
author2 Mucahid Kutlu (14274942)
İsa Kerem Bayırlı (22828178)
author2_role author
author
author_facet Togay Yazar (22828175)
Mucahid Kutlu (14274942)
İsa Kerem Bayırlı (22828178)
author_role author
dc.creator.none.fl_str_mv Togay Yazar (22828175)
Mucahid Kutlu (14274942)
İsa Kerem Bayırlı (22828178)
dc.date.none.fl_str_mv 2025-07-29T09:00:00Z
dc.identifier.none.fl_str_mv 10.1007/s10579-025-09857-w
dc.relation.none.fl_str_mv https://figshare.com/articles/journal_contribution/Turkronicles_diachronic_resources_for_the_fast_evolving_Turkish_language/30860318
dc.rights.none.fl_str_mv CC BY 4.0
info:eu-repo/semantics/openAccess
dc.subject.none.fl_str_mv Information and computing sciences
Artificial intelligence
Language, communication and culture
Language studies
Diachronic corpora
Diachronic analysis
Turkish corpus
Frequency analysis
dc.title.none.fl_str_mv Turkronicles: diachronic resources for the fast evolving Turkish language
dc.type.none.fl_str_mv Text
Journal contribution
info:eu-repo/semantics/publishedVersion
text
contribution to journal
description <p dir="ltr">Over the past century, the Turkish language has undergone substantial changes, mainly driven by governmental interventions. The relatively rapid linguistic evolution of the Turkish language complicates the processing of historical Turkish documents. In this work, we introduce Turkronicles, which is a diachronic corpus for Turkish derived from the Official Gazette of Türkiye and the records of the Grand National Assembly of Türkiye, spanning the period from 1920 to 2024. Turkronicles contains 46,328 documents and 1.1B tokens, making it an important resource for analyzing the linguistic evolution of Turkish and developing models to process historical Turkish documents. In addition, we develop a library to conduct linguistic analysis on diachronic corpora easily. Furthermore, we train a model to fix OCR errors within the documents. Moreover, we explore how the Turkish vocabulary and the writing conventions have changed since 1920 using our corpus. Our analysis reveals that the vocabulary has changed significantly and multiple spellings exist for several words. Specifically, we show that vocabulary divergence increases over time, as expected. Due to such significant vocabulary change in Turkish over time, similarity between the periods 1920–1929 and 2010–2019 is 57%. Despite the substantial vocabulary changes, we demonstrate that it is possible to identify old Turkish words that have the same meanings with newly coined ones using word embeddings. Regarding writing conventions, we found a noticeable decrease in the use of circumflex. In addition, words ending with the letters ‘-b’ and ‘-d’ have been largely replaced by their counterparts ending with ‘-p’ and ‘-t’, respectively, although the former are still in use. Lastly, we observe an increase in the usage of words that comply with vowel harmony rules as a result of the “purification” process of Turkish Language Reform. Overall, our study quantitatively highlights the dramatic changes in Turkish from various linguistic aspects.</p><h2 dir="ltr">Other Information</h2><p dir="ltr">Published in: Language Resources and Evaluation<br>License: <a href="https://creativecommons.org/licenses/by/4.0" target="_blank">https://creativecommons.org/licenses/by/4.0</a><br>See article on publisher's website: <a href="https://dx.doi.org/10.1007/s10579-025-09857-w" target="_blank">https://dx.doi.org/10.1007/s10579-025-09857-w</a></p>
eu_rights_str_mv openAccess
id Manara2_889c7ef0acdaa216d2edb8773b9e5c1b
identifier_str_mv 10.1007/s10579-025-09857-w
network_acronym_str Manara2
network_name_str Manara2
oai_identifier_str oai:figshare.com:article/30860318
publishDate 2025
repository.mail.fl_str_mv
repository.name.fl_str_mv
repository_id_str
rights_invalid_str_mv CC BY 4.0
spelling Turkronicles: diachronic resources for the fast evolving Turkish languageTogay Yazar (22828175)Mucahid Kutlu (14274942)İsa Kerem Bayırlı (22828178)Information and computing sciencesArtificial intelligenceLanguage, communication and cultureLanguage studiesDiachronic corporaDiachronic analysisTurkish corpusFrequency analysis<p dir="ltr">Over the past century, the Turkish language has undergone substantial changes, mainly driven by governmental interventions. The relatively rapid linguistic evolution of the Turkish language complicates the processing of historical Turkish documents. In this work, we introduce Turkronicles, which is a diachronic corpus for Turkish derived from the Official Gazette of Türkiye and the records of the Grand National Assembly of Türkiye, spanning the period from 1920 to 2024. Turkronicles contains 46,328 documents and 1.1B tokens, making it an important resource for analyzing the linguistic evolution of Turkish and developing models to process historical Turkish documents. In addition, we develop a library to conduct linguistic analysis on diachronic corpora easily. Furthermore, we train a model to fix OCR errors within the documents. Moreover, we explore how the Turkish vocabulary and the writing conventions have changed since 1920 using our corpus. Our analysis reveals that the vocabulary has changed significantly and multiple spellings exist for several words. Specifically, we show that vocabulary divergence increases over time, as expected. Due to such significant vocabulary change in Turkish over time, similarity between the periods 1920–1929 and 2010–2019 is 57%. Despite the substantial vocabulary changes, we demonstrate that it is possible to identify old Turkish words that have the same meanings with newly coined ones using word embeddings. Regarding writing conventions, we found a noticeable decrease in the use of circumflex. In addition, words ending with the letters ‘-b’ and ‘-d’ have been largely replaced by their counterparts ending with ‘-p’ and ‘-t’, respectively, although the former are still in use. Lastly, we observe an increase in the usage of words that comply with vowel harmony rules as a result of the “purification” process of Turkish Language Reform. Overall, our study quantitatively highlights the dramatic changes in Turkish from various linguistic aspects.</p><h2 dir="ltr">Other Information</h2><p dir="ltr">Published in: Language Resources and Evaluation<br>License: <a href="https://creativecommons.org/licenses/by/4.0" target="_blank">https://creativecommons.org/licenses/by/4.0</a><br>See article on publisher's website: <a href="https://dx.doi.org/10.1007/s10579-025-09857-w" target="_blank">https://dx.doi.org/10.1007/s10579-025-09857-w</a></p>2025-07-29T09:00:00ZTextJournal contributioninfo:eu-repo/semantics/publishedVersiontextcontribution to journal10.1007/s10579-025-09857-whttps://figshare.com/articles/journal_contribution/Turkronicles_diachronic_resources_for_the_fast_evolving_Turkish_language/30860318CC BY 4.0info:eu-repo/semantics/openAccessoai:figshare.com:article/308603182025-07-29T09:00:00Z
spellingShingle Turkronicles: diachronic resources for the fast evolving Turkish language
Togay Yazar (22828175)
Information and computing sciences
Artificial intelligence
Language, communication and culture
Language studies
Diachronic corpora
Diachronic analysis
Turkish corpus
Frequency analysis
status_str publishedVersion
title Turkronicles: diachronic resources for the fast evolving Turkish language
title_full Turkronicles: diachronic resources for the fast evolving Turkish language
title_fullStr Turkronicles: diachronic resources for the fast evolving Turkish language
title_full_unstemmed Turkronicles: diachronic resources for the fast evolving Turkish language
title_short Turkronicles: diachronic resources for the fast evolving Turkish language
title_sort Turkronicles: diachronic resources for the fast evolving Turkish language
topic Information and computing sciences
Artificial intelligence
Language, communication and culture
Language studies
Diachronic corpora
Diachronic analysis
Turkish corpus
Frequency analysis