Motivation: Are Language Proficiency Assessment Resources Globally Accessible? 🌍

Language proficiency assessment is foundational to advancing educational research and many natural language processing (NLP) applications, particularly for tasks like automatic readability assessment (ARA) and automated essay scoring (AES). The Common European Framework of Reference for Languages or CEFR is one of the most globally recognized frameworks for standardised assessment of proficiency across different languages and countries. However, despite CEFR's role as the de facto framework, most language proficiency assessment datasets based on the CEFR suffer from several access and interoperability limitations:

    Access Restrictions: Since CEFR started in Europe, a significant number of learner corpora produced from European government-funded projects are accessible only to institutions and researchers based in Europe due to data privacy regulations such as GDPR.
    Lack of Standardised Data Processing: Due to lack of unified data processing practices, corpora and datasets often come in non-uniform, highly-variant formats including TXT, XML, CSV, Excel, XLS, and PDF files requiring substantial preprocessing effort on the side of researchers.
    Limitations in Language Coverage: Existing corpora and datasets available for request primarily cover English and major European languages such as Spanish and French while less representation and availability for mid- and low-resource languages such as Welsh.
    Gaps in Text Granularities: Many corpora and dataset resources are only annotated at the sentence level, limiting research and potential findings for datasets in other granularities such as paragraph, document, or dialogue formats.
These challenges make multilingual research on language proficiency assessment with CEFR tedious and challenging. Likewise, as interest grows in using large language models (LLMs) and cross-lingual techniques for educational assessment and NLP tasks, it becomes increasingly important to have access to open, machine-readable, high-quality, multilingual datasets.

Our Solution: The UniversalCEFR Initiative 📕

We introduce UniversalCEFR, a large, standardised, open, multilingual dataset of CEFR-labeled texts for language proficiency assessment research, consolidated from 26 existing corpora and accessible for non-commercial research. It brings together 505,807 texts labeled with CEFR levels across 13 languages and 4 scripts (Latin, Cyrillic, Arabic, and Devanagari).

UniversalCEFR image

The dataset includes various text categories including learner texts (e.g., essays, writing samples) and reference texts (e.g., curated reading materials); diverse text-level granularities including sentence, paragraph, document, and dialogue levels; and full CEFR scale coverage from A1 to C2.

UniversalCEFR Standardised Data Format 💽

For UniversalCEFR, we use a structured and standardised JSON format containing necessary per-instance information in the form of eight (8) metadata fields for each CEFR-labelled texts included in the dataset. The table below lists the text fields, descriptions, and examples for each field. All instances that were validated and included in the collection of CEFR-labelled corpora for UniversalCEFR conform to this format.

Field Description
title The unique title of the text retrieved from its original corpus (NA if there are no titles such as CEFR-assessed sentences or paragraphs).
lang The source language of the text in ISO 639-1 format (e.g., en for English).
source_name The source dataset name where the text is collected as indicated from their source dataset, paper, and/or documentation (e.g., cambridge-exams from Xia et al. (2016)).
format The format of the text in terms of level of granularity as indicated from their source dataset, paper, and/or documentation. The recognized formats are the following: document-level, paragraph-level, discourse-level, sentence-level.
category The classification of the text in terms of who created the material. The recognized categories are reference for texts created by experts, teachers, and language learning professionals and learner for texts written by language learners and students.
cefr_level The CEFR level associated with the text. The six recognized CEFR levels are the following: A1, A2, B1, B2, C1, C2.
license The licensing information associated with the text (e.g., CC-BY-NC-SA 4.0).
text The actual content of the text itself.

Why Do We Need UniversalCEFR? 🔭

Beyond its data and technical contributions, UniversalCEFR also carries broader sociolinguistic significance. UniversalCEFR addresses the growing linguistic inequality in modern AI development through focusing on under-represented languages alongside English.

We also hope this initiative can lead to more responsible AI development that actively resists the growing linguistic centralization around English in global AI research—a modern Matthew effect—where well-resourced languages receive disproportionate technological attention while smaller languages (like Czech or Welsh) are left behind. The UniversalCEFR initiative is a strong first step towards mitigating this Matthew effect.

Join the Initiative! 🤝

We want to grow this community of researchers, language experts, and educators to further advance openly accessible CEFR and language proficiency assessment corpora for all. If you're interested in this direction, please feel free to join the Huggingface and Github Organizations in the links above.

Moreover, if you have corpora or datasets you want to add to UniversalCEFR for better exposure and utility to researchers around the world, please fill up this form.

When we index your dataset to UniversalCEFR, we will cite you and the paper or project from which the dataset came across the UniversalCEFR platforms. The ownership and copyright of any dataset indexed in UniversalCEFR remain with original data creators.

Citation

@article{imperial2025universalcefr,
  title = {{UniversalCEFR: Enabling Open Multilingual Research on Language Proficiency Assessment}},
  author = {Joseph Marvin Imperial and Abdullah Barayan and Regina Stodden and Rodrigo Wilkens and Ricardo Muñoz Sánchez and Lingyun Gao and Melissa Torgbi and Dawn Knight and Gail Forey 
  and Reka R. Jablonkai and Ekaterina Kochmar and Robert Reynolds and Eugénio Ribeiro and Horacio Saggion and Elena Volodina and Sowmya Vajjala and Thomas François and Fernando Alva-Manchego and Harish Tayyar Madabushi},
  journal = {arXiv preprint arXiv:2506.01419},
  year = {2025},
  url = {https://arxiv.org/abs/2506.01419}}