Alkhalil Corpora consists of two open-source datasets for Modern Standard Arabic. The first corpus contains about 100 million words collected from websites across the Arab region and organized by thematic categories. The second is a 1-million-word annotated sub-corpus with lemma information, produced through a semi-automatic pipeline that combines the Alkhalil Lemmatizer, MADAMIRA, and manual validation.

For further details, please check the following paper :

  • Belayachi, S., Mazroui, A. Alkhalil Corpus: An Open-Source Thematic and Lemmatized Corpus for Modern Standard Arabic. In: Proceedings of the AbjadNLP Workshop at EACL 2026, Rabat, Morocco (2026).

You can download the corpora via huggingface by following this link:

Link