Alkhalil Corpora consists of two open-source datasets for Modern Standard Arabic. The first corpus contains about 100 million words collected from websites across the Arab region and organized by thematic categories. The second is a 1-million-word annotated sub-corpus with lemma information, produced through a semi-automatic pipeline that combines the Alkhalil Lemmatizer, MADAMIRA, and manual validation.
For further details, please check the following paper :
- Belayachi, S., Mazroui, A. Alkhalil Corpus: An Open-Source Thematic and Lemmatized Corpus for Modern Standard Arabic. In: Proceedings of the AbjadNLP Workshop at EACL 2026, Rabat, Morocco (2026).

