Nemlar corpus is a set of Arabic texts initially annotated by the Egyptian RDI company on behalf of the NEMLAR consortium that owns the rights. It contains approximately 500,000 Arabic words spread out over 489 files and covering 13 different domains. Our team then enriched the corpus with the lemma tag and had Nemlar's agreement to make it open. Tags provided in Nemlar corpus for a given word are:

  • Its vowelized form
  • Its lemma
  • Its stem
  • The clitics attached to the stem
  • Its grammatical category
  • Its scheme

For further details, please check the following paper :

  • Boudchiche, M.; Mazroui, A.; 2015. Enrichment of the Nemlar corpus with the lemma label. In: Study day “Arabic language resources for NLP: construction, standardization, management and operation”. November 26, 2015. Rabat, Morocco.

XML

You have the opportunity to download Nemlar in XML Format.

Download