The OSIAN corpus includes Open-Source International Arabic News that were collected from 32 popular Arabic newspapers around the world. This corpus is processed, archived and published into the CLARIN infrastructure under the Creative Commons Attribution-Non-Commercial 4.0 International License - CC BY-NC 4.0. The OSIAN corpus consists of:
- About 3.5 million articles
- More than 37 million sentences
- Roughly 1 billion tokens
- Each article is annotated with Descriptive metadata (information about source, date of extraction and so on)
For further details, please check the following paper :
- mad Zeroual, Dirk Goldhahn, Thomas Eckart, and Abdelhak Lakhouaja. 2019. OSIAN: Open Source International Arabic News Corpus - Preparation and Integration into the CLARIN-infrastructure. In Proceedings of the Fourth Arabic Natural Language Processing Workshop, pages 175–182, Florence, Italy. Association for Computational Linguistics. https://doi.org/10.18653/v1/W19-4619.

