alelm – Arabic language enginneering and learning modeling

RESOURCES

expand_more

Lexicons

expand_more

Arabic WordNet ontology

Description: This improved version is an extension of the original Arabic Wordnet (http://globalwordnet.org/arabic-wordnet/awn-browser/), it was enriched by new verbs, nouns including the broken plurals that is a specific form for Arabic words.

Clarin: Arabic WordNet ontology
ISLRN: 576-499-135-548-6
Citation: L. Abouenour, K. Bouzoubaa and P. Rosso, "On the evaluation and improvement of Arabic WordNet coverage and usability," Language Resources and Evaluation, vol. 47, n° 13, pp. 891-917, 2013.

Characters lexicon

Description: An LMF conformant XML-based file containing all Arabic characters (letters, vowels and punctuations). Each character described with a description, different displays (isolated, at the beginning, middle and the end of a word), a codification (Unicode, others could be added later), and two transliterations (Buckwalter and wiki). The lexicon is composed of 42 characters: the 28 known letters, 5 hamza forms, 9 special letters, 9 vowels and 3 punctuation marks.

Clarin: Arabic Enclitics Lexicon (LMF) Arabic Enclitics Lexicon (XML)
ISLRN: 250-846-271-090-1 (LMF)
306-352-322-908-4 (XML)
Citations:
- T. Loukili, K. Bouzoubaa, "Structuration et Standardisation des ressources linguistiques de l'Arabe - cas de l'alphabet, préfixes et suffixes", Journées Doctorales en Technologies de l'Information et Communication, Tangier, Morocco, 7/ 2011
- Driss Namly, Yasser Regragui and Karim Bouzoubaa. "Interoperable Arabic language resources building and exploitation in SAFAR platform". International Conference on Computer Systems and Applications, AICCSA 2016.

Clitics (Enclitics)

Description: A XML-based file containing all Arabic enclitics and consisting of 14 atomic enclitics, which generates about 73 enclitics when applying their association rules.

Clarin: Arabic Enclitics Lexicon
ISLRN: 356-004-001-278-7
Citation: Driss Namly, Yasser Regragui and Karim Bouzoubaa. "Interoperable Arabic language resources building and exploitation in SAFAR platform". International Conference on Computer Systems and Applications, AICCSA 2016.

Clitics (Proclitics)

Description: A XML-based file containing all Arabic proclitics and consisting of 12 atomic proclitics, which generates about 94 proclitics when applying their association rules.

Clarin: Arabic Proclitics Lexicon
ISLRN: 382-029-397-588-7
Citation: Driss Namly, Yasser Regragui and Karim Bouzoubaa. "Interoperable Arabic language resources building and exploitation in SAFAR platform". International Conference on Computer Systems and Applications, AICCSA 2016.

Stop-words

Description: An XML-based file containing Arabic Stop-words respecting nouns, particle and verbs. This lexicon is composed by 27796 stop-words.

Clarin: Arabic Stop-words Lexicon
ISLRN: 324-965-777-406-1
Citations:
- Driss Namly and al. "Development of Arabic particles lexicon using the LMF framework". Colloque pour les Etudiants Chercheurs en Traitement Automatique du Langage Naturel et ses applications (CEC-TAL 2015). Sousse - Tunisie, le 23-25 Mars 2015.
- Driss Namly and al. "A Complex Arabic stop-words list design". The Second National Doctoral Symposium On Arabic Language Engineering (JDILA'2015) ENSA of Fez USMBA, 28-29 October 2015.

MORALEX-Morphology

Description: MORALEX is a lexicon of morphemes that includes 402 Moroccan Arabic affixes and clitics that were manually created and linguistically checked. Indeed, MORALEX is composed of 24 atomic affixes, 43 atomic clitics and 335 compound morphemes. The main advantage of this resource is its rich morphological information such as POS, form, and person, etc. It can be used in different contexts particularly in morphological tasks.

Clarin: Arabic Morphological evaluation corpus
Citation: R. Tachicart, K. Bouzoubaa, "Towards Automatic Normalization of the Moroccan Dialectal Arabic User Generated Text" in the 7th International Conference on Arabic Language Processing ICALP'19, October 2019, Nancy, France

Triliteral roots

Description: This xml file is a lexicon containing all 21952 (28x28x28) Arabic triliteral combinations (roots). the file is split into three parts as follow: the first part contains the phonetic constraints that must be taken into account in the formation of Arabic roots (for more details see all_phonetic_rules.xml in http://arabic.emi.ac.ma/alelm/?q=Resources). the second part contains the lexicons that were used to create this lexicon (see in lexicons tag). the third part contains the roots.

Clarin: Arabic Triliteral roots Lexicon
ISLRN: 813-907-570-946-2
Developped by: Ebtihal Mustafa and Mohammed Karim BOUZOUBAA
Citation: Mustafa, Ebtihal, and Karim Bouzoubaa. 2023. A Bi-Gram Approach for an Exhaustive Arabic Triliteral Roots Lexicon. Languages 8: 83. https://doi.org/10.3390/languages8010083

Phonetic rules phonology

Description: this xml file describes the Arabic phonetic constraints (rules) resulting from the analysis of the lexicons(Taj Alarous, Al ain, Lisan Al arab, Alwassit and elke moassir ). These rules are to be applied to Arabic roots and are classified into a number of categories. Each category has a certain type of constraints as follow: The first category defines that the root must not consist of three identical letters. The second category defines that the root must not start with two repeating letters. The third category lists the letters that must not occur in the same root, regardless of their order. The fourth category lists the letters that may not be used together in a certain order in a root.

Clarin: Arabic Phonetic Rules
ISLRN: 190-535-098-473-3
Developped by: Ebtihal Mustafa and Mohammed Karim BOUZOUBAA
Citation: Mustafa, Ebtihal, and Karim Bouzoubaa. 2023. A Bi-Gram Approach for an Exhaustive Arabic Triliteral Roots Lexicon. Languages 8: 83. https://doi.org/10.3390/languages8010083

Addressed Arabic phonetic rules

Description: this xml file describes the Arabic phonetic constraints are to be applied on Arabic root. The first rule category lists the letters that may not occur in the same root, regardless of their order. The second category lists the letters that may not be used together in a root word with a specific order. The third and fourth categories show that each contiguous letters must not be redundant

Clarin: Addressed Arabic Phonetic Rules
ISLRN: 991-445-325-823-5
Developped by: Ebtihal Mustafa and Mohammed Karim BOUZOUBAA
Citation: Mustafa, Ebtihal, and Karim Bouzoubaa. 2023. A Bi-Gram Approach for an Exhaustive Arabic Triliteral Roots Lexicon. Languages 8: 83. https://doi.org/10.3390/languages8010083

Broken plural list

Description: An LMF conformant XML-based file containing a comprehensive Arabic broken plural list. The file contains 12,249 singular words with their corresponding BPs.

Clarin: Broken plural list
ISLRN: 340-952-913-841-9
Developped by: Karim Bouzoubaa, Mariame Ouamer, Rachida Tajmout
Citation: M. Ouamer, R. Tajmout, K. Bouzoubaa, "Arabic Broken Plural Model Based on the Broken Pattern", In International Conference on Digital Technologies and Applications, January 2022.

CALEM (Comprehensive Arabic LEMmas)

Description: Comprehensive Arabic LEMmas is a lexicon covering a large list of Arabic lemmas and their corresponding inflected word forms (stems) with details (POS + Root). Each lexical entry represents a lemma followed by all its possible stems and each stem is enriched by its morphological features especially the root and the POS.
It is composed of 164,845 lemmas representing 7,200,918 stems, detailed as follow:
757 Arabic particles
2,464,631 verbal stems
4,735,587 nominal stems
The lexicon is provided as an LMF conformant XML-based file in UTF8 encoding, which represents about 1,22 Gb of data.

Developped by: Driss Namly, Abdelhamid El Jihad, Karim Bouzoubaa
Clarin: Comprehensive Arabic LEMmas
ISLRN: 462-532-124-988-8
Citation: Namly Driss, Karim Bouzoubaa, Abdelhamid El Jihad, and Si Lhoussain Aouragh. "Improving Arabic Lemmatization Through a Lemmas Database and a Machine-Learning Technique." In Recent Advances in NLP: The Case of Arabic Language, pp. 81-100. Springer, Cham, 2020.

MORV (Moroccan Morphological vocabulary)

Description: The Moroccan Morphological vocabulary is a lexicon containing more than 4.6 M entries describing a given Moroccan Arabic word with fourteen (14) morphological and semantic features: the word orthographic form, the segmentation (prefix and suffix), part-of-speech (POS), gender, number, tense and transitivity (for verbs), its origin, dialectal lemma, Arabic lemma, the root, voice, state, and affirmative/negative form. This vocabulary is provided as an xml file and represents more than 900 Mb of data.

Developped by: Ridouane Tachicart, Karim Bouzoubaa
Citation: Ridouane Tachicart, Karim Bouzoubaa, Moroccan Arabic vocabulary generation using a rule-based approach, Journal of King Saud University - Computer and Information Sciences, Volume 34, Issue 10, Part A, 2022, Pages 8538-8548, ISSN 1319-1578

Patterns lexicon

Description: This lexicon is composed of verbal and nominal patterns. Nominal patterns vary according to the categories to which they belong. These categories can be reduced to derivative, non derivative names and Massader.

Developped by: ALELM Team

expand_more

Dictionaries

expand_more

"Al wassit" Arabic dictionary

Description: An LMF conformant XML-based file containing the electronic version of al wassit dictionary. An Arabic monolingual dictionary accomplished by the Academy of the Arabic Language in Cairo. Al wassit dictionary is constitued by: 6900 roots, 61101 lexical entries (18199 verbs, 42731 nouns and 171 particles), 8821 examples (5231 verbs, and 3590 nouns) and 119140 meanings.

Clarin: "Al wassit" Arabic dictionary (LMF) "Al wassit" Arabic dictionary (XML)
ISLRN: 795-847-093-546-5 (LMF)
283-443-022-502-4 (XML)
Citation: Driss Namly, Yasser Regragui and Karim Bouzoubaa. "Interoperable Arabic language resources building and exploitation in SAFAR platform". International Conference on Computer Systems and Applications, AICCSA 2016.

Contemporary

Description: An LMF conformant XML-based file containing the electronic version of al logha al arabia al moassira (Contemporary Arabic) dictionary. An Arabic monolingual dictionary accomplished by Ahmed Mukhtar Abdul Hamid Omar (deceased: 1424) with the help of a working group. The Contemporary dictionary material is composed by 5778 roots, 32300 lexical entries (10475 verbs, 21457 nouns and 368 particles), 29118 entries example and 43384 additional examples, 63019 meanings and 17883 contextual expressions.

Clarin: Contemporary Arabic dictionary (LMF) Contemporary Arabic dictionary (XML)
ISLRN: 264-069-820-478-0 (LMF)
065-323-843-026-9 (XML)
Citations:
- Driss Namly, Karim Bouzoubaa. "LMF conversion of an editorial dictionary: the case of the Contemporary Arabic dictionary". Journée d'étude Ressources langagières de l'arabe pour le TAL : construction, standardisation, gestion et exploitation, 26 Novembre 2015 Institut d'Etudes et de Recherches pour l'Arabisation, Rabat
- Driss Namly, Yasser Regragui and Karim Bouzoubaa. "Interoperable Arabic language resources building and exploitation in SAFAR platform". International Conference on Computer Systems and Applications, AICCSA 2016.

MADED

Description: Moroccan Arabic Dialect Electronic Dictionary (MADED) is an electronic lexicon containing almost 11.500 entries. They are written in Arabic script wherein each Moroccan Arabic dialect (MA) lemma is provided with its corresponding Moden Standard Arabic (MSA) equivalent. In addition, MADED entries are annotated with useful metadata such as part-of-speech (POS), root and origin (Arabic, French, ...). This dictionary is provided as an xml file and represents about 1 Mb of data

Clarin: MADED
ISLRN: 977-057-254-691-5
Citation: R. Tachicart, K. Bouzoubaa, "Building a Moroccan dialect electronic Dictionary (MDED)", in the 5th International Conference on Arabic Language Processing CITALA, Oujda, Morocco, 11/2014.

expand_more

Corpora

expand_more

CLEF-TREC Q/A

Description: List of 2264 questions + answers of CLEF and TREC, translated to Arabic

Clarin: CLEF-TREC Q/A
ISLRN: 680-984-485-076-7
Citation: Abouenour L., Bouzoubaa K., Rosso P. "On the Evaluation and Improvement of Arabic WordNet Coverage and Usability", Languages Resources and Evaluation, Springer Netherlands 10.1007/s10579-013-9237-0 6/ 2013.

Stemming evaluation NAFIS Gold Standard

Description: Normalized Arabic Fragments for Inestimable Stemming (NAFIS) is an Arabic stemming gold standard corpus composed by a collection of texts, selected to be representative of Arabic stemming tasks and manually annotated.

Clarin: NAFIS Arabic Stemming Gold Standard Corpus
ISLRN: 305-450-745-774-1
Citation: Driss Namly, Rachida Tajmout, Karim Bouzoubaa, Lahsen. Abouenour. "NAFIS: A Gold Standard Corpus for Arabic Stemmers Evaluation". International Business Information Management Association (IBIMA), November 2016 Seville, Spain

LID language Identification

Description: This resource is a corpus containing 34k Moroccan Colloquial Arabic sentences collected from different sources. The sentences are written in Arabic letters. This resource can be useful in some NLP applications such as Language Identification.

Clarin: LID
ISLRN: 048-993-307-382-7
Citation: R. Tachicart, K. Bouzoubaa, Si Lhoucine Aouragh and Hamid Jaafar "Automatic Identification of Moroccan Colloquial Arabic", in the 6th International Conference on Arabic Language Processing ICALP'17, October 2017, Fez, Morocco.

Spell checking

Description: The file represents a text corpus in the context of Arabic spell checking, where a group of persons edited different files, and all of the committed spelling errors by these persons have been recorded. A comprehensive representation these persons' profile has been considered: male, female, old-aged, middle-aged, young-aged, high and low computer usage users, etc. Through this work, we aim to help researchers and those interested in Arabic NLP by providing them with an Arabic spell check corpus ready and open to exploitation and interpretation. This study also enabled the inventory of most spelling mistakes made by editors of Arabic texts. This file contains the following sections (tags): people - documents they printed - types of possible errors - errors they made. Each section (tag) contains some data that explains its details and its content, which helps researchers extracting research-oriented results. The people section contains basic information about each person and its relationship of using the computer, while the documents section clarifies all sentences in each document with the numbering of each sentence to be used in the errors section that was committed. We are also adding the "type of errors" section in which we list all the possible errors with their description in the Arabic language and give an illustrative example.

Clarin: Manual Arabic spelling-errors correction for collected documents
ISLRN: 922-673-450-479-2
Developped by: Ahmed Abdalrhman Saty, Karim Bouzoubaa, Aouragh Si Lhoussain
Citation: Saty, Ahmed A.; Aouragh, Si Lhoussain; Bouzoubaa, Karim (2023). "A New Spell-Checking Approach Based on the User Profile". International Journal of Computing and Digital Systems.

Arabic ACL corpus

Description: This corpus constitutes all sentences representing the Arabic Controlled Language (ACL). It contains 551 sentences taken from four textbooks and websites dedicated to teach Arabic language to kids such as: a) First grade book, Republic of Sudan (كتاب الصف الاول جمهورية السودان), b) Al Jazeera Educational Site (موقع الجزيرة التعليمي), c) Bella Preparatory School Girls Forum (منتدى مدرسة بيلا الاعدادية بنات), and d) Albahr website (موقع انا البحر). These sentences are respecting 52 ACL rules. The average number of sentences for each rule is 10.6. All sentences in the corpus were analyzed by Farasa syntactic parser to confirm they are correctly analyzed. The validity of the parsing was done manually by linguist experts. The structure of this corpus is made of a header and a body. The header consists of a set of metadata that describe the corpus, such as the corpus name, the authors, the sources and further meta data. While the header is made of metadata, the body contains rules. Each rule has a code, a structure and all sentences respecting that rule. For each sentence, we store an id, the vowelledand unvowelled text as well as the result of parsing using Farasa.

Clarin: Arabic ACL corpus
ISLRN: 813-396-094-059-3
Developped by: Salah Elfahal Elebaed Hoyam, Kasbi Mohammed, Nasri Mohammed and Bouzoubaa Karim
Citation: H. Salah El Fahal, M. Nasri, K. Bouzoubaa, A. Kabbaj, "Resources for developing an Arabic Controlled Language", In International Journal of Computer Science Trends and Technology - IJCST, Volume 9 Issue 6, Nov-Dec 2021.

Quranic stemming evaluation

Description: This is a reduced version of the Quranic corpus developed by Kais Dukes et al. (http://corpus.quran.com/). It contacins 18352 words with their stems, roots and lemmas. We created this reduced version to serve as stemming evaluation corpus:

Citation: Jaafar, Y., Namly, D., Bouzoubaa, K., & Yousfi, A. (2017). "Enhancing Arabic stemming process using resources and benchmarking tools". Journal of King Saud University-Computer and Information Sciences, 29(2), 164-170.

Morphological evaluation

Description: An annotated corpus dedicated to the benchmark and evaluation of Arabic morphological analyzers. It consists of 100 words with all their possible analysis. The corpus contains several morphological information such as stem, pattern, root, lemma, etc...

Citation: Y. Jaafar, K. Bouzoubaa, A. Yousfi, R. Tajmout, H. Khamar, "Improving Arabic Morphological Analyzers Benchmark", In The International Journal of Speech Technology (IJST), pp. 1-9, April 2016

Arabic Keywords Extraction Corpus

Description: Arabic Keywords Extraction Corpus (AKEC) is a corpus for the evaluation of keywords detection systems. It is composed of 2448 news articles, totaling approximately 4M tokens, sourced from a news website. Articles in the corpus span four distinct domains: Art, Economy, Politics, and Sport. The dataset was manually annotated with its corresponding keywords, indicated by a binary tag (1 for keyword, 0 otherwise).

expand_more

Models

expand_more

Lexicons

Arabic WordNet ontology

Characters lexicon

Clitics (Enclitics)

Clitics (Proclitics)

Stop-words

MORALEX-Morphology

Triliteral roots

Phonetic rules phonology

Addressed Arabic phonetic rules

Broken plural list

CALEM (Comprehensive Arabic LEMmas)

MORV (Moroccan Morphological vocabulary)

Patterns lexicon

Dictionaries

"Al wassit" Arabic dictionary

Contemporary

MADED

Corpora

CLEF-TREC Q/A

Stemming evaluation NAFIS Gold Standard

LID language Identification

Spell checking

Arabic ACL corpus

Quranic stemming evaluation

Morphological evaluation

Arabic Keywords Extraction Corpus

Models

Coming soon ...