ARLEX: A Large Scale Comprehensive Lexical Inventory for Modern Standard Arabic

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

This paper introduces a lexical resource, ARLEX, for Modern Standard Arabic (MSA) that explicitly lists ambiguity at the lexical and syntactic levels for each token. Arabic orthography is known for being underspecified for short vowels and other markers such as letter doubling and glottal stops, known as diacritics. This leads to further ambiguity in orthography with real impact on natural language processing (NLP) applications, not to mention readability and human language processing. We specifically target listing alternative ambiguous forms of words within and across the same part of speech (POS), namely where tokens with no specified diacritics may have multiple possible diacritized alternatives. The entries in this dictionary are constrained to five POS tags: verbs, nouns, adjectives, adverbs, and prepositions. A morphological analyzer and disambiguator is leveraged to generate the desired linguistic properties. The resulting inventory, ARLEX, is a large scale comprehensive resource of words, recording their degree of ambiguity at various levels with example usages. ARLEX could be most useful for NLP applications, pedagogical applications, as well as socio- and psycho-linguistic studies.
Original languageEnglish (US)
Title of host publicationProceedings of the LREC 2018 Workshop
Pages1-7
StatePublished - 2018

Fingerprint

Dive into the research topics of 'ARLEX: A Large Scale Comprehensive Lexical Inventory for Modern Standard Arabic'. Together they form a unique fingerprint.

Cite this