Chatbots

NLP for Arabic – The case of lemmatization

Arabic is a complex language for NLP tasks, even for simple ones like lemmatization.

There are several reasons for this:

  • Arabic creates words based on roots: for example, the word کتاب (kitab, “book”) is derived from ك ت ب (k t b). Many related words are derived from the same root.
  • Arabic can create words, similar to “compounds” but more limited, combining certain POSes, such as prepositions, conjunctions and pronouns, with nouns, adjectives and verbs. For example, وكتابي (wakitabi, “and my book”) is composed of و (wa, “and”) + كتاب + ي (i, “my”).
  • More often than not, Arabic speakers omit vowels when writing, which makes hard to determine the real lemma of the words.
  • Finally, Arabic is written in one canonical form across countries (MSA, Modern Standard Arabic) and has different variants, both in those same countries (Egyptian, Najdi…) and across countries (Gulf Arabic is used Kuwait, UAE, Qatar…); additionally, some regional dialects like Egyptian are known across most of the Arabic-speaking world due to their widespread use in media. Furthermore, there are different registers involved: Classical Arabic, used for old texts and reciting the Qur’an; MSA, used for writing, broadcasting or interviewing; the colloquial regional dialect, which is the everyday language used in informal contexts, and more.

As a result of this complexity, developing tools to do NLP in Arabic requires, and in particular lemmatization:

  • A good data source, comprehensive and accurate, that tags morphological attributes and also language variants. For example, the plural of سيارة (“car”) is different in MSA (سيارات) and in Gulf Arabic (سيايير).
  • A good data architecture that integrates different information sources: prefixes, suffixes, roots, forms…
  • A very efficient processing software designed to handle millions of different potential tokens that can be generated just in MSA, for example.

At Bitext we have developed a set of NLP tools, including lemmatization, that

  • covers the different variants: MSA, Najdi, Egyptian, Gulf…
  • handles 30 million of words per second
  • provides linguistic data on 35 million words

Are you interested in our services or want more information? Let´s get in touch!

admin

Recent Posts

Case Study: Finequities & Bitext Copilot – Redefining the New User Journey in Social Finance

Bitext introduced the Copilot, a natural language interface that replaces static forms with a conversational,…

2 months ago

Automating Online Sales with Proactive Copilots

Automating Online Sales with a New Breed of Copilots. The next generation of GenAI Copilots…

3 months ago

Taming the GPT Beast for Customer Service

GPT and other generative models tend to provide disparate answers for the same question. Having…

6 months ago

Can You Use GPT for CX Purposes? Yes, You Can

ChatGPT has major flaws that prevent it from becoming a useful tool in industries like…

7 months ago

Why Do You Need to Fine-tune Your Conversational LLM with 100’s (If Not 1,000’s) of Examples?

If data is the oil of the AI industry, we are running out of data…

7 months ago

Introducing a New Breed of Data to Fine-tune LLMs: Hybrid Datasets

Fine-Tuning LLMs with Bitext's Hybrid Datasets: How AI Text Generation is Revolutionizing Conversational AI

7 months ago