Capitalization and Normalization for English and Arabic NLP

Currently, most natural language processing models have a pre-processing stage, where the text itself is cleaned up, before feeding into the model. This has quite a few benefits, mostly to reduce the amount of information the model has to learn, and gives more consistent results. For example, you wouldn't want your model to distinguish iphone and IPHONE as two different things (normally). The same thing happens in a similar process for Arabic (when written with Unicode). A Unicode normalization step usually happens, especially for tools like CAMeL, which is necessary since letters in Arabic can have many different forms ("لا", for example, is comprised of two different letters, ل and ا). On top of this, another form of orthographic normalization is performed, which converts visually similar letters into a single form¹.

One thing that I'm concerned with is the fact that Arabic tools seem to take it to another level and remove diacritics from words as well. To me, this seems like a problem, since diacritics are not a function of the written form of Arabic, but actually have bearing on the specific meanings. Here, there's three layers:

The meaning of the word in Arabic <– diacritics are here
The written form of the word in Arabic <– orthographic normalization is here
The written form of the word in Unicode <– Unicode normalization is here

For natural language processing models to work, the models need data that is as close to the meanings of the words (which may not map nicely to the word itself), without removing too much meaning from each word. Unicode normalization and orthographic normalization are necessary because they unravel the quirks of the Arabic script, which is largely independent of what the word means (a misspelled word still means the same thing with enough context).

To put it in English terms, take the Spongebob Text Meme. Capitalization here actually does modify the meaning of the text, it makes the text sarcastic! However, if you feed a model all lowercase text, it'll never learn to distinguish when capitalization does have a bearing on meaning. For purposes such as sentiment analysis, this is a major problem.

From how I see it, capitalization in English and diacritics in Arabic take on similar roles, where they can sometimes imbue meaning on top of the words themselves. And while it's kind of lame to end a post talking about how people need to be more careful about their data and their goals, we shouldn't blindly take text pre-processing as a "solved" problem.

Footnotes:

Done for datasets used in stuff like Zampieri, Marcos, Preslav Nakov, Sara Rosenthal, Pepa Atanasova, Georgi Karadzhov, Hamdy Mubarak, Leon Derczynski, Zeses Pitenis, and Çağrı Çöltekin. “SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020).” ArXiv:2006.07235 [Cs], September 30, 2020. http://arxiv.org/abs/2006.07235.