3 NLP Assumptions that don’t hold up
Introduction
As Natural Language Processing (NLP) develops as a subfield of AI, I have noticed a worrying trend of developers with no background in Linguistics. At best, this is a great opportunity for interested individuals to learn about the complexities of Language, focusing on new and exciting languages. At worst, programmers who think they know everything about Language (but couldn’t tell a realis from a deontic) can end up leading multilingual projects with very real consequences.
Having an interdisciplinary background in Computer Science and Linguistics, and having worked in industry as a computational linguist, I have compiled a few (somewhat) common assumptions that the principled NLP engineer should avoid. These are not intended to be groundbreaking, complete, or academically rigorous. My goal is simply to highlight the diversity of world languages, especially in ways in which they behave unexpectedly compared to English.
All uncited language data are my own, based on my study (to varying extents) of those languages.
1. You can tokenize by whitespace.
Tokenization is the process of splitting a text into discrete meaningful units, or tokens. This is one of the major components of data preprocessing, allowing us to target individual words instead of entire sentences or documents. So if each token corresponds to one word, we can simply segment our text by splitting on whitespace, right?
Let’s think about this a little more. In English, there are some concepts which are represented by sequences of characters including spaces. For example, consider ⟨New Jersey⟩1 . These two words together represent one of the United States (another good example). New and jersey are valid words on their own, and so their combined meaning would be lost if they were tokenized separately due to the intervening space. In general, English orthography is inconsistent with compound words.
Finnish
Other languages present different challenges. An issue arises if a single “word” includes whitespace. But what about “words” that include multiple intuitive tokens without whitespace? In agglutinative languages like Finnish, it is possible to have compound words such as:
käsipyyherullajärjestelmä
(Uusi Kielemme)
Which translates to ‘hand towel roll system’. In English, it seems perfectly reasonable to tokenize each of these words on their own, but this would require a more sophisticated approach in Finnish. It is at this point that our concept of a “word” starts to break down—and so too does our understanding of tokens. We now have to make judgments on what it means to be a word.
Inuit
As an extreme case against simple whitespace tokenization, consider the polysynthetic Inuit languages of Alaska, Canada, and Greenland. In these languages, compound “words” can encode information about arbitrary nouns, adjectives, verbs, and pronouns. Thus, it’s possible to utter a complete sentence in one word:
iglukpijumalaaktuŋa
‘I am anxious about building a house’
(Eugene Nida 1949)
Yes, iglukpiyumalaaktuŋa could be parsed as a single token. Yet this defeats the motivation behind tokenization in the first place, such a token would be incredibly information-dense, abandoning the desired simplification. In English, the amount of information in a whitespace-word is generally quite small (with plenty of exceptions). Yet in the Inuit languages, this is not the case.
Chinese
As a final case against whitespace tokenization, there are some scriptio continua languages like Chinese, which don’t use whitespace at all:
我只想是你的朋友
Try splitting that sentence on spaces alone!
2. All morphology involves prefixes or suffixes.
Once we’ve segmented our text into tokens, the next step is often to lemmatize those tokens, transforming them into their “dictionary citation” (a.k.a. lemma) forms. For example, The word running becomes run. This is useful for NLP as it allows us to treat all forms of a word in the same way, which comes in handy for tasks like sentiment analysis.
One more key term is morphology, which refers to the forms of words. When running is created from run, we can say that the suffix -ing was applied as a morphological process.
When lemmatizing data, many approaches simply trim off common prefixes and suffixes like ⟨ing⟩ to isolate the “stem” of a token. This approach is feasible for English, but it assumes that all morphological processes will involve prefixes or suffixes (affixes in general). In other words, this method of lemmatization assumes that all morphology is concatenative.
English
We must look no farther than English to discover words that break this pattern. Consider verb lemmas like drink, swim, and fly. The past tense forms of these words are notably not formed with the standard suffix -ed, but rather by changing the vowel: drank, swam, flew. This specific pattern is known as ablaut and occurs in English for historical reasons. More broadly, it is an example of non-concatenative morphology, where a grammatical change is marked without using affixes.
Arabic
The Semitic languages are notable for their “root-and-pattern” morphological systems. Let’s take Arabic as an example. We start with the root meaning ‘write’, consisting of three consonants, written from right to left:
ك - ت - ب
k - t - b
To say ‘I write’, we fill in the gaps between the letters with a set of vowels conveying the first-person singular present tense:
أكْتُب
‘aktub2
To say ‘she writes’, we once again apply the vowel pattern to the root, but this time we add an extra prefix to signify the feminine gender:
تكْتُب
taktub
Because of this system, lemmatization of Arabic cannot be accomplished simply by stripping off affixes—this task requires a more sophisticated approach; it may be more valuable to isolate the root of each word if possible, which is considerably different from the English lemma form.
For English, non-concatenative morphology manifests in a handful of verbs and nouns with irregular tenses or plurals which could easily populate a lookup table. Yet in Arabic, many terms follow the root-and-pattern process. When preprocessing data of any language, it is often desirable to transform tokens into a simpler form—lemmas. Understanding the language’s morphology is the key to reversing the processes which derive from those lemmas, and figuring out what the lemmas represent in the first place.
3. Textual data will include as much relevant information as necessary.
Obviously, writing doesn’t convey the same information as speech. As NLP engineers, we concede that textual data can’t capture the same intonations, gestures, and contextual nuances as a real-life conversation. Yet we tend to assume that the text in our datasets is complete—at least complete enough to make reasonable judgements.
In English, this may be a reasonable assumption. The most commonly omitted items are things like diacritics and punctuation, and it is usually pretty easy to understand a text without these3 . Yet in other languages, the extent to which items are customarily omitted from text, and the extent to which this impacts meaning, can vary greatly.
Arabic, again
We already saw how Arabic morphology works by filling in vowels for a consonantal pattern. In Arabic orthography, these vowels are often omitted entirely, left to the reader to infer from context. This type of writing system is known as an abjad. The lack of vowels can introduce ambiguity into a parsing system if unaccounted for. Consider the following:
كتب
ktb
We know from earlier that these three letters comprise the root for ‘write’. By altering just the vowels, Arabic speakers can convey multiple meanings. Pay close attention to the diacritics on each word:
كَتَبَ
kataba
‘he wrote’
or:
كُتُب kutub
‘books’
Both words are represented by the same textual form, ⟨كتب⟩, which omits the vowel diacritics. Even though they are semantically related, this can result in incoherent understandings of the text, especially since the two words represent different parts of speech.
Kalabari
Some languages use changes in pitch—known as tone—to convey various levels of information. Famously, Chinese uses tones to differentiate between words. The Chinese words for ‘mother’ and ‘horse’ are both pronounced with the same syllable, but different tones4 . This difference is reflected in the orthography, as both words have different characters. Yet in some languages, tone conveys grammatical information. Let’s consider the Kalabari language of southern Nigeria.
In Kalabari, tone is used to determine the transitivity of verbs. A transitive verb takes a direct object, while an intransitive verb does not. In the following examples (taken from Larry Hyman 2016), pay attention to the tonal diacritics:
kíkíma
‘hide [something]’
kìkimá
‘be hidden’
As you can see, the tone has changed to convey a difference in meaning between the two verbs. Yet tones are not marked in Kalabari orthography. This difference is only apparent when analyzing spoken language. In text, it would be difficult to achieve an accurate understanding of any given word without analyzing the term’s distribution across a dataset.
Kalabari’s grammatical tone is a rather esoteric example of this phenomenon, which probably won’t have massive NLP impacts due to its limited grammatical function. Yet this represents a more valuable takeaway, that the pronunciation of a word is not always separable from its grammar.
Conclusion
The three specific assumptions I described above may never impact any of your projects. Yet an inevitable encounter with bad multilingual assumptions is inevitable. In NLP, there’s really only one assumption to avoid: Language X works like this, so Language Y probably will as well.
You don’t need to be a fluent speaker of another language to create powerful systems for it—you don’t really need to speak it at all. But it is imperative that you do your research about it. Look up what linguists have to say about its features. Find out how others have approached its challenges before. What special considerations do they make to handle its differences from English?