Morpheme and Allomorph

Why This Matters

A morpheme is the smallest unit of a language that carries meaning or grammatical function. The English word unhappiness contains three morphemes: un- (negation), happy (root), -ness (noun-forming suffix). Decomposing words into morphemes is the foundational analytical move in morphology, exactly parallel to the phoneme decomposition in phonology (phoneme-vs-allophone).

An allomorph is a context-determined surface realization of a morpheme. The English plural -s has three allomorphs:

[s] after voiceless consonants: cats, books, ships.
[z] after voiced consonants and vowels: dogs, toys, eggs.
[ɪz] after sibilants: kisses, churches, wishes.

The three allomorphs are different sounds but the same morpheme (plural). The choice between them is determined phonologically.

Understanding morpheme/allomorph precisely matters for:

Tokenization in NLP: BPE, WordPiece, and SentencePiece tokenizers treat tokens as letter-strings, which approximates but does not match morpheme structure. Tokenizer choice affects how multi-morpheme words are represented in downstream models. See tokenization and information theory on DSAPath.
Morphologically rich languages: Finnish, Turkish, Korean, Arabic exhibit thousands of word forms per root via complex agglutinative morphology. ML models for these languages need morpheme-aware processing for competitive performance.
Cross-linguistic typology: languages differ dramatically in how much morpheme structure they exhibit (English: little; Turkish: lots).
Speech recognition: ASR systems handling morphologically rich languages need allomorph-aware acoustic models.

Definitions

Definition

Morpheme

The smallest unit of a language that carries meaning or grammatical function. Morphemes can be:

Free: standing as words on their own (cat, run, happy).
Bound: occurring only attached to other morphemes (-ness, un-, -ing).

Bound morphemes split further into derivational (forming new words: un- + happy gives unhappy) and inflectional (marking grammatical features: -s plural, -ed past).

Definition

Allomorph

A surface realization of a morpheme, occurring in a specific phonological or grammatical context. Different allomorphs of the same morpheme have the same meaning but different phonetic/orthographic forms.

Definition

Morph

The actual surface form of a morpheme in a specific instance. The morph in cats is the [s] segment; that morph realizes the plural morpheme.

Worked Example: English Plural

The plural morpheme in English (a single grammatical feature "more than one") has three regular allomorphs determined by the final segment of the noun stem:

Stem ends in...	Allomorph	Examples
Voiceless consonant other than sibilant	`[s]`	cats, books, sticks
Voiced consonant other than sibilant or vowel	`[z]`	dogs, kids, eggs, cars, toys
Sibilant ([s], [z], [ʃ], [ʒ], [tʃ], [dʒ])	`[ɪz]`	kisses, churches, wishes, mazes

The rule: insert [ɪ] before the plural suffix when the stem ends in a sibilant; otherwise voice-assimilate the suffix.

Beyond regular plurals, English has irregular plurals (children, feet, mice, oxen) which are stored as unanalyzed morpheme combinations or as lexical exceptions.

Worked Example: English Past Tense

The regular past-tense morpheme has three allomorphs, mirroring the plural pattern:

Stem ends in...	Allomorph	Examples
Voiceless consonant other than [t]	`[t]`	walked, kicked, stopped
Voiced consonant other than [d] or vowel	`[d]`	played, lived, judged
[t] or [d]	`[ɪd]`	wanted, needed, decided

The pattern is structural: voicing-assimilate the suffix to match the stem; insert epenthetic [ɪ] when the stem and suffix share manner of articulation. This is one of the textbook cases of morphophonology — the interaction between morphology (allomorph selection) and phonology (sound assimilation).

Worked Example: Turkish Vowel Harmony

Turkish exhibits vowel harmony: suffix vowels match the stem in backness and (often) rounding. The plural morpheme has two allomorphs:

-lar: after back-vowel stems, as in kuş-lar "birds"; *ev-lar is wrong.
-ler: after front-vowel stems, as in ev-ler "houses"; *kuş-ler is wrong.

The locative case marker has four allomorphs:

Stem vowel	Allomorph	Example
Front unrounded (e, i)	-de / -te	ev-de "in the house"
Front rounded (ö, ü)	-de / -te	göl-de "at the lake"
Back unrounded (a, ı)	-da / -ta	yol-da "on the road"
Back rounded (o, u)	-da / -ta	otobüs-te "on the bus"

The -de/-te alternation is voicing assimilation; the -de/-da alternation is vowel harmony. Turkish word-formation chains can stack 5-10 suffixes, each with multiple allomorphs, producing dozens of forms per root.

Determining Whether Two Morphs Are Allomorphs

The phonologist's procedure (parallel to the phoneme/allophone test):

Look for a minimal pair contrast in meaning. If two morphs differ in form and meaning, they are different morphemes (not allomorphs). The English plural [s] and the third-singular [s] look identical but mean different things; they are different morphemes that happen to share surface form.
List the contexts of each morph. Identify what phonological or grammatical environment determines the choice.
Check for complementary distribution. If morph A occurs only in contexts where morph B does not, and vice versa, they are allomorphs in complementary distribution.
Assign an underlying morpheme. The convention varies; in Distributed Morphology (Halle-Marantz 1993), morphemes are abstract feature bundles, and allomorph selection is handled by vocabulary insertion rules in a post-syntactic module.

Cross-Linguistic Typology

Languages vary dramatically in morphological complexity:

Isolating languages (Mandarin, Vietnamese, classical Chinese): few or no bound morphemes; meaning is conveyed by word order and free morphemes.
Analytic languages (much of modern English): grammatical relations are expressed largely by word order, auxiliary words, and free morphemes rather than rich inflection.
Fusional languages (Latin, Russian, Spanish): a single affix can bundle several grammatical features. Latin amabamus "we were loving" combines root am- with tense, person, and number morphology.
Agglutinative languages (Turkish, Finnish, Japanese, Korean, Swahili): rich morphology with each morpheme separable; words can have many sequential affixes each with a distinct meaning.
Polysynthetic languages (Inuit, Mohawk, Yup'ik): extreme morphology; what would be a sentence in English is a single multi-morpheme word. Inuit tusaatsiarunnanngittualuujunga "I can't hear very well" is one word with ~10 morphemes.

The Greenberg universals and the typological data in WALS (World Atlas of Language Structures) catalog these patterns across thousands of languages.

ML Connections

Tokenization vs morphological analysis

Subword tokenizers (BPE, WordPiece) approximate morphological structure but do not reproduce it:

For English, BPE often splits words at morpheme boundaries by accident: unhappiness may tokenize as un + happiness or un + happy + ness depending on training-corpus frequencies.
For agglutinative languages, BPE typically produces sub-morphemic splits: a Turkish word with 8 morphemes might tokenize as 12-15 BPE tokens with token boundaries crossing morpheme boundaries.
For polysynthetic languages, BPE typically fails to capture the morpheme structure at all; the long words become long token sequences with no morphological coherence.

This is one plausible contributor to multilingual-LLM performance gaps: low-resource morphologically rich languages are often represented by longer, less morpheme-aligned token sequences, which can make learning and evaluation harder. The size of the effect depends on the tokenizer, corpus, script, and downstream task.

Morphologically-aware models

Some recent work re-introduces morpheme-level structure:

Morfessor (Creutz-Lagus 2007): unsupervised morpheme segmentation from raw text. Used as a preprocessing step in some low-resource NMT systems.
Character-level transformers: skip tokenization entirely; process raw bytes or characters. Avoids tokenizer mismatch at the cost of much longer sequences.
Byte-level BPE with morphological priors: production hybrid in some systems (LLaMA-2 tokenizes digits individually for arithmetic robustness; some multilingual models bias their vocabulary toward likely morpheme boundaries).

Probing for morphological structure

Probing classifiers (probing-classifiers-for-linguistic-structure) can test whether transformer representations contain information about morphological features or morpheme boundaries even when the tokenizer is sub-morphemic. A successful probe under controls shows that the information is extractable; it does not by itself show that the model causally uses a human-like morpheme representation.

Common Mistakes

Watch Out

Confusing morpheme with syllable

Syllables are phonological units (vowel + surrounding consonants); morphemes are meaningful units. Strawberry has 3 syllables but 2 morphemes (straw + berry). Banana has 3 syllables but only 1 morpheme.

Watch Out

Treating allomorphs as different morphemes

The plural [s], [z], and [ɪz] are the same morpheme realized in three contexts. The grammatical feature "plural" is one; the surface form differs.

Watch Out

Assuming all languages have the same morpheme structure

English speakers often assume words have a clear stem-plus- affix structure. This is true of English and Romance languages but not of all languages. Polysynthetic languages have words that are entire propositions; isolating languages have minimal morpheme structure.

Watch Out

Forgetting suppletive forms

Go / went is the past tense of go, but there is no phonological process that derives went from go. This is a suppletive form: a single morpheme realized as phonologically-unrelated allomorphs in different grammatical contexts. Suppletion is rare but real; be / am / is / are / was / were is the most-suppletive verb in English.

Cross-Network Links

LinguisticsPath internal: prerequisite phoneme-vs-allophone is the parallel phonology distinction; next natural topics are inflectional morphology, derivational morphology, and Distributed Morphology.
TheoremPath: tokenization and information theory is the technical page to read when the question becomes LLM tokenization rather than linguistic morphology.
ComputationPath direction: finite-state morphology (Kaplan-Kay 1994) formalizes morpheme-allomorph rules as finite-state transducers.

References

Canonical:

Haspelmath, Martin, and Andrea D. Sims. Understanding Morphology (2010, 2nd ed.), Chapters 1-3.
Aronoff, Mark, and Kirsten Fudeman. What is Morphology? (2010, 2nd ed.).
Bauer, Laurie. The Linguistics Student's Handbook (2007), Chapter 9.
Spencer, Andrew. Morphological Theory: An Introduction to Word Structure in Generative Grammar (1991).
Halle, Morris, and Alec Marantz. "Distributed Morphology and the Pieces of Inflection." The View from Building 20 (1993) 111-176.

Computational:

Creutz, Mathias, and Krista Lagus. "Unsupervised Models for Morpheme Segmentation and Morphology Learning." ACM TALIP 4 (2007) 1-34.
Sennrich, Rico, Barry Haddow, and Alexandra Birch. "Neural Machine Translation of Rare Words with Subword Units." ACL (2016).
Bojanowski, Piotr, et al. "Enriching Word Vectors with Subword Information." TACL 5 (2017) 135-146.