Morpheme and Allomorph
Why This Matters
A morpheme is the smallest unit of a language that carries
meaning or grammatical function. The English word unhappiness
contains three morphemes: un- (negation), happy (root),
-ness (noun-forming suffix). Decomposing words into morphemes
is the foundational analytical move in morphology, exactly
parallel to the phoneme decomposition in phonology
(phoneme-vs-allophone).
An allomorph is a context-determined surface realization of a morpheme. The English plural -s has three allomorphs:
[s]after voiceless consonants: cats, books, ships.[z]after voiced consonants and vowels: dogs, toys, eggs.[ɪz]after sibilants: kisses, churches, wishes.
The three allomorphs are different sounds but the same morpheme (plural). The choice between them is determined phonologically.
Understanding morpheme/allomorph precisely matters for:
- Tokenization in NLP: BPE, WordPiece, and SentencePiece tokenizers treat tokens as letter-strings, which approximates but does not match morpheme structure. Tokenizer choice affects how multi-morpheme words are represented in downstream models. See tokenization and information theory on DSAPath.
- Morphologically rich languages: Finnish, Turkish, Korean, Arabic exhibit thousands of word forms per root via complex agglutinative morphology. ML models for these languages need morpheme-aware processing for competitive performance.
- Cross-linguistic typology: languages differ dramatically in how much morpheme structure they exhibit (English: little; Turkish: lots).
- Speech recognition: ASR systems handling morphologically rich languages need allomorph-aware acoustic models.
Definitions
Morpheme
The smallest unit of a language that carries meaning or grammatical function. Morphemes can be:
- Free: standing as words on their own (cat, run, happy).
- Bound: occurring only attached to other morphemes (-ness, un-, -ing).
Bound morphemes split further into derivational (forming new words: un- + happy gives unhappy) and inflectional (marking grammatical features: -s plural, -ed past).
Allomorph
A surface realization of a morpheme, occurring in a specific phonological or grammatical context. Different allomorphs of the same morpheme have the same meaning but different phonetic/orthographic forms.
Morph
The actual surface form of a morpheme in a specific instance.
The morph in cats is the [s] segment; that morph realizes
the plural morpheme.
Worked Example: English Plural
The plural morpheme in English (a single grammatical feature "more than one") has three regular allomorphs determined by the final segment of the noun stem:
| Stem ends in... | Allomorph | Examples |
|---|---|---|
| Voiceless consonant other than sibilant | [s] | cats, books, sticks |
| Voiced consonant other than sibilant or vowel | [z] | dogs, kids, eggs, cars, toys |
| Sibilant ([s], [z], [ʃ], [ʒ], [tʃ], [dʒ]) | [ɪz] | kisses, churches, wishes, mazes |
The rule: insert [ɪ] before the plural suffix when the stem ends in a sibilant; otherwise voice-assimilate the suffix.
Beyond regular plurals, English has irregular plurals (children, feet, mice, oxen) which are stored as unanalyzed morpheme combinations or as lexical exceptions.
Worked Example: English Past Tense
The regular past-tense morpheme has three allomorphs, mirroring the plural pattern:
| Stem ends in... | Allomorph | Examples |
|---|---|---|
| Voiceless consonant other than [t] | [t] | walked, kicked, stopped |
| Voiced consonant other than [d] or vowel | [d] | played, lived, judged |
| [t] or [d] | [ɪd] | wanted, needed, decided |
The pattern is structural: voicing-assimilate the suffix to match the stem; insert epenthetic [ɪ] when the stem and suffix share manner of articulation. This is one of the textbook cases of morphophonology — the interaction between morphology (allomorph selection) and phonology (sound assimilation).
Worked Example: Turkish Vowel Harmony
Turkish exhibits vowel harmony: suffix vowels match the stem in backness and (often) rounding. The plural morpheme has two allomorphs:
-lar: after back-vowel stems, as in kuş-lar "birds";*ev-laris wrong.-ler: after front-vowel stems, as in ev-ler "houses";*kuş-leris wrong.
The locative case marker has four allomorphs:
| Stem vowel | Allomorph | Example |
|---|---|---|
| Front unrounded (e, i) | -de / -te | ev-de "in the house" |
| Front rounded (ö, ü) | -de / -te | göl-de "at the lake" |
| Back unrounded (a, ı) | -da / -ta | yol-da "on the road" |
| Back rounded (o, u) | -da / -ta | otobüs-te "on the bus" |
The -de/-te alternation is voicing assimilation; the -de/-da alternation is vowel harmony. Turkish word-formation chains can stack 5-10 suffixes, each with multiple allomorphs, producing dozens of forms per root.
Determining Whether Two Morphs Are Allomorphs
The phonologist's procedure (parallel to the phoneme/allophone test):
- Look for a minimal pair contrast in meaning. If two
morphs differ in form and meaning, they are different
morphemes (not allomorphs). The English plural
[s]and the third-singular[s]look identical but mean different things; they are different morphemes that happen to share surface form. - List the contexts of each morph. Identify what phonological or grammatical environment determines the choice.
- Check for complementary distribution. If morph A occurs only in contexts where morph B does not, and vice versa, they are allomorphs in complementary distribution.
- Assign an underlying morpheme. The convention varies; in Distributed Morphology (Halle-Marantz 1993), morphemes are abstract feature bundles, and allomorph selection is handled by vocabulary insertion rules in a post-syntactic module.
Cross-Linguistic Typology
Languages vary dramatically in morphological complexity:
- Isolating languages (Mandarin, Vietnamese, classical Chinese): few or no bound morphemes; meaning is conveyed by word order and free morphemes.
- Analytic languages (much of modern English): grammatical relations are expressed largely by word order, auxiliary words, and free morphemes rather than rich inflection.
- Fusional languages (Latin, Russian, Spanish): a single affix can bundle several grammatical features. Latin amabamus "we were loving" combines root am- with tense, person, and number morphology.
- Agglutinative languages (Turkish, Finnish, Japanese, Korean, Swahili): rich morphology with each morpheme separable; words can have many sequential affixes each with a distinct meaning.
- Polysynthetic languages (Inuit, Mohawk, Yup'ik): extreme morphology; what would be a sentence in English is a single multi-morpheme word. Inuit tusaatsiarunnanngittualuujunga "I can't hear very well" is one word with ~10 morphemes.
The Greenberg universals and the typological data in WALS (World Atlas of Language Structures) catalog these patterns across thousands of languages.
ML Connections
Tokenization vs morphological analysis
Subword tokenizers (BPE, WordPiece) approximate morphological structure but do not reproduce it:
- For English, BPE often splits words at morpheme boundaries by accident: unhappiness may tokenize as un + happiness or un + happy + ness depending on training-corpus frequencies.
- For agglutinative languages, BPE typically produces sub-morphemic splits: a Turkish word with 8 morphemes might tokenize as 12-15 BPE tokens with token boundaries crossing morpheme boundaries.
- For polysynthetic languages, BPE typically fails to capture the morpheme structure at all; the long words become long token sequences with no morphological coherence.
This is one plausible contributor to multilingual-LLM performance gaps: low-resource morphologically rich languages are often represented by longer, less morpheme-aligned token sequences, which can make learning and evaluation harder. The size of the effect depends on the tokenizer, corpus, script, and downstream task.
Morphologically-aware models
Some recent work re-introduces morpheme-level structure:
- Morfessor (Creutz-Lagus 2007): unsupervised morpheme segmentation from raw text. Used as a preprocessing step in some low-resource NMT systems.
- Character-level transformers: skip tokenization entirely; process raw bytes or characters. Avoids tokenizer mismatch at the cost of much longer sequences.
- Byte-level BPE with morphological priors: production hybrid in some systems (LLaMA-2 tokenizes digits individually for arithmetic robustness; some multilingual models bias their vocabulary toward likely morpheme boundaries).
Probing for morphological structure
Probing classifiers
(probing-classifiers-for-linguistic-structure)
can test whether transformer representations contain information
about morphological features or morpheme boundaries even when the
tokenizer is sub-morphemic. A successful probe under controls shows
that the information is extractable; it does not by itself show
that the model causally uses a human-like morpheme representation.
Common Mistakes
Confusing morpheme with syllable
Syllables are phonological units (vowel + surrounding consonants); morphemes are meaningful units. Strawberry has 3 syllables but 2 morphemes (straw + berry). Banana has 3 syllables but only 1 morpheme.
Treating allomorphs as different morphemes
The plural [s], [z], and [ɪz] are the same morpheme
realized in three contexts. The grammatical feature "plural"
is one; the surface form differs.
Assuming all languages have the same morpheme structure
English speakers often assume words have a clear stem-plus- affix structure. This is true of English and Romance languages but not of all languages. Polysynthetic languages have words that are entire propositions; isolating languages have minimal morpheme structure.
Forgetting suppletive forms
Go / went is the past tense of go, but there is no phonological process that derives went from go. This is a suppletive form: a single morpheme realized as phonologically-unrelated allomorphs in different grammatical contexts. Suppletion is rare but real; be / am / is / are / was / were is the most-suppletive verb in English.
Cross-Network Links
- LinguisticsPath internal: prerequisite
phoneme-vs-allophoneis the parallel phonology distinction; next natural topics are inflectional morphology, derivational morphology, and Distributed Morphology. - TheoremPath: tokenization and information theory is the technical page to read when the question becomes LLM tokenization rather than linguistic morphology.
- ComputationPath direction: finite-state morphology (Kaplan-Kay 1994) formalizes morpheme-allomorph rules as finite-state transducers.
References
Canonical:
- Haspelmath, Martin, and Andrea D. Sims. Understanding Morphology (2010, 2nd ed.), Chapters 1-3.
- Aronoff, Mark, and Kirsten Fudeman. What is Morphology? (2010, 2nd ed.).
- Bauer, Laurie. The Linguistics Student's Handbook (2007), Chapter 9.
- Spencer, Andrew. Morphological Theory: An Introduction to Word Structure in Generative Grammar (1991).
- Halle, Morris, and Alec Marantz. "Distributed Morphology and the Pieces of Inflection." The View from Building 20 (1993) 111-176.
Computational:
- Creutz, Mathias, and Krista Lagus. "Unsupervised Models for Morpheme Segmentation and Morphology Learning." ACM TALIP 4 (2007) 1-34.
- Sennrich, Rico, Barry Haddow, and Alexandra Birch. "Neural Machine Translation of Rare Words with Subword Units." ACL (2016).
- Bojanowski, Piotr, et al. "Enriching Word Vectors with Subword Information." TACL 5 (2017) 135-146.