Skip to main content

Computational Linguistics · 14 min

Vector Semantics and word2vec, Revisited

Words as vectors in a high-dimensional space, where geometric closeness reflects semantic similarity. The algorithmic instantiation of the distributional hypothesis: Mikolov's 2013 word2vec (skip-gram and CBOW) and the Levy-Goldberg 2014 result that skip-gram-with-negative-sampling implicitly factorizes a shifted PMI matrix.

Vector Semantics and word2vec, Revisited

Why This Matters

The distributional hypothesis (distributional-hypothesis) says: words that occur in similar contexts have similar meanings. Vector semantics is the algorithmic instantiation: represent each word as a vector in Rd\mathbb{R}^d such that geometric closeness (cosine similarity, dot product) reflects distributional and therefore semantic similarity.

The word2vec algorithms (Mikolov et al. 2013) made vector semantics tractable at scale: hundreds of thousands of words can be embedded in 100-300-dimensional vectors trained on billions of tokens in hours of CPU time. The result was a phase change in NLP: pretrained word embeddings became the default input representation for most NLP tasks 2013-2018, before contextual embeddings (BERT, ELMo) replaced them.

Even after the contextual-embedding revolution, vector semantics remains:

  • The conceptual substrate of all dense-representation NLP.
  • The starting point for understanding what transformer representations capture.
  • The teaching example for distributional semantics.
  • A practical baseline for many tasks (lightweight, no GPU needed).

This page treats word2vec as the canonical instance and discusses the Levy-Goldberg 2014 result that gives it formal grounding: skip-gram-with-negative-sampling implicitly factorizes a shifted PMI matrix.

The Skip-Gram Model

Skip-gram predicts context words from a center word. Given a corpus of sentences:

  1. Slide a window of size 2k+12k + 1 across each sentence.
  2. The center word is the input; the 2k2k surrounding words are the targets.
  3. For each (center, context) pair, train the model to assign high probability to the (center, context) pair and low probability to (center, random word) pairs.

The architecture: each word ww has two vectors — vw\mathbf{v}_w (center vector) and vw\mathbf{v}'_w (context vector). The probability of context word cc given center word ww is

P(cw)=exp(vwvc)cVexp(vwvc).P(c \mid w) = \frac{\exp(\mathbf{v}_w \cdot \mathbf{v}'_c)}{\sum_{c' \in V} \exp(\mathbf{v}_w \cdot \mathbf{v}'_{c'})}.

This softmax over the full vocabulary is expensive. Two techniques mitigate:

  • Hierarchical softmax: organize the vocabulary as a binary tree; predict each binary decision rather than the full softmax. Cost: O(logV)O(\log V) per prediction.
  • Negative sampling: replace the softmax with a binary classification: distinguish the true context word from kk random negative samples.

Negative sampling is the dominant choice. The objective:

max(w,c)Dlogσ(vwvc)+(w,c)Dlogσ(vwvc)\max \sum_{(w, c) \in D} \log \sigma(\mathbf{v}_w \cdot \mathbf{v}'_c) + \sum_{(w, c') \in D'} \log \sigma(-\mathbf{v}_w \cdot \mathbf{v}'_{c'})

where DD is the set of true (word, context) pairs from the corpus, DD' is the set of negative samples, and σ\sigma is the sigmoid.

The CBOW Model

The dual of skip-gram: predict the center word from the context. Architecture:

  1. Compute the average of the context-word vectors.
  2. Predict the center word from this average.

CBOW is faster to train than skip-gram (one prediction per window vs 2k2k) but produces representations that are slightly worse on most evaluations. Skip-gram's per-context-word loss distribution gives finer-grained gradient signal.

The Levy-Goldberg PMI Theorem

Theorem

Skip-Gram-Negative-Sampling Implicitly Factorizes Shifted PMI

Statement

At convergence, the skip-gram-negative-sampling objective satisfies vwvc=PMI(w,c)logk\mathbf{v}_w \cdot \mathbf{v}'_c = \mathrm{PMI}(w, c) - \log k where PMI(w,c)=logP(w,c)P(w)P(c)\mathrm{PMI}(w, c) = \log \frac{P(w, c)}{P(w) P(c)} is the pointwise mutual information of the (word, context) pair and kk is the number of negative samples.

Intuition

The negative-sampling loss is a binary cross-entropy that can be solved analytically for each (w, c) pair. The optimum dot product is exactly the shifted PMI value.

Proof Sketch

Levy-Goldberg 2014 showed: at the optimum, the binary cross- entropy loss for each pair is minimized when σ(vwvc)=P(w,c)/P(w)P(c)P(w,c)/P(w)P(c)+k=ePMI(w,c)ePMI(w,c)+k\sigma(\mathbf{v}_w \cdot \mathbf{v}'_c) = \frac{P(w, c) / P(w) P(c)}{P(w, c) / P(w) P(c) + k} = \frac{e^{\mathrm{PMI}(w, c)}}{e^{\mathrm{PMI}(w, c)} + k}.

Solving for the dot product: vwvc=logePMIk=PMI(w,c)logk\mathbf{v}_w \cdot \mathbf{v}'_c = \log \frac{e^{\mathrm{PMI}}}{k} = \mathrm{PMI}(w, c) - \log k.

Therefore the matrix MM with Mw,c=vwvcM_{w, c} = \mathbf{v}_w \cdot \mathbf{v}'_c is, at convergence, the shifted PMI matrix. The skip-gram optimization is implicitly low-rank factorization of this matrix.

Why It Matters

The result connects skip-gram to the count-based distributional-semantics tradition (Turney-Pantel 2010). Count-based methods explicitly compute PMI matrices and SVD- factorize them; skip-gram does the same factorization implicitly via SGD. The two paradigms are not different theories but different algorithmic routes to the same matrix factorization.

Failure Mode

The theorem holds at the optimum. In practice, training runs do not reach the global optimum, and the resulting embeddings differ slightly from the analytic shifted-PMI factorization. The differences are typically small but observable.

Analogy Completion: king − man + woman ≈ queen

The most-cited word2vec result. The analogy task: "aa is to bb as cc is to ?"; the answer is the word dd nearest to vbva+vc\mathbf{v}_b - \mathbf{v}_a + \mathbf{v}_c.

Examples that work:

  • king − man + woman ≈ queen
  • Paris − France + Italy ≈ Rome
  • walking − walked + ran ≈ running

Examples that don't work as cleanly:

  • Unfamiliar entity pairs (rare names; concepts not in the training corpus).
  • Semantic categories with non-vector-additive structure (color analogies, prepositional analogies).
  • Many proposed "analogies" in the original Mikolov paper turn out to be artifacts of the test set rather than general linguistic patterns.

The analogy result is a partial validation: vector semantics captures some but not all linguistic structure.

Production Use and Decline

Word2vec dominated NLP from 2013 to ~2018, then was largely replaced by contextual embeddings (ELMo, BERT, GPT). The reasons:

  • Word2vec gives one vector per word type. Bank (river edge) and bank (financial institution) get the same vector, an obvious limitation.
  • Contextual embeddings produce per-token representations that differ across contexts.
  • Pretrained transformers ship with substantially better representations for most downstream tasks.

Word2vec persists in:

  • Lightweight production deployments: no GPU, embedded systems, low-latency contexts.
  • Information-retrieval baselines: efficient document- embedding via averaged word2vec vectors.
  • Scientific-literature embeddings: trained on domain-specific corpora; competitive with general-purpose contextual models in narrow domains.
  • Pedagogy: the cleanest example of distributional semantics at the introductory level.

ML Connections

From word2vec to BERT to LLMs

The progression:

  1. Static word embeddings (word2vec, GloVe, fastText): one vector per word type, learned from co-occurrence statistics.
  2. Contextual word embeddings (ELMo, CoVe): vectors depend on the surrounding sentence; word-sense disambiguation emerges.
  3. Bidirectional contextual embeddings (BERT, RoBERTa): pretrained masked-language-model objective; the representations capture deep syntactic and semantic structure.
  4. Causal-language-model embeddings (GPT family): trained on next-token prediction; useful for generation, not just classification.

Each generation absorbed the previous and added context- sensitivity. The distributional-hypothesis substrate runs through all of them.

Compositional distributional semantics

The Coecke-Sadrzadeh-Clark 2010 framework extends vector semantics to compositional meanings: phrase and sentence meanings as tensor contractions over word vectors. The mapping: each syntactic type has a tensor of corresponding order; function application is tensor contraction.

The framework is mathematically clean but didn't dominate NLP because contextual transformers absorbed the compositionality task implicitly. Compositional distributional semantics persists in categorical-quantum-mechanics-flavored NLP (DisCoCat, quantum NLP) and as a theoretical bridge between formal semantics and distributional semantics.

Biases and ethical issues

Word embeddings encode social biases present in the training corpus: gender stereotypes, racial biases, occupational associations. Bolukbasi et al. 2016 documented that man − woman ≈ programmer − homemaker in word2vec embeddings trained on Google News. Subsequent work proposed debiasing techniques; the conceptual question (whether biases reflect the data or can be excised from representations) remains contested.

Common Mistakes

Watch Out

Treating word2vec as the only distributional method

Word2vec is one algorithm. GloVe and fastText use different objectives but produce similar embeddings. The distributional hypothesis is the underlying claim; many algorithms instantiate it.

Watch Out

Conflating cosine similarity with semantic similarity

Cosine similarity in word2vec space correlates with semantic similarity but is not identical to it. Antonyms (hot / cold) have high cosine similarity because they appear in similar contexts; distributionally similar but semantically opposite.

Watch Out

Assuming static embeddings handle polysemy

Bank (river) and bank (financial) get the same vector in word2vec. Static embeddings are word-type-level; they do not disambiguate senses. Contextual embeddings (BERT, GPT) handle this.

Watch Out

Treating analogies as universally working

The famous king − man + woman ≈ queen analogy works in specific cases but is far from universal. Many "analogy benchmarks" tested with word2vec turned out to have artifacts that didn't generalize. The analogy story is more nuanced than popular accounts suggest.

Cross-Network Links

References

Canonical:

  • Turney, Peter D., and Patrick Pantel. "From Frequency to Meaning: Vector Space Models of Semantics." JAIR 37 (2010) 141-188.
  • Manning, Christopher D., and Hinrich Schutze. Foundations of Statistical Natural Language Processing (1999), Chapter 8.
  • Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. "Efficient Estimation of Word Representations in Vector Space." ICLR Workshop (2013).
  • Mikolov, Tomas, et al. "Distributed Representations of Words and Phrases and their Compositionality." NeurIPS (2013).

Embedding Models and Composition:

  • Levy, Omer, and Yoav Goldberg. "Neural Word Embeddings as Implicit Matrix Factorization." NeurIPS (2014).
  • Pennington, Jeffrey, Richard Socher, and Christopher D. Manning. "GloVe: Global Vectors for Word Representation." EMNLP (2014).
  • Bojanowski, Piotr, et al. "Enriching Word Vectors with Subword Information." TACL 5 (2017) 135-146.
  • Coecke, Bob, Mehrnoosh Sadrzadeh, and Stephen Clark. "Mathematical Foundations for a Compositional Distributional Model of Meaning." Linguistic Analysis 36 (2010) 345-384.