Research · Zachary Houghton

Language is not a fixed system of rules; it is a dynamic cognitive system, continuously reshaped by every encounter we have with it. At the heart of my work is a question about the architecture of that resource: how do humans and machines trade off between storage and computation?

On the storage side, linguistic knowledge can be represented holistically: whole phrases, collocations, and constructions stored as units in memory. On the computation side, we assemble meaning from parts: abstract rules and schemas that apply broadly. Most linguistic behavior doesn't fall cleanly into either camp.

What I study is the balance between them: how much exposure is needed to shift a pattern from computed to stored (which may be more of a continuum than a binary), how that shift manifests cognitively and behaviorally, and whether the computational systems we build for language mirror or diverge from humans. I approach these questions using neural networks, Bayesian models, and error-driven learning frameworks, alongside corpus analyses and experimental psycholinguistic methods.

Mental Representations

Meaning isn't stored in a single place. It's distributed.

There is no single neuron or brain region that "holds" the word dog. Meaning emerges from patterns of activation distributed across millions of interconnected neurons, shaped by every context in which that word has appeared. Click words from the two groups below. Notice that words within the same category activate overlapping but distinct patterns, while words from different categories share only a small core of neurons encoding general biological knowledge.

// interactive demo · distributed neural activation

Animals

Plants

Within-category words share most of their pattern. Cross-category words share only a small "living things" core.

Neuron overlap is weighted by GloVe 50d cosine similarity — so dog/cat share more than dog/wolf, and animals share little with plants. Layout is illustrative, not measured neural data.

// research Distributed representations are central to my view of linguistic knowledge: not a rule system applied to input, but a high-dimensional space of patterns learned from exposure. This framing motivates my interest in storage: a holistically stored phrase is one whose representational geometry has been shaped by its full usage history, not merely assembled from its parts.

Learning

Prediction error drives shifts in representations.

A core claim of usage-based linguistics is that linguistic knowledge is learned, not innate, and that the mechanism is error-driven: representations are updated proportionally to how wrong a prediction was. The Rescorla-Wagner model formalizes this: ΔV = αβ(λ − ΣV), where α is cue salience, β is learning rate, λ is the expected outcome value (the associative strength the cue would carry once fully conditioned, analogous to a training label), and ΣV is the total current prediction across all active cues. The term (λ − ΣV) is the prediction error: how far the system's current prediction falls from the expected outcome. The notation V for associative strength follows the original Rescorla-Wagner convention. This rule captures several counterintuitive effects (blocking, extinction, overexpectation) purely from the logic of prediction error minimization.

// interactive demo · Rescorla-Wagner model

α — cue salience 0.20

β — outcome salience 0.20

// research Error-driven learning is the primary mechanism in my models of how linguistic representations change with exposure, including how binomial order preferences become entrenched. The key prediction: as a phrase becomes more predictable, prediction error decreases, and the representation shifts from computed to stored.

Computational Modeling

How much of language is memorized, and how much is assembled?

One test case for the storage-computation tradeoff is binomial expressions: phrases like bread and butter or black and white, where word order is conventionalized. A purely compositional system derives a phrase's meaning by combining its parts, treating both orderings as equivalent (since the operation is symmetric). A holistic system stores the canonical form as its own unit, with a representation shaped by its specific usage history. Under holistic storage, the two orderings are no longer equivalent; they occupy distinct positions in representational space.

Frequency is what predicts where a binomial falls on this continuum: high-frequency binomials are more likely to have developed holistic representations; low-frequency ones remain compositional. The visualization below uses GloVe word embeddings (reduced to 2D) to illustrate this. At Compositional, both orderings of a phrase map to the same point (their meaning is just the average of the two words, regardless of order). Drag toward Holistic: the two orderings separate, with high-frequency binomials (larger, brighter) diverge more than low-frequency ones (smaller, dimmer).

// demo · canonical (●) vs reversed (○) ordering in embedding space

Compositional Holistic

Each binomial appears twice: a filled dot (●) for the canonical ordering and an outlined dot (○) for the reversed ordering, connected by a line. At Compositional, both dots sit at the same point — the average of their two words' GloVe vectors — because a fully compositional system treats both orderings as equivalent. Drag toward Holistic to reveal each ordering's actual position in GloVe space. Color encodes relative frequency (dark = low, light = high). High-frequency binomials diverge more, reflecting that their canonical and reversed forms have developed more distinct distributional histories. Hover over a dot to see the label.

Noisy-Channel Comprehension & Frequency-Dependent Preference Extremity

Language is transmitted through a noisy channel. Words get swallowed in a crowd, misheard over a bad connection, or garbled by a fast speaker. A rational listener doesn't simply record what they heard — they infer what the speaker most likely intended, given both the signal that arrived and prior knowledge about what speakers tend to say. This is the core insight behind noisy-channel models of comprehension (Gibson et al., 2013): perception is not passive recording but active probabilistic inference.

For binomials, the relevant "noise" is word-order error: the possibility that a speaker intended one ordering but produced the other. A listener who hears butter and bread doesn't necessarily update toward believing that order is acceptable — they may instead infer that the speaker meant bread and butter and made a slip. The noisy-channel model formalizes this: P(Sᵢ | Sₚ) ∝ P(Sᵢ) · P(Sᵢ→Sₚ), where Sᵢ is the inferred intended ordering, Sₚ is what was perceived, P(Sᵢ) encodes abstract ordering preferences (shorter before longer, animate before inanimate, and so on), and P(Sᵢ→Sₚ) is the probability that the intended form was distorted into the perceived form. Frequency does not strengthen the prior — it determines N: how many times a learner encounters the expression.

This framework can be used to model how ordering preferences evolve across generations of learners. In an iterated learning paradigm, each generation hears N tokens, updates beliefs via noisy-channel inference, and passes the language on. When listeners infer slightly more noise than speakers actually produce, a small bias compounds across generations: the preferred ordering becomes increasingly dominant, but only for high-N expressions. Low-frequency binomials lack the exposure for the compounding to take hold, so their preferences stay weak and variable. Toggle noise below to see: without noise, all N-curves collapse to the diagonal and frequency has no effect.

// demo · frequency-dependent preference extremity (Houghton & Morgan, 2024)

pnoise = 0.04

Without noise: all four N-curves lie on the diagonal — no frequency effect. With listener noise > speaker noise: higher N steepens the S-curve.

// research This model (Houghton & Morgan, 2024) demonstrates that frequency-dependent preference extremity can arise purely from noisy-channel processing, and that it requires listeners to infer more noise than speakers actually produce. The model accounts for the corpus-wide distribution of binomial ordering preferences across 594 binomials from Morgan & Levy (2015). It also offers a natural explanation for why the most frozen, non-reversible binomials — bread and butter, black and white, rise and fall — tend to be the most frequent: high frequency means more exposures, more inferential compounding, and stronger polarization away from the dispreferred order. Freezing is not an accident of convention; it is a predictable consequence of being heard often.

Statistics

Quantifying uncertainty, updating beliefs from data.

Most of my work combines Bayesian modeling and mixed-effects regression, chosen because they make assumptions explicit and quantify uncertainty rather than collapsing it into a p-value. This section walks through the intuition behind each.

01 · Bayesian Modeling

For Bayesian modeling, I walk through an extended example from the social deduction game Werewolf, a game I genuinely enjoy and one that maps almost perfectly onto the structure of Bayesian inference. No statistics background required.

The Setup

You're playing Werewolf: 5 players, 1 is secretly the werewolf. You don't know who. During the game you observe someone acting suspiciously. Should that update your belief that they're the werewolf? It depends entirely on their baseline rate of acting suspicious, and critically on whether that rate differs between their werewolf and innocent roles. Compare two players from past games:

Charles: always suspicious

As werewolf, Charles acted suspiciously 8 out of 10 rounds. As an innocent, he acted suspiciously 8 out of 10 rounds. His behavior tells you nothing about his role; it's just his personality.

θ_w ~ Beta(8, 2) · mean ≈ 0.80 θ_i ~ Beta(8, 2) · mean ≈ 0.80 → suspicious acts: uninformative

Spencer: only suspicious as werewolf

As werewolf, Spencer acted suspiciously 8 out of 10 rounds. As an innocent, he acted suspiciously only 2 out of 10 rounds. His suspicious behavior strongly distinguishes his roles.

θ_w ~ Beta(8, 2) · mean ≈ 0.80 θ_i ~ Beta(2, 8) · mean ≈ 0.20 → suspicious acts: highly informative

Behavioral History

Before the current round starts, you've been keeping mental tally marks across all previous rounds. For each player you track two columns: how often they acted suspiciously when they were the werewolf, and how often they acted suspiciously when they were innocent. We encode these counts as Beta distributions:

// behavioral history in count form \cdot W = werewolf role, I = innocent role θ_w ~ Beta(α_w, β_w) where α_w = suspicious acts as W β_w = non-suspicious acts as W θ_i ~ Beta(α_i, β_i) where α_i = suspicious acts as I β_i = non-suspicious acts as I // E[θ_w] = α_w / (α_w + β_w) \cdot E[θ_i] = α_i / (α_i + β_i) // More past rounds \to larger α+β \to narrower distribution \to more reliable estimate. // Charles: α_w=8, β_w=2, α_i=8, β_i=2 \to E[θ_w] = E[θ_i] = 0.80. Identical. // Spencer: α_w=8, β_w=2, α_i=2, β_i=8 \to E[θ_w] = 0.80, E[θ_i] = 0.20. Separated.

These distributions are fixed for the duration of the current round — you're using what you learned before to make inferences now.

Updating P(W)

A new round begins. P(W) = 1/5 — one werewolf among five players. Each observation updates it via Bayes' rule, and the result becomes the new P(W) going into the next observation. The likelihoods come directly from the behavioral history:

// likelihoods \cdot from behavioral history P(S | W) = E[θ_w] = α_w / (α_w + β_w) P(S | I) = E[θ_i] = α_i / (α_i + β_i) P(\negS | W) = 1 - E[θ_w] = β_w / (α_w + β_w) P(\negS | I) = 1 - E[θ_i] = β_i / (α_i + β_i)

// full update rule \cdot after observing S P(W | S) = P(W) \cdot α_w/(α_w+β_w) ────────────────────────────────────────────────────────────── P(W) \cdot α_w/(α_w+β_w) + (1-P(W)) \cdot α_i/(α_i+β_i) // after \negS: replace α_w with β_w and α_i with β_i above // α_w, β_w, α_i, β_i are fixed within a round — they come from past rounds

How much P(W) moves is controlled by the likelihood ratio [α_w/(α_w+β_w)] / [α_i/(α_i+β_i)] — that is, P(S|W) / P(S|I). For Charles, 0.80 / 0.80 = 1: S is equally likely under either role, so it leaves P(W) unchanged. For Spencer, 0.80 / 0.20 = 4: each S quadruples the odds he's the werewolf. Same observation, same rule — different behavioral history, different inference.

Each observation within the current round produces a single updated number — not a distribution, just one probability — and that number becomes the new P(W) going into the next observation:

// sequential updating within a round \cdot Spencer as example prior = P(W) = 1/5 \to observe S₁ \to posterior = P(W | S₁) prior = P(W | S₁) \to observe S₂ \to posterior = P(W | S₁, S₂) prior = P(W | S₁, S₂) \to observe \negS₃ \to posterior = P(W | S₁, S₂, \negS₃) // each posterior is one number, not a distribution // each posterior becomes the prior for the next observation

// deeper dive · where the likelihoods come from

Think of θ_w as Charles's true suspicious rate as werewolf — a fixed number in [0,1] governing how likely he is to act suspiciously in any given round as W. Like the true probability of heads on a coin, it exists independently of any particular observation and can never be directly observed — only S or ¬S outcomes in each round. It is the mechanism connecting past observations to the current round's likelihood: past rounds inform θ_w, and θ_w generates S. Without it, there is no path from "Charles was the werewolf in past rounds" to a probability for S.

The full derivation conditions on everything we know throughout:

// full outer derivation \cdot let pr = past rounds throughout P(W | S, pr) = P(S | W, pr) \cdot P(W | pr) / P(S | pr) // expanding P(S | pr) via total probability P(S | W, pr) \cdot P(W | pr) = ────────────────────────────────────────────────────────── P(S | W, pr) \cdot P(W | pr) + P(S | \negW, pr) \cdot P(\negW | pr)

P(S | W, pr) has no direct formula — θ_w is the intermediate quantity that connects past rounds to S. We introduce it via the law of total probability and marginalize over it:

// introducing θ_w via law of total probability P(S | W, pr) = \int P(S, θ_w | W, pr) dθ_w // expanding the joint via the chain rule = \int P(S | θ_w, W, pr) \cdot p(θ_w | pr) dθ_w // θ_w is defined as Charles's suspicious rate as werewolf — it already encodes // everything W tells us about S, so W and pr drop out of P(S | θ_w) = \int P(S | θ_w) \cdot p(θ_w | pr) dθ_w // P(S | θ_w) = θ_w by definition — θ_w is the probability of S given that rate = \int θ_w \cdot p(θ_w | pr) dθ_w // this last step is definitional — E[X] = \int x \cdot p(x) dx for any continuous random variable. // each possible value of θ_w is weighted by its probability density and summed continuously, // the exact continuous analogue of the discrete weighted average Σ xᵢ \cdot P(X = xᵢ) = E[θ_w]

The last step — E[θ_w] = α_w / (α_w + β_w) — only holds because p(θ_w | pr) is a Beta distribution, whose mean is α/(α+β). Here is where that comes from.

We build p(θ_w | pr) by applying Bayes' rule once per past round where Charles was the werewolf. Before any data, the prior is Beta(1,1) — flat, every rate equally plausible. Each round's likelihood contains only that round's single observation (S or ¬S), and each posterior becomes the prior for the next round:

// sequential updating \cdot two past rounds, S then \negS // round 1 \cdot prior = Beta(1,1), density = 1 everywhere \cdot observe S₁ p(θ_w | S₁) \propto P(S₁ | θ_w) \cdot p(θ_w) = θ_w \cdot 1 = θ_w^(2-1) \cdot (1-θ_w)^(1-1) = Beta(2, 1) \to becomes prior for round 2 // round 2 \cdot prior = Beta(2,1) from round 1 \cdot observe \negS₂ p(θ_w | S₁, \negS₂) \propto P(\negS₂ | θ_w) \cdot p(θ_w | S₁) = (1-θ_w) \cdot θ_w^(2-1) \cdot (1-θ_w)^(1-1) = θ_w^(2-1) \cdot (1-θ_w)^(2-1) = Beta(2, 2) \to becomes prior for round 3

No matter how many rounds accumulate, each suspicious observation adds 1 to the exponent on θ_w and each non-suspicious observation adds 1 to the exponent on (1−θ_w) — regardless of order. So the sequential updates always produce:

// result of sequential updating \cdot α_w and β_w are counts p(θ_w | pr) = Beta(α_w, β_w) // α_w = total suspicious rounds as W // β_w = total non-suspicious rounds as W E[θ_w] = α_w / (α_w + β_w)

Plugging back into the outer derivation:

// final result P(W | S, pr) [α_w / (α_w+β_w)] \cdot (1/5) = ──────────────────────────────────────────────────────────────────────────── [α_w / (α_w+β_w)] \cdot (1/5) + [α_i / (α_i+β_i)] \cdot (4/5)

Notice there are two separate layers of updating here. p(θ_w | pr) is built across past rounds — each round where Charles was the werewolf contributing one Bernoulli observation that increments α_w or β_w. The resulting likelihoods P(S|W) and P(S|I) are then fixed for the current round. Meanwhile, P(W | pr) = 1/5 is updated within the current round — each new observation moves it and the posterior becomes the prior for the next observation. Prior and likelihood are mathematical roles, not labels for new vs. old data.

// fun fact · neural networks as Bayesian reasoners

The Bayesian framework may be more than a metaphor for cognition. Recent work by Kapatsinski (2026) demonstrates that GPT-2 performs adaptive partial pooling: rare contexts borrow more information from similar contexts, while frequent contexts rely on their own specific evidence. This is mathematically equivalent to Bayesian hierarchical regression. A neural network trained purely on next-word prediction converges on the same inferential solution that a Bayesian statistician derives analytically.

// interactive demo · Bayesian updating with Beta priors

Select a player. The top chart shows their Beta priors built from past rounds: the purple curve is θ_w (suspicious rate in role W) and the gray curve is θ_i (suspicious rate in role I). Wide overlap (Charles) means no S can move P(W); clear separation (Spencer) means each S carries real information. Add observations and watch P(W) update below.

end round →

Round: 1 | n: 0 | S: 0 | P(W): 0.200

The key takeaway: what you know about a player from past rounds directly shapes what you can infer from their behavior now. The priors θ_w and θ_i aren't just background noise. They encode the entire history of how each player has acted across roles. A player who was always suspicious as the werewolf and always suspicious as an innocent gives you nothing to work with. A player whose suspicious behavior reliably tracks their role is the one whose actions actually move the needle.

// connection

This structure is related to the noisy-channel processing model used in my research on frequency-dependent preference extremity. Both use a Beta distribution built from accumulated evidence to inform current inference via Bayes' rule. The difference is in what the Beta is doing: here, the Beta priors are on the likelihood parameters — encoding how diagnostic behavior is per role — and the inference is about a hidden categorical state (the player's role). In the noisy-channel model, the Beta is on the prior over utterances, and the inference is a decoding problem: given what was heard, what did the speaker intend?

// research The Bayesian framework is more than a statistical convenience. Comprehenders appear to function as probabilistic inference engines, combining prior knowledge with incoming evidence. My experimental work tries to measure both quantities directly, and the connection to neural network behavior (Kapatsinski, 2026) suggests these may not be two separate stories.

02 · Mixed-Effects Regression

Mixed-effects regression is the everyday workhorse of experimental linguistics. It models how one variable predicts another while accounting for structure in the data: multiple observations from the same participant, the same words appearing across participants. Treating those groups as independent inflates confidence; mixed effects pool information across groups in a principled way, a process called partial pooling.

The Question

Does word frequency predict reaction time in a lexical decision task? High-frequency words should be recognized faster. But the size of that effect probably varies across participants: some readers are more sensitive to frequency than others. A mixed-effects model estimates both the population-level trend and participant-specific deviations from it simultaneously.

The Model

// mixed-effects model \cdot reaction time and word frequency RT_ij = β₀ + β₁\cdotlog(freq_j) + b₀ᵢ + b₁ᵢ\cdotlog(freq_j) + ε_ij // RT_ij: reaction time for participant i on word j (milliseconds) // β₀: population intercept (average RT when log freq = 0) // β₁: fixed slope (average frequency effect across all participants) // b₀ᵢ: random intercept for participant i (overall faster/slower than average) // b₁ᵢ: random slope for participant i (how much their frequency effect differs) // ε_ij: residual error, ε ~ N(0, σ²)

The random effects [b₀ᵢ, b₁ᵢ] are assumed jointly normal, centered at zero:

// random effects structure [b₀ᵢ, b₁ᵢ] ~ MVN(0, Σ) // Σ captures variance in intercepts, variance in slopes, and their correlation. // In R (lme4): lmer(RT ~ log_freq + (1 + log_freq | participant), data = df)

Why Not Just Fit Each Participant Separately?

Each participant contributes only a limited number of observations, so per-participant ordinary least squares estimates are noisy. The mixed model exploits the fact that participants are drawn from the same population: extreme estimates get pulled toward the group mean in proportion to how uncertain they are. This shrinkage improves estimation without assuming everyone is identical. The dashed white line in the demo below is the fixed effect; the colored lines show each participant's model-estimated slope, already partially pooled toward the population.

// interactive demo · reaction time and word frequency

Simulated data: 6 participants, 20 words each. Each point is one trial. Colored lines show per-participant regression lines (fixed effect plus random effects). The dashed white line is the population-level fixed effect only. Toggle each to see how participants cluster around the group trend.

// research Mixed-effects models are foundational to my experimental work. When measuring reading times, ordering preferences, or recognition responses, the model must account for the fact that items and participants each introduce their own variance. The random effects structure reflects the actual data-generating process and determines which conclusions the data can support.