Machine Learning
Naive Bayes Classifier
A wrong assumption that trains in one pass — and still wins on text
A Naive Bayes classifier applies Bayes' rule while pretending every feature is independent given the class — a wrong assumption that still trains in O(n·d), needs no iteration, and beats far heavier models on text.
- TrainingO(n·d) one pass
- PredictionO(d·k)
- Iterations0 (no gradient descent)
- Hyperparameters1 (smoothing α)
- Core assumptionconditional independence
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
The intuition: count, then bet
Suppose you want to decide whether the email in front of you is spam. You don't reason about grammar or intent — you just remember what spam looks like. The word "viagra" shows up in 1 of every 30 spam messages and almost never in real mail. "Meeting" is the reverse. A Naive Bayes classifier formalises exactly that gut feeling: it counts how often each word appears in each class during training, then for a new message it multiplies those word-frequencies together and bets on whichever class scores higher.
What makes the method "naive" is the multiplication step. Multiplying the per-word probabilities is only valid if the words are independent — if seeing "San" tells you nothing about whether "Francisco" follows. That is obviously false in real language. Naive Bayes assumes it anyway. The astonishing result, formalised by Harry Zhang in 2004, is that the assumption is wrong in a way that usually doesn't change the final answer: the inter-word dependencies tend to push every class's score in the same direction, so the ranking of classes survives even when the probabilities themselves are nonsense.
The payoff is speed and simplicity. There is no gradient descent, no iteration, no weight matrix to learn. Training is a single pass that tallies counts; prediction is a handful of additions in log space. That is why, decades after it was first used for spam filtering (Sahami et al., 1998; Paul Graham's influential "A Plan for Spam," 2002), it remains the default baseline every text classifier is measured against.
Bayes' rule and the independence trick
Start with Bayes' theorem. For a class c and an observed feature vector x = (x₁, …, x_d):
P(c | x) = P(x | c) · P(c) / P(x)
We want the class that maximises P(c | x). The denominator P(x) is identical for every class, so it drops out of the comparison — we only need to rank the numerator P(x | c) · P(c). The hard term is the likelihood P(x | c) = P(x₁, x₂, …, x_d | c), the joint distribution over all features. Estimating a full joint over d features needs exponentially many counts; it is hopeless.
Here is the naive leap. Assume the features are conditionally independent given the class. Then the joint factorises into a product of one-dimensional terms:
P(x | c) = ∏ᵢ P(xᵢ | c)
ĉ = argmax_c P(c) · ∏ᵢ P(xᵢ | c)
Each factor P(xᵢ | c) is now a single number you can estimate by counting. The whole model is just a table: for every (word, class) pair, the probability of that word given that class, plus a prior P(c) per class.
Complexity. Training reduces to counting tokens, so it is O(n · d̄) where n is the number of documents and d̄ the average document length — a single linear pass with no iteration. The model stores O(k · V) numbers for k classes and vocabulary size V. Prediction scores each of the k classes by summing over the document's tokens, so it is O(d · k) per document.
Two engineering necessities. First, work in log space: log P(c) + Σᵢ log P(xᵢ | c). A product of hundreds of sub-1 probabilities underflows to 0.0 in IEEE-754 double within roughly 300 terms; the log-sum stays representable and the argmax is unchanged because log is monotonic. Second, smooth: add a pseudocount α (usually 1, "Laplace smoothing") to every word count so no probability is ever exactly zero.
The zero-frequency trap
The single most common way to break a from-scratch Naive Bayes is to skip smoothing. Imagine the word "blockchain" never appeared in any ham message during training. Then P("blockchain" | ham) = 0. The moment a legitimate email mentions blockchain, the entire ham score collapses to zero — because you are multiplying — and the message is classified spam no matter how innocent the other 200 words are. One unseen token vetoes everything.
Laplace (add-α) smoothing fixes it by pretending you saw every vocabulary word α extra times in every class:
P(wᵢ | c) = (count(wᵢ, c) + α) / (Σ_w count(w, c) + α · V)
With α = 1 and a vocabulary of V words, an unseen token still gets a small but non-zero probability of 1 / (N_c + V), where N_c is the total token count in class c. The α is the model's one real hyperparameter; values between 0.01 and 1.0 are typical, and smaller α trusts the data more.
When to reach for Naive Bayes
- High-dimensional, sparse text. Spam filters, sentiment, topic and language detection, document routing. Bag-of-words gives tens of thousands of features and Naive Bayes shrugs at the dimensionality.
- Tiny training sets. Because it estimates only
k·Vindependent parameters rather than fitting a joint, it needs far less data to stop overfitting than logistic regression or a neural net. - Streaming / online updates. Counts are additive, so you can update the model with new documents in O(1) per token without retraining.
- A strong, instant baseline. Before you spend a day tuning an SVM or a transformer, train Naive Bayes in a second and find out how hard the problem actually is.
Avoid it when features are heavily correlated and you need calibrated probabilities, when feature interactions carry the signal (XOR-like patterns are invisible to it), or when continuous features are wildly non-Gaussian and you're forced into the Gaussian variant.
Naive Bayes vs other classifiers
| Naive Bayes | Logistic Regression | Linear SVM | Decision Tree | k-NN | |
|---|---|---|---|---|---|
| Model type | Generative | Discriminative | Discriminative | Discriminative | Instance-based |
| Training time | O(n·d), one pass | O(n·d·iters) | O(n·d) to O(n²) | O(n·d·log n) | O(1) (lazy) |
| Prediction time | O(d·k) | O(d·k) | O(d·SV) | O(depth) | O(n·d) |
| Needs iteration | No | Yes (gradient) | Yes (QP/SGD) | No (greedy) | No |
| Handles correlated features | Poorly (assumes none) | Yes | Yes | Yes | Yes |
| Probability calibration | Poor (overconfident) | Good | None (needs Platt) | Moderate | Moderate |
| Small-data behavior | Excellent | Overfits | Overfits | Overfits | Noisy |
| Sweet spot | Sparse text, baselines | General linear | Margin-rich text | Tabular, interpretable | Low-dim, local |
The deep contrast is generative vs discriminative. Naive Bayes models P(x | c) — how the data is generated — and applies Bayes' rule. Logistic regression models P(c | x) directly. Ng and Jordan's classic 2001 result: the generative Naive Bayes reaches its (higher) asymptotic error faster, so it wins with little data, while logistic regression's (lower) asymptotic error wins once you have enough. They are the same coin's two faces.
What the numbers actually say
- Spam filtering: ~98–99% accuracy. Paul Graham's 2002 statistical filter and the SpamAssassin Bayes module routinely hit false-positive rates well under 0.1% on personal mail — the kind of number that made Bayesian spam filtering ubiquitous by 2004.
- 20 Newsgroups (~18,800 docs, 20 classes): ~80–85% accuracy for multinomial Naive Bayes, a few points behind a tuned linear SVM (~88%) but trained in a fraction of a second versus seconds-to-minutes for the SVM.
- Training cost: zero iterations. One linear scan of the corpus. On the 20-newsgroups corpus that is well under a second on a laptop; the heaviest cost is tokenisation, not the model.
- Memory: O(k·V). A 20-class model over a 50,000-word vocabulary is one million log-probabilities — a few megabytes — with no support vectors, no embedding matrix, and no trees to store.
- One knob. The only thing to tune is the smoothing α. Everything else is determined by counting, which is why it is the canonical "fit it and forget it" baseline.
JavaScript implementation
A complete multinomial Naive Bayes text classifier — train by counting, predict in log space, smooth with α:
class MultinomialNB {
constructor(alpha = 1) {
this.alpha = alpha;
this.vocab = new Set();
this.classCounts = {}; // docs per class
this.tokenCounts = {}; // {class: {word: count}}
this.classTotals = {}; // total tokens per class
this.nDocs = 0;
}
// tokens: array of words for one document; label: its class
train(tokens, label) {
this.nDocs++;
this.classCounts[label] = (this.classCounts[label] || 0) + 1;
this.tokenCounts[label] ??= {};
this.classTotals[label] ??= 0;
for (const w of tokens) {
this.vocab.add(w);
this.tokenCounts[label][w] = (this.tokenCounts[label][w] || 0) + 1;
this.classTotals[label]++;
}
}
// log P(word | class) with Laplace smoothing
logLikelihood(word, label) {
const count = (this.tokenCounts[label][word] || 0) + this.alpha;
const denom = this.classTotals[label] + this.alpha * this.vocab.size;
return Math.log(count / denom);
}
predict(tokens) {
let best = null, bestScore = -Infinity;
for (const c of Object.keys(this.classCounts)) {
// log prior + sum of log likelihoods (add, never multiply)
let score = Math.log(this.classCounts[c] / this.nDocs);
for (const w of tokens) {
if (this.vocab.has(w)) score += this.logLikelihood(w, c);
}
if (score > bestScore) { bestScore = score; best = c; }
}
return best;
}
}
// usage
const nb = new MultinomialNB(1);
nb.train(['win', 'free', 'money', 'now'], 'spam');
nb.train(['free', 'gift', 'click'], 'spam');
nb.train(['meeting', 'tomorrow', 'project'], 'ham');
nb.train(['lunch', 'project', 'update'], 'ham');
console.log(nb.predict(['free', 'meeting', 'money'])); // 'spam'
Two details mirror every production implementation. The prior log(classCounts / nDocs) is added once per class; the inner loop only sums log-likelihoods — there is no multiplication anywhere, which is precisely what avoids underflow. And out-of-vocabulary words are skipped at predict time, since they carry no class signal and smoothing already handled the train-time unseen case.
Python implementation
The same model, plus the Gaussian variant for continuous features so you can see how only the per-feature likelihood changes:
import math
from collections import defaultdict, Counter
class MultinomialNB:
def __init__(self, alpha=1.0):
self.alpha = alpha
self.vocab = set()
self.class_docs = Counter()
self.token_counts = defaultdict(Counter) # class -> Counter(word)
self.class_total = Counter() # class -> total tokens
self.n_docs = 0
def train(self, tokens, label):
self.n_docs += 1
self.class_docs[label] += 1
for w in tokens:
self.vocab.add(w)
self.token_counts[label][w] += 1
self.class_total[label] += 1
def _log_likelihood(self, word, label):
count = self.token_counts[label][word] + self.alpha
denom = self.class_total[label] + self.alpha * len(self.vocab)
return math.log(count / denom)
def predict(self, tokens):
best, best_score = None, float('-inf')
for c in self.class_docs:
score = math.log(self.class_docs[c] / self.n_docs) # log prior
for w in tokens:
if w in self.vocab:
score += self._log_likelihood(w, c) # add, don't multiply
if score > best_score:
best, best_score = c, score
return best
# Gaussian Naive Bayes — same skeleton, continuous likelihood
class GaussianNB:
def fit(self, X, y):
self.classes = set(y)
self.stats = {} # class -> list of (mean, var) per feature
self.prior = {}
for c in self.classes:
rows = [x for x, t in zip(X, y) if t == c]
self.prior[c] = len(rows) / len(X)
cols = list(zip(*rows))
self.stats[c] = [(sum(col)/len(col),
sum((v - sum(col)/len(col))**2 for v in col)/len(col) + 1e-9)
for col in cols]
def predict(self, x):
best, best_score = None, float('-inf')
for c in self.classes:
score = math.log(self.prior[c])
for xi, (mu, var) in zip(x, self.stats[c]):
# log of the normal density N(xi; mu, var)
score += -0.5 * math.log(2 * math.pi * var) - (xi - mu)**2 / (2 * var)
if score > best_score:
best, best_score = c, score
return best
Notice the architecture is identical: estimate a prior, sum per-feature log-likelihoods, take the argmax. Only the _log_likelihood term differs — a counted frequency for multinomial, a Gaussian density for continuous data. The + 1e-9 on the variance is a guard against zero-variance features, the continuous analogue of Laplace smoothing.
Variants worth knowing
Multinomial Naive Bayes. The text workhorse. Features are token counts; the likelihood is the relative frequency of each word in a class. This is the variant in both code samples above and the one meant when someone says "Naive Bayes for text."
Bernoulli Naive Bayes. Features are binary present/absent flags, ignoring how many times a word appears. Crucially it includes a term for words that are absent — explicit non-occurrence is signal. It often wins on short documents (tweets, titles) where counts are mostly 0 or 1.
Gaussian Naive Bayes. For continuous features, assume each feature follows a per-class normal distribution and estimate its mean and variance. Used for sensor data, medical measurements, the Iris dataset — anything that isn't word counts.
Complement Naive Bayes (Rennie et al., 2003). Estimates each class's parameters from the complement (all other classes) to correct the skew multinomial NB suffers on imbalanced datasets. A near-free accuracy bump that scikit-learn ships as ComplementNB.
TF-IDF + NB and the NBSVM hybrid. Feeding TF-IDF-weighted features instead of raw counts, and Wang & Manning's 2012 NBSVM (Naive Bayes log-count ratios as features for a linear SVM), both close much of the gap to discriminative models while keeping NB's speed.
Common bugs and edge cases
- Multiplying probabilities instead of adding logs. The classic underflow bug: works on toy 3-word inputs, silently returns garbage on real documents once the product hits the denormal floor. Always sum log-probabilities.
- Forgetting smoothing. A single unseen word zeros out a class. The symptom is bizarre: the model is confident and wrong, and adding one innocuous word flips the label.
- Letting α scale with the wrong denominator. The smoothing term in the denominator must be
α · V(vocabulary size), notαalone — otherwise probabilities for a class don't sum to one and the comparison is biased. - Trusting the output probabilities. Because of the false independence assumption, NB's posteriors are wildly overconfident — routinely 0.9999 or 0.0001. Use them to rank, not as calibrated confidences; if you need calibration, apply isotonic regression or Platt scaling.
- Ignoring class imbalance. A 95%-ham corpus gives ham a dominating prior. Either resample, drop the prior, or switch to Complement NB.
- Double-counting in Bernoulli. Bernoulli NB takes the set of words, not the list — converting a count vector straight in without binarising inflates frequent words and corrupts the absence term.
Frequently asked questions
Why is Naive Bayes called "naive"?
Because it assumes every feature is conditionally independent of every other feature given the class. In real text, words are heavily correlated — "San" and "Francisco" co-occur constantly — so the assumption is almost always false. It's called naive because believing it is naive, yet the classifier works anyway.
If the independence assumption is wrong, why does it still classify well?
Classification only needs the correct class to score highest, not for the probabilities to be calibrated. Zhang's 2004 analysis showed the dependencies often cancel across classes, leaving the argmax intact. The output probabilities are usually badly overconfident — near 0 or 1 — but the ranking that decides the label survives.
What is Laplace (add-one) smoothing and why is it mandatory?
It adds a pseudocount (usually 1) to every word count before computing probabilities. Without it, any word never seen in a class gets probability zero, and because the model multiplies probabilities, a single zero annihilates the entire class score. Smoothing guarantees no probability is ever exactly zero.
Why does real code add log-probabilities instead of multiplying?
Multiplying hundreds of probabilities (each below 1) underflows to 0.0 in floating point within a few dozen terms. Working in log space turns the product into a sum of negative numbers, which stays representable. The argmax is unchanged because log is monotonic.
What's the difference between multinomial, Bernoulli, and Gaussian Naive Bayes?
They differ in the per-feature likelihood. Multinomial models word counts (how many times a token appears) and dominates text classification. Bernoulli models binary present/absent flags and rewards explicit absence of words. Gaussian assumes each feature is a normal distribution per class and is used for continuous data like sensor readings.
How fast is Naive Bayes compared to logistic regression or an SVM?
Training is a single pass — O(n·d) for n documents averaging d tokens — with no iteration, no gradient descent, and no hyperparameter search beyond the smoothing constant. On a 20-newsgroups-sized corpus it trains in well under a second, often 10–100× faster than a tuned linear SVM, while landing within a few accuracy points.