Machine Learning
Support Vector Machine (SVM)
The classifier that throws away almost all your data — and keeps the few points that matter
A support vector machine finds the maximum-margin hyperplane separating two classes, defined entirely by a handful of critical points called support vectors — and the kernel trick lets it bend that boundary into any shape without ever leaving the original feature space.
- Training time (kernel)O(n²)–O(n³)
- Prediction timeO(nsv · d)
- MemoryO(n²) kernel matrix
- Model parametersC, kernel, γ
- Sweet spot10³–10⁵ rows
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
How a support vector machine works
Suppose you have two classes of points and a hundred different straight lines that separate them perfectly. Which line is best? A perceptron stops at the first line it finds. Logistic regression picks one based on a probabilistic loss. A support vector machine asks a sharper question: which separator leaves the widest empty corridor between the two classes? That corridor is the margin, and the line down its center is the maximum-margin hyperplane.
Formally, a hyperplane is w·x + b = 0. The signed distance from a point to it is (w·x + b) / ‖w‖. We label classes y ∈ {−1, +1} and require every point to sit on the correct side with at least unit functional margin: yᵢ(w·xᵢ + b) ≥ 1. The geometric width of the corridor is then 2 / ‖w‖. Widening the margin means shrinking ‖w‖, so the training problem is:
minimize ½‖w‖²
subject to yᵢ(w·xᵢ + b) ≥ 1 for all i
This is a convex quadratic program — one global optimum, no local minima to get stuck in, unlike a neural network. The crucial consequence falls out of the Karush-Kuhn-Tucker conditions: at the optimum, almost every constraint is slack (the point sits comfortably away from the corridor), and its Lagrange multiplier αᵢ is zero. Only the points lying exactly on the margin boundary have αᵢ > 0. Those are the support vectors. Delete every other training point and retrain — the boundary does not move an inch.
This is the idea that makes SVM unusual. A trained model with a million examples might be defined by a few dozen support vectors. The rest of the data was scaffolding.
Soft margins: when the classes overlap
Real data is rarely cleanly separable. One outlier in the wrong place makes the hard-margin problem above infeasible — no hyperplane satisfies every constraint. The fix is the soft margin: introduce a slack variable ξᵢ ≥ 0 for each point measuring how far it violates its margin, and pay for that slack in the objective:
minimize ½‖w‖² + C · Σ ξᵢ
subject to yᵢ(w·xᵢ + b) ≥ 1 − ξᵢ , ξᵢ ≥ 0
The hyperparameter C sets the exchange rate between margin width and misclassification. Large C buys a narrow margin that bends to avoid every error (high variance, overfit risk). Small C tolerates errors in exchange for a fatter, smoother margin (high bias). C is the single most important knob to tune. Equivalently — and this is the form most useful in practice — minimizing the soft-margin objective is exactly minimizing hinge loss max(0, 1 − yᵢ(w·xᵢ + b)) plus an L2 penalty ‖w‖². SVM is "linear model + hinge loss + L2 regularization" — that framing connects it directly to gradient descent and to logistic regression (which swaps hinge for log loss).
The kernel trick: bending the boundary
A line can't separate concentric rings or an XOR pattern. The classic trick is to map points into a higher-dimensional space where they become linearly separable — for rings, lift each 2-D point to (x, y, x²+y²) and the inner ring drops below the outer one, separable by a flat plane.
But here is the elegant part. Rewrite the SVM in its dual form and the training data appears only inside dot products xᵢ·xⱼ — never alone. So we never need the coordinates of the high-dimensional space; we only need a function that returns the dot product as if we had mapped both points there. That function is a kernel K(xᵢ, xⱼ) = φ(xᵢ)·φ(xⱼ). Common choices:
- Linear:
K = xᵢ·xⱼ. No mapping; just the plain SVM. Best for high-dimensional sparse data like text. - Polynomial:
K = (γ xᵢ·xⱼ + r)^d. Models feature interactions up to degree d. - RBF / Gaussian:
K = exp(−γ‖xᵢ − xⱼ‖²). The default. Its implicit feature space is infinite-dimensional, yet the program stays finite because we only ever evaluate the kernel on the n training points.
The RBF parameter γ sets how far a single support vector's influence reaches. Large γ → tight, local bumps → a wiggly boundary that can memorize noise. Small γ → broad, smooth influence → a nearly linear boundary. C and γ are tuned together on a grid.
When to choose an SVM
- Small-to-medium datasets — a few thousand to a few hundred thousand rows, where O(n²) memory is still affordable.
- High-dimensional data — text classification, gene expression, where features outnumber samples. A linear SVM is often the strongest baseline here.
- Clean margins matter — when you want the most robust separator, not just a separator. The max-margin principle gives good generalization bounds.
- Non-linear but not huge — RBF SVM captures curved boundaries that logistic regression can't, without the data appetite of a deep network.
Avoid SVM when you have millions of rows (use a linear model with SGD, or gradient-boosted trees), when you need calibrated probabilities out of the box, or when interpretability of individual feature effects is the goal — the kernel boundary is opaque.
SVM vs other classifiers
| Kernel SVM | Logistic regression | Random forest | k-NN | Neural network | |
|---|---|---|---|---|---|
| Training time | O(n²)–O(n³) | O(n·d) per epoch | O(n log n · trees) | O(1) (lazy) | O(n·d) per epoch |
| Prediction time | O(nsv · d) | O(d) | O(depth · trees) | O(n · d) | O(layers · d) |
| Memory | O(n²) kernel matrix | O(d) | O(n · trees) | O(n · d) | O(params) |
| Non-linear boundary | Yes (kernels) | No (needs features) | Yes | Yes | Yes |
| Native probabilities | No (Platt scaling) | Yes | Yes | Yes (vote share) | Yes (softmax) |
| Scales to millions | No | Yes | Yes | Poorly | Yes |
| Sweet spot | 10³–10⁵ rows, high-dim | linear, huge data | tabular, mixed types | tiny, low-dim | huge, perceptual data |
The headline trade-off: SVM gives you a principled, convex, max-margin boundary with strong small-data performance, but it does not scale and it does not hand you probabilities. Boosted trees have largely displaced it as the default tabular classifier; linear SVM remains a top-tier text baseline.
What the numbers actually say
- Training is roughly quadratic to cubic in n. SMO-based solvers (libsvm, the engine behind scikit-learn's
SVC) are empirically about O(n²·²). On 100,000 points the kernel matrix alone holds 10¹⁰ doubles ≈ 80 GB — which is exactly why you don't run kernel SVM there. - Support vectors are typically a small fraction of the data on clean problems, but on noisy or heavily overlapping data they can balloon to a large fraction of n — and prediction cost scales with the count of support vectors, so a noisy SVM is slow at inference too.
- Linear SVM trains in O(n·d) with a dedicated solver like LIBLINEAR or SGD, handling millions of sparse text features in seconds — a different complexity class from the kernel version.
- The C × γ grid is small but expensive. A typical search is 7 values of C × 7 of γ = 49 fits, each with 5-fold CV = 245 SVM trainings. Budget for it.
- Feature scaling is not optional. RBF uses Euclidean distance; a feature on a 0–10000 scale drowns one on a 0–1 scale. Standardizing to zero mean and unit variance can swing accuracy by 20+ points.
JavaScript implementation
A from-scratch linear soft-margin SVM trained by subgradient descent on the hinge loss. This is the form that scales and the one most worth understanding — the quadratic-program/kernel machinery is an optimization detail layered on top.
// Linear soft-margin SVM via subgradient descent on hinge loss + L2.
// X: array of feature vectors, y: labels in {-1, +1}.
function trainSVM(X, y, { C = 1, lr = 0.001, epochs = 200 } = {}) {
const n = X.length, d = X[0].length;
let w = new Array(d).fill(0);
let b = 0;
for (let ep = 0; ep < epochs; ep++) {
for (let i = 0; i < n; i++) {
const margin = y[i] * (dot(w, X[i]) + b); // yᵢ(w·xᵢ + b)
if (margin >= 1) {
// point is safely outside the margin: only the L2 term pulls w down
for (let j = 0; j < d; j++) w[j] -= lr * w[j];
} else {
// point violates the margin: hinge term pushes it back
for (let j = 0; j < d; j++) w[j] -= lr * (w[j] - C * y[i] * X[i][j]);
b += lr * C * y[i];
}
}
}
return { w, b };
}
const dot = (a, c) => a.reduce((s, v, j) => s + v * c[j], 0);
// Prediction: the SIGN of the distance is the class.
function predict(model, x) {
return dot(model.w, x) + model.b >= 0 ? 1 : -1;
}
Two things to notice. First, the update has two regimes split by margin ≥ 1 — that branch is the hinge: points already correct-and-confident contribute no classification gradient, only the regularizer. Second, only points that violate their margin push on w. Run this to convergence and the points that kept pushing are precisely your support vectors.
Python implementation
In practice you call scikit-learn. The real work is the preprocessing and the hyperparameter grid — get those wrong and the algorithm choice is irrelevant.
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
# Scaling MUST be inside the pipeline so CV folds don't leak test statistics.
pipe = make_pipeline(StandardScaler(), SVC(kernel="rbf"))
grid = {
"svc__C": [0.1, 1, 10, 100],
"svc__gamma": [0.001, 0.01, 0.1, 1, "scale"],
}
search = GridSearchCV(pipe, grid, cv=5, n_jobs=-1, scoring="f1")
search.fit(X_train, y_train)
print("best params:", search.best_params_)
best = search.best_estimator_
# Inspect what the model actually leaned on:
svc = best.named_steps["svc"]
print("support vectors:", svc.n_support_) # per class
print("fraction of data:", svc.support_.size / len(X_train))
print("test score:", best.score(X_test, y_test))
Below is the same hinge-loss subgradient idea in NumPy, mirroring the JavaScript above so the mechanics stay visible:
import numpy as np
def train_svm(X, y, C=1.0, lr=1e-3, epochs=200):
n, d = X.shape
w, b = np.zeros(d), 0.0
for _ in range(epochs):
for i in range(n):
if y[i] * (X[i] @ w + b) >= 1:
w -= lr * w # L2 only
else:
w -= lr * (w - C * y[i] * X[i]) # hinge + L2
b += lr * C * y[i]
return w, b
Variants worth knowing
SVR (support vector regression). The same machinery for regression. Instead of a margin between classes, fit a tube of width ε around the function; only points outside the tube become support vectors and incur loss. Insensitive to small errors by design.
One-class SVM. No labels — learn a boundary that encloses the "normal" data so anything outside is flagged. The workhorse of novelty and anomaly detection when you only have examples of the normal class.
ν-SVM (nu-SVM). Reparameterizes C with ν ∈ (0, 1], which directly bounds both the fraction of margin errors and the fraction of support vectors. More interpretable to tune than the unbounded C.
Linear SVM solvers (LIBLINEAR, Pegasos). Drop the kernel and you can solve the primal directly in O(n·d). Pegasos is stochastic gradient descent on the hinge objective — essentially the loop above with a decaying learning rate — and trains on text corpora that kernel SVM could never touch.
Structured SVM. Generalizes the binary output to structured objects — sequences, trees, segmentations — by defining a joint feature map and a structured loss. Used in NLP and computer vision before deep learning largely took over.
Common bugs and edge cases
- Forgetting to scale features. The single most common SVM mistake. RBF and the margin are distance-based; unscaled features let one column dominate. Always standardize, and do it inside the CV pipeline to avoid leakage.
- Tuning C without γ (or vice versa). They interact strongly. A good C at one γ is a bad C at another. Search the 2-D grid, not two 1-D lines.
- Treating the distance as a probability. The raw decision value is an unbounded signed distance, not a probability. Enable
probability=Truefor Platt scaling only if you truly need it — it adds an internal cross-validation and can disagree with the hardpredictnear the boundary. - Running kernel SVM on a million rows. The O(n²) kernel matrix quietly tries to allocate tens of gigabytes and the job dies or thrashes. Switch to
LinearSVCor SGD past ~10⁵ rows. - Imbalanced classes. The margin objective ignores class frequency, so a rare class gets swallowed. Set
class_weight="balanced"to reweight C per class. - A huge support-vector count is a symptom. If nearly every point is a support vector, your γ is too large (overfitting) or the data is genuinely inseparable — and prediction will be slow because every support vector is evaluated at inference time.
Frequently asked questions
What exactly is a support vector?
A support vector is a training point that lies exactly on the margin boundary (or, in the soft-margin case, inside or on the wrong side of it). Only these points have non-zero weight in the solution — every other point could be deleted and the decision boundary would not move. That is why the model is named after them.
What is the kernel trick and why does it matter?
The dual SVM only ever touches the data through dot products xᵢ·xⱼ. A kernel K(xᵢ, xⱼ) computes the dot product in a higher-dimensional feature space without ever building the coordinates of that space. The RBF kernel corresponds to an infinite-dimensional space, so SVM can learn arbitrarily curved boundaries while still solving a finite quadratic program over the n training points.
How is C different from the RBF gamma parameter?
C controls the soft-margin trade-off: large C punishes every misclassification hard and gives a narrow, overfit-prone margin, while small C tolerates errors for a wider, smoother margin. Gamma controls the reach of a single RBF support vector: large gamma makes each point's influence local and the boundary wiggly, small gamma makes influence broad and the boundary flat. You tune them jointly on a 2-D grid by cross-validation.
Why is SVM rarely used on very large datasets?
Training a kernel SVM costs roughly O(n²) to O(n³) time and O(n²) memory because the kernel matrix has n² entries. At a million points that matrix alone is terabytes. SVM dominates on small-to-medium tabular and text data — thousands to low-hundred-thousands of rows — and gives way to linear models, gradient-boosted trees, or neural networks past that scale.
Does SVM output probabilities?
Not natively — the raw output is a signed distance to the hyperplane, not a probability. Libraries add Platt scaling, which fits a logistic-regression sigmoid to that distance using cross-validation. It works but is slow, can be inconsistent with the hard decision near the boundary, and is off by default in most implementations.
How does SVM handle more than two classes?
SVM is binary at its core. Multiclass is built on top: one-vs-rest trains k classifiers (each class against all others) and picks the highest score, while one-vs-one trains k(k−1)/2 pairwise classifiers and votes. scikit-learn's SVC uses one-vs-one; LinearSVC uses one-vs-rest. One-vs-one trains more models but each on a smaller subset, so it is often faster for kernel SVM.