Machine Learning

L1 & L2 Regularization

Two ways to punish big weights — one makes them vanish, one just shrinks them

L1 and L2 regularization add a penalty on weight size to a model's loss to curb overfitting: L1 (Lasso) drives weights exactly to zero for feature selection, while L2 (Ridge) shrinks them smoothly toward zero.

L1 penaltyλ·Σ|wᵢ|
L2 penaltyλ·Σwᵢ²
L1 effectsparse (exact zeros)
L2 effectsmooth shrinkage
Tune λ bycross-validation

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

The intuition: penalize complexity, not just error

An over-parameterized model will happily memorize its training set. Give a degree-15 polynomial twelve noisy points and it threads every one of them, wiggling violently in between — training error near zero, test error catastrophic. The wiggles are powered by enormous coefficients that cancel each other out, so the first thing to attack is the size of the weights.

Regularization changes the objective. Instead of minimizing the raw loss L(w), you minimize L(w) + λ·R(w), where R(w) is a penalty that grows with the magnitude of the weights and λ ≥ 0 controls how much you care. The optimizer now faces a tug-of-war: fit the data, but keep the weight vector small. Small weights mean a smoother, lower-variance function that generalizes — the bias–variance trade-off, dialed by a single knob.

Two choices of R(w) dominate practice, and they behave very differently:

L2 (Ridge / weight decay): R(w) = Σ wᵢ². Shrinks every weight smoothly toward zero; nothing ever reaches it.
L1 (Lasso): R(w) = Σ |wᵢ|. Drives the least-useful weights to exactly zero, deleting features outright.

The mechanism: why the gradients differ

The whole personality difference between L1 and L2 lives in the derivative of the penalty. Consider one weight w and a gradient-descent step w ← w − η·(∂L/∂w + λ·∂R/∂w).

L2: ∂(w²)/∂w = 2w. The penalty's pull is proportional to w. A big weight is shoved hard; as it shrinks, the pull fades. The update becomes w ← w(1 − 2ηλ) − η·∂L/∂w — every step multiplies the weight by a factor slightly below 1. That's literally "weight decay." A weight approaches zero geometrically but never lands on it, because the force vanishes in lockstep with the weight.

L1: ∂(|w|)/∂w = sign(w). The pull is a constant ±λ regardless of how small w is. Once the data's gradient can no longer overcome that fixed shove, the weight is dragged straight through zero and pinned there. This is the source of sparsity. Because |w| isn't differentiable at zero, the clean implementation is the soft-thresholding operator from proximal gradient methods:

soft(w, t) = sign(w) · max(|w| − t,  0)     // t = η·λ

You take the ordinary gradient step on the data loss, then shrink the result toward zero by t and clamp anything inside [−t, +t] to exactly 0.

The geometric picture seals it. Minimizing L subject to a budget on R means the elliptical loss contours expand until they kiss the constraint region. L2's region is a circle (a sphere in higher dimensions) — first contact is generically off-axis, so all weights stay nonzero. L1's region is a diamond with sharp corners on the axes — contours almost always touch a corner first, where one or more coordinates are exactly zero. The corners are the sparsity.

When to choose L1, L2, or both

Reach for L2 when you believe every feature contributes a little, when features are correlated (L2 spreads weight across a correlated group; L1 arbitrarily keeps one and zeros the rest), and as the default for neural nets — weight decay is in almost every training recipe.
Reach for L1 when you suspect most features are useless and want the model to tell you which ones — automatic feature selection. The zeroed coefficients give a sparse, interpretable, cheap-to-serve model.
Reach for Elastic Net (both, weighted) when p ≫ n or features come in correlated clusters: L1 alone is unstable there, L2 alone won't sparsify, the blend gets both.

If your real problem is a noisy training signal rather than model size, regularization is the wrong tool — get more data or use early stopping. Regularization fights variance from an over-flexible hypothesis class, not label noise per se.

L1 vs L2 at a glance

	L1 (Lasso)	L2 (Ridge)	Elastic Net
Penalty term	λ·Σ\|wᵢ\|	λ·Σwᵢ²	λ(α·Σ\|wᵢ\| + (1−α)·Σwᵢ²)
Penalty gradient	λ·sign(w) (constant)	2λw (proportional)	mix of both
Effect on weights	exact zeros (sparse)	small but nonzero	sparse + grouped
Feature selection	yes, built in	no	yes
Correlated features	keeps one, drops rest	spreads weight evenly	keeps the group together
Differentiable everywhere	no (kink at 0)	yes	no
Closed-form solution	none (iterative / LARS)	yes: (XᵀX + λI)⁻¹Xᵀy	none
Bayesian prior	Laplace	Gaussian	blend
Typical home	high-dim sparse, genomics	neural nets, collinear data	p ≫ n with feature groups

The headline: L2 is the smooth, always-solvable, group-friendly default; L1 buys you a sparse, self-selecting model at the cost of a non-smooth objective and instability under correlation. Ridge even has a closed form — the +λI term also makes XᵀX invertible when it otherwise wouldn't be, which is why Ridge predates machine learning as Tikhonov regularization (Andrey Tikhonov, 1943) and Hoerl & Kennard's ridge regression (1970). Lasso is younger — Robert Tibshirani coined it in 1996.

What the numbers actually say

Ridge has a closed form; Lasso does not. Ridge costs one matrix solve, O(p³ + p²n) for the normal equations. Lasso needs an iterative solver — coordinate descent or LARS — typically O(p²n) per sweep over a path of λ values, but each weight update is a single soft-threshold so sweeps are cheap.
L1 can recover a sparse signal from far fewer samples. Compressed-sensing theory shows that an s-sparse vector in p dimensions is recoverable from on the order of s·log(p/s) measurements via L1 — sub-linear in p, the result that launched the field (Candès–Tao, Donoho, ~2006).
The decay factor is exact. L2 with rate η and strength λ multiplies each weight by (1 − 2ηλ) per step. With η = 0.01 and λ = 0.1 the per-step factor is 1 − 2ηλ = 0.998, so a weight with no opposing data gradient halves in about ln(2)/(2ηλ) = ln(2)/0.002 ≈ 347 steps.
AdamW's decoupling matters. Loshchilov & Hutter (2017) showed decoupling weight decay from Adam's adaptive denominator improves ImageNet/CIFAR generalization by roughly 1–2% top-1 — small in percent, large at scale — which is why AdamW is now the default transformer optimizer.

JavaScript implementation

Ridge and Lasso gradient-descent training for linear regression, side by side. The only difference is the penalty applied after the data-gradient step.

// X: array of feature rows (already standardized). y: targets.
// type: 'l1' | 'l2'. lambda: strength. lr: learning rate.
function train(X, y, { type = 'l2', lambda = 0.1, lr = 0.05, epochs = 500 } = {}) {
  const n = X.length, p = X[0].length;
  let w = new Array(p).fill(0);
  let b = 0;                                  // bias is NOT regularized

  const soft = (z, t) => Math.sign(z) * Math.max(Math.abs(z) - t, 0);

  for (let e = 0; e < epochs; e++) {
    const gw = new Array(p).fill(0);
    let gb = 0;
    for (let i = 0; i < n; i++) {
      let pred = b;
      for (let j = 0; j < p; j++) pred += w[j] * X[i][j];
      const err = pred - y[i];                // d(MSE)/dpred
      for (let j = 0; j < p; j++) gw[j] += (2 / n) * err * X[i][j];
      gb += (2 / n) * err;
    }
    b -= lr * gb;                             // bias: plain step, no penalty
    for (let j = 0; j < p; j++) {
      if (type === 'l2') {
        // L2: smooth shrink — gradient of lambda*w^2 is 2*lambda*w
        w[j] -= lr * (gw[j] + 2 * lambda * w[j]);
      } else {
        // L1: gradient step, then soft-threshold toward zero (proximal)
        w[j] = soft(w[j] - lr * gw[j], lr * lambda);
      }
    }
  }
  return { w, b };
}

// L1 yields exact zeros; count them to see the feature selection happen.
const { w } = train(X, y, { type: 'l1', lambda: 0.3 });
console.log('zeroed features:', w.filter(v => v === 0).length);

Note the two patterns. L2 folds straight into the gradient (gw[j] + 2λw[j]) because w² is differentiable. L1 cannot — its kink at zero means you take the data step first, then apply soft(), the proximal operator. That clamp is what manufactures the exact zeros a naive gw[j] + λ·sign(w[j]) update would jitter around forever.

Python implementation

In practice you reach for scikit-learn, where the choice is just a class. The penalty strength is named alpha there (the same λ).

import numpy as np
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score

# Always standardize first: the penalty treats every weight equally,
# so features must be on the same scale.
ridge = make_pipeline(StandardScaler(), Ridge(alpha=1.0))
lasso = make_pipeline(StandardScaler(), Lasso(alpha=0.1))
enet  = make_pipeline(StandardScaler(), ElasticNet(alpha=0.1, l1_ratio=0.5))

# Pick alpha by cross-validation over a log grid (the right way).
for alpha in [1e-3, 1e-2, 1e-1, 1.0, 10.0]:
    m = make_pipeline(StandardScaler(), Lasso(alpha=alpha))
    score = cross_val_score(m, X, y, cv=5,
                            scoring='neg_mean_squared_error').mean()
    print(f"alpha={alpha:<6} cv_mse={-score:.4f}")

# Soft-thresholding from scratch — the heart of a Lasso coordinate-descent step.
def soft_threshold(rho, t):
    if   rho < -t: return rho + t
    elif rho >  t: return rho - t
    else:          return 0.0          # the snap-to-zero that gives sparsity

Two things teams get wrong. First, forgetting StandardScaler — without it a feature in kilometers and one in millimeters are penalized wildly unequally. Second, tuning alpha on the training set: the penalty exists to improve generalization, so it can only be chosen on held-out folds.

Variants worth knowing

Elastic Net. Penalty λ(α‖w‖₁ + (1−α)‖w‖₂²). Zou & Hastie (2005) introduced it to fix Lasso's two failures: when p > n, Lasso selects at most n features; and among correlated features it picks one at random. Elastic Net keeps correlated groups together while still sparsifying.

Group Lasso. Penalizes the L2 norm of predefined groups of weights, so an entire group is zeroed together. Used when features come in natural blocks — dummy-encoded categoricals, or all the weights feeding one neuron (structured pruning).

Weight decay vs AdamW. For SGD, L2 and weight decay coincide. For Adam they don't: L2 enters the adaptive denominator and gets unevenly scaled per-parameter. AdamW (2017) applies decay as a separate w ← w(1 − ηλ) step, restoring the clean behavior and improving generalization. It's the default in modern transformer training.

Dropout and early stopping. Other implicit regularizers for nets. Dropout randomly zeros activations and behaves like an adaptive L2 on the weights; early stopping halts training before the weights grow large, which for linear models is provably close to L2.

Tikhonov regularization. The general L2 form ‖Γw‖² with a matrix Γ; Ridge is the special case Γ = √λ·I. The framing that ties machine-learning regularization back to ill-posed inverse problems in physics and imaging.

Common bugs and edge cases

Regularizing the bias. The intercept shouldn't be penalized — it only shifts the output and adds no complexity. Penalizing it biases predictions toward zero. Libraries exclude it by default; hand-rolled code often forgets.
Skipping standardization. The penalty is scale-sensitive. Un-standardized features make the penalty punish large-unit features and ignore small-unit ones, silently distorting which weights survive.
Naive L1 via sign() gradient. Stepping by λ·sign(w) makes weights oscillate ±ε around zero forever and never reach it. Use soft-thresholding (proximal / coordinate descent) to get true zeros.
λ on the wrong scale. Doubling the dataset doubles the data loss but not the penalty, shifting the effective λ. Average the loss (divide by n) so λ means the same thing across dataset sizes.
Tuning λ on the training set. Training error falls monotonically as λ → 0, so it always picks zero regularization. Choose λ on validation folds only.
Expecting L1 to be stable under correlation. With two near-identical features, Lasso flips a coin about which to keep and may swap on a tiny data change. If you need stable selection across folds, use Elastic Net.
Confusing "more λ = better." Past the sweet spot, larger λ underfits — every weight crushed toward zero, the model predicts the mean. The validation curve is U-shaped, not monotone.

Frequently asked questions

Why does L1 produce exactly-zero weights but L2 does not?

L1's penalty |w| has a constant-magnitude gradient (±λ) that keeps pushing a weight toward zero with the same force no matter how small it gets, so it overshoots to exactly zero and stays pinned there. L2's penalty w² has gradient 2λw that vanishes as w shrinks, so weights approach zero asymptotically but never arrive.

Is L2 regularization the same as weight decay?

For plain SGD they are equivalent: adding λ‖w‖² to the loss makes the update multiply every weight by (1 − 2ηλ) each step (the factor-of-2 comes from differentiating w²; libraries that define a decay coefficient directly fold it in as (1 − ηλ)). For adaptive optimizers like Adam they diverge, which is why AdamW decouples weight decay from the gradient-based L2 term — it measurably improves generalization.

What is the difference between Lasso and Ridge regression?

Ridge regression is linear regression with an L2 penalty; it shrinks all coefficients smoothly and keeps every feature. Lasso regression uses an L1 penalty; it zeros out the coefficients of irrelevant features, performing automatic feature selection. Elastic Net blends both.

How do you choose the regularization strength lambda?

Tune λ by cross-validation over a logarithmic grid (e.g. 1e-4, 1e-3, … 1e2). Too small and you still overfit; too large and the model underfits, with every weight crushed toward zero. The sweet spot minimizes validation error, not training error.

Should I regularize the bias term?

No. The bias just shifts the output and doesn't contribute to model complexity or overfitting, so penalizing it only biases the predictions toward zero for no benefit. Standard libraries exclude the intercept from the penalty by default.

Does regularization require standardized features?

Yes, almost always. The penalty treats every weight equally, so a feature measured in millimeters gets a tiny weight and a feature in kilometers a huge one — the penalty then unfairly punishes the large-scale feature. Standardize to zero mean and unit variance first.