Machine Learning

Logistic Regression

Bend a straight line into a probability — the linear classifier that refuses to die

Logistic regression fits a sigmoid to a weighted sum of features, mapping any input to a probability between 0 and 1 — the workhorse linear classifier, trained by minimizing log loss with gradient descent in O(n·d) per epoch.

  • Modelσ(w·x + b)
  • LossLog loss (convex)
  • Training costO(n·d) per epoch
  • Prediction costO(d)
  • Decision boundaryLinear hyperplane

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How logistic regression works

Suppose you want to predict whether an email is spam. You have features — counts of certain words, the time it was sent, whether it has an attachment. Linear regression would multiply each feature by a weight, add them up, and hand you a number. But that number is unbounded: it might say "−3.7" or "12,000," and neither is a probability. Logistic regression takes that exact same weighted sum and pushes it through one more step — a squashing function called the sigmoid — so the output always lands between 0 and 1.

The model has two parts. First, the linear part computes a score, often called the logit:

z = w₁x₁ + w₂x₂ + ... + w_d x_d + b  =  w·x + b

Then the sigmoid maps that score to a probability:

σ(z) = 1 / (1 + e^(−z))

When z is large and positive, e^(−z) shrinks to zero and σ approaches 1. When z is large and negative, σ approaches 0. At z = 0, σ is exactly 0.5 — the tipping point. So the model's prediction is P(y = 1 | x) = σ(w·x + b), and the decision boundary — the surface where the model is maximally undecided — is the flat hyperplane w·x + b = 0.

The crucial reframing: the weighted sum doesn't predict the probability directly, it predicts the log-odds. Rearranging the sigmoid gives z = log(p / (1 − p)). So each weight wⱼ has a clean meaning — increasing feature xⱼ by one unit multiplies the odds by e^(wⱼ). That interpretability is a big part of why logistic regression is still everywhere in medicine, credit scoring, and A/B testing, decades after fancier models arrived.

Training: maximum likelihood and log loss

Fitting the model means choosing the weights that make the observed labels most probable. For a single example with true label y ∈ {0, 1} and predicted probability p = σ(z), the likelihood is p if y = 1 and 1 − p if y = 0. Taking the negative log and summing over all n examples gives the cost we actually minimize — log loss, also called binary cross-entropy:

L(w, b) = −(1/n) Σᵢ [ yᵢ·log(pᵢ) + (1 − yᵢ)·log(1 − pᵢ) ]

There is no closed-form solution like there is for ordinary linear regression — you can't just invert a matrix. But the function has a wonderful property: it's convex in the weights, so it has exactly one global minimum and no false bottoms to get trapped in. The gradient turns out to be strikingly clean. For weight wⱼ:

∂L/∂wⱼ = (1/n) Σᵢ (pᵢ − yᵢ)·xᵢⱼ
∂L/∂b  = (1/n) Σᵢ (pᵢ − yᵢ)

That (prediction − truth) term is identical in form to the gradient of linear regression with squared error — a lucky cancellation between the sigmoid's derivative and the cross-entropy's. You feed these gradients to gradient descent, take a step, recompute, and repeat until the loss stops dropping. Each epoch touches every example once across every feature, so the cost is O(n·d) per epoch for n examples and d features.

When to reach for logistic regression

  • You need calibrated probabilities, not just labels. "This transaction is 87% likely to be fraud" is far more useful than a bare yes/no — it lets you rank, threshold by business cost, and abstain.
  • You need to explain the model. Each coefficient is an interpretable odds multiplier. Regulators in lending and insurance frequently require this; a random forest can't be defended in court the same way.
  • You have many features and limited data. A linear model with regularization rarely overfits the way a deep net does on a few thousand rows.
  • You need a fast, strong baseline. It trains in seconds, predicts in microseconds, and is the first thing every experienced practitioner fits before reaching for anything heavier.

Skip it when the true boundary is genuinely non-linear and you can't hand-engineer the right features — a kernel SVM, random forest, or neural network will pull ahead. And skip it when features interact in complicated ways you can't anticipate; logistic regression only sees additive, linear effects unless you feed it crafted interaction terms.

Logistic regression vs other classifiers

Logistic regressionLinear regressionLinear SVMDecision treeNaive BayesNeural net
OutputProbability (0–1)Real valueSigned marginClass / leaf probProbabilityProbability
Decision boundaryLinearn/aLinear (max-margin)Axis-aligned stepsLinear (Gaussian NB)Arbitrary
LossLog loss (convex)Squared errorHinge lossGini / entropyLikelihood (closed-form)Any (non-convex)
Training costO(n·d) / epochO(n·d² + d³)O(n·d) / epochO(n·d·log n)O(n·d)O(n·d·W) / epoch
Probabilities calibrated?Yes, nativelyNoNo (needs Platt scaling)RoughlyOften poorlyUsually, after softmax
Interpretable?Yes — odds ratiosYesCoefficients, less soYes — the pathYesNo
Handles non-linearity?Only via feature engineeringNoWith kernelsYesNoYes

The headline contrast is with linear regression: same linear core, but logistic wraps it in a sigmoid and swaps squared error for log loss so the output is a probability. Against a linear SVM the difference is the loss function — log loss cares about all points and gives probabilities; hinge loss cares only about points near the margin and gives none. Against a decision tree, logistic regression trades flexible, jagged boundaries for one clean, interpretable hyperplane.

What the numbers actually say

  • Prediction is one dot product. Scoring a 1,000-feature model is 1,000 multiply-adds plus one exp — on the order of a microsecond. That's why logistic regression powers ad-click and fraud systems that must respond in single-digit milliseconds across billions of requests a day.
  • Training scales linearly. One epoch over 1 million examples with 100 features is 100 million multiply-adds — well under a second on a modern CPU. Convergence typically takes tens to low hundreds of epochs, so a model fits in seconds, not hours.
  • Memory is tiny. The model is the weight vector: d + 1 floats. A 1,000-feature classifier is about 4 KB — small enough to ship inside a cookie or an edge function.
  • The convexity guarantee is real. With log loss there is exactly one optimum, so two practitioners with the same data and the same regularization converge to the same weights. No random-seed lottery, unlike neural nets.
  • Perfect separation breaks it. On linearly separable data the unregularized weights grow without bound — in practice the solver hits its iteration cap with coefficients in the hundreds and probabilities pinned at 0 or 1. Always regularize.

JavaScript implementation

A complete batch-gradient-descent trainer. Features should be standardized (zero mean, unit variance) first so a single learning rate works across all dimensions.

const sigmoid = z => 1 / (1 + Math.exp(-z));

function dot(w, x) {
  let s = 0;
  for (let j = 0; j < w.length; j++) s += w[j] * x[j];
  return s;
}

// X: array of feature rows, y: array of 0/1 labels
function fit(X, y, { lr = 0.1, epochs = 500, l2 = 0.0 } = {}) {
  const n = X.length, d = X[0].length;
  const w = new Array(d).fill(0);
  let b = 0;

  for (let e = 0; e < epochs; e++) {
    const gw = new Array(d).fill(0);
    let gb = 0;

    for (let i = 0; i < n; i++) {
      const p = sigmoid(dot(w, X[i]) + b);
      const err = p - y[i];                  // the clean (pred − truth) term
      for (let j = 0; j < d; j++) gw[j] += err * X[i][j];
      gb += err;
    }

    for (let j = 0; j < d; j++) {
      gw[j] = gw[j] / n + l2 * w[j];          // L2 shrinks weights toward 0
      w[j] -= lr * gw[j];
    }
    b -= lr * (gb / n);
  }
  return { w, b };
}

function predictProba({ w, b }, x) { return sigmoid(dot(w, x) + b); }
function predict(model, x, threshold = 0.5) {
  return predictProba(model, x) >= threshold ? 1 : 0;
}

Two details that bite beginners. First, the bias b is not regularized — penalizing it just shifts every prediction off-center. Second, the gradient is averaged over n (divide by n) so the learning rate doesn't have to be retuned every time the dataset size changes.

Python implementation

The same algorithm, vectorized with NumPy so each epoch is two matrix multiplies instead of nested loops — orders of magnitude faster, and how real libraries do it.

import numpy as np

def sigmoid(z):
    # Numerically stable: avoids overflow when z is very negative
    return np.where(z >= 0,
                    1 / (1 + np.exp(-z)),
                    np.exp(z) / (1 + np.exp(z)))

def fit(X, y, lr=0.1, epochs=500, l2=0.0):
    n, d = X.shape
    w = np.zeros(d)
    b = 0.0
    for _ in range(epochs):
        z = X @ w + b
        p = sigmoid(z)
        err = p - y                       # shape (n,)
        grad_w = (X.T @ err) / n + l2 * w
        grad_b = err.mean()
        w -= lr * grad_w
        b -= lr * grad_b
    return w, b

def predict_proba(w, b, X):
    return sigmoid(X @ w + b)

def log_loss(y, p, eps=1e-15):
    p = np.clip(p, eps, 1 - eps)          # never log(0)
    return -np.mean(y * np.log(p) + (1 - y) * np.log(1 - p))

# In practice, reach for scikit-learn — it uses a far better optimizer (L-BFGS):
# from sklearn.linear_model import LogisticRegression
# clf = LogisticRegression(C=1.0).fit(X, y)   # C = 1/l2 (inverse regularization)

Note the stable sigmoid and the np.clip in the loss. A naive 1/(1+exp(-z)) overflows for large negative z, and log(0) is -inf — both are classic sources of NaN losses that silently poison training.

Variants worth knowing

Softmax (multinomial) regression. Swap the sigmoid for the softmax and give each of K classes its own weight vector. The scores are normalized to a probability distribution over all classes that sums to 1. Binary logistic regression is exactly the two-class case. This is also the final layer of nearly every classification neural network.

Ridge (L2) and Lasso (L1) penalties. Adding λ·‖w‖² (ridge) keeps weights small and fixes the perfect-separation blow-up. Adding λ·‖w‖₁ (lasso) drives some weights to exactly zero, doubling as automatic feature selection. Elastic net blends the two.

Stochastic and mini-batch gradient descent. Instead of summing the gradient over all n examples per step, update from one example (SGD) or a small batch. This converges in far fewer passes on huge datasets and is how logistic regression scales to billions of rows in online ad systems.

Ordinal and conditional logit. Ordinal logistic regression handles ranked categories (poor/fair/good) with shared cutpoints; conditional logit handles choice data where each decision picks one option from a varying set. Both keep the log-odds linear core.

Logistic regression as one neuron. A single sigmoid unit trained with cross-entropy is logistic regression. Stack many of them with non-linearities between layers and you have a neural network — logistic regression is the atom from which deep learning is built.

Common bugs and edge cases

  • Forgetting to standardize features. If one feature ranges 0–1 and another ranges 0–1,000,000, a single learning rate can't serve both — the large-scale feature dominates the gradient and training crawls or diverges. Standardize first.
  • Squared error instead of log loss. MSE on a sigmoid is non-convex and has vanishing gradients when the model is confidently wrong, so learning stalls exactly when it should correct fastest. Always use cross-entropy.
  • Unstable sigmoid. 1/(1+exp(-z)) overflows for very negative z; use the branchless stable form shown above or you get silent NaNs.
  • Not regularizing on separable data. Perfect separation sends the weights to infinity; the loss approaches zero but the model is wildly overconfident and the solver never converges. Add an L2 penalty.
  • Reading the 0.5 threshold as sacred. 0.5 is only optimal when the classes are balanced and false positives cost the same as false negatives. For rare-event detection (fraud, disease) you almost always move the threshold — or reweight the classes.
  • Trusting coefficients under multicollinearity. When two features are highly correlated, their weights become unstable and can flip sign between runs even though predictions are fine. Drop, combine, or regularize before interpreting individual odds ratios.

Frequently asked questions

Why use the sigmoid instead of fitting a straight line to 0/1 labels?

A straight line is unbounded — it predicts probabilities below 0 and above 1, which are meaningless. The sigmoid squashes any real number into (0, 1), and its log-odds interpretation makes a linear weighted sum exactly equal to the log of the odds ratio, so the math stays linear while the output stays a valid probability.

Why is log loss used instead of squared error?

Squared error on top of a sigmoid is non-convex, so gradient descent can get stuck in local minima. Log loss (binary cross-entropy) is convex in the weights, guaranteeing a single global optimum, and it penalizes confident wrong predictions far more harshly — the loss goes to infinity as a confident prediction approaches the wrong label.

Is logistic regression actually a regression or a classifier?

Both, depending on how you read it. It regresses a probability — a continuous value in (0, 1) — onto the features. You turn it into a classifier by thresholding that probability, usually at 0.5. The name comes from the regression of log-odds, not from predicting a real-valued target.

What is the decision boundary of logistic regression?

It is always linear in the feature space: the set of points where the weighted sum w·x + b equals 0, which is exactly where the predicted probability equals 0.5. To get curved boundaries you must add non-linear features (polynomial terms, interactions) — the model itself only ever draws a hyperplane.

How does logistic regression extend to more than two classes?

Replace the sigmoid with the softmax function, giving multinomial (softmax) regression: each class gets its own weight vector, and softmax normalizes the scores into a probability distribution that sums to 1. The binary sigmoid is the two-class special case of softmax.

Why does logistic regression fail on perfectly separable data?

If a hyperplane perfectly separates the classes, the maximum-likelihood weights diverge to infinity — the model keeps pushing the sigmoid steeper to drive log loss toward zero, and the weights never converge. L2 regularization fixes this by penalizing large weights, pulling the optimum back to a finite point.