Machine Learning

ROC Curves & AUC

One curve, every threshold — and a single number that scores the whole classifier

A ROC curve plots a classifier's true positive rate against its false positive rate as you sweep the decision threshold; the area under it (AUC) is the probability the model ranks a random positive above a random negative, from 0.5 (coin flip) to 1.0 (perfect).

  • x-axisFalse positive rate
  • y-axisTrue positive rate
  • AUC range0.5 → 1.0
  • AUC computeO(n log n)
  • Random baselineAUC = 0.5

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

The intuition: one classifier, many classifiers

A binary classifier — logistic regression, a gradient-boosted tree, a neural network — rarely outputs a clean "yes" or "no." It outputs a score: a number between 0 and 1 that ranks how confident it is that an example is positive. The decision to call something positive is a separate, downstream choice: pick a threshold, and everything scoring above it becomes a prediction of "positive."

That threshold is a dial, not a constant. Crank it to 0.99 and you predict positive almost never — you catch only the most blatant cases (few false alarms, but you miss most true positives). Crank it to 0.01 and you predict positive almost always — you catch everything, at the price of drowning in false alarms. Every threshold in between is a different operating point of the same model.

The ROC curve — Receiver Operating Characteristic, a name inherited from World War II radar operators deciding whether a blip was an enemy plane or noise — is what you get when you sweep that dial from one extreme to the other and trace the trade-off. Each threshold gives one point; the curve is all of them. It answers a question accuracy can't: how good is the model's ranking, independent of where you happen to set the cutoff?

How the curve is built

Two rates define the axes, and both are computed within a class, which is the source of ROC's robustness:

  • True Positive Rate (TPR, also called sensitivity or recall) = TP / (TP + FN) — of all the actual positives, what fraction did we catch? This is the y-axis.
  • False Positive Rate (FPR, also called fall-out, equal to 1 − specificity) = FP / (FP + TN) — of all the actual negatives, what fraction did we wrongly flag? This is the x-axis.

Now sweep. Start the threshold above the highest score: nothing is predicted positive, so TP = FP = 0, and you sit at the origin (0, 0). Lower the threshold one example at a time. Each time the threshold drops past a real positive's score, TPR ticks up — the curve steps up. Each time it drops past a real negative's score, FPR ticks up — the curve steps right. When the threshold falls below every score, everything is predicted positive: TPR = FPR = 1, the top-right corner (1, 1).

A perfect classifier scores every positive above every negative, so the sweep goes straight up the y-axis to (0, 1) and then straight across to (1, 1) — hugging the top-left corner, enclosing the entire unit square. A useless classifier scores positives and negatives interchangeably, so up-steps and right-steps interleave evenly and the curve follows the diagonal from (0,0) to (1,1). The area under the curve is the score: 1.0 for perfect, 0.5 for the diagonal.

What AUC actually measures

The deepest fact about AUC is that it is not merely "the area under a curve" — it has an exact probabilistic meaning. AUC equals the probability that the classifier assigns a higher score to a randomly drawn positive example than to a randomly drawn negative example:

AUC = P( score(random positive) > score(random negative) )

This is why AUC is a measure of ranking quality, not of calibration or of any particular decision. It doesn't care whether the scores are 0.001 and 0.002 or 0.5 and 0.9 — only that positives tend to outrank negatives. And it connects ROC directly to a classical rank statistic: AUC is the normalized Mann-Whitney U statistic (equivalently, the Wilcoxon rank-sum test). That equivalence, proved by Hanley and McNeil in 1982, is what lets you compute AUC in O(n log n) by sorting and ranking, with no curve drawn at all:

AUC = (S_pos − P·(P+1)/2) / (P · N)

where S_pos is the sum of the ranks of the P positive examples (ranking all P+N examples by score, smallest = rank 1), and N is the number of negatives. Ties are handled by assigning the average of the tied ranks, which correctly contributes 0.5 to the pairwise count for each tied positive-negative pair.

When to use ROC AUC — and when not to

  • Model comparison and selection. A single threshold-free scalar makes it trivial to rank candidate models or track a model over time. This is ROC AUC's home turf.
  • Roughly balanced classes. When positives and negatives are comparable in count, ROC and precision-recall agree and ROC's symmetry is a clean summary.
  • When false-positive and false-negative costs are unknown. AUC integrates over all thresholds, so you don't have to commit to a cost ratio up front.
  • Avoid for heavily imbalanced data with costly false positives. When 99.9% of cases are negative, FPR's denominator is enormous; thousands of false alarms barely nudge the x-axis, and ROC AUC looks great while the deployed system is unusable. Reach for a precision-recall curve and average precision (AP) instead.
  • Don't deploy on AUC alone. It tells you the model ranks well; it does not tell you which cutoff to ship. That is a separate decision (see below).

ROC AUC vs other evaluation metrics

ROC AUCPR AUC / Avg precisionAccuracyF1 scoreLog loss
Threshold-freeYesYesNo (one cutoff)No (one cutoff)Yes
Robust to class imbalancePartly (optimistic)YesNo (collapses)PartlyPartly
Measures rankingYesYesNoNoNo
Measures calibrationNoNoNoNoYes
Range0.5–1.0baseline–1.00–10–10→∞ (lower better)
Best whenComparing models, balancedRare positives, costly FPBalanced, equal costsSingle cutoff, imbalanceProbabilities matter

The headline distinction is ROC AUC vs PR AUC. They differ only in the x-axis: ROC divides false positives by the (huge) negative count, PR divides them by predicted positives. On a 1:1000 imbalanced fraud dataset, a model can post ROC AUC = 0.95 and average precision = 0.30 simultaneously — both true, telling you the ranking is good but the precision at any usable recall is poor. Report both when classes are skewed.

What the numbers actually say

  • The diagonal is exactly AUC = 0.5. Not "around" 0.5 — a classifier outputting random scores has expected AUC precisely 0.5, the area of the lower triangle of the unit square.
  • Compute cost is the sort: O(n log n) time, O(n) space, via the Mann-Whitney rank formula. Trapezoidal integration of an explicit curve with k thresholds is O(n log n) to sort plus O(n) to sweep — same asymptotics, but the rank trick avoids materializing the curve.
  • Imbalance optimism is quantifiable. On a 100:1 negative:positive split, adding 1,000 false positives raises FPR by only 1000 / (100,000) = 0.01 but can crater precision from 0.9 to 0.5. ROC barely moves; PR plummets.
  • Sampling variance is real. The standard error of AUC roughly scales as 1/√(min(P, N)). With only 20 positives, a measured AUC of 0.80 has a 95% confidence interval that can span 0.65 to 0.92 — don't over-read AUC differences on tiny test sets.
  • AUC ≈ 1.0 on real-world data is a red flag, not a trophy. In practice it almost always means a leaked feature (e.g. an ID correlated with the label, or a post-outcome column) rather than a genuinely separable problem.

JavaScript implementation

Two functions: the fast O(n log n) AUC via the rank formula (with tie handling), and an explicit curve sweep for plotting.

// Fast AUC via the Mann-Whitney rank statistic. O(n log n).
// scores[i] = model score, labels[i] = 1 (positive) or 0 (negative).
function auc(scores, labels) {
  const n = scores.length;
  const idx = [...Array(n).keys()].sort((a, b) => scores[a] - scores[b]);

  // Assign ranks 1..n, averaging ranks across tied scores.
  const rank = new Array(n);
  let i = 0;
  while (i < n) {
    let j = i;
    while (j < n && scores[idx[j]] === scores[idx[i]]) j++;
    const avg = (i + 1 + j) / 2;            // average of ranks (i+1)..j
    for (let k = i; k < j; k++) rank[idx[k]] = avg;
    i = j;
  }

  let P = 0, N = 0, rankSumPos = 0;
  for (let t = 0; t < n; t++) {
    if (labels[t] === 1) { P++; rankSumPos += rank[t]; }
    else N++;
  }
  if (P === 0 || N === 0) return NaN;       // AUC undefined with one class
  return (rankSumPos - P * (P + 1) / 2) / (P * N);
}

// Explicit ROC curve for plotting: returns [{fpr, tpr, threshold}, ...].
function rocCurve(scores, labels) {
  const n = scores.length;
  const order = [...Array(n).keys()].sort((a, b) => scores[b] - scores[a]); // high → low
  const P = labels.filter(l => l === 1).length;
  const N = n - P;
  const points = [{ fpr: 0, tpr: 0, threshold: Infinity }];
  let tp = 0, fp = 0;
  for (const i of order) {
    if (labels[i] === 1) tp++; else fp++;
    points.push({ fpr: fp / N, tpr: tp / P, threshold: scores[i] });
  }
  return points;
}

The rank-based auc and the trapezoid area of rocCurve agree to floating-point precision — they are two views of the same statistic. The curve version is what you draw; the rank version is what you report.

Python implementation

In production you would call scikit-learn — but the point of writing it out is that AUC is just a rank sum, not magic.

import numpy as np

def auc(scores, labels):
    """AUC via the Mann-Whitney U / normalized rank-sum. O(n log n)."""
    scores = np.asarray(scores, dtype=float)
    labels = np.asarray(labels)
    order = np.argsort(scores, kind="mergesort")  # stable
    s_sorted = scores[order]

    # Average ranks for ties (ranks are 1-based).
    ranks = np.empty(len(scores))
    i = 0
    while i < len(scores):
        j = i
        while j < len(scores) and s_sorted[j] == s_sorted[i]:
            j += 1
        ranks[order[i:j]] = (i + 1 + j) / 2.0
        i = j

    pos = labels == 1
    P, N = pos.sum(), (~pos).sum()
    if P == 0 or N == 0:
        return float("nan")
    return (ranks[pos].sum() - P * (P + 1) / 2.0) / (P * N)

def roc_curve(scores, labels):
    """Returns (fpr, tpr, thresholds), sorted by descending threshold."""
    scores, labels = np.asarray(scores, float), np.asarray(labels)
    order = np.argsort(-scores)                  # high → low
    labels = labels[order]
    tps = np.cumsum(labels == 1)
    fps = np.cumsum(labels == 0)
    P, N = (labels == 1).sum(), (labels == 0).sum()
    fpr = np.concatenate([[0], fps / N])
    tpr = np.concatenate([[0], tps / P])
    return fpr, tpr, np.concatenate([[np.inf], scores[order]])

# Verify the curve area equals the rank-based AUC:
# fpr, tpr, _ = roc_curve(s, y);  np.trapz(tpr, fpr)  ==  auc(s, y)

Both implementations deliberately handle the degenerate single-class case (returning NaN) — a classic crash in hand-rolled metrics, where a validation fold happens to contain no positives and you divide by zero.

Variants and relatives worth knowing

Gini coefficient. In credit scoring, the standard metric is Gini = 2·AUC − 1, a linear rescaling that maps the random baseline to 0 and perfect to 1. Same information, different convention.

Partial AUC (pAUC). When only a slice of the curve matters — say, the region with FPR < 0.1 because you can't tolerate more false alarms — integrate the area only over that range. A model can win on full AUC yet lose on the FPR region you actually operate in.

Precision-Recall curve and Average Precision. Swap FPR for precision on the x-axis trade-off. AP is the area under the PR curve and is the metric of choice for rare-positive detection and information retrieval. See it in action with logistic regression on imbalanced data.

Multiclass AUC. ROC is inherently binary. Extend it by one-vs-rest (averaging the AUC of each class against all others) or one-vs-one (averaging over all class pairs, the Hand-Till method). Used to evaluate a softmax classifier head.

Youden's J and the operating point. J = TPR − FPR is maximized at the ROC point geometrically farthest above the diagonal — a common, cost-agnostic rule for choosing the deployment threshold once you've accepted the model.

Common bugs and edge cases

  • Passing predicted labels instead of scores. Feed predict() (hard 0/1) into an AUC function and you collapse the curve to three points; you must pass predict_proba() or decision_function() continuous scores.
  • Trusting AUC on extreme imbalance. A 0.97 ROC AUC on 1:10000 data can hide an average precision of 0.05. Always report PR AUC alongside when positives are rare.
  • Reading AUC as a deployable threshold. AUC is an average over all cutoffs; it never tells you the cutoff. Choose the operating point with Youden's J, a fixed FPR budget, or a cost matrix.
  • Inverted scores giving AUC < 0.5. If your "positive" probability is actually the probability of the negative class, AUC comes out below 0.5. Don't conclude the model is broken — flip the sign and AUC becomes 1 − x.
  • Mishandling ties. Assigning sequential ranks to tied scores (instead of the average) biases AUC. A constant classifier (all scores equal) must yield exactly 0.5; if your code returns 0 or 1, your tie handling is wrong.
  • Computing AUC on the training set. Like any metric, it must be measured on held-out data. A near-1.0 AUC on train and a 0.6 on test is overfitting, not a great model.

Frequently asked questions

What does an AUC of 0.5 mean?

An AUC of 0.5 means the classifier ranks positives above negatives no better than a coin flip — its ROC curve hugs the diagonal. AUC is exactly the probability that the model scores a randomly chosen positive higher than a randomly chosen negative, so 0.5 is no discrimination, 1.0 is perfect separation, and below 0.5 means the scores are inverted (flip the sign and you beat random).

Why use ROC AUC instead of accuracy?

Accuracy is computed at a single threshold and collapses on imbalanced data — a classifier that always predicts 'negative' scores 99% accuracy when only 1% of cases are positive. ROC AUC is threshold-independent: it summarizes performance across every possible cutoff, and because both axes are rates (normalized within each class), it stays meaningful regardless of the positive/negative ratio.

When should I use a precision-recall curve instead of ROC?

Use a precision-recall curve when positives are rare and false positives are costly — fraud, disease screening, ad clicks. ROC's false positive rate divides by the large negative count, so a flood of false alarms barely moves the x-axis and the curve looks deceptively good. Precision divides by predicted positives, exposing that flood directly. On balanced data the two tell the same story.

How is AUC actually computed from predictions?

The fast way skips the curve entirely. Because AUC equals the probability a positive outranks a negative, it is identical to the normalized Mann-Whitney U statistic: sort all examples by score, assign ranks, sum the ranks of the positives, and apply U = R_pos − P(P+1)/2, then divide by P·N. That is O(n log n) for the sort versus trapezoidal integration of the curve, and it handles ties with average ranks.

Does AUC tell me which threshold to deploy?

No. AUC aggregates over all thresholds, so two models with the same AUC can behave completely differently at the cutoff you actually ship. Pick the operating point separately — by maximizing Youden's J (TPR − FPR), by fixing a tolerable false positive rate, or by minimizing expected cost when false positives and false negatives have different prices.

What is a good AUC value?

It depends on the domain, but rough field conventions: 0.5 is random, 0.7 is acceptable, 0.8 is good, and above 0.9 is excellent — though in easy problems 0.9 may be mediocre and in hard ones (predicting human behavior, noisy biology) 0.65 can be state of the art. An AUC near 1.0 on real data usually signals target leakage, not a brilliant model.