Security
Differential Privacy
Mathematically proven privacy — by adding exactly the right amount of noise
Differential privacy is a mathematical guarantee that a query's output barely changes whether or not any single person is in the dataset, achieved by adding calibrated noise scaled to the query's sensitivity and a privacy budget ε.
- Privacy parameterε (epsilon)
- Laplace noise scaleΔf / ε
- Count sensitivityΔf = 1
- Compositionε values add
- Local DP error≈ √n worse
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
The idea: your presence shouldn't move the needle
Suppose a hospital publishes a statistic: "47,213 patients in this county have diabetes." You're worried someone could ask the same query the day before you enrolled — getting 47,212 — and the day after, getting 47,213, and conclude that you have diabetes. That subtraction attack is the whole game. Differential privacy makes it impossible by guaranteeing that the published answer looks essentially the same whether or not you're in the data at all.
The trick is to add random noise to the true answer before releasing it. Instead of "47,213" the system might report "47,210" or "47,216" — close enough to be useful for research, but blurry enough that no one can tell whether the count moved because of you or because of the noise. Crucially, the noise is calibrated: just enough to hide one person, not so much that the statistic becomes garbage.
Differential privacy was defined by Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith in 2006 (the "Calibrating Noise to Sensitivity" paper). It earned the 2017 Gödel Prize. The key conceptual leap: privacy is a property of the algorithm that releases the answer, not of the dataset. That's why it survives attackers who already know everything else about you — side information can't break a guarantee that never depended on secrecy of the other rows.
The definition and the Laplace mechanism
Call two datasets neighbors if they differ in exactly one person's record. A randomized algorithm M is ε-differentially private if, for every pair of neighboring datasets D and D', and every possible output S:
Pr[M(D) ∈ S] ≤ e^ε · Pr[M(D') ∈ S]
Read it as: removing or adding one person multiplies the probability of any outcome by at most e^ε. When ε is small, e^ε ≈ 1 + ε, so ε = 0.1 means the output distribution shifts by at most ~10%. That bound holds simultaneously for every individual and every output — that's what makes it a worst-case guarantee, not an average-case one.
To achieve it for a numeric query f, you need its sensitivity — the most one person can change the answer:
Δf = max over neighbors D, D' of | f(D) − f(D') |
For a count, one person changes it by 1, so Δf = 1. For a sum of salaries capped at $500k, Δf = 500,000. The Laplace mechanism then releases:
M(D) = f(D) + Laplace(0, b), where b = Δf / ε
The Laplace distribution has density (1/2b)·exp(−|x|/b) and standard deviation b·√2. The proof that this satisfies the definition is two lines: the ratio of densities at any point, between the D and D' answers, is bounded by exp(|f(D)−f(D')|/b) ≤ exp(Δf/b) = e^ε. That exact cancellation — sensitivity over scale equals ε — is why Laplace is the canonical mechanism. For the Gaussian mechanism you instead use the L2 sensitivity and get a slightly weaker (ε, δ)-DP guarantee, which composes better over many queries.
When to reach for differential privacy
- Publishing aggregate statistics from sensitive data — census tables, hospital counts, salary distributions — where individual rows must stay protected even under linkage attacks.
- Telemetry from millions of devices — typing habits, crash reports, emoji frequencies — where you want population trends but should never see one user's data in the clear (use local DP).
- Training machine-learning models on private data, so the model can't memorize and regurgitate a training example (DP-SGD).
- Interactive query systems where analysts run many ad-hoc queries and you must bound total leakage with a privacy budget.
Skip it when the data is already public, when you can compute on encrypted data instead (homomorphic encryption protects the inputs, DP protects the outputs — they solve different problems), or when your population is so small that the noise needed to hide one person would swamp the signal. DP is a tool for the regime where n is large and the per-person contribution is bounded.
Differential privacy vs other privacy approaches
| Differential privacy | k-anonymity | Naive anonymization | Homomorphic encryption | Secure aggregation | |
|---|---|---|---|---|---|
| Protects | Output of a query | Rows in a release | Direct identifiers | Inputs during compute | Inputs during aggregation |
| Survives side information? | Yes — provable | No (linkage attacks) | No (87% re-identifiable) | Yes (inputs hidden) | Yes (inputs hidden) |
| Composes over many queries? | Yes — budget adds up | No formal model | No | N/A | N/A |
| Accuracy cost | Noise ∝ Δf/ε | Generalization loss | None (but insecure) | None (exact result) | None (exact sum) |
| Per-record overhead | ~0 (noise at release) | Bucketing | ~0 | 1000×+ slower compute | Crypto handshakes |
| Trust model | Central or local (none) | Trusted curator | Trusted curator | Untrusted server OK | Untrusted server OK |
| Real-world use | US Census 2020, Apple, Chrome | Early HIPAA de-ID | The AOL/Netflix breaches | Encrypted analytics | Federated learning |
The headline distinction: k-anonymity and naive anonymization try to make the data safe to release, and both have been broken by linkage attacks that combine the release with outside knowledge. Differential privacy gives up on that and instead makes a provable promise about the mechanism — which is the only kind of promise that holds against an attacker who already knows your ZIP code, birthday, and Netflix ratings.
What the numbers actually say
- 87% of Americans are uniquely identified by the triple (ZIP code, birth date, sex), per Latanya Sweeney's 2000 study — the result that detonated naive anonymization and motivated DP.
- Noise is cheap on big counts. At ε = 1, count sensitivity 1, Laplace noise has standard deviation
√2 ≈ 1.41. On a true count of 10,000,000 that's a relative error of 0.0000141% — invisible. On a true count of 5, the same ±1.41 is a 28% error. DP's cost lands almost entirely on small subgroups. - The 2020 US Census applied DP to all published tables, the first national statistical agency to do so, with a total budget of roughly ε ≈ 19.6 split across the redistricting (PL 94-171) statistics — a number the Bureau debated publicly for years.
- Apple ships local DP in iOS/macOS with per-datatype daily budgets reported around ε = 2 to 8 (e.g. ε = 4 for emoji, ε = 8 for some Safari telemetry), which privacy researchers criticized as too loose.
- Local DP costs about √n in accuracy. To estimate a population mean to the same error, local DP needs roughly n times more samples than central DP — for n = 1,000,000 users that's a √n ≈ 1000× error penalty per sample.
JavaScript implementation
The core is just "sample Laplace noise, add it." Sampling Laplace from a uniform: take U ∈ (−0.5, 0.5] and return −b·sign(U)·ln(1−2|U|) (the inverse-CDF method).
// Sample from Laplace(0, b) via inverse transform.
function sampleLaplace(b) {
const u = Math.random() - 0.5; // U in (-0.5, 0.5]
return -b * Math.sign(u) * Math.log(1 - 2 * Math.abs(u));
}
// ε-DP count: sensitivity Δf = 1, so noise scale b = 1/ε.
function privateCount(rows, predicate, epsilon) {
const trueCount = rows.filter(predicate).length;
return trueCount + sampleLaplace(1 / epsilon);
}
// A budget tracker — composition theorem: ε values add.
class PrivacyBudget {
constructor(total) { this.total = total; this.spent = 0; }
spend(epsilon) {
if (this.spent + epsilon > this.total + 1e-9)
throw new Error(`budget exhausted: ${this.spent}+${epsilon} > ${this.total}`);
this.spent += epsilon;
return epsilon;
}
remaining() { return this.total - this.spent; }
}
// Usage: total budget ε = 1, answer two queries at ε = 0.5 each.
const budget = new PrivacyBudget(1.0);
const patients = [/* ... records ... */];
const a = privateCount(patients, p => p.diabetic, budget.spend(0.5));
const b = privateCount(patients, p => p.age > 65, budget.spend(0.5));
// A third 0.5 query would throw — the guarantee is gone once ε runs out.
Two things to flag. First, b = Δf / ε — for a count Δf is 1, but for a sum you must clamp each value to a known range first, or the sensitivity is unbounded and the guarantee is void. Second, the budget tracker enforces the composition theorem: you cannot answer infinitely many queries, because each leaks a little and the leaks add.
Python implementation: randomized response
The oldest DP mechanism predates the name by 40 years. Stanley Warner's 1965 randomized response lets pollsters ask "Have you committed crime X?" with provable deniability: flip a coin; if heads, answer truthfully; if tails, answer "yes." Any individual can claim the coin forced their answer — yet the population rate is recoverable by de-biasing.
import math, random
def randomized_response(true_bit: bool) -> bool:
"""Local DP: each respondent randomizes their own answer.
Coin 1 heads -> tell truth; tails -> coin 2 decides the reported bit."""
if random.random() < 0.5: # heads: truthful
return true_bit
return random.random() < 0.5 # tails: uniform random answer
# This mechanism is ln(3)-DP ≈ 1.0986-DP:
# P(report yes | yes) = 0.5 + 0.25 = 0.75
# P(report yes | no) = 0.25
# ratio = 0.75 / 0.25 = 3 = e^ε -> ε = ln 3
EPSILON = math.log(3)
def estimate_true_rate(responses: list[bool]) -> float:
"""De-bias: observed = 0.5*p + 0.25, so p = 2*observed - 0.5."""
observed = sum(responses) / len(responses)
return 2 * observed - 0.5
def laplace_mechanism(true_value: float, sensitivity: float, epsilon: float) -> float:
"""Central DP: trusted curator adds one Laplace draw to the true answer."""
b = sensitivity / epsilon
u = random.random() - 0.5
noise = -b * math.copysign(1, u) * math.log(1 - 2 * abs(u))
return true_value + noise
# 100k people, true 'yes' rate 30%. Recover it from noisy local answers:
truth = [random.random() < 0.30 for _ in range(100_000)]
noisy = [randomized_response(b) for b in truth]
print(round(estimate_true_rate(noisy), 3)) # ~0.30, each answer still deniable
Randomized response is local DP — the noise is added on each person's device, so the pollster (or Apple, or Chrome) never sees a truthful individual answer. The price is the √n accuracy penalty: you need a large population before the de-biased estimate is tight, which is exactly why local DP is used for telemetry from hundreds of millions of devices and not for a hospital with 50,000 patients.
Variants worth knowing
(ε, δ)-DP (approximate DP). Relaxes the pure definition by allowing a tiny failure probability δ: Pr[M(D)∈S] ≤ e^ε·Pr[M(D')∈S] + δ. You set δ smaller than 1/n (so it's unlikely to ever catastrophically leak). This is what the Gaussian mechanism gives, and it composes far better over thousands of queries.
Rényi DP and zero-concentrated DP (zCDP). Tighter accounting frameworks for composition. Naive composition adds ε linearly; advanced and Rényi composition let total ε grow like √k over k queries instead of k, which is the difference between a usable and a useless budget when training a model for thousands of steps.
DP-SGD. Differentially private stochastic gradient descent: clip each example's gradient to bound sensitivity, add Gaussian noise to the batch gradient, and account the privacy cost with the moments accountant. It's how you train a neural net on private data without it memorizing individual records.
Local vs central DP. Central trusts a curator and adds noise once (low error). Local trusts no one and adds noise per device (≈√n higher error). The shuffle model sits between them: devices add modest local noise, then a trusted shuffler strips identities and breaks the link, recovering much of central DP's accuracy without a fully trusted aggregator.
The exponential mechanism. For non-numeric outputs (pick the best category, the best price, the best location), you can't just add noise. The exponential mechanism samples an output with probability proportional to exp(ε·utility / 2Δu), giving DP for "argmax"-style releases.
Common bugs and pitfalls
- Forgetting to clamp before summing. A sum or average has unbounded sensitivity unless you cap each value to a fixed range. One unclamped outlier (a billionaire's salary) makes Δf huge and either destroys accuracy or, if you used Δf = 1 by mistake, voids the guarantee entirely.
- Spending the budget without tracking it. Answering 100 queries at ε = 1 each is ε = 100 total, not ε = 1. Forgetting composition is the most common way to think you're private when you aren't.
- Floating-point leakage. Naive Laplace sampling has known attacks: the low-order bits of a float reveal whether noise was added, breaking the guarantee. Production libraries use the snapping mechanism or discrete (geometric/Gaussian) noise to close this hole.
- Post-processing confusion (the safe direction). Any function of a DP output is still DP — you can round, clip, or visualize freely. But re-querying the raw data to "correct" the noisy answer spends more budget; don't.
- Using DP on a tiny population. With n = 30, the noise needed to hide one person dwarfs the signal. DP is a large-n tool; below some threshold, suppression (don't publish small cells) is the honest choice.
- Treating ε as comparable across systems. ε = 1 with per-query accounting is not the same privacy as ε = 1 per day across all datatypes. Always ask: ε over what, and composed how?
Frequently asked questions
What does epsilon (ε) mean in differential privacy?
ε bounds how much one person's data can change the output distribution: the probability of any outcome differs by at most a factor of e^ε between the dataset with you and the dataset without you. Smaller ε means stronger privacy and more noise. ε around 0.1–1 is considered strong, ε up to ~10 is weak but still used; the US Census Bureau spent roughly ε ≈ 19.6 across all 2020 redistricting statistics.
What is the difference between local and central differential privacy?
Central DP adds noise once, at a trusted aggregator that already holds the raw data, so accuracy scales well. Local DP adds noise on each person's own device before anything is sent, so no one needs to be trusted — but the error grows by a factor of about √n, making local DP far noisier for the same ε. Apple, Google, and Microsoft use local DP for telemetry; the Census uses central DP.
How much noise does the Laplace mechanism add?
The Laplace mechanism adds noise drawn from Laplace(0, Δf/ε), where Δf is the query's sensitivity — the most one person can change the answer. For a simple count, Δf = 1, so you add Laplace noise with scale 1/ε. At ε = 1 that is a standard deviation of about 1.41 counts, which is negligible against a population of millions but enough to mask any single individual.
Why can't you just anonymize the data instead?
Removing names doesn't work: 87% of Americans are uniquely identified by ZIP code, birth date, and sex alone, and linkage attacks famously re-identified Netflix Prize users and a Massachusetts governor's medical records. Differential privacy is a guarantee about the algorithm, not the data, so it holds even against an attacker with unlimited side information.
What is the privacy budget and why does it run out?
Each query you answer leaks a little information, and under the composition theorem the ε values add up across queries. A fixed total budget (say ε = 1) means you can answer one query at ε = 1, or ten queries at ε = 0.1 each. Once the budget is spent the guarantee is gone — there is no way to un-leak — so practical systems track and ration ε like a finite resource.
Does differential privacy make the data useless?
No, for large populations. Because Laplace noise has a fixed standard deviation while true counts grow with n, the relative error vanishes: ±1.4 noise on a count of 10 million is a 0.00001% error. DP hurts most on small subgroups and rare categories, which is exactly where individual privacy is most at risk — so the accuracy cost lands where it should.