Machine Learning

The Kalman Filter

How a noisy sensor and a guess become a confident estimate

The Kalman filter is a recursive algorithm that fuses a noisy measurement with a motion model to estimate a hidden state in real time, optimally weighting each by its uncertainty via the Kalman gain — O(n³) per step in the state dimension.

Per-step costO(n³) in state dim n
State storedmean x̂ + covariance P
OptimalityMMSE for linear-Gaussian
MemoryO(n²) — no history kept
InventedRudolf Kálmán, 1960

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

The intuition: two bad estimates make one good one

Suppose you are tracking a drone. Your GPS gives you a position, but it jitters by a few meters every reading. You also have a physics model — last frame the drone was here, moving this fast, so this frame it should be roughly there. The model drifts over time, but it is smooth. Neither source alone is trustworthy. The Kalman filter's whole idea is that if you know how wrong each source tends to be, you can blend them into an estimate that is better than either one.

Rudolf Kálmán published the algorithm in 1960. Within a decade it was running on the Apollo Guidance Computer, fusing inertial measurements with star sightings to navigate to the Moon — on a machine with about 2 KB of RAM. Today it sits underneath GPS, radar tracking, robot localization, battery state-of-charge estimation, and the sensor fusion in your phone that merges accelerometer and gyroscope into a stable orientation.

The filter keeps two things, and only two things, between steps: a best-guess state vector x̂ (position, velocity, whatever you are tracking) and a covariance matrix P that encodes how uncertain that guess is and how the uncertainties correlate. It never stores past measurements. Every cycle it does two moves: predict (push the state forward through the motion model, growing uncertainty) and update (pull the prediction toward a fresh measurement, shrinking uncertainty). That recursive predict–update loop is the entire algorithm.

The mechanism: predict, then update

Model the system as linear with Gaussian noise. The state evolves by xₖ = F·xₖ₋₁ + B·uₖ + wₖ and you observe zₖ = H·xₖ + vₖ, where w ~ N(0, Q) is process noise and v ~ N(0, R) is measurement noise. F is the state-transition matrix, B maps a control input u, and H maps state into measurement space.

Predict. Move the estimate forward and inflate the uncertainty by the process noise:

x̂ₖ⁻ = F x̂ₖ₋₁ + B uₖ           (project the state)
Pₖ⁻  = F Pₖ₋₁ Fᵀ + Q           (project the covariance, grow it by Q)

Update. A measurement zₖ arrives. Compute the innovation (how surprised you are), its covariance, the gain, and then correct:

yₖ = zₖ − H x̂ₖ⁻               (innovation: measurement minus prediction)
Sₖ = H Pₖ⁻ Hᵀ + R             (innovation covariance)
Kₖ = Pₖ⁻ Hᵀ Sₖ⁻¹             (Kalman gain — the optimal blend factor)
x̂ₖ = x̂ₖ⁻ + Kₖ yₖ            (corrected state)
Pₖ = (I − Kₖ H) Pₖ⁻          (shrunk covariance)

The Kalman gain K is the heart of it. Read the gain as a ratio: K = P⁻Hᵀ / (HP⁻Hᵀ + R). When measurement noise R is huge, the denominator dominates, K → 0, and the correction term Ky vanishes — you ignore the sensor and keep your prediction. When R is tiny and the prediction uncertainty P⁻ is large, K → H⁻¹ and you snap the estimate onto the measurement. The filter computes this trade-off optimally, per dimension, every step.

Complexity. For a state of dimension n and a measurement of dimension m, the cost is dominated by the matrix products and the m×m inverse in S⁻¹: O(n³ + m³) arithmetic per step, with O(n²) memory. Critically there is no dependence on the number of past steps — the filter is constant-memory and constant-time-per-step, which is exactly why it runs in real time on tiny hardware.

When to reach for a Kalman filter

Your system is roughly linear and your noise is roughly Gaussian. Constant-velocity or constant-acceleration motion, RC-circuit dynamics, anything well-described by a linear difference equation.
You need real-time, streaming estimates. The filter processes one measurement at a time and never revisits the past, so latency is bounded and predictable.
You are fusing multiple sensors. GPS + IMU, lidar + wheel odometry, multiple thermometers. Each sensor is just another H and R; you run an update per sensor.
You want a calibrated uncertainty, not just a point estimate. P tells downstream code how much to trust the estimate — invaluable for gating, data association, and decision-making.

Reach for something else when the dynamics are strongly nonlinear (EKF/UKF), the posterior is multimodal — "the robot is in one of three rooms" (particle filter), or you can afford to wait and process the whole batch offline for a smoother estimate (a Kalman smoother, or full batch least squares).

Kalman vs other state estimators

	Kalman (KF)	Extended (EKF)	Unscented (UKF)	Particle filter	Complementary filter	Batch least squares
System type	Linear	Nonlinear (linearized)	Nonlinear	Any	Linear, hand-tuned	Any (offline)
Belief shape	Single Gaussian	Single Gaussian	Single Gaussian	Arbitrary (samples)	None (no covariance)	Gaussian over batch
Cost per step	O(n³)	O(n³) + Jacobian	O(n³), ~2n+1 sigma pts	O(N·n), N = particles	O(n)	O((nT)³) one-shot
Optimality	MMSE (provably)	1st-order approx	2nd-order approx	Asymptotic as N→∞	None	MMSE over full window
Multimodal?	No	No	No	Yes	No	No
Real-time?	Yes	Yes	Yes	Yes (if N small)	Yes (cheapest)	No
Typical use	Radar, GPS tracking	SLAM, navigation	Attitude, robotics	Localization, vision	Drone IMU fusion	Survey, calibration

The headline trade-off: the plain Kalman filter is the cheapest provably optimal estimator, but only for linear-Gaussian systems. Every other row buys generality — nonlinearity, multimodality, or offline accuracy — by giving up either the closed-form optimality or the constant per-step budget.

What the numbers actually say

One step is a handful of matrix multiplies. For a 6-state tracker (3D position + 3D velocity) with a 3D position measurement, predict-update is on the order of a few thousand floating-point operations — under a microsecond on a modern CPU, and entirely feasible at the 1 kHz update rates used in IMU fusion.
Memory is fixed at O(n²). A 9-state filter stores a 9×9 covariance = 81 doubles ≈ 648 bytes, regardless of whether you have processed 10 measurements or 10 billion. Contrast that with a particle filter tracking the same belief at 10,000 particles: 9 × 10,000 = 90,000 doubles, plus weights.
The Apollo Guidance Computer ran it in ~2 KB of erasable (read/write) memory on a ~2 MHz clock, navigating the spacecraft by fusing inertial integration with crew star sightings.
Variance reduction is multiplicative, not additive. Fusing two independent measurements of variances σ₁² and σ₂² yields a fused variance of σ₁²σ₂²/(σ₁²+σ₂²) — strictly smaller than the smaller of the two. Two sensors each with 1 m std-dev fuse to 0.71 m; ten of them to 0.32 m.

JavaScript implementation

A scalar (1-D) Kalman filter is the cleanest way to see the predict–update cycle without matrix bookkeeping. This tracks a value drifting under constant-velocity-ish dynamics from a noisy sensor:

class ScalarKalman {
  // x: state estimate, p: estimate variance
  // q: process noise variance, r: measurement noise variance
  constructor({ x = 0, p = 1, q = 1e-3, r = 0.1 } = {}) {
    this.x = x; this.p = p; this.q = q; this.r = r;
  }

  // Predict: state assumed roughly constant (F = 1), so just grow uncertainty.
  predict() {
    // x stays the same; variance inflates by process noise
    this.p = this.p + this.q;
    return this.x;
  }

  // Update: fold in a measurement z.
  update(z) {
    const k = this.p / (this.p + this.r);  // Kalman gain in [0, 1]
    this.x = this.x + k * (z - this.x);    // blend toward measurement
    this.p = (1 - k) * this.p;             // shrink uncertainty
    return this.x;
  }

  step(z) { this.predict(); return this.update(z); }
}

// Track a true value of 10 hidden behind noisy ±2 readings:
const kf = new ScalarKalman({ x: 0, p: 1, q: 1e-3, r: 4 });
const readings = [12.1, 8.4, 11.7, 9.2, 10.8, 9.9, 10.3];
for (const z of readings) console.log(kf.step(z).toFixed(2));
// Output converges toward ~10 and barely wobbles, despite ±2 noise.

Notice the gain k = p/(p+r) is exactly the optimal-blend ratio from the math section, collapsed to one dimension. Early on p is large so k is near 1 and the filter leans on measurements; as p shrinks, k drops and the filter starts smoothing.

Python implementation — the full matrix filter

The real, multivariable filter in NumPy. This is the canonical constant-velocity tracker: state is [position, velocity], and we only ever measure position.

import numpy as np

class KalmanFilter:
    def __init__(self, F, H, Q, R, x0, P0):
        self.F, self.H, self.Q, self.R = F, H, Q, R
        self.x, self.P = x0, P0
        self.I = np.eye(F.shape[0])

    def predict(self, u=None, B=None):
        self.x = self.F @ self.x + (B @ u if u is not None else 0)
        self.P = self.F @ self.P @ self.F.T + self.Q
        return self.x

    def update(self, z):
        y = z - self.H @ self.x                       # innovation
        S = self.H @ self.P @ self.H.T + self.R       # innovation covariance
        K = self.P @ self.H.T @ np.linalg.inv(S)      # Kalman gain
        self.x = self.x + K @ y
        # Joseph form — stays symmetric & positive-definite under round-off
        A = self.I - K @ self.H
        self.P = A @ self.P @ A.T + K @ self.R @ K.T
        return self.x

# Constant-velocity model, dt = 1 second
dt = 1.0
F = np.array([[1, dt],
              [0,  1]])
H = np.array([[1.0, 0.0]])          # we observe position only
Q = np.array([[0.05, 0.0],
              [0.0,  0.05]])         # small process noise
R = np.array([[4.0]])               # measurement variance (±2 std-dev)
x0 = np.array([0.0, 1.0])           # start at pos 0, velocity 1
P0 = np.eye(2) * 500.0              # large initial uncertainty

kf = KalmanFilter(F, H, Q, R, x0, P0)

true_pos = 0.0
for t in range(8):
    true_pos += 1.0                                  # truth moves at v = 1
    z = np.array([true_pos + np.random.randn() * 2]) # noisy sensor
    kf.predict()
    est = kf.update(z)
    print(f"t={t}  meas={z[0]:6.2f}  est_pos={est[0]:6.2f}  est_vel={est[1]:4.2f}")
# The filter recovers velocity ≈ 1 even though velocity is never measured —
# it infers it from the position trend. That's the magic of the covariance
# coupling position and velocity off the diagonal of P.

The standout result: we never measure velocity, yet the filter estimates it. The off-diagonal terms of P couple position and velocity, so each position correction also nudges the velocity estimate. This is why a Kalman filter can recover hidden, unobserved state dimensions — the famous reason it outperforms naive averaging.

Variants worth knowing

Extended Kalman Filter (EKF). For nonlinear models x = f(x) and z = h(x), linearize each step by taking the Jacobians F = ∂f/∂x and H = ∂h/∂x at the current estimate, then run the ordinary update. The workhorse of GPS receivers and early SLAM. Diverges if the linearization point is far from the truth.

Unscented Kalman Filter (UKF). Instead of a Taylor approximation, deterministically sample 2n+1 "sigma points" around the mean, push each through the true nonlinear function, and recover a Gaussian from the transformed cloud. Captures the distribution to second order with no Jacobians, and is usually more accurate and more robust than the EKF for strong nonlinearities.

Square-root and Joseph-form filters. Numerically stable reformulations that propagate a Cholesky factor of P (square-root) or use the symmetric update (Joseph) so the covariance never goes non-positive-definite under floating-point round-off.

Information filter. Propagate the inverse covariance (the information matrix) instead of P. Fusing many sensors becomes a simple sum of information contributions, which is why it underlies distributed and decentralized estimation.

Kalman smoother (RTS). A two-pass, offline variant: run the filter forward, then a Rauch–Tung–Striebel backward pass that corrects each estimate using future measurements. Strictly more accurate than the filter, but not real-time.

Common bugs and edge cases

Covariance loses positive-definiteness. The naive P = (I − KH)P drifts asymmetric and can go negative under round-off, especially with very small R. Use the Joseph form or a square-root filter.
Initializing P too small. If you start over-confident (tiny P0), the gain stays near zero and the filter ignores measurements — it never converges to the truth. Start P0 large and let it shrink.
Wrong units / mismatched dt. F encodes the timestep. If measurements arrive irregularly you must rebuild F and Q with the actual elapsed dt each step, not a fixed constant.
Treating Q as a free knob with no meaning. Too small and the filter lags real maneuvers (the estimate trails the truth); too large and it chases sensor noise. Validate with the normalized innovation squared (NIS) against its chi-squared band.
Forgetting angle wraparound. Innovations on headings or angles must be wrapped to (−π, π]. A raw subtraction near ±π produces a ~2π innovation that yanks the estimate the wrong way.
Outliers with no gating. A single bad measurement (multipath GPS, a dropped lidar return) gets fully trusted unless you reject it — gate on the Mahalanobis distance yᵀS⁻¹y before applying the update.
Using a linear KF on a nonlinear system. If H or F aren't truly linear, the optimality guarantee is gone and the filter can quietly bias. Switch to the EKF or UKF.

Frequently asked questions

What does the Kalman gain actually do?

The Kalman gain K is the optimal blend factor between your prediction and your measurement. It ranges from 0 to 1 (per dimension): when the sensor is very noisy relative to the prediction, K shrinks toward 0 and you trust the model; when the prediction is uncertain relative to the sensor, K grows toward 1 and you trust the measurement. The filter computes K from the covariances so the fused estimate has the smallest possible variance.

Why is the Kalman filter called optimal?

For a linear system with Gaussian, zero-mean, white noise, the Kalman filter is the minimum-mean-square-error estimator — no other estimator, linear or nonlinear, produces a state estimate with lower expected squared error. If you drop the Gaussian assumption but keep linearity, it is still the best linear unbiased estimator (BLUE). When the system is nonlinear, that optimality guarantee disappears and you fall back to the EKF or UKF.

What is the difference between the Kalman filter and a particle filter?

The Kalman filter represents the state distribution with a single Gaussian (a mean vector and a covariance matrix) and updates it in closed form. A particle filter represents the distribution with thousands of weighted samples, so it can model multimodal, non-Gaussian beliefs at the cost of far more compute. Use Kalman when the system is roughly linear-Gaussian; use a particle filter when the posterior has multiple peaks the Gaussian can't capture.

When should I use an Extended Kalman Filter instead of the plain one?

Use the Extended Kalman Filter (EKF) when your motion model or measurement model is nonlinear — for example a GPS range that depends on the square root of position differences. The EKF linearizes those functions with a Jacobian at the current estimate each step. If the nonlinearity is strong, the Unscented Kalman Filter (UKF) is usually more accurate because it propagates sample sigma points instead of a first-order Taylor approximation.

Why does the covariance matrix sometimes blow up or go non-positive-definite?

The naive update P = (I − KH)P loses symmetry and positive-definiteness to floating-point round-off, especially with very accurate sensors. The fix is the Joseph form, P = (I − KH)P(I − KH)ᵀ + KRKᵀ, which stays symmetric and positive semi-definite by construction, or a square-root filter that propagates a Cholesky factor instead of P itself.

How do I tune the process and measurement noise Q and R?

R comes from the sensor: it is the variance of your measurement, usually from the datasheet or empirical calibration. Q is the harder one — it captures how much you trust the motion model. Too small a Q makes the filter sluggish and slow to react to maneuvers; too large makes it jittery and over-trusting of noisy measurements. A common practical tactic is to scale Q until the normalized innovation squared (NIS) sits inside its chi-squared confidence band.