Computer Architecture

Floating-Point (IEEE 754)

Sign, exponent, mantissa — and why 0.1 + 0.2 isn't 0.3

IEEE 754 floating-point stores a real number as a sign bit, a biased exponent, and a fraction (mantissa), encoding value = ±1.f × 2^(e−bias) — which is why 0.1 + 0.2 lands at 0.30000000000000004 instead of 0.3.

Double layout1 + 11 + 52 bits
Significant digits (double)≈ 15–17 decimal
Exponent bias1023 (double) / 127 (float)
Machine epsilon2^−52 ≈ 2.2e−16
Representable doubles≈ 1.8 × 10^19

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How IEEE 754 packs a real number into bits

A computer has a fixed number of bits and an infinite number of reals to represent, so something has to give. IEEE 754 — the 1985 standard that almost every CPU, GPU, and language implements — chooses scientific notation in binary. Every finite number is stored as three fields laid out left to right: a sign, an exponent, and a fraction (also called the mantissa or significand).

The value of a normal number is:

value = (−1)^sign × 1.fraction × 2^(exponent − bias)

For a 64-bit double the split is 1 sign bit, 11 exponent bits, and 52 fraction bits. For a 32-bit float it's 1 + 8 + 23. Three ideas make this work:

The implicit leading 1. In binary scientific notation the significand always starts with 1 (you normalize until it does), so that bit is never stored — you get 53 bits of precision out of 52 stored bits, for free.
The biased exponent. Rather than a separate sign for the exponent, IEEE 754 stores exponent + bias as an unsigned integer. A double's bias is 1023, so a stored field of 1023 means 2^0, 1024 means 2^1, 1022 means 2^−1. The bias also makes the bit pattern of positive floats sort in the same order as signed integers — handy for fast comparisons.
Reserved exponents. An all-zero exponent means subnormal (or zero); an all-ones exponent means Infinity (zero fraction) or NaN (nonzero fraction). That carves out the special values from the normal range.

So 0.15625 becomes 1.25 × 2^−3 → sign 0, fraction .01, exponent field 127 − 3 = 124 for a float. Clean. The trouble starts when the number you want isn't a sum of a handful of powers of two.

Why 0.1 + 0.2 ≠ 0.3

This is the most-searched fact about floating point, and the answer is purely about base. In decimal, 1/3 is the repeating fraction 0.3333…; you can't write it exactly with a finite number of digits. In binary, 1/10 is the repeating fraction 0.0001100110011… — also infinite. So 0.1 can't be stored exactly in any binary float; it gets rounded to the nearest 53-bit value, which is about 0.1000000000000000055511151231257827021181583404541015625.

Both 0.1 and 0.2 are stored a hair too large. Add the two rounded values, round the result to fit, and you land on 0.30000000000000004 — the representable double closest to the true sum, but not the representable double closest to 0.3. The numbers were never wrong; they were never 0.1 and 0.2 to begin with.

0.1 + 0.2            // 0.30000000000000004
0.1 + 0.2 === 0.3    // false
(0.1 + 0.2).toFixed(17)   // "0.30000000000000004"
0.3.toFixed(17)           // "0.29999999999999999"  ← different stored value

The same effect explains why 0.1 * 3 !== 0.3, why a running total of money in floats drifts by pennies, and why for (let f = 0.0; f !== 1.0; f += 0.1) is an infinite loop — f steps right past 1.0 to 0.9999999999999999.

When to use floats — and when not to

Use floating point for physics, graphics, signal processing, statistics, machine learning — anything where inputs are already approximate and a relative error of 10^−16 is invisible.
Use double, not float, by default. A 32-bit float carries only ~7 decimal digits; intermediate sums in a long calculation erode that fast. Doubles cost twice the memory but rarely twice the time on a modern CPU.
Do NOT use floats for money. Use integer cents, a fixed-point type, or a decimal library. A bank that adds a million float transactions can be off by dollars.
Do NOT use == on computed floats. Compare with a tolerance.
Watch catastrophic cancellation — subtracting two nearly-equal large numbers throws away most of the significant bits. (1e16 + 1) − 1e16 gives 0, not 1.

float vs double vs other number formats

	float (binary32)	double (binary64)	half (binary16)	bfloat16	Decimal (e.g. decimal128)	Integer / fixed-point
Total bits	32	64	16	16	128	varies (32/64)
Sign / exp / mantissa	1 / 8 / 23	1 / 11 / 52	1 / 5 / 10	1 / 8 / 7	1 / decimal exp / 34 digits	n/a
Significant digits	~7	~15–17	~3	~2–3	~34	exact
Largest magnitude	~3.4 × 10^38	~1.8 × 10^308	65504	~3.4 × 10^38	~10^6144	bounded by width
Machine epsilon	2^−23 ≈ 1.2e−7	2^−52 ≈ 2.2e−16	2^−10 ≈ 9.8e−4	2^−7 ≈ 7.8e−3	10^−33	0 (exact)
0.1 + 0.2 exact?	No	No	No	No	Yes	n/a
Typical use	GPU, audio, games	Default scalar math	ML inference, GPUs	ML training (wide range)	Money, accounting	Counts, money in cents

The headline trade-off is range versus precision versus base. Float and double trade precision for memory; bfloat16 keeps double's exponent range but slashes the mantissa so neural nets don't overflow; decimal formats fix the 0.1 problem by being base-10 at the cost of slower hardware and more bits. Integers and fixed-point are exact but can't span 10^308.

What the numbers actually say

A double resolves ~15.95 decimal digits. With 53 bits of significand, log₁₀(2^53) ≈ 15.95 — so 15 digits always round-trip, 17 digits are needed to print a double uniquely, and the safe assumption is "about 15 significant figures."
The relative rounding error of any single operation is ≤ ½ epsilon = 2^−53 ≈ 1.1 × 10^−16. Errors don't usually accumulate to the worst case, but a sum of n terms can drift by up to about n × epsilon in naive summation.
Kahan summation cuts that drift to O(epsilon), independent of n. Adding 10 million floats naively can lose 3–4 digits; Kahan's compensated sum keeps full precision for a couple of extra additions per element.
Denormals can be 10–100× slower. Many CPUs trap to microcode for subnormal operands; an audio reverb tail that decays into denormals can suddenly drop a real-time thread, which is why DAWs set the flush-to-zero (FTZ) and denormals-are-zero (DAZ) flags.
There are exactly 2^64 − 2^53 finite-ish doubles after removing the NaN payloads — about 1.8 × 10^19 distinct values, spread non-uniformly: half of all doubles lie between −2 and 2, because spacing doubles with every power of two.

JavaScript implementation

JavaScript numbers are IEEE 754 doubles, so you can dissect one directly with typed arrays. Here's a decoder that pulls out the sign, the biased exponent, and the mantissa, plus a tolerance-based comparison you should reach for instead of ==.

function decodeDouble(x) {
  const buf = new ArrayBuffer(8);
  new Float64Array(buf)[0] = x;
  const bits = new BigUint64Array(buf)[0];     // raw 64 bits
  const sign = Number(bits >> 63n) & 1;
  const expRaw = Number((bits >> 52n) & 0x7ffn); // 11 exponent bits
  const frac = bits & 0xfffffffffffffn;          // 52 fraction bits

  let kind = 'normal', exp = expRaw - 1023;
  if (expRaw === 0)        kind = frac === 0n ? 'zero' : 'subnormal';
  else if (expRaw === 0x7ff) kind = frac === 0n ? 'infinity' : 'NaN';

  return {
    sign,
    storedExponent: expRaw,
    unbiasedExponent: kind === 'normal' ? exp : (kind === 'subnormal' ? -1022 : null),
    fractionBits: frac.toString(2).padStart(52, '0'),
    kind,
  };
}

decodeDouble(0.1).fractionBits;
// "1001100110011001100110011001100110011001100110011010"  ← repeating, rounded

// Compare floats safely — never `a === b` on computed values.
function nearlyEqual(a, b, relTol = 1e-9, absTol = 1e-12) {
  if (a === b) return true;                    // exact / both Infinity
  const diff = Math.abs(a - b);
  return diff <= Math.max(absTol, relTol * Math.max(Math.abs(a), Math.abs(b)));
}

nearlyEqual(0.1 + 0.2, 0.3);   // true

Two notes. First, Number.EPSILON in JS is exactly machine epsilon (2^−52); it's the right scale only near 1.0, which is why a robust comparison multiplies the tolerance by the operands' magnitude. Second, Number.isInteger works up to 2^53 — beyond Number.MAX_SAFE_INTEGER consecutive integers stop being representable and 2^53 + 1 rounds to 2^53.

Python implementation

Python's float is also a double. The standard library gives you exact decomposition without bit-twiddling, and math.isclose is the canonical tolerant comparison.

import struct, math

def decode_double(x: float) -> dict:
    bits = struct.unpack('>Q', struct.pack('>d', x))[0]  # 64-bit big-endian
    sign = (bits >> 63) & 1
    exp_raw = (bits >> 52) & 0x7ff                       # 11 exponent bits
    frac = bits & ((1 << 52) - 1)                        # 52 fraction bits

    if exp_raw == 0:
        kind = 'zero' if frac == 0 else 'subnormal'
    elif exp_raw == 0x7ff:
        kind = 'infinity' if frac == 0 else 'NaN'
    else:
        kind = 'normal'

    return {
        'sign': sign,
        'stored_exponent': exp_raw,
        'unbiased_exponent': exp_raw - 1023 if kind == 'normal' else None,
        'fraction_bits': format(frac, '052b'),
        'kind': kind,
    }

# The exact rational value Python actually stored for 0.1:
from fractions import Fraction
print(Fraction(0.1))   # 3602879701896397/36028797018963968  (≠ 1/10)
print(0.1 + 0.2)       # 0.30000000000000004

# Safe comparison — relative + absolute tolerance, like nearlyEqual above.
math.isclose(0.1 + 0.2, 0.3, rel_tol=1e-9, abs_tol=1e-12)   # True

# Kahan compensated summation — kills naive drift over many adds.
def kahan_sum(values):
    total = 0.0
    comp = 0.0                     # running compensation for lost low-order bits
    for v in values:
        y = v - comp
        t = total + y
        comp = (t - total) - y     # recovers what the add just dropped
        total = t
    return total

data = [0.1] * 10_000_000
naive = 0.0
for v in data: naive += v
print(naive)            # 999999.9998389754   ← naive drift
print(kahan_sum(data))  # 1000000.0  ← compensated, recovers full precision
# (CPython's built-in sum() also got float compensated summation in 3.12, so sum(data) likewise prints 1000000.0)

Note the asymmetry: struct.pack('>d', x) gives you the IEEE 754 byte layout, but the byte order on the wire is your choice — the format character > forces big-endian so the sign bit is the top bit of the first byte regardless of your CPU. And Fraction(0.1) is the killer demo: it prints the exact dyadic rational the machine stored, proving the stored value is not 1/10.

Variants and special values worth knowing

Subnormals (denormals). With the exponent field all zeros, the implicit leading 1 vanishes and the value is ±0.fraction × 2^(1−bias). These fill the gap between the smallest normal number and zero so underflow is gradual, not a cliff — but they cost performance, hence FTZ/DAZ.

±Infinity. Exponent all ones, fraction zero. Produced by overflow or 1.0/0.0. Infinity arithmetic mostly does the sensible thing: Inf + 1 == Inf, 1/Inf == 0.

NaN. Exponent all ones, fraction nonzero. 0.0/0.0, Inf − Inf, and sqrt(−1) all yield NaN. Its defining property: NaN != NaN, which is the standard idiom for detecting it (x !== x). There are quiet NaNs and signaling NaNs, distinguished by the top fraction bit; the rest of the 51 bits are a payload most code ignores.

Signed zero. +0.0 and −0.0 are distinct bit patterns that compare equal, but 1/−0.0 is −Infinity. Matters for complex branch cuts and conventions like "this counter underflowed from below."

bfloat16 and FP8. Machine-learning hardware introduced 16- and 8-bit formats that keep a wide exponent (so gradients don't overflow) at the cost of a tiny mantissa. They follow the same sign/exponent/mantissa structure — just with the dials turned way down.

Rounding modes. The default is round-to-nearest, ties-to-even (also called banker's rounding), which avoids the upward bias of always-round-half-up. IEEE 754 also defines round-toward-zero, round-up, and round-down, used by interval arithmetic to bound error rigorously.

Common bugs and edge cases

Using == on computed floats. 0.1 + 0.2 == 0.3 is false. Compare with a relative-plus-absolute tolerance.
Looping by float increment. for (f = 0; f != 1; f += 0.1) never hits exactly 1.0. Loop with an integer counter and multiply.
Storing money in floats. Cents accumulate rounding error; a million transactions can be off by dollars. Use integer minor units or a decimal type.
Forgetting NaN poisons comparisons. NaN < 0, NaN > 0, and NaN == NaN are all false, so a sort or a min/max can silently misbehave when a NaN sneaks in. Test with x !== x.
Catastrophic cancellation. Subtracting nearly equal values destroys precision. Rearrange the algebra — e.g. use (−b ± sqrt(...)) / 2a's numerically stable form, or compute log1p(x) instead of log(1 + x) for small x.
Assuming addition is associative. (a + b) + c can differ from a + (b + c), so parallel reductions and compiler reordering (-ffast-math) can change results. If reproducibility matters, fix the summation order.
Trusting 32-bit floats for big integers. A float can't represent every integer past 2^24; a double stops at 2^53. IDs and counters above those break silently.

Frequently asked questions

Why is 0.1 + 0.2 not exactly 0.3?

Neither 0.1 nor 0.2 has an exact binary representation — both are infinitely repeating fractions in base 2, rounded to 53 significant bits. Their stored values are each slightly off, and the rounded sum lands on 0.30000000000000004, the nearest representable double to the true 0.3, which itself isn't exactly representable either.

What do the sign, exponent, and mantissa actually mean?

The sign bit is 0 for positive, 1 for negative. The exponent is stored with a bias (127 for float, 1023 for double) so it can encode negative powers without a separate sign. The mantissa (fraction) holds the bits after an implicit leading 1, so the value is ±1.fraction × 2^(exponent − bias).

What is machine epsilon for a double?

Machine epsilon is the gap between 1.0 and the next representable number: 2^−52 ≈ 2.22 × 10^−16 for a 64-bit double, and 2^−23 ≈ 1.19 × 10^−7 for a 32-bit float. It sets the relative rounding error of every operation — about 15–17 significant decimal digits for double.

Why are there two zeros, +0.0 and −0.0?

The sign bit is independent of the rest of the encoding, so a zero mantissa and zero exponent can carry either sign. +0.0 and −0.0 compare equal under ==, but 1/+0.0 is +Infinity while 1/−0.0 is −Infinity, which matters for branch cuts and signed underflow.

What are denormal (subnormal) numbers?

When the exponent field is all zeros, the implicit leading 1 is dropped and the value becomes ±0.fraction × 2^(1−bias). These subnormals fill the gap between the smallest normal number and zero, giving gradual underflow — but on many CPUs they run 10–100× slower, which is why audio and physics code often enables flush-to-zero.

Should I ever compare two floats with ==?

Almost never for computed values. Use an absolute-or-relative tolerance instead: abs(a − b) <= max(absTol, relTol × max(abs(a), abs(b))). Exact == is only safe for sentinel values you stored yourself (like exactly 0.0) or for integer-valued floats below 2^53.