Machine Learning

Model Quantization

Q: How much smaller and faster does quantization make a model?

Going from FP32 to INT8 cuts the weight footprint 4× (32 bits → 8 bits); FP32 to INT4 cuts it 8×. A 7-billion-parameter model is about 28 GB in FP32, 14 GB in FP16, 7 GB in INT8, and roughly 3.5–4 GB in INT4. Speed gains depend on whether the operation is memory-bound — LLM token generation often runs 2–4× faster in INT4 because it is bottlenecked on reading weights from memory, not on arithmetic.

Q: What is the difference between post-training quantization and quantization-aware training?

Post-training quantization (PTQ) takes an already-trained FP32 model and converts its weights to integers using a small calibration set — minutes of work, no labels, no gradients. Quantization-aware training (QAT) simulates the rounding error during training (or fine-tuning) so the network learns weights that survive being rounded. QAT costs GPU-hours but typically recovers most of the accuracy lost at INT4 and below, where PTQ alone degrades.

Q: How does dequantization recover a float from an integer?

Quantization maps a float x to an integer q = round(x / s) + z, where s is the scale and z is the zero-point. Dequantization reverses it: x ≈ s · (q − z). The result is an approximation, not the original value — the error is bounded by half a quantization step, s/2. Choosing a good scale s from the tensor's value range is what determines accuracy.

Q: Why does quantizing activations hurt more than quantizing weights?

Weights are fixed after training and have a fairly compact, predictable distribution, so a single scale covers them well. Activations change with every input and large transformers develop a handful of huge outlier activation channels — values 100× the rest. One global scale then either clips the outliers or starves the normal values of precision. Methods like LLM.int8(), SmoothQuant, and AWQ exist specifically to handle these activation outliers.

Q: What is the difference between symmetric and asymmetric quantization?

Symmetric quantization fixes the zero-point at 0 and maps the range [−max, +max], which is cheap because the integer 0 represents the float 0 exactly and matrix-multiply accumulation needs no zero-point correction. Asymmetric quantization uses the true [min, max] with a nonzero zero-point, which wastes no codes and suits one-sided distributions like post-ReLU activations, at the cost of extra cross-terms in the matmul.

Q: Is INT4 quantization good enough for production LLMs?

For inference, yes, in many cases. Modern 4-bit schemes like GPTQ and AWQ with group-wise scales (one scale per 64–128 weights) typically lose only about 0.1–0.5 perplexity on large models, which is barely perceptible in generated text. Below 4 bits — 3-bit, 2-bit, or 1-bit (BitNet-style) — accuracy degrades sharply unless the model was trained for it from scratch.

Trade a few decimal places for a model that fits on your phone

Model quantization stores and computes a neural network's weights in 8- or 4-bit integers instead of 32-bit floats, shrinking the model 4–8× and speeding up inference with little accuracy loss.

FP32 → INT8 size4× smaller
FP32 → INT4 size8× smaller
Mapq = round(x / s) + z
Round error≤ s / 2
INT4 LLM perplexity loss≈ 0.1–0.5

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

The intuition: a ruler with fewer marks

A trained neural network is a giant pile of numbers — its weights. By default each weight is a 32-bit floating-point value (FP32), which can represent any magnitude from roughly 10⁻³⁸ to 10³⁸ with about 7 decimal digits of precision. That precision is overkill for a network whose weights almost all sit between −1 and +1. You are paying 32 bits to store a number you could pin down with far fewer.

Quantization swaps that fine-grained float ruler for a coarse integer one. Pick the range a tensor actually uses — say weights in [−0.6, +0.6] — and slice it into a fixed number of evenly spaced bins. INT8 gives you 256 bins (2⁸), INT4 gives you just 16 (2⁴). Every weight snaps to the nearest bin and is stored as a small integer index. To use a weight, you multiply that index back by the bin width. The model gets smaller because indices are tiny; it gets faster because integer math and the reduced memory traffic are cheaper than wide-float math.

The whole bet is that a neural network is robust to a little rounding. Drop each weight to the nearest of 256 buckets and the network's output barely moves, because the errors are small, roughly zero-mean, and partly cancel across millions of multiply-accumulates. That robustness is what makes quantization a near-free lunch — until you push the bins too coarse.

The mechanism: scale, zero-point, round

The core of quantization is an affine map between a float x and an integer q:

quantize:    q = round(x / s) + z          (then clamp to [q_min, q_max])
dequantize:  x ≈ s · (q − z)

Two parameters define the map:

Scale s — the width of one bin, a positive float. For b-bit symmetric quantization of a tensor whose maximum absolute value is α, the scale is s = α / (2^(b−1) − 1). For INT8 that denominator is 127.
Zero-point z — the integer that represents the float 0.0. In symmetric quantization z = 0; in asymmetric quantization z = round(−x_min / s) so the integer grid lines up with the true [x_min, x_max] range.

The error from this round-trip is bounded by half a bin: |x − s·(q−z)| ≤ s/2. Make the bins narrower (more bits, or a tighter range) and the error shrinks. The art of quantization is choosing the range — too wide and you waste codes on values that never occur; too tight and you clip the tails, which is usually worse.

The payoff shows up in matrix multiply, the operation that dominates a network. With symmetric INT8 weights and INT8 activations, a dot product becomes an integer accumulation that hardware runs cheaply, and the float scales factor out of the whole sum:

y = Σ  w_i · x_i
  ≈ Σ (s_w · q_w,i)(s_x · q_x,i)
  = s_w · s_x · Σ (q_w,i · q_x,i)     ← one INT32 accumulator, rescaled once

That is why symmetric quantization is preferred for weights: with z = 0 there are no extra cross-terms, just a clean integer sum scaled at the end.

Granularity: per-tensor, per-channel, per-group

One scale for an entire weight matrix (per-tensor) is cheapest but blunt — a single outlier column forces a coarse scale on everyone. Per-channel quantization gives each output channel (row) its own scale, costing one extra float per row but dramatically improving accuracy. Group-wise quantization goes finer still: one scale per contiguous block of 64 or 128 weights. Group size 128 at 4 bits is the workhorse behind GPTQ and AWQ — a 16-bit scale shared by 128 four-bit weights adds only 16 / 128 = 0.125 bits per weight (about 16 / 512 ≈ 3% storage overhead) yet recovers nearly all the accuracy.

When to quantize — and when not to

Inference on memory-bound hardware. Phones, edge devices, and even data-center LLM serving are usually bottlenecked on memory bandwidth, not FLOPs. Shrinking weights 4–8× cuts the bytes you must read per token, so latency drops even when the math itself isn't faster.
Fitting a model where it didn't fit. A 70B-parameter model is ~140 GB in FP16 (two A100-80GB cards); in INT4 it's ~35 GB and fits on a single card. This is the killer use case for local LLMs.
Batch-1, latency-sensitive serving. Generating one token at a time reads the whole weight matrix to produce a single vector — pure memory traffic, exactly what quantization helps.

Avoid quantizing when you're training (gradients need float dynamic range — though FP16/BF16 mixed precision is standard), when the workload is compute-bound with huge batches (the matmul, not memory, is the limit), or when the model is tiny and accuracy-critical and the few megabytes saved aren't worth any risk. Also be wary of quantizing the embedding and final output layers of an LLM too aggressively — they're often kept at higher precision.

Precision formats compared

	FP32	FP16 / BF16	INT8	INT4 (group-wise)	INT4 (per-tensor)	FP8 (E4M3)
Bits per weight	32	16	8	~4.5 (incl. scales)	4	8
Size vs FP32	1×	2× smaller	4× smaller	~7× smaller	8× smaller	4× smaller
7B model footprint	28 GB	14 GB	7 GB	~4 GB	3.5 GB	7 GB
Typical accuracy loss	none (baseline)	negligible	< 1%	small (≈0.1–0.5 ppl)	noticeable	negligible
Needs calibration?	no	no	yes (PTQ)	yes (GPTQ/AWQ)	yes	usually no
Hardware INT/FP path	FP	FP (Tensor Cores)	INT (DP4A, INT8 cores)	INT (custom kernels)	INT	FP8 Tensor Cores (Hopper+)
Best for	training reference	training, serving	CNNs, edge, robust serving	local LLM weights	extreme size pressure	training & inference on new GPUs

The headline trade-off: each halving of bit-width roughly halves the footprint and the memory traffic, but the accuracy cliff steepens as you go. INT8 is almost always safe with simple PTQ. INT4 needs a smart scheme (group-wise scales, outlier handling) to stay safe. Below 4 bits you generally need quantization-aware training or a from-scratch low-bit architecture.

What the numbers actually say

Footprint scales linearly with bits. A 7B model: 28 GB (FP32) → 14 GB (FP16) → 7 GB (INT8) → ~4 GB (INT4 with group scales). The same model that needs a 16 GB GPU in FP16 runs on an 8 GB GPU in INT4.
INT8 on classic vision models is essentially free. Published TensorRT and TFLite results put ResNet-50 INT8 top-1 accuracy within ~0.5% of FP32, while running 2–4× faster on INT8-capable hardware.
INT4 LLMs lose very little. GPTQ and AWQ on models like Llama-2-70B report perplexity increases on the order of 0.1–0.5 — imperceptible in generated text — for an 8× weight shrink.
Speed comes from bandwidth, not arithmetic. Token-by-token LLM generation reads every weight once per token. At INT4 you read a quarter of the bytes of FP16, so a bandwidth-bound decode can run 2–4× faster even though the dot products are unrolled the same number of times.
The cost of going too low is a cliff, not a slope. 3-bit PTQ often loses several points of accuracy; naive 2-bit can break a model entirely. Recovering it requires QAT, which costs real GPU-hours.

JavaScript implementation

A complete symmetric and asymmetric INT8 quantizer, plus the dequantization round-trip. This is the exact math a runtime applies tensor-by-tensor.

// Symmetric INT8: zero-point fixed at 0, range [-α, +α].
function quantizeSymmetric(weights, bits = 8) {
  const qmax = (1 << (bits - 1)) - 1;          // 127 for INT8
  const alpha = Math.max(...weights.map(Math.abs)) || 1e-8;
  const s = alpha / qmax;                        // scale = bin width
  const q = weights.map(w =>
    Math.max(-qmax - 1, Math.min(qmax, Math.round(w / s)))
  );
  return { q, scale: s, zero: 0 };
}

// Asymmetric INT8 (unsigned 0..255): uses the true [min, max] range.
function quantizeAsymmetric(weights, bits = 8) {
  const qmax = (1 << bits) - 1;                  // 255
  const lo = Math.min(...weights), hi = Math.max(...weights);
  const s = (hi - lo) / qmax || 1e-8;
  const z = Math.round(-lo / s);                 // integer that maps to 0.0
  const q = weights.map(w =>
    Math.max(0, Math.min(qmax, Math.round(w / s) + z))
  );
  return { q, scale: s, zero: z };
}

// Reverse the map — recover an approximate float.
function dequantize({ q, scale, zero }) {
  return q.map(v => scale * (v - zero));
}

const w = [-0.51, 0.02, 0.33, -0.18, 0.6];
const packed = quantizeSymmetric(w);            // q are small ints, one float scale
const restored = dequantize(packed);            // ≈ w, error ≤ scale/2 per element

Two details matter. First, the clamp is not optional: an outlier outside the chosen range must saturate to q_max rather than overflow the integer. Second, notice that the only float you keep per tensor is the scale (and zero if asymmetric) — that's the "~4.5 bits" in the table: 4 bits per weight plus a shared scale amortized over the group.

Python implementation

The same logic vectorized with NumPy, plus the famous problem learners actually search for: group-wise INT4 quantization of a weight matrix, the scheme underneath GPTQ/AWQ-style 4-bit LLMs.

import numpy as np

def quantize_symmetric(w, bits=8):
    qmax = (1 << (bits - 1)) - 1          # 127
    alpha = np.abs(w).max()
    s = alpha / qmax if alpha > 0 else 1e-8
    q = np.clip(np.round(w / s), -qmax - 1, qmax).astype(np.int8)
    return q, s

def dequantize_symmetric(q, s):
    return q.astype(np.float32) * s

# Group-wise INT4: one scale per `group` contiguous weights per row.
# This is what makes 4-bit LLM weights usable.
def quantize_int4_groupwise(W, group=128):
    rows, cols = W.shape
    qmax = (1 << 3) - 1                   # 7  -> signed 4-bit range [-8, 7]
    Wq = np.zeros_like(W, dtype=np.int8)
    scales = np.zeros((rows, cols // group), dtype=np.float32)
    for r in range(rows):
        for g in range(cols // group):
            blk = W[r, g*group:(g+1)*group]
            a = np.abs(blk).max()
            s = a / qmax if a > 0 else 1e-8
            Wq[r, g*group:(g+1)*group] = np.clip(np.round(blk / s), -8, 7)
            scales[r, g] = s
    return Wq, scales

def dequantize_int4_groupwise(Wq, scales, group=128):
    rows, cols = Wq.shape
    W = np.zeros((rows, cols), dtype=np.float32)
    for r in range(rows):
        for g in range(cols // group):
            W[r, g*group:(g+1)*group] = (
                Wq[r, g*group:(g+1)*group].astype(np.float32) * scales[r, g]
            )
    return W

W = np.random.randn(4, 256).astype(np.float32) * 0.1
Wq, scales = quantize_int4_groupwise(W, group=128)
err = np.abs(W - dequantize_int4_groupwise(Wq, scales, 128)).mean()
print(f"4-bit group-wise mean abs error: {err:.5f}")   # tiny relative to weight scale

The group loop is the key idea: each block of 128 weights gets a scale tuned to its magnitude, so a quiet block isn't forced to share a coarse scale with a loud one. That local adaptivity is why group-wise INT4 holds accuracy where per-tensor INT4 falls apart.

Variants worth knowing

Post-training quantization (PTQ). Quantize an already-trained model using a small unlabeled calibration set to estimate activation ranges. Fast — minutes — and the default for INT8. No retraining, no gradients.

Quantization-aware training (QAT). Insert "fake quantize" nodes during training that round in the forward pass but pass gradients through with the straight-through estimator. The network learns weights that are robust to rounding. Essential below INT8 when PTQ degrades; costs GPU-hours.

GPTQ. A one-shot, layer-by-layer PTQ method that quantizes weights one column at a time and uses second-order (Hessian) information to update the remaining unquantized weights, compensating for each rounding error. Pushes large models to INT4/INT3 with minimal loss.

AWQ (Activation-aware Weight Quantization). Observes that a small fraction of weight channels are "salient" because they multiply large activations. It scales those channels up before quantizing so they keep more precision, then folds the scale back — protecting the channels that matter without mixed precision.

LLM.int8() and SmoothQuant. Two ways to tame activation outliers. LLM.int8() runs the rare outlier dimensions in FP16 and the rest in INT8 (mixed-precision decomposition). SmoothQuant migrates the difficulty from activations into weights via a per-channel rescaling so both can be quantized uniformly.

BitNet / 1-bit and ternary networks. Train from scratch with weights constrained to {−1, +1} or {−1, 0, +1}. Not a PTQ method — the architecture is designed for extreme low-bit from the start, replacing multiplies with additions.

Common bugs and edge cases

Calibrating on the wrong data. PTQ estimates activation ranges from a calibration set. If that set isn't representative of real inputs, the ranges are wrong and accuracy tanks. A few hundred in-distribution samples is usually enough — more isn't always better.
Ignoring activation outliers. Quantizing transformer activations per-tensor without outlier handling clips the few huge channels and wrecks the model. Use LLM.int8(), SmoothQuant, or AWQ for large language models.
Forgetting to clamp. Omitting the saturating clamp lets an out-of-range value overflow the integer type and wrap around to a wildly wrong number — a single weight can poison a layer.
Quantizing the wrong layers. The first and last layers (input embeddings, output projection / softmax logits) are disproportionately sensitive. Many production recipes keep them in FP16 and quantize only the middle.
Assuming smaller always means faster. On compute-bound, large-batch workloads the matmul dominates; if the hardware lacks fast INT4 kernels, dequantizing back to FP16 to multiply can erase the gain. The win is real mainly when you're memory-bound.
Symmetric scale on a one-sided distribution. Post-ReLU activations are all ≥ 0. Symmetric quantization wastes the entire negative half of the integer range on values that never occur — use asymmetric (unsigned) quantization there.

Frequently asked questions

How much smaller and faster does quantization make a model?