Databases

Columnar Storage

Turn the table on its side so analytics reads only what it needs

Columnar storage lays each column out contiguously on disk instead of each row, so analytic queries read only the columns they touch and compress 5–20× better because neighbouring values share a type and a domain.

Best forOLAP scans & aggregation
Worst forSingle-row OLTP writes
Typical compression5–20× vs row store
I/O for a 3-column queryreads ~3 of N columns
Reconstructionby row position (offset i)

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How columnar storage works

A table is a logical grid of rows and columns, but a disk is a one-dimensional sequence of bytes. Something has to decide the order in which the grid is flattened. A row store writes record by record: all fields of row 1, then all fields of row 2, and so on. A column store writes field by field: every value of column 1, then every value of column 2. Same data, transposed layout.

Consider an events table with columns user_id, country, device, and revenue. On disk the two layouts look like this:

Row store   : [u1,US,ios,4.99][u2,DE,web,0.00][u3,US,ios,2.50] ...
Column store: [u1,u2,u3,...][US,DE,US,...][ios,web,ios,...][4.99,0.00,2.50,...]

Now run SELECT country, SUM(revenue) FROM events GROUP BY country. The query names two columns out of four. In the row store, every record sits between the fields you want, so the scan drags user_id and device off disk just to skip over them — you read the whole table to use half of it. In the column store, you read the country chunk and the revenue chunk and nothing else. On a 4-column table that is roughly half the I/O; on a 200-column data-warehouse fact table it is a 50–100× reduction.

The second win is compression. Because a column is a contiguous run of one type with a narrow domain, the values next to each other look alike, and three cheap encodings do most of the work:

Dictionary encoding — map each distinct value to a small integer. A country column with 200 distinct strings becomes a stream of 1-byte codes plus a 200-entry dictionary.
Run-length encoding (RLE) — if the column is sorted or clustered, replace a run of identical values with a (value, count) pair. A sorted country column of a billion rows might collapse to a few hundred pairs.
Bit-packing / frame-of-reference — store integers in the fewest bits they actually need, or as small deltas from a per-block base, instead of a full 64-bit word each.

A general-purpose compressor like LZ4 or Zstd then runs over the already-encoded stream and squeezes out the rest. Row storage can never reach these ratios because each record interleaves an int, a string, and a float, so the byte stream has no run of like-typed data for the encoder to exploit.

When to use a column store — and when not to

The layout is a bet on your access pattern. Pick columnar when:

Analytic scans dominate. Aggregations, group-bys, and reports that touch a few columns across many rows are exactly what columnar is built for.
Tables are wide and queries are narrow. The more columns the table has and the fewer each query touches, the bigger the I/O saving.
Data is append-mostly. Logs, events, metrics, and historical facts that are written once and read forever play to columnar's strength.
Storage and scan cost matter. Cloud warehouses bill by bytes scanned; columnar plus compression cuts the bill directly.

Avoid columnar — or keep a row store alongside it — when:

You do point lookups and single-row writes. Fetching or updating one whole record means touching every column segment; a row store does it in one place.
You need SELECT * on individual rows. Reassembling a full wide row from N separate columns is the worst case for a column store.
The workload is write-heavy and latency-sensitive. Mutating compressed, immutable column segments forces decode-modify-re-encode cycles or background rewrites.

This is the OLTP-versus-OLAP split. Many modern systems hedge with a hybrid: a row-format write buffer for fresh data that is later compacted into read-optimized columnar segments (SAP HANA, Apache Druid, and DuckDB-style storage all do variations of this).

Row store vs column store vs hybrid

	Row store	Column store	PAX / hybrid
On-disk order	all fields of a row together	all values of a column together	columnar within a row-group page
Single-row read	1 seek	N seeks (one per column)	1 page, then per-column offsets
Analytic scan of k of N columns	reads all N	reads only k	reads only k within touched pages
Compression ratio	1–3×	5–20×	4–15×
Single-row insert	1 append	N segment writes (batched)	1 page write
Update in place	cheap	decode → modify → re-encode	page rewrite
Best workload	OLTP, point access	OLAP, wide scans	mixed / single-node analytics
Real-world use	PostgreSQL heap, MySQL InnoDB	Redshift, ClickHouse, Vertica, BigQuery	Parquet, ORC, DuckDB

"PAX" (Partition Attributes Across) is the middle path used by file formats like Parquet: split the table into row groups of, say, a million rows, then store each row group's columns contiguously within that group. You keep most of columnar's scan and compression wins while bounding how far apart one logical row's fields can drift, which helps both reconstruction and parallelism.

What the numbers actually say

I/O scales with columns touched, not rows. A query naming 3 columns of a 100-column fact table reads roughly 3% of the data a row store would scan — before compression. With 10× compression on top, you move ~0.3% of the bytes.
5–20× compression is typical, and 100×+ is real for low-cardinality sorted columns. A boolean is_active column or a sorted event_date column run-length-encodes almost to nothing. The original C-Store and Vertica papers reported whole-table footprints a fraction of the equivalent row store.
Vectorized scans hit billions of values per second per core. Processing a dense typed array in 1,024-value batches lets the CPU keep data in L1/L2 cache and use SIMD; engines like ClickHouse and DuckDB routinely report 1–10 GB/s/core scan throughput on compressed columns.
Sort order is a tuning knob, not a free win. RLE only helps the column you physically sort by; secondary columns compress at dictionary rates. Picking the right clustering/sort key is one of the highest-leverage decisions in a column store.
Writes pay the bill. A single-row insert that costs one append in a row store may touch dozens of column segments, which is why column stores buffer writes and compact in the background rather than mutating in place.

JavaScript implementation

A minimal column store: store each column as its own array, project by selecting columns, filter by scanning one column and reusing the surviving row indices (late materialization), and show dictionary + run-length encoding on a single column.

class ColumnStore {
  constructor(schema) {           // schema: ['user_id','country','device','revenue']
    this.schema = schema;
    this.cols = Object.fromEntries(schema.map(c => [c, []]));
    this.n = 0;
  }
  insert(row) {                   // row keyed by column name
    for (const c of this.schema) this.cols[c].push(row[c]);
    this.n++;
  }
  // Scan only the named columns; filter on one, aggregate another.
  // `predicate(value)` runs on the filter column — we never touch the others
  // for rejected rows (late materialization).
  sumWhere(sumCol, filterCol, predicate) {
    const f = this.cols[filterCol];   // one contiguous array — cache friendly
    const s = this.cols[sumCol];
    let total = 0;
    for (let i = 0; i < this.n; i++) {
      if (predicate(f[i])) total += s[i];   // gather sumCol only on survivors
    }
    return total;
  }
  // Reassemble full rows by position — the slow path a column store avoids.
  row(i) {
    const r = {};
    for (const c of this.schema) r[c] = this.cols[c][i];
    return r;
  }
}

// Dictionary + run-length encoding for one low-cardinality column.
function encodeColumn(values) {
  const dict = [], code = new Map();
  const codes = values.map(v => {
    if (!code.has(v)) { code.set(v, dict.length); dict.push(v); }
    return code.get(v);
  });
  // RLE over the integer codes (great when the column is sorted/clustered).
  const runs = [];
  for (let i = 0; i < codes.length; ) {
    let j = i;
    while (j < codes.length && codes[j] === codes[i]) j++;
    runs.push([codes[i], j - i]);   // [code, runLength]
    i = j;
  }
  return { dict, runs };            // store this instead of the raw strings
}

const t = new ColumnStore(['user_id', 'country', 'device', 'revenue']);
t.insert({ user_id: 'u1', country: 'US', device: 'ios', revenue: 4.99 });
t.insert({ user_id: 'u2', country: 'DE', device: 'web', revenue: 0.00 });
t.insert({ user_id: 'u3', country: 'US', device: 'ios', revenue: 2.50 });

// SELECT SUM(revenue) WHERE country = 'US' — touches 2 of 4 columns.
console.log(t.sumWhere('revenue', 'country', c => c === 'US')); // 7.49
console.log(encodeColumn(t.cols.country));
// { dict: ['US','DE'], runs: [[0,1],[1,1],[0,1]] }

Two ideas carry over to real engines. First, sumWhere reads only the filter and sum columns and never materializes user_id or device — that is the whole point. Second, you store the encoded form ({dict, runs}), not the raw strings; the scan can even run predicates directly on dictionary codes without decoding.

Python implementation

The same store with a vectorized scan via NumPy — exactly how production engines turn a column into a tight, branch-free loop over a typed buffer.

import numpy as np

class ColumnStore:
    def __init__(self, schema):
        self.schema = schema
        self.cols = {c: [] for c in schema}
        self.n = 0

    def insert(self, row):
        for c in self.schema:
            self.cols[c].append(row[c])
        self.n += 1

    def finalize(self):                 # freeze columns into typed arrays
        self.arr = {c: np.array(v) for c, v in self.cols.items()}

    # Vectorized: build a boolean mask on ONE column, apply to another.
    # No Python-level per-row loop — NumPy runs the comparison in C with SIMD.
    def sum_where(self, sum_col, filter_col, value):
        mask = self.arr[filter_col] == value      # one pass over a dense array
        return self.arr[sum_col][mask].sum()       # gather survivors, then add

    def row(self, i):                   # reconstruct by position (slow path)
        return {c: self.cols[c][i] for c in self.schema}


def encode_column(values):
    """Dictionary + run-length encoding for a low-cardinality column."""
    dict_, code = [], {}
    codes = []
    for v in values:
        if v not in code:
            code[v] = len(dict_)
            dict_.append(v)
        codes.append(code[v])
    runs, i = [], 0
    while i < len(codes):
        j = i
        while j < len(codes) and codes[j] == codes[i]:
            j += 1
        runs.append((codes[i], j - i))   # (code, run_length)
        i = j
    return dict_, runs


t = ColumnStore(['user_id', 'country', 'device', 'revenue'])
for r in [
    {'user_id': 'u1', 'country': 'US', 'device': 'ios', 'revenue': 4.99},
    {'user_id': 'u2', 'country': 'DE', 'device': 'web', 'revenue': 0.00},
    {'user_id': 'u3', 'country': 'US', 'device': 'ios', 'revenue': 2.50},
]:
    t.insert(r)
t.finalize()

print(t.sum_where('revenue', 'country', 'US'))   # 7.49
print(encode_column(t.cols['country']))          # (['US','DE'], [(0,1),(1,1),(0,1)])

The win is mask = arr == value: NumPy evaluates the predicate over the whole contiguous array in compiled C, with no Python object dispatch per element. That is the essence of vectorized execution — the dense, single-type column layout is what makes it possible.

Variants worth knowing

Pure column store (DSM). Decomposition Storage Model — every column is a fully separate file or segment. Maximum scan and compression efficiency; worst-case single-row reconstruction. C-Store, Vertica, ClickHouse's MergeTree, and Amazon Redshift sit here.

PAX / row-group columnar. Partition rows into groups, lay out columns contiguously within each group. Parquet, ORC, and Arrow Feather use this so a single scan task owns a self-contained group and reconstruction stays local.

Column groups. Cluster columns that are queried together (e.g. lat + lng) into one segment so common queries seek once. Cassandra and Bigtable-style "column families" are a coarse version of this idea.

Hybrid / delta architectures. Keep a small, mutable row-format buffer for fresh writes and merge it into immutable columnar segments in the background. SAP HANA's delta store, Apache Druid's real-time vs historical segments, and the read-store/write-store split in the original C-Store all follow this pattern.

Late vs early materialization. An execution-layer variant, not a storage one: run filters and joins on raw column vectors and stitch rows together as late as possible (late materialization) versus reconstructing rows up front (early materialization). Late wins when predicates are selective; early can win when nearly every row survives.

Common bugs and edge cases

Losing the position correspondence. Tuple reconstruction relies on value i meaning the same row in every column. Any operation that filters or reorders one column must carry a row-index vector, not drop it — sort one column independently and the table is silently corrupted.
NULLs break run-length and bit-packing. A column with NULLs needs a separate presence bitmap (a "definition level" in Parquet); folding NULL into a sentinel value wrecks both compression and aggregation correctness.
Decoding too early. Materializing rows before filtering throws away columnar's biggest advantage. Push predicates down to the compressed column and reconstruct only survivors.
Treating it as an OLTP store. High-rate single-row inserts and updates against immutable column segments cause write amplification and constant background compaction. Batch writes, or front the column store with a row-format buffer.
Bad sort/clustering key. RLE only pays off on the column you physically sort by. Sort by a high-cardinality key like user_id and your low-cardinality columns lose most of their run-length savings.
Ignoring per-chunk min/max stats. Columnar formats store min/max per row group so the engine can skip groups that cannot match a predicate. If your data is unsorted, the min/max of every group overlaps and skipping never fires — you scan everything anyway.

Frequently asked questions

Why does columnar storage compress so much better than row storage?

All values in a column share one type and usually a narrow domain — a country column holds maybe 200 distinct strings, a status column three. Storing them contiguously lets the encoder exploit that locality: dictionary encoding replaces each value with a tiny integer code, run-length encoding collapses long sorted runs to a (value, count) pair, and the remaining integers pack into the minimum number of bits. Row storage interleaves an int, a string, and a float on every row, so a general-purpose compressor never sees a clean stream of like-typed data.

What is the difference between a row store and a column store?

A row store keeps all fields of one record together, so reading or updating a whole row touches one place on disk — ideal for OLTP point lookups and writes. A column store keeps all values of one field together, so an analytic scan reads only the columns the query names and skips the rest — ideal for OLAP aggregations over a few columns and billions of rows. Same logical table, opposite physical layout.

Why is columnar storage bad for transactional workloads?

Inserting one row means writing into every column file, and updating one row means decoding a compressed block, changing one value, and re-encoding. A single-row insert that costs one append in a row store can touch dozens of separate column segments in a column store. That is why column stores batch writes into immutable segments and rebuild them in the background rather than mutating in place.

How does a query engine reassemble a row from separate columns?

By position. The value at offset i in every column belongs to the same logical row, so tuple reconstruction is just gathering index i from each requested column. Engines delay this stitching as long as possible — called late materialization — running filters and aggregations on raw compressed column vectors and only assembling rows for the handful of values that survive to the output.

Is Parquet a columnar format?

Yes. Apache Parquet stores data in row groups, and within each row group every column is written as its own contiguous chunk with per-chunk min/max statistics, dictionary and run-length encoding, and a final pass of general compression. A query reads only the column chunks it needs and skips entire row groups whose min/max bounds cannot match the predicate. ORC and Arrow's on-disk Feather format use the same idea.

What is vectorized execution and why does columnar storage enable it?

Because a column lives in memory as a dense array of one type, the engine can process it a batch of 1,024 values at a time in a tight loop with no per-value type dispatch and no pointer chasing. The CPU keeps the array in cache, the branch predictor sees a predictable loop, and SIMD instructions add or compare many values per cycle. Row-at-a-time engines pay an interpreter overhead on every field of every row; vectorized column engines amortize it across a whole batch.