Why √N as the block size?

Range queries touch up to two partial blocks (left edge, right edge) and a span of full blocks. The partial-block work is O(block_size); the full-block work is O(N / block_size). Setting these equal gives block_size = √N, the value that minimises total per-query work. For N = 10⁵ that's 316 blocks of size 316.

Is sqrt decomposition slower than a segment tree?

Asymptotically yes — O(√N) versus O(log N). For N = 10⁵, √N ≈ 316 versus log₂N ≈ 17 — roughly 19× more cells per query. But sqrt-decomp has smaller constants, is cache-friendlier, and is shorter to write. For N ≤ 10⁵ with 10⁶ queries, both fit in a second in C++.

When does sqrt decomposition beat segment trees?

When the operation doesn't compose well — counting distinct elements in a range, finding the mode, k-th smallest of a range with updates. These problems don't have an associative merge that fits a segment tree node; storing a frequency table per block does. The Mo's algorithm framework is built on this insight.

What is Mo's algorithm?

An offline technique: sort range queries by (block of left endpoint, right endpoint), then sweep a moving window. Each pointer moves O(N√N) total time across all queries. Used for problems with no easy online query — distinct counts, range mode, range gcd of frequencies. Sqrt decomposition is the engine.

How do you handle updates with sqrt decomposition?

Point update: O(1) — change the cell, change its block's aggregate. Range update with lazy add: store a delta-per-block, apply on query. Range update with arbitrary monoid: O(√N) — flush lazy to affected partials, mark full blocks. The structure is forgiving — almost any operation has a workable variant.

What size of N is the sweet spot?

N around 10⁴ to 10⁵ with 10⁵ to 10⁶ queries. Below 10³ even naive O(N) per query fits; above 10⁶ the √N factor (≈ 1000) blows the time budget. In that window sqrt decomposition is the sweet spot of simplicity and speed.

Sqrt Decomposition — Block Arrays in O(√N) Per Range Query

How sqrt decomposition works

You have an array of N numbers, point updates, and range-sum queries. Two naive solutions sit at opposite ends:

Plain array: update O(1), range query O(N).
Maintained prefix-sum array: range query O(1), update O(N).

Sqrt decomposition meets in the middle. Pick a block size B ≈ √N. Slice the array into ⌈N/B⌉ contiguous blocks. For every block, store the precomputed aggregate (sum, in this case). The array itself stays intact.

Point update: change one cell, fix that cell's block aggregate. Two memory writes — O(1).

Range query over [l, r]: walk through three pieces.

The partial left block — cells from l to the end of its block, summed cell-by-cell. At most B cells.
The span of full blocks — every block fully covered by [l, r]. Read each block's aggregate directly. At most N/B blocks.
The partial right block — cells from the start of r's block to r. At most B cells.

Total work: 2B + N/B cells. Setting B = √N minimises this to 2√N + √N = 3√N — that's where the structure's name comes from.

Worked example

Take 16 elements and B = 4, so 4 blocks of 4:

arr   = [3, 1, 4, 1 | 5, 9, 2, 6 | 5, 3, 5, 8 | 9, 7, 9, 3]
block = [    9     |     22      |     21     |     28    ]

Range query [3, 12] (0-indexed, inclusive on both ends):

Partial left block (block 0 from index 3): arr[3] = 1.
Full blocks: blocks 1 and 2 → 22 + 21 = 43.
Partial right block (block 3 from index 12 to 12): arr[12] = 9.

Total: 1 + 43 + 9 = 53. We touched 1 + 2 + 1 = 4 cells instead of 10. With N = 16 and √N = 4, the savings are visible; with N = 10⁵, the savings are dramatic — 316 cells per query versus 100,000.

When to use sqrt decomposition

Quick implementations under contest time pressure — you can write it correctly in 5 minutes.
Operations that don't compose nicely into a segment-tree node — distinct counts, mode, range gcd with frequencies, k-th smallest in a range.
Offline batch queries via Mo's algorithm — sort queries by (left-block, right) and amortise total work.
Cache-friendly workloads — 316-element blocks fit comfortably in L1, beating segment-tree pointer chasing.
Range updates where lazy propagation is overkill — keep a per-block delta, push when needed.

Reach for a segment tree or Fenwick tree when N exceeds 10⁶ or when the query rate is so high that √N starts to dominate.

Sqrt decomposition vs Fenwick tree vs segment tree

	Sqrt-decomp	Fenwick tree	Segment tree
Point update	O(1)	O(log n)	O(log n)
Range query	O(√n)	O(log n)	O(log n)
Range update	O(√n) (lazy)	O(log n) (BIT2 trick, sum only)	O(log n) (lazy)
Code length	~25 lines	~12 lines	~40 lines
Supports min/max/gcd	Yes	No (needs inverse)	Yes
Supports distinct/mode	Yes (frequency block)	No	Hard (heavy node state)
Constant factor	Smallest (cache-friendly)	Very small	2–3× larger
Persistence	Easy (block snapshots)	Hard	Easy (path-copy)

The trade-off is asymptotic versus practical: √N is worse than log N on paper, but the constant factor and code simplicity often make sqrt decomposition the right tool for medium-N problems.

Pseudo-code

B = ceil(sqrt(N))
blocks = array of ceil(N / B) sums, computed on build

function update(i, value):
    blocks[i / B] += value - arr[i]
    arr[i] = value

function query(l, r):
    s = 0
    blockL = l / B
    blockR = r / B
    if blockL == blockR:
        for i in l..r: s += arr[i]
        return s
    for i in l..(blockL + 1) * B - 1: s += arr[i]
    for b in (blockL + 1)..(blockR - 1): s += blocks[b]
    for i in blockR * B..r: s += arr[i]
    return s

JavaScript implementation

class SqrtDecomposition {
  constructor(arr) {
    this.n = arr.length;
    this.b = Math.max(1, Math.ceil(Math.sqrt(this.n)));
    this.arr = Int32Array.from(arr);
    const numBlocks = Math.ceil(this.n / this.b);
    this.blocks = new Int32Array(numBlocks);
    for (let i = 0; i < this.n; i++) this.blocks[Math.floor(i / this.b)] += this.arr[i];
  }

  update(i, value) {
    const block = Math.floor(i / this.b);
    this.blocks[block] += value - this.arr[i];
    this.arr[i] = value;
  }

  query(l, r) {
    let s = 0;
    const blockL = Math.floor(l / this.b);
    const blockR = Math.floor(r / this.b);
    if (blockL === blockR) {
      for (let i = l; i <= r; i++) s += this.arr[i];
      return s;
    }
    for (let i = l; i < (blockL + 1) * this.b; i++) s += this.arr[i];
    for (let b = blockL + 1; b < blockR; b++) s += this.blocks[b];
    for (let i = blockR * this.b; i <= r; i++) s += this.arr[i];
    return s;
  }
}

const sd = new SqrtDecomposition([3,1,4,1,5,9,2,6,5,3,5,8,9,7,9,3]);
console.log(sd.query(3, 12));  // 53
sd.update(7, 0);
console.log(sd.query(3, 12));  // 53 - 6 = 47

Python implementation — distinct counts with frequency blocks

One reason to learn sqrt decomposition is that it supports operations segment trees can't. Counting distinct elements in a range with point updates is one such problem: each block stores a hashmap of value → count, and the answer is the size of the union of block hashmaps (plus partial-block scans).

import math
from collections import Counter, defaultdict

class DistinctSqrt:
    def __init__(self, arr):
        self.n = len(arr)
        self.b = max(1, int(math.sqrt(self.n)))
        self.arr = list(arr)
        self.freq = []  # one Counter per block
        for i in range(0, self.n, self.b):
            self.freq.append(Counter(arr[i:i + self.b]))

    def update(self, i, value):
        block = i // self.b
        old = self.arr[i]
        self.freq[block][old] -= 1
        if self.freq[block][old] == 0: del self.freq[block][old]
        self.freq[block][value] += 1
        self.arr[i] = value

    def distinct(self, l, r):
        block_l = l // self.b
        block_r = r // self.b
        seen = defaultdict(int)
        if block_l == block_r:
            for v in self.arr[l:r + 1]: seen[v] += 1
            return len(seen)
        for v in self.arr[l:(block_l + 1) * self.b]: seen[v] += 1
        for b in range(block_l + 1, block_r):
            for v in self.freq[b]: seen[v] += 1
        for v in self.arr[block_r * self.b:r + 1]: seen[v] += 1
        return len(seen)

ds = DistinctSqrt([1, 2, 1, 3, 2, 4, 1, 5, 1])
print(ds.distinct(0, 8))   # 5
print(ds.distinct(2, 6))   # 4

The total work is O(√N) per query — exactly the same shape as the sum version. The cost of carrying a Counter per block is O(N) extra memory, but you get an operation no segment tree handles cleanly.

Variants

Range update with lazy add

Store a add[block] per block. A range update touches at most two partial blocks (cell-by-cell apply) and N/B full blocks (bump add[block]). Range queries read both arr[i] + add[block(i)]. All operations stay O(√N).

Mo's algorithm

Offline queries sorted by (block of L, R). When you move the active window, each pointer travels O(N · √Q) total across all queries — sub-O(N) per query amortised. Used for range mode, range distinct, range frequency-gcd. Sqrt decomposition is the block-size choice that makes the sort work.

Persistent sqrt decomposition

Each update snapshots the affected block plus the changed cell — O(√N) memory per version. Useful when you need "what was the sum in version k" alongside live mutation.

Adaptive block sizes

If updates dominate queries 100:1, pick a smaller B so update work shrinks. Theoretical sweet spot is B = √(query_count / update_count · N). Most contest problems just use ⌈√N⌉ and accept the constant-factor loss.

Common bugs and edge cases

Single-block ranges. When blockL == blockR the three-piece loop falls apart — the "full blocks" middle is empty and the two partials overlap. Always handle this case first with a direct cell scan from l to r.
Block index off-by-one. The right partial starts at blockR * B, not blockR * B + 1. The full-blocks loop goes from blockL + 1 to blockR − 1 inclusive, not blockR.
Last-block underflow. If N isn't divisible by B, the last block is short. Bounds-check the partial-right loop with min((blockR+1)*B, N) to avoid reading past the array.
Updates that forget to fix the block aggregate. Mutating arr[i] without updating blocks[i/B] produces a structure that silently drifts. Use the delta form blocks[i/B] += new - old and write to arr[i] afterward.
Wrong B for the workload. Distinct-count queries often have answer counts proportional to block scans, not block count. Profile and tune. The default ⌈√N⌉ is fine for sum, suboptimal for mode or distinct.
Floating-point sqrt. Computing B via Math.sqrt in JavaScript can give one-off values like 9 when you wanted 10. Always wrap with Math.max(1, Math.ceil(Math.sqrt(N))).

Sqrt Decomposition

Interactive visualization

Watch the 60-second explainer

How sqrt decomposition works

Worked example

When to use sqrt decomposition

Sqrt decomposition vs Fenwick tree vs segment tree

Pseudo-code

JavaScript implementation

Python implementation — distinct counts with frequency blocks

Variants

Range update with lazy add

Mo's algorithm

Persistent sqrt decomposition

Adaptive block sizes

Common bugs and edge cases

Frequently asked questions

Interactive visualization

Watch the 60-second explainer

How sqrt decomposition works

Worked example

When to use sqrt decomposition

Sqrt decomposition vs Fenwick tree vs segment tree

Pseudo-code

JavaScript implementation

Python implementation — distinct counts with frequency blocks

Variants

Range update with lazy add

Mo's algorithm

Persistent sqrt decomposition

Adaptive block sizes

Common bugs and edge cases

Frequently asked questions

Related concepts