How fast is a 2D Fenwick tree in practice?

On a 1000×1000 grid, each update touches log₂(1000) ≈ 10 rows × 10 columns = 100 cells. Each cell update is a few cycles. A modern CPU handles around half a million 2D BIT updates per second on a million-cell grid — fast enough for interactive image filtering or live-board game state, but not for full-frame video at 60 fps.

Why O(log²) and not O(log)?

Each dimension contributes one logarithmic factor independently. The outer loop walks log N rows; for every row touched, the inner loop walks log M columns. Multiply the two — log N · log M — and that is your work per update or query. The 2D BIT cannot beat this by being clever: rank-and-coordinate-compressed inputs still need both walks.

When is a 2D Fenwick tree the wrong choice?

Sparse grids. If only a few thousand cells are non-zero in a 10⁶ × 10⁶ space, an offline sweep with a 1D BIT (Mo's algorithm in 2D), a kd-tree, or a hash-backed BIT beats the dense O(NM) memory cost. The 2D BIT is also a poor fit for range min/max — no inverse exists, so the trick that powers BIT for sum cannot be ported.

Can a 2D Fenwick tree do range updates plus range queries?

Yes, with four 2D BITs in parallel, generalising the BIT2 trick from 1D. Each rectangular update adds a constant to a submatrix; queries return the rectangular sum. Cost stays O(log² N) per operation but constants quadruple, so check whether your problem really needs it before writing 100 lines of 2D BIT bookkeeping.

How does the inversion-count trick generalise to 2D?

It becomes the 2D dominance problem: count pairs where one point strictly dominates another in both coordinates. Sort by one axis, use a 1D BIT indexed by the other after coordinate compression. The 2D BIT itself appears when both axes are dynamic — for example, counting kings dominating points as both move.

Why is the submatrix sum formula prefix(r2,c2) − prefix(r1−1,c2) − prefix(r2,c1−1) + prefix(r1−1,c1−1)?

Inclusion-exclusion. The large prefix overcounts everything above-left of (r2, c2). Subtracting the two strip prefixes removes that overhang twice in the corner, so you add the corner prefix back. Four 2D queries — still O(log²) per submatrix.

2D Fenwick Tree (Binary Indexed Tree) — Submatrix Sums in O(log²)

How a 2D Fenwick tree works

The 1D Fenwick tree stores partial sums of an array along ranges whose lengths are powers of two — every bit[i] covers arr[i − lowbit(i) + 1 .. i]. The 2D version applies the same trick to each axis independently. A cell bit[r][c] stores the sum of a rectangular subgrid whose row-span equals lowbit(r) and column-span equals lowbit(c).

Concretely, with N = M = 8, cell bit[4][6] stores the sum of arr[1..4][5..6] — a 4×2 rectangle ending at (4, 6). Cell bit[8][8] stores the entire grid. Cell bit[3][3] stores only arr[3][3] because both rows and columns have lowbit = 1.

To update position (r, c) by δ, walk row-up via r += r & −r until you exceed N. For every row touched, walk column-up via c += c & −c until you exceed M, adding δ at each (r, c) pair. To query the prefix sum over [1..r][1..c], walk both axes downward by subtracting low-bits, accumulating bit[r][c]. Both operations touch O(log N · log M) cells.

To get a submatrix sum from (r1, c1) to (r2, c2), apply 2D inclusion-exclusion:

submatrix(r1,c1,r2,c2) = prefix(r2,c2)
                        - prefix(r1-1,c2)
                        - prefix(r2,c1-1)
                        + prefix(r1-1,c1-1)

Four prefix queries — each O(log² N) — give the rectangle in total O(log² N) time.

Cost in concrete numbers

On a 1000 × 1000 grid, one BIT operation touches roughly log₂(1000)² ≈ 100 cells. Compare that to the brute-force alternatives:

Re-scan the grid for every query: O(NM) = 10⁶ cells per query.
Static 2D prefix-sum array: O(1) query but updates cost O(NM) — useless if the grid mutates.
2D Fenwick tree: 100 cells per query and per update — six orders of magnitude better than re-scan when both operations interleave.

For a 10⁶-cell grid with mixed update/query workload, expect roughly 500k operations per second in C++, ~100k in JavaScript with Int32Array. Image-filter passes, dynamic heat-maps and live leaderboards all sit in this range.

When to use a 2D Fenwick tree

Dynamic submatrix sums where both rows and columns are bounded and dense.
2D dominance counting (after coordinate compression) — how many earlier points sit below-left of a new point.
Online image processing where pixel updates require integral-image refreshes (medical imaging, satellite raster live edits).
Game boards with up-to-thousands of cells and frequent area scoring.
Statistical heat-maps where a 2D BIT is cheaper than re-deriving cumulative tables on each tick.

Avoid it when the grid is sparse (use a hashmap-backed nested BIT, or a kd-tree), or when you need range min/max (no group inverse — fall back to 2D sparse tables or a 2D segment tree).

2D Fenwick tree vs 2D segment tree vs 2D prefix-sum

	2D Fenwick tree	2D segment tree	Static 2D prefix-sum
Update	O(log N · log M)	O(log N · log M)	O(NM) — must rebuild
Submatrix query	O(log N · log M)	O(log N · log M)	O(1)
Memory	N · M ints	~16 · N · M ints	N · M ints
Code length	~25 lines	~120 lines	~10 lines
Operations supported	Any group (sum, XOR)	Any monoid (min, gcd)	Any group
Range updates	Yes (4 BITs, 2D BIT2 trick)	Yes, with 2D lazy propagation	No (only static)
Constant factor	Smallest dynamic option	~4–8× slower than 2D BIT	Fastest queries; useless if dynamic

The 2D segment tree's appeal is range min/max, but it eats roughly 16× the memory of a 2D BIT — a 10³ × 10³ board uses 64 MB instead of 4 MB. For sum-only problems, the BIT wins on every axis except code complexity (the BIT is shorter).

Pseudo-code

tree[1..N][1..M] = 0       # 1-indexed

function update(r, c, delta):
    i = r
    while i <= N:
        j = c
        while j <= M:
            tree[i][j] += delta
            j += j & -j
        i += i & -i

function prefixSum(r, c):
    s = 0
    i = r
    while i > 0:
        j = c
        while j > 0:
            s += tree[i][j]
            j -= j & -j
        i -= i & -i
    return s

function submatrix(r1, c1, r2, c2):
    return prefixSum(r2, c2)
         - prefixSum(r1-1, c2)
         - prefixSum(r2, c1-1)
         + prefixSum(r1-1, c1-1)

JavaScript implementation

class Fenwick2D {
  constructor(n, m) {
    this.n = n; this.m = m;
    // (n+1) rows × (m+1) cols, flat Int32Array for cache locality
    this.tree = new Int32Array((n + 1) * (m + 1));
  }
  _idx(r, c) { return r * (this.m + 1) + c; }

  update(r, c, delta) {
    for (let i = r; i <= this.n; i += i & -i) {
      for (let j = c; j <= this.m; j += j & -j) {
        this.tree[this._idx(i, j)] += delta;
      }
    }
  }

  prefix(r, c) {
    let s = 0;
    for (let i = r; i > 0; i -= i & -i) {
      for (let j = c; j > 0; j -= j & -j) {
        s += this.tree[this._idx(i, j)];
      }
    }
    return s;
  }

  submatrix(r1, c1, r2, c2) {
    return this.prefix(r2, c2)
         - this.prefix(r1 - 1, c2)
         - this.prefix(r2, c1 - 1)
         + this.prefix(r1 - 1, c1 - 1);
  }
}

// Example: 4×4 grid, set a few cells, query a 2×2 block
const bit = new Fenwick2D(4, 4);
bit.update(1, 1, 3);
bit.update(2, 2, 5);
bit.update(2, 3, 2);
bit.update(3, 1, 7);
console.log(bit.submatrix(2, 2, 3, 3)); // 5 + 2 = 7

The flat Int32Array with a manual index function is roughly 3× faster than an array of arrays in V8, because nested boxed arrays force pointer chasing and bounds checks on every dereference. For very wide grids consider tiling the inner BIT into cache-line-sized rows.

Python implementation

class Fenwick2D:
    def __init__(self, n, m):
        self.n, self.m = n, m
        self.tree = [[0] * (m + 1) for _ in range(n + 1)]

    def update(self, r, c, delta):
        i = r
        while i <= self.n:
            j = c
            while j <= self.m:
                self.tree[i][j] += delta
                j += j & -j
            i += i & -i

    def prefix(self, r, c):
        s = 0
        i = r
        while i > 0:
            j = c
            while j > 0:
                s += self.tree[i][j]
                j -= j & -j
            i -= i & -i
        return s

    def submatrix(self, r1, c1, r2, c2):
        return (self.prefix(r2, c2)
                - self.prefix(r1 - 1, c2)
                - self.prefix(r2, c1 - 1)
                + self.prefix(r1 - 1, c1 - 1))

bit = Fenwick2D(4, 4)
for r, c, v in [(1,1,3), (2,2,5), (2,3,2), (3,1,7)]:
    bit.update(r, c, v)
print(bit.submatrix(2, 2, 3, 3))   # 7

Variants

Range update + range query (2D BIT2)

Maintain four 2D BITs storing different correction terms — a direct generalisation of the 1D BIT2 trick. A rectangular update at [r1, c1, r2, c2] becomes 16 point updates (four BITs × four inclusion-exclusion corners); a rectangular query becomes 16 prefix queries. Total work stays O(log² N) but constants grow by ~16×.

Offline coordinate compression

For sparse inputs over a large coordinate range, read all updates and queries up front, compress (r, c) values to ranks 1..k, build a k × k dense 2D BIT (or smaller if rectangles allow). Memory becomes O(events²) instead of O(N · M).

BIT-of-BITs versus BIT-of-segtrees

A nested 2D BIT is symmetric — both axes use BIT semantics. A BIT-of-segtrees lets the inner axis support range min/max while the outer remains a BIT. Useful when one axis has point updates and the other needs richer queries.

Persistent 2D BIT (rare)

Pure persistence is awkward — path-copy across both axes inflates memory to O(events · log² N). For persistence problems, switch to a 2D segment tree or to offline sweep-line techniques.

Common bugs and edge cases

Forgetting 1-indexing on both axes. Either dimension's lowbit hangs at zero. Allocate (N+1) × (M+1) and shift inputs by +1 when reading from a 0-indexed source.
Submatrix off-by-one. The inclusion-exclusion uses r1−1 and c1−1 — not r1 and c1. Pass prefix(0, c) safely; the function returns 0 by construction.
Overflow. Even on modest grids, a million summed cells with values up to 10⁶ overflow signed 32-bit. Use 64-bit accumulators (BigInt or 64-bit typed arrays) whenever the cell range times the cell count can exceed 2³¹.
Memory explosion. A 10⁴ × 10⁴ 2D BIT is 100 million ints — 400 MB at 4 bytes each. Coordinate-compress first if your grid is sparser than 10% filled.
Naïve build instead of O(NM). Calling update for every cell is O(NM log² N) — for a 1000 × 1000 grid that's 10⁸ ops, several seconds. Instead initialise the array of values and propagate each cell to its 2D parent in one pass.
Mismatched n and m in inner loop. Using j <= n instead of j <= m in the inner walk works only on a square grid. Non-square grids silently drop columns.

2D Fenwick Tree (Binary Indexed Tree)

Interactive visualization

Watch the 60-second explainer

How a 2D Fenwick tree works

Cost in concrete numbers

When to use a 2D Fenwick tree

2D Fenwick tree vs 2D segment tree vs 2D prefix-sum

Pseudo-code

JavaScript implementation

Python implementation

Variants

Range update + range query (2D BIT2)

Offline coordinate compression

BIT-of-BITs versus BIT-of-segtrees

Persistent 2D BIT (rare)

Common bugs and edge cases

Frequently asked questions

Interactive visualization

Watch the 60-second explainer

How a 2D Fenwick tree works

Cost in concrete numbers

When to use a 2D Fenwick tree

2D Fenwick tree vs 2D segment tree vs 2D prefix-sum

Pseudo-code

JavaScript implementation

Python implementation

Variants

Range update + range query (2D BIT2)

Offline coordinate compression

BIT-of-BITs versus BIT-of-segtrees

Persistent 2D BIT (rare)

Common bugs and edge cases

Frequently asked questions

Related concepts