Data Structures
Piece Table
Edits never move bytes — they splice pieces
A piece table represents a document with two immutable buffers plus a sequence of piece descriptors. Edits splice the piece list — bytes are never moved. Powers VS Code and classic Word.
- Insert at cursorO(1) splice + O(log n) lookup
- Delete rangeO(log n) piece edits
- Read full documentO(n) over pieces
- MemoryOriginal + add + piece list
- Undo costO(1) — restore old piece list
- Bytes moved on editZero
Interactive visualization
Press play, or step through manually. Watch the piece list grow as edits splice — original bytes never move.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
How a piece table works
Open a 100 MB log file in your editor and start typing in the middle. A naive editor representing the document as a flat string would copy 50 MB of bytes on every keystroke. A piece table never copies the file at all. Instead, it stores the file once in an original buffer and never touches it again. Every character you type goes to a separate add buffer that only grows. The document itself is described by a small list of pieces — pointers that say "first take 1000 bytes from the original buffer at offset 0, then 5 bytes from the add buffer at offset 0, then 99,000 bytes from the original buffer at offset 1000."
Three components:
- Original buffer. The file as loaded from disk. Immutable. Never modified for the lifetime of the document.
- Add buffer. A growing byte array. Every typed character, every paste, every macro insertion appends bytes here. Bytes already written are never overwritten.
- Piece list. An ordered sequence of
(buffer, offset, length)tuples. Reading the document means walking the list and emitting each slice in order.
That's it. An insertion at cursor position p:
- Find the piece containing p — call it piece P spanning bytes
[a, b). - Append the new text to the add buffer; remember its
(addBuf, addOffset, len)as new piece N. - Split P into two: P1 =
[a, p), P2 =[p, b). - Replace P with the sequence P1, N, P2 in the piece list.
No bytes were copied. The total work is constant for the splice plus the cost of finding piece P, which in a piece-tree variant is O(log n).
Deletion is even easier
Deleting characters in [p, q):
- Find the piece containing p (call it Pleft) and the piece containing q (Pright).
- Trim Pleft to end at p; trim Pright to start at q.
- Remove every piece strictly between them.
Again no bytes are copied. The original and add buffers retain the deleted bytes verbatim — they're simply not referenced by any active piece. This is exactly what makes undo so cheap: the previous piece list pointed at those bytes; restore the list and the bytes "reappear."
When to use a piece table
- Text editors and IDEs. Insertions, deletions, and copy/paste at arbitrary cursor positions are the dominant operation. Piece tables make them trivial.
- Large-file editing. A 100 MB file occupies 100 MB once — for the original buffer — plus the size of your edits. Compare to gap buffers, which copy the whole thing on cursor jump.
- Collaborative editing. Operational transforms and CRDTs play well with piece lists because each piece is an immutable byte range with a stable identity.
- Cheap undo/redo. Each piece-list snapshot is small and never duplicates bytes. Linear undo stacks are practical even for hour-long editing sessions.
- Code intelligence services. Language servers want stable byte offsets across edits. Piece offsets give every original byte a permanent address.
Piece table vs gap buffer vs rope
| Piece table | Gap buffer | Rope | Flat string | |
|---|---|---|---|---|
| Insert at cursor | O(1) splice + O(log n) lookup | O(1) at gap, O(n) on move | O(log n) | O(n) |
| Delete range | O(log n) | O(n) on cursor jump | O(log n) | O(n) |
| Random read | O(log n) | O(1) | O(log n) | O(1) |
| Full document scan | O(n) piece walk | O(n) skip gap | O(n) | O(n) |
| Memory overhead | Add buffer + piece list | Gap of unused bytes | ~30% tree nodes | None |
| Undo cost | O(1) — restore list | Replay log | Persistent variant: O(log n) | O(n) per step |
| Best at | Large files, many edits | Small files, local edits | Concurrent edits | Read-only |
| Used by | VS Code, Word | Emacs | Xi editor, Atom Teletype | Bash readline |
The historical view: gap buffers won in the Emacs era because cursor motion was sequential and memory was tight. Piece tables won in the GUI editor era because files got larger and undo became a first-class feature. Ropes are the academic favorite but pay a constant factor in practice; only Xi editor and a few CRDT engines use them at scale.
Pseudo-code
// Piece descriptor.
Piece { buffer: ORIG | ADD, offset: int, length: int }
state:
original: byte[] // immutable, file contents
add: byte[] // append-only, edits
pieces: list<Piece>
insert(pos, text):
addOffset = len(add)
add.appendBytes(text)
newPiece = Piece(ADD, addOffset, len(text))
(pieceIndex, offsetIntoPiece) = findPiece(pos)
P = pieces[pieceIndex]
leftPart = Piece(P.buffer, P.offset, offsetIntoPiece)
rightPart = Piece(P.buffer, P.offset + offsetIntoPiece, P.length - offsetIntoPiece)
pieces.splice(pieceIndex, 1, leftPart, newPiece, rightPart)
delete(from, to):
trim left piece to end at `from`
trim right piece to start at `to`
remove all pieces strictly in between
read():
out = ""
for P in pieces:
out += buffers[P.buffer].slice(P.offset, P.offset + P.length)
return out
JavaScript implementation
class PieceTable {
constructor(original) {
this.original = original; // immutable string from disk
this.add = ""; // append-only typed text
this.pieces = [
{ buffer: "orig", offset: 0, length: original.length }
];
}
// Find which piece contains the document offset `pos`.
// Returns [pieceIndex, offsetInsidePiece].
findPiece(pos) {
let acc = 0;
for (let i = 0; i < this.pieces.length; i++) {
const p = this.pieces[i];
if (pos <= acc + p.length) return [i, pos - acc];
acc += p.length;
}
return [this.pieces.length - 1, this.pieces.at(-1).length];
}
insert(pos, text) {
if (!text) return;
const addOffset = this.add.length;
this.add += text;
const newPiece = { buffer: "add", offset: addOffset, length: text.length };
const [idx, off] = this.findPiece(pos);
const P = this.pieces[idx];
if (off === 0) {
this.pieces.splice(idx, 0, newPiece);
} else if (off === P.length) {
this.pieces.splice(idx + 1, 0, newPiece);
} else {
const left = { buffer: P.buffer, offset: P.offset, length: off };
const right = { buffer: P.buffer, offset: P.offset + off, length: P.length - off };
this.pieces.splice(idx, 1, left, newPiece, right);
}
}
delete(from, to) {
if (from >= to) return;
const [iStart, oStart] = this.findPiece(from);
const [iEnd, oEnd] = this.findPiece(to);
if (iStart === iEnd) {
const P = this.pieces[iStart];
const left = { buffer: P.buffer, offset: P.offset, length: oStart };
const right = { buffer: P.buffer, offset: P.offset + oEnd, length: P.length - oEnd };
const out = [];
if (left.length) out.push(left);
if (right.length) out.push(right);
this.pieces.splice(iStart, 1, ...out);
return;
}
const Pstart = this.pieces[iStart];
const Pend = this.pieces[iEnd];
const trimmedStart = { buffer: Pstart.buffer, offset: Pstart.offset, length: oStart };
const trimmedEnd = { buffer: Pend.buffer, offset: Pend.offset + oEnd, length: Pend.length - oEnd };
const out = [];
if (trimmedStart.length) out.push(trimmedStart);
if (trimmedEnd.length) out.push(trimmedEnd);
this.pieces.splice(iStart, iEnd - iStart + 1, ...out);
}
toString() {
let out = "";
for (const p of this.pieces) {
const buf = p.buffer === "orig" ? this.original : this.add;
out += buf.slice(p.offset, p.offset + p.length);
}
return out;
}
}
// Usage:
const doc = new PieceTable("Hello, world!");
doc.insert(5, " beautiful"); // -> "Hello beautiful, world!"
doc.delete(0, 6); // -> "beautiful, world!"
console.log(doc.toString());
Python implementation
from dataclasses import dataclass
from typing import List, Tuple
@dataclass
class Piece:
buffer: str # 'orig' or 'add'
offset: int
length: int
class PieceTable:
def __init__(self, original: str):
self.original = original
self.add = ""
self.pieces: List[Piece] = [Piece('orig', 0, len(original))]
def _find(self, pos: int) -> Tuple[int, int]:
acc = 0
for i, p in enumerate(self.pieces):
if pos <= acc + p.length:
return i, pos - acc
acc += p.length
return len(self.pieces) - 1, self.pieces[-1].length
def insert(self, pos: int, text: str):
if not text:
return
add_off = len(self.add)
self.add += text
new = Piece('add', add_off, len(text))
idx, off = self._find(pos)
P = self.pieces[idx]
if off == 0:
self.pieces.insert(idx, new)
elif off == P.length:
self.pieces.insert(idx + 1, new)
else:
left = Piece(P.buffer, P.offset, off)
right = Piece(P.buffer, P.offset + off, P.length - off)
self.pieces[idx:idx+1] = [left, new, right]
def __str__(self):
buffers = {'orig': self.original, 'add': self.add}
return "".join(buffers[p.buffer][p.offset:p.offset+p.length] for p in self.pieces)
doc = PieceTable("Hello, world!")
doc.insert(5, " beautiful")
print(doc) # Hello beautiful, world!
This Python sketch uses a plain Python list for the piece sequence; lookup is O(n). Production implementations replace it with a balanced tree keyed by document offset (the "piece tree" in VS Code's case), making findPiece O(log n).
Variants
- Classic piece table. The pieces are stored in a doubly-linked list or plain array. Edits at the cursor are O(1), but finding a piece by document offset is O(n). Fine for files under a few MB.
- Piece tree (VS Code). The pieces live in a red-black tree keyed by cumulative byte offset. Insert and lookup are both O(log n). Powers VS Code's editor since 2018; handles 100 MB files with microsecond per-edit latency.
- Append-only piece table. Both original and add are append-only; old pieces are never mutated. Enables structural sharing across undo states — each undo entry is just a pointer to a previous piece-list root.
- CRDT piece tables. Each piece carries a unique ID; concurrent inserts from different users become independent pieces, merged by the CRDT ordering. Foundation of collaborative editors like Y-text.
- Piece tables on disk. Microsoft Word's classic .doc format embeds the piece table directly — the file stores both the original text region and the add region plus the piece descriptors, so the next open continues exactly where you left off without ever rewriting the whole file.
Performance — costed claims
- VS Code piece tree. The 2018 migration report measured: opening a 35 MB JSON file dropped from out-of-memory to under one second; per-keystroke latency stayed under 1 ms even on a 100 MB synthetic file. The piece tree was the key change.
- Per-edit work. Each insert or delete touches at most 3 pieces (split a piece, insert a new one). With a balanced-tree implementation, lookup adds O(log P) where P is the number of pieces.
- Memory overhead. One piece descriptor is ~24 bytes (3 fields, padding). For a 1 MB file with 10,000 edits, you have ~30,000 pieces × 24 B = 720 KB of piece overhead — under 1% of the original file size.
- Undo storage. Each undo entry is one piece-list snapshot (~24 B × number of changed pieces). A typical 1000-step undo history fits in a few hundred KB regardless of the size of the edited text.
- Coalescing wins. Adjacent pieces from the same buffer with contiguous offsets can merge into one. Editors that coalesce on idle keep piece counts roughly linear in distinct edit positions, not keystrokes.
Common bugs and edge cases
- Off-by-one in piece splits. Splitting at offset zero (insert at piece start) and offset equal to length (insert at piece end) should both avoid creating zero-length pieces. Empty pieces accumulate as junk and break offset arithmetic.
- Multi-byte characters at piece boundaries. If you split a UTF-8 sequence mid-codepoint, the document becomes corrupt. Always split on codepoint boundaries; index lookups should treat offsets as codepoint indices or always validate.
- Line-count drift. Editors usually cache line counts per piece for fast line-number lookup. Forgetting to recompute the cache after split/merge produces stale line numbers — the bug VS Code's original piece-table implementation had before the tree migration.
- Add buffer growth without bound. If a user types and deletes the same paragraph 10,000 times, the add buffer keeps growing while no active piece references most of it. Periodically GC by rewriting the document into a fresh single-piece state during save.
- Linear search through pieces. The naive list-based implementation is O(P) per lookup. At 100,000 pieces and 60 FPS, that's 6M comparisons/s on the hot path. Switch to a tree before this becomes a problem.
- Concurrent reader during edit. If a syntax highlighter walks the piece list while the editor splices it, the walk can crash or skip characters. Use a copy-on-write piece list or an immutable persistent tree.
Frequently asked questions
Why don't text editors just use a flat string?
Inserting one character at position 50,000 in a 1 MB flat string copies the remaining 950,000 bytes. At 60 FPS, that's a budget-killer. Real editors need an O(log n) — or amortized O(1) — insertion at any cursor position. Piece tables, gap buffers, and ropes are the three classic answers; each trades different overheads.
How does a piece table actually represent a document?
Two immutable byte buffers and a list of pieces. The 'original' buffer holds the file as loaded from disk and is never modified. The 'add' buffer is append-only and accumulates every typed character. Each piece is a triple (buffer, offset, length) — pointing into one of the two buffers. The document is the concatenation of those slices in piece-list order. Editing splices the list; bytes never move.
What's the difference between a piece table, gap buffer, and rope?
A gap buffer keeps the document as one array with a movable gap around the cursor — fast local edits, slow cursor jumps. A rope stores the document in a balanced binary tree of small chunks — uniformly fast everywhere but more complex. A piece table uses two append-only buffers and a piece list — edits are pure pointer manipulation and undo is trivial because original bytes never change.
Why does VS Code use a piece tree?
VS Code switched from a line-based array to a piece tree (a red-black tree of pieces) in 2018 specifically to handle large files. The team reported opening a 35 MB file went from out-of-memory to under one second, and per-edit latency stayed in microseconds even at 100 MB. The tree gives O(log n) lookup of which piece contains a given offset, on top of the basic piece-table semantics.
Why is undo cheap with a piece table?
Because the original buffer is immutable and the add buffer is append-only, the bytes that were 'there before' an edit still exist verbatim. Undo just restores the previous piece list — a snapshot of pointers, not bytes. Modern editors store the piece list per edit in an undo stack. Memory cost is proportional to the number of pieces changed, not the size of edited text.
When does a piece table degrade?
Pathologically large numbers of tiny pieces. If a user makes 100,000 random one-character edits, the piece list balloons to 100,000 entries; iterating the document becomes painful. Mitigations: keep the piece list in a balanced tree (piece tree), coalesce adjacent pieces from the same buffer when possible, and periodically rebuild the document into a fresh single piece during save or idle time.
Did Microsoft Word really use a piece table?
Yes — classic Word for DOS and early Windows versions used a piece table descended from the Bravo editor at Xerox PARC, and its lineage runs back to the late 1970s. Charles Simonyi brought the idea to Microsoft. Modern Word's binary .doc format still exposes the piece-table structure on disk; the runtime structure has evolved but the family resemblance remains.