Systems

Journaling File Systems

Write your intentions to a log before you act — so a crash can never leave the disk half-changed

A journaling file system writes intended changes to a log before touching the real on-disk structures, so a crash mid-write can be replayed or rolled back — turning a multi-block update into one atomic commit.

  • Recovery after crashSeconds (replay log)
  • Full fsckMinutes–hours (scan disk)
  • Default journal size (ext4)≈ 128 MB
  • Default modeMetadata only (ordered)
  • Write amplification (data=journal)

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

The problem journaling solves

Renaming a file or appending a block is rarely a single disk write. To move a file between directories the file system must remove the entry from one directory block, add it to another, possibly update a link count in the inode, and maybe touch a free-space bitmap. Those are several physically separate sectors. The disk commits them one at a time, in whatever order the elevator scheduler likes, and a power cut can strike between any two.

Land in that gap and you get a file system that violates its own invariants: an inode that two directories both claim, a block marked free that an inode still references, a directory pointing at an inode that no longer exists. None of these are "your data is slightly stale" — they are structural corruption that a naive mount might dereference into a kernel panic or silent data loss.

The classic fix was to scan the whole disk on every boot after an unclean shutdown — fsck — rebuilding a consistent picture by cross-checking every inode against every block. That works, but its cost scales with the size of the disk, not the size of the interrupted operation. On a multi-terabyte volume that is a coffee-break-length outage on every crash. The insight that fixes it is the same one databases reached decades earlier with write-ahead logging: before you modify the real structures, first write down exactly what you intend to do, in one append, to a dedicated region called the journal. That is journaling.

The transaction lifecycle

The unit of work is a transaction — a bundle of block changes that must apply all-or-nothing. The journal is a small fixed-size circular log (a reserved area of the disk, or a separate device). A transaction moves through a strict, ordered protocol:

  1. Begin. An in-memory transaction collects every block this operation will dirty. In ext4's JBD2 layer many independent operations are batched into one compound transaction that commits together, amortizing the cost.
  2. Journal write (the log records). Copies of the changed blocks — or just descriptions of them — are written to the journal. This is a large sequential append, cheap on any medium.
  3. Commit. A commit record is written and the write barrier is enforced so it cannot reach the platter before the log records it depends on. The transaction is now durable. This is the linearization point: if a crash happens before the commit block lands, the transaction never happened; after it, the transaction is guaranteed to complete.
  4. Checkpoint (write-back). Lazily, in the background, the real in-place blocks are overwritten from the committed journal entries. Only after every block of a transaction is safely checkpointed can its journal space be reclaimed and reused by the circular log.

The ordering constraint is the entire ballgame: log first, commit second, in-place writes last. Recovery becomes trivial. On mount the file system scans the journal, finds the last transaction with a valid commit record, and replays every committed transaction by re-issuing its in-place writes (idempotent — replaying twice is harmless). Anything past the last good commit is discarded. The disk is restored to a state that is consistent and reflects every operation the application was told succeeded.

Why the barrier matters: redo vs undo

There are two ways to journal, borrowed straight from database recovery theory (the ARIES family of algorithms, Mohan et al., 1992):

  • Redo (physical) journaling — the dominant choice for file systems. The journal holds the new contents of each block. Recovery replays committed transactions forward. ext3, ext4, and the original design all use redo logging. Uncommitted transactions simply vanish; committed ones are pushed to completion.
  • Undo / undo-redo journaling — the journal also (or instead) holds old values so an in-progress transaction can be rolled back. This lets in-place writes start before commit, but the bookkeeping is heavier. NTFS leans on undo+redo so it can both roll back partial transactions and redo completed ones.

Whichever flavor, correctness rests on a single hardware promise: the storage stack must honor a write barrier / flush so that the commit record is not reordered ahead of the data it certifies. If a disk's volatile write cache lies about flushing (a notorious problem with cheap consumer drives circa 2005), the journal can record a commit whose backing data never landed — and replay will confidently reconstruct corruption. ext4 guards against this with an optional commit-block checksum (journal_checksum): if the recorded blocks don't match the checksum in the commit record, the transaction is rejected rather than trusted.

When journaling is the right tool (and when it isn't)

  • General-purpose disks where fast, bounded recovery matters. This is why nearly every desktop and server file system shipped since the early 2000s journals by default.
  • Metadata-heavy workloads — lots of small creates, renames, and deletes — benefit most, because metadata journaling coalesces many random updates into one sequential append.
  • You need atomicity across several blocks but don't want the full versioning machinery of copy-on-write.

It is the wrong tool when: the data is disposable scratch space (mount with the journal disabled and skip the double work); you need full data-versioning and snapshots (reach for a copy-on-write file system like ZFS or Btrfs); or the workload is overwhelmingly large sequential writes where a log-structured design that only ever appends will outperform a journal that writes twice.

Journaling vs the alternatives

Metadata journalingFull data journalingCopy-on-writeLog-structured FSSoft updatesfsck only
What's protectedStructure onlyStructure + dataStructure + data (snapshots)Structure + dataStructure onlyNothing until repair
Recovery timeSecondsSecondsInstant (atomic root swap)Roll forward from checkpointBackground fsckO(disk size)
Write amplification≈1× data, 2× metadata≈1× (+ tree updates)≈1× (+ GC churn)
Most recent data loss?Possible (ordered mode)No (if committed)No (if committed)No (if committed)PossiblePossible + corruption
ComplexityModerateModerateHighHigh (GC needed)Very high (dep tracking)Low
Examplesext4, XFS, NTFSext4 data=journalZFS, Btrfs, APFS, ReFSLFS, F2FS, NILFSFreeBSD UFS+SUold ext2, FAT

The headline trade-off is the same one databases face: durability and bounded recovery cost some write amplification. Journaling buys you a fixed-cost recovery in exchange for re-writing changed blocks (always for metadata, optionally for data). Copy-on-write sidesteps the double write by never overwriting in place, but pays in tree-update overhead and fragmentation. Soft updates (Ganger & McKusick, FreeBSD UFS) cleverly order writes so the disk is always consistent without a journal — but the dependency tracking is so intricate it never saw broad adoption beyond UFS.

What the numbers actually say

  • Recovery is constant-time in disk size. Journal replay touches only the log — ext4's default journal is about 128 MB regardless of whether the volume is 50 GB or 50 TB. A full fsck on a 16 TB drive can run tens of minutes to over an hour; journal replay on the same drive completes in seconds.
  • data=journal roughly halves write throughput. Every data block is written twice — once to the journal, once in place — so sustained sequential write bandwidth is about 2× write-amplified. Ordered mode avoids this by journaling metadata only, which is why it is the default.
  • Batching is what makes it cheap. ext4 commits compound transactions roughly every 5 seconds (the commit= mount option). A burst of a thousand small create() calls turns into a handful of large sequential journal appends instead of thousands of scattered random metadata writes.
  • Ordered mode can still lose your latest writes. It guarantees you never read garbage after a crash, but data written in the last few seconds before a crash — not yet fsync'd — may be gone. Journaling protects consistency, not recency. Only fsync() guarantees a specific write is durable.

JavaScript implementation: a redo journal

This models the redo protocol against a simulated block device. Writes go to the journal, get a commit record, and only then checkpoint to the "disk". A crash before commit is invisible after recovery; a crash after commit is replayed.

class BlockDevice {
  constructor() { this.blocks = new Map(); }   // the "real" in-place storage
  write(addr, val) { this.blocks.set(addr, val); }
  read(addr) { return this.blocks.get(addr); }
}

class JournalingFS {
  constructor(dev) {
    this.dev = dev;
    this.journal = [];          // circular log of committed transactions
    this.txn = null;            // open in-memory transaction
  }

  begin() { this.txn = { writes: [], committed: false }; }

  // Stage a change — it does NOT touch the device yet.
  stage(addr, val) {
    if (!this.txn) throw new Error('no open transaction');
    this.txn.writes.push({ addr, val });
  }

  // Step 1: append log records. Step 2: write the commit record + barrier.
  commit() {
    this.journal.push({ writes: this.txn.writes.slice() }); // log records (durable)
    this.journal[this.journal.length - 1].committed = true;  // commit block + flush
    this.checkpoint();          // Step 3: lazy in-place write-back
    this.txn = null;
  }

  // Push committed journal entries to their real locations, then reclaim log space.
  checkpoint() {
    for (const entry of this.journal) {
      if (!entry.committed) continue;
      for (const { addr, val } of entry.writes) this.dev.write(addr, val);
    }
    this.journal = this.journal.filter(e => !e.committed); // free reclaimed space
  }

  // Simulate a crash: in-memory transaction is lost; only durable journal survives.
  crash() { this.txn = null; }

  // On mount: replay every committed transaction, discard the rest.
  recover() {
    for (const entry of this.journal) {
      if (!entry.committed) continue;          // partial txn → ignored
      for (const { addr, val } of entry.writes) this.dev.write(addr, val);
    }
    this.journal = this.journal.filter(e => !e.committed);
  }
}

// Atomic two-block "rename": both writes land, or neither does.
const dev = new BlockDevice();
const fs  = new JournalingFS(dev);
fs.begin();
fs.stage(/*dirA*/ 100, 'removed');
fs.stage(/*dirB*/ 200, 'added');
fs.crash();                 // power cut BEFORE commit
fs.recover();
console.log(dev.read(100), dev.read(200)); // undefined undefined — never happened

The crucial property is that stage() never writes to the device. The device only ever changes during checkpoint() or recover(), and both only act on entries flagged committed. A transaction that crashes before its commit record is, by construction, atomic-zero.

Python: ordered mode and the data-vs-metadata distinction

Real file systems usually journal only metadata. The subtlety that makes ordered mode safe is a constraint between two streams: data blocks must hit the disk before the metadata that points at them commits. Otherwise a replayed inode could reference a block that holds stale, unrelated bytes.

class OrderedJournalFS:
    """Metadata is journaled; data is written in place but ORDERED before commit."""
    def __init__(self):
        self.disk_data = {}      # in-place data blocks
        self.disk_meta = {}      # in-place metadata (inodes, bitmaps)
        self.journal = []        # committed metadata transactions

    def write_file_block(self, data_addr, data, meta_addr, meta):
        # 1. Flush DATA to its final location first (the "ordered" guarantee).
        self.disk_data[data_addr] = data
        # 2. Append the METADATA change to the journal as one transaction.
        txn = {"meta": [(meta_addr, meta)], "committed": False}
        self.journal.append(txn)
        # 3. Write the commit record (with a barrier in a real FS).
        txn["committed"] = True
        # 4. Checkpoint metadata to its in-place home, lazily.
        self._checkpoint()

    def _checkpoint(self):
        for txn in self.journal:
            if not txn["committed"]:
                continue
            for addr, val in txn["meta"]:
                self.disk_meta[addr] = val
        self.journal = [t for t in self.journal if not t["committed"]]

    def recover(self):
        # Replay committed metadata. Because data was forced out BEFORE the
        # commit, any replayed inode is guaranteed to point at real contents,
        # never at uninitialized garbage. THAT is what ordered mode buys you.
        for txn in self.journal:
            if txn["committed"]:
                for addr, val in txn["meta"]:
                    self.disk_meta[addr] = val
        self.journal = [t for t in self.journal if not t["committed"]]


# writeback mode would skip step 1's ordering — then a crash could leave the
# inode (metadata) pointing at a data block that was never written, exposing
# whatever stale bytes happened to sit there. That is the classic
# "old data exposure" hole that ordered mode closes.

Note what ordered mode does not promise: it never journals the data itself, so a crash can still lose the most recent un-fsync'd data block. It only promises you will never read someone else's old bytes through a freshly written file — a security and integrity guarantee, not a durability one.

Variants worth knowing

ext3 / ext4 + JBD2. The canonical example. The Journaling Block Device layer (JBD2 in ext4) is a generic transaction engine the file system calls into. Three modes: data=writeback (metadata only, no ordering), data=ordered (metadata journaled, data forced first — the default), and data=journal (everything journaled, 2× writes, strongest guarantee).

XFS. A metadata-only journal designed by SGI in 1993 for huge volumes and high parallelism. It uses asynchronous, delayed logging and per-CPU log buffers, making it excellent for large files and parallel metadata workloads.

NTFS. Uses the $LogFile with both undo and redo records, enabling roll-back of partial transactions and roll-forward of completed ones — closer to a full database recovery manager than the typical redo-only file-system journal.

ext4 with journal checksums. A torn or reordered commit block used to be silently trusted. The journal_checksum feature lets recovery detect a commit whose logged blocks don't match, and discard it — turning a possible corruption into a clean rollback of the last transaction.

Log-structured and copy-on-write cousins. A log-structured file system (LFS, F2FS) makes the whole disk the log and never overwrites in place, eliminating the double write. Copy-on-write file systems (ZFS, Btrfs, APFS) write new versions to free space and atomically swap a root pointer, getting crash consistency and snapshots without a separate journal.

Common bugs and edge cases

  • Trusting a lying write cache. If the drive acknowledges a flush it didn't perform, the journal can record a commit ahead of its data and replay will reconstruct corruption. Disable volatile write caching for the journal device, or rely on checksums and barriers — never assume the disk is honest.
  • Confusing consistency with durability. A journal keeps the file system structurally consistent; it does not guarantee your last write is on disk. Applications that need a specific write to survive must call fsync() (and on the directory too, for renames).
  • The fsync/rename ordering trap. The infamous 2009 ext4 delayed-allocation issue: apps did write(temp); rename(temp, real) expecting atomic replacement, but without an fsync the data blocks weren't yet allocated at commit, so a crash left a zero-length file. The fix was both kernel heuristics and teaching apps to fsync before rename.
  • Replay must be idempotent. Recovery may run, crash mid-replay, and run again. Physical redo logging is naturally idempotent (writing the same new value twice is harmless); logical logging is not, and needs sequence numbers (LSNs) to know what's already applied.
  • Journal too small for the workload. If transactions fill the circular log faster than checkpointing can drain it, writers stall waiting for space. Bursty metadata workloads sometimes need a larger journal (tune2fs -J size=) or an external journal device.
  • Double-journaling on a journaled database. Running a database with its own write-ahead log on top of data=journal writes the same bytes up to four times. For DB volumes, ordered or writeback mode plus the app's own journal is the right layering.

Frequently asked questions

Does journaling guarantee my file data survives a crash?

Not by default. Most journaling file systems only journal metadata — inodes, bitmaps, directory blocks. In ext4's default ordered mode, your file's data blocks are flushed before the metadata commits, so you never read garbage, but the most recent unsynced writes can still be lost. Only data=journal mode logs the file contents themselves, at roughly half the write throughput.

Why is a journaling file system faster to recover than fsck?

A full fsck scans every inode and block in the file system to rebuild a consistent picture, which scales with disk size — minutes to hours on a multi-terabyte volume. Journal recovery only replays the committed transactions still sitting in the log, a fixed-size region (typically 128 MB on ext4), so it finishes in seconds regardless of how big the disk is.

What stops the file system from replaying a half-written transaction?

Each transaction ends with a commit record, and the journal only treats a transaction as complete once that commit block is durably on disk. On recovery the file system replays transactions up to the last valid commit and discards anything after it. Modern journals also protect against the commit block itself landing before its data using a checksum (ext4's journal_checksum) so a torn write is detected, not blindly trusted.

Isn't writing everything twice wasteful?

Yes — the double-write is the core tax of physical journaling, and it is why metadata-only journaling is the default. The win is that journal writes are sequential and batched: many small scattered metadata updates get coalesced into one large append to the log, which is far cheaper on both spinning disks and SSDs than the random in-place writes they replace. Log-structured and copy-on-write file systems avoid the double write entirely by never overwriting in place.

What is the difference between writeback, ordered, and journal mode in ext4?

Writeback journals only metadata and imposes no ordering, so after a crash a metadata block can point at data that was never written — you may see stale or garbage contents. Ordered (the default) journals metadata but forces data blocks out before the metadata commit, preventing garbage. Journal (data=journal) logs both data and metadata to the journal first, giving the strongest guarantee at the highest cost.

Do SSDs still need journaling?

Yes. Journaling protects against a crash leaving file-system structures inconsistent, which is independent of the storage medium. SSDs do their own internal logging (the flash translation layer is essentially a log), but that protects the device's block mapping, not your inode and directory consistency. A power loss can still tear a multi-block file-system update, so the journal is still doing real work.