Biotechnology

DNA Sequencing

Reading the exact order of the bases

DNA sequencing is the process of determining the exact order of the four nucleotide bases — adenine (A), thymine (T), cytosine (C) and guanine (G) — along a strand of DNA. The classic Sanger method copies a template with DNA polymerase in the presence of chain-terminating bases that stop extension at random positions, producing a nested ladder of fragments of every length; fluorescently labeled fragments are sorted by size, and the order in which they emerge spells the sequence. Modern next-generation platforms read millions of short fragments in parallel, while nanopore devices read single native molecules in real time. The first human genome — about 3.2 billion base pairs — took 13 years and ~$3 billion; today the same genome costs a few hundred dollars and finishes in roughly a day.

  • Invented byFrederick Sanger, 1977 (chain-termination)
  • Reads theOrder of A, T, C, G bases
  • Sanger read length~500–1000 bases, >99.99% accurate
  • Human genome~3.2 billion base pairs
  • Cost trajectory$3 billion (2003) → ~$200 (today)
  • Long readsNanopore: >100,000 bases per read

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

What DNA sequencing actually reads

A DNA molecule is a long double helix, but the information it carries is one-dimensional: a string drawn from a four-letter alphabet — A, T, C, G. Sequencing is the act of reading that string off, in order, with no gaps. A gene like the one encoding human beta-globin is only about 1,600 bases long; the entire human genome is roughly 3.2 billion base pairs spread across 23 chromosome pairs. Reading a single base by eye is impossible — each is a sub-nanometer chemical group buried inside a polymer thinner than the wavelength of light. Every sequencing technology is, at bottom, a clever trick for converting "which base is here?" into a signal a machine can detect: a color, an electrical current, or a flash of light.

The reason the order matters is that the order is the meaning. The genetic code reads bases three at a time; a single substitution can change one amino acid, and that one change causes sickle-cell anemia (an A→T at one position in beta-globin). To find such a variant you do not need to understand the protein — you only need to read the sequence precisely enough to see the letter that differs from the reference.

Sanger sequencing: the chain-terminator trick

Frederick Sanger's 1977 method — which won him a second Nobel Prize — is still the conceptual heart of the field and the gold standard for accuracy. It exploits the chemistry of DNA polymerase, the enzyme that copies DNA. Polymerase builds a new strand by adding nucleotides to the free 3'-hydroxyl (3'-OH) group at the growing end. Sanger's insight was to mix in a small fraction of dideoxynucleotides (ddNTPs) — bases that are missing that 3'-OH. The moment a ddNTP is incorporated, there is no hook for the next base, and that copy terminates. These are the "chain terminators."

Because termination is random — sometimes the polymerase adds a normal base, sometimes a terminator — a single template produces a population of copies that stop at every possible position: one ending at the first A, another at the third C, another at the tenth G, and so on. This nested set of fragments, differing in length by a single base, is the key. In the modern version, each of the four ddNTPs (ddA, ddT, ddC, ddG) carries a distinct fluorescent dye, so a fragment's final base is color-coded.

The fragments are then pushed through a thin capillary filled with gel in a process called capillary electrophoresis. DNA is negatively charged, so an electric field drags it through the gel; shorter fragments move faster and exit first. As each fragment passes a laser at the end of the capillary, its dye fluoresces, and a detector logs the color. Read the colors in the order they arrive — shortest fragment to longest — and you have read the sequence base by base. The output, a series of colored peaks, is called a chromatogram. A single Sanger run yields a high-quality read of roughly 500 to 1,000 bases at better than 99.99% per-base accuracy.

Next-generation sequencing: massive parallelism

Sanger reads one fragment per capillary, which is far too slow for a 3.2-billion-base genome. Next-generation sequencing (NGS), dominated by Illumina, changed the economics by reading millions to billions of fragments at once. The genome is shattered into short pieces, each is anchored to a flow-cell surface and copied locally into a tight cluster of identical molecules (so its signal is bright enough to see). Then sequencing proceeds by synthesis, one base per cycle: the machine adds fluorescently labeled, reversibly terminated nucleotides, images the entire surface to record which color appeared at each of the billions of clusters, chemically removes the dye and the block, and repeats. After 100–300 cycles, every cluster has yielded a short read.

Individual NGS reads are short (50–300 bases), but the sheer number makes up for it. Software either aligns reads to a known reference genome or assembles them de novo from their overlaps. The crucial metric is coverage (depth): the average number of reads spanning each position. Human clinical genomes are typically sequenced to 30× coverage — each base read about thirty times — so that random errors average out and true variants stand out from noise. Because individual base-calls are less accurate than Sanger (~99.9%), redundancy is how NGS achieves reliable variant detection.

Long reads and reading the native molecule

Short reads struggle with repetitive regions — when a sequence repeats thousands of times, a 150-base read cannot tell which copy it came from, leaving gaps in assemblies. Long-read platforms solve this. PacBio's single-molecule real-time (SMRT) sequencing watches a single polymerase add fluorescent bases in a tiny well, producing reads tens of thousands of bases long. Nanopore sequencing (Oxford Nanopore) is even more direct: it reads the native DNA molecule with no copying and no dyes at all. A single strand is pulled through a nanometer-wide protein pore in a membrane under a voltage; as each base passes, it partially blocks the ionic current by a characteristic amount, and that current trace is decoded into sequence in real time.

Because nanopore reads the molecule itself, it can span 100,000+ bases in a single read, detect chemical modifications such as DNA methylation directly, and run on a device the size of a USB stick. That portability let researchers sequence Ebola in West African field clinics and Zika on the move. The trade-off is a higher raw error rate, again managed with coverage and consensus.

Sequencing methods compared

MethodRead lengthReads per runRaw accuracyBest for
Sanger (capillary)500–1,000 bp~96 (one per capillary)>99.99%Single targets, confirming variants
Illumina (NGS)50–300 bpBillions~99.9%Whole genomes, exomes, RNA-seq
PacBio HiFi10,000–25,000 bpMillions>99.9% (consensus)De novo assembly, phasing
Oxford Nanopore10,000–>100,000 bpMillions~95–99%Long reads, field work, methylation

The collapsing cost of a genome

The most staggering fact about sequencing is its price curve. The Human Genome Project finished the first reference genome in 2003 after about 13 years of work by an international consortium, at a cost near $3 billion. As NGS matured, the per-genome cost fell far faster than Moore's law for chips: roughly $10 million in 2007, the long-promised "$1,000 genome" by 2014, and today a 30× human genome runs in the low hundreds of dollars and finishes in about a day. This thousand-thousand-fold drop is precisely why sequencing migrated from a once-in-history megaproject to a routine line item on a clinical lab order.

Why sequencing matters

  • Medical diagnosis. Pinpoints the exact mutation behind inherited disease and guides newborn screening.
  • Cancer genomics. Profiles a tumor's mutations to match patients to targeted therapies.
  • Pathogen surveillance. SARS-CoV-2 variants were tracked by sequencing millions of viral genomes worldwide.
  • Evolution and the tree of life. Shared mutations reconstruct relationships; ancient DNA reads Neanderthal and mammoth genomes.
  • Forensics and ancestry. Identification and genealogy from genetic markers.
  • Agriculture and conservation. Breeding, biodiversity monitoring, and population genetics.

Common misconceptions

  • "Sequencing reads the whole genome in one piece." No — DNA is fragmented first; reads are short and reassembled computationally.
  • "More reads always means better." Quality and even coverage matter; biased or error-prone reads can mislead.
  • "Sequencing tells you what genes do." It tells you the order of bases; interpreting function needs separate analysis.
  • "Sanger is obsolete." It is still the accuracy benchmark and the go-to for single, short targets.
  • "A genome is a single answer." Coverage, variant-calling thresholds, and the reference used all shape the result.

Frequently asked questions

How does DNA sequencing work?

At its core, sequencing converts the chemical order of bases into a readable signal. In Sanger sequencing, DNA polymerase copies a single-stranded template, but the reaction is spiked with chain-terminating dideoxynucleotides (ddNTPs) that lack the 3'-OH needed to add the next base. Each time one is incorporated, that copy stops, so the reaction produces fragments ending at every possible position. Each ddNTP carries a different fluorescent dye, so a fragment's terminal base is color-coded. Separating the fragments by size — shortest first — and reading the colors in order spells out the sequence one base at a time.

What is the difference between Sanger and next-generation sequencing?

Sanger sequencing reads one DNA fragment at a time, giving long (~500–1000 base), highly accurate reads (>99.99%), but it is slow and expensive per base. Next-generation sequencing (NGS), such as Illumina, reads millions to billions of short fragments (typically 50–300 bases) simultaneously on a flow cell, detecting one fluorescent base per cycle across the whole surface. NGS slashed the cost of a human genome from billions of dollars to a few hundred. Sanger remains the gold standard for short, single targets and for confirming NGS calls; NGS dominates whole-genome and large-scale work.

What is a 'read' in DNA sequencing?

A read is a single stretch of sequence the machine reports — the inferred order of bases for one fragment. Reads are short relative to genomes: Illumina reads run 50–300 bases, while long-read platforms (PacBio, Oxford Nanopore) routinely exceed 10,000–100,000 bases. Because a genome is far longer than any read, software stitches overlapping reads back together — alignment to a reference, or de novo assembly. 'Coverage' (or depth) is how many reads cover each position on average; 30× coverage of a human genome means each base is read about 30 times, which is enough to call variants confidently.

How much does it cost to sequence a human genome?

The first human genome, finished by the Human Genome Project in 2003, took about 13 years and roughly $3 billion. Costs then fell faster than Moore's law: by 2007 a genome was ~$10 million, by 2014 the '$1,000 genome' arrived, and today whole-genome sequencing at 30× coverage runs in the low hundreds of dollars and finishes in roughly a day. This collapse in cost is why sequencing has moved from a grand international project to a routine clinical and research tool.

What is nanopore sequencing?

Nanopore sequencing reads DNA directly, with no copying and no fluorescent labels. A single DNA strand is threaded through a nanometer-scale protein pore embedded in a membrane while a voltage is applied. As each base passes, it blocks the ionic current through the pore by a characteristic amount; the pattern of current dips is decoded into a sequence in real time. Because it reads the native molecule, nanopore can produce very long reads (tens to hundreds of thousands of bases), detect base modifications like methylation directly, and run on a device the size of a USB stick — even in the field.

What is DNA sequencing used for?

Sequencing underpins much of modern biology and medicine. Clinically it diagnoses inherited disorders, profiles tumors to guide targeted therapy, screens newborns, and tracks pathogens — SARS-CoV-2 variants were identified by sequencing millions of viral genomes. It powers ancestry and forensic identification, reconstructs the evolutionary tree of life from shared mutations, reads ancient DNA from Neanderthal bones, and enables agriculture and conservation genetics. Essentially any question about the precise genetic content of an organism is answered by reading its sequence.