Evolution

Selective Sweep

A strongly favored mutation drags its neighboring DNA to high frequency — erasing variation around it through genetic hitchhiking

A selective sweep is the rise of a strongly favored mutation to high frequency or fixation, which drags the neighboring DNA it sits on along for the ride — erasing genetic variation in a window that can span hundreds of kilobases. Maynard Smith and Haigh named the effect "genetic hitchhiking" in 1974; real sweeps include lactase persistence (LCT/MCM6 -13910*T, ~10,000 years), the Duffy-null malaria allele, and the Tibetan EPAS1 altitude variant introgressed from Denisovans.

  • MechanismGenetic hitchhiking (Maynard Smith & Haigh 1974)
  • Genomic signatureDiversity valley + long haplotypes
  • SFS skewTajima's D < 0, Fay & Wu's H < 0
  • Window width~s / r — up to ~1 Mb in humans
  • DetectioniHS, XP-EHH, SweepFinder
  • Classic caseLactase persistence, s ≈ 0.01–0.10

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How a selective sweep works

Imagine a population of chromosomes, each a slightly different string of DNA. Most of the differences between them are neutral — single-nucleotide variants that do nothing. Now a beneficial mutation appears on exactly one of those chromosomes. That chromosome already carries a particular set of neutral variants, a specific haplotype. Natural selection cannot pick out the one beneficial base and copy it alone; it can only favor the whole chromosome. So as the beneficial allele rises from one copy to high frequency, every neutral variant riding on the same haplotype rises with it. The variants that happened to sit on rival chromosomes are driven out. By the time the favored allele reaches fixation, the population is nearly uniform around the selected site — that is the selective sweep.

John Maynard Smith and John Haigh formalized this in 1974 in a paper titled "The hitch-hiking effect of a favourable gene." Their key insight was that selection at one locus has collateral consequences for linked loci, and that the reach of those consequences is set by a tug-of-war between selection and recombination. Selection drives the favored haplotype up; recombination tries to peel neutral variants off it and onto other backgrounds. The closer a neutral site sits to the selected locus, the harder it is for recombination to rescue its variation before the sweep completes. The farther away, the more chances recombination has. The outcome is a V-shaped valley of diversity centered on the selected site, deep in the middle and recovering toward the edges.

Three things have to be true for a strong, clean sweep. The selection must be strong (a large selection coefficient s) so the rise is fast. The mutation must be new and on a single background (a hard sweep) so there is only one winning haplotype. And recombination near the locus must be modest, so it cannot break the linkage in time. Loosen any of these and the sweep softens, broadens, or fades.

The genomic signature step by step

A completed hard sweep stamps a recognizable, three-part fingerprint on the genome. Read in order, these are the clues geneticists use to find sweeps without ever seeing them happen.

  • 1. A valley of reduced diversity. Nucleotide diversity, written π (pi) — the average number of pairwise differences per site — collapses near the selected locus. In a deep sweep it can fall close to zero at the center and rise back to the genome-wide background over hundreds of kilobases. Heterozygosity (the fraction of individuals carrying two different alleles) drops the same way.
  • 2. A skewed site-frequency spectrum. Right after the sweep, the swept region is a single haplotype, so almost no variation exists. New mutations then start accumulating, and because they are all young, they are rare — an excess of low-frequency variants. This drives Tajima's D strongly negative. Meanwhile, the few neutral variants that hitchhiked all the way up are now at high frequency in the derived (mutant) state — an excess of high-frequency derived alleles that drives Fay and Wu's H negative. The pair together is a much stronger signal than either alone.
  • 3. Long, shared haplotypes. Because the winning haplotype spread faster than recombination could chop it up, many individuals carry the same long unbroken stretch of identical DNA around the locus. This long-range linkage disequilibrium is measured by extended haplotype homozygosity (EHH) and summarized as iHS (within one population) and XP-EHH or Rsb (comparing two populations). Long haplotypes are the best signal for recent or ongoing sweeps, because the diversity valley needs time to fully form.

The first two signatures fade as time and mutation refill the valley. The third fades fastest of all because recombination steadily erodes long haplotypes — which is why long-haplotype tests in humans detect mostly sweeps from the last ~30,000 years, while the diversity valley can reveal older events.

When sweeps happen — players and conditions

Selective sweeps are the genomic footprint of strong, recent, positive selection. They show up whenever an environment changes fast enough that one allele confers a large advantage:

  • New diets and lifestyles. Lactase persistence is the textbook case: the spread of dairying gave milk-digesters a major caloric and nutritional edge, and the -13910*T regulatory allele swept across Europe in roughly 350 generations, leaving a ~1 Mb haplotype.
  • Pathogen pressure. Malaria has driven some of the strongest known human sweeps. The Duffy-null FY*O allele, which switches off the Duffy antigen receptor that Plasmodium vivax uses to enter red blood cells, reached near-fixation across sub-Saharan Africa. Sickle-cell and G6PD-deficiency alleles show related, more complex selection signatures.
  • New altitudes and climates. The EPAS1 allele that blunts the overproduction of red blood cells at altitude reached ~87 percent in Tibetans — and was introgressed from Denisovans, an archaic human group. Light-skin pigmentation swept in Europeans at SLC24A5 (the Ala111Thr / A111T change, rs1426654), one of the largest-effect pigmentation loci known.
  • Human-imposed selection. Pesticide and insecticide resistance produces rapid sweeps in pests — the kdr sodium-channel mutations and Ace-1 acetylcholinesterase variants in Anopheles and Drosophila. Domestication left sweep signatures at coat-color, body-size and behavior loci in dogs, cattle and maize (e.g. tb1 for plant architecture).
  • Standing variation and recurrence. When the beneficial allele is already common before selection starts, or arises many times, you get a soft sweep — several haplotypes rising together. Soft sweeps dominate in large, mutation-rich populations and fast-adapting microbes and viruses.

Hard sweep vs soft sweep

PropertyHard sweepSoft sweep
Origin of beneficial alleleSingle new mutation, one copyStanding variation or recurrent mutation
Number of winning haplotypesOneTwo or more
Diversity reductionNear-total at the centerPartial — several backgrounds survive
Tajima's D signalStrongly negativeWeak or near zero
Long-haplotype signal (iHS/EHH)Strong, single dominant haplotypeDiffuse across multiple haplotypes
DetectabilityEasy with classical scansHard — often missed by standard tests
Favored bySmall populations, low mutation supply, novel environmentsLarge populations, high mutation supply, common targets
Typical exampleLactase persistence, Duffy-nullInsecticide resistance in mosquitoes, viral drug resistance

Selective sweep vs other ways diversity is lost

A sweep is not the only process that erases variation. Two others look superficially similar in some statistics, and telling them apart is the central challenge of detecting selection.

PropertySelective sweepGenetic drift / bottleneckBackground selection
CausePositive selection on one linked locusRandom sampling; population crashPurifying selection removing deleterious mutations
Genomic scaleLocalized — one valley at one locusGenome-wideLocalized to low-recombination, gene-dense regions
Diversity effectSharp local valley of πUniform reduction everywhereGradual reduction, no sharp center
Site-frequency spectrumExcess rare + excess high-freq derivedExcess rare only (across whole genome)Excess rare, milder
Long shared haplotypesYes — one dominant haplotypeNo specific haplotype enrichmentNo
Population specificityOften one population (local adaptation)Affects all loci in that populationSame across populations
Key discriminatorLocal outlier vs genome backgroundGenome-wide flatnessCorrelates with recombination + gene density, not with a single peak

The numbers — how wide, how fast, how strong

The reach of a sweep is governed by the ratio of the selection coefficient s to the recombination rate r between the selected site and a neutral site. As a rule of thumb, diversity is substantially depressed out to a genetic distance on the order of s. Concretely:

  • Selection coefficients. "Strong" human sweeps run s ≈ 0.01–0.10 — a 1 to 10 percent fitness advantage per generation. Lactase persistence is estimated around s ≈ 0.01–0.10 depending on the study; some malaria-resistance loci are even higher.
  • Time to fixation. A new additive allele with selection coefficient s in a population of effective size Ne typically fixes in roughly (2/s)·ln(2Ne) generations. For s = 0.05 and Ne ≈ 10,000, that is on the order of ~400 generations — a few thousand years in humans (~25–30 years per generation). Fast enough that recombination cannot keep up nearby.
  • Window width. Human recombination averages ~1 centimorgan per megabase (cM/Mb), i.e. ~1 crossover per 100 Mb per generation. A sweep with s = 0.05 flattens diversity out to roughly the distance where rs, which is on the order of hundreds of kilobases to ~1 Mb — exactly the span of the observed lactase-persistence haplotype.
  • Recombination matters as much as selection. The same s in a recombination cold spot clears a far wider region; in a hotspot, a much narrower one. This is why sweep windows vary by an order of magnitude across the genome.
  • Detection horizon. Long-haplotype statistics (iHS, XP-EHH) in humans are sensitive to sweeps younger than ~30,000 years (~1,000–1,200 generations); the diversity valley and SFS-based scans (SweepFinder) can reach somewhat older events before mutation refills the signal.

Worked example — sizing the lactase sweep

Take the European lactase-persistence allele -13910*T. Suppose it confers a per-generation advantage of s = 0.05 in a dairying population of effective size Ne ≈ 10,000. The expected time to fixation is about (2/s)·ln(2Ne) = (2/0.05)·ln(20,000) ≈ 40 × 9.9 ≈ ~400 generations, or roughly 10,000–12,000 years at ~25–30 years per generation — squarely consistent with the spread of dairying since the Neolithic.

How wide a region should hitchhike? Diversity is cleared out to about the map distance where the recombination rate equals the selection coefficient. At ~1 cM/Mb, a genetic distance of r = s = 0.05 cM corresponds to ~0.05 Mb on one side and similar on the other — but because the sweep is fast and the relationship is approximate, the empirically observed strong-LD haplotype extends to roughly 1 Mb, among the longest in the human genome. The allele itself reached ~70 percent or more in northern Europeans, and the same selective pressure produced independent persistence alleles (such as -14010*C) that swept separately in African pastoralists — convergent adaptation at one gene, with three or more distinct sweep signatures around the world.

Where sweeps show up in real organisms

  • Humans — recent local adaptation. Lactase persistence (LCT/MCM6), Duffy-null malaria resistance (FY/DARC), Tibetan altitude tolerance (EPAS1, introgressed from Denisovans; EGLN1), light pigmentation (SLC24A5, SLC45A2, KITLG), and ethanol/arsenic-metabolism loci all carry sweep signatures of varying age and strength.
  • Drosophila. A founding system for sweep theory. Sweeps at loci like the insecticide-resistance gene Ace and the Cyp6g1 DDT-resistance allele are textbook examples; Drosophila's large effective population size makes soft sweeps common, which reshaped the field's expectations.
  • Pathogens and pests. Drug-resistance alleles sweep through malaria parasites (pfcrt, dhfr for chloroquine and pyrimethamine resistance), bacteria (antibiotic resistance), and viruses. Insecticide resistance (kdr, Ace-1) sweeps repeatedly through mosquito vectors — often softly.
  • Domestication. Maize architecture (tb1, teosinte branched1), dog coat and body-size loci, chicken thyroid-stimulating-hormone receptor (TSHR), and many crop "domestication syndrome" genes sit under sweep signatures left by thousands of years of artificial selection.
  • Conservation and disease genomics. Sweep scans help map adaptation in threatened species and identify candidate loci for human disease where selection on a pathogen-defense or metabolic allele also nudged disease risk.

Common misconceptions and pitfalls

  • "A sweep removes the beneficial mutation's variation too." The opposite — the favored allele itself goes to high frequency. What is removed is the neutral variation linked to it. The selected site becomes uniform because everyone now carries the beneficial allele, not because it was deleted.
  • "Negative Tajima's D proves positive selection." No. A population that recently expanded after a bottleneck shows negative Tajima's D genome-wide, mimicking a sweep at every locus. The discriminator is that a true sweep is a local outlier — a sharp valley against a flatter genomic background — plus the long-haplotype and high-frequency-derived-allele signals that demography does not produce locally.
  • "All sweeps are hard sweeps." Early theory assumed single new mutations, but standing variation and recurrent mutation produce soft sweeps that retain multiple haplotypes and largely escape classical scans. In large populations soft sweeps may be the rule, not the exception.
  • "Background selection and sweeps are the same." Background selection removes deleterious variants and also lowers diversity in low-recombination regions, but it produces a gradual reduction correlated with gene density and recombination — not a single sharp valley with high-frequency derived alleles. Separating the two is an active methodological problem.
  • "A wider valley means stronger selection." Width depends on s/r, not s alone. A modest sweep in a recombination cold spot can clear a wider region than a strong sweep in a hotspot. You must condition on the local recombination rate before reading off selection strength.
  • "Sweeps last forever in the genome." The signatures decay. Long haplotypes erode by recombination within tens of thousands of years; the diversity valley refills by mutation over hundreds of thousands. Ancient sweeps may leave only faint traces, which is why most detected human sweeps are recent.

Frequently asked questions

What is genetic hitchhiking?

Genetic hitchhiking is the change in frequency of a neutral allele caused not by selection acting on it, but by selection acting on a different allele physically linked to it on the same chromosome. When a strongly beneficial mutation sweeps to high frequency, every neutral variant that happens to lie on the same haplotype rises with it, while variants on competing haplotypes are lost. John Maynard Smith and John Haigh coined the term in their 1974 paper "The hitch-hiking effect of a favourable gene" (Genetical Research). Hitchhiking is the mechanism; a selective sweep is the resulting pattern of reduced variation around the selected site. The width of the swept region depends on the ratio of the selection coefficient s to the recombination rate r — roughly, diversity is depressed out to a genetic distance of order s, so a stronger sweep or a region of low recombination clears a wider window.

What is the difference between a hard sweep and a soft sweep?

In a hard sweep, a single new beneficial mutation arises once and sweeps to fixation, so the entire population ends up carrying one haplotype around the selected site and diversity is almost completely erased. In a soft sweep, the beneficial allele is present on multiple haplotype backgrounds when selection begins — either because the mutation recurred independently several times, or because it was already segregating as standing variation before the environment changed. Multiple haplotypes then rise together, so several distinct backgrounds survive and the diversity reduction is partial rather than total. Soft sweeps leave a much weaker classical signature and are harder to detect; they appear to be common in large populations like Drosophila and in rapidly adapting pathogens, where the supply of new mutations is high. Insecticide-resistance alleles in mosquitoes often sweep softly because resistance mutations arise repeatedly.

How do you detect a selective sweep in genome data?

A completed hard sweep leaves three linked signatures. First, reduced heterozygosity and nucleotide diversity (pi) in a window around the selected site — sometimes a near-total valley of variation. Second, a skewed site-frequency spectrum: an excess of rare variants (because new mutations accumulate after the sweep on the now-uniform background) gives a strongly negative Tajima's D, and an excess of high-frequency derived variants gives a negative Fay and Wu's H. Third, long-range linkage disequilibrium and unusually long, identical haplotypes shared by many individuals — captured by extended haplotype homozygosity (EHH) and its summary statistics iHS (within a population) and XP-EHH or Rsb (between populations). Composite-likelihood methods like SweepFinder and SweeD scan the site-frequency spectrum for the characteristic spatial valley. An ongoing or incomplete sweep is best caught by the long-haplotype tests, because the diversity valley has not yet fully formed.

Is lactase persistence a selective sweep?

Yes — it is one of the best-documented recent sweeps in humans. The ability to digest milk into adulthood is driven mainly by the -13910*T variant (rs4988235) in an enhancer within the MCM6 gene that keeps the neighboring LCT (lactase) gene switched on. After dairy farming spread in Europe over the last ~10,000 years, this allele rose to roughly 70 percent or more in northern Europeans, and it carries one of the longest extended haplotypes ever measured in the human genome — about 1 megabase of strong linkage disequilibrium, the classic signature of a fast, recent sweep. Estimated selection coefficients are large, on the order of 0.01 to 0.10. Independent lactase-persistence alleles arose and swept separately in African and Middle Eastern pastoralist populations (for example -14010*C), a striking case of convergent adaptation at the same gene.

Why does a selective sweep reduce genetic diversity?

Diversity falls because the sweep replaces many competing chromosome backgrounds with copies of just one. When the beneficial mutation arises, it sits on a single haplotype carrying a particular set of neutral variants. As selection drives that haplotype from one copy to fixation, all the other haplotypes — and the variation they carried near the locus — are eliminated. Recombination is the only thing that can rescue a neutral variant onto the winning background, but a strong sweep finishes in a few hundred to a few thousand generations, leaving recombination too little time to act over short physical distances. So the closer a site is to the selected locus, the more completely its variation is wiped out, producing a V-shaped valley of diversity that recovers gradually with distance as recombination breaks up the linkage.

How wide is the region affected by a sweep?

The width scales with the strength of selection relative to recombination — to an order of magnitude, diversity is depressed out to a genetic map distance comparable to the selection coefficient s. A strong human sweep (s of a few percent) in a region of average recombination (~1 centimorgan per megabase) can flatten diversity across hundreds of kilobases to a megabase, which is why the lactase-persistence haplotype spans ~1 Mb. The same selection strength in a low-recombination region clears a much wider window, while in recombination hotspots the swept region is narrow. After the sweep, the valley refills over time as new mutations accumulate and recombination reshuffles haplotypes, so old sweeps leave shallow, broad signatures and recent sweeps leave deep, sharp ones — which is why long-haplotype tests pick up only sweeps from roughly the last 30,000 years in humans.