Genetics
Linkage Disequilibrium
Non-random co-occurrence of alleles at nearby loci — D = pAB − pA·pB, decays with recombination over generations
Linkage disequilibrium (LD) is the non-random co-occurrence of alleles at two or more loci, measured as D = pAB − pA·pB and decaying each generation by a factor of (1 − c), where c is the recombination fraction. Two normalised statistics dominate reporting: D' (between −1 and +1, =1 means at least one haplotype is unobserved) and r² (squared correlation, drives GWAS power as sample size scales with 1/r²). Loci ~1 cM apart lose about 1 percent of their LD per generation; in humans LD blocks span roughly 10–100 kb in non-African populations and 5–10 kb in African populations, reflecting deeper effective population size.
- DefinitionD = pAB − pA·pB
- Normalised statsD' (∈ [−1, 1]) · r² (∈ [0, 1])
- Decay rate~1 % per generation per cM
- LD blocks (non-African)~10–100 kb
- LD blocks (African)~5–10 kb
- Catalogued byHapMap 2002–2010 · 1000G 2008–2015
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
Why linkage disequilibrium matters
- Foundation of modern GWAS. The 600,000 to 2,000,000 SNPs on a typical genotyping array tag the rest of the common-variant genome through LD. Without LD, every causal variant would have to be directly typed, multiplying genotyping costs by 50 to 100. The GWAS Catalog (curated since 2008) holds over 600,000 reported associations across more than 6,000 traits, almost all detected via tag SNPs in LD with the actual causal variants.
- Population history is encoded in LD. The Out-of-Africa bottleneck around 60,000 years ago elevated LD across non-African genomes. Modern LD-decay curves let demographers infer effective population size Ne back ~10,000 generations: PSMC and SMC++ methods exploit r² across distance to estimate Ne(t) trajectories.
- Selective sweeps leave LD signatures. When a beneficial mutation rises rapidly to high frequency, it drags surrounding variants with it — a selective sweep. The lactase-persistence allele in Europeans sits at the centre of an LD haplotype extending more than 1 Mb, a signature so strong it was the first selective sweep convincingly detected in modern humans (Bersaglieri et al 2004).
- Fine-mapping requires breaking LD. When a GWAS hit covers a 100 kb LD block containing 50 SNPs, identifying the causal one requires either dense sequencing, functional assays, or trans-ethnic meta-analysis — populations with shorter LD blocks (Africans) sometimes break the locus into smaller pieces.
- Admixture mapping uses LD as a clock. In recently admixed populations (e.g., African Americans, Latinos), LD between unlinked loci decays as ~1/c per post-admixture generation. Measuring this decay calibrates the date of admixture: African American genomes show admixture LD consistent with 6 to 10 generations since the trans-Atlantic slave trade.
- Polygenic risk scores depend on LD weighting. A polygenic score sums effect sizes across millions of SNPs. Naive summing double-counts effects shared via LD; methods like LDpred and PRS-CS reweight using a reference LD matrix. Cross-population transferability is poor partly because LD differs: a European-derived score loses 40 to 80 percent of its predictive accuracy in African ancestry samples.
- 1 cM ≈ 1 Mb in human autosomes (sex-averaged). The total human genetic map is ~3,500 cM over ~3,200 Mb, so the average c per Mb is roughly 1 percent. Sex chromosomes deviate sharply: the X chromosome recombines only in females, and the Y has no homologous recombination over most of its length, so LD on the Y extends across the entire chromosome.
Common misconceptions
- LD is the same as genetic linkage. Linkage refers to whether two loci segregate together within a family across one meiosis. LD is a population-level statistic measuring correlation across all chromosomes in the gene pool, integrating thousands of generations of recombination, drift, selection, and demography. Two loci can be tightly linked (low recombination fraction) and still in low LD if recombination has had enough time to break the association.
- D' = 1 means perfect tagging. D' = 1 only requires one haplotype to be missing — three of the four possible haplotypes can still be observed at very different frequencies, and r² can be near zero. r² is the right statistic for tagging power, because GWAS sample-size requirements scale with 1/r² between the typed SNP and the causal variant.
- LD only matters for nearby loci. In recently admixed populations or after strong selection, LD can span unlinked loci. Population stratification — failing to correct for ancestry differences — induces LD between any pair of differentially distributed alleles and is the leading cause of false-positive GWAS associations. Principal component correction and mixed-effect models exist precisely to remove this artefact.
- LD decay is uniform. Decay rates vary by 100x or more across the genome because recombination is concentrated in hotspots. Within a hotspot, LD breaks down in 5–20 kb; outside hotspots it can persist over hundreds of kilobases. The HLA region on chromosome 6 is famous for LD blocks spanning megabases, sustained by epistatic selection and low local recombination.
- r² has the same meaning across populations. r² depends on allele frequencies in the population sampled. A SNP that is common in Europeans but rare in East Asians will tag different variants with different efficiencies in the two populations, which is why population-specific reference panels are essential for accurate tagging-SNP design and post-hoc fine-mapping.
- LD is always positive. D can be positive or negative depending on which alleles co-occur. The sign of D depends on the labelling convention; r² is always non-negative because it is squared. Reporting both r² and D' (with sign) gives the most complete picture.
How LD evolves
Consider two biallelic loci with alleles A/a and B/b. Define haplotype frequencies xAB, xAb, xaB, xab. Allele frequencies are pA = xAB + xAb and pB = xAB + xaB. If alleles were drawn independently (as in Hardy-Weinberg equilibrium across loci), xAB would equal pA·pB. The LD coefficient D measures the deviation: D = xAB − pA·pB = xAB·xab − xAb·xaB. Each generation, recombination at rate c between the two loci converts a fraction c of AB and ab haplotypes into Ab and aB (and vice versa), reducing D by a factor of (1 − c) per generation. Without other forces, D approaches zero geometrically with time.
Drift in finite populations counteracts decay. The expected equilibrium r² between two loci is approximately E[r²] ≈ 1 / (1 + 4 Ne c), where Ne is effective population size. For human autosomes with Ne ≈ 10,000 and c = 0.0001 (10 kb apart), 4 Ne c ≈ 4, giving E[r²] ≈ 0.2 — the empirical baseline measured by HapMap and 1000 Genomes. Bottlenecks slash Ne temporarily, raising LD across the genome. Selection on a haplotype generates LD over a window inversely proportional to the strength of selection: a strong sweep with selection coefficient s = 0.1 leaves a hitchhiking footprint extending roughly s/(2c) chromosomal positions before LD breaks.
Computing D from genotype data requires inferring haplotypes when phase is unknown. The EM algorithm (Excoffier and Slatkin 1995) and Bayesian methods (Stephens et al PHASE, Browning's Beagle) reconstruct haplotypes from unphased genotypes; modern reference-panel-based imputation (IMPUTE2, Beagle, Minimac) leverages large phased panels (1000 Genomes, TOPMed) so any new study can impute LD-linked variants from a sparse genotyping chip. The HapMap 3 release (2010) catalogued ~1.6 million SNPs; 1000 Genomes Phase 3 (2015) catalogued ~84 million variants in 2,504 individuals, and TOPMed (2021) extends this to ~140,000 individuals and >700 million variants.
Linkage vs LD vs association
| Concept | Linkage | Linkage disequilibrium | Association |
|---|---|---|---|
| Scale | Within a single family | Across a population | Across a population (case-control or cohort) |
| Measure | Recombination fraction c (LOD score) | D, D', r² | Odds ratio, β coefficient, P-value |
| Time horizon | One meiosis (one generation) | Thousands of generations | Current population |
| Detects | Co-segregation of marker and trait in pedigrees | Allelic correlation independent of phenotype | Correlation between genotype and phenotype |
| Sample size | 10–100 families | 1,000–100,000 chromosomes for fine LD | 10,000–1,000,000 individuals (modern GWAS) |
| Resolution | ~1–10 cM | Down to single-SNP via dense LD blocks | Locus-level; needs LD/fine-mapping for causal variant |
| Classic study | Sturtevant 1913 Drosophila | HapMap 2002, 1000 Genomes | WTCCC 2007, UK Biobank 2018 |
Famous experiments
- Sturtevant 1913. Plotted six Drosophila melanogaster X-chromosome loci using recombination frequencies — first genetic map. Distinct from LD (it measures linkage in pedigrees), but laid the conceptual groundwork.
- Lewontin 1964. Defined the normalised statistic D' to disentangle LD from allele-frequency confounding — the standard reporting convention to this day.
- HapMap Project 2002–2010. Genotyped 270 (Phase I) then 1,184 individuals (Phase III) across multiple populations to build the first genome-wide LD map. Showed haplotype block structure interrupted by recombination hotspots.
- 1000 Genomes Project 2008–2015. Whole-genome sequenced 2,504 individuals across 26 populations, catalogued ~84 million variants, and provided the reference panel that powers modern GWAS imputation.
- WTCCC 2007. Wellcome Trust Case Control Consortium published the first large multi-disease GWAS — 14,000 cases across 7 diseases plus 3,000 controls — proving that LD-tagging on commercial chips could detect novel risk loci.
Frequently asked questions
What is the formula for linkage disequilibrium?
For two biallelic loci with alleles A/a and B/b, with allele frequencies p_A, p_a, p_B, p_b and haplotype frequency p_AB, the standard pairwise LD coefficient is D = p_AB − p_A · p_B. D ranges from −0.25 to +0.25 and is sensitive to allele frequencies. Two normalised statistics are typically reported. D' = D / D_max, where D_max is the maximum value of |D| given the marginal allele frequencies; D' ranges from −1 to +1, with |D'| = 1 meaning at least one of the four possible haplotypes is unobserved. r² = D² / (p_A · p_a · p_B · p_b) is the squared correlation between the indicator variables for the two alleles; r² is the statistic that drives statistical power for SNP-tag association studies, because the sample size needed to detect a causal variant via a tag SNP scales as 1/r².
How fast does LD decay?
Each generation, LD between two loci with recombination fraction c shrinks: D_{t+1} = (1 − c) · D_t. After t generations D_t = (1 − c)^t · D_0. For loci 1 cM apart (c ≈ 0.01), LD decays by about 1 percent per generation; after 100 generations only ~37 percent of the initial LD remains, after 300 generations ~5 percent. Loci 0.1 cM apart (about 100 kb in humans on average) lose only 0.1 percent per generation, so LD persists for thousands of years. This is why LD blocks in non-African human populations extend 10–100 kb and span demographically recent history; Africans, with longer effective population history without a recent bottleneck, show LD blocks roughly an order of magnitude shorter.
Why is LD essential for GWAS?
GWAS typically genotype 0.5 to 2 million tagging SNPs across roughly 3 billion bp of human genome — far less than a complete inventory of 80–100 million common variants. The strategy works because tagging SNPs are chosen so that any common (minor allele frequency > 0.05) untyped variant is in high LD (r² > 0.8) with at least one typed SNP. A causal variant that the chip does not measure directly will still produce an association signal at any tag SNP it shares an LD block with, statistically detectable with sample size scaling as 1/r². The HapMap and 1000 Genomes projects measured LD across populations to allow chip designs that maximise tagging efficiency. As a side effect, fine-mapping the causal variant within an associated LD block requires either dense sequencing or trans-ethnic comparison.
Why is LD different across populations?
LD is shaped by population history. Three forces matter most. First, effective population size N_e: smaller populations accumulate LD faster from drift and decay it slower because the equilibrium r² ≈ 1 / (1 + 4 N_e c). Second, bottlenecks: a recent severe bottleneck (e.g., out-of-Africa migration ~60,000 years ago) elevates LD across the entire genome for thousands of generations afterward. Third, admixture: recent mixing of two diverged populations creates extensive LD even between physically unlinked loci, decaying with each post-admixture generation. Empirically, African genomes show short LD blocks (~5–10 kb), East Asian and European show longer blocks (~20–60 kb), Finns and Ashkenazi Jews show even longer blocks (>100 kb in some regions) consistent with their bottlenecks.
What causes high LD between distant loci?
Several non-recombination forces sustain or generate LD. Recent admixture between populations with different allele frequencies creates LD between unlinked loci that decays at rate c per generation. Selection on a haplotype carrying linked variants (a selective sweep) drags the entire haplotype to high frequency, generating long-range LD that persists for thousands of years afterward — the lactase persistence haplotype in Europeans extends LD over more than 1 Mb. Population structure: if a sample mixes two subpopulations with different allele frequencies, alleles common in one subpopulation but absent in the other appear correlated even when physically unlinked; failing to control for this is the leading cause of false-positive GWAS hits. Epistatic selection (favouring particular allele combinations) and inversions that suppress recombination locally also create persistent long-range LD.
What is a haplotype block?
A haplotype block is a stretch of the genome over which most pairs of common SNPs are in high LD with each other and only a few haplotype configurations are observed. The HapMap consortium showed that human chromosomes can be partitioned into roughly 100,000 to 300,000 such blocks, separated by short recombination hotspots that account for the majority of crossovers. Within a block, typically 4 to 10 distinct haplotypes account for >95 percent of all observed chromosomes, so a small number of tagging SNPs (often 3 to 6) can capture most of the variation. Block boundaries correspond largely to the same hotspots PRDM9 directs in meiosis, with hotspot recombination rates 10 to 1,000 times the genome-wide average concentrated in 1–2 kb windows.