Algorithms

Modular Exponentiation

How a 2048-bit power that should take longer than the universe finishes in a millisecond

Modular exponentiation computes a^b mod m in O(log b) multiplications by repeatedly squaring and reducing — the workhorse of RSA, Diffie-Hellman, and primality tests, turning a number with millions of digits into one that never overflows.

MultiplicationsΘ(log b)
Naive methodΘ(b)
Bit-cost (schoolbook)O(log b · log²m)
Operand size≤ 2 log₂ m bits
Invented byPingala, ~200 BCE

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

The problem: a^b mod m, where everything is enormous

RSA signing a message computes something like c = m^d mod n where d and n are each 2048-bit numbers — roughly 617 decimal digits. Compute m^d the obvious way, by multiplying m by itself d − 1 times, and two things go catastrophically wrong. First, you'd perform about 2²⁰⁴⁸ multiplications — more than the number of atoms in the observable universe, squared, many times over. Second, even one of those products would be a number with more digits than there are particles in the cosmos; no amount of RAM holds it.

Modular exponentiation fixes both problems at once with two ideas you can hold in your head:

Square instead of multiply repeatedly. You don't need b multiplications to raise something to the bth power. Squaring doubles the exponent each time, so log₂ b squarings get you to b. That's the exponentiation by squaring half.
Reduce mod m after every operation. Modular arithmetic is closed under multiplication: (x · y) mod m = ((x mod m) · (y mod m)) mod m. So fold the mod m into every step and no intermediate value ever exceeds m². That's the modular half.

Put them together and a power that should take longer than the heat death of the universe finishes in a few thousand machine operations.

Square-and-multiply: reading the exponent in binary

The key insight is that any exponent can be built from squarings if you read it in binary. Take b = 13 = 1101₂. Then

a^13 = a^(8 + 4 + 1) = a^8 · a^4 · a^1

The powers a^1, a^2, a^4, a^8, … are each the square of the previous one. So you walk the bits of b, keep a running square of the base, and whenever you hit a set bit, fold that square into the accumulated result. Here is the full trace for 3^13 mod 7, scanning the exponent right-to-left:

b = 13 = 1101 in binary   (bits from LSB: 1, 0, 1, 1)
result = 1, base = 3

bit 0 = 1 → result = 1·3 mod 7 = 3      base → 3²=9 mod 7 = 2
bit 1 = 0 → result unchanged = 3        base → 2²=4 mod 7 = 4
bit 2 = 1 → result = 3·4 mod 7 = 12 mod 7 = 5   base → 4²=16 mod 7 = 2
bit 3 = 1 → result = 5·2 mod 7 = 10 mod 7 = 3   base → (done)

3^13 mod 7 = 3

Four bits, three squarings, three multiplies — six modular operations instead of twelve. (The squaring after the top set bit is skipped because no higher bit needs it.) The saving looks small here, but it's the difference between log₂ b and b: for a 2048-bit exponent, ~3072 operations instead of ~2²⁰⁴⁸.

Complexity: count multiplications, then count bits

There are two cost models, and conflating them is the most common analysis mistake.

Multiplication count. The exponent b has ⌊log₂ b⌋ + 1 bits. Each bit triggers one squaring; each set bit triggers one extra multiply. So the algorithm does between log₂ b and 2 log₂ b modular multiplications — Θ(log b). On average half the bits are set, so expect about 1.5 log₂ b multiplications.

Bit complexity. Each modular multiplication is itself an operation on numbers up to log₂ m bits wide. With schoolbook multiplication that's O(log²m) bit operations per multiply, plus an O(log²m) reduction. So the whole exponentiation costs O(log b · log²m) bit operations. Swap in Karatsuba multiplication and each multiply drops to O(log^1.585 m); swap in FFT-based multiplication and it's near-linear, O(log m · log log m), which is what GMP uses for very large operands.

The operands never blow up: because every product is reduced mod m, each operand stays below m, and the un-reduced product of two such operands is below m² — at most 2 log₂ m bits. That bounded memory footprint is what makes the algorithm practical at all.

When to reach for it (and its three companions)

Public-key cryptography. RSA encryption, decryption, and signing are all a^b mod n. Diffie-Hellman key exchange computes g^secret mod p. The entire security of these systems rests on this operation being cheap to do forward and (without the key) infeasible to invert.
Primality testing. Fermat and Miller-Rabin tests hinge on evaluating a^(n−1) mod n for random bases — pure modular exponentiation.
Hashing and PRNGs. Rolling hashes, the BBS generator, and many MACs are powered by modular powers.
Competitive programming. Anything that needs x mod (10⁹ + 7), including modular inverses via Fermat's little theorem (x⁻¹ ≡ x^(p−2) mod p) and combinatorics with large factorials.

Reach for the matrix version when the recurrence is linear (Fibonacci, linear DP transitions), and elliptic-curve scalar multiplication when you want the same security as RSA at a fraction of the key size. Both reuse the square-and-multiply skeleton on a different "multiply."

How the exponentiation strategies compare

	Naive repeated multiply	Square & multiply (R→L)	Montgomery ladder	Fixed-window (k-ary)	Matrix exponentiation
Multiplications	b − 1	~1.5 log₂ b	2 log₂ b (always)	~log₂ b + log₂ b / k	Θ(log b) matrix mults
Big-O (mults)	Θ(b)	Θ(log b)	Θ(log b)	Θ(log b)	Θ(d³ log b)
Constant-time?	no	no (leaks bits)	yes	yes (with care)	n/a
Extra memory	O(1)	O(1)	O(1) (two registers)	O(2^k) precomputed powers	O(d²) per matrix
Best use	tiny exponents only	general / teaching	side-channel-safe crypto	fastest software crypto	linear recurrences

The headline trade-off: square-and-multiply is the fastest and simplest, but its running time depends on the secret exponent's bits, so production crypto pays a constant-factor tax (the ladder, or windowing with dummy multiplies) to make the operation count independent of the data.

What the numbers actually say

A 2048-bit exponent: ~3072 modular multiplications. That's ~2048 squarings plus ~1024 multiplies (half the bits set). On a modern CPU with 64-bit limbs and Montgomery reduction, that's well under a millisecond — which is why TLS handshakes can do thousands of RSA operations per second per core.
The naive method would need ~2²⁰⁴⁸ multiplications — about 10⁶¹⁶. The observable universe is roughly 10⁸⁰ atoms and 10¹⁸ seconds old; you would never finish a single signature.
CRT cuts RSA decryption to ~25% of the cost. Two exponentiations mod the 1024-bit primes p and q cost about 2 · (1/2)² = 1/2 of one full 2048-bit exponentiation in multiply-count, and the smaller operands make each multiply ~4× cheaper, netting a 3–4× speedup in practice.
Euler's theorem can shrink the exponent first. If gcd(a, m) = 1, then a^b ≡ a^(b mod φ(m)) mod m, so an exponent with a trillion digits collapses to one below φ(m) before squaring even begins.

JavaScript implementation

Native JavaScript number loses precision above 2⁵³, so for anything cryptographic you must use BigInt. The right-to-left square-and-multiply translates directly:

// Compute (base ** exp) % mod for non-negative BigInt inputs.
function modPow(base, exp, mod) {
  if (mod === 1n) return 0n;            // everything is 0 mod 1
  let result = 1n;
  base = base % mod;                    // normalize base into [0, mod)
  while (exp > 0n) {
    if (exp & 1n) {                      // current low bit is set
      result = (result * base) % mod;   // fold this power into the result
    }
    exp >>= 1n;                         // drop the bit we just handled
    base = (base * base) % mod;         // square the base for the next bit
  }
  return result;
}

modPow(3n, 13n, 7n);                    // → 3n
modPow(2n, 1_000_000n, 1_000_000_007n); // → 235042059n, in microseconds

Two details that matter. First, base = base % mod up front guards against a base larger than the modulus. Second, the order inside the loop is "multiply on the set bit, then always square" — squaring even on the final iteration is harmless because the loop exits before that square is ever used. Note this code is not constant-time: the if (exp & 1n) branch leaks the exponent's Hamming weight through timing, which is fine for competitive programming but unacceptable for a private key.

Python implementation (and the built-in)

Python's integers are arbitrary-precision by default and the three-argument pow built-in already does fast modular exponentiation in C — so in real code you simply write pow(a, b, m). Here is the explicit algorithm for understanding, alongside the one-liner you should actually ship:

def mod_pow(base, exp, mod):
    if mod == 1:
        return 0
    result = 1
    base %= mod
    while exp > 0:
        if exp & 1:                  # low bit set → multiply in
            result = (result * base) % mod
        exp >>= 1                     # next bit
        base = (base * base) % mod   # square
    return result


# In practice, just use the built-in — it's constant-folded in C
# and uses windowed exponentiation internally:
pow(3, 13, 7)                        # → 3
pow(2, 1_000_000, 1_000_000_007)     # → 235042059

# Modular inverse via Fermat's little theorem (p prime, gcd(a, p) == 1):
def mod_inverse(a, p):
    return pow(a, p - 2, p)          # a^(p-2) ≡ a^(-1)  (mod p)

# Python 3.8+: pow also handles negative exponents as modular inverse:
pow(3, -1, 7)                        # → 5, because 3·5 = 15 ≡ 1 (mod 7)

The negative-exponent form (added in Python 3.8) is the cleanest way to compute a modular inverse without writing the extended Euclidean algorithm by hand — though it only works when the inverse exists, i.e. when gcd(a, m) = 1.

Variants worth knowing

Left-to-right square-and-multiply. Scan the exponent from the most-significant bit. Start with result = 1; each step square the result, and if the current bit is set, multiply in the base. Same operation count, but it needs only one persistent accumulator and meshes naturally with windowing.

Fixed-window (k-ary) exponentiation. Precompute a^0, a^1, …, a^(2^k − 1), then consume the exponent k bits at a time: square k times, then one multiply by the right precomputed power. This cuts multiplies from ~1.5 log b toward log b · (1 + 1/k) at the cost of a small table. Sliding-window does even better by skipping runs of zeros.

Montgomery ladder. Maintain two values and do exactly one square and one multiply per bit regardless of whether the bit is 0 or 1. The operation sequence no longer depends on the secret, defeating timing and simple power-analysis attacks. This is the standard for production RSA and ECC, and it's the scalar-multiplication core of elliptic-curve crypto.

Montgomery / Barrett reduction. Orthogonal to the squaring strategy — these replace the expensive mod m division inside each step with shifts and multiplies, the single biggest constant-factor win in real big-number libraries.

Matrix exponentiation. Replace the scalar multiply with matrix multiply and you can evaluate any linear recurrence in O(d³ log n) — the famous trick for the n-th Fibonacci number in logarithmic time via [[1,1],[1,0]]^n.

Common bugs and edge cases

Integer overflow. In C/C++/Java the product base * base can overflow a 64-bit integer even when both operands fit, because the un-reduced product needs up to 2 log₂ m bits. Use __int128, BigInt, or Montgomery multiplication; this is the most common silent failure.
Forgetting to reduce the base first. If base ≥ mod and you skip base %= mod, the first square can already overflow.
mod == 1. Every value is 0 mod 1, including a^0. Without the early return, a naive implementation returns 1, which is wrong.
exp == 0. The answer is 1 mod m (the loop never runs). The convention 0^0 = 1 is what these routines return — fine for almost every application.
Negative or huge exponents. A negative exponent only makes sense as a modular inverse and requires gcd(a, m) = 1. For exponents larger than φ(m) you can pre-reduce with Euler's theorem — but only when gcd(a, m) = 1; otherwise you need the Carmichael/lifting-the-exponent rules.
Using square-and-multiply for secrets. The data-dependent multiply branch leaks the exponent through timing and power. Reach for the Montgomery ladder or a windowed constant-time routine in any security context.

Frequently asked questions

Why is modular exponentiation fast when naive exponentiation isn't?

Naive a^b does b − 1 multiplications, which is exponential in the number of bits of b. Square-and-multiply rewrites the exponent in binary and only ever squares or multiplies once per bit, so it needs Θ(log b) multiplications. For a 2048-bit exponent that's the difference between ~2^2048 operations and ~3000 — the gap between forever and microseconds.

Why reduce mod m after every multiplication instead of at the end?

Because (x·y) mod m = ((x mod m)·(y mod m)) mod m, you can fold the reduction inside the loop. If you didn't, the intermediate a^b would be a number with billions of digits long before you could reduce it. Reducing every step keeps every operand below m, so the values never exceed roughly 2·log₂(m) bits.

What's the difference between left-to-right and right-to-left square-and-multiply?

Right-to-left scans exponent bits from the least significant, keeping a running base that it squares every step and multiplies into the result on a set bit. Left-to-right starts from the most significant bit, squaring the accumulating result each step and multiplying in the base on a set bit. They do the same number of operations; left-to-right is friendlier to fixed-window optimizations, right-to-left parallelizes the squarings.

Is square-and-multiply safe to use directly in cryptography?

No. The multiply step only runs on a 1 bit, so the running time and power draw leak the secret exponent's bits to a timing or power side-channel attack. Real RSA and Diffie-Hellman use constant-time variants like the Montgomery ladder or fixed-window exponentiation that perform the same operations regardless of the exponent's bits.

How does the Chinese Remainder Theorem speed up RSA decryption?

Instead of one exponentiation mod the full modulus n = p·q, you do two exponentiations mod p and mod q — each with half-size operands — then recombine with CRT. Because Karatsuba and schoolbook multiplication cost grows superlinearly, two half-size exponentiations are about 3 to 4× faster than one full-size one. Every production RSA library uses this.

What does Euler's theorem let you do to a huge exponent?

When gcd(a, m) = 1, a^φ(m) ≡ 1 (mod m), so a^b ≡ a^(b mod φ(m)) (mod m). You can shrink an astronomically large exponent down below φ(m) before you even start squaring. RSA's whole key structure — choosing e and d so that e·d ≡ 1 (mod φ(n)) — is built on exactly this fact.