Bit Manipulation
Bit Manipulation Tricks
A handful of operators turn whole algorithms into single CPU cycles
Bit hacks compress whole algorithms into a few CPU cycles. Clear lowest bit with x & (x-1), isolate it with x & -x, count bits with Brian Kernighan or POPCNT, vectorize with SWAR.
- Clear lowest 1-bitx & (x-1)
- Isolate lowest 1-bitx & -x
- Fill below lowest 1x | (x-1)
- Test power of 2x & (x-1) == 0
- popcount (POPCNT)1 instr, ~1 ns
- SWAR popcount12 ops, 0 branches
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
The core identities
Most useful bit hacks come from a few observations about how arithmetic operations interact with the binary representation. Each works because subtracting 1 or negating a number does something specific and reversible to its bits.
x & (x − 1): clear the lowest set bit
x = 0 0 0 0 1 1 0 0 (= 12)
x − 1 = 0 0 0 0 1 0 1 1 (= 11)
x & (x-1) = 0 0 0 0 1 0 0 0 (= 8)
Subtracting 1 from x flips the lowest set bit and every bit below it: the lowest 1 becomes 0, and the 0s below become 1s (because of borrow propagation). ANDing with the original keeps only the bits above the lowest 1 — which are unchanged by the subtraction. The lowest 1 itself disappears.
x & −x: isolate the lowest set bit
x = 0 0 0 0 1 1 0 0 (= 12)
~x = 1 1 1 1 0 0 1 1
~x + 1 = 1 1 1 1 0 1 0 0 (= -12 in two's complement, = 244 unsigned)
x & −x = 0 0 0 0 0 1 0 0 (= 4)
In two's complement, −x = ~x + 1. The +1 ripples through any trailing 1s in ~x, stopping at the first 0 — which is exactly the lowest 1 in x. After the carry, bits below the lowest set bit are 0, the bit at the lowest set bit is 1, and bits above are the complement of x's. ANDing keeps only the position where x and −x both have 1 — the lowest set bit, alone.
This is the backbone of Fenwick trees (binary indexed trees): the parent of node i in the implicit tree is at index i − (i & −i), and the next node responsible for updates is i + (i & −i).
x | (x − 1): fill 1s from the lowest set bit downward
x = 0 0 0 0 1 1 0 0 (= 12)
x − 1 = 0 0 0 0 1 0 1 1 (= 11)
x | (x-1) = 0 0 0 0 1 1 1 1 (= 15)
OR combines the keeping-the-high-bits behavior of x with the trailing 1s of x − 1. Useful for constructing masks of "everything below this bit position."
x & (x − 1) == 0: is x a power of 2?
A power of 2 has exactly one set bit. Clear it via x & (x − 1) and you get 0. So (x != 0) && ((x & (x − 1)) == 0) tests power-of-two-ness in 3 operations, all branch-predictor-friendly. Beats log2(x) + check (which involves floating point) or repeated division (which loops).
Population count (Hamming weight)
Counting the number of 1-bits in an integer — also called the Hamming weight — comes up everywhere from error-correcting codes to bitmap-set cardinality. Three approaches, fastest to slowest:
Hardware: POPCNT instruction
// GCC / Clang
int hw = __builtin_popcount(x); // 32-bit
int hw64 = __builtin_popcountll(x); // 64-bit
// C++20
#include <bit>
int hw = std::popcount(x);
// Rust
let hw = x.count_ones();
// Java
int hw = Integer.bitCount(x);
// JavaScript — no native popcount, must emulate
function popcount(x) {
x = x - ((x >>> 1) & 0x55555555);
x = (x & 0x33333333) + ((x >>> 2) & 0x33333333);
x = (x + (x >>> 4)) & 0x0F0F0F0F;
return (x * 0x01010101) >>> 24;
}
On x86 with SSE4.2 (every CPU made since 2008), POPCNT runs in 1 cycle latency throughput on Skylake+ — effectively 0.5 ns per call. The C++ standard requires std::popcount to compile to POPCNT when the target supports it.
Brian Kernighan's loop
int popcount_kernighan(unsigned x) {
int count = 0;
while (x) {
x &= x - 1; // clear lowest set bit
count++;
}
return count;
}
The loop runs once per set bit, not once per bit position. For sparse integers (a few 1s among many 0s), this is dramatically faster than the naïve "shift and test" loop. Counting bits in a 64-bit integer with 5 set bits: 5 iterations, ~10 ns. The naïve approach: 64 iterations, ~80 ns.
SWAR (SIMD Within A Register)
int popcount_swar(uint64_t x) {
const uint64_t m1 = 0x5555555555555555; // 01 01 01 01 ...
const uint64_t m2 = 0x3333333333333333; // 00110011 00110011 ...
const uint64_t m4 = 0x0F0F0F0F0F0F0F0F; // 00001111 00001111 ...
const uint64_t h = 0x0101010101010101;
x = x - ((x >> 1) & m1); // sum bits in pairs
x = (x & m2) + ((x >> 2) & m2); // sum pair-sums into nibbles
x = (x + (x >> 4)) & m4; // sum nibble-sums into bytes
return (x * h) >> 56; // sum byte-sums via multiply
}
Twelve operations, zero branches, no hardware instruction needed. Works on every 64-bit CPU since 1985. The trick: treat the 64-bit register as a vector of smaller fields and sum them in parallel by masking.
The first line x − ((x >> 1) & m1) sums 32 pairs of bits in place, producing 2-bit counts. Line 2 sums adjacent pair-counts into 4-bit nibble-counts. Line 3 sums adjacent nibble-counts into 8-bit byte-counts. The final multiply by 0x0101...01 sums all 8 byte-counts into the high byte by virtue of how multiplication accumulates. Right-shift extracts the answer.
Trace: count bits in 0xCAFEBABE using SWAR
x = 0xCAFEBABE = 1100 1010 1111 1110 1011 1010 1011 1110
(32-bit; popcount = 22)
Step 1: x = x - ((x >> 1) & 0x55555555)
Each pair of bits is replaced by the count of 1s in that pair.
Pairs: 11 00 10 10 11 11 11 10 10 11 10 10 10 11 11 10
Sums: 10 00 01 01 10 10 10 01 01 10 01 01 01 10 10 01
x = 0x82A9 6595
Step 2: x = (x & 0x33...) + ((x >> 2) & 0x33...)
Sum adjacent pair-counts into 4-bit nibble counts (0-4).
Nibbles: 0010 0010 ... (each < 4)
x = 0x4444 4444 4444 4444 → for our number: nibble sums
Step 3: x = (x + (x >> 4)) & 0x0F...
Sum adjacent nibble counts into byte counts (0-8 per byte).
Step 4: (x * 0x01010101) >> 24
The multiply broadcasts the sum of all four bytes into byte 3,
which the shift extracts.
Final: 22.
When bit tricks pay off
- Bitmask DP. Subsets of a small universe (≤ 32 elements) live in a single int. Operations like "iterate subsets of mask" use
for (s = mask; s; s = (s - 1) & mask). Traveling Salesman with 20 cities uses 2²⁰ × 20 = 20 M DP states fitting in 80 MB. - Bloom filters and hash sets. Each hash function maps to a bit position; setting and testing is
arr[i >> 6] |= 1ULL << (i & 63)andarr[i >> 6] & (1ULL << (i & 63)). Zero allocation, sub-nanosecond. - Fenwick trees / binary indexed trees. Both traversal directions use
i & −i. Range sum query in O(log n) with two pointer operations per step. - Cache-line-aware data layout. Aligning structures with
x & ~63floors to the cache line. Alignment checks withx & 63. - Network and binary parsers. Endian swap (
__builtin_bswap32), extract bit fields with(x >> shift) & mask, set fields with(x & ~mask) | (val << shift). - Cryptographic primitives. XOR encryption, SHA round functions, AES S-box masking. The bit-level structure of these algorithms is fundamentally non-arithmetic.
Avoid bit tricks in code that doesn't need the speed. The compiler optimizer is good enough that a clear if (n > 0 && (n & (n - 1)) == 0) reads better than !(n & -n >> 1) and runs the same speed.
Bit tricks vs naïve approaches
| Operation | Bit trick | Naïve | Bit trick cost | Naïve cost |
|---|---|---|---|---|
| Clear lowest 1 | x & (x-1) | Loop testing each bit | 2 ops | O(bit position) ops |
| Power of 2 test | x & (x-1) == 0 | floor(log2(x)) round-trip | 3 ops | ~50 cycles FP |
| Lowest set bit position | __builtin_ctz(x) | Loop while LSB == 0 | 1 instr | O(position) ops |
| Popcount (POPCNT) | __builtin_popcount(x) | Shift-and-add loop | 1 instr (~1 ns) | ~30 ns |
| Endian swap | __builtin_bswap32(x) | Shift & OR 4 bytes | 1 instr | ~12 ops |
| Sign-flip a sign-magnitude | x ^ (1 << 31) | if-else negation | 1 op | 1 branch |
| Min(a, b) branchless | b ^ ((a^b) & -(a<b)) | if-else | 5 ops, 0 branches | 1 branch |
| Abs(x) branchless | (x ^ (x >> 31)) - (x >> 31) | if-else | 3 ops | 1 branch |
What "single cycle" means in nanoseconds
Modern CPUs run at 3–5 GHz, so one cycle is 200–333 picoseconds. POPCNT, BSR (bit scan reverse), BSF, LZCNT, and TZCNT (count trailing zeros) are all single-cycle latency / single-cycle throughput on Skylake and later. The compiler picks them for free given __builtin_* intrinsics:
| Operation | Hardware | SWAR | Loop |
|---|---|---|---|
| popcount (32 bits) | ~1 ns | ~3 ns | ~25 ns |
| popcount (64 bits) | ~1 ns | ~3 ns | ~50 ns |
| ctz (lowest 1 position) | ~1 ns | ~5 ns | ~20 ns |
| clz (highest 1 position) | ~1 ns | ~5 ns | ~20 ns |
| popcount array of 10⁶ ints | ~1 ms | ~3 ms | ~25 ms |
The hardware-vs-loop gap matters when counting bits is in your inner loop. Bloom filter membership tests at 10⁸/sec throughput need popcount to be sub-nanosecond. Roaring bitmaps' set intersection runs popcount on millions of 64-bit words per second.
JavaScript implementations
// JavaScript: bitwise ops are 32-bit signed. Use >>> for unsigned shift.
const popcount = x => {
x = x - ((x >>> 1) & 0x55555555);
x = (x & 0x33333333) + ((x >>> 2) & 0x33333333);
x = (x + (x >>> 4)) & 0x0F0F0F0F;
return (x * 0x01010101) >>> 24;
};
const clearLowest = x => x & (x - 1);
const lowestBit = x => x & -x;
const isPow2 = x => x > 0 && (x & (x - 1)) === 0;
const ctz = x => popcount((x & -x) - 1); // count trailing zeros
const clz = x => { // count leading zeros
if (x === 0) return 32;
let n = 0;
if ((x & 0xFFFF0000) === 0) { n += 16; x <<= 16; }
if ((x & 0xFF000000) === 0) { n += 8; x <<= 8; }
if ((x & 0xF0000000) === 0) { n += 4; x <<= 4; }
if ((x & 0xC0000000) === 0) { n += 2; x <<= 2; }
if ((x & 0x80000000) === 0) { n += 1; }
return n;
};
const swapBits = (x, i, j) => {
const bit = ((x >>> i) ^ (x >>> j)) & 1;
return x ^ ((bit << i) | (bit << j));
};
const reverseBits = x => {
x = ((x & 0xAAAAAAAA) >>> 1) | ((x & 0x55555555) << 1);
x = ((x & 0xCCCCCCCC) >>> 2) | ((x & 0x33333333) << 2);
x = ((x & 0xF0F0F0F0) >>> 4) | ((x & 0x0F0F0F0F) << 4);
x = ((x & 0xFF00FF00) >>> 8) | ((x & 0x00FF00FF) << 8);
return ((x >>> 16) | (x << 16)) >>> 0;
};
JavaScript's bitwise operators coerce to 32-bit signed integers, so always use >>> (zero-fill right shift) for unsigned semantics. For 64-bit operations, use BigInt — but the constant factor goes up 10–100× because BigInt operations heap-allocate.
Famous problem: single number (XOR trick)
Given an array where every element appears twice except one, find the unique element in O(n) time and O(1) space:
function singleNumber(nums) {
let x = 0;
for (const n of nums) x ^= n;
return x;
}
singleNumber([2, 2, 1]); // 1
singleNumber([4, 1, 2, 1, 2]); // 4
XOR is associative and commutative, and a ^ a = 0. So XORing all elements cancels every pair, leaving just the unique element. The trick generalizes: to find two singletons in a "every-other-element-twice" array, XOR everything to get a ^ b, isolate a bit where they differ with (a^b) & -(a^b), then partition by that bit and XOR each half separately.
Python implementation
def popcount(x):
# Python's int has arbitrary precision; this returns popcount for non-negative x.
count = 0
while x:
x &= x - 1
count += 1
return count
# Python 3.10+ has a built-in
n = (1234).bit_count() # → 6
# Other tricks (32-bit semantics required)
def clear_lowest(x): return x & (x - 1)
def lowest_bit(x): return x & -x
def is_power_of_2(x): return x > 0 and (x & (x - 1)) == 0
def trailing_zeros(x): return (x & -x).bit_length() - 1 # for x > 0
# Numpy for bulk operations
import numpy as np
a = np.array([12, 7, 256, 1024], dtype=np.uint32)
# numpy has no native popcount; use:
def numpy_popcount(arr):
a = arr.copy()
a = a - ((a >> 1) & 0x55555555)
a = (a & 0x33333333) + ((a >> 2) & 0x33333333)
a = (a + (a >> 4)) & 0x0F0F0F0F
return (a * 0x01010101) >> 24
More tricks worth knowing
- Round up to next power of 2.
x--; x |= x >> 1; x |= x >> 2; x |= x >> 4; x |= x >> 8; x |= x >> 16; x++;— 7 ops, no branches. - Branchless absolute value.
int mask = x >> 31; return (x + mask) ^ mask;— uses arithmetic right shift to broadcast the sign bit. - Branchless min and max.
y ^ ((x ^ y) & -(x < y))for min; replace<with>for max. Faster than CMOV when the branch is unpredictable. - Toggle a bit.
x ^= 1 << n. Set:x |= 1 << n. Clear:x &= ~(1 << n). Test:(x >> n) & 1. - Detect overflow on add.
(a + b) < afor unsigned. For signed, check the sign-bit XOR pattern of operands and result. - Average without overflow.
(a & b) + ((a ^ b) >> 1)computes(a + b) / 2without intermediate overflow. - Sign extension from k bits.
(x ^ (1 << (k - 1))) - (1 << (k - 1))— turns a k-bit value into its sign-extended 32-bit equivalent. - Find lowest bit position.
__builtin_ctz(x)= count trailing zeros = log₂(x & -x). Single-cycle TZCNT on Haswell+.
Common bugs and edge cases
- x − 1 when x = 0. Unsigned subtraction wraps around to
UINT_MAX, then ANDing gives 0 — accidentally fine. Signed: implementation-defined in C before C23. Always guard withif (x)if unsure. - Negation overflow.
−INT_MINis undefined behavior in C (overflows becauseINT_MAX = −INT_MIN − 1). Use−(unsigned)xfor the bit trick to be safe. - Shift by ≥ word size.
x << 32on a 32-bit int is UB in C, returns x in JavaScript (where shift amounts mod 32). Always check the shift range. - Sign-bit propagation on arithmetic shift.
−1 >> 1in C with a signed int is implementation-defined; on virtually every compiler it preserves the sign bit. Use unsigned types or>>>in JavaScript when you want logical shift. - 32-bit assumptions in 64-bit code. Mask constants like
0x55555555are 32 bits. For 64-bit popcount you need0x5555555555555555ULL. Forgetting theULLtruncates the constant. - Compiler may already do it. Modern compilers recognize
(x & -x),x & (x - 1), and SWAR popcount patterns. Writing them out by hand sometimes hurts performance because it confuses the optimizer. Use intrinsics (__builtin_*) when speed matters.
Frequently asked questions
Why does x & (x-1) clear the lowest set bit?
Subtracting 1 from x flips every bit from the lowest set bit downward. The lowest 1 becomes 0; all the 0s below it become 1s. ANDing with the original x then keeps only the bits above the lowest set bit (the ones unchanged by the subtraction) — the lowest 1 itself is cleared. For x = 12 (00001100), x-1 = 11 (00001011), and x & (x-1) = 8 (00001000).
Why does x & -x isolate the lowest set bit?
In two's complement, -x is ~x + 1. The +1 ripples up through any trailing 1s in ~x (which were trailing 0s in x), stopping at the first 0 in ~x — which corresponds to the lowest 1 in x. After the carry, ~x + 1 has 1s in all positions above the lowest set bit and 1 exactly at the lowest set bit position. ANDing with x preserves only the lowest set bit, since the bits above it differ between x and -x.
How fast is __builtin_popcount on modern hardware?
On x86 with POPCNT (introduced in SSE4.2, available on every CPU since 2008), __builtin_popcount compiles to a single instruction with 3-cycle latency and 1-cycle throughput — roughly 1 ns per call. Without POPCNT, GCC falls back to a SWAR sequence of ~12 instructions, still under 5 ns. Compare to a naïve loop counting bits one at a time: ~30 ns for a 32-bit integer.
What is SWAR?
SIMD Within A Register. Treat a 64-bit integer as a vector of smaller values and process them in parallel. For popcount: sum bits in adjacent pairs (mask 0x5555...), then sum pair-sums into nibbles (mask 0x3333...), then nibbles into bytes (mask 0x0F0F...), then multiply by 0x01010101... and shift to get the total. 12 operations, 0 branches, works on every CPU since 1985.
When should I reach for bit tricks?
Hot inner loops where the same operation runs millions of times per second. Bitmask dynamic programming (subset DP), Bloom filters, cryptographic hashes, error-correcting codes, set operations on small universes, fast paths in databases (postings lists, roaring bitmaps), and graphics (color masking, alpha blending). Avoid bit tricks in code that doesn't need them — they hurt readability and yield no measurable speed gain over the compiler's optimizer.
Are bit tricks portable across word sizes?
Most are word-size-agnostic in their math but require careful typing. x & -x works for any signed integer type. Multi-bit tricks (SWAR popcount) need different masks for 32-bit vs 64-bit (0x55555555 vs 0x5555555555555555). The C standard explicitly does not guarantee two's complement before C23, so some tricks involving negation are technically undefined behavior on hypothetical sign-magnitude or one's-complement machines — none exist in practice.