Linear Algebra

Trace of a Matrix

Sum the diagonal — and get the sum of every eigenvalue for free

The trace of a square matrix is the sum of its diagonal entries. It equals the sum of eigenvalues, is invariant under similarity, and obeys the cyclic identity tr(AB) = tr(BA) — the single most useful property in matrix calculus.

  • Definitiontr(A) = Σᵢ Aᵢᵢ (sum of diagonal entries)
  • Cyclic propertytr(AB) = tr(BA); generalises to tr(ABC) = tr(BCA) = tr(CAB)
  • Eigenvalue identitytr(A) = Σᵢ λᵢ (with algebraic multiplicity)
  • Basis-independenttr(P⁻¹AP) = tr(A) for any invertible P
  • CostO(n) — no multiplications needed
  • Used inMatrix calculus, Frobenius norm, divergence, Gibbs free energy, machine learning losses

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

What it is

For a square n × n matrix A, the trace is the sum of its diagonal entries:

tr(A) = A₁₁ + A₂₂ + ⋯ + Aₙₙ = Σᵢ Aᵢᵢ

That's it — one line of code, O(n) work, no multiplications. The trace is undefined for non-square matrices.

What makes it interesting is not the formula but the identities it satisfies. Most of the diagonal entries get permuted and shuffled when you change basis, but their sum doesn't. The trace is a coordinate-free invariant of a linear operator.

The cyclic property

The single most useful trace identity:

tr(AB) = tr(BA)

This even holds when A and B are non-square — as long as AB and BA are both well-defined. Expand:

tr(AB) = Σᵢ (AB)ᵢᵢ = Σᵢ Σⱼ Aᵢⱼ Bⱼᵢ
       = Σⱼ Σᵢ Bⱼᵢ Aᵢⱼ
       = Σⱼ (BA)ⱼⱼ = tr(BA).

By induction the cyclic property extends to longer products:

tr(ABC) = tr(BCA) = tr(CAB)

You can rotate the factors as if they were on a circle — but not arbitrarily permute them. tr(ABC) is generally not equal to tr(BAC).

Worked example — a 3×3 matrix

Take

A = [ 4   1  −2 ]
    [ 0   3   5 ]
    [ 7  −1   2 ]

The trace is 4 + 3 + 2 = 9. Three additions, done.

Verify the eigenvalue identity. The characteristic polynomial is det(λI − A) = λ³ − 9λ² + 5λ − 47 (after expansion). The coefficient of λ² is −tr(A) = −9. ✓

Verify the cyclic property. Pick

B = [ 1   0 ]      C = [ 2  −1   0 ]
    [ 2   1 ]          [ 0   3   4 ]
    [ 3  −1 ]

Here B is 3 × 2 and C is 2 × 3. Then BC is 3 × 3, CB is 2 × 2. Compute each diagonal sum: tr(BC) = (1·2 + 0·0) + (2·(−1) + 1·3) + (3·0 + (−1)·4) = 2 + 1 + (−4) = −1. And tr(CB) = (2·1 + (−1)·2 + 0·3) + (0·0 + 3·1 + 4·(−1)) = 0 + (−1) = −1. ✓ Same value — different size matrices.

Variants and generalisations

  • Partial trace. In quantum information, for a bipartite system on H_A ⊗ H_B, the partial trace tr_B(ρ) sums over the B subsystem to give the reduced density matrix on H_A. It is the trace-of-block generalisation that produces classical marginal distributions from joint quantum states.
  • Trace class operators. On an infinite-dimensional Hilbert space, the sum Σᵢ ⟨eᵢ, T eᵢ⟩ may not converge for an arbitrary bounded operator. Trace-class operators are those for which this sum converges absolutely regardless of orthonormal basis. They form a Banach space with norm ‖T‖₁ = Σ σᵢ(T) (the sum of singular values).
  • Frobenius inner product. ⟨A, B⟩_F = tr(AᵀB) = Σᵢⱼ Aᵢⱼ Bᵢⱼ — the trace turns matrices into a Hilbert space. The induced norm ‖A‖_F = √tr(AᵀA) is the Euclidean norm of the matrix viewed as a vector.
  • Generalised trace via tensor contractions. tr(A) is the contraction of the two indices of A. In tensor calculus this generalises to arbitrary index contractions, the building block of every Einstein-summation expression.

Algorithm

function trace(A):
    n = number of rows of A
    if A is not square: error
    s = 0
    for i in 0 .. n-1:
        s += A[i][i]
    return s

// Frobenius norm uses trace too
function frobenius(A):
    return sqrt(trace(A.T * A))
    // or equivalently sqrt(sum of A[i][j]^2)

O(n) work, no multiplications, embarrassingly parallel. The Frobenius norm is just as cheap (O(n²)) if you sum squared entries directly rather than forming AᵀA explicitly.

Key identities

IdentityWhere it matters
tr(A + B) = tr(A) + tr(B)Linearity — trace is a linear functional on matrices
tr(cA) = c · tr(A)Pulls scalars out for free
tr(Aᵀ) = tr(A)Diagonal is unchanged by transpose
tr(AB) = tr(BA), even when sizes are m×n and n×mCyclic property — engine of matrix calculus
tr(P⁻¹AP) = tr(A)Similarity invariant — basis-independent
tr(A) = Σᵢ λᵢ (with algebraic multiplicity)Sum of eigenvalues — quick spectral check
d/dt det(I + tH)|₀ = tr(H)Jacobi's formula — trace is the linearisation of det at I
∂ tr(AX) / ∂X = Aᵀ; ∂ tr(XᵀAX) / ∂X = (A + Aᵀ)XCleanest derivatives in matrix calculus — used in every ML gradient

Common mistakes

  • Confusing tr(AB) with tr(A)·tr(B). They are unrelated in general. Even for diagonal matrices tr(AB) = Σ aᵢbᵢ but tr(A)·tr(B) = (Σ aᵢ)(Σ bᵢ) — almost never the same.
  • Assuming the cyclic property holds for arbitrary permutations. tr(ABC) = tr(BCA) = tr(CAB) is rotation only. tr(BAC) and tr(ACB) are usually different.
  • Trying to take the trace of a non-square matrix. Undefined. The trace lives on square matrices; the Frobenius inner product extends it via tr(AᵀB) which is defined for matching shapes.
  • Forgetting that trace counts algebraic, not geometric, multiplicities. A defective matrix with eigenvalue 2 of algebraic multiplicity 3 contributes 2·3 = 6 to the trace, even though only one eigenvector exists.
  • Misreading the partial-trace direction. In quantum mechanics, tr_B(ρ_(AB)) "traces out B" — it leaves a reduced density matrix on A, not on B. The subscript labels what is summed away, not what survives.
  • Using trace inequality bounds incorrectly. tr(AB) ≤ ‖A‖_F · ‖B‖_F is Cauchy–Schwarz in the Frobenius inner product. tr(AB) ≤ ‖A‖₂ · ‖B‖₁ (the Hölder inequality on singular values) is sharper. Pick the right one for your problem.

Where the trace shows up

  • Matrix calculus. Every gradient with respect to a matrix variable simplifies via the trace trick — d(scalar)/d(matrix) is computed by rewriting the scalar as a trace and applying tr(AB) = tr(BA). Without it, ML loss derivatives become unmanageable.
  • Frobenius norm and low-rank approximation. ‖A‖²_F = tr(AᵀA) = Σ σᵢ². Truncating the SVD minimises this norm over rank-k matrices (Eckart–Young theorem).
  • Quantum mechanics. ⟨A⟩ = tr(ρ A) is the expectation of observable A in state ρ. Probabilities are tr(ρ P) for projectors P. Von Neumann entropy is −tr(ρ log ρ).
  • Statistical mechanics. The partition function Z = tr(e^(−βH)) sums Boltzmann factors over all energy eigenstates simultaneously, by exploiting basis-independence of the trace.
  • Differential geometry. The divergence of a vector field is the trace of its Jacobian. The Ricci scalar is the trace of the Ricci tensor. The Laplacian is the trace of the Hessian.
  • Determinant derivatives in optimisation. d log det(A)/dA = (A⁻¹)ᵀ. This emerges directly from Jacobi's formula via the trace — the gradient of log-likelihood in Gaussian models.
  • Machine learning regularisers. Frobenius regularisation tr(WᵀW), nuclear norm regularisation tr(√(WᵀW)), graph Laplacian regularisation tr(WᵀLW) — every one of these is a trace under the hood.

Frequently asked questions

What is the trace of a matrix?

For a square n × n matrix A, the trace is the sum of its main-diagonal entries: tr(A) = A_11 + A_22 + ⋯ + A_nn. It is defined only for square matrices. Despite its simple definition, it carries deep information — it equals the sum of eigenvalues, the derivative of the determinant at the identity, and it is invariant under change of basis.

Why is tr(AB) = tr(BA)?

Write tr(AB) = Σ_i (AB)_ii = Σ_i Σ_j A_ij · B_ji. The right-hand side is symmetric in A and B if you swap which is summed first, and you get the same Σ_j Σ_i B_ji · A_ij = Σ_j (BA)_jj = tr(BA). The identity holds even when A and B are non-square — as long as AB and BA are both defined and square. By induction it gives the full cyclic property: tr(ABC) = tr(BCA) = tr(CAB).

Is the trace basis-independent?

Yes. For any invertible P, tr(P^(-1) A P) = tr(A P P^(-1)) = tr(A) by the cyclic property. That makes trace a similarity invariant, so the trace of a linear operator is well-defined regardless of which basis you write it in. Eigenvalues and determinant share this property; the off-diagonal entries do not.

Why does the trace equal the sum of eigenvalues?

The characteristic polynomial det(λI − A) expands as λ^n − tr(A) λ^(n-1) + ⋯ ± det(A). The coefficient of λ^(n-1) is −Σλ_i (counted with algebraic multiplicity) on one hand and −tr(A) on the other. So tr(A) = Σ λ_i. For diagonalisable A this also follows from A = PDP^(-1) and tr(PDP^(-1)) = tr(D).

What is the Frobenius inner product?

It is the natural inner product on matrices, defined by ⟨A, B⟩_F = tr(A^T B) = Σ_ij A_ij B_ij. The corresponding norm is the Frobenius norm ‖A‖_F = sqrt(tr(A^T A)) = sqrt(Σ A_ij²). It is what generalises the dot product from vectors to matrices and underlies low-rank approximation (Eckart–Young), least-squares fitting on matrix variables, and reproducing-kernel methods.

How is the trace related to derivatives of the determinant?

Jacobi's formula: d/dt det(A(t)) = det(A) · tr(A^(-1) · dA/dt). At A = I this collapses to d/dt det(I + tH)|_(t=0) = tr(H). The trace is literally the linearisation of det at the identity. That is why the divergence of a vector field (the trace of the Jacobian) controls how volumes expand under the flow.

Where does the trace appear in machine learning?

Everywhere matrix calculus shows up. The Frobenius norm regulariser is tr(W^T W). The cross-entropy with one-hot targets reduces to a trace. The trace of the Hessian is the Laplacian of the loss. The trace of a Gram matrix is the total within-cluster variance. Differentiating tr(AXB) gives A^T B^T, the cleanest pattern in matrix calculus.