A Neat Not-Randomized Algorithm: Polar Express

Every once in a while, there’s a paper that comes out that is so delightful that I can’t help share it on this blog, and I’ve started a little series Neat Randomized Algorithms for exactly this purpose. Today’s entry into this collection is The Polar Express: Optimal Matrix Sign Methods and their Application to the Muon Algorithm by Noah Amsel, David Persson, Christopher Musco, and Robert M. Gower. Despite its authors belonging to the randomized linear algebra ouvré, this paper is actually about a plain-old deterministic algorithm. But it’s just so delightful that I couldn’t help but share it in this series any way.

The authors of The Polar Express are motivated by the recent Muon algorithm for neural network optimization. The basic idea of Muon is that it helps to orthogonalize the search directions in a stochastic gradient method. That is, rather than update a weight matrix W with search direction G using the update rule

    \[W \gets W - \eta G,\]

instead use the update

    \[W\gets W - \eta \operatorname{polar}(G).\]

Here,

    \[\operatorname{polar}(G) \coloneqq \operatorname*{argmin}_{Q \textrm{ with orthonormal columns}} \norm{G - Q}_{\rm F}\]

is the closed matrix with orthonormal columns to G and is called the (unitary) polar factor of G. (Throughout this post, we shall assume for simplicity that G is tall and full-rank.) Muon relies on efficient algorithms for rapidly approximating \operatorname{polar}(G).

Given a singular value decomposition G = U\Sigma V^\top, the polar factor may be computed in closed form as \operatorname{polar}(G) = UV^\top. But computing the SVD is computationally expensive, particularly in GPU computing environments. Are there more efficient algorithms that avoid the SVD? In particular, can we design algorithms that use only matrix multiplications, for maximum GPU efficiency?

The Polar Factor as a Singular Value Transformation

Computing the polar factor \operatorname{polar}(G) of a matrix G effectively applies an operation to G which replaces all of its singular values by one. Such operations are studied in quantum computing, where they are called singular value transformations.

Definition (singular value transformation): Given an odd function f, the singular value transformation of G = U\Sigma V^\top by f is f[G] \coloneqq Uf(\Sigma)V^\top.

On its face, it might seem like that the polar factor of G is cannot be obtained as a singular value transformation. After all, the constantly one function f(x)= 1 is not odd. But, to obtain the polar factor, we only need a function f which sends positive inputs to 1. Thus, the polar decomposition \operatorname{polar}(G) is given by the singular value transformation associated with the sign function:

    \[\operatorname{sign}(x) = \begin{cases} 1, & x > 0, \\ 0, & x = 0, \\ -1, & x < 0. \end{cases}\]

The sign function is manifestly odd, and the polar factor satisfies

    \[\operatorname{polar}(G) = \operatorname{sign}[G].\]

Singular Value Transformations and Polynomials

How might we go about computing the singular value transformation of a matrix? For an (odd) polynomial, this computation can be accomplished using a sequence of matrix multiplications alone. Indeed, for p(x) = a_1 x + a_3 x^3 + \cdots + a_{2k+1} x^{2k+1}, we have

    \[p[G] = a_1 G + a_3 G(G^\top G) + \cdots + a_{2k+1} G(G^\top G)^k.\]

For a general (odd) function f, we can approximately compute the singular value transformation f[G] by first approximating f by a polynomial p, and then using p[G] as a proxy for f[G]. Here is an example:

>> G = randn(2)                            % Random test matrix
G =
   0.979389080992349  -0.198317114406418
  -0.252310961830649  -1.242378171072736
>> [U,S,V] = svd(G);
>> fG = U*sin(S)*V'                        % Singular value transformation
fG =
   0.824317193982434  -0.167053523352195
  -0.189850719961322  -0.935356030417109
>> pG = G - (G*G'*G)/6 + (G*G'*G*G'*G)/120 % Polynomial approximation
pG =
   0.824508188218982  -0.167091255945116
  -0.190054681059327  -0.936356677704568

We see that we get reasonably high accuracy by approximating \sin[G] using its degree-three Taylor polynomial.

The Power of Composition

The most basic approach to computing the sign function would be to use a fixed polynomial of degree 2k+1. However, this approach converges fairly slowly as we increase the degree k.

A better strategy is to use compositions. A nice feature of the sign function is the fixed point property: For every x, \operatorname{sign}(x) is a fixed point of the \operatorname{sign} function:

    \[\operatorname{sign}(\operatorname{sign}(x)) = \operatorname{sign}(x) \quad \text{for all } x \in \real.\]

The fixed point strategy suggests an alternate strategy for computing the sign function using polynomials. Rather than using one polynomial of large degree, we can instead compose many polynomials of low degree. The simplest such compositional algorithm is the Newton–Schulz iteration, which consists of initializing P\gets G applying the following fixed point equation until convergence:

    \[P \gets \frac{3}{2} P - \frac{1}{2} PP^\top P.\]

Here is an example execution of the algorithm:

>> P = randn(100) / 25;
>> [U,~,V] = svd(P); polar = U*V'; % True polar decomposition
>> for i = 1:20
      P = 1.5*P-0.5*P*P'*P; % Newton-Schulz iteration
      fprintf("Iteration %d\terror %e\n",i,norm(P - polar));
   end
Iteration 1	error 9.961421e-01
Iteration 2	error 9.942132e-01
Iteration 3	error 9.913198e-01
Iteration 4	error 9.869801e-01
Iteration 5	error 9.804712e-01
Iteration 6	error 9.707106e-01
Iteration 7	error 9.560784e-01
Iteration 8	error 9.341600e-01
Iteration 9	error 9.013827e-01
Iteration 10	error 8.525536e-01
Iteration 11	error 7.804331e-01
Iteration 12	error 6.759423e-01
Iteration 13	error 5.309287e-01
Iteration 14	error 3.479974e-01
Iteration 15	error 1.605817e-01
Iteration 16	error 3.660929e-02
Iteration 17	error 1.985827e-03
Iteration 18	error 5.911348e-06
Iteration 19	error 5.241446e-11
Iteration 20	error 6.686995e-15

As we see, the initial rate of convergence is very slow, and obtain only a single digit of accuracy after 15 iterations. After this burn-in period, the rate of convergence is very rapid, and the method achieves machine accuracy after 20 iterations.

The Polar Express

The Newton–Schulz iteration approximates the sign function using a composition of the same polynomial p repeatedly. But we can get better approximations by applying a sequence of different polynomials p_1,\ldots,p_t, resulting in an approximation of the form

    \[\operatorname{sign}[G] \approx p_t[p_{t-1}[\cdots[p_2[p_1[G]]\cdots]].\]

The Polar Express paper asks the question:

What are the optimal choice of polynomials p_i?

For simplicity, the authors of The Polar Express focus on the case where all of the polynomials p_i have the same (odd) degree 2k+1.

On its face, it seems like this problem might be intractable as the best choice of polynomial p_{i+1} seemly could depend in a complicated way on all of the previous polynomials p_1,\ldots,p_i. Fortunately, the authors of The Polar Express show that there is a very simple way of computing the optimal polynomials. Begin by assuming that the singular values of G lie in an interval [\ell_0,u_0]. We then choose p_1 to be the degree-(2k+1) odd polynomial approximation to the sign function on [\ell_0,u_0] that minimizes the L_\infty error:

    \[p_1 = \operatorname*{argmin}_{\text{odd degree-($2k+1$) polynomial } p} \max_{x \in [\ell_0,u_0]} |p(x) - \operatorname{sign}(x)|.\]

This optimal polynomial can be computed by a version of the Remez algorithm provided in the Polar Express paper. After applying p_1 to G, the singular values of p_1[G] lie in a new interval [\ell_1,u_1]. To build the next polynomial p_2, we simply find the optimal approximation to the sign function on this interval:

    \[p_2 = \operatorname*{argmin}_{\text{odd degree-($2k+1$) polynomial } p} \max_{x \in [\ell_1,u_1]} |p(x) - \operatorname{sign}(x)|.\]

Continuing in this way, we can generate as many polynomials p_1,p_2,\ldots as we want.

For given values of \ell_0 and u_0, the coefficients of the optimal polynomials p_1,p_2,\ldots can be computed in advance and stored, allowing for rapid deployment at runtime. Moreover, we can always ensure the upper bound is u_0 = 1 by normalizing G\gets G / \norm{G}_{\rm F}. As such, there is only one parameter \ell_0 that we need to know in order to compute the optimal coefficients. The authors of The Polar Express are motivated by applications in deep learning using 16-bit floating point numbers. In this value, the lower bound \ell_0 = 0.001 is appropriate. (As the authors stress, their method remains convergent even if too large a value of \ell_0 is chosen, though convergence may be slowed somewhat.)

Below, I repeat the experiment from above using (degree-5) Polar Express instead of Newton–Schulz. The coefficients for the optimal polynomials are taken from the Polar Express paper.

>> P = randn(100) / 25;
>> [U,~,V] = svd(P); polar = U*V';
>> P2 = P*P'; P = ((17.300387312530933*P2-23.595886519098837*eye(100))*P2+8.28721201814563*eye(100))*P; fprintf("Iteration 1\terror %e\n",norm(P - polar));
Iteration 1	error 9.921347e-01
>> P2 = P*P'; P = ((0.5448431082926601*P2-2.9478499167379106*eye(100))*P2+4.107059111542203*eye(100))*P; fprintf("Iteration 2\terror %e\n",norm(P - polar));
Iteration 2	error 9.676980e-01
>> P2 = P*P'; P = ((0.5518191394370137*P2-2.908902115962949*eye(100))*P2+3.9486908534822946*eye(100))*P; fprintf("Iteration 3\terror %e\n",norm(P - polar));
Iteration 3	error 8.725474e-01
>> P2 = P*P'; P = ((0.51004894012372*P2-2.488488024314874*eye(100))*P2+3.3184196573706015*eye(100))*P; fprintf("Iteration 4\terror %e\n",norm(P - polar));
Iteration 4	error 5.821937e-01
>> P2 = P*P'; P = ((0.4188073119525673*P2-1.6689039845747493*eye(100))*P2+2.300652019954817*eye(100))*P; fprintf("Iteration 5\terror %e\n",norm(P - polar));
Iteration 5	error 1.551595e-01
>> P2 = P*P'; P = ((0.37680408948524835*P2-1.2679958271945868*eye(100))*P2+1.891301407787398*eye(100))*P; fprintf("Iteration 6\terror %e\n",norm(P - polar));
Iteration 6	error 4.588549e-03
>> P2 = P*P'; P = ((0.3750001645474248*P2-1.2500016453999487*eye(100))*P2+1.8750014808534479*eye(100))*P; fprintf("Iteration 7\terror %e\n",norm(P - polar));
Iteration 7	error 2.286853e-07
>> P2 = P*P'; P = ((0.375*P2-1.25*eye(100))*P2+1.875*eye(100))*P; fprintf("Iteration 8\terror %e\n",norm(P - polar));
Iteration 8	error 1.113148e-14

We see that the Polar Express algorithm converges to machine accuracy in only 8 iterations (24 matrix products), a speedup over the 20 iterations (40 matrix products) required by Newton–Schulz. The Polar Express paper contains further examples with even more significant speedups.

Make sure to check out the Polar Express paper for many details not shared here, including extra tricks to improve stability in 16-bit floating point arithmetic, discussions about how to compute the optimal polynomials, and demonstrations of the Polar Express algorithm for training GPT-2.

References: Muon was first formally described in the blog post Muon: An optimizer for hidden layers in neural networks (2024); for more, see this blog post by Jeremy Bernstein and this paper by Jeremy Bernstein and Laker Newhouse. The Polar Express is proposed in The Polar Express: Optimal Matrix Sign Methods and their Application to the Muon Algorithm (2025) by Noah Amsel, David Persson, Christopher Musco, and Robert M. Gower. For more on the matrix sign function and computing it, chapter 5 of Functions of Matrices: Theory and Computation (2008) by Nicholas H. Higham is an enduringly useful reference.

Neat Randomized Algorithms: RandDiag for Rapidly Diagonalizing Normal Matrices

Consider two complex-valued square matrices A\in\complex^{n\times n} and B\in\complex^{n\times n}. The first matrix A is Hermitian, being equal A = A^* to its conjugate transpose A^*. The other matrix B is non-Hermitian, B \ne B^*. Let’s see how long it takes to compute their eigenvalue decompositions in MATLAB:

>> A = randn(1e3) + 1i*randn(1e3); A = (A+A')/2;
>> tic; [V_A,D_A] = eig(A); toc % Hermitian matrix
Elapsed time is 0.415145 seconds.
>> B = randn(1e3) + 1i*randn(1e3);
>> tic; [V_B,D_B] = eig(B); toc % non-Hermitian matrix
Elapsed time is 1.668246 seconds.

We see that it takes 4\times longer to compute the eigenvalues of the non-Hermitian matrix B as compared to the Hermitian matrix A. Moreover, the matrix V_A of eigenvectors for a Hermitian matrix A = V_AD_AV_A^{-1} is a unitary matrix, V_A^*V_A = V_AV_A^* = I.

There are another class of matrices with nice eigenvalue decompositions, normal matrices. A square, complex-valued matrix C is normal if C^*C = CC^*. The matrix V_C of eigenvectors for a normal matrix C = V_C D_C V_C^{-1} is also unitary, V_C^*V_C = V_CV_C^* = I. An important class of normal matrices are unitary matrices themselves. A unitary matrix U is always normal since it satisfies U^*U = UU^* = I.

Let’s see how long it takes MATLAB to compute the eigenvalue decomposition of a unitary (and thus normal) matrix:

>> U = V_A;                     % unitary, and thus normal, matrix
>> tic; [V_U,D_U] = eig(U); toc % normal matrix
Elapsed time is 2.201017 seconds.

Even longer than it took to compute an eigenvalue decomposition of the non-normal matrix B! Can we make the normal eigenvalue decomposition closer to the speed of the Hermitian eigenvalue decomposition?

Here is the start of an idea. Every square matrix C has a Cartesian decomposition:

    \[C = H + iS, \quad H = \frac{C+C^*}{2}, \quad S = \frac{C-C^*}{2i}.\]

We have written C as a combination of its Hermitian part H and i times its skew-Hermitian part S. Both H and S are Hermitian matrices. The Cartesian decomposition of a square matrix is analogous to the decomposition of a complex number into its real and imaginary parts.

For a normal matrix C, the Hermitian and skew-Hermitian parts commute, HS = SH. We know from matrix theory that commuting Hermitian matrices are simultaneously diagonalizable, i.e., there exists Q such that H = QD_HQ^* and S = QD_SQ^* for diagonal matrices D_H and D_S. Thus, given access to such Q, C has eigenvalue decomposition

    \[C = Q(D_H+iD_S)Q^*.\]

Here’s a first attempt to turn this insight into an algorithm. First, compute the Hermitian part H of C, diagonalize H = QD_HQ^*, and then see if Q diagonalizes C. Let’s test this out on a 2\times 2 example:

>> C = orth(randn(2) + 1i*randn(2)); % unitary matrix
>> H = (C+C')/2;                     % Hermitian part
>> [Q,~] = eig(H);
>> Q'*C*Q                            % check to see if diagonal
ans =
  -0.9933 + 0.1152i  -0.0000 + 0.0000i
   0.0000 + 0.0000i  -0.3175 - 0.9483i

Yay! We’ve succeeded at diagonalizing the matrix C using only a Hermitian eigenvalue decomposition. But we should be careful about declaring victory too early. Here’s a bad example:

>> C = [1 1i;1i 1]; % normal matrix
>> H = (C+C')/2;
>> [Q,~] = eig(H);
>> Q'*C*Q           % oh no! not diagonal
ans =
   1.0000 + 0.0000i   0.0000 + 1.0000i
   0.0000 + 1.0000i   1.0000 + 0.0000i

What’s going on here? The issue is that the Hermitian part H = I for this matrix has a repeated eigenvalue. Thus, H has multiple different valid matrices of eigenvectors. (In this specific case, every unitary matrix Q diagonalizes H.) By looking at H alone, we don’t know which Q matrix to pick which also diagonalizes S.

He and Kressner developed a beautifully simple randomized algorithm called RandDiag to circumvent this failure mode. The idea is straightforward:

  1. Form a random linear combination M = \gamma_1 H + \gamma_2 S of the Hermitian and skew-Hermitian parts of A, with standard normal random coefficients \gamma_1 and \gamma_2.
  2. Compute Q that diagonalizes M.

That’s it!

To get a sense of why He and Kressner’s algorithm works, suppose that H has some repeated eigenvalues and S has all distinct eigenvalues. Given this setup, it seems likely that a random linear combination of S and H will also have all distinct eigenvalues. (It would take a very special circumstances for a random linear combination to yield two eigenvalues that are exactly the same!) Indeed, this intuition is a fact: With 100% probability, Q diagonalizing a Gaussian random linear combination of simultaneously diagonalizable matrices H and S also diagonalizes H and S individually.

MATLAB code for RandDiag is as follows:

function Q = rand_diag(C)
   H = (C+C')/2; S = (C-C')/2i;
   M = randn*H + randn*S;
   [Q,~] = eig(M);
end

When applied to our hard 2\times 2 example from before, RandDiag succeeds at giving a matrix that diagonalizes C:

>> Q = rand_diag(C);
>> Q'*C*Q
ans =
   1.0000 - 1.0000i  -0.0000 + 0.0000i
  -0.0000 - 0.0000i   1.0000 + 1.0000i

For computing the matrix of eigenvectors for a 1000\times 1000 unitary matrix, RandDiag takes 0.4 seconds, just as fast as the Hermitian eigendecomposition did.

>> tic; V_U = rand_diag(U); toc
Elapsed time is 0.437309 seconds.

He and Kressner’s algorithm is delightful. Ultimately, it uses randomness in only a small way. For most coefficients a_1,a_2 \in \real, a matrix Q diagonalizing a_1 H + a_2 S will also diagonalize A = H+iS. But, for any specific choice of a_1,a_2, there is a possibility of failure. To avoid this possibility, we can just pick a_1 and a_2 at random. It’s really as simple as that.

References: RandDiag was proposed in A simple, randomized algorithm for diagonalizing normal matrices by He and Kressner (2024), building on their earlier work in Randomized Joint Diagonalization of Symmetric Matrices (2022) which considers the general case of using random linear combinations to (approximately) simultaneous diagonalize (nearly) commuting matrices. RandDiag is an example of a linear algebraic algorithm that uses randomness to put the input into “general position”; see Randomized matrix computations: Themes and variations by Kireeva and Tropp (2024) for a discussion of this, and other, ways of using randomness to design matrix algorithms.

Neat Randomized Algorithms: Randomized Cholesky QR

As a research area, randomized numerical linear algebra (RNLA) is as hot as ever. To celebrate the exciting work in this space, I’m starting a new series on my blog where I celebrate cool recent algorithms in the area. At some future point, I might talk about my own work in this series, but for now I’m hoping to use this series to highlight some of the awesome work being done by my colleagues.


Given a tall matrix A \in \real^{m\times n} with m\ge n, its (economy-size) QR factorization is a decomposition of the form A = QR, where Q \in \real^{m\times n} is a matrix with orthonormal columns and R \in \real^{n\times n} is upper triangular. QR factorizations are used to solve least-squares problems and as a computational procedure to orthonormalize the columns of a matrix.

Here’s an example in MATLAB, where we use QR factorization to orthonormalize the columns of a 10^6\times 10^2 test matrix. It takes about 2.5 seconds to run.

>> A = randn(1e6, 1e2) * randn(1e2) * randn(1e2); % test matrix
>> tic; [Q,R] = qr(A,"econ"); toc
Elapsed time is 2.647317 seconds.

The classical algorithm for computing a QR factorization uses Householder reflectors and is exceptionally numerically stable. Since Q has orthonormal columns, Q^\top Q = I is the identity matrix. Indeed, this relation holds up to a tiny error for the Q computed by Householder QR:

>> norm(Q'*Q - eye(1e2)) % || Q^T Q - I ||
ans =
   7.0396e-14

The relative error \|A - QR\|/\|A\| is also small:

>> norm(A - Q*R) / norm(A)
ans =
   4.8981e-14

Here is an alternate procedure for computing a QR factorization, known as Cholesky QR:

function [Q,R] = cholesky_qr(A)
   R = chol(A'*A);
   Q = A / R;       % Q = A * R^{-1}
end

This algorithm works by forming A^\top A, computing its (upper triangular) Cholesky decomposition A^\top A = R^\top R, and setting Q = AR^{-1}. Cholesky QR is very fast, about 5\times faster than Householder QR for this example:

>> tic; [Q,R] = cholesky_qr(A); toc
Elapsed time is 0.536694 seconds.

Unfortunately, Cholesky QR is much less accurate and numerically stable than Householder QR. Here, for instance, is the value of \|Q^\top Q - I\|, about ten million times larger than for Householder QR!:

>> norm(Q'*Q - eye(1e2))
ans =
   7.5929e-07

What’s going on? As we’ve discussed before on this blog, forming A^\top A is typically problematic in linear algebraic computations. The “badness” of a matrix A is measured by its condition number, defined to be the ratio of its largest and smallest singular values \kappa(A) = \sigma_{\rm max}(A)/\sigma_{\rm min}(A). The condition number of A^\top A is the square of the condition number of A, \kappa(A^\top A) = \kappa(A)^2, which is at the root of Cholesky QR’s loss of accuracy. Thus, Cholesky QR is only appropriate for matrices that are well-conditioned, having a small condition number \kappa(A) \approx 1, say \kappa(A) \le 10.

The idea of randomized Cholesky QR is to use randomness to precondition A, producing a matrix R_1 that B = AR_1^{-1} is well-conditioned. Then, since B is well-conditioned, we can apply ordinary Cholesky QR to it without issue. Here are the steps:

  1. Draw a sketching matrix S of size 2n\times m; see these posts of mine for an introduction to sketching.
  2. Form the sketch SA. This step compresses the very tall matrix m\times n to the much shorter matrix SA of size 2n\times n.
  3. Compute a QR factorization SA = Q_1R_1 using Householder QR. Since the matrix SA is small, this factorization will be quick to compute.
  4. Form the preconditioned matrix B = AR_1^{-1}.
  5. Apply Cholesky QR to B to compute B = QR_2.
  6. Set R = R_2R_1. Observe that A = BR_1 = QR_2R_1 = QR, as desired.

MATLAB code for randomized Cholesky QR is provided below:1Code for the sparsesign subroutine can be found here.

function [Q,R] = rand_cholesky_qr(A)
   S = sparsesign(2*size(A,2),size(A,1),8); % sparse sign embedding
   R1 = qr(S*A,"econ"); % sketch and (Householder) QR factorize
   B = A / R1; % B = A * R_1^{-1}
   [Q,R2] = cholesky_qr(B);
   R = R2*R1;
end

Randomized Cholesky QR is still faster than ordinary Householder QR, about 3\times faster in our experiment:

>> tic; [Q,R] = rand_cholesky_qr(A); toc
Elapsed time is 0.920787 seconds.

Randomized Cholesky QR greatly improves on ordinary Cholesky QR in terms of accuracy and numerical stability. In fact, the size of \|Q^\top Q - I\| is even smaller than for Householder QR!

>> norm(Q'*Q - eye(1e2))
ans =
   1.0926e-14

The relative error \|A - QR\|/\|A\| is small, too! Even smaller than for Householder QR in fact:

>> norm(A - Q*R) / norm(A)
ans =
   4.0007e-16

Like many great ideas, randomized Cholesky QR was developed independently by a number of research groups. A version of this algorithm was first introduced in 2021 by Fan, Guo, and Lin. Similar algorithms were investigated in 2022 and 2023 by Balabanov, Higgins et al., and Melnichenko et al. Check out Melnichenko et al.‘s paper in particular, which shows very impressive results for using randomized Cholesky QR to compute column pivoted QR factorizations.

References: Primary references are A Novel Randomized XR-Based Preconditioned CholeskyQR Algorithm by Fan, Guo, and Lin (2021); Randomized Cholesky QR factorizations by Balabanov (2022); Analysis of Randomized Householder-Cholesky QR Factorization with Multisketching by Higgins et al. (2023); CholeskyQR with Randomization and Pivoting for Tall Matrices (CQRRPT) by Melnichenko et al. (2023). The idea of using sketching to precondition tall matrices originates in the paper A fast randomized algorithm for overdetermined linear least-squares regression by Rokhlin and Tygert (2008).