Every once in a while, there’s a paper that comes out that is so delightful that I can’t help share it on this blog, and I’ve started a little series Neat Randomized Algorithms for exactly this purpose. Today’s entry into this collection is The Polar Express: Optimal Matrix Sign Methods and their Application to the Muon Algorithm by Noah Amsel, David Persson, Christopher Musco, and Robert M. Gower. Despite its authors belonging to the randomized linear algebra ouvré, this paper is actually about a plain-old deterministic algorithm. But it’s just so delightful that I couldn’t help but share it in this series any way.
The authors of The Polar Express are motivated by the recent Muon algorithm for neural network optimization. The basic idea of Muon is that it helps to orthogonalize the search directions in a stochastic gradient method. That is, rather than update a weight matrix  with search direction
 with search direction  using the update rule
 using the update rule 
      ![Rendered by QuickLaTeX.com \[W \gets W - \eta G,\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-f4e1c8e417c22f5378f2c084fe8937cd_l3.png)
      ![Rendered by QuickLaTeX.com \[W\gets W - \eta \operatorname{polar}(G).\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-c9c2f2804d2675c9a8a9e6112b43c6f1_l3.png)
      ![Rendered by QuickLaTeX.com \[\operatorname{polar}(G) \coloneqq \operatorname*{argmin}_{Q \textrm{ with orthonormal columns}} \norm{G - Q}_{\rm F}\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-cedee2dc450e00d845f4b3f41de98a5f_l3.png)
 and is called the (unitary) polar factor of
 and is called the (unitary) polar factor of  . (Throughout this post, we shall assume for simplicity that
. (Throughout this post, we shall assume for simplicity that  is tall and full-rank.) Muon relies on efficient algorithms for rapidly approximating
 is tall and full-rank.) Muon relies on efficient algorithms for rapidly approximating  .
.
Given a singular value decomposition  , the polar factor may be computed in closed form as
, the polar factor may be computed in closed form as  . But computing the SVD is computationally expensive, particularly in GPU computing environments. Are there more efficient algorithms that avoid the SVD? In particular, can we design algorithms that use only matrix multiplications, for maximum GPU efficiency?
. But computing the SVD is computationally expensive, particularly in GPU computing environments. Are there more efficient algorithms that avoid the SVD? In particular, can we design algorithms that use only matrix multiplications, for maximum GPU efficiency?
The Polar Factor as a Singular Value Transformation
Computing the polar factor  of a matrix
 of a matrix  effectively applies an operation to
 effectively applies an operation to  which replaces all of its singular values by one. Such operations are studied in quantum computing, where they are called singular value transformations.
 which replaces all of its singular values by one. Such operations are studied in quantum computing, where they are called singular value transformations.
Definition (singular value transformation): Given an odd function  , the singular value transformation of
, the singular value transformation of  by
 by  is
 is ![Rendered by QuickLaTeX.com f[G] \coloneqq Uf(\Sigma)V^\top](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-ced13a661f24d5413c3535383f4c910f_l3.png) .
.
On its face, it might seem like that the polar factor of  is cannot be obtained as a singular value transformation. After all, the constantly one function
 is cannot be obtained as a singular value transformation. After all, the constantly one function  is not odd. But, to obtain the polar factor, we only need a function
 is not odd. But, to obtain the polar factor, we only need a function  which sends positive inputs to
 which sends positive inputs to  . Thus, the polar decomposition
. Thus, the polar decomposition  is given by the singular value transformation associated with the sign function:
 is given by the singular value transformation associated with the sign function: 
      ![Rendered by QuickLaTeX.com \[\operatorname{sign}(x) = \begin{cases} 1, & x > 0, \\ 0, & x = 0, \\ -1, & x < 0. \end{cases}\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-103d0549df7eab731cd55caeaef8f0f2_l3.png)
      ![Rendered by QuickLaTeX.com \[\operatorname{polar}(G) = \operatorname{sign}[G].\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-94977973d95c7d3483bee0620f2ef5f9_l3.png)
Singular Value Transformations and Polynomials
How might we go about computing the singular value transformation of a matrix? For an (odd) polynomial, this computation can be accomplished using a sequence of matrix multiplications alone. Indeed, for  , we have
, we have 
      ![Rendered by QuickLaTeX.com \[p[G] = a_1 G + a_3 G(G^\top G) + \cdots + a_{2k+1} G(G^\top G)^k.\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-3377aab54a08cc6e010a9df4e06d0606_l3.png)
 , we can approximately compute the singular value transformation
, we can approximately compute the singular value transformation ![Rendered by QuickLaTeX.com f[G]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-a72a56b33f34eecf4bf37ad1ee1c78ea_l3.png) by first approximating
 by first approximating  by a polynomial
 by a polynomial  , and then using
, and then using ![Rendered by QuickLaTeX.com p[G]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-2f3968e6285f4d4baef540b858d2c50a_l3.png) as a proxy for
 as a proxy for ![Rendered by QuickLaTeX.com f[G]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-a72a56b33f34eecf4bf37ad1ee1c78ea_l3.png) . Here is an example:
. Here is an example:
>> G = randn(2)                            % Random test matrix
G =
   0.979389080992349  -0.198317114406418
  -0.252310961830649  -1.242378171072736
>> [U,S,V] = svd(G);
>> fG = U*sin(S)*V'                        % Singular value transformation
fG =
   0.824317193982434  -0.167053523352195
  -0.189850719961322  -0.935356030417109
>> pG = G - (G*G'*G)/6 + (G*G'*G*G'*G)/120 % Polynomial approximation
pG =
   0.824508188218982  -0.167091255945116
  -0.190054681059327  -0.936356677704568We see that we get reasonably high accuracy by approximating ![Rendered by QuickLaTeX.com \sin[G]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-7d43149d58358ce19f9f50e096f525e2_l3.png) using its degree-three Taylor polynomial.
 using its degree-three Taylor polynomial.
The Power of Composition
The most basic approach to computing the sign function would be to use a fixed polynomial of degree  . However, this approach converges fairly slowly as we increase the degree
. However, this approach converges fairly slowly as we increase the degree  .
.
A better strategy is to use compositions. A nice feature of the sign function is the fixed point property: For every  ,
,  is a fixed point of the
 is a fixed point of the  function:
 function: 
      ![Rendered by QuickLaTeX.com \[\operatorname{sign}(\operatorname{sign}(x)) = \operatorname{sign}(x) \quad \text{for all } x \in \real.\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-fa3bb06693b0d6b4742c882a7a2f4b7c_l3.png)
 applying the following fixed point equation until convergence:
 applying the following fixed point equation until convergence:       ![Rendered by QuickLaTeX.com \[P \gets \frac{3}{2} P - \frac{1}{2} PP^\top P.\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-88631b351c7b5aafba34c5d29a5c4921_l3.png)
>> P = randn(100) / 25;
>> [U,~,V] = svd(P); polar = U*V'; % True polar decomposition
>> for i = 1:20
      P = 1.5*P-0.5*P*P'*P; % Newton-Schulz iteration
      fprintf("Iteration %d\terror %e\n",i,norm(P - polar));
   end
Iteration 1	error 9.961421e-01
Iteration 2	error 9.942132e-01
Iteration 3	error 9.913198e-01
Iteration 4	error 9.869801e-01
Iteration 5	error 9.804712e-01
Iteration 6	error 9.707106e-01
Iteration 7	error 9.560784e-01
Iteration 8	error 9.341600e-01
Iteration 9	error 9.013827e-01
Iteration 10	error 8.525536e-01
Iteration 11	error 7.804331e-01
Iteration 12	error 6.759423e-01
Iteration 13	error 5.309287e-01
Iteration 14	error 3.479974e-01
Iteration 15	error 1.605817e-01
Iteration 16	error 3.660929e-02
Iteration 17	error 1.985827e-03
Iteration 18	error 5.911348e-06
Iteration 19	error 5.241446e-11
Iteration 20	error 6.686995e-15As we see, the initial rate of convergence is very slow, and obtain only a single digit of accuracy after 15 iterations. After this burn-in period, the rate of convergence is very rapid, and the method achieves machine accuracy after 20 iterations.
The Polar Express
The Newton–Schulz iteration approximates the sign function using a composition of the same polynomial  repeatedly. But we can get better approximations by applying a sequence of different polynomials
 repeatedly. But we can get better approximations by applying a sequence of different polynomials  , resulting in an approximation of the form
, resulting in an approximation of the form 
      ![Rendered by QuickLaTeX.com \[\operatorname{sign}[G] \approx p_t[p_{t-1}[\cdots[p_2[p_1[G]]\cdots]].\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-fa6a1907a8c5265dd61a630d5f0eb268_l3.png)
What are the optimal choice of polynomials
?
For simplicity, the authors of The Polar Express focus on the case where all of the polynomials  have the same (odd) degree
 have the same (odd) degree  .
.
On its face, it seems like this problem might be intractable as the best choice of polynomial  seemly could depend in a complicated way on all of the previous polynomials
 seemly could depend in a complicated way on all of the previous polynomials  . Fortunately, the authors of The Polar Express show that there is a very simple way of computing the optimal polynomials. Begin by assuming that the singular values of
. Fortunately, the authors of The Polar Express show that there is a very simple way of computing the optimal polynomials. Begin by assuming that the singular values of  lie in an interval
 lie in an interval ![Rendered by QuickLaTeX.com [\ell_0,u_0]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-211f4ff3af7863232d7db347c4ef7e54_l3.png) . We then choose
. We then choose  to be the degree-(
 to be the degree-( ) odd polynomial approximation to the sign function on
) odd polynomial approximation to the sign function on ![Rendered by QuickLaTeX.com [\ell_0,u_0]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-211f4ff3af7863232d7db347c4ef7e54_l3.png) that minimizes the
 that minimizes the  error:
 error:
      ![Rendered by QuickLaTeX.com \[p_1 = \operatorname*{argmin}_{\text{odd degree-($2k+1$) polynomial } p} \max_{x \in [\ell_0,u_0]} |p(x) - \operatorname{sign}(x)|.\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-e291c7f68e60512be5e8e92c7e3cf90b_l3.png)
 to
 to  , the singular values of
, the singular values of ![Rendered by QuickLaTeX.com p_1[G]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-630611cd5d04a5cd2fabb16f5855dcfa_l3.png) lie in a new interval
 lie in a new interval ![Rendered by QuickLaTeX.com [\ell_1,u_1]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-15fd6cfaa78a33099efd2e7ebe256fbb_l3.png) . To build the next polynomial
. To build the next polynomial  , we simply find the optimal approximation to the sign function on this interval:
, we simply find the optimal approximation to the sign function on this interval:       ![Rendered by QuickLaTeX.com \[p_2 = \operatorname*{argmin}_{\text{odd degree-($2k+1$) polynomial } p} \max_{x \in [\ell_1,u_1]} |p(x) - \operatorname{sign}(x)|.\]](https://www.ethanepperly.com/wp-content/ql-cache/quicklatex.com-d4b8a6d502f41db5153e14b3a0f17b88_l3.png)
 as we want.
 as we want. 
For given values of  and
 and  , the coefficients of the optimal polynomials
, the coefficients of the optimal polynomials  can be computed in advance and stored, allowing for rapid deployment at runtime. Moreover, we can always ensure the upper bound is
 can be computed in advance and stored, allowing for rapid deployment at runtime. Moreover, we can always ensure the upper bound is  by normalizing
 by normalizing  . As such, there is only one parameter
. As such, there is only one parameter  that we need to know in order to compute the optimal coefficients. The authors of The Polar Express are motivated by applications in deep learning using 16-bit floating point numbers. In this value, the lower bound
 that we need to know in order to compute the optimal coefficients. The authors of The Polar Express are motivated by applications in deep learning using 16-bit floating point numbers. In this value, the lower bound  is appropriate. (As the authors stress, their method remains convergent even if too large a value of
 is appropriate. (As the authors stress, their method remains convergent even if too large a value of  is chosen, though convergence may be slowed somewhat.)
 is chosen, though convergence may be slowed somewhat.)
Below, I repeat the experiment from above using (degree-5) Polar Express instead of Newton–Schulz. The coefficients for the optimal polynomials are taken from the Polar Express paper.
>> P = randn(100) / 25;
>> [U,~,V] = svd(P); polar = U*V';
>> P2 = P*P'; P = ((17.300387312530933*P2-23.595886519098837*eye(100))*P2+8.28721201814563*eye(100))*P; fprintf("Iteration 1\terror %e\n",norm(P - polar));
Iteration 1	error 9.921347e-01
>> P2 = P*P'; P = ((0.5448431082926601*P2-2.9478499167379106*eye(100))*P2+4.107059111542203*eye(100))*P; fprintf("Iteration 2\terror %e\n",norm(P - polar));
Iteration 2	error 9.676980e-01
>> P2 = P*P'; P = ((0.5518191394370137*P2-2.908902115962949*eye(100))*P2+3.9486908534822946*eye(100))*P; fprintf("Iteration 3\terror %e\n",norm(P - polar));
Iteration 3	error 8.725474e-01
>> P2 = P*P'; P = ((0.51004894012372*P2-2.488488024314874*eye(100))*P2+3.3184196573706015*eye(100))*P; fprintf("Iteration 4\terror %e\n",norm(P - polar));
Iteration 4	error 5.821937e-01
>> P2 = P*P'; P = ((0.4188073119525673*P2-1.6689039845747493*eye(100))*P2+2.300652019954817*eye(100))*P; fprintf("Iteration 5\terror %e\n",norm(P - polar));
Iteration 5	error 1.551595e-01
>> P2 = P*P'; P = ((0.37680408948524835*P2-1.2679958271945868*eye(100))*P2+1.891301407787398*eye(100))*P; fprintf("Iteration 6\terror %e\n",norm(P - polar));
Iteration 6	error 4.588549e-03
>> P2 = P*P'; P = ((0.3750001645474248*P2-1.2500016453999487*eye(100))*P2+1.8750014808534479*eye(100))*P; fprintf("Iteration 7\terror %e\n",norm(P - polar));
Iteration 7	error 2.286853e-07
>> P2 = P*P'; P = ((0.375*P2-1.25*eye(100))*P2+1.875*eye(100))*P; fprintf("Iteration 8\terror %e\n",norm(P - polar));
Iteration 8	error 1.113148e-14We see that the Polar Express algorithm converges to machine accuracy in only 8 iterations (24 matrix products), a speedup over the 20 iterations (40 matrix products) required by Newton–Schulz. The Polar Express paper contains further examples with even more significant speedups.
Make sure to check out the Polar Express paper for many details not shared here, including extra tricks to improve stability in 16-bit floating point arithmetic, discussions about how to compute the optimal polynomials, and demonstrations of the Polar Express algorithm for training GPT-2.
References: Muon was first formally described in the blog post Muon: An optimizer for hidden layers in neural networks (2024); for more, see this blog post by Jeremy Bernstein and this paper by Jeremy Bernstein and Laker Newhouse. The Polar Express is proposed in The Polar Express: Optimal Matrix Sign Methods and their Application to the Muon Algorithm (2025) by Noah Amsel, David Persson, Christopher Musco, and Robert M. Gower. For more on the matrix sign function and computing it, chapter 5 of Functions of Matrices: Theory and Computation (2008) by Nicholas H. Higham is an enduringly useful reference.