Anthony Polloreno, Ph.D.

@ampolloreno

Engineer

Thoughts on Muon and Second Order Optimization

Muon has a clean derivation for linear layers, because for a linear layer \(y = xW\) one can ask for the steepest first-order update under a spectral-norm constraint on the update itself, in which case the answer is the polar factor of the gradient. If \(G = \nabla_W L\), then

\[ \Delta W \propto -\operatorname{Polar}(G) = -UV^\top, \qquad G = U\Sigma V^\top. \]

Muon approximates this polar factor with Newton–Schulz iterations, and in that sense the standard derivation is genuinely a derivation for linear maps. Once attention enters, however, the constrained object changes, because for one attention head \(Q = XW_Q\), \(K = XW_K\), and \(V = XW_V\), while the attention logits are \(S = \frac{QK^\top}{\sqrt d} = \frac{XW_QW_K^\top X^\top}{\sqrt d}\).

That scope is not accidental, since Bernstein’s derivation is explicit that Muon is defined for linear layers and traces the underlying RMS-to-RMS operator norm back to Appendix E of A Spectral Condition for Feature Learning. [Bernstein, 2025, and Yang et al., 2024] The object that directly controls attention is therefore not \(W_Q\) or \(W_K\) separately but the bilinear product \(W_QW_K^\top\), so if one orthogonalizes the updates to \(W_Q\) and \(W_K\) independently one is still applying the linear-layer derivation to two matrices that attention does not use independently. What attention actually uses is their product, and so the relevant constraint belongs on the induced change in \(QK^\top\).

Large-scale Muon training already shows this mismatch. Moonshot’s Muon scaling report identifies two crucial fixes for scaling Muon, namely weight decay and per-matrix update-RMS rescaling, while also noting that Muon’s update RMS depends on matrix shape in a way that can become problematic for small attention matrices such as per-head KV matrices in GQA/MLA-style attention. In their training dynamics, the maximum attention logit can rise above 100 in some layers even when loss and gradient norm remain stable. [Liu et al., 2025]

Kimi K2 introduces MuonClip, whose main new ingredient is QK-Clip, and the paper says that QK-Clip rescales query and key projection weights after the optimizer step in order to constrain attention logits explicitly, which is already a strong indication that Muon alone is not enough for attention. [Kimi Team, 2025] DeepSeek-V4 takes the architectural route instead, using Muon for most matrix parameters while applying RMSNorm to attention queries and compressed KV entries immediately before core attention, and the paper attributes that normalization to logit stability and does not use QK-Clip. [DeepSeek-AI, 2026] Across these reports, Muon is used for matrix parameters whereas attention receives extra QK/logit control, which is exactly what one should expect once the linear-layer derivation is applied to a bilinear operator.

For a linear layer, the idealized Muon problem is

\[ \min_{\Delta W}\; \langle G, \Delta W \rangle \quad\text{s.t.}\quad \|\Delta W\|_2 \le \rho. \]

For attention, however, the analogous constraint should not be placed on \(\Delta W_Q\) and \(\Delta W_K\) separately, but on the induced change in the attention logits. For one head, \(S = \frac{QK^\top}{\sqrt d}\), and a first-order perturbation gives

\[ \Delta S = \frac{(X\Delta W_Q)K^\top + Q(X\Delta W_K)^\top}{\sqrt d}. \]

The corresponding attention-aware Muon problem is

\[ \min_{\Delta W_Q,\Delta W_K} \langle G_Q,\Delta W_Q\rangle + \langle G_K,\Delta W_K\rangle \]

subject to

\[ \left\| \frac{(X\Delta W_Q)K^\top + Q(X\Delta W_K)^\top}{\sqrt d} \right\| \le \rho. \]

If we write \(\theta=(W_Q,W_K)\), \(g=\nabla_\theta L\), and \(J_S\) for the Jacobian mapping parameter updates to logit updates, then the exact trust-region update has the natural-gradient/Gauss–Newton form

\[ \Delta\theta^* = -\rho\, \frac{(J_S^\top J_S + \lambda I)^{-1}g} {\sqrt{g^\top (J_S^\top J_S + \lambda I)^{-1}g}}. \]

What matters in this expression is that the update couples \(W_Q\) and \(W_K\), because logit changes couple them, so the step should be judged by the induced change in \(QK^\top\) rather than by the two matrices separately. Muon is cheap precisely because it is local to one matrix, so one takes a matrix-shaped momentum buffer, runs a few Newton–Schulz iterations, rescales the update, and applies it. The attention-aware update is not local in that sense, because it depends on the current activations \(X\), the current queries \(Q\), the current keys \(K\), the attention mask, the head structure, the sequence length, the softmax geometry, and, in modern models, compressed, sparse, latent, grouped, or multi-query attention layouts. Even before considering the softmax, the natural object is the sequence-by-sequence logit matrix \(S\), which in dense attention is \(T \times T\) per head, while in sparse or compressed attention it becomes architecture-specific. In MLA-style or compressed-KV attention, the key may not even be materialized in the same way at training and inference. The exact optimizer-side solution is therefore expensive, which is why the architecture is the cheaper place to impose the same geometry.

The scalar attention logit is \(\ell(q,k) = \frac{q^\top k}{\sqrt d}\), and the dangerous degrees of freedom are the radial norms of \(q\) and \(k\). If Muon updates \(W_Q\) and \(W_K\) independently, the product \(QK^\top\) can grow even when the individual matrix updates look reasonable. A direct architectural fix is to set \(\hat q = R_q \frac{q}{\|q\|_2 + \epsilon}\) and \(\hat k = R_k \frac{k}{\|k\|_2 + \epsilon}\), and then compute \(\ell(q,k)=\frac{\hat q^\top \hat k}{\sqrt d}\). In transformer notation this becomes \(\hat Q_h = \operatorname{RMSNorm}(Q_h)\) and \(\hat K_h = \operatorname{RMSNorm}(K_h)\), followed by \(S_h = \alpha_h\frac{\hat Q_h\hat K_h^\top}{\sqrt d}\).

The scalar \(\alpha_h\) sets a per-head temperature or radius budget, and one scalar per head matches the isotropic version of the argument, whereas a full RMSNorm gamma vector learns a diagonal metric instead. With this normalization, the logits are bounded, since \(|\ell(q,k)| \le \frac{\|\hat q\|_2\|\hat k\|_2}{\sqrt d} = \frac{R_qR_k}{\sqrt d}\), and, more importantly, the local sensitivity is bounded as well, because \(\nabla_{\hat q}\ell = \frac{\hat k}{\sqrt d}\) and \(\nabla_{\hat k}\ell = \frac{\hat q}{\sqrt d}\), so \(\|\nabla_{\hat q}\ell\|_2 = \frac{R_k}{\sqrt d}\) and \(\|\nabla_{\hat k}\ell\|_2 = \frac{R_q}{\sqrt d}\). This is much closer to the operator control that Muon gives a linear map.

Take exact L2 normalization, \(\hat q = R\frac{q}{\|q\|_2}\), and write \(u = \frac{q}{\|q\|_2}\). Then

\[ D\hat q = \frac{R}{\|q\|_2} \left(I - uu^\top\right). \]

The term \(I - uu^\top\) is the projection onto the tangent space of the sphere, so QK norm removes the radial direction in the literal differential sense that an update which merely increases \(\|q\|\) does not change the attention logits. Under normalization, the optimizer no longer has to learn not to spend update budget on radial Q/K growth, because the forward map removes that direction directly.

Attention-aware trust regions and QK normalization are therefore addressing the same instability at different levels, since the former constrains the update in logit space whereas the latter changes the parameterization so that the unstable radial direction is mostly absent. This is also the role of DeepSeek-V4’s pre-attention normalization: Muon remains matrix-local, while the model removes the unstable radial Q/K direction before attention.

The Muon derivation points to per-head L2/RMS normalization on Q and K, because Muon’s motivating geometry is spectral norm, that is, the operator norm induced by Euclidean vector norms, namely \(\|W\|_2 = \sup_{\|x\|_2=1}\|Wx\|_2\). The vector geometry paired with this is \(\ell_2\), and, equivalently, if one wants to bound the dot product then Holder’s inequality gives \(|q^\top k| \le \|q\|_p\|k\|_{p^*}\) with \(\frac1p + \frac1{p^*}=1\), for which the only symmetric self-dual choice is \(p=2\).

It is also the only rotation-invariant choice, which matters because Muon’s polar update is built from SVD and orthogonal geometry, whereas an \(\ell_1\) or \(\ell_\infty\) constraint picks out coordinate axes and therefore changes the geometry of the problem. This is why the natural prediction is per-head RMS/L2 normalization on Q and K before the dot product, with RMSNorm appearing simply as L2 normalization written in the usual transformer scale, since \(\operatorname{RMS}(q)=\frac{\|q\|_2}{\sqrt d}\).

A scalar cap such as \(\ell \mapsto \tau\tanh(\ell/\tau)\) or \(\ell \mapsto \tau\arctan(\ell/\tau)\) bounds the logit value after the dot product, but that is not the same thing as bounding the operator that generates the logit in the first place. If \(c(\ell)\) is the scalar cap, then \(\nabla_q c(\ell) = c'(\ell)\frac{k}{\sqrt d}\), so near \(\ell=0\), where most softcaps have \(c'(0)\approx 1\), one can still have a large local sensitivity whenever \(k\) has huge norm but happens to be nearly orthogonal to \(q\). In that sense a softcap controls values whereas QK norm controls the operator.

Under the Muon/operator-norm story, QK RMSNorm is therefore the architectural match, while a scalar cap can still serve as a guardrail without actually fixing the missing QK geometry. The same constraint can then appear in three places, since Kimi K2 applies it after the optimizer step through QK-Clip, an attention-aware Muon would impose it in the update itself, and DeepSeek-V4 moves it into the forward pass with QK normalization. In each case the underlying object is the same, namely control of the change in \(QK^\top\).

Muon is not a cheap Hessian inverse, but rather a structured matrix update, whereas Adam-style methods use diagonal gradient statistics. Muon instead uses matrix geometry and a controlled spectral profile, and this is precisely the setting in which maximal-update arguments apply. Bernstein’s derivation makes the hyperparameter-transfer point explicit, because in the RMS-to-RMS geometry the update scale transfers across width rather than being retuned layer by layer. [Bernstein, 2025] Shah et al. study Muon together with maximal update parameterization and find that Muon expands the compute-time Pareto frontier over AdamW while retaining data efficiency at large batch sizes. [Shah et al., 2025]

Moonshot’s Muon report gives a similar empirical result, since after adding weight decay and correcting update RMS their scaling-law experiments show roughly a 2x compute-efficiency improvement over AdamW under their compute-optimal setup. [Liu et al., 2025] The attention story, however, says that matrix-level maximal updates are not enough, because for MLP matrices the matrix itself is the operator whereas for attention Q/K the operator is the product. The practical rule is therefore to use Muon where the parameter is the operator and to add QK geometry where the operator is bilinear.

Preconditioning changes the usual critical-batch calculation. In the first-order setting, the expected progress has the familiar diminishing-returns form

\[ \Delta L_{\mathrm{opt}}(B) = \frac{\Delta L_{\max}}{1+B_{\mathrm{noise}}/B}. \]

The noise scale predicts the point at which increasing batch size gives diminishing returns, but if we add a fixed deterministic preconditioner \(P\) then the same calculation gives a modified critical batch size

\[ B_P = \frac{\operatorname{tr}(HP\Sigma P)}{g^\top P H P g}. \]

For Newton or natural-gradient style preconditioning with \(P=H^{-1}\), this becomes

\[ B_N = \frac{\operatorname{tr}(H^{-1}\Sigma)}{g^\top H^{-1}g}. \]

So even before stochastic curvature estimation, preconditioning changes the noise metric, because the relevant gradient noise is now whitened by curvature. If the curvature itself is estimated stochastically, there is a second statistical problem, since the update is no longer just \(H^{-1}\hat g\) but \(\hat H^{-1}\hat g\), and that in turn creates two noise scales, one on the gradient side and one on the curvature side. In the notation of my note,

\[ B_g = \frac{\operatorname{tr}(H^{-1}\Sigma_g)}{g^\top H^{-1}g}, \]

and

\[ B_h = \frac{u^\top \mathbb E[\delta H^{-1}\delta]u}{g^\top H^{-1}g}, \qquad u=H^{-1}g. \]

The second-order progress law is approximately

\[ \frac{\Delta L_{\mathrm{opt}}(B_1,B_2)}{\Delta L^N_{\max}} \approx \frac{1}{1+B_g/B_1+B_h/B_2}. \]

So second-order optimization has a critical surface, not a single critical batch size.

\[ \frac{B_g}{B_1}+\frac{B_h}{B_2}\approx 1. \]

For full stochastic curvature estimation there are therefore two batch scales, one for gradients and one for curvature, whereas practical Muon is not full stochastic Newton at all, because it neither draws a separate curvature minibatch nor inverts it directly, but instead uses a strong structural prior in which the update should be matrix-orthogonalized together with temporal averaging through momentum.

If an estimator is exponentially averaged as \(s_t=(1-\beta)x_t+\beta s_{t-1}\), then under stationarity its variance is reduced by \(\frac{1-\beta}{1+\beta}\). The effective sample multiplier is therefore \(M_\beta = \frac{1+\beta}{1-\beta}\). For \(\beta=0.95\), \(M_\beta \approx 39\). For \(\beta=0.99\), \(M_\beta \approx 199\).

Matrix-whitening optimizers can therefore behave more second-order without requiring a huge separate curvature batch, because they combine structure and averaging, and second-order-ish optimizers change the statistical estimation problem rather than merely rescaling the old first-order one. For full curvature, the curvature-side batch can be enormous, whereas for structured or factorized curvature it can be much smaller. In the critical-curvature note, full outer-product curvature has \(B_h=d+1\) in a whitened Gaussian block, while a K-FAC block of shape \(n\times m\) has

\[ B_h^{\mathrm{KFAC}} \approx n+m+2. \]

The same lesson appears here, namely that scalable second-order behavior does not come from estimating a giant dense curvature matrix but from choosing a structured geometry with a much smaller statistical burden. This is consistent with the empirical Muon story, because public dense training batches are still largely governed by the gradient-side critical batch and hardware utilization, while the second-order side can be kept under control by structure, momentum, and update-scale normalization.

There is also a simple stability reason to like Muon, because for a linear layer an exactly polar-normalized Muon update satisfies \(\|\operatorname{Polar}(G)\|_2 = 1\), so the operator norm of the raw update is controlled with \(\|\Delta W\|_2 \le \eta\rho\). This does not prove global stability, but it does say that the per-step perturbation of the layer as an operator is bounded in the geometry that the layer actually uses.

For attention, the analogous bound would be on \(\Delta(W_QW_K^\top)\). Using first-order expansion,

\[ \Delta(W_QW_K^\top) \approx \Delta W_Q W_K^\top + W_Q\Delta W_K^\top. \]

Independent bounds on \(\Delta W_Q\) and \(\Delta W_K\) do not directly bound the product in the same clean way unless one also controls the norms of \(W_Q\) and \(W_K\) or normalizes the resulting queries and keys. QK norm is again the architecture-side fix, because if \(q,k \mapsto \hat q,\hat k\) then \(|\hat q^\top\hat k| \le \|\hat q\|_2\|\hat k\|_2\), and in that sense the spectral story and the attention story really do match.

A Muon transformer should therefore treat these pieces separately: one uses Muon on parameters whose forward role is linear, such as MLP matrices, expert matrices, dense projections, routers, and possibly value/output projections, while treating Q/K separately with per-head RMS/L2 normalization before the dot product. QK-Clip is best viewed as a guardrail for settings in which the architecture makes direct normalization awkward, and the quantities worth logging are the ones that correspond to this geometry, such as max logit, RMS logit, large-logit ratio, query/key RMS, and update RMS by matrix shape. Likewise, critical batch size should be estimated at small scale for the actual optimizer rather than imported from an SGD schedule.

In short, Muon gives a principled matrix update for linear operators, but attention depends on the bilinear operator \(QK^\top\), so the same derivation does not apply unchanged, and QK normalization is the simplest way I know to put that missing geometry back into the model.