That pesky equation 5.25

statistics
maths
R
linear algebra
Author

Loïc Grobol

Published

August 1, 2023

A view of Times Square.

A cute property of Penalized Least Squares that took me more work than I’m comfortable disclosing.

Header by Terabass, CC BY-SA 3.0, via Wikimedia Commons

I’m currently reading Douglas M. Bates’ “lme4: Mixed-effects modeling with R” (Bates (2018), the latest version, which for some reason seems only available from someone at ETH Zurich’s personal stash). It’s generally well written and I have learned a lot, but I find it a bit too terse in some places, with unclear jumps in the reasoning or computations. One such jump is equation 5.25:

\[ \lVert y_{\mathrm{obs}} - X\beta - Z\Lambda_{\theta}u\rVert^2 + \lVert u\rVert^2 = r_{\theta,\beta}^2 + \lVert L_{\theta}^TP(u-\tilde{u}])\rVert^2 \tag{1}\]

I have had inordinate troubles understanding how the author arrived at this, so for the benefit of (at least) the futures versions of myself who will have forgotten this, I will try to fill the gaps, leaving as little impliciteness as possible.

Problem statement

For brevity’s sake, I will try to keep to the useful definitions, but here are some pointers on what’s going on here:

  • We’re trying to fit a mixed-effects linear model (MELM) with observations \(y_{\mathrm{obs}}\). \(\beta\) and \(u\) 1 are the parameters associated with the fixed and the random effects respectively.

  • The left-hand side is a Penalized Residual Sum of Squares (PRSS)

    • The first part, \(\lVert y_{\mathrm{obs}} - X\beta - Z\Lambda_{\theta}u\rVert^2\) is the sum of squared residuals for the problem \(y = X\beta - Z\Lambda_{\theta}u + \varepsilon\).
    • The second part is a penalty term on the norm of \(u\), which comes from the chosen formulation of the MELM where the random effects follow a centered normal distribution. It also has the nice effect of favoring fixed-effect explanations (more on this in another post, hopefully).
  • The whole left-hand side is minimized at \(\tilde{u}\) with value \(r_{\theta,\beta}^2\), i.e.

    \[ \tilde{u} = \mathop{\mathrm{argmin}}_u \lVert y_{\mathrm{obs}} - X\beta - Z\Lambda_{\theta}u\rVert^2 + \lVert u\rVert^2 \] and

    \[ r_{\theta,\beta}^2 = \lVert y_{\mathrm{obs}} - X\beta - Z\Lambda_{\theta}\tilde{u}\rVert^2 + \lVert \tilde{u}\rVert^2 \]

In other words, Equation 1 means that for all \(u\), the PRSS can be written as the sum of its minimal value \(r_{\theta,\beta}^2\) and a residual term: \(\lVert L_{\theta}^TP(u-\tilde{u})\rVert^2\), which contains the difference between \(u\) and its optimal value \(\tilde{u}\) (that was to be expected since this residual must be zero when \(u=\tilde{u}\)).

Note that I have not described every variable here, most glaringly \(L_{\theta}\) and \(P\), their time will come later.

Ordinary sum of squared residuals

We have called the left-hand side of Equation 1 a Penalized Residual Sum of Squares. There are two notable things here

  • Manipulating this is not nice. Sums of norms, even squared ones are not very appealing in general. If we attempt to prove equation (5.25) directly, this will be a major pain.
  • This PRSS looks tantalizingly like a normal sum of squared residuals. If you squint, it’s actually a sum of squared sums of squares.

So the first thing we will do is to actually make it look like the SSR of an ordinary least squares problem using the following trick: for all \(u\) we have,

\[ \begin{aligned} \lVert y_{\mathrm{obs}} - X\beta - Z\Lambda_{\theta}u\rVert^2 + \lVert u\rVert^2 &= \lVert y_{\mathrm{obs}} - X\beta - Z\Lambda_{\theta}u\rVert^2 + \lVert -u\rVert^2\\ &= \left\lVert \begin{bmatrix}y_{\mathrm{obs}} - X\beta - Z\Lambda_{\theta}-u\\-u\end{bmatrix}\right\rVert^2\\ &= \left\lVert \underbrace{\begin{bmatrix}y_{\mathrm{obs}} - X\beta\\0\end{bmatrix}}_{y^*} - \underbrace{\begin{bmatrix}Z\Lambda_{\theta}\\I\end{bmatrix}}_{X^*}u\right\rVert^2 \end{aligned} \tag{2}\]

Where the brackets denote stacking (of vectors and of matrices) and \(I\) is the identity matrix (of the appropriate size, which is \(q\), the number of random effects). The second line is simply a rewriting of the fact that the squared norm of a vector is the sum of its squared coordinates ; the third line is just factoring the common \(u\).

Now, this is the SSR for an ordinary least square problem:

\[ y^* = X^*u + \varepsilon \tag{3}\]

and \((\tilde{u}, r_{\theta, \beta}^2)\) must also be a solution of it.

Note that Equation 3 is not just any OLS problem: \(y^*\) and \(X^*\) have very specific properties that could be necessary to obtain Equation 1. But it actually turns out that they aren’t and Equation 1 is a property of OLS problems in general.

A property of ordinary least squares problems

Let’s take a step back and consider a general OLS problem

\[ Y = X\beta + \varepsilon \]

It’s well known that (if the problem is overdetermined), the sum of squared residuals is minimal for \(\beta = \hat{\beta}\), where

\[ \hat{\beta} = (X^TX)^{-1}Xy \]

and therefore

\[ X^TX\hat{\beta} = X^Ty \tag{4}\]

Now let’s look at what an equivalent of Equation 1 would be. Probably something like

\[ \lVert y - X\beta \rVert^2 = \lVert y - X\hat{\beta} \rVert^2 + \lVert ???\rVert^2 \]

and we have to figure out what the question marks would stand for.

So let’s go! From the definition of the euclidian norm, we have:

\[ \begin{align} \lVert y - X\beta \rVert^2 - \lVert y - X\hat{\beta} \rVert^2 &= (y - X\beta)^T(y - X\beta) - (y - X\hat{\beta})^T(y - X\hat{\beta})\\ &= \begin{aligned}[t] & y^Ty - y^TX\beta - (X\beta)^Ty +(X\beta)^TX\beta\\ - & y^Ty + y^TX\hat{\beta} + (X\hat{\beta})^Ty -(X\hat{\beta})^TX\hat{\beta} \end{aligned}\\ &= \begin{aligned}[t] & - y^TX\beta - \beta^TX^Ty +\beta^TX^TX\beta\\ & + y^TX\hat{\beta} + \hat{\beta}^TX^Ty - \hat{\beta}^TX^TX\hat{\beta} \end{aligned} \end{align} \]

using Equation 4, we have \(X^Ty=X^TX\hat{\beta}\) and therefore also \(y^TX=\hat{\beta}^TX^TX\), so

\[ \begin{align} \phantom{\lVert y - X\beta \rVert^2 - \lVert y - X\hat{\beta} \rVert^2} &= \begin{aligned}[t] & - \hat{\beta}^TX^TX\beta - \beta^TX^TX\hat{\beta} +\beta^TX^TX\beta\\ & + \hat{\beta}^TX^TX\hat{\beta} + \hat{\beta}^TX^TX\hat{\beta} - \hat{\beta}^TX^TX\hat{\beta} \end{aligned}\\ &= \hat{\beta}^TX^TX(\hat{\beta}-\beta) - \beta^TX^TX(\hat{\beta}-\beta) \\ &= (\hat{\beta}-\beta)^TX^TX(\hat{\beta}-\beta)\\ &= \lVert X(\hat{\beta}-\beta)\rVert^2 \end{align} \tag{5}\]

Whichs gives us

\[ \lVert y - X\beta \rVert^2 = \lVert y - X\hat{\beta} \rVert^2 + \lVert X(\hat{\beta}-\beta)\rVert^2 \tag{6}\]

This looks very much like Equation 1! Now we only need a tiny bit of work to go back to our PRSS. For that, we will actually not use Equation 6 but the second to last line of Equation 5:

\[ \lVert y - X\beta \rVert^2 - \lVert y - X\hat{\beta} \rVert^2 = (\hat{\beta}-\beta)^TX^TX(\hat{\beta}-\beta) \]

Which, if we apply it to Equation 3 gives us:

\[ \lVert y - X^*u \rVert^2 - \lVert y - X^*\tilde{u} \rVert^2 = (\tilde{u}-u)^T{X^*}^TX^*(\tilde{u}-u) \tag{7}\]

Cholesky and his friend

One think I have skipped over previously is the meaning of the \(L_{\theta}\) and \(P\) symbols. These are both matrices that are used to optimize the computations needed to obtain \(\tilde{u}\). To be more precise, since \(\tilde{u}\) is a solution of Equation 3, the theory of OLS gives us that:

\[ ({X^*}^TX^*)\tilde{u} = {X^*}^Ty \]

Solving this linear system directly is computationally intensive, but in this case \({X^*}^TX^*\) is definite positive2 and it can be made more efficient using Cholesky’s decomposition, which in this case amounts to find a lower triangular matrix \(L_{\theta}\) such that

\[ P^TL_{\theta}L_{\theta}^TP = {X^*}^TX^* \tag{8}\]

where \(P\) is a the permutation matrix obtained from a (approximated) minimal degree algorithm to makes the decomposition process even more efficient3.

Wrap up

We have all the ingredients we need, let’s just plug Equation 8 into Equation 7:

\[ \begin{align} \lVert y - X^*u \rVert^2 - \lVert y - X^*\tilde{u} \rVert^2 &= (\tilde{u}-u)^T{X^*}^TX^*(\tilde{u}-u)\\ &= (\tilde{u}-u)^TP^TL_{\theta}L_{\theta}^TP(\tilde{u}-u)\\ &= \left[L_{\theta}^TP(\tilde{u}-u)\right]^T\left[L_{\theta}^TP(\tilde{u}-u)\right]\\ &= \lVert L_{\theta}^TP(\tilde{u}-u)\rVert^2 \end{align} \]

which proves Equation 1.

Further reading

  • As I said at the top Bates (2018) is a really good work, which I recommend if you want to understand the inner workings of lme4 in depth. I will probably write more about the points I found interesting or confusing in upcoming posts.
  • In general, I recommend the articles of the English Wikipedia on these topics, which are usually good and provide good references if you want to delve into the technicalities. See in particular “Proofs Involving Ordinary Least Squares” (2022).
  • As a reference for linear models, I recommend Rao et al. (2007), an extremely comprehensive — if dry — textbook.
  • Should you happen — as I did — to fall into a rabbit hole of “but why?!” regarding the probability theory tools used for these topics, I have to recommend my personal reference: Klenke (2020)

Have fun and see you next time!

Back to top

References

Bates, Douglas. 2018. Lme4: Mixed-effects Modeling with R. https://stat.ethz.ch/~maechler/MEMo-pages/lMMwR.pdf.
Ingram, Stephen. 2006. “Minimum Degree Reordering Algorithms: A Tutorial.” 2006. http://sfingram.net/cs517_final.pdf.
Klenke, Achim. 2020. Probability Theory: A Comprehensive Course. Universitext. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-030-56402-5.
“Proofs Involving Ordinary Least Squares.” 2022. In Wikipedia. https://en.wikipedia.org/w/index.php?title=Proofs_involving_ordinary_least_squares&oldid=1066540290.
Rao, C. Radhakrishna, Helge Toutenburg, Shalabh, and Christian Heumann. 2007. Linear Models and Generalizations: Least Squares and Alternatives. 3rd ed. Springer Series in Statistics. Springer Berlin, Heidelberg. https://books.google.com?id=3LK9JoGEyN4C.

Footnotes

  1. \(u\) is not really a vector of parameters, as you’ll find out if you read through the book, but for the purpose of this exposition we can think of it as one.↩︎

  2. See the exercices at the end of chapter 5 in Bates (2018) to see why it can be guaranteed in non-made up scenarios.↩︎

  3. See Ingram (2006) for a good intro to this.↩︎