Bayesian regression by dummy

Time for another item in the infinite stream of “things I said I’d do but didn’t make time for”. This week, it is a walkthrough of Bayesian multiple linear regression.

This is one of many things I have come across during my Ph.D. that have made me feel stupid. I find this quite alright. I know that I am “stupid” relative to other people. I also know that if you do not encounter people much smarter than you, you will likely not learn as quickly as you might otherwise. However it was very disheartening to find lots of different resources that skipped over a (few) fairly tricky, important step(s). I thought it would be useful for me to talk through this process to ensure I understand it fairly well, and it might be useful for somebody either as a sanity check for their work or a quick walkthrough.

In Bayesian linear regression, we posit that the vector of \(n\) observations, \(y\) was generated by a linear process such that \(y = X\beta + \epsilon\).¹ Here \(X\) is the \(n \times k\) design matrix, where \(k\) is the number of available predictors. Typically, \(\epsilon_i \sim N(0, \sigma^2)\), where \(\sigma^2\) controls the magnitude of the observed noise.

The likelihood of this model is

\[p(y | X, \beta, \sigma^2) \propto (\sigma^2)^{-\frac{n}{2}} \exp\left(\frac{1}{2\sigma^2}(y- X \beta)^{\rm T}(y- X \beta)\right)\]

We can sensibly focus on the log of the parts of this expression that relate to the regression coefficients, \(\beta\):

\[(y - X\beta)^{\rm T}(y - XB)\]

This is said (without explanation) to be equal to the expression

\[(y- X \hat{\beta})^{\rm T}(y- X \hat{\beta}) + (\beta - \hat{\beta})^{\rm T}(X^{\rm T}X)(\beta - \hat{\beta})\]

where \(\hat{\beta}\) are the ordinary least squares estimates of the regression coefficients, such that

\[\hat{\beta} = \ (X^{\rm T}X)^{-1}X^{\rm T}y\]

I found this pretty confusing, and while this wasn’t the element of Bayesian regression that I was directly interested in, I wanted to know what was going on there. The referenced book² gives slightly more detail, mentioning that we want to “complete the square”, but this is still pretty vague. It writes the problem as

\[\begin{align} (y - X\beta)^{\rm T}(y - XB) = & \ \beta^{\rm T} X^{\rm T} X \beta - \beta^{\rm T} X^{\rm T} y - y^{\rm T} X\beta + y^{\rm T}y \\ = & \ (\beta - \hat{\beta})^{\rm T}X^{\rm T}X(\beta - \hat{\beta}) + S \\ \text{where} \ S = & \ (y - X\hat{\beta})^{\rm T}(y - X\hat{\beta}) \\ \text{and} \ \hat{\beta} = & \ (X^{\rm T}X)^{-1}X^{\rm T}y \end{align}\]

So how does this step happen exactly? Two key results that are used throughout are that \((X^{-1})^{\rm T} = (X^{\rm T})^{-1}\) and that \((XY)^{\rm T} = y^{\rm T}X^{\rm T}\). So what do we do here to complete the square? Well, working back from the last result, we can expand the expression

\[\begin{align} (\beta - \hat{\beta})^{\rm T}X^{\rm T}X(\beta - \hat{\beta}) + (y - X\hat{\beta})^{\rm T}(y - X\hat{\beta}) = & \ \beta^{\rm T}X^{\rm T}X\beta - \beta^{\rm T}X^{\rm T}X \hat{\beta} - \hat{\beta}^{\rm T} X^{\rm T}X\beta + \hat{\beta}^{\rm T} X^{\rm T} X \hat{\beta} \\ & + y^{\rm T}y - y^{\rm T} X\hat{\beta} - \hat{\beta}^{\rm T} X^{\rm T} y + \hat{\beta}^{\rm T} X^{\rm T} X \hat{\beta} \end{align}\]

Unfortunately this isn’t really any more clear or obvious to me! If we replace \(\hat{\beta}\) with \((X^{\rm T}X)^{-1}X^{\rm T}y\) we might get a bit closer, though. Using the two results mentioned above, we can show that

\[\begin{align} \hat{\beta}^{\rm T} = & \ ((X^{\rm T}X)^{-1}X^{\rm T}y)^{\rm T} \\ = & \ (X^{\rm T}y)^{\rm T}((X^{\rm T}X)^{-1})^{\rm T} \\ = & \ y^{\rm T}X((X^{\rm T}X)^{\rm T})^{-1} \\ = & \ y^{\rm T} X (X^{\rm T} (X^{\rm T})^{\rm T})^{-1} \\ = & \ y^{\rm T}X(X^{\rm T}X)^{-1} \end{align}\]

Using this result, we can rewrite our previous expression as

\[\begin{align} & \beta^{\rm T}X^{\rm T}X\beta \\ & - \beta^{\rm T}X^{\rm T}X(X^{\rm T}X)^{-1}X^{\rm T}y \\ & - y^{\rm T}X(X^{\rm T}X)^{-1}X^{\rm T}X\beta \\ & + y^{\rm T}X(X^{\rm T}X)^{-1}X^{\rm T}X(X^{\rm T}X)^{-1}X^{\rm T}y \\ & + y^{\rm T}y \\ & - y^{\rm T} X (X^{\rm T}X)^{-1} X^{\rm T} y \\ & - y^{\rm T} X (X^{\rm T}X)^{-1} X^{\rm T} y \\ & + y^{\rm T}X (X^{\rm T}X)^{-1} X^{\rm T} X (X^{\rm T}X)^{-1} X^{\rm T} y \end{align}\]

Since we know that by definition \((X^{\rm T}X)(X^{\rm T}X)^{-1} = I\) and \((X^{\rm T}X)^{-1}(X^{\rm T}X) = I\), we can reduce this expression considerably, and it starts to become familiar:

\[\beta^{\rm T} X^{\rm T} X \beta - \beta^{\rm T} X^{\rm T}y - y^{\rm T} X \beta + y^{\rm T}y + 2y^{\rm T}X (X^{\rm T}X)^{-1} X^{\rm T}y - 2y^{\rm T} X (X^{\rm T}X)^{-1} X^{\rm T} y\]

You can see that this is the same as the expression above,

\[(y - X\beta)^{\rm T}(y - XB) = \beta^{\rm T} X^{\rm T} X \beta - \beta^{\rm T} X^{\rm T} y - y^{\rm T} X\beta + y^{\rm T}y\]

So, as is usually the case, completing the square meant adding and subtracting the same quantity from the expression, in this case \(2y^{\rm T}X (X^{\rm T}X)^{-1} X^{\rm T}y\). We can then insert some identity matrices in the form of \((X^{\rm T}X)^{-1}(X^{\rm T}X)\) or \((X^{\rm T}X)(X^{\rm T}X)^{-1}\) to allow us to factorise the expression in the desired form.

With this rewritten likelihood, we can search for a conjugate prior. O’Hagan describes this portion in a bit more detail - I may revisit it at some point!

Note that I am not implying a causal relationship between \(X\) and \(y\). ↩
O’Hagan, Anthony (1994). Bayesian Inference. Kendall’s Advanced Theory of Statistics. 2B (First ed.). Halsted. ISBN 0-340-52922-9. ↩