Principal component analysis(PCA) & Partial least squares regression(PLS)

PCA

A notes from Deep Learning Foundations and Concepts (Christopher M. Bishop, Hugh Bishop)

Principal component analysis, or PCA, is widely used for applications such as dimensionality reduction, lossy data compression, feature extraction, and data visualization. It is also known as the Kosambi–Karhunen–Loeve transform.

PCA can be defined as the linear proejction that maximizes the variance of the projecrted data, or it can be defined as the linear projection that minimizes the average projection cost(hte mean squared distance between the data points and their projections)

Given a dataset, PCA seeks a space of lower dimensionality(known as principal subspace) that satisfies the definitions above.

Principal20Analysis20principal

Maximum variance formulation

Consider a data set of observations $x_{n}$ where $n = 1, ..., N$ , and $x_{n}$ is a variable with dimensionality D.
Our goal is to project the data onto a space having demensionality $M < D$ , while maximizing the variance after projected

assume that $M$ is given

When $M = 1$ , we can define the direction of the space using a $D$ -dimensional vector $u_{1}$ , and without loss of generality, we choose $u_{1}$ that satisfies $u_{1}^{T} u_{1} = 1$
then projection of $x_{n}$ will be $u_{1}^{T} x_{n}$
the mean of the procjected data is $u_{1}^{T} \overset{ˉ}{x}$

\overline{x} = \frac{1}{N} n = 1 \sum N x_{n}

the variance will be

\frac{1}{N} n = 1 \sum N {u_{1}^{T} x_{n} - u_{1}^{T} \overline{x}}^{2} = u_{1}^{T} S u_{1}

where $S$ is the data covariance matrix

S = \frac{1}{N} n = 1 \sum N (x_{n} - \overline{x}) (x_{n} - \overline{x})^{T} .

And then maximize the projected variance $u_{1}^{T} S u_{1}$
To prevent $∣∣ u_{1} ∣∣ \to \infty$ , A appropriate constraint comes from the normalization condition $u_{1}^{T} u_{1} = 1$
Introduce a Lagrange multiplier denoted by $λ_{1}$
then the formula becomes

u_{1}^{T} S u_{1} + λ_{1} (1 - u_{1}^{T} u_{1}) .

Setting the derivative with respect to $u_{1}$ equal to zero, there is a stationary point when

Su_{1} = λ_{1} u_{1},

It says that $u_{1}$ must be an eigenvector of $S$ . If left-multiply by $u_{1}^{T}$ , and with the condition $u_{1}^{T} u_{1} = 1$
the variance will be

u_{1}^{T} S u_{1} = λ_{1}

Maximizing the variance becomes choosing the eigonvector $u_{1}$ that has the largest eigonvalue $λ_{1}$ . This eigonvector is kown as the first principal component

We can then define additional principal components in an incremental fashion(增量) by choosing each new direction to be that which maximizes the projected variance amongst all possible directions orthogonal to those already considered. (取所有与当前选择的方向正交的最大化方差的方向)

In the case of $M$ -dimensional projection space, the optimal solution would be the $M$ eigonvectors of the covariance matrix $S$ that have the $M$ largest eigenvalues

总结来说，找到 $M$ 维的投影空间，先计算均值，以均值计算协方差矩阵，求解协方差矩阵的 $M$ 个最大特征值对应的特征向量。

Minimum-error formulation

Introduce a complete orthonormal set of D-dimensional basis vectors ${u_{i}}$ where $i \leq D$ that satisfy

u_{i}^{T} u_{j} = δ_{ij} .

each dataset can be represented exactly by a linear combination of the basis vectors

x_{n} = i = 1 \sum D α_{ni} u_{i}

It can be regarded as a rotation of the coordinate system to a new system defined by the ${u_{i}}$

Taking the inner product with $u_{j}$ , we can obtain

x_{n}^{T} u_{j} = (i = 1 \sum D α_{ni} u_{i})^{T} u_{j}

and since orthonormality
we can obtain $a_{nj} = x_{n}^{T} u_{j}$ , and sothat we can write

x_{n} = i = 1 \sum D (x_{n}^{T} u_{i}) u_{i} .

However, we use D-dimentional space for the expression, our goal is to represent the data in $M < D$ subspace.
We can approximate the data by

x_{n} = i = 1 \sum M z_{ni} u_{i} + i = M + 1 \sum D b_{i} u_{i}

where the ${z_{ni}}$ depend on the particular data point, and ${b_{i}}$ are constants
We are free to choose ${u_{i}}$ , the ${z_{ni}}$ and the ${b_{i}}$ so as to minimize the projection error(Introduce by reduction in dimensionality)

The error can be defined as

J = \frac{1}{N} n = 1 \sum N ∥ x_{n} - x_{n} ∥^{2} .

Substituting into all formula, setting the derivative with respect to $z_{nj}$ to zero, and making use of the orthonormality conditions, we obtain

z_{nj} = x_{n}^{T} u_{j}

And respect to $b_{i}$ to zero, we obtain

b_{j} = \overline{x}^{T} u_{j}

where $j > M$

Substitute for $z_{nj}$ and $b_{i}$ , the difference between data and projection becomes

x_{n} - x_{n} = i = M + 1 \sum D {(x_{n} - \overline{x})^{T} u_{i}} u_{i}

We can see the minimum error is given by the orthogonal projection
We therefore obtain the error purely of the ${u_{i}}$ in the form

J = \frac{1}{N} n = 1 \sum N i = M + 1 \sum D (x_{n}^{T} u_{i} - \overline{x}^{T} u_{i})^{2} = i = M + 1 \sum D u_{i}^{T} S u_{i} .

Aim to avoid $u_{i} = 0$ , we must add a constraint to the minimization

For a intuition about the result, let’s consider a case that $D = 2$ , and $M = 1$
By adding Lagrange multiplier $λ_{2}$ , we minimize

J = u_{2}^{T} S u_{2} + λ_{2} (1 - u_{2}^{T} u_{2})

It is the same as the maximum variance process, we just obtain the minimum instead
By setting the derivative with respect to $u_{2}$ to zero, we obtain $S u_{2} = λ_{2} u_{2}$
back-subtitude it into $J$ , we obtain $J = λ_{2}$
With the goal to minimize $J$ , we choose the smaller eigonvalue, and therefore we choose the eigonvector corresponding to the larger eigonvalue as the principal subspace

The general solution is

S u_{i} = λ_{i} u_{i}

And $J$ is given by

J = i = M + 1 \sum D λ_{i}

We choose the $D - M$ smallest eigenvalues, adn hence the eigenvectors defining the principal subspace are those corresponding to the $M$ largest eigenvalues.

PLSR

notes for 16 Partial Least Squares Regression | All Models Are Wrong: Concepts of Statistical Learning

Initially, I wanted to learn about Partial Least Squares, but I found that it might be too broad or challenging. So just start from PLSR first

PLS method has a big family, and PLSR may be a friendly one
PLSR is another dimension reduction method to regularize a model, like PCR, PLSR seek subspace describe by $z_{1}, ... z_{k}$ (or linear combinations of $X$ )

Moreover, there is an implicit assumption: $X$ and $y$ are assumed to be functions of a reduced $(k < p)$ number of components $Z$ that can be used to decompose them

X = Z V^{T} + E

Pasted image 20241103185941

y = Z b + e

Pasted image 20241103190019

So the model in the $X$ -space is

\hat{X} = Z V^{T}

and y becomes

\hat{y} = Z b

But How to get the components

Assume that both the inputs and the response are mean-centered(and possibly standardized). Compute the covariances between all the inputs variables and the reponse:

\tilde{w}_{1} = (co v (x_{1}, y), \dots, co v (x_{p}, y)) .

In vector-matrix notation

\tilde{w}_{1} = X^{T} y

In fact, the notations above are not the same.
The matrix notation larger than the covariance
But we care about the direction only, and their direction will be the same if inputs and response are mean-centered(their mean is 0)
so it doesn’t matter

and we can also rescaled $\tilde{w}_{1}$ by regressing each predictor $x_{j}$ onto $y$

\tilde{w}_{1} = X^{T} y / y^{T} y

and then normalize $\tilde{w}_{1}$

w_{1} = \frac{w ~ _{1}}{∣∣ w ~ _{1} ∣∣}

We use these weights to compute the first component $z_{1}$

z_{1} = w_{11} x_{1} + ... + w_{p 1} x_{p}

since $w_{1}$ is unit-vector, the fomula can be expressed like

z_{1} = X w_{1} = X w_{1} / w_{1}^{T} w_{1}

Then we use the component to regress inputs onto it to obtain the first PLS loading

v_{1} = X^{T} z_{1} / z_{1}^{T} z_{1}

regress response onto it to obtain the first PLS coeffs

b_{1} = y^{T} z_{1} / z_{1}^{T} z_{1}

And thus we have a first one-rank approximation

\hat{X} = z_{1} v_{1}^{T}

Then we can obtain the rasidual matrix $X_{1}$ (named deflation)

X_{1} = X - \hat{X} = X - z_{1} v_{1}^{T} (deflation)

deflate the response

y_{1} = y - b_{1} z_{1}

This is the first round, and we can obtain k components by repeating the process k times, iteratively

every time we reduce the rasidual by obtain the approximation of the previous rasidual matrix and the target $y$
the effect of these approximation are synthesized by matrix multiply

个人认为，这么做的目的是通过回归的方式来拟合，得到降维子空间的同时减小产生的误差(每次都是拟合误差)，这些拟合的结果最终出现在矩阵中(开头的式子)，效果被综合

What PLS is doing is calculating all the different ingredients (e.g. $w_{i}, z_{i}, v_{i}$ , $b_{i}$ ) separately, using least squares regressions. Hence the reason for its name partial least squares.

References

PCA

PLSR

Difference

Not yet

Principal Component Analysis (PCA) and Partial Least Squares (PLS) Technical Notes

White Box

Notes

Diffusion Intro

PF-ODE

Diffusion

Thoughts

生成式人工智能浪潮下残缺的人

《人都是要死的》读后感

Principal component analysis(PCA) & Partial least squares regression(PLS)

PCA

Maximum variance formulation

Minimum-error formulation

PLSR

But How to get the components

References

PCA

PLSR

Difference

Graph View

Table of Contents

Backlinks