PCA

A notes from Deep Learning Foundations and Concepts (Christopher M. Bishop, Hugh Bishop)

Principal component analysis, or PCA, is widely used for applications such as dimensionality reduction, lossy data compression, feature extraction, and data visualization. It is also known as the Kosambi–Karhunen–Loeve transform.

PCA can be defined as the linear proejction that maximizes the variance of the projecrted data, or it can be defined as the linear projection that minimizes the average projection cost(hte mean squared distance between the data points and their projections)

Given a dataset, PCA seeks a space of lower dimensionality(known as principal subspace) that satisfies the definitions above.

Principal20Analysis20principal

Maximum variance formulation

Consider a data set of observations where , and is a variable with dimensionality D.
Our goal is to project the data onto a space having demensionality , while maximizing the variance after projected

assume that is given

When , we can define the direction of the space using a -dimensional vector , and without loss of generality, we choose that satisfies
then projection of will be
the mean of the procjected data is

the variance will be

where is the data covariance matrix

And then maximize the projected variance
To prevent , A appropriate constraint comes from the normalization condition
Introduce a Lagrange multiplier denoted by
then the formula becomes

Setting the derivative with respect to equal to zero, there is a stationary point when

It says that must be an eigenvector of . If left-multiply by , and with the condition
the variance will be

Maximizing the variance becomes choosing the eigonvector that has the largest eigonvalue . This eigonvector is kown as the first principal component

We can then define additional principal components in an incremental fashion(增量) by choosing each new direction to be that which maximizes the projected variance amongst all possible directions orthogonal to those already considered. (取所有与当前选择的方向正交的最大化方差的方向)

In the case of -dimensional projection space, the optimal solution would be the eigonvectors of the covariance matrix that have the largest eigenvalues

总结来说,找到维的投影空间,先计算均值,以均值计算协方差矩阵,求解协方差矩阵的个最大特征值对应的特征向量。

Minimum-error formulation

Introduce a complete orthonormal set of D-dimensional basis vectors where that satisfy

each dataset can be represented exactly by a linear combination of the basis vectors

It can be regarded as a rotation of the coordinate system to a new system defined by the

Taking the inner product with , we can obtain

and since orthonormality
we can obtain , and sothat we can write

However, we use D-dimentional space for the expression, our goal is to represent the data in subspace.
We can approximate the data by

where the depend on the particular data point, and are constants
We are free to choose , the and the so as to minimize the projection error(Introduce by reduction in dimensionality)

The error can be defined as

Substituting into all formula, setting the derivative with respect to to zero, and making use of the orthonormality conditions, we obtain

And respect to to zero, we obtain

where

Substitute for and , the difference between data and projection becomes

We can see the minimum error is given by the orthogonal projection
We therefore obtain the error purely of the in the form

Aim to avoid , we must add a constraint to the minimization

For a intuition about the result, let’s consider a case that , and
By adding Lagrange multiplier , we minimize

It is the same as the maximum variance process, we just obtain the minimum instead
By setting the derivative with respect to to zero, we obtain
back-subtitude it into , we obtain
With the goal to minimize , we choose the smaller eigonvalue, and therefore we choose the eigonvector corresponding to the larger eigonvalue as the principal subspace

The general solution is

And is given by

We choose the smallest eigenvalues, adn hence the eigenvectors defining the principal subspace are those corresponding to the largest eigenvalues.

PLSR

notes for 16 Partial Least Squares Regression | All Models Are Wrong: Concepts of Statistical Learning

Initially, I wanted to learn about Partial Least Squares, but I found that it might be too broad or challenging. So just start from PLSR first

PLS method has a big family, and PLSR may be a friendly one
PLSR is another dimension reduction method to regularize a model, like PCR, PLSR seek subspace describe by (or linear combinations of )

Moreover, there is an implicit assumption: and are assumed to be functions of a reduced number of components that can be used to decompose them

Pasted image 20241103185941

Pasted image 20241103190019

So the model in the -space is

and y becomes

But How to get the components

Assume that both the inputs and the response are mean-centered(and possibly standardized). Compute the covariances between all the inputs variables and the reponse:

In vector-matrix notation

In fact, the notations above are not the same.
The matrix notation larger than the covariance
But we care about the direction only, and their direction will be the same if inputs and response are mean-centered(their mean is 0)
so it doesn’t matter

and we can also rescaled by regressing each predictor onto

and then normalize

We use these weights to compute the first component

since is unit-vector, the fomula can be expressed like

Then we use the component to regress inputs onto it to obtain the first PLS loading

regress response onto it to obtain the first PLS coeffs

And thus we have a first one-rank approximation

Then we can obtain the rasidual matrix (named deflation)

deflate the response

This is the first round, and we can obtain k components by repeating the process k times, iteratively

every time we reduce the rasidual by obtain the approximation of the previous rasidual matrix and the target
the effect of these approximation are synthesized by matrix multiply

个人认为,这么做的目的是通过回归的方式来拟合,得到降维子空间的同时减小产生的误差(每次都是拟合误差),这些拟合的结果最终出现在矩阵中(开头的式子),效果被综合

What PLS is doing is calculating all the different ingredients (e.g. , ) separately, using least squares regressions. Hence the reason for its name partial least squares.

References

PCA

PLSR

Difference

Not yet