Norms in ML

Overview

almost copy from deep learning

In machine learning, norm is a function that measure the size of vector

The $L^{p}$ norm is given by

∣∣ x ∣ ∣_{p} = (i \sum ∣ x_{i} ∣^{p})^{\frac{1}{p}}

for $p \in R$ , $p \geq 1$

Intuitively, the norm of a vector $x$ measures the distance from the origin to the point $x$

Rigorously, a norm is any function $f$ that satisfies the following properties:

$f (x) = 0 \Rightarrow x = 0$
$f (x = y) \leq f (x) + f (y)$ (triangle inequality)
$\forall α \in R, f (α x) = ∣ α ∣ f (x)$

The $L^{2}$ norm, with $p = 2$ , is known as the Euclidean norm(欧几里得范数). It is simply the Euclidean distance.
$∣∣ x ∣∣ = \sum_{i} x_{i}^{2}$ $L^{2}$ norm is used frequentily in ML, and often denoted simply as $∣∣ x ∣∣$
It is more common to used its squared version, which can be calculated simply by $x^{T} x$

squared $L^{2}$ norm is more convenient to work with mathematiocally and computaionally, for example, each derivative of the squared $L^{2}$ norm only depend on the corresponding element of $x$ , while $L^{2}$ norm depond on the entire vector

In other contexts, the squared $L^{2}$ norm may be undesirable because it increases slowly near the origin.
It maybe important to discriminate between elements that are exactly zero and elements that are small but nonzero, and thus we turn to $L^{1}$ norm
$L^{1}$ norm increase at the same rate(linear) in all location, but retains mathematical simplicity

∣∣ x ∣ ∣_{1} = i \sum ∣ x_{i} ∣

$L^{1}$ norm can be used as a substiture for the number of nonzero entries

other norm may needed include max norm $L^{\infty}$ norm

∣∣ x ∣ ∣_{\infty} = ma x_{i} ∣ x_{i} ∣

It is the maximum absolute value of the components of the vector.

In the context of deep learning, the most common way to measure the size of a matrix is Frobenius norm:

∣∣ A ∣ ∣_{F} = i, j \sum A_{i, j}^{2}

a analogous to $L^{2}$ norm

by the way, we can use norm to rewrite the dot product

x^{T} y = ∣∣ x ∣ ∣_{2} ∣∣ y ∣ ∣_{2} cos θ

Regularization with Norm

From PRML

3.1.4 Regularized least squares

Error function with regularization term

E_{D} (w) + λ E_{W} (w)

where $λ$ is the regularization coefficient that controls the relative importance of the data-dependent error $E_{D} (w)$ and the regulatization term $E_{W} (w)$ .

Regularization aim to cotrol over-fitting, so that Considering adding regularizaiton term on the parameters is naturally

We can use norm to describe the amount or other property of parameters.

Squared $L^{2}$ norm

E (w) = \frac{1}{2} w^{T} w

the $1/2$ here is added for later convenience
the entire penalty become

\frac{λ}{2} w^{T} w

This particular choice of regularizer is known in the context of machine literature as weight decay(权重衰减), it encourages weight values to decay towards zero(due to the minimize optimal), unless supported by data.
It provides an example of a parameter shrinkage(参数缩减) (in the context of statistic), lead to a closed form of the objective function

A more general regularizer is sometimes used

E (w) + \frac{λ}{2} j = 1 \sum M ∣ w_{j} ∣^{q}

since the norm describe the distance distance from the origin to the point $x$ , It can be visulized like
Pasted image 20241024144101

In other perspective, minimizing objetive function with norm( $L^{1}$ norm or $L^{2}$ norm) can be calculated using Lagrange multipilers.
in that way, minimize objective function is equivalent ti minimizing the cost function subject to constraint

j = 1 \sum M ∣ w_{j} ∣^{q} \leq η

It can also be visualized like the picture above

$L^{1}$ norm would lead to a sparse model in which the corresponding basis functions play no role(系数置0，对应的基函数无效)

Pasted image 20241024145819

the solution is $w^{*}$ .
$L^{2}$ norm in the left and $L^{1}$ norm in the right
It shows that it $L^{1}$ (lasso) lead to a sparse solution in which one parameter is 0
It should also be recognized that with regularization, the parameters both decreased

White Box

Notes

Diffusion Intro

PF-ODE

Diffusion

Thoughts

生成式人工智能浪潮下残缺的人

《人都是要死的》读后感

Norms in ML

Overview

Regularization with Norm

From PRML

Graph View

Table of Contents

Backlinks