ReLU(Rectified Linear Unit)

rectified linear activation funtion or ReLU ReLU通过正负区分输入
基本是默认的激活函数

R e LU = ma x (0, x)

谁大选谁

Limitation of sigmoid and Tanh

sigmoid
Tanh

Gradient Problems

待补

ReLU Activation

R e LU = ma x (0, x)

avoid vanishing/exploding gradient issues(its gradient is either 0 or 1) but suffers from the dying ReLU problem

dying ReLU problem — negative inputs lead to inactive neurons
- Cause negative input and 0 output, no gradient no update
- Impact Once a ReLU neuron gets stuck in this state where it only outputs zero, it is unlikely to recover 输出是0，梯度也是0，输入在这个神经元处消除了，而且因为梯度是0权重不会修改，所以神经元不再激活(死去)
- Resulting Issues 可能导致无法拟合

if input > 0:
	return input
else:
	return 0

虽然不是光滑的，但是可以认为0处斜率是0
实际使用没有问题

advantages

Computational Simplicity
- return max(0, x) 而不是指数计算(sigmoid & tanh)
Representational Sparsity
- 返回真正的0(0.0) 而不是像sigmoid和tanh一样返回近似值
- allowing the activation of hidden layers in neural networks to contain one or more true zero values. 允许0
- 称为 稀疏表示(sparse representation)
  - 这个稀疏表示在autoencoder中比较重要 👈不是很懂，详见deep learning(花书)
Linear Behavior
- easy to optimize
- avoid vanishing gradients
Deep networks

tips

use ReLU as default activation function
Use with MLPs, CNNs, but probably not RNNs
- 使用ReLU之后网络表现暴涨
- The surprising answer is that using a rectifying non-linearity is the single most important factor in improving the performance of a recognition system.
- When using ReLU with CNNs, they can be used as the activation function on the filter maps themselves, followed then by a pooling layer.
- ReLU were thought to not be appropriate for Recurrent Neural Networks (RNNs) such as the Long Short-Term Memory Network (LSTM) by default
use smaller bias imput value
- When using ReLU in your network, consider setting the bias to a small value, such as 0.1
- 花书这么写，但是有些争议，可以都试试
Use “He Weight Initialization” 👈 没看懂 PyTorch中的Xavier以及He权重初始化方法解释_pytorch中he初始化-CSDN博客
- 何恺明也太强了
Scale Input Data
- standardizing variables to have a zero mean(均值为0) and unit variance(方差为0) or normalizing each value to the scale 0-to-1
Use Weight Penalty
- ReLU is unbounded in the positive domain
- L1 or L2 vector norm ← 使用L1比较好

Limitations of ReLU

dying ReLU
- 当大权重或者异常输入时，可能会导致为了拟合实际输出，偏移bias变成很大的负数，可能导致正常的输入也是负的，此时所有输入都是负的，输出就是0，梯度原地消失

Other ReLU

Leaky ReLU (LReLU or LReL) modifies the function to allow small negative values when the input is less than zero.
The Exponential Linear Unit, or ELU, is a generalization of the ReLU that uses a parameterized exponential function to transition from the positive to small negative values.
The Parametric ReLU, or PReLU, learns parameters that control the shape and leaky-ness of the function.
Maxout is an alternative piecewise linear function that returns the maximum of the inputs, designed to be used in conjunction with the dropout regularization technique.

Leaky ReLU

Leaky ReLU introduces a small gradient for negative inputs
通过调整学习率和评估来调整斜率

负数部分有一点斜率

disadvantege: Inconsistent output for negative input

Parametric ReLU(PReLU)

with learnable slope parameter
effectiveness in various applications:

computer vision
speech recognition 需要fine-tune得到对应的超参数(learnable parameter)

Gaussian Error Linear Unit (GeLU)

probabilistic foundations and smooth approximation characteristics
GeLU is a smooth approximation of the rectifier function, scaling inputs by their percentile rather than their sign

gained notable popularity in transformer architectures
光滑且非线性，能够很好拟合复杂的模型(CV…)

White Box

Notes

Linear Attention (working on GatedDeltaNet)

CS231n Module 3

Model Editing (ML2025)

Thoughts

生成式人工智能浪潮下残缺的人

《人都是要死的》读后感