# 理解weight decay

2019年1月3日 / 4次阅读

weight decay是一种神经网络regularization的方法，它的作用在于让weight不要那么大，实践的结果是这样做可以有效防止overfitting。至于为什么有这样的效果，有些写书的人也说不清楚。

$$C = C_0 + \frac{\lambda}{2n} \sum_w w^2$$

$$\frac{\partial C}{\partial w} = \frac{\partial C_0}{\partial w} + \frac{\lambda}{n} w$$

weight decay也被称为L2 regularization，或者是L2 parameter norm penalty。

（1）让weight变小一点，带来的好处是可以是整个神经网络对输入中的噪音（或者一点点变化）不要那么敏感；weight太大，其对应的输入的一点点变化就会起到主导作用，进而显著改变输出。

（2）从公式来看，weight decay对于比较大的weight，decay的更多，比较小的weight，decay较小；这就相当于，weight越大，惩罚越大，即可以更有效的减少Cost函数。

（3）让神经网络倾向于形成更简单的，“斜率”更小的模型；比如一个非线性模型，我们可以用很复杂的高阶多项式来表示，也可以容忍一些噪音，通过简单的低阶多项式来表示，甚至直接使用线性函数来表示。

For example, in linear regression, this gives us solutions that have a smaller slope, or put weight on fewer of the features. In other words, even though the model is capable of representing functions with much more complicated shape, weight decay has encouraged it to use a simpler function described by smaller coefficients.

Intuitively, in the feature space, only directions along which the parameters contribute significantly to reducing the objective function are preserved relatively intact. In directions that do not contribute to reducing the objective function, movement in this direction will not significantly increase the gradient.So, Components of the weight vector corresponding to such unimportant directions are decayed away through the use of the regularization throughout training.

Another simple explanation is when your weights are large, they are more sensitive to small noises in the input data. So, when a small noise is propagated through your network with large weights, it produces much different value in the output layer of the NN rather than a network with small weights.

Note that weight decay is not the only regularization technique. In the past few years, some other approaches have been introduced such as Dropout, Bagging, Early Stop, and Parameter Sharing which work very well in NNs.

Ctrl+D 收藏本页