




vector activation vs scalar activation







sigmoid output -> prob of classification



how to define the error???

first choice: square euclidean distance
L2 divergence -> differentiation is just


gradient<0 => y_i should increase to reduce the div

arithmetically wrong, but label smoothing will help gradient descent!
avoid overshooting
https://leimao.github.io/blog/Label-Smoothing/

it's a heuristic









forward NN


backward NN
(1) trivial: grad of output

(2) grad of the final activation layer

(3) grad of the last group of weights


(4) grad of the second last group of y

(5) 综上 pseudocode & backward forward comparision














网友评论