美文网首页
neural networks

neural networks

作者: 习惯了千姿百态 | 来源:发表于2018-08-17 12:01 被阅读0次

1.Representation

a_i^{(j)}: activation of unit i in layer j
\theta^{(j)}:matrix of weights controlling function mapping from layer j to layer j+1

Figure 1
Figure 2
If network has s_j units in layer j, s_{j+1}units in layer j+1 , then\theta^{(j)}will be of dimension s_{j+1}*(s_j+1)

z^{(j)}:represents the input of layer j;(j\ge2)
a^{(j)}:the output of layer j;(j\ge2)
\theta_{i,j}^l:weight controlling function mapping from the unit i in layer l+1 to the unit j in layer l
we use the sigmoid function in this nueral networks,we can get the relation:
a^{(j)}=g(z^{(j)})
To keep form consistency, a^{(1)}=[1;X^{(1)}] (add the bias unit of layer 1)
According to the Figure 2, z_1^{(2)}=\theta_1^{(1)}*a_1^{(1)}\qquad z_2^{(2)}=\theta_2^{(1)}*a_2^{(1)}\qquad z_3^{(2)}=\theta_3^{(1)}*a_3^{(1)}
so we can get z^{(2)}=\theta^{(1)}*a^{(1)}, then a^{(2)}=g(z^{(2)}),but it should be noted that layer has a bias unit,so we should add it to the a^{(2)}, that is a_0^{(2)}=1,which MATLAB command is a2=[1;a2].
in a similar way,z^{(3)}=\theta^{(2)}*a^{(2)}, but we needn't to add the bias unit, because this is the last layer(output layer), so the output is h_\theta(x)=a^{(3)}=g(z^{(3)})

2.Learning

2.1. cost function

h_\theta(x)\in R^K\qquad (h_\theta(x))_k=k^{th} output
J(\theta)=-\frac{1}{m}\sum_{i=1}^{m}\sum_{k=1}^{k}\bigg [ y_{k}^{(i)}log((h_{\Theta}(x^{(i)}))_{k})+(1-y_{k}^{(i)})log(1-(h_{\Theta}(x^{(i)}))_{k}) \bigg] + \frac{\lambda}{2m}\sum_{l=1}^{L-1}\sum_{i=1}^{sl}\sum_{j=1}^{s_{(l+1)}}(\Theta_{j,i}^{(l)})^2
To get the \theta which minimize J(\theta) is our goal. We can use the gradient descent.

2.2. gradient descent

We only need to compute the J(\theta) and \frac{\partial}{\partial\theta_{i,j}^l}J(\theta)., then use the advanced optimization,such as fmincg.

2.2.1. Forward propagation

a^{(1)}=x
z^{(2)}=\theta^{(1)}a^{(1)}
a^{(2)}=g(z^{(2)}) (add a_0^{(2)}=1)
z^{(3)}=\theta^{(2)}a^{(2)}
a^{(3)}=g(z^{(3)}) (add a_0{(3)}=1)
z^{(4)}=\theta^{(3)}a^{(3)}
a^{(4)}=g(z^{(4)})---output

2.2.2 Backpropagation algorithm

Focusing on a single example x^{(i)},y^{(i)},the case of 1 output unit, and ignoring regularization(\lambda=0)
J(\theta)=C(\theta)=-\bigg [ y\cdot log(a^{(L)})+(1-y)log(1-a^{(L)}) \bigg]
we definite the \delta^{(l)}error of cost for a_j^{(l)}(unit j in layer l).
Formally, \delta^{(l)}=\frac{\partial C(\theta)}{\partial z^{(l)}},
Goal:
compute \frac{\partial C}{\partial \theta^{(l)}},further,update \theta^{(l)}:=\theta^{(l)}-\frac{\partial C}{\partial \theta^{(l)}}
Given:
\delta^{(l)}=\frac{\partial C}{\partial z^{(l)}}
z^{(l+1)}=\theta^{(l)}*a^{(l)}=\theta^{(l)}*g(z^{(l)})
g'(z^{(l)})=\frac{\partial g(z^{(l)})}{\partial z^{(l)}}=g(z^{(l)})[1-g(z^{(l)})]=a^{(l)}[1-a^{(l)}]
Derivation:
obviously, \delta^{(l)}=\frac{∂C}{∂z^{(l)}}=\frac{∂C}{∂z^{(l+1)}}\cdot\frac{∂z^{(l+1)}}{∂z^{(l)}}=\delta^{(l+1)} \cdot (\theta^{(l)})^{T} \cdot a^{(l)} \cdot (1-a^{(l)})-----(1)
\frac{∂C}{∂a^{(l)}} = \frac{\partial z^{(l+1)}}{\partial a^{(l)}} \frac{\partial C}{\partial z^{(l+1)}}=(\theta^{(l)})^{T} \cdot \delta^{l+1}
so we get the recursion about \delta^{(l)}, we can use the loop for l=L-1,L-2,...,2 after we compute the \delta^{(L)}.
\delta^{(L)}=\frac{\partial C}{\partial z^{(L)}}=\frac{\partial C}{\partial a^{(L)}}\frac{\partial a^{(L)}}{\partial z^{(L)}}=\frac{a^{(L)}-y}{(1-a^{(L)})a^{(L)}}[a^{(L)}(1-a^{(L)})]=a^{(L)}-y----(2)
finally,we have compute \delta^{(l)}, now we should use the \delta^{(l)} to compute \frac{\partial C}{\partial \theta^{(l)}}
\frac{\partial C}{\partial \theta^{(l)}}=\frac{\partial C}{\partial z^{(l+1)}}\frac{\partial z^{(l+1)}}{\partial \theta^{(l)}}=\delta^{(l)}a^{(l)}


Algorithm:
Training set (x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),\dots ,(x^{(m)},y^{(m)}).m 个训练样本.
Set \Delta_{ij}^{(l)}=0 (for all l,i,j) ,used to compute\frac{∂}{∂\Theta_{i,j}^{(l)}}J(\Theta),, Cycle accumulation
For training example t=1 to m:
 1. Set a^{(1)}:=x^{(t)}
 2.perform forwardPropagation,computea^{(l)} for l=2,3,…,L
 3.compute error of output layer by the formula \delta ^{(L)} = a ^{(L)} - y^{(t)}
 4.Compute \delta ^{(L-1)},\delta ^{(L-2)},\dots ,\delta ^{(2)}.\quad \\ \delta ^{(l)}=\big((\Theta^{(l)})^{T} \delta ^{(l+1)} \big).*a^{(l)}.*\big( 1-a^{(i)}\big)\quad \quad
 5.\Delta_{i,j}^{(l)}:=\Delta_{i,j}^{(l)}+a_{j}^{(l)}\delta_{i}^{(l+1)} or with vectorization,\Delta^{(l)}:=\Delta^{(l)}+\delta^{(l+1)}(a^{(l)})^T.
ENDFOR
update:
D_{i,j}^{(l)}:=\frac{1}{m}\big(\Delta_{i,j}^{(l)}+\lambda\Theta_{i,j}^{(l)} \big),if \quad j \neq 0
D_{i,j}^{(l)}:=\frac{1}{m}\Delta_{i,j}^{(l)},if\quad j =0
\boxed{\frac{∂}{∂\Theta_{i,j}^{(l)}}J(\Theta)=D_{i,j}^{(l)}}

3.summary

1.  Randomly initialize weights
2.  Implement forward propagation to get h_ \theta(x^{(i)}) for any x^{(i)}
3.Implement code to compute cost function J(\theta)
4.Implement backpropagation to comoute partial derivatives\frac{\partial}{\partial\theta_{j,k}^{(l)}}J(\theta)
5.Use gradient checking to compare \frac{\partial}{\partial\theta_{j,k}^{(l)}}J(\theta) computed using backpropagation vs. using numerical estimate of gradient of J(\theta).Then disable gradinet checking code
6.Use gradient descent or advanced optimization method with backpropagation to try to minimize J(\theta) as a function of parameters \theta

相关文章

网友评论

      本文标题:neural networks

      本文链接:https://www.haomeiwen.com/subject/udxfbftx.html