2014, before batch normalization was invented, training NN was hard.
For example, VGG was trained for 11 layers first, and then randomly added more layers inside, so that it could converge.
Another example: Google net used early output
bad!
only bp within a batch
similar to mini batch
learned to recite the GNU license
675 Mass Ave -> central square ???
not perfect
soft attention -> weighted combination of all img location
hard attention -> forcing the model select only one location to look at -> more tricky -> not differentiable -> talk later in RL lecture
RNN typically not deep -> 2,3,4 layers usually













网友评论