Subido por Noah

Batch Normalization

Anuncio
Butch Nomalization
#
-
Nomulization
FC
Statistics
Mini-Batch
via
(BNS
Activation
FC
-
-
cadded)
BN
(Non-linearity?
. . .
->
- >
Back-Propagation
->
nomaliation statistics &
wort
"intoine I
I
parameters
pammeterineRatio
FC:WIl+ b
a re
-"&
I
11"O
:
&
I
&
instead of
-
&
At
change
in
learned through
my
*
I
I
Advantages of
Batch
be
recovered
setting
cli
(sive:me:Ni
every mini-batch
I
M
2
timesin
22:restor contains
ch"cil*...cl
Ms:rector
contains
Me" M*... Me
68:rector
contains
-
:
Each
=>
1
I
Coutputof
y
&
by
optimal (
Nomalization
affects
1.
the
learning
nomalization
enables
network
interposed
non-ineurity of
more
the scale of
which
->
can
the
between
training optimized
maintain
2.
rate of
-
5,
8
network.
layers
that
zun
resilient to changes in
& learning rates.
prevents gradient-vanishing & exploding
problems
weights
set larger
:faster
in
& stuble
and
more robust
to
weights
scale
convergence.
(K - ( n >
Batch
produces
which then
can
fully participate
backpropagation
in
step.
BN (new)
4i+ fift."(i)
these
are
parameters
nomalization
simple
-
will
fixed distribution of inputs
achieve
->
before entering
constrain
introduced?
the
inputs of sigmoid
the
inputto
of
to activation function.
or
tuck activation,
lineur-regime of
the
fruction
calmost
linen->
introduce
5,
8 that
scale & shift the
nomalized value
&
through
optimize
them
Ex
can
Step
Test
I
non-linearity
of network!
y
Nii
(Inferences:
mintas
use, guint
pirt=
breaks
(!!
gradient-backprop.
=>
activation functions.
mini-Butch B.
for each feature
nomalization statistics,
=
Why
->
Mr.
inputdistribution during training
cli-clonth of
Y
mitigates internal covariance shift, where weightinitiin
I
calculated throughout
->
In
...
<sample)
batch
:
& changes
....
on
St
=
( if its
-
=6+
original inputcan
*
-
4
network parameters during training.
i
0.
=
training
Training step:for
~
B
C
Activ.
shift
distribution ofinputdata due to
in
1
Butch-size:m
-
I
it
·
Inputs
:Internal Covariance
change
I
A
ECMAS
Inki
#
as
r
E) 65)
I
xint-yint
6inf+
*
biused
I
B.
+
-
variance
MI
batch
mean
sample)
&
:Since year
doing
variance
training.
linear
are
BNirt
transformation.
fixed
is
simple
-
BN for
CNN
(FxFx27
(NxN x47
>(F
(N-Fil
P
=
F
x
filter
inputdate
*
x
X
(N-Fil
xD
D) (stride:1)
sameparameters
->
***
+
=
y
share
output
y
=
same
filters.
node
1
of
a(n)
i
=
Same
-
pay
ba
+
5,
8. (ford
in
1-DL
Descargar