Enter the URL of the YouTube video to download subtitles in many different formats and languages.

hi everyone today we are continuing our

implementation of makemore now in the

multi-layer perceptron along the lines

of benjiotyle 2003 for character level

language modeling so we followed this

paper took in a few characters in the

past and used an MLP to predict the next

so what we'd like to do now is we'd like

to move on to more complex and larger

neural networks like recurrent neural

networks and their variations like the

now before we do that though we have to

stick around the level of multilia

perceptron for a bit longer and I'd like

to do this because I would like us to

have a very good intuitive understanding

of the activations in the neural net

during training and especially the

gradients that are flowing backwards and

how they behave and what they look like

and this is going to be very important

to understand the history of the

development of these architectures

because we'll see that recurrent neural

networks while they are very expressive

approximator and can in principle

Implement uh all the algorithms we'll

see that they are not very easily

ingredient-based techniques that we have

available to us and that we use all the

and the key to understanding why they

are not optimizable easily is to

understand the the activations and the

gradients and how they behave during

training and we'll see that a lot of the

variants since recurring neural networks

have tried to improve that situation and

so that's the path that we have to take

and let's get started so the starting

code for this lecture is largely the

code from four but I've cleaned it up a

so you'll see that we are importing all

the torch and matplotlab utilities we're

reading in the words just like before

these are eight example words there's a

total of 32 000 of them here's a

vocabulary of all the lowercase letters

here we are reading in the data set and

processing it and creating three splits

the train Dev and the test split

now in MLP this is the identical same

MLP except you see that I removed a

bunch of magic numbers that we had here

and instead we have the dimensionality

of the embedding space of the characters

and the number of hidden units in the

hidden layer and so I've pulled them

outside here so that we don't have to go

and change all these magic numbers all

with the same neural net with 11 000

parameters that we optimize now over 200

000 steps with a batch size of 32 and

you'll see that I refactor I refactored

the code here a little bit but there are

no functional changes I just created a

few extra variables a few more comments

and I removed all the magic numbers and

otherwise is the exact same thing

then when we optimize we saw that our

loss looked something like this we saw

that the train and Val loss were about

here I refactored the code a little bit

for the evaluation of arbitrary splits

so you pass in a string of which split

you'd like to evaluate and then here

depending on train Val or test I index

in and I get the correct split and then

this is the forward pass of the network

and evaluation of the loss and printing

so just making it nicer one thing that

I'm using a decorator torch.nograd which

you can also look up and read the

documentation of basically what this

decorator does on top of a function is

that whatever happens in this function

is assumed by torch to never require any

gradients so it will not do any of the

bookkeeping that it does to keep track

of all the gradients in anticipation of

it's almost as if all the tensors that

get created here have a requires grad of

false and so it just makes everything

much more efficient because you're

telling torch that I will not call that

backward on any of this computation and

you don't need to maintain the graph

so that's what this does and you can

torch.nograd and you can look those up

then here we have the sampling from a

um just as before just a poor passive

and neural net getting the distribution

sampling from it adjusting the context

window and repeating until we get the

special and token and we see that we are

starting to get much nicer looking words

sample from the model but still not

amazing and they're still not fully

named like but it's much better than

when we had them with the byground model

so that's our starting point now the

first thing I would like to scrutinize

I can tell that our network is very

improperly configured at initialization

and there's multiple things wrong with

it but let's just start with the first

look here on the zeroth iteration the

we are recording a loss of 27 and this

rapidly comes down to roughly one or two

or so I can tell that the initialization

is all messed up because this is way too

in training of neural Nets it is almost

always the case that you will have a

rough idea for what loss to expect at

initialization and that just depends on

the loss function and the problem set up

in this case I do not expect 27. I

expect a much lower number and we can

basically at initialization what we'd

like is that there's 27 characters that

could come next for any one training

example at initialization we have no

reason to believe any characterist to be

much more likely than others and so we'd

expect that the probability distribution

that comes out initially is a uniform

distribution assigning about equal

probability to all the 27 characters

so basically what we like is the

probability for any character would be

that is the probability we should record

and then the loss is the negative log

probability so let's wrap this in a

and then that one can take the log of it

and then the negative log probability is

which is 3.29 much much lower than 27.

and so what's happening right now is

that at initialization the neural net is

creating probability distributions that

are all messed up some characters are

very confident and some characters are

and then basically what's happening is

that the network is very confidently

and uh that make that's what makes it um

record very high loss so here's a

smaller four-dimensional example of the

issue let's say we only have four

characters and then we have Logics that

come out of the neural land and they are

then when we take the soft Max of all

zeros we get probabilities there are a

so sums to one and is exactly uniform

and then in this case if the label is

say 2 it doesn't actually matter if this

if the label is 2 or 3 or 1 or 0 because

it's a uniform distribution we're

recording the exact same loss in this

case 1.38 so this is the loss we would

expect for a four-dimensional example

and I can see of course that as we start

to manipulate these logits we're going

to be changing the loss here so it could

be that we lock out and by chance this

could be a very high number like you

know five or something like that then in

that case we'll record a very low loss

because we're assigning the correct

probability at the initialization by

much more likely it is that some other

dimension will have a high uh logit and

then what will happen is we start to

record much higher loss and what can

come what can happen is basically The

Lodges come out like something like this

you know and they take on Extreme values

for example if we have torch.randen of

four so these are uniform sorry these

are normally distributed numbers

and here we can also print the logits

the probabilities that come out of it

and the loss and so because these logits

are near zero for the most part the loss

but suppose this is like times 10 now

you see how because these are more

extreme values it's very unlikely that

you're going to be guessing the correct

bucket and then you're confidently wrong

if your lodges are coming out even more

you might get extremely insane losses

like infinity even at initialization

so basically this is not good and we

want the load just to be roughly zero

um when the network is initialized in

fact the logits can don't have to be

just zero they just have to be equal so

for example if all the objects are one

then because of the normalization inside

the softmax this will actually come out

but by symmetry we don't want it to be

any arbitrary positive or negative

number we just want it to be all zeros

and record the loss that we expect at

initialization so let's Now quickly see

where things go wrong in our example

let me reinitialize the neural net and

here let me break after the very first

iteration so we only see the initial

so that's way too high and intuitively

now we can expect the variables involved

and we see that the logits here if we

if we just print the first row we see

that the load just take on quite extreme

and that's what's creating the fake

confidence and incorrect answers and

get very very high so these Lotus should

be much much closer to zero so now let's

think through how we can achieve logits

coming out of this neural net to be more

you see here that lodges are calculated

as the hidden States multiplied by W2

so first of all currently we're

initializing B2 as random values of the

but because we want roughly zero we

don't actually want to be adding a bias

of random numbers so in fact I'm going

to add a times a zero here to make sure

that B2 is just basically zero at the

and second this is H multiplied by W2 so

if we want logits to be very very small

then we would be multiplying W2 and

so for example if we scale down W2 by

do again just the very first iteration

you see that we are getting much closer

to what we expect so the rough roughly

what we want is about 3.29 this is 4.2

I can make this maybe even smaller

3.32 okay so we're getting closer and

closer now you're probably wondering can

then we get of course exactly what we're

and the reason I don't usually do this

is because I'm I'm very nervous and I'll

show you in a second why you don't want

to be setting W's or weights of a neural

net exactly to zero um you usually want

it to be small numbers instead of

exactly zero for this output layer in

this specific case I think it would be

fine but I'll show you in a second where

things go wrong very quickly if you do

in that case our loss is close enough

but has some entropy it's not exactly

zero it's got some low entropy and

that's used for symmetry breaking as

the logits are now coming out much

closer to zero and everything is well

so if I just erase these and I now take

we can run the optimization with this

new initialization and let's just see

what losses we record okay so I let it

run and you see that we started off good

the plot of the loss now doesn't have

um because basically what's happening in

the hockey stick the very first few

iterations of the loss what's happening

optimization is just squashing down the

logits and then it's rearranging the

logits so basically we took away this

easy part of the loss function where

just the the weights were just being

and so therefore we don't we don't get

these easy gains in the beginning and

we're just getting some of the hard

gains of training the actual neural nut

and so there's no hockey stick

so good things are happening in that

both number one lawsuit initialization

and the the loss doesn't look like a

hockey stick and this is true for any

neuron that you might train and

and second the last that came out is

unfortunately I erased what we had here

2.12 and this is what this was 2.16 so

we get a slightly improved result and

the reason for that is uh because we're

optimizing the neural net actually

instead of just spending the first

several thousand iterations probably

just squashing down the weights

because they are so way too high in the

beginning in the initialization

so something to look out for and uh

that's number one now let's look at the

second problem let me re-initialize our

neural net and let me reintroduce the

so we have a reasonable initial loss so

even though everything is looking good

on the level of the loss and we get

something that we expect there's still a

deeper problem working inside this

neural net and its initialization

so the logits are now okay the problem

the activations of the Hidden States now

if we just visualize this Vector sorry

this tensor H it's kind of hard to see

but the problem here roughly speaking is

you see how many of the elements are one

now recall that torch.nh the 10h

function is a squashing function it

takes arbitrary numbers and it squashes

them into a range of negative one and

so let's look at the histogram of H to

get a better idea of the distribution of

well we can see that H is 32 examples

and 200 activations in each example we

can view it as negative one to stretch

and we can then call to list to convert

this into one large python list of

and then we can pass this into plt.hist

for histogram and we say we want 50 bins

and a semicolon to suppress a bunch of

so we see this histogram and we see that

most the values by far take on the value

of negative one and one so this 10h is

we can look at the pre-activations that

and we can see that the distribution of

the pre-activations are is very very

broad these take numbers between

negative 15 and 15 and that's why in the

torture 10h everything is being squashed

and capped to be in the range of

negative one and one and lots of numbers

here take on very extreme values

now if you are new to neural networks

you might not actually see this as an

issue but if you're well burst in the

dark arts back propagation and then have

an intuitive sense of how these

gradients flow through a neural net you

are looking at your distribution of 10h

activations here and you are sweating

so let me show you why we have to keep

in mind that during that propagation

just like we saw in micrograd we are

doing backward pass starting at the loss

and flowing through the network

backwards in particular we're going to

back propagate through this tors.10h

and this layer here is made up of 200

neurons for each one of these examples

and it implements an element twice 10 H

so let's look at what happens in 10h in

we can actually go back to our previous

micrograd code in the very first lecture

and see how we implement the 10h

we saw that the input here was X and

then we calculate T which is the 10h of

so that's T and T is between negative 1

and 1. it's the output of the 10h and

then in the backward pass how do we back

and then we multiply it this is the

chain rule with the local gradient which

took the form of 1 minus t squared

so what happens if the outputs of your

10h are very close to negative one or

one if you plug in t equals one here

you're going to get a zero multiplying

out that grad no matter what up.grad is

we are killing the gradient and we're

stopping effectively the back

propagation through this 10h unit

similarly when T is negative one this

will again become zero and out that grad

and intuitively this makes sense because

and what's happening is if its output is

very close to one then we are in the

and so changing basically the input

is not going to impact the output of the

10h too much because it's it's so it's

in a flat region of the 10h and so

therefore there's no impact on the loss

and so indeed the the weights and the

biases along with this tan H neuron do

not impact the loss because the output

of the standard unit is in the flat

region in the 10h and there's no

influence we can we can be changing them

whatever we want however we want and the

loss is not impacted that's that's

another way to justify that indeed the

gradient would be basically zero it

we get 1 times at that grad so when the

10h takes on exactly value of zero then

out.grad is just passed through

so basically what this is doing right is

if T is equal to zero then this the 10h

and uh gradient just passes through but

the more you are in the flat tails the

more the gradient is squashed

so in fact you'll see that the the

gradient flowing through 10 H can only

ever decrease in the amount that it

decreases is proportional through a

depending on how far you are in the flat

and so that's kind of what's Happening

Here and through this the concern here

is that if all of these outputs H are in

the flat regions of negative one and one

then the gradients that are flowing

through the network will just get

now there is some redeeming quality here

and that we can actually get a sense of

I've wrote some code here and basically

what we want to do here is we want to

take a look at H take the absolute value

and see how often it is in the in a flat

region so say greater than 0.99

and what you get is the following and

this is a Boolean tensor so uh in the

Boolean tensor you get a white if this

is true and a black and this is false

and so basically what we have here is

the 32 examples and then 200 hidden

neurons and we see that a lot of this is

white and what that's telling us is that

all these 10h neurons were very very

active and uh they're in a flat tail

and so in all these cases uh the back

the backward gradient would get

now we would be in a lot of trouble if

for ever for any one of these 200

neurons if it was the case that the

entire column is white because in that

case we have what's called a dead neuron

and this could be a tannage neuron where

the initialization of the weights and

the biases could be such that no single

example ever activates this 10h in the

sort of active part of the 10h if all

the examples land in the tail then this

neuron will never learn it is a dead

and so just scrutinizing this and

looking for Columns of completely white

we see that this is not the case so I

don't see a single neuron that is all of

and so therefore it is the case that for

every one of these 10h neurons

we do have some examples that activate

them in the active part of the 10h and

so some gradients will flow through and

this neuron will learn and the neuron

will change and it will move and it will

but you can sometimes get yourself in

cases where you have dead neurons and

the way this manifests is that for 10

inch neuron this would be when no matter

what inputs you plug in from your data

set this 10 inch neuron always fires

completely one or completely negative

one and then it will just not learn

because all the gradients will be just

uh this is true not just percentage but

for a lot of other non-linearities that

people use in neural networks so we

certainly use 10h a lot but sigmoid will

have the exact same issue because it is

a squashing neuron and so the same will

um basically the same will actually

applied to sigmoid the same will also

reply to a relu so relu has a completely

so if you have a relative neuron then it

um if it is positive and if it's if the

pre-activation is negative it will just

shut it off since the region here is

completely flat then during back

propagation uh this would be exactly

um like all of the gradient would be set

exactly to zero instead of just like a

very very small number depending on how

and so you can get for example a dead

relu neuron and a dead relinuron would

basically what it is is if a neuron with

a relu nonlinearity never activates

so for any examples that you plug in in

the data set it never turns on it's

always in this flat region then this

reli neuron is a dead neuron it's

weights and bias will never learn they

will never get a gradient because the

and this can sometimes happen at

initialization because the weights and

the biases just make it so that by

chance some neurons are just forever

optimization if you have like a too high

of the learning rate for example

sometimes you have these neurons that

gets too much of a gradient and they get

knocked out off the data manifold

and what happens is that from then on no

example ever activates this neuron so

this neuron remains that forever so it's

kind of like a permanent brain damage in

and so sometimes what can happen is if

your learning rate is very high for

example and you have a neural net with

regular neurons you train the neural net

and you get some last loss but then

actually what you do is you go through

the entire training set and you forward

your examples and you can find neurons

that never activate they are dead

neurons in your network and so those

neurons will will never turn on and

usually what happens is that during

training these relative neurons are

changing moving Etc and then because of

a high gradient somewhere by chance they

and then nothing ever activates them and

from then on they are just dead

uh so that's kind of like a permanent

brain damage that can happen to some of

these other nonlinearities like leaky

relu will not suffer from this issue as

much because you can see that it doesn't

have flat tails you almost always get

and elu is also fairly frequently used

it also might suffer from this issue

so that's just something to be aware of

and something to be concerned about and

in this case we have way too many

um activations H that take on Extreme

values and because there's no column of

white I think we will be okay and indeed

the network optimizes and gives us a

pretty decent loss but it's just not

optimal and this is not something you

want especially during initialization

and so basically what's happening is

that this H pre-activation that's

it's it's too extreme it's too large

um it's creating a distribution that is

too saturated in both sides of the 10h

and it's not something you want because

it means that there's less training uh

for these neurons because they update

um less frequently so how do we fix this

MCAT which comes from C so these are

uniform gaussian but then it's

multiplied by W1 plus B1 and H preact is

too far off from zero and that's causing

the issue so we want this reactivation

to be closer to zero very similar to

so here we want actually something very

now it's okay to set the biases to very

small number we can either multiply it

by zero zero one to get like a little

um I sometimes like to do that

there's like a little bit of variation

and diversity in the original

initialization of these 10h neurons and

I find in practice that that can help

and then the weights we can also just

like squash so let's multiply everything

and now let's look at this and well

you see now because we multiply doubly

by 0.1 we have a much better histogram

and that's because the pre-activations

are now between negative 1.5 and 1.5 and

this we expect much much less white

so basically that's because there are no

neurons that's saturated above 0.99 in

either direction this is actually a

maybe we can go up a little bit

it's very much am I changing W1 here so

okay so maybe something like this is is

a nice distribution so maybe this is

what our initialization should be so let

and let me starting with initialization

let me run the full optimization without

the break and uh let's see what we got

okay so the optimization finished and I

Rebrand the loss and this is the result

that we get and then just as a reminder

I put down all the losses that we saw

so we see that we actually do get an

improvement here and just as a reminder

we started off with a validation loss of

2.17 when we started by fixing the

softmax being confidently wrong we came

down to 2.13 and by fixing the 10h layer

being way too saturated we came down to

and the reason this is happening of

course is because our initialization is

better and so we're spending more time

doing productive training instead of

not very productive training because our

gradients are set to zero and we have to

learn very simple things like the

overconfidence of the softmax in the

beginning and we're spending Cycles just

like squashing down the weight Matrix

um basically initialization and its

impacts on performance just by being

aware of the internals of these neural

Nets and their activations their

gradients now we're working with a very

small Network this is just one layer

multiplayer perceptron so because the

network is so shallow the optimization

problem is actually quite easy and very

initialization was terrible the network

still learned eventually it just got a

bit worse result this is not the case in

general though once we actually start

working with much deeper networks that

have say 50 layers things can get much

more complicated and these problems

and so you can actually get into a place

where the network is basically not

training at all if your initialization

is bad enough and the deeper your

network is and the more complex it is

the less forgiving it is to some of

um something that we definitely be aware

of and uh something to scrutinize

something to plot and something to be

yeah okay so that's great that that

worked for us but what we have here now

is all these metric numbers like point

two like where do I come up with this

and how am I supposed to set these if I

have a large neural left with lots and

and so obviously no one does this by

hand there's actually some relatively

principled ways of setting these scales

um that I would like to introduce to you

so let me paste some code here that I

prepared just to motivate the discussion

so what I'm doing here is we have some

random input here x that is drawn from a

gaussian and there's 1000 examples that

are 10 dimensional and then we have a

weight and layer here that is also

initialized using gaussian just like we

and we these neurons in the hidden layer

look at 10 inputs and there are 200

neurons in this hidden layer and then we

um in this case the multiplication X

pre-activations of these neurons

and basically the analysis here looks at

okay suppose these are uniform gaussian

and these weights are uniform gaussian

if I do x times W and we forget for now

the bias and the non-linearity

then what is the mean and the standard

so in the beginning here the input is uh

just a normal gaussian distribution mean

zero and the standard deviation is one

and the standard deviation again is just

a measure of a spread of discussion

but then once we multiply here and we

look at the histogram of Y we see that

the mean of course stays the same it's

about zero because this is a symmetric

operation but we see here that the

standard deviation has expanded to three

so the input standard deviation was one

but now we've grown to three and so what

you're seeing in the histogram is that

um we're expanding this gaussian from

the input and we don't want that we want

most of the neural Nets to have

relatively similar activations so unit

gaussian roughly throughout the neural

net and so the question is how do we

scale these wfs to preserve the um to

preserve this distribution to remain a

and so intuitively if I multiply here uh

these elements of w by a large number

then this gaussian grows and grows in

standard deviation so now we're at 15.

so basically these numbers here in the

output y take on more and more extreme

but if we scale it down well I say 0.2

then conversely this gaussian is getting

smaller and smaller and it's shrinking

and you can see that the standard

deviation is 0.6 and so the question is

what do I multiply by here to exactly

preserve the standard deviation to be

and it turns out that the correct answer

mathematically when you work out through

the variance of this multiplication here

is that you are supposed to divide by

the square root of the fan in the fan in

is the basically the uh number of input

elements here 10. so we are supposed to

divide by 10 square root and this is one

way to do the square root you raise it

to a power of 0.5 that's the same as

so when you divide by the square root of

the output gaussian it has exactly

unsurprisingly a number of papers have

looked into how but to best initialize

neural networks and in the case of

multiplayer perceptions we can have

fairly deep networks that have these

nonlinearities in between and we want to

make sure that the activations are well

behaved and they don't expand to

infinity or Shrink all the way to zero

and the question is how do we initialize

the weights so that these activations

take on reasonable values throughout the

now one paper that has stuck this in

quite a bit of detail that is often

referenced is this paper by coming here

at all called the delving deep into

rectifiers now in this case they

actually study convolutional neural

networks and they studied especially the

relu nonlinearity and the p-valued

nonlinearity instead of a 10 H

nonlinearity but the analysis is very

um basically what happens here is for

them the the relation that they care

about quite a bit here is a squashing

function where all the negative numbers

are simply clamped to zero so the

positive numbers are passed through but

everything negative is just set to zero

and because uh you are basically

throwing away half of the distribution

they find in their analysis of the

forward activations in the neural net

that you have to compensate for that

they find that basically when they

initialize their weights they have to do

it with a zero mean gaussian whose

standard deviation is square root of 2

what we have here is we are initializing

a concussion with the square root of

this NL here is the Fanon so what we

have is square root of one over the fan

because we have the division here

now they have to add this factor of 2

because of the relu which basically

discards half of the distribution and

clamps it at zero and so that's where

now in addition to that this paper also

studies not just the uh sort of behavior

of the activations in the forward pass

of the neural net but it also studies

the back propagation and we have to make

sure that the gradients also are well

um because ultimately they end up

and what they find here through a lot of

the analysis that I invite you to read

approachable what they find is basically

if you properly initialize the forward

pass the backward pass is also

approximately initialized up to a

constant factor that has to do with the

size of the number of hidden neurons in

and uh but basically they find

empirically that this is not a choice

now this timing initialization is also

implemented in pytorch so if you go to

torch.nn.net documentation you'll find

and in my opinion this is probably the

most common way of initializing neural

and it takes a few keyword arguments

here so number one it wants to know the

mode would you like to normalize the

activations or would you like to

normalize the gradients to to be always

gaussian with zero mean and a unit or

one standard deviation and because they

find the paper that this doesn't matter

too much most of the people just leave

it as the default which is Fan in and

then second passing the nonlinearity

that you are using because depending on

the nonlinearity we need to calculate a

slightly different gain and so if your

nonlinearity is just linear so there's

no nonlinearity then the gain here will

be one and we have the exact same uh

kind of formula that we've got here

but if the nonlinearity is something

else we're going to get a slightly

different gain and so if we come up here

we see that for example in the case of

relu this gain is a square root of 2.

and the reason it's a square root

you see how the two is inside of the

square root so the gain is a square root

in the case of linear or identity we

just get a gain of one in the case of

10h which is what we're using here the

and intuitively why do we need a gain on

top of the initialization is because 10h

just like relu is a contractive

transformation so what that means is

you're taking the output distribution

from this matrix multiplication and then

you are squashing it in some way now

relu squashes it by taking everything

below zero and clamping it to zero tan H

also squashes it because it's a

contractual operation it will take the

squeeze them in and so in order to fight

the squeezing in we need to boost the

weights a little bit so that we

renormalize everything back to standard

so that's why there's a little bit of a

now I'm skipping through this section A

little bit quickly and I'm doing that

actually intentionally and the reason

about seven years ago when this paper

was written you had to actually be

extremely careful with the activations

and ingredients and their ranges and

their histograms and you have to be very

careful with the precise setting of

gains and the scrutinizing of the

nonlinearities used and so on and

everything was very finicky and very

fragile and very properly arranged for

the neural not to train especially if

your neural network was very deep

but there are a number of modern

innovations that have made everything

significantly more stable and more

well-behaved and has become less

important to initialize these networks

and some of those modern Innovations for

example are residual connections which

we will cover in the future the use of a

number of normalization layers like for

example batch normalization layer

normalization group normalization we're

going to go into a lot of these as well

and number three much better optimizers

not just stochastic gradient descent the

simple Optimizer we're basically using

here but a slightly more complex

optimizers like RMS prop and especially

Adam and so all of these modern

Innovations make it less important for

initialization of the neural net all

that being said in practice uh what

should we do in practice when I

initialize these neural Nets I basically

just normalize my weights by the square

root of the fan in uh so basically uh

roughly what we did here is what I do

now if we want to be exactly accurate

here we and go by init of coming normal

this is how a good implemented we want

to set the standard deviation to be

gained over the square root of fan n

right so to set the standard deviation

of our weights we will proceed as

basically when we have a torch that

renin and let's say I just create a

thousand numbers we can look at the

standard deviation of this and of course

that's one that's the amount of spread

let's make this a bit bigger so it's

so that's the spread of the gaussian of

zero mean and unit standard deviation

now basically when you take these and

that basically scales down the gaussian

and that makes its standard deviation

0.2 so basically the number that you

multiply by here ends up being the

standard deviation of this caution

so here this is a standard deviation 0.2

gaussian here when we sample rw1

but we want to set the standard

deviation to gain over square root of

so in other words we want to multiply by

gain which for 10 H is 5 over 3.

uh square root of the fan in and in this

example here the fan in was 10 and I

just noticed that actually here the fan

in for W1 is actually an embed times

block size which as you all recall is

actually 30 and that's because each

character is 10 dimensional but then we

have three of them and we concatenate

them so actually the fan in here was 30

and I should have used 30 here probably

but basically we want 30 square root so

this is the number this is what our

standard deviation we want to be and

this number turns out to be 0.3

whereas here just by fiddling with it

and looking at the distribution and

making sure it looks okay we came up

and so instead what we want to do here

is we want to make the standard

5 over 3 which is our gain divide

times 0.2 square root and these brackets

here are not that necessary but I'll

just put them here for clarity this is

basically what we want this is the

chiming in it in our case for a 10h

nonlinearity and this is how we would

initialize the neural net and so we're

we can initialize this way and then we

can train the neural net and see what we

okay so I trained the neural net and we

end up in roughly the same spot so

looking at the validation loss we now

get 2.10 and previously we also had 2.10

and there's a little bit of a difference

but that's just the randomness of the

but the big deal of course is we get to

the same spot but we did not have to

introduce any magic numbers that we got

from just looking at histograms and

guessing checking we have something that

is semi-principled and will scale us to

much bigger networks and uh something

that we can sort of use as a guide so I

mentioned that the precise setting of

these initializations is not as

important today due to some Modern

Innovations and I think now is a pretty

good time to introduce one of those

modern Innovations and that is best

so batch normalization came out in 2015

from a team at Google and it was an

extremely impactful paper because it

made it possible to train very deep

neural Nets quite reliably and uh it

basically just worked so here's what

nationalization does and what's

basically we have these hidden States HP

act right and we were talking about how

pre-activation states to be way too

small because then the 10h is not doing

anything but we don't want them to be

too large because then the 10h is

in fact we want them to be roughly

roughly gaussian so zero mean and a unit

or one standard deviation at least at

so the Insight from The Bachelor

normalization paper is okay you have

these hidden States and you'd like them

to be roughly gaussian then why not take

the hidden States and just normalize

them to be gaussian and it sounds kind

of crazy but you can just do that

because uh standardizing hidden States

so that their unit caution is a

perfectly differentiable operation as

we'll soon see and so that was kind of

like the big Insight in this paper and

when I first read it my mind was blown

because you can just normalize these

hidden States and if you'd like unit

gaussian States in your network at least

initialization you can just normalize

them to be in gaussian so let's see how

that works so we're going to scroll to

our pre-activations here just before

now the idea again is remember we're

trying to make these roughly gaussian

and that's because if these are way too

small numbers then the 10h here is kind

of connective but if these are very

large numbers then the 10h is way too

saturated and grade is in the flow so

we'd like this to be roughly caution

so the Insight in bathroomization again

is that we can just standardize these

activations so they are exactly gaussian

has a shape of 32 by 200 32 examples by

200 neurons in the hidden layer

so basically what we can do is we can

take hpact and we can just calculate the

and the mean we want to calculate across

and we want to also keep them as true so

that we can easily broadcast this

is 1 by 200 in other words we are doing

the mean over all the uh elements in the

and similarly we can calculate the

standard deviation of these activations

and that will also be one by 200.

now in this paper they have the

uh sort of prescription here and see

here we are calculating the mean which

is just taking the average value

of any neurons activation and then the

standard deviation is basically kind of

this the measure of the spread that

we've been using which is the distance

of every one of these values away from

the mean and that squared and averaged

that's the that's the variance and then

if you want to take the standard

deviation you would square root the

variance to get the standard deviation

so these are the two that we're

calculating and now we're going to

normalize or standardize these X's by

um dividing by the standard deviation

so basically we're taking Edge preact

and then we divide by the standard

this is exactly what these two STD and

sorry this is the mean and this is the

variance you see how the sigma is the

standard deviation usually so this is

Sigma Square which is variance is the

square of the standard deviation

so this is how you standardize these

values and what this will do is that

every single neuron now and its firing

rate will be exactly unit gaussian on

these 32 examples at least of this batch

that's why it's called batch

normalization we are normalizing these

and then we could in principle train

this notice that calculating the mean

and the standard deviation these are

just mathematical formulas they're

perfectly differentiable all this is

perfectly differentiable and we can just

the problem is you actually won't

achieve a very good result with this and

we want these to be roughly gaussian but

only at initialization but we don't want

these to be to be forced to be gaussian

always we would actually We'll add the

neural nuts to move this around to

potentially make it more diffuse to make

it more sharp to make some 10 H neurons

maybe mean more trigger more trigger

happy or less trigger happy so we'd like

this distribution to move around and

we'd like the back propagation to tell

us how that distribution should move

around and so in addition to this idea

of standardizing the activations at any

uh we have to also introduce this

additional component in the paper

here describe the scale and shift

and so basically what we're doing is

we're taking these normalized inputs and

we are additionally scaling them by some

gain and offsetting them by some bias to

get our final output from this layer

and so what that amounts to is the

we are going to allow a batch

to be initialized at just a once

and the ones will be in the shape of 1

and then we also will have a b and bias

which will be torched at zeros

and it will also be of the shape n by 1

the B and gain will multiply this

and the BN bias will offset it here

so because this is initialized to one

at initialization each neuron's firing

values in this batch will be exactly

unit gaussian and we'll have nice

numbers no matter what the distribution

of the hpact is coming in coming out it

will be in gaussian for each neuron and

that's roughly what we want at least at

um and then during optimization we'll be

able to back propagate to be in game and

being biased and change them so the

network is given the full ability to do

with this whatever it wants uh

here we just have to make sure that we

um include these in the parameters of

the neural nut because they will be

and then we should be able to train

which is the best normalization layer

here on a single line of code and we're

going to swing down here and we're also

going to do the exact same thing at test

so similar to training time we're going

to normalize and then scale and that's

and we'll see in a second that we're

actually going to change this a little

bit but for now I'm going to keep it

so I'm just going to wait for this to

converge okay so I'll add the neural

nuts to converge here and when we scroll

down we see that our validation loss

here is 2.10 roughly which I wrote down

here and we see that this is actually

kind of comparable to some of the

results that we've achieved previously

now I'm not actually expecting an

improvement in this case and that's

because we are dealing with a very

simple neural nut that has just a single

hidden layer so in fact in this very

simple case of just one hidden layer we

were able to actually calculate what the

scale of w should be to make these

pre-activations already have a roughly

gaussian shape so the best normalization

but you might imagine that once you have

a much deeper neural nut that has lots

of different types of operations and

there's also for example residual

connections which we'll cover and so on

it will become basically very very

difficult to tune those scales of your

weight matrices such that all the

activations throughout the neural Nets

and so that's going to become very

quickly intractable but compared to that

it's going to be much much easier to

sprinkle batch normalization layers

so in particular it's common to to look

at every single linear layer like this

one this is a linear layer multiplying

by a weight Matrix and adding the bias

or for example convolutions which we'll

cover later and also perform basically a

multiplication with the weight Matrix

but in a more spatially structured

format it's custom it's customary to

take these linear layer or convolutional

layer and append a bachelorization layer

right after it to control the scale of

these activations at every point in the

neural net so we'd be adding these

bathroom layers throughout the neural

net and then this controls the scale of

these activations throughout the neural

net it doesn't require us to do a

perfect mathematics and care about the

activation distributions for all these

different types of neural network Lego

building blocks that you might want to

introduce into your neural net and it

significantly stabilizes uh the training

and that's why these layers are quite

popular now the stability offered by

batch normalization actually comes at a

terrible cost and that cost is that if

you think about what's Happening Here

something something terribly strange and

it used to be that we have a single

example feeding into a neural net and

then we calculate this activations and

it's logits and this is a deterministic

sort of process so you arrive at some

Logics for this example and then because

of efficiency of training we suddenly

started to use batches of examples but

those batches of examples were processed

independently and it was just an

but now suddenly in bash normalization

because of the normalization through the

batch we are coupling these examples

mathematically and in the forward pass

and the backward pass of the neural land

so now the hidden State activations

hpact and your logits for any one input

example are not just a function of that

example and its input but they're also a

function of all the other examples that

happen to come for a ride in that batch

and these examples are sampled randomly

and so what's happening is for example

when you look at each preact that's

going to feed into H the hidden State

activations for for example for for any

one of these input examples is going to

actually change slightly depending on

what other examples there are in a batch

and and depending on what other examples

H is going to change subtly and it's

going to like Jitter if you imagine

sampling different examples because the

statistics of the mean and the standard

deviation are going to be impacted

and so you'll get a Jitter for H and

and you think that this would be a bug

or something undesirable but in a very

strange way this actually turns out to

be good in neural network training and

as a side effect and the reason for that

is that you can think of this as kind of

like a regularizer because what's

happening is you have your input and you

get your age and then depending on the

other examples this is generating a bit

and so what that does is that it's

effectively padding out any one of these

input examples and it's introducing a

um because of the padding out it's

actually kind of like a form of data

augmentation which we'll cover in the

future and it's kind of like augmenting

the input a little bit and it's

jittering it and that makes it harder

for the neural nuts to overfit to these

concrete specific examples so by

introducing all this noise it actually

like Pats out the examples and it

regularizes the neural net and that's

deceivingly as a second order effect

this is actually a regularizer and that

has made it harder for us to remove the

because basically no one likes this

property that the the examples in the

batch are coupled mathematically and in

the forward pass and at least all kinds

of like strange results uh we'll go into

some of that in a second as well

um and it leads to a lot of bugs and

um and so on and so no one likes this

property uh and so people have tried to

deprecate the use of astronomization and

move to other normalization techniques

that do not couple the examples of a

batch examples are layer normalization

normalization and so on and we'll cover

we'll cover some of these later

um but basically long story short bash

formalization was the first kind of

normalization layer to be introduced it

worked extremely well it happens to have

this regularizing effect it stabilized

and people have been trying to remove it

normalization techniques but it's been

hard because it just works quite well

and some of the reason that it works

quite well is again because of this

regularizing effect and because of the

because it is quite effective at

controlling the activations and their

uh so that's kind of like the brief

story of nationalization and I'd like to

show you one of the other weird sort of

so here's one of the strange outcomes

that I only glossed over previously

when I was evaluating the loss on the

basically once we've trained a neural

net we'd like to deploy it in some kind

of a setting and we'd like to be able to

feed in a single individual example and

get a prediction out from our neural net

but how do we do that when our neural

net now in a forward pass estimates the

statistics of the mean energy standard

deviation of a batch the neural net

expects badges as an input now so how do

we feed in a single example and get

and so the proposal in the batch

normalization paper is the following

what we would like to do here is we

would like to basically have a step

after training that calculates and sets

the bathroom mean and standard deviation

a single time over the training set

and so I wrote this code here in

interest of time and we're going to call

what's called calibrate the Bachelor of

and basically what we do is not no grad

telling pytorch that none of this we

will call the dot backward on and it's

going to be a bit more efficient

we're going to take the training set get

the pre-activations for every single

training example and then one single

time estimate the mean and standard

deviation over the entire training set

and then we're going to get B and mean

and be in standard deviation and now

these are fixed numbers as the meaning

and here instead of estimating it

we are going to instead here use B and

and here we're just going to use B and

and so at this time we are going to fix

these clamp them and use them during

you see that we get basically identical

but the benefit that we've gained is

that we can now also forward a single

example because the mean and standard

deviation are now fixed uh sort of

that said nobody actually wants to

estimate this mean and standard

deviation as a second stage after neural

network training because everyone is

lazy and so this batch normalization

paper actually introduced one more idea

which is that we can we can estimate the

mean and standard deviation in a running

matter running manner during training of

the neural net and then we can simply

just have a single stage of training and

on the side of that training we are

estimating the running mean and standard

deviation so let's see what that would

let me basically take the mean here that

we are estimating on the batch and let

me call this B and mean on the I

um and then here this is B and sdd

uh and the mean comes here and the STD

comes here so so far I've done nothing

I've just moved around and I created

these extra variables for the mean and

standard deviation and I've put them

here so so far nothing has changed but

what we're going to do now is we're

going to keep a running mean of both of

these values during training so let me

swing up here and let me create a BN

and I'm going to initialize it at zeros

which I'll initialize at once

in the beginning because of the way we

initialized W1 and B1 each react will be

roughly unit gaussian so the mean will

be roughly zero and the standard

deviation roughly one so I'm going to

but then here I'm going to update these

these mean and standard deviation that

are running they're not actually part of

the gradient based optimization we're

never going to derive gradients with

respect to them they're they're updated

and so what we're going to do here is

we're going to say with torch top no

grad telling pytorch that the update

here is not supposed to be building out

a graph because there will be no doubt

but this running is basically going to

plus 0.001 times the this value

and in the same way be an STD running

but it will receive a small update in

the direction of what the current

and as you're seeing here this update is

outside and on the side of the gradient

based optimization and it's simply being

updated not using gradient descent it's

just being updated using a gen key like

and so while the network is training and

these pre-activations are sort of

changing and shifting around during back

propagation we are keeping track of the

typical mean and standard deviation and

we're estimating them once and when I

now I'm keeping track of this in a

running manner and what we're hoping for

of course is that the meat being mean

underscore running and B and mean

underscore STD are going to be very

similar to the ones that we calculated

and that way we don't need a second

stage because we've sort of combined the

two stages and we've put them on the

side of each other if you want to look

and this is how this is also implemented

in the batch normalization layer in pi

um the exact same thing will happen and

then later when you're using inference

it will use the estimated running mean

of both the mean estimate deviation of

so let's wait for the optimization to

converge and hopefully the running mean

and standard deviation are roughly equal

to these two and then we can simply use

it here and we don't need this stage of

explicit calibration at the end okay so

I'll rerun the explicit estimation and

then the B and mean from the explicit

and B and mean from the running

during the during the optimization you

it's not identical but it's pretty close

in the same way be an STD is this and be

as you can see that once again they are

fairly similar values not identical but

and so then here instead of being mean

we can use the B and mean running

instead of being STD we can use bnstd

and uh hopefully the validation loss

will not be impacted too much

okay so it's basically identical and

this way we've eliminated the need for

this explicit stage of calibration

because we are doing it in line over

here okay so we're almost done with

batch normalization there are only two

more notes that I'd like to make number

one I've skipped a discussion over what

is this plus Epsilon doing here this

Epsilon is usually like some small fixed

number for example one a negative five

by default and what it's doing is that

it's basically preventing a division by

zero in the case that the variance over

is exactly zero in that case uh here we

normally have a division by zero but

because of the plus Epsilon this is

going to become a small number in the

denominator instead and things will be

more well behaved so feel free to also

add a plus Epsilon here of a very small

number it doesn't actually substantially

change the result I'm going to skip it

in our case just because this is

unlikely to happen in our very simple

example here and the second thing I want

you to notice is that we're being

wasteful here and it's very subtle but

right here where we are adding the bias

these biases now are actually useless

because we're adding them to the hpact

but then we are calculating the mean

for every one of these neurons and

subtracting it so whatever bias you add

here is going to get subtracted right

and so these biases are not doing

anything in fact they're being

subtracted out and they don't impact the

rest of the calculation so if you look

at b1.grad it's actually going to be

zero because it's being subtracted out

and doesn't actually have any effect

and so whenever you're using batch

normalization layers then if you have

any weight layers before like a linear

or a comma or something like that you're

better off coming here and just like not

using bias so you don't want to use bias

and then here you don't want to add it

because it's that spurious instead we

have this vast normalization bias here

and that bastionalization bias is now in

charge of the biasing of this

distribution instead of this B1 that we

and so basically the rationalization

layer has its own bias and there's no

need to have a bias in the layer before

it because that bias is going to be

so that's the other small detail to be

careful with sometimes it's not going to

do anything catastrophic this B1 will

just be useless it will never get any

gradient it will not learn it will stay

constant and it's just wasteful but it

doesn't actually really impact anything

otherwise okay so I rearranged the code

a little bit with comments and I just

wanted to give a very quick summary of

we are using batch normalization to

control the statistics of activations in

it is common to sprinkle batch

normalization layer across the neural

net and usually we will place it after

layers that have multiplications like

for example a linear layer or a

convolutional layer which we may cover

now the batch normalization internally

has parameters for the gain and the bias

and these are trained using back

it also has two buffers the buffers are

the mean and the standard deviation the

running mean and the running mean of the

and these are not trained using back

propagation these are trained using this

janky update of kind of like a running

these are sort of the parameters and the

buffers of bashram layer and then really

what it's doing is it's calculating the

mean and the standard deviation of the

activations uh that are feeding into the

then it's centering that batch to be

unit gaussian and then it's offsetting

and scaling it by the Learned bias and

and then on top of that it's keeping

track of the mean and standard deviation

and it's maintaining this running mean

and this will later be used at inference

so that we don't have to re-estimate the

meanest standard deviation all the time

and in addition that allows us to

basically forward individual examples at

so that's the batch normalization layer

it's a fairly complicated layer

um but this is what it's doing

internally now I wanted to show you a

so you can search resnet which is a

residual neural network and these are

context of neural networks used for

and of course we haven't come dresnets

in detail so I'm not going to explain

all the pieces of it but for now just

note that the image feeds into a resnet

on the top here and there's many many

layers with repeating structure all the

way to predictions of what's inside that

this repeating structure is made up of

these blocks and these blocks are just

sequentially stacked up in this deep

now the code for this the block

basically that's used and repeated

sequentially in series is called this

bottleneck block bottleneck block

and there's a lot here this is all

pytorch and of course we haven't covered

all of it but I want to point out some

here in the init is where we initialize

the neural net so this coded block here

is basically the kind of stuff we're

doing here we're initializing all the

and in the forward we are specifying how

the neural lot acts once you actually

have the input so this code here is

along the lines of what we're doing here

and now these blocks are replicated and

stacked up serially and that's what a

and so notice what's happening here com1

and these convolutional layers basically

they're the same thing as a linear layer

except convolutional layers don't apply

um convolutional layers are used for

images and so they have spatial

structure and basically this linear

multiplication and bias offset are done

instead of a math instead of the full

input so because these images have

structure spatial structure convolutions

just basically do WX plus b but they do

it on overlapping patches of the input

but otherwise it's WX plus b

then we have the norm layer which by

default here is initialized to be a

batch Norm in 2D so two-dimensional

and then we have a nonlinearity like

relu so instead of uh here they use relu

but both both are just nonlinearities

and you can just use them relatively

interchangeably from very deep networks

relu is typically empirically work a bit

so see the motif that's being repeated

here we have convolution batch

normalization convolution patch

normalization early Etc and then here

this is residual connection that we

but basically that's the exact same

pattern we have here we have a weight

layer like a convolution or like a

linear layer batch normalization and

then 10h which is a nonlinearity but

basically a weight layer a normalization

layer and a nonlinearity and that's the

motif that you would be stacking up when

you create these deep neural networks

exactly as it's done here and one more

thing I'd like you to notice is that

here when they are initializing the comp

layers like comp one by one the depth

and so it's initializing an nn.cap2d

which is a convolutional layer in

pytorch and there's a bunch of keyword

arguments here that I'm not going to

explain yet but you see how there's bias

equals false the bicycles fall is

exactly for the same reason as bias is

not used in our case the CRI race to use

a bias and these are bias is spurious

because after this weight layer there's

a bachelorization and the bachelor

normalization subtracts that bias and

then has its own bias so there's no need

to introduce these spurious parameters

it wouldn't hurt performance it's just

and so because they have this motif of

calf pasture and relu they don't need to

buy us here because there's a bias

by the way this example here is very

easy to find just do resnet pie torch

and uh it's this example here so this is

kind of like the stock implementation of

a residual neural network in pytorch and

you can find that here but of course I

haven't covered many of these parts yet

and I would also like to briefly descend

into the definitions of these pytorch

layers and the parameters that they take

now instead of a convolutional layer

we're going to look at a linear layer

uh because that's the one that we're

using here this is a linear layer and I

haven't covered convolutions yet but as

I mentioned convolutions are basically

linear layers except on patches

so a linear layer performs a w x plus b

except here they're calling the W A

um since is WX plus b very much

like we did here to initialize this

layer you need to know the fan in the

and that's so that they can initialize

this W this is the fan in and the fan

out so they know how how big the weight

you need to also pass in whether you

whether or not you want a bias and if

you set it to false then no bias will be

um and you may want to do that exactly

like in our case if your layer is

followed by a normalization layer such

so this allows you to basically disable

now in terms of the initialization if we

swing down here this is reporting the

variables used inside this linear layer

and our linear layer here has two

parameters the weight and the bias in

the same way they have a weight and a

and they're talking about how they

initialize it by default so by default

python initialize your weights by taking

and then doing one over Fannin square

distribution they are using a uniform

so it's very much the same thing but

they are using a one instead of five

over three so there's no gain being

calculated here the gain is just one but

otherwise it's exactly one over the

square root of fan in exactly as we have

so 1 over the square root of K is the is

the scale of the weights but when they

are drawing the numbers they're not

using a gaussian by default they're

using a uniform distribution by default

and so they draw uniformly from negative

square root of K to square root of K

but it's the exact same thing and the

same motivation from for with respect to

what we've seen in this lecture and the

reason they're doing this is if you have

a roughly gaussian input this will

ensure that out of this layer you will

have a roughly gaussian output and you

you basically achieve that by scaling

the weights by 100 square root of fan in

and then the second thing is the battery

normalization layer so let's look at

what that looks like in pytorch

so here we have a one-dimensional mesh

normalization layer exactly as we are

and there are a number of keyword

arguments going into it as well

so we need to know the number of

features uh for us that is 200 and that

is needed so that we can initialize

these parameters here the gain the bias

and the buffers for the running mean and

then they need to know the value of

Epsilon here and by default this is one

negative five you don't typically change

this too much then they need to know the

momentum and the momentum here as they

explain is basically used for these uh

running mean and running standard

so by default the momentum here is 0.1

the momentum we are using here in this

and basically rough you may want to

change this sometimes and roughly

speaking if you have a very large batch

then typically what you'll see is that

when you estimate the mean and the

for every single batch size if it's

large enough you're going to get roughly

and so therefore you can use slightly

but for a batch size as small as 32 the

mean understand deviation here might

take on slightly different numbers

because there's only 32 examples we are

using to estimate the mean of standard

deviation so the value is changing

around a lot and if your momentum is 0.1

that that might not be good enough for

and converge to the actual mean and

standard deviation over the entire

and so basically if your batch size is

very small momentum of 0.1 is

potentially dangerous and it might make

it so that the running mean and standard

deviation is thrashing too much during

training and it's not actually

uh Alpha and equals true determines

whether dispatch normalization layer has

these learnable affine parameters the uh

the gain and the bias and this is almost

always kept it true I'm not actually

sure why you would want to change this

then track running stats is determining

whether or not bachelorization layer of

and one reason you may you may want to

skip the running stats is because you

may want to for example estimate them at

the end as a stage two like this and in

that case you don't want the batch

normalization layer to be doing all this

extra compute that you're not going to

and finally we need to know which device

we're going to run this batch

normalization on a CPU or a GPU and what

Precision single Precision double

so that's the batch normalization layer

otherwise the link to the paper is the

same formula we've implement it and

everything is the same exactly as we've

okay so that's everything that I wanted

to cover for this lecture really what I

wanted to talk about is the importance

of understanding the activations and the

gradients and their statistics in neural

networks and this becomes increasingly

important especially as you make your

neural networks bigger larger and deeper

we looked at the distributions basically

at the output layer and we saw that if

you have two confident mispredictions

because the activations are too messed

up at the last layer you can end up with

these hockey stick losses and if you fix

this you get a better loss at the end of

training because your training is not

then we also saw that we need to control

the activations we don't want them to

you know squash to zero or explode to

infinity and because that you can run

into a lot of trouble with all of these

non-linearities in these neural nuts and

basically you want everything to be

fairly homogeneous throughout the neural

activations throughout the neural net

let me talk about okay if we would

roughly gaussian activations how do we

scale these weight matrices and biases

during initialization of the neural net

so that we don't get um you know so

everything is as controlled as possible

um so that gave us a large boost in

Improvement and then I talked about how

that strategy is not actually uh

possible for much much deeper neural

Nets because when you have much deeper

neural nuts with lots of different types

of layers it becomes really really hard

to precisely set the weights and the

biases in such a way that the

activations are roughly uniform

so then I introduced the notion of the

normalization layer now there are many

normalization layers that people use in

practice batch normalization layer

normalization constant normalization

group normalization we haven't covered

most of them but I've introduced the

first one and also the one that I

believe came out first and that's called

and we saw how batch normalization works

this is a layer that you can sprinkle

throughout your deep neural nut and the

basic idea is if you want roughly

gaussian activations well then take your

activations and take the mean understand

deviation and Center your data and you

can do that because the centering

but and on top of that we actually had

to add a lot of bells and whistles and

complexities of the patch normalization

layer because now we're centering the

data that's great but suddenly we need

the gain and the bias and now those are

and then because we are coupling all the

training examples now suddenly the

question is how do you do the inference

or to do to do the inference we need to

now estimate these mean and standard

deviation once or the entire training

set and then use those at inference but

then no one likes to do stage two so

instead we fold everything into the

batch normalization layer during

training and try to estimate these in

the running manner so that everything is

um and as I mentioned no one likes this

layer it causes a huge amount of bugs

um and intuitively it's because it is

coupling examples in the forward pass of

a neural net and I've shot myself in the

foot with this layer over and over again

in my life and I don't want you to

uh so basically try to avoid it as much

as possible uh some of the other

alternatives to these layers are for

example group normalization or layer

normalization and those have become more

common uh in more recent deep learning

but we haven't covered those yet but

definitely batch normalization was very

influential at the time when it came out

in roughly 2015 because it was kind of

the first time that you could train

reliably uh much deeper neural nuts and

fundamentally the reason for that is

because this layer was very effective at

controlling the statistics of the

so that's the story so far and um that's

all I wanted to cover and in the future

lectures hopefully we can start going

into recurring neural Nets and recurring

neural Nets as we'll see are just very

very deep networks because you uh you

unroll the loop and uh when you actually

optimize these neurons and that's where

analysis around the activation

statistics and all these normalization

layers will become very very important

for a good performance so we'll see that

okay so I lied I would like us to do one

more summary here as a bonus and I think

it's useful as to have one more summary

of everything I've presented in this

lecture but also I would like us to

start by tortifying our code a little

bit so it looks much more like what you

would encounter in pi torch so you'll

see that I will structure our code into

these modules like a linear module and a

bachelor module and I'm putting the code

inside these modules so that we can

construct neural networks very much like

we would construct them in pytorch and I

will go through this in detail so we'll

then we will do the optimization loop as

and then the one more thing that I want

to do here is I want to look at the

activation statistics both in the

forward pass and in the backward pass

and then here we have the evaluation and

so let me rewind all the way up here and

so here I'm creating a linear layer

you'll notice that torch.nn has lots of

different types of layers and one of

those layers is the linear layer

linear takes a number of input features

output features whether or not we should

have bias and then the device that we

want to place this layer on and the data

type so I will omit these two but

otherwise we have the exact same thing

we have the fan in which is number of

inputs fan out the number of outputs and

whether or not we want to use a bias and

internally inside this layer there's a

weight and a bias if you like it

it is typical to initialize the weight

using say random numbers drawn from a

gaussian and then here's the coming

initialization that we've discussed

already in this lecture and that's a

good default and also the default that I

believe python trees is and by default

the bias is usually initialized to zeros

now when you call this module this will

basically calculate W Times X plus b if

and then when you also call that

parameters on this module it will return

the tensors that are the parameters of

now next we have the bachelorization

layer so I've written that here and this

is very similar to Pi torch NN dot bash

so I'm kind of taking these three

parameters here the dimensionality the

Epsilon that we'll use in the division

and the momentum that we will use in

keeping track of these running stats the

running mean and the running variance

um now pack torch actually takes quite a

few more things but I'm assuming some of

their settings so for us I find will be

true that means that we will be using a

gamma and beta after the normalization

the track running stats will be true so

we will be keeping track of the running

mean and the running variance in the in

our device by default is the CPU and the

data type by default is a float

so those are the defaults otherwise we

are taking all the same parameters in

this bathroom layer so first I'm just

now here's something new there's a DOT

training which by default is true in

packtorch and then modules also have

this attribute that training and that's

because many modules and batch Norm is

included in that have a different

behavior of whether you are training

your own lot and or whether you are

running it in an evaluation mode and

calculating your evaluation loss or

using it for inference on some test

and masterm is an example of this

because when we are training we are

going to be using the mean and the

variance estimated from the current

batch but during inference we are using

the running mean and running variants

and so also if we are training we are

updating mean and variants but if we are

testing then these are not being updated

and so this flag is necessary and by

default true just like impact torch

now the parameters investment 1D are the

and then the running mean and running

variants are called buffers in pytorch

nomenclature and these buffers are

trained using exponential moving average

here explicitly and they are not part of

the back propagation and stochastic

gradient descent so they are not sort of

like parameters of this layer and that's

why when we calculate when we have a

parameters here we only return gamma and

beta we do not return the mean and the

variance this is trained sort of like

internally here every forward pass using

now in a forward pass if we are training

then we use the mean and the variance

estimated by the batch let me plot the

we calculate the mean and the variance

now up above I was estimating the

standard deviation and keeping track of

the standard deviation here in the

running standard deviation instead of

running variance but let's follow the

paper exactly here they calculate the

variance which is the standard deviation

squared and that's what's kept track of

in the running variance instead of a

uh but those two would be very very

if we are not training then we use

running mean in various we normalize

and then here I'm calculating the output

of this layer and I'm also assigning it

to an attribute called dot out

now dot out is something that I'm using

in our modules here this is not what you

would find in pytorch we are slightly

deviating from it I'm creating a DOT out

because I would like to very easily

maintain all those variables so that we

can create statistics of them and plot

them but Pi torch and modules will not

then finally here we are updating the

buffers using again as I mentioned

uh provide given the provided momentum

and importantly you'll notice that I'm

using the torstart no grad context

manager and I'm doing this because if we

don't use this then pytorch will start

building out an entire computational

graph out of these tensors because it is

expecting that we will eventually call

it that backward but we are never going

to be calling that backward on anything

that includes running mean and running

variance so that's why we need to use

this contact manager so that we are not

sort of maintaining them using all this

additional memory so this will make it

more efficient and it's just telling

factors that will only know backward we

just have a bunch of tensors we want to

okay now scrolling down we have the 10h

layer this is very very similar to

torch.10h and it doesn't do too much it

just calculates 10h as you might expect

so that's torch.nh and there's no

but because these are layers

um it now becomes very easy to sort of

like stack them up into basically just a

and we can do all the initializations

that we're used to so we have the

initial sort of embedding Matrix we have

our layers and we can call them

and then again with Trump shot no grad

there's some initializations here so we

want to make the output softmax a bit

less confident like we saw and in

addition to that because we are using a

six layer multi-layer perceptron here so

you see how I'm stacking linear 10 age

I'm going to be using the game here and

I'm going to play with this in a second

so you'll see how when we change this

what happens to the statistics

finally the primers are basically the

embedding Matrix and all the parameters

in all the layers and notice here I'm

using a double list comprehension if you

want to call it that but for every layer

in layers and for every parameter in

each of those layers we are just

stacking up all those piece all those

now in total we have 46 000 parameters

and I'm telling by George that all of

then here we have everything here we are

actually mostly used to we are sampling

batch we are doing forward pass the

forward pass now is just a linear

application of all the layers in order

followed by the cross entropy

and then in the backward path you'll

notice that for every single layer I now

iterate over all the outputs and I'm

telling pytorch to retain the gradient

and then here we are already used to all

the all the gradients set To None do the

backward to fill in the gradients do an

update using the caskaranian scent and

then track some statistics and then I am

going to break after a single iteration

now here in this cell in this diagram

I'm visualizing the histogram the

histograms of the forward pass

activations and I'm specifically doing

so iterating over all the layers except

for the very last one which is basically

um if it is a 10 inch layer and I'm

using a 10 inch layer just because they

have a finite output negative one to one

and so it's very easy to visualize here

so you see negative one to one and it's

a finite range and easy to work with

I take the out tensor from that layer

into T and then I'm calculating the mean

the standard deviation and the percent

and the way I Define the percent

saturation is that t dot absolute value

is greater than 0.97 so that means we

are here at the Tails of the 10h and

remember that when we are in the Tails

of the 10h that will actually stop

gradients so we don't want this to be

here I'm calling torch.histogram and

then I am plotting this histogram so

basically what this is doing is that

every different type of layer and they

all have a different color we are

um values in these tensors take on any

of the values Below on this axis here

so the first layer is fairly saturated

here at 20 so you can see that it's got

Tails here but then everything sort of

stabilizes and if we had more layers

here it would actually just stabilize at

around the standard deviation of about

0.65 and the saturation would be roughly

and the reason that this stabilizes and

gives us a nice distribution here is

because gain is set to 5 over 3.

now here this gain you see that by

default we initialize with one over

square root of fan in but then here

during initialization I come in and I

iterate all the layers and if it's a

linear layer I boost that by the gain

now we saw that one so basically if we

just do not use a gain then what happens

if I redraw this you will see that

the standard deviation is shrinking and

the saturation is coming to zero and

basically what's happening is the first

layer is you know pretty decent but then

further layers are just kind of like

shrinking down to zero and it's

happening slowly but it's shrinking to

zero and the reason for that is when you

just have a sandwich of linear layers

alone then a then initializing our

weights in this manner we saw previously

would have conserved the standard

but because we have this interspersed

the Stanley layers are squashing

functions and so they take your

distribution and they slightly squash it

and so some gain is necessary to keep

expanding it to fight the squashing

so it just turns out that 5 over 3 is a

good value so if we have something too

small like one we saw that things will

come towards zero but if it's something

well let me do something a bit more

extreme because so it's a bit more

okay so we see here that the saturations

are trying to be way too large

okay so three would create way too

so five over three is a good setting for

a sandwich of linear layers with 10 inch

activations and it roughly stabilizes

the standard deviation at a reasonable

now honestly I have no idea where five

over three came from in pytorch when we

initialization I see empirically that it

stabilizes this sandwich of linear n10h

and that the saturation is in a good

range but I don't actually know this

came out of some math formula I tried

searching briefly for where this comes

from but I wasn't able to find anything

but certainly we see that empirically

these are very nice ranges our

saturation is roughly five percent which

is a pretty good number and uh this is a

good setting of The gain in this context

similarly we can do the exact same thing

with the gradients so here is a very

same Loop if it's a 10h but instead of

taking the layered that out I'm taking

the grad and then I'm also showing the

mean on the standard deviation and I'm

plotting the histogram of these values

and so you'll see that the gradient

distribution is fairly reasonable and in

particular what we're looking for is

that all the different layers in this

sandwich has roughly the same gradient

things are not shrinking or exploding

so we can for example come here and we

can take a look at what happens if this

gain was way too small so this was 0.5

first of all the activations are

shrinking to zero but also the gradients

are doing something weird the gradient

started out here and then now they're

and similarly if we for example have a

then we see that also the gradients have

there's some asymmetry going on where as

you go into deeper and deeper layers the

activations are also changing and so

that's not what we want and in this case

we saw that without the use of Bachelor

as we are going through right now we

have to very carefully set those gains

to get nice activations in both the

forward pass and the backward pass now

normalization I would also like to take

a look at what happens when we have no

10h units here so erasing all the 10

but keeping the gain at 5 over 3. we now

have just a giant linear sandwich so

let's see what happens to the

as we saw before the correct gain here

is one that is the standard deviation

1.667 is too high and so what's going to

I have to change this to be linear so we

are because there's no more 10 inch

and let me change this to linear as well

um the activations started out on the

blue and have by layer 4 become very

diffuse so what's happening to the

and with the gradients on the top layer

the activation the gradient statistics

are the purple and then they diminish as

you go down deeper in the layers and so

basically you have an asymmetry like in

the neural net and you might imagine

that if you have very deep neural

networks say like 50 layers or something

like that this just this is not a good

place to be so that's why before bash

normalization this was incredibly tricky

to to set in particular if this is too

large of a game this happens and if it's

then this happens also the opposite of

that basically happens here we have a um

shrinking and a diffusion depending on

which direction you look at it from

and so certainly this is not what you

want and in this case the correct

setting of The gain is exactly one

just like we're doing at initialization

the statistics for the forward and the

backward pass are well behaved and so

the reason I want to show you this is

the basically like getting neuralness to

train before these normalization layers

and before the use of advanced

optimizers like atom which we still have

to cover and residual connections and so

on training neurons basically look like

this it's like a total Balancing Act you

have to make sure that everything is

precisely orchestrated and you have to

care about the activations and the

gradients and their statistics and then

maybe you can train something but it was

basically impossible to train very deep

networks and this is fundamentally the

reason for that you'd have to be very

very careful with your initialization

um the other point here is you might be

asking yourself by the way I'm not sure

if I covered this why do we need these

10h layers at all why do we include them

and then have to worry about the gain

and uh the reason for that of course is

that if you just have a stack of linear

then certainly we're getting very easily

nice activations and so on but this is

just a massive linear sandwich and it

turns out that it collapses to a single

representation power so if you were to

plot the output as a function of the

input you're just getting a linear

function no matter how many linear

layers you stack up you still just end

up with a linear transformation all the

W X Plus B's just collapse into a large

WX plus b with slightly different W's as

um but interestingly even though the

forward pass collapses to just a linear

layer because of back propagation and

the Dynamics of the backward pass the

optimization is really is not identical

you actually end up with all kinds of

Dynamics in the backward pass because of

the uh the way the chain rule is

calculating it and so optimizing a

linear layer by itself and optimizing a

sandwich of 10 millionaire layers in

both cases those are just a linear

transformation in the forward pass but

the training Dynamics would be different

and there's entire papers that analyze

in fact like infinitely layered linear

layers and so on and so there's a lot of

things too that you can play with there

uh but basically the technical

turn this sandwich from just a linear

function into a neural network that can

in principle approximate any arbitrary

okay so now I've reset the code to use

the linear 10h sandwich like before and

I reset everything so the gains five

over three we can run a single step of

optimization and we can look at the

activation statistics of the forward

but I've added one more plot here that I

think is really important to look at

when you're training your neural nuts

and to consider and ultimately what

we're doing is we're updating the

parameters of the neural net so we care

about the parameters and their values

so here what I'm doing is I'm actually

iterating over all the parameters

um restricting it to the two-dimensional

parameters which are basically the

weights of these linear layers and I'm

skipping the biases and I'm skipping the

um Gammas and the betas in the bathroom

but you can also take a look at those as

well but what's happening with the

weights is um instructive by itself

so here we have all the different

weights their shapes so this is the

embedding layer the first linear layer

all the way to the very last linear

layer and then we have the mean the

standard deviation of all these

the histogram and you can see that it

actually doesn't look that amazing so

there's some trouble in Paradise even

though these gradients looked okay

there's something weird going on here

I'll get to that in a second and the

last thing here is the gradient to data

ratio so sometimes I like to visualize

this as well because what this gives you

a sense of is what is the scale of the

gradient compared to the scale of the

actual values and this is important

because we're going to end up taking a

that is the learning rate times the

gradient onto the data and so the

gradient has two large of magnitude if

the numbers in there are too large

compared to the numbers in data then

but in this case the gradient to data is

our loan numbers so the values inside

grad are 1000 times smaller than the

values inside data in these weights most

now notably that is not true about the

last layer and so the last layer

actually here the output layer is a bit

of a troublemaker in the way that this

is currently arranged because you can

the last layer here in pink takes on

values that are much larger than some of

um inside the neural net so the standard

deviations are roughly 1 and negative

three throughout except for the last

last layer which actually has roughly

one e negative two a standard deviation

of gradients and so the gradients on the

last layer are currently about 100 times

greater sorry 10 times greater than all

the other weights inside the neural nut

and so that's problematic because in the

simple stochastically in the sun setup

you would be training this last layer

about 10 times faster than you would be

now this actually like kind of fixes

itself a little bit if you train for a

bit longer so for example if I greater

than 1000 only then do a break

let me reinitialize and then let me do

it 1000 steps and after 1000 steps we

okay so you see how the neurons are a

bit are saturating a bit and we can also

look at the backward pass but otherwise

they look good they're about equal and

there's no shrinking to zero or

and you can see that here in the weights

things are also stabilizing a little bit

so the Tails of the last pink layer are

actually coming down coming in during

but certainly this is like a little bit

of troubling especially if you are using

stochastic gradient descent instead of a

now I'd like to show you one more plot

that I usually look at when I train

neural networks and basically the

gradient to data ratio is not actually

that informative because what matters at

the end is not the gradient to date

ratio but the update to the data ratio

because that is the amount by which we

will actually change the data in these

so coming up here what I'd like to do is

I'd like to introduce a new update to

it's going to be less than we're going

to build it out every single iteration

and here I'd like to keep track of

without any gradients I'm comparing the

update which is learning rate times the

that is the update that we're going to

social Mediterranean or all the

parameters and then I'm taking the

basically standard deviation of the

update we're going to apply and divide

it by the actual content the data of

that parameter and its standard

so this is the ratio of basically how

great are the updates to the values in

then we're going to take a log of it and

actually I'd like to take a log 10.

just so it's a nicer visualization so

we're going to be basically looking at

of this division here and then that item

to pop out the float and we're going to

be keeping track of this for all the

parameters and adding it to this UD

so now let me reinitialize and run a

we can look at the activations the

gradients and the parameter gradients as

we did before but now I have one more

now what's happening here is where every

interval go to parameters and I'm

constraining it again like I did here to

so the number of dimensions in these

sensors is two and then I'm basically

plotting all of these update ratios over

I plot those ratios and you can see that

initialization to take on certain values

and then these updates sort of like

start stabilizing usually during

then the other thing that I'm plotting

here is I'm plotting here like an

approximate value that is a Rough Guide

for what it roughly should be and it

should be like roughly one in negative

and so that means that basically there's

um and they take on certain values and

the updates to them at every single

iteration are no more than roughly 1 000

of the actual like magnitude in those

tensors uh if this was much larger like

if the log of this was like say negative

one this is actually updating those

values quite a lot they're undergoing a

but the reason that the final rate the

final layer here is an outlier is

because this layer was artificially

shrunk down to keep the soft max income

you see how we multiply The Weight by

uh in the initialization to make the

last layer prediction less confident

that made that artificially made the

values inside that tensor way too low

and that's why we're getting temporarily

a very high ratio but you see that that

stabilizes over time once that weight

starts to learn starts to learn

but basically I like to look at the

evolution of this update ratio for all

my parameters usually and I like to make

sure that it's not too much above

uh so around negative three on this log

if it's below negative three usually

that means that the parameters are not

so if our learning rate was very low

let's initialize and then let's actually

do a learning rate of say y negative

if you're learning rate is way too low

this plot will typically reveal it

so you see how all of these updates are

way too small so the size of the update

in magnitude to the size of the numbers

in that tensor in the first place so

this is a symptom of training way too

so this is another way to sometimes set

the learning rate and to get a sense of

what that learning rate should be and

ultimately this is something that you

if anything the learning rate here is a

little bit on the higher side because

um we're above the black line of

negative three we're somewhere around

negative 2.5 it's like okay and uh but

everything is like somewhat stabilizing

and so this looks like a pretty decent

um learning rates and so on but this is

something to look at and when things are

miscalibrated you will you will see very

everything looks pretty well behaved

right but just as a comparison when

things are not properly calibrated what

does that look like let me come up here

and let's say that for example uh what

let's say that we forgot to apply this

fan in normalization so the weights

inside the linear layers are just a

sample for my gaussian in all those

what happens to our how do we notice

well the activation plot will tell you

whoa your neurons are way too saturated

the gradients are going to be all messed

the histogram for these weights are

going to be all messed up as well and

there's a lot of asymmetry and then if

we look here I suspect it's all going to

be also pretty messed up so you see

there's a lot of discrepancy in how fast

these layers are learning and some of

them are learning way too fast so

negative one negative 1.5 those aren't

very large numbers in terms of this

ratio again you should be somewhere

around negative three and not much more

about that so this is how miscalibration

so if your neural nuts are going to

manifest and these kinds of plots here

to your attention and so you can address

them okay so so far we've seen that when

we have this linear 10h sandwich we can

actually precisely calibrate the gains

and make the activations the gradients

and the parameters and the updates all

look pretty decent but it definitely

feels a little bit like balancing

of a pencil on your finger and that's

because this gain has to be very

normalization layers into the fix into

the mix and let's let's see how that

I'm going to take the bachelor Monday

and I'm going to start placing it inside

and as I mentioned before the standard

typical place you would place it is

between the linear layer so right after

it but before the nonlinearity but

people have definitely played with that

and uh in fact you can get very similar

results even if you place it after the

and the other thing that I wanted to

mention is it's totally fine to also

place it at the end after the last

linear layer and before the loss

function so this is potentially fine as

and in this case this would be output

um now because the last layer is

mushroom we would not be changing the

weight to make the softmax less

confident we'd be changing the gamma

because gamma remember in the bathroom

is the variable that multiplicatively

interacts with the output of that

so we can initialize this sandwich now

and we can train and we can see that the

activations are going to of course look

very good and they are going to

necessarily look good because now before

every single 10 H layer there is a

normalization in The Bachelor so this is

unsurprisingly all looks pretty good

it's going to be standard deviation of

roughly 0.65 two percent and roughly

equal standard deviation throughout the

entire layers so everything looks very

the gradients look good the weights look

good and they're distributions

also look pretty reasonable we're going

above negative three a little bit but

not by too much so all the parameters

are training in roughly the same rate

but now what we've gained is we are

brittle with respect to the gain of

these so for example I can make the gain

which was much much slower than what we

but as we'll see the activations will

actually be exactly unaffected and

that's because of again this explicit

normalization the gradients are going to

look okay the weight gradients are going

to look okay but actually the updates

even though the forward and backward

pass to a very large extent look okay

because of the backward pass of the

batch form and how the scale of the

incoming activations interacts in the

basharm and its backward pass this is

the scale of the updates on these

parameters so the grades and ingredients

of these weights are affected

so we still don't get a completely free

pass to pass an arbitrary weights here

but it everything else is significantly

more robust in terms of the forward

backward and the weight gradients it's

just that you may have to retune your

learning rate if you are changing

sufficiently the the scale of the

activations that are coming into the

bachelor's so here for example this we

changed the gains of these linear layers

to be greater and we're seeing that the

updates are coming out lower as a result

and then finally we can also if we are

using basharms we don't actually need to

necessarily let me reset this to one so

there's no gain we don't necessarily

um normalize by fan in sometimes so if I

take out the fan in so these are just

we'll see that because of batch drum

this will actually be relatively well

this this is look of course in the

forward pass look good the gradients

the backward the weight updates look

okay A little bit of fat tails in some

and this looks okay as well but as you

as you can see uh we're significantly

below negative three so we'd have to

bump up the learning rate of this

bachelor so that we are training more

properly and in particular looking at

this roughly looks like we have to 10x

the learning rate to get to about 20

so we come here and we would change this

then we'll see that everything still of

and now we are roughly here and we

expect this to be an okay training run

so long story short we are significantly

more robust to the gain of these linear

layers whether or not we have to apply

the fan in and then we can change the

gain but we actually do have to worry a

um scales and making sure that the

learning rate is properly calibrated

here but thus the activations of the

forward backward pass and the updates

are all are looking significantly more

well-behaved except for the global scale

that is potentially being adjusted here

okay so now let me summarize there are

three things I was hoping to achieve

with this section number one I wanted to

introduce you to batch normalization

which is one of the first modern

innovations that we're looking into that

helped stabilize very deep neural

networks and their training and I hope

you understand how the bachelorization

works and how it would be used in a

number two I was hoping to pie tortify

some of our code and wrap it up into

these modules so like linear Bachelor 1D

10h Etc these are layers or modules and

they can be stacked up into neural Nets

like Lego building blocks and these

layers actually exist in pie torch and

if you import torch and then you can

actually the way I've constructed it you

can simply just use pytorch by

prepending and then dot to all these

and actually everything will just work

because the API that I've developed here

is identical to the API that pytorch

uses and the implementation also is

basically as far as I'm aware identical

and number three I try to introduce you

to the diagnostic tools that you would

use to understand whether your neural

network is in a good State dynamically

so we are looking at the statistics and

histograms and activation of the forward

pass application activations the

backward pass gradients and then also

we're looking at the weights that are

going to be updated as part of

stochastic already in ascent and we're

looking at their means standard

deviations and also the ratio of

gradients to data or even better the

updates to data and we saw that

typically we don't actually look at it

as a single snapshot Frozen in time at

some particular iteration typically

people look at this as uh over time just

like I've done here and they look at

these updated data ratios and they make

sure everything looks okay and in

um running negative 3 or basically

negative 3 on the log scale is a good

rough heuristic for what you want this

ratio to be and if it's way too high

then probably the learning rate or the

updates are a little too too big and if

it's way too small that the learning

so that's just some of the things that

you may want to play with when you try

to get your neural network to work with

now there's a number of things I did not

try to achieve I did not try to beat our

previous performance as an example by

introducing the bathroom layer actually

um and I found the new I used the

learning rate finding mechanism that

I've described before I tried to train

the bathroom layer a bachelor neural nut

and I actually ended up with results

that are very very similar to what we've

and that's because our performance now

is not bottlenecked by the optimization

which is what bass Norm is helping with

the performance at this stage is

bottlenecked by what I suspect is the

So currently we are taking three

characters to predict the fourth one and

I think we need to go beyond that and we

architectures that are like recurrent

neural networks and Transformers in

um the log probabilities that we're

and I also did not try to have a full

explanation of all of these activations

the gradients and the backward paths and

the statistics of all these gradients

and so you may have found some of the

parts here unintuitive and maybe you're

slightly confused about okay if I change

the gain here how come that we need a

different learning rate and I didn't go

into the full detail because you'd have

to actually look at the backward pass of

all these different layers and get an

intuitive understanding of how that

works and I did not go into that in this

lecture the purpose really was just to

introduce you to the diagnostic tools

and what they look like but there's

still a lot of work remaining on the

intuitive level to understand the

initialization the backward pass and how

all of that interacts but you shouldn't

feel too bad because honestly we are

getting to The Cutting Edge of where the

field is we certainly haven't I would

say solved initialization and we haven't

solved back propagation and these are

still very much an active area of

research people are still trying to

figure out what's the best way to

initialize these networks what is the

um and so on so none of this is really

solved and we don't really have all the

answers to all the uh to you know all

these cases but at least you know we're

making progress at least we have some

tools to tell us whether or not things

are on the right track for now

I think we've made positive progress in

this lecture and I hope you enjoyed that