Enter the URL of the YouTube video to download subtitles in many different formats and languages.
JASON MAYES: Now you know\nwhat pre-trained models
are, how do you select one to\nuse in a real-world project?
In this section,\nyou'll walk through
a hypothetical, real-world\nscenario, breaking down
the problem into\nthe following steps.
First, what exactly is the\nproblem you're trying to solve?
Secondly, once you\nknow the problem
what models do you know\nof that could assist you
And then finally, considering\nall identified usable models
how do you select one to\ndeliver the best solution?
So first up, imagine a real\ncustomer has contacted you
and states, "I\'d like\nto create a web app that
can detect and alert me when\nan intruder is in my garden
but does not send any of\nmy images to the Cloud.
One thing right away\nthat you can take note of
is that the customer\nwants a web app.
This means TensorFlow.js\ncould be a great fit
as you know it can run both\nclient side in the browser
and also server side via\nNode.js in the Cloud.
However, the privacy was\nalso of key importance.
And in this case, TensorFlow.js\nclient side in a browser
Diving deeper into the\nrequirements of the app
it will need to detect\nwhen an intruder is
Upon clarifying with the\ncustomer what that really
means, what is an\nintruder, it seems
you just need to detect\nhumans and not cats, dogs
or other animals\nthat may also enter.
This means you'll want\nto make a system that
knows when a human is in a\ngiven frame of the video stream.
All right, so you've\nalready seen a few models
In fact, can you guess\nwhich ones could detect
Feel free to pause the\nvideo here and note down
which ones you might consider\nbefore you continue to see
I'll wait for you to do that.
All right, let's\ncheck your options
from these known\npre-trained models.
Technically speaking, all of\nthe listed vision and human body
models could be used\nto detect the presence
And yes, that includes the\nface and hand models too.
It's also worth noting that,\ndepending on the requirements
sound could also be used to\ndetect the presence of talking
even if the person's\nnot visible.
It's often quite\nsurprising how many ways
Now let's assume, after\nchecking with the customer
that sound is not\nsuitable, as they
And they don't want to\naccidentally detect that.
In that case, you've got a\nchoice of the following--
image classification, object\ndetection, body segmentation
pose estimation, face-landmark\ndetection, or even hand
pose estimation, if the\nhands are in the shot.
Let's break down these to see\nthe pros and cons of each.
In these situations,\nit's helpful to fill out
a table of features\nthat matter to the task
Some common key\nthings to always check
include the inference\nspeed, which
is the time it\ntakes from sending
new data as input to the\nmodel, to getting an output
Lower times, typically\nmeasured a milliseconds
mean faster performance\nthat can also sometimes be
For example, 10\nframes per second
which means it takes\n100 milliseconds to run
as you could run it 10 times\nbefore a second would have
elapsed, which is\nimportant when working
with real-time applications,\nsuch as video, as often video
will need to run at 24\nframes per second or greater.
If the model runs\nslower than this
some video frames will have to\nbe skipped for classification
which might be acceptable,\ndepending on your use case.
But typically, if the\nclassification time
drops below 10 to 15\nframes per second
the application may\nappear to feel laggy
especially if you're\ntrying to deal
with fast-moving\nobjects, which would
lead to a less than\nbeneficial user experience.
You might also want to\ncheck the amount of memory
the model uses, both in\nterms of raw file size
typically measured in\nmegabytes, where less is better
if you want the\npage to load faster
but also in terms of runtime\nmemory used to execute it
which is the amount\nof RAM it would need
to run on the machine, which\nagain, is typically measured
With modern machines having a\nlot of RAM and fast internet
connections, which is\nbecoming less of an issue
but you should still be\nmindful of these things
as your end user might not\nhave the same luxuries.
And on that point,\nknowing your users'
expected working\nenvironment will
help decide which models are not\nsuitable faster, and of course
As web engineers\nand designers, you
More than websites need to\nbe designed responsibly
based on the device\nit's being run on.
And the same rules apply here\nfor machine learning models
After checking\nwith the customer
you learn they want to run the\nsystem on a spare smartphone
But fast internet is\navailable at all times
as it will be connected\nto the house Wi-Fi.
So it's now time\nto fill out a table
Well, if you're lucky, the\ndocumentation for a model
may include a performance\nsection that details
In fact, a screenshot of\nthe pose estimation model
documentation is\nshown here that shows
the expected frames\nper second you can
However, no details for\nmemory usage and file size
For undocumented items, you\nwill need to benchmark the model
yourself by creating a simple\nwebsite that loads the model
and uses it, recording\nthese values yourself
Let's walk through how\nto record these values
if they're not already\nprovided for you.
First, frames per\nsecond, you can simply
record a timestamp just\nbefore you execute the model
and then record the timestamp\nonce you get a result.
Subtract the two, and you'll\nhave the number of milliseconds
To convert this into\nframes per second
you can simply divide\nthis number into 1,000
as 1,000 milliseconds\nrepresents one second.
So if it took 50 milliseconds\nto run, 1,000 divided by 50
would give you 20\nframes per second.
Now, for file size, you can\ncheck this pretty easily
You can press F12 on your\ndemo page to open it.
Or simply right-click\nanywhere and choose Inspect.
Once loaded, switch\nto the Network tab
and ensure Disable\nCache is selected.
Then refresh the web page,\nusing Control+F5 or by manually
refreshing the web page\nin the browser window.
And as the page loads,\nyou'll see all the web page
Note that it might\ntake some time
to complete loading all of them.
Now, TensorFlow.js models\nconsist of a model.json file
along with the associated\nbinary files that use for .bin
extension, for which there\ncould be multiple binary files
Once the web page\nhas stopped loading
and is working as\nintended, voiceover
recorded results for the\nmodel.json and the bin files
from which you can add\nup the file sizes of each
to get a total model file size.
For this example model,\nit's around 4.8 megabytes
of total data transferred to\nthe browser, as you can see.
Similarly for RAM usage, again,\nopen the Chrome Developer Tools
once the application is running,\nand ensure that the model
You can now switch to the\nMemory tab and take a snapshot.
And then switch to\nthe Statistics view
to get an overview of how\nmuch RAM is being used.
Note how, in this case, this\nis a larger size than the file
size of the model itself,\ntotaling around 56.6 megabytes
OK, so let's assume these\nare the benchmarks you
get for the models of interest.
You know that for\nreal-time applications
you probably want at least\n10 frames per second.
Looking at the table, you can\nsee that the body segmentation
model only runs at 2 frames per\nsecond on the target device.
For this reason alone,\nit's probably not
suitable for this\nmobile use case.
So you can safely drop\nit from the contenders.
However, it should\nbe noted that there
might be other body segmentation\nmodels out there that
But that would require further\nresearch and investigation.
Let's just stick to the\nmodels you know for now.
So now let's look\nat the file sizes.
Typically, a smaller\nfile size may
be preferable for a web page\nthat's reloaded all the time
especially if it's loaded\nover a 3G connection
However, given your use case\nwhere you load the page once
and then leave the\napp running, this
is less of an issue\nfor our customer.
The overhead of\nwaiting, say, 2 more
seconds, to potentially\nuse a better model
And remember, the\ncustomer for this use case
also specified that the device\nwill have access to fast Wi-Fi.
So download time is\nnot really an issue.
You can therefore,\nstill consider
all of the models shown\nfor this example scenario.
Now, looking towards\nthe RAM usage
you can see that face\nlandmarks and hand
pose are really high, with 70\nand 125 megabytes respectively.
Given that the intention is to\nrun on a mobile device, which
might also be an older\ngeneration model, maybe
with just one\ngigabyte of RAM, it
would be wise to\nbe mindful of this
in case the user decides to\nrun other apps simultaneously.
Furthermore, not all\nhumans have hands
which makes the hand pose model\nless reliable for those use
And also, detecting\nthe face would not
work well if the human\nwas wearing a mask.
So this is also potentially\na poor solution.
Let's discard both\nof these for now.
You are now left with\nthree contenders-- image
classification, object\ndetection, and pose estimation.
The next step would be to\ntry all three of these models
to check if there are any\nobvious flaws in using them
And also, what unique\nbenefits can each provide?
One thing you\nmight consider next
is the number of\nhumans in the image.
While this was not an original\nrequest from the user
two of the remaining\nmodels support
object detection\nand pose estimation.
As a developer, it makes sense\nto offer such functionality
as this information could be\nbeneficial, given the use case.
Remember, image\nclassification will simply
be a binary yes or no, if a\nhuman is somewhere in the image
But not where or how many,\nwhich is less useful to you.
So given that object detection\noffers more information
with faster performance and\nalmost half the RAM usage
it makes sense to drop\nimage classification
from consideration\nat this point too.
So now, you have two\nmodels left to choose from.
Diving deeper into the\ndocumentation for both
you can discover that object\ndetection, by default
can detect up to 20\nobjects simultaneously
and can even go higher if\nneeded at the sacrifice
The pose estimation\nmodel, by contrast
can detect a maximum of\nsix people at one time
and uses over twice the\namount of RAM to do so.
For this use case,\nthe pose model
may not scale well if a\nlarger group of people
came and you wanted to know\nhow many people there were.
That being said, it is\nover two times faster
So that could be better\nfor fast-moving people.
Both of these models\nare suitable contenders.
You might, at this point,\nwant to ask the customer
if detecting the presence\nof six people is enough.
And if they agree,\nthen it makes sense
to use the pose estimation\nmodel, as it's faster.
If they need the\nextra granularity
to count a larger\ncrowd, then maybe
the object detection model\nwill be the way to go.
As you've seen, the model\nyou ultimately choose
will depend on your customers'\nneeds and the environment
It's not always\nimmediately obvious
which model will\nperform the best
without further\ninvestigation and research.
And as a developer,\nit's up to you
to decide what provides\nthe optimal solution
to bring the idea to life\nin the best way possible.
Now, in the next\nsection, you will
create your first full\nend-to-end project
from a completely\nblank canvas where
I'll show you, step\nby step, how to make
your very own smart\ncamera, just like the one
you explored in this scenario,\nusing the object detection
So get your coding\nhat on, and I'll
see you in the next\nsection to put what