3.2: Selecting an ML model to use with Английский - CC subtitles (closed captions) and transcript

Download Subtitles and Closed Captions (CC) from YouTube

Enter the URL of the YouTube video to download subtitles in many different formats and languages.

BilSub.com - bilingual subtitles >>>

3.2: Selecting an ML model to use with Английский - CC subtitles Complain, DMCA

A- A+ close video open video

JASON MAYES: Now you know\nwhat pre-trained models

are, how do you select one to\nuse in a real-world project?

In this section,\nyou'll walk through

a hypothetical, real-world\nscenario, breaking down

the problem into\nthe following steps.

First, what exactly is the\nproblem you're trying to solve?

Secondly, once you\nknow the problem

what models do you know\nof that could assist you

And then finally, considering\nall identified usable models

how do you select one to\ndeliver the best solution?

So first up, imagine a real\ncustomer has contacted you

and states, "I\'d like\nto create a web app that

can detect and alert me when\nan intruder is in my garden

but does not send any of\nmy images to the Cloud.

One thing right away\nthat you can take note of

is that the customer\nwants a web app.

This means TensorFlow.js\ncould be a great fit

as you know it can run both\nclient side in the browser

and also server side via\nNode.js in the Cloud.

However, the privacy was\nalso of key importance.

And in this case, TensorFlow.js\nclient side in a browser

Diving deeper into the\nrequirements of the app

it will need to detect\nwhen an intruder is

Upon clarifying with the\ncustomer what that really

means, what is an\nintruder, it seems

you just need to detect\nhumans and not cats, dogs

or other animals\nthat may also enter.

This means you'll want\nto make a system that

knows when a human is in a\ngiven frame of the video stream.

All right, so you've\nalready seen a few models

In fact, can you guess\nwhich ones could detect

Feel free to pause the\nvideo here and note down

which ones you might consider\nbefore you continue to see

I'll wait for you to do that.

All right, let's\ncheck your options

from these known\npre-trained models.

Technically speaking, all of\nthe listed vision and human body

models could be used\nto detect the presence

And yes, that includes the\nface and hand models too.

It's also worth noting that,\ndepending on the requirements

sound could also be used to\ndetect the presence of talking

even if the person's\nnot visible.

It's often quite\nsurprising how many ways

Now let's assume, after\nchecking with the customer

that sound is not\nsuitable, as they

And they don't want to\naccidentally detect that.

In that case, you've got a\nchoice of the following--

image classification, object\ndetection, body segmentation

pose estimation, face-landmark\ndetection, or even hand

pose estimation, if the\nhands are in the shot.

Let's break down these to see\nthe pros and cons of each.

In these situations,\nit's helpful to fill out

a table of features\nthat matter to the task

Some common key\nthings to always check

include the inference\nspeed, which

is the time it\ntakes from sending

new data as input to the\nmodel, to getting an output

Lower times, typically\nmeasured a milliseconds

mean faster performance\nthat can also sometimes be

For example, 10\nframes per second

which means it takes\n100 milliseconds to run

as you could run it 10 times\nbefore a second would have

elapsed, which is\nimportant when working

with real-time applications,\nsuch as video, as often video

will need to run at 24\nframes per second or greater.

If the model runs\nslower than this

some video frames will have to\nbe skipped for classification

which might be acceptable,\ndepending on your use case.

But typically, if the\nclassification time

drops below 10 to 15\nframes per second

the application may\nappear to feel laggy

especially if you're\ntrying to deal

with fast-moving\nobjects, which would

lead to a less than\nbeneficial user experience.

You might also want to\ncheck the amount of memory

the model uses, both in\nterms of raw file size

typically measured in\nmegabytes, where less is better

if you want the\npage to load faster

but also in terms of runtime\nmemory used to execute it

which is the amount\nof RAM it would need

to run on the machine, which\nagain, is typically measured

With modern machines having a\nlot of RAM and fast internet

connections, which is\nbecoming less of an issue

but you should still be\nmindful of these things

as your end user might not\nhave the same luxuries.

And on that point,\nknowing your users'

expected working\nenvironment will

help decide which models are not\nsuitable faster, and of course

As web engineers\nand designers, you

More than websites need to\nbe designed responsibly

based on the device\nit's being run on.

And the same rules apply here\nfor machine learning models

After checking\nwith the customer

you learn they want to run the\nsystem on a spare smartphone

But fast internet is\navailable at all times

as it will be connected\nto the house Wi-Fi.

So it's now time\nto fill out a table

Well, if you're lucky, the\ndocumentation for a model

may include a performance\nsection that details

In fact, a screenshot of\nthe pose estimation model

documentation is\nshown here that shows

the expected frames\nper second you can

However, no details for\nmemory usage and file size

For undocumented items, you\nwill need to benchmark the model

yourself by creating a simple\nwebsite that loads the model

and uses it, recording\nthese values yourself

Let's walk through how\nto record these values

if they're not already\nprovided for you.

First, frames per\nsecond, you can simply

record a timestamp just\nbefore you execute the model

and then record the timestamp\nonce you get a result.

Subtract the two, and you'll\nhave the number of milliseconds

To convert this into\nframes per second

you can simply divide\nthis number into 1,000

as 1,000 milliseconds\nrepresents one second.

So if it took 50 milliseconds\nto run, 1,000 divided by 50

would give you 20\nframes per second.

Now, for file size, you can\ncheck this pretty easily

You can press F12 on your\ndemo page to open it.

Or simply right-click\nanywhere and choose Inspect.

Once loaded, switch\nto the Network tab

and ensure Disable\nCache is selected.

Then refresh the web page,\nusing Control+F5 or by manually

refreshing the web page\nin the browser window.

And as the page loads,\nyou'll see all the web page

Note that it might\ntake some time

to complete loading all of them.

Now, TensorFlow.js models\nconsist of a model.json file

along with the associated\nbinary files that use for .bin

extension, for which there\ncould be multiple binary files

Once the web page\nhas stopped loading

and is working as\nintended, voiceover

recorded results for the\nmodel.json and the bin files

from which you can add\nup the file sizes of each

to get a total model file size.

For this example model,\nit's around 4.8 megabytes

of total data transferred to\nthe browser, as you can see.

Similarly for RAM usage, again,\nopen the Chrome Developer Tools

once the application is running,\nand ensure that the model

You can now switch to the\nMemory tab and take a snapshot.

And then switch to\nthe Statistics view

to get an overview of how\nmuch RAM is being used.

Note how, in this case, this\nis a larger size than the file

size of the model itself,\ntotaling around 56.6 megabytes

OK, so let's assume these\nare the benchmarks you

get for the models of interest.

You know that for\nreal-time applications

you probably want at least\n10 frames per second.

Looking at the table, you can\nsee that the body segmentation

model only runs at 2 frames per\nsecond on the target device.

For this reason alone,\nit's probably not

suitable for this\nmobile use case.

So you can safely drop\nit from the contenders.

However, it should\nbe noted that there

might be other body segmentation\nmodels out there that

But that would require further\nresearch and investigation.

Let's just stick to the\nmodels you know for now.

So now let's look\nat the file sizes.

Typically, a smaller\nfile size may

be preferable for a web page\nthat's reloaded all the time

especially if it's loaded\nover a 3G connection

However, given your use case\nwhere you load the page once

and then leave the\napp running, this

is less of an issue\nfor our customer.

The overhead of\nwaiting, say, 2 more

seconds, to potentially\nuse a better model

And remember, the\ncustomer for this use case

also specified that the device\nwill have access to fast Wi-Fi.

So download time is\nnot really an issue.

You can therefore,\nstill consider

all of the models shown\nfor this example scenario.

Now, looking towards\nthe RAM usage

you can see that face\nlandmarks and hand

pose are really high, with 70\nand 125 megabytes respectively.

Given that the intention is to\nrun on a mobile device, which

might also be an older\ngeneration model, maybe

with just one\ngigabyte of RAM, it

would be wise to\nbe mindful of this

in case the user decides to\nrun other apps simultaneously.

Furthermore, not all\nhumans have hands

which makes the hand pose model\nless reliable for those use

And also, detecting\nthe face would not

work well if the human\nwas wearing a mask.

So this is also potentially\na poor solution.

Let's discard both\nof these for now.

You are now left with\nthree contenders-- image

classification, object\ndetection, and pose estimation.

The next step would be to\ntry all three of these models

to check if there are any\nobvious flaws in using them

And also, what unique\nbenefits can each provide?

One thing you\nmight consider next

is the number of\nhumans in the image.

While this was not an original\nrequest from the user

two of the remaining\nmodels support

object detection\nand pose estimation.

As a developer, it makes sense\nto offer such functionality

as this information could be\nbeneficial, given the use case.

Remember, image\nclassification will simply

be a binary yes or no, if a\nhuman is somewhere in the image

But not where or how many,\nwhich is less useful to you.

So given that object detection\noffers more information

with faster performance and\nalmost half the RAM usage

it makes sense to drop\nimage classification

from consideration\nat this point too.

So now, you have two\nmodels left to choose from.

Diving deeper into the\ndocumentation for both

you can discover that object\ndetection, by default

can detect up to 20\nobjects simultaneously

and can even go higher if\nneeded at the sacrifice

The pose estimation\nmodel, by contrast

can detect a maximum of\nsix people at one time

and uses over twice the\namount of RAM to do so.

For this use case,\nthe pose model

may not scale well if a\nlarger group of people

came and you wanted to know\nhow many people there were.

That being said, it is\nover two times faster

So that could be better\nfor fast-moving people.

Both of these models\nare suitable contenders.

You might, at this point,\nwant to ask the customer

if detecting the presence\nof six people is enough.

And if they agree,\nthen it makes sense

to use the pose estimation\nmodel, as it's faster.

If they need the\nextra granularity

to count a larger\ncrowd, then maybe

the object detection model\nwill be the way to go.

As you've seen, the model\nyou ultimately choose

will depend on your customers'\nneeds and the environment

It's not always\nimmediately obvious

which model will\nperform the best

without further\ninvestigation and research.

And as a developer,\nit's up to you

to decide what provides\nthe optimal solution

to bring the idea to life\nin the best way possible.

Now, in the next\nsection, you will

create your first full\nend-to-end project

from a completely\nblank canvas where

I'll show you, step\nby step, how to make

your very own smart\ncamera, just like the one

you explored in this scenario,\nusing the object detection

So get your coding\nhat on, and I'll

see you in the next\nsection to put what

↑ Return to Top ↑