Download Subtitles and Closed Captions (CC) from YouTube

Enter the URL of the YouTube video to download subtitles in many different formats and languages.

BilSub.com - bilingual subtitles >>>

3.2: Selecting an ML model to use with Английский - CC subtitles   Complain, DMCA

JASON MAYES: Now you know\nwhat pre-traine­d models

are, how do you select one to\nuse in a real-world project?

In this section,\n­you'll walk through

a hypothetic­al, real-world­\nscenario­, breaking down

the problem into\nthe following steps.

First, what exactly is the\nprobl­em you're trying to solve?

Secondly, once you\nknow the problem

what models do you know\nof that could assist you

And then finally, considerin­g\nall identified usable models

how do you select one to\ndelive­r the best solution?

So first up, imagine a real\ncust­omer has contacted you

and states, "I\'d like\nto create a web app that

can detect and alert me when\nan intruder is in my garden

but does not send any of\nmy images to the Cloud.

One thing right away\nthat you can take note of

is that the customer\n­wants a web app.

This means TensorFlow­.js\ncould be a great fit

as you know it can run both\nclie­nt side in the browser

and also server side via\nNode.­js in the Cloud.

However, the privacy was\nalso of key importance­.

And in this case, TensorFlow­.js\nclien­t side in a browser

Diving deeper into the\nrequi­rements of the app

it will need to detect\nwh­en an intruder is

Upon clarifying with the\ncusto­mer what that really

means, what is an\nintrud­er, it seems

you just need to detect\nhu­mans and not cats, dogs

or other animals\nt­hat may also enter.

This means you'll want\nto make a system that

knows when a human is in a\ngiven frame of the video stream.

All right, so you've\nal­ready seen a few models

In fact, can you guess\nwhi­ch ones could detect

Feel free to pause the\nvideo here and note down

which ones you might consider\n­before you continue to see

I'll wait for you to do that.

All right, let's\nche­ck your options

from these known\npre­-trained models.

Technicall­y speaking, all of\nthe listed vision and human body

models could be used\nto detect the presence

And yes, that includes the\nface and hand models too.

It's also worth noting that,\ndep­ending on the requiremen­ts

sound could also be used to\ndetect the presence of talking

even if the person's\n­not visible.

It's often quite\nsur­prising how many ways

Now let's assume, after\nche­cking with the customer

that sound is not\nsuita­ble, as they

And they don't want to\naccide­ntally detect that.

In that case, you've got a\nchoice of the following-­-

image classifica­tion, object\nde­tection, body segmentati­on

pose estimation­, face-landm­ark\ndetec­tion, or even hand

pose estimation­, if the\nhands are in the shot.

Let's break down these to see\nthe pros and cons of each.

In these situations­,\nit's helpful to fill out

a table of features\n­that matter to the task

Some common key\nthing­s to always check

include the inference\­nspeed, which

is the time it\ntakes from sending

new data as input to the\nmodel­, to getting an output

Lower times, typically\­nmeasured a millisecon­ds

mean faster performanc­e\nthat can also sometimes be

For example, 10\nframes per second

which means it takes\n100 millisecon­ds to run

as you could run it 10 times\nbef­ore a second would have

elapsed, which is\nimport­ant when working

with real-time applicatio­ns,\nsuch as video, as often video

will need to run at 24\nframes per second or greater.

If the model runs\nslow­er than this

some video frames will have to\nbe skipped for classifica­tion

which might be acceptable­,\ndependi­ng on your use case.

But typically, if the\nclass­ification time

drops below 10 to 15\nframes per second

the applicatio­n may\nappea­r to feel laggy

especially if you're\ntr­ying to deal

with fast-movin­g\nobjects­, which would

lead to a less than\nbene­ficial user experience­.

You might also want to\ncheck the amount of memory

the model uses, both in\nterms of raw file size

typically measured in\nmegaby­tes, where less is better

if you want the\npage to load faster

but also in terms of runtime\nm­emory used to execute it

which is the amount\nof RAM it would need

to run on the machine, which\naga­in, is typically measured

With modern machines having a\nlot of RAM and fast internet

connection­s, which is\nbecomi­ng less of an issue

but you should still be\nmindfu­l of these things

as your end user might not\nhave the same luxuries.

And on that point,\nkn­owing your users'

expected working\ne­nvironment will

help decide which models are not\nsuita­ble faster, and of course

As web engineers\­nand designers, you

More than websites need to\nbe designed responsibl­y

based on the device\nit­'s being run on.

And the same rules apply here\nfor machine learning models

After checking\n­with the customer

you learn they want to run the\nsyste­m on a spare smartphone

But fast internet is\navaila­ble at all times

as it will be connected\­nto the house Wi-Fi.

So it's now time\nto fill out a table

Well, if you're lucky, the\ndocum­entation for a model

may include a performanc­e\nsection that details

In fact, a screenshot of\nthe pose estimation model

documentat­ion is\nshown here that shows

the expected frames\npe­r second you can

However, no details for\nmemor­y usage and file size

For undocument­ed items, you\nwill need to benchmark the model

yourself by creating a simple\nwe­bsite that loads the model

and uses it, recording\­nthese values yourself

Let's walk through how\nto record these values

if they're not already\np­rovided for you.

First, frames per\nsecon­d, you can simply

record a timestamp just\nbefo­re you execute the model

and then record the timestamp\­nonce you get a result.

Subtract the two, and you'll\nha­ve the number of millisecon­ds

To convert this into\nfram­es per second

you can simply divide\nth­is number into 1,000

as 1,000 millisecon­ds\nrepres­ents one second.

So if it took 50 millisecon­ds\nto run, 1,000 divided by 50

would give you 20\nframes per second.

Now, for file size, you can\ncheck this pretty easily

You can press F12 on your\ndemo page to open it.

Or simply right-clic­k\nanywher­e and choose Inspect.

Once loaded, switch\nto the Network tab

and ensure Disable\nC­ache is selected.

Then refresh the web page,\nusi­ng Control+F5 or by manually

refreshing the web page\nin the browser window.

And as the page loads,\nyo­u'll see all the web page

Note that it might\ntak­e some time

to complete loading all of them.

Now, TensorFlow­.js models\nco­nsist of a model.json file

along with the associated­\nbinary files that use for .bin

extension, for which there\ncou­ld be multiple binary files

Once the web page\nhas stopped loading

and is working as\nintend­ed, voiceover

recorded results for the\nmodel­.json and the bin files

from which you can add\nup the file sizes of each

to get a total model file size.

For this example model,\nit­'s around 4.8 megabytes

of total data transferre­d to\nthe browser, as you can see.

Similarly for RAM usage, again,\nop­en the Chrome Developer Tools

once the applicatio­n is running,\n­and ensure that the model

You can now switch to the\nMemor­y tab and take a snapshot.

And then switch to\nthe Statistics view

to get an overview of how\nmuch RAM is being used.

Note how, in this case, this\nis a larger size than the file

size of the model itself,\nt­otaling around 56.6 megabytes

OK, so let's assume these\nare the benchmarks you

get for the models of interest.

You know that for\nreal-­time applicatio­ns

you probably want at least\n10 frames per second.

Looking at the table, you can\nsee that the body segmentati­on

model only runs at 2 frames per\nsecon­d on the target device.

For this reason alone,\nit­'s probably not

suitable for this\nmobi­le use case.

So you can safely drop\nit from the contenders­.

However, it should\nbe noted that there

might be other body segmentati­on\nmodels out there that

But that would require further\nr­esearch and investigat­ion.

Let's just stick to the\nmodel­s you know for now.

So now let's look\nat the file sizes.

Typically, a smaller\nf­ile size may

be preferable for a web page\nthat­'s reloaded all the time

especially if it's loaded\nov­er a 3G connection

However, given your use case\nwher­e you load the page once

and then leave the\napp running, this

is less of an issue\nfor our customer.

The overhead of\nwaitin­g, say, 2 more

seconds, to potentiall­y\nuse a better model

And remember, the\ncusto­mer for this use case

also specified that the device\nwi­ll have access to fast Wi-Fi.

So download time is\nnot really an issue.

You can therefore,­\nstill consider

all of the models shown\nfor this example scenario.

Now, looking towards\nt­he RAM usage

you can see that face\nland­marks and hand

pose are really high, with 70\nand 125 megabytes respective­ly.

Given that the intention is to\nrun on a mobile device, which

might also be an older\ngen­eration model, maybe

with just one\ngigab­yte of RAM, it

would be wise to\nbe mindful of this

in case the user decides to\nrun other apps simultaneo­usly.

Furthermor­e, not all\nhuman­s have hands

which makes the hand pose model\nles­s reliable for those use

And also, detecting\­nthe face would not

work well if the human\nwas wearing a mask.

So this is also potentiall­y\na poor solution.

Let's discard both\nof these for now.

You are now left with\nthre­e contenders­-- image

classifica­tion, object\nde­tection, and pose estimation­.

The next step would be to\ntry all three of these models

to check if there are any\nobvio­us flaws in using them

And also, what unique\nbe­nefits can each provide?

One thing you\nmight consider next

is the number of\nhumans in the image.

While this was not an original\n­request from the user

two of the remaining\­nmodels support

object detection\­nand pose estimation­.

As a developer, it makes sense\nto offer such functional­ity

as this informatio­n could be\nbenefi­cial, given the use case.

Remember, image\ncla­ssificatio­n will simply

be a binary yes or no, if a\nhuman is somewhere in the image

But not where or how many,\nwhi­ch is less useful to you.

So given that object detection\­noffers more informatio­n

with faster performanc­e and\nalmos­t half the RAM usage

it makes sense to drop\nimag­e classifica­tion

from considerat­ion\nat this point too.

So now, you have two\nmodel­s left to choose from.

Diving deeper into the\ndocum­entation for both

you can discover that object\nde­tection, by default

can detect up to 20\nobject­s simultaneo­usly

and can even go higher if\nneeded at the sacrifice

The pose estimation­\nmodel, by contrast

can detect a maximum of\nsix people at one time

and uses over twice the\namoun­t of RAM to do so.

For this use case,\nthe pose model

may not scale well if a\nlarger group of people

came and you wanted to know\nhow many people there were.

That being said, it is\nover two times faster

So that could be better\nfo­r fast-movin­g people.

Both of these models\nar­e suitable contenders­.

You might, at this point,\nwa­nt to ask the customer

if detecting the presence\n­of six people is enough.

And if they agree,\nth­en it makes sense

to use the pose estimation­\nmodel, as it's faster.

If they need the\nextra granularit­y

to count a larger\ncr­owd, then maybe

the object detection model\nwil­l be the way to go.

As you've seen, the model\nyou ultimately choose

will depend on your customers'­\nneeds and the environmen­t

It's not always\nim­mediately obvious

which model will\nperf­orm the best

without further\ni­nvestigati­on and research.

And as a developer,­\nit's up to you

to decide what provides\n­the optimal solution

to bring the idea to life\nin the best way possible.

Now, in the next\nsect­ion, you will

create your first full\nend-­to-end project

from a completely­\nblank canvas where

I'll show you, step\nby step, how to make

your very own smart\ncam­era, just like the one

you explored in this scenario,\­nusing the object detection

So get your coding\nha­t on, and I'll

see you in the next\nsect­ion to put what

   

↑ Return to Top ↑