Python Jumpstart by Building 10 Apps Transcripts
Chapter: App 8: File Searcher App
Lecture: Generators save the day
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
We've seen this concept of generator methods let's go apply it to everything we have going on here
0:06
and I'll even show you another keyword we haven't had a chance to talk about yet. So let's start at the bottom.
0:10
Here is a method that is a traditional method it puts all the stuff into the list and then once that computation is done
0:16
everything is computed it returns that whole list and then we just have another list above that we had even more to
0:22
and keep extending as we have more and more files. We can do better. So, instead of doing matches here, let's get rid of this,
0:30
and instead of doing append m we will say yield m and we won't have to return the matches maybe I'll even comment it out for you, like so,
0:41
so no more lists, we are just doing yield, now this would already work, what we are doing is calling up here search file and here we get our generator,
0:51
but we can take any collection generators or iterable collections and extend this list, but we can actually do better still
1:01
so this is a generator method, this is a regular one, but we can also apply the exactly same idea here and the same idea here,
1:08
now this gets a little tricky because I have to say for m in matches: yield m, now, that's not the most fun thing to write,
1:17
it would work but I'll show you something better, same thing down here for all the matches there we want to do that,
1:22
and then we no longer have our return so here is the generator method and this is going to come through
1:27
and each time that we sort go pull something out of this collection, it's going to go until it hits one of these,
1:34
which the generator and it's going to hand one back so if we only wanted the first 4 matches we could compute that extremely quickly.
1:41
However, this line 65, 66 this is not the coolest thing, it turns out that Python 3.3 added basically a keyword that will do the same thing
1:52
like take a whole set and sort of hand them back one at a time, and so we can simplify this and just say yield from matches,
1:58
and if we really wanted to simplify this we could actually come down here and just write it as one line, we could just say yield from that,
2:06
never even store matches here, similarly yield from that; so down below, we have search files,
2:13
that individual searching of a single file is a generator and we only ever have a single line in memory at a time.
2:21
Now up here, as we work through all the files in our directory or even recurse into a tree of directories and their files,
2:28
we are only pulling back one item from either here or here at a given time, and that means we only have 1 line in memory at a time,
2:38
really one search result and then we can go up here and we are printing out now, let me just show you that this is still working,
2:45
let's bring this back and then let me search the simple files again, just to show that we are actually still searching just like before.
2:54
So let's search the small set of books for Holmes, there you go, you can see 468 matches, and we are searching the Ulysses
3:03
or searching The Adventures of Sherlock Holmes, perfect, it works exactly the same but from a performance perspective it's not the same,
3:13
let's run it again, and this time we are going to search the large set of files,
3:18
and again, we are going to search for how many question marks are there, there were something like 2.78 million question marks and remember,
3:26
we had to use like almost 400 MB of memory to answer that question, remember, 400 MB, what's it going to do this time, can we do better?
3:36
Oh, here, hold on, let me stop this really quick, remember, we didn't do the output, it was too much going really,
3:42
we just did a little count, so let's rerun it this way. Ok, here we go again,
3:48
it's going, 3.8 MB remember, it should have jumped up to 300 MB, a gigabyte, what is going on, this is so absolutely amazing,
3:58
look at this you guys, we are processing gigabytes and gigabytes of code with almost identical algorithms and yet the memory usage is the same
4:06
as if we are processing like a single line in memory, because that's all we are ever holding, is a single line in memory, ok,
4:14
great we do have the file stream open to some huge file at some point, but we are seeking over, we are streaming across it.
4:21
Let's just let it run and see where it goes. It's done, look at that, look at the memory usage, look at the CPU,
4:35
look at the performance, it is so much better than it was before, in fact, I kept the previous one around,
4:43
let's have a look at it, it's not really fair to put them side by side, because the scale of the graph is not the same,
4:50
but I think we'll get the sense anyway. So on the left is the old bad sort of standard procedural code style and now look at the memory,
4:59
it goes from 3 MB when I was starting out to 394 MB; ours went from 3 MB to 4 MB. And that was it.
5:12
If you look at the size of the CPU graph or sort of the length of any of these graphs, you'll see they are basically identical in computational time,
5:19
it looks actually lower on CPU usage, presumably it's doing less garbage collection less allocation,
5:27
doubling of lists and copying them and things like this, and all we have done to change that algorithm is use the yield
5:34
and yield from keyword instead of making lists appending and extending them. The code we wrote actually got a couple of lines shorter,
5:41
so this is the power of generator methods and any time you are processing like a pipeline of lots of data you saw that you can chain them together
5:50
to create these pipelines basically effortlessly, we'll see that there is even a simpler way to create
5:56
this type of structure something called a generator expression, right, but we'll save that for the next app.