Python Jumpstart by Building 10 Apps Transcripts
Chapter: App 8: File Searcher App
Lecture: Generators save the day

Login or purchase this course to watch this video and the rest of the course contents.
0:00 We've seen this concept of generator methods let's go apply it to everything we have going on here
0:06 and I'll even show you another keyword we haven't had a chance to talk about yet. So let's start at the bottom.
0:10 Here is a method that is a traditional method it puts all the stuff into the list and then once that computation is done
0:16 everything is computed it returns that whole list and then we just have another list above that we had even more to
0:22 and keep extending as we have more and more files. We can do better. So, instead of doing matches here, let's get rid of this,
0:30 and instead of doing append m we will say yield m and we won't have to return the matches maybe I'll even comment it out for you, like so,
0:41 so no more lists, we are just doing yield, now this would already work, what we are doing is calling up here search file and here we get our generator,
0:51 but we can take any collection generators or iterable collections and extend this list, but we can actually do better still
1:01 so this is a generator method, this is a regular one, but we can also apply the exactly same idea here and the same idea here,
1:08 now this gets a little tricky because I have to say for m in matches: yield m, now, that's not the most fun thing to write,
1:17 it would work but I'll show you something better, same thing down here for all the matches there we want to do that,
1:22 and then we no longer have our return so here is the generator method and this is going to come through
1:27 and each time that we sort go pull something out of this collection, it's going to go until it hits one of these,
1:34 which the generator and it's going to hand one back so if we only wanted the first 4 matches we could compute that extremely quickly.
1:41 However, this line 65, 66 this is not the coolest thing, it turns out that Python 3.3 added basically a keyword that will do the same thing
1:52 like take a whole set and sort of hand them back one at a time, and so we can simplify this and just say yield from matches,
1:58 and if we really wanted to simplify this we could actually come down here and just write it as one line, we could just say yield from that,
2:06 never even store matches here, similarly yield from that; so down below, we have search files,
2:13 that individual searching of a single file is a generator and we only ever have a single line in memory at a time.
2:21 Now up here, as we work through all the files in our directory or even recurse into a tree of directories and their files,
2:28 we are only pulling back one item from either here or here at a given time, and that means we only have 1 line in memory at a time,
2:38 really one search result and then we can go up here and we are printing out now, let me just show you that this is still working,
2:45 let's bring this back and then let me search the simple files again, just to show that we are actually still searching just like before.
2:54 So let's search the small set of books for Holmes, there you go, you can see 468 matches, and we are searching the Ulysses
3:03 or searching The Adventures of Sherlock Holmes, perfect, it works exactly the same but from a performance perspective it's not the same,
3:13 let's run it again, and this time we are going to search the large set of files,
3:18 and again, we are going to search for how many question marks are there, there were something like 2.78 million question marks and remember,
3:26 we had to use like almost 400 MB of memory to answer that question, remember, 400 MB, what's it going to do this time, can we do better?
3:36 Oh, here, hold on, let me stop this really quick, remember, we didn't do the output, it was too much going really,
3:42 we just did a little count, so let's rerun it this way. Ok, here we go again,
3:48 it's going, 3.8 MB remember, it should have jumped up to 300 MB, a gigabyte, what is going on, this is so absolutely amazing,
3:58 look at this you guys, we are processing gigabytes and gigabytes of code with almost identical algorithms and yet the memory usage is the same
4:06 as if we are processing like a single line in memory, because that's all we are ever holding, is a single line in memory, ok,
4:14 great we do have the file stream open to some huge file at some point, but we are seeking over, we are streaming across it.
4:21 Let's just let it run and see where it goes. It's done, look at that, look at the memory usage, look at the CPU,
4:35 look at the performance, it is so much better than it was before, in fact, I kept the previous one around,
4:43 let's have a look at it, it's not really fair to put them side by side, because the scale of the graph is not the same,
4:50 but I think we'll get the sense anyway. So on the left is the old bad sort of standard procedural code style and now look at the memory,
4:59 it goes from 3 MB when I was starting out to 394 MB; ours went from 3 MB to 4 MB. And that was it.
5:12 If you look at the size of the CPU graph or sort of the length of any of these graphs, you'll see they are basically identical in computational time,
5:19 it looks actually lower on CPU usage, presumably it's doing less garbage collection less allocation,
5:27 doubling of lists and copying them and things like this, and all we have done to change that algorithm is use the yield
5:34 and yield from keyword instead of making lists appending and extending them. The code we wrote actually got a couple of lines shorter,
5:41 so this is the power of generator methods and any time you are processing like a pipeline of lots of data you saw that you can chain them together
5:50 to create these pipelines basically effortlessly, we'll see that there is even a simpler way to create
5:56 this type of structure something called a generator expression, right, but we'll save that for the next app.


Talk Python's Mastodon Michael Kennedy's Mastodon