Python Jumpstart by Building 10 Apps Transcripts
Chapter: App 8: File Searcher App
Lecture: The performance problem
0:01 So let's jump over here to Windows 10 for a minute, and continue working on our app, and the reason I want to come over to Windows is
0:09 I want to explore this performance problem and the tooling on Windows is super good
0:14 for understanding the performance characteristics of individual applications rather than system wide.
0:21 So, we are going to change the problem space a little bit, we have been previously searching this books folder,
0:26 and if you go look at the properties you will see that's about 5MB of text, that's a serious quantity of text
0:33 and it's blasting through ti so that's already impressive, but let's look and differ them out.
0:37 Here we have some more books but now we have 2.27 GB of txt files, that is a ridiculous amount of text files.
0:47 You'll see that if we try to search that content, well the app actually does surprisingly well,
0:53 it really does go through and it finally has results and so on, but if we leverage this concept of generator methods and related things
0:59 that will build on other applications further down the line, we can actually do amazingly better, ok,
1:05 so just to make sure everything is working on Windows, let me just search the same stuff here, ok so we want to search c/users/mkennedy/desktop/books
1:17 and let's search for "incredible". Excellent, so it looks like we've found some inverness to from incredible age...
1:26 right, Ulysses, A Dolls' House, not too many results there, but you can see it's working. Fantastic,
1:33 now I happen to know from trying this earlier that we need to change this output here
1:39 and in fact if we print out all of this it's going to be so much output when we go through the 2 GB of files that it actually causes the problems,
1:49 the significant part of the performance is literally that print right there, so instead of doing this we are just going to go and do a count,
1:58 so we'll say match count, now right now this is a list and I could just do len of list and just print that out
2:08 but it's going to turn out when that this becomes a generator, len of generator doesn't mean the same thing,
2:16 so let me just independently keep track of the count, and we'll just say something like this, and let's put a little comma separator
2:23 and we'll do .format and match count. So let's run this one more time, ok, same place let's search for "funny" and apparently we've found
2:34 33 matches of the word "funny" and the 5 MB of text that was really quick, that's awesome, right.
2:41 But they just sort of push a little bit harder in the performance perspective,
2:44 let's search the 2.27 GB of text and we are going to search for something, maybe question mark. So let me introduce you to process explorer,
2:55 so process explorer is kind of like task manager, activity monitor, from OS X but it gets a ton of information both visually
3:04 as well as so things like performance counters on windows to tell you what is going on with the apps, and it lets us,
3:11 here is our Python app that is waiting for us to hit go, all right, here we go, our app goes, here is our sort of operating memory down here
3:18 and it's dropping into the distance so it turns out that searching for question marks in this file
3:25 there are ton of them so we are building them up into our list as we are recursively going through these files on disk.
3:32 You can see it's pretty computational heavy, pretty IO intense but really the memory is just growing and growing,
3:40 you'll see that when we get to the talking about generators that maybe this is not the way this app has to behave, right,
3:46 we can actually incorporate very minor changes into our app and get dramatically better performance at least from a memory perspective.
4:00 All right, our process has finished and we found 2.7 million question marks in those files, and look at the memory,
4:09 this is not the most amazing outcome that we could have had. It turns out it took almost 400 MB the way we implemented our algorithm,
4:18 and depending on the how we hold the data or the size of the data, it could be even worse.