Python Memory Management and Tips Transcripts
Chapter: Memory and functions
Lecture: Converting the pipeline to generators
0:00 Well, if this switch to reusing the variable name seemed impressive to you,
0:04 what we're going to do now will probably blow you away.
0:08 So let's go over here and I'm gonna make a totally new copy of this.
0:12 I'm gonna call this "app_one_at_a_time".
0:18 We're going to switch this out to use a totally different style,
0:23 and we'll leave the seed the same, we're still gonna import this, we're still
0:26 going to need that and so on.
0:27 But instead of using this greedy_main,
0:29 I'm just gonna have the regular main here and we're actually...
0:33 And no let's actually go back to the greedy_main.
0:35 That's even more interesting, right?
0:36 This was the worst case scenario.
0:39 So we're going to get rid of this one.
0:41 Super. So what we're gonna do is we're going to rewrite,
0:45 reimplement these functions so that they use less memory. And let's go ahead and just run
0:50 this and see where we are.
0:54 Looks like we're using 83 megabytes of memory
0:57 additionally, to run that code.
0:59 Can we do better? Well,
1:01 the answer is yes. The mechanism for doing so will really impress you,
1:05 I think. So, let's come along here and Let's change this.
1:12 Now I'm gonna change this in two ways.
1:13 These are all three list comprehensions and converting list comprehensions to where we're going
1:18 is really, really easy. But I want to rewrite one in a way that
1:23 makes it more obvious of what the general solution is.
1:27 So let's first do the easy piece.
1:30 Let's say this one, instead of creating the list,
1:34 filling up the list with all the items,
1:36 keeping everyone of them in memory at a time and then processing them later,
1:40 what I'd like to do is set up this function in a way so that it can
1:43 give one back, let that entirely be processed through the pipeline and then offer up
1:49 the next, and the next, and the next.
1:51 So instead of loading a million things,
1:53 we're gonna load one, let it get taken care of and then load the next.
1:57 How do we accomplish this?
1:59 It seems like there's some interesting or tricky coordination we're gonna have to do from here
2:04 to make that happen. It turns out it's mind-blowingly simple.
2:08 So see these square braces? When you have square braces,
2:11 this generates a list that is fully realized into memory.
2:14 If we change those from square braces to parentheses, and nothing else,
2:19 this will now generate a generator, and the generator gives up one item at a time
2:24 and does not load them all into memory.
2:27 And because range is a generator, we're pulling on the generator. As we pull on
2:32 this generator from load_data, it pulls on range,
2:34 so nothing extra is going to be loaded.
2:37 You see this warning as it's no longer a list were returning.
2:40 We're gonna return a generator, like so, if we import it.
2:48 Alright, well, that's pretty cool.
2:50 What about this one? This "filter_data"?
2:53 Let's say the filter_data we're going to do in a more interesting way.
2:57 Let's make this an "Iterator", make a little more general,
3:00 I suppose. And then over here,
3:02 we're gonna say, this is not now taking, we'll go up,
3:05 You'll see, now this has got a warning,
3:07 so this is now gonna be something that takes an iterator,
3:11 which would still work for a list, and it's going to return an iterator,
3:15 and let's just keep flowing this along.
3:20 This is cool. What are we gonna do with this one?
3:22 This is the one we're going to rewrite.
3:24 So let's do this "scale_data" one.
3:25 And our fancy conversion, we'll just put parentheses instead of this, and let's see
3:32 if it works. It probably will run.
3:37 Yes, this is that little part I threw in to make things interesting.
3:40 So we're going to, just, we're going to have to deal with this in just a second.
3:48 But let's throw it away for a minute.
3:49 And how are we doing? We used 31 MB's.
3:52 That's less than half of what the better version we did.
3:56 Oh, but it's going to get a whole lot better because this filter one in
4:00 the middle is actually creating a list.
4:02 Let's see what we can do around that one.
4:05 Well, again, we could put just curly braces
4:06 there, or parentheses, I'll show you the general solution. So, we can create these generator expressions,
4:11 but to create a proper generator,
4:14 it's really simple. In Python
4:16 you just use the yield keyword. So we'll say "for n in data", we're gonna do an
4:19 if, "if n mod 5 is not equal to zero",
4:26 then we want to say "here's one of the items". And the way you say that
4:28 is use the yield keyword and then just the item. There,
4:33 that's it. Let's try again,
4:35 Whoohoo! Zero megabytes used! Zero!
4:39 Now, for those of you who are aware of what generators do, you'll realize we
4:44 haven't actually done any work by pulling on them,
4:47 right? So if we didn't do this,
4:49 we haven't made anything happen. Now doing the slice,
4:53 though, this is a little bit tricky.
4:55 We need to figure out a way to get this data back,
4:58 right? The reason I left these in here is because we want to have this constraint,
5:02 this somewhat realistic constraint of like dealing with the last bit of data or something
5:06 like that. If you're just gonna pass them along and process them one at a
5:09 time, it would look something like this "for s in scaled",
5:13 we could print, I guess we won't print it,
5:16 but we could say "z equals s times s". Just do something with it
5:21 so it goes through all of them.
5:22 Let's see what that does for the memory.
5:26 Oh, my, goodness! 9 megabytes!
5:29 We started at 9, we stayed at 9.
5:31 We actually added it a little bit,
5:33 but it's like you know what?
5:35 Really? We haven't used any memory.
5:37 We've used less than one megabyte,
5:39 right? Because we're showing it 2 zero decimal places.
5:41 We've used 200 kilobytes. That is insane.
5:46 That is insane! If we take what we added before,
5:49 which was 63 in the best case, or 90, and it was 83 in the worst
5:53 case, divided by 2 kilobytes,
5:57 that's a 415 time improvement. How awesome is that?
6:04 That is so amazing. And what did we have to do?
6:07 Well, we put parentheses rather than square brackets.
6:11 And here we use the yield keyword just to show you the general solution,
6:14 but we could have put parentheses instead of square brackets.
6:17 However, certain things we thought we wanted to do, like this, turned out to be
6:20 tricky because you can't slice them.
6:21 You should really be able to, right. Like you could interpret
6:24 that as like "give me these in this range" or whatever,
6:26 but it doesn't work that way.
6:28 So What we can do is we can come over here and use this thing called
6:31 an "islice". So I'll say "islice", and it itself is a generator,
6:35 so it's not going to realize itself for printing unless we throw in a list.
6:40 but this comes from itertools,
6:42 and we say the thing that we would like to slice from 0 to 10, and
6:47 let's give that a shot. Here's the head, and that's exactly what we had before.
6:52 And we're still using zero bytes.
6:55 Zero bytes at our 200 K. Now it gets a little more tricky to do this
7:00 tail. I could be misunderstanding or not finding the right library to give us
7:06 the tail, but I don't know how to get it.
7:09 So I'm just going to do a little loop here,
7:11 right? So we're gonna say "for n in scaled".
7:15 Now, you got to be careful.
7:16 We've kind of used up, we've consumed the first 10 and as you go through
7:20 these generators, it doesn't like reset to the beginning.
7:23 So this only really works if there's more than 20, otherwise you would just store them
7:28 in a list. Alright,
7:29 we're dealing with tons of data,
7:30 but anyway, that's what we're doing.
7:32 So we're going through here and we're saying "tail.append(n)", and that would add
7:38 all of them into the list.
7:40 And you know, it's not horrible,
7:41 what's gonna happen with the memory, it's a little bit slow,
7:44 but we've only still use 31 MB's instead of 60 or 83 or whatever.
7:49 But we only want the last 10.
7:51 So we'll say "if length of tail is greater than 10,
7:54 we're gonna throw away the one on the front and let it move towards the back".
7:58 So we'll just say "pop zero" and then we go. Takes a tiny moment,
8:05 but like zero megabytes again and now we can get our tail back.
8:10 Let's just say "tail". See, it takes a moment to go through it,
8:15 but that's because we never actually process.
8:18 We never went through that iteration until we tried to get to the end.
8:22 If you compare these numbers against the ones we had before,
8:25 they're the same numbers. This is the same result.
8:28 What did we have before? We went from 9 MB's at the beginning,
8:31 up to like 100. In the good case,
8:34 it was 80 or something like that.
8:37 It did not move. It finishes at 9 megabytes.
8:40 So this pattern, this ability to use generators in a pipeline to say we're not
8:46 actually gonna load all the originals,
8:47 we're gonna load one and we're going to go back and we're gonna pass it to
8:52 here, we're gonna pass it to here. So when we try to loop over this,
8:55 like right here, or right here,
8:58 we pull one out of scaled, scaled reaches into it,
9:01 generator has been given, and it pulls one out of here.
9:04 This may pull a couple from original because it could be skipping, all the while transforming
9:10 it in the scaled, and then that's gonna pull on this generator,
9:13 which is gonna pull on the number which generates the random coming out of the range.
9:17 So it's like one generator pulls on the next and pulls on the next.
9:21 So we're never really loading more than one or two things into the memory at a
9:24 time. It doesn't matter if it's one or a million like it is
9:28 in this case, we use basically zero memory to make that happen.
9:33 This is not something that always works.
9:36 But if you have a use case where a generator makes sense,
9:39 use it. Look how awesome this is. The scenario where it doesn't necessarily make sense is
9:44 if you want to create the data and then go over and over and index into
9:48 it, and pass it around and use it again.
9:50 Remember, generators get used up. But we could always just do like this where we
9:56 go through the whole pipeline and then realize the last bit into a list.
10:01 We saw that still more than 50% improvement and it still gives us that,
10:05 like, in memory list work with the data style.
10:08 So we have a bunch of options. So we could do this, sort of, realize it
10:14 as a list, but only at the end and not the intermediate data.
10:17 And check that out. We end up using our same design patterns, the same way
10:21 of working with stuff, and we use basically zero extra megabytes.
10:25 So, this is such a cool pattern if you could make it work.