Python Memory Management and Tips Transcripts
Chapter: Memory and functions
Lecture: Dropping intermediate data
0:00 Let's look at this code and think about how we can make it better.
0:04 We'll see that we could make the smallest little change and actually dramatically improve the memory
0:09 usage. But first, I just really quickly recorded how much memory is used at
0:13 the beginning, in the end, and have the print out now show
0:17 exactly. So when I run this,
0:18 you'll see it'll say, used basically 83 MB.
0:22 Great. So what I want to do is I'm gonna keep this one around for
0:24 you. I'll call this one, go up here and call this "greedy_main" something like that because
0:34 this one uses extra memory and this one here,
0:37 we'll just leave this as "main".
0:40 It's gonna be our improved version.
0:43 So thinking about this, why is so much memory being used?
0:46 You can see it grow each time when we run it.
0:49 It's using first 48, and then 60, and then 90.
0:54 And actually, what's happening is, at least from this step, from going from original to
0:58 filtered, we're using less data, right?
1:01 We're dropping out 20% of the data from a million to 799,000.
1:06 Should use less, not more,
1:07 right? Well, when you think about how reference counting works,
1:10 when do these variables go away?
1:13 The variables go away when they're no longer defined in the function, right?
1:18 There's no optimization that says, Well,
1:20 "original is not used after this line so we can clean it up".
1:23 Some things have that like .NET and it's JIT compiler in production will clean those
1:27 variables up as soon as possible,
1:28 but in debug, it'll keep it around.
1:30 So, like if you said a break point here,
1:31 you can see it. But Python doesn't make those types of optimizations.
1:36 What it does is as long as this function is running,
1:39 the variables defined in it still exist.
1:41 So that means original filtered still has a reference to it and won't be cleaned
1:45 up even though clearly, you know,
1:47 original is not needed after line 39, filtered is not needed after line
1:50 40. How do we fix it?
1:52 Well, one not as beautiful way,
1:55 but really easy way to fix this is to just reuse the variable.
2:00 What if we just call this data?
2:01 Here's the data in the current, the current data in the pipeline.
2:05 And that goes here. And now here's the current data in the pipeline,
2:09 and we're gonna pass that along.
2:10 And now he's a current date in the pipeline,
2:12 and then we're going to work with the data
2:14 at the step it's at. Now, this is not as nice,
2:18 right? If I read this code,
2:20 you know which data from which step am I working on? Somebody doing code review
2:25 like this might say, Well,
2:26 "this variable means three different things along the way",
2:29 and that's really crummy, because here
2:30 you had original filter and scale that doesn't need as much documentation or as many comments to
2:36 understand what's happening. But here's the thing,
2:40 this reference here, when you go and set the next line like this,
2:44 it replaces it and dropped the reference to what was called "original".
2:47 This line is going to drop the reference to what was called "filtered" and so on.
2:51 So we shouldn't be holding on to those from step to step to step.
2:54 Let's just run it again and see what this, like literally five word change means for
2:59 memory. How cool is that?
3:02 So we've come along and we've started our nine again.
3:07 This is the same. But then notice this step up to 59 was less, and
3:11 then 79, or 78, T guess 79 if you rounded it up,
3:15 and then we get the data back.
3:16 So this is 78, the final,
3:19 which is 69. And what did we have before? We had, not that many "o's", call
3:31 this "single variable mode" or something like that,
3:35 Right? So we've saved, not a huge amount,
3:38 but we've saved a non-trivial amount of memory by just using a different variable name.
3:44 How cool is that? So I think that's a pretty big deal.
3:48 The more data that we load, like if this was 10 million or larger,
3:53 it would make a bigger difference.
3:54 If we had more steps, this technique would make a bigger difference,
3:58 right? It's how much cumulatively did you have to like hang on to as you went
4:02 along? I think because we're converting from maybe ints to floats here,
4:06 probably this last step, it takes the most memory.
4:08 So if we started with floats or something like that,
4:11 we could probably see a
4:12 bigger difference. But very cool.
4:15 We were able to basically delete original and delete
4:18 filtered and just keep what we had called "scaled" here to work with, and that was
4:23 it. I think that's super cool.
4:25 I guess a parting comment is if I was writing this code,
4:28 you know, I would have something, some kind of comment here that is like using
4:33 single variable name you ensure data cleaned up as fast as possible. I don't know,
4:41 Something like this. I'm not generally a big fan of code comments,
4:44 because usually it means your code is not clear,
4:47 but here we made it unclear on purpose.
4:49 It's worth while to reduce that amount of memory,
4:52 definitely in this case and some in real cases,
4:55 right? this could be huge.
4:56 What we're going to see later is that we could actually do much, much better than
5:00 this. But there will be a performance trade-off to a small degree,
5:04 right? So here's one variation on trying to take this like pipeline of data processing
5:09 and make it much more efficient by not holding onto the intermediate steps.
5:13 We do that by having a single variable name that were just reusing over and over.