Python Memory Management and Tips Transcripts
Chapter: Memory and functions
Lecture: Dropping intermediate data

Login or purchase this course to watch this video and the rest of the course contents.
0:00 Let's look at this code and think about how we can make it better.
0:05 We'll see that we could make the smallest little change and actually dramatically improve the memory
0:10 usage. But first, I just really quickly recorded how much memory is used at the beginning, in the end, and have the print out now show
0:18 exactly. So when I run this, you'll see it'll say, used basically 83 MB. Great. So what I want to do is I'm gonna keep this one around for
0:25 you. I'll call this one, go up here and call this "greedy_main" something like that because this one uses extra memory and this one here,
0:38 we'll just leave this as "main". It's gonna be our improved version. So thinking about this, why is so much memory being used?
0:47 You can see it grow each time when we run it. It's using first 48, and then 60, and then 90.
0:55 And actually, what's happening is, at least from this step, from going from original to filtered, we're using less data, right?
1:02 We're dropping out 20% of the data from a million to 799,000. Should use less, not more,
1:08 right? Well, when you think about how reference counting works, when do these variables go away?
1:14 The variables go away when they're no longer defined in the function, right? There's no optimization that says, Well,
1:21 "original is not used after this line so we can clean it up". Some things have that like .NET and it's JIT compiler in production will clean those
1:28 variables up as soon as possible, but in debug, it'll keep it around. So, like if you said a break point here,
1:32 you can see it. But Python doesn't make those types of optimizations. What it does is as long as this function is running,
1:40 the variables defined in it still exist. So that means original filtered still has a reference to it and won't be cleaned
1:46 up even though clearly, you know, original is not needed after line 39, filtered is not needed after line 40. How do we fix it?
1:53 Well, one not as beautiful way, but really easy way to fix this is to just reuse the variable. What if we just call this data?
2:02 Here's the data in the current, the current data in the pipeline. And that goes here. And now here's the current data in the pipeline,
2:10 and we're gonna pass that along. And now he's a current date in the pipeline, and then we're going to work with the data
2:15 at the step it's at. Now, this is not as nice, right? If I read this code,
2:21 you know which data from which step am I working on? Somebody doing code review like this might say, Well,
2:27 "this variable means three different things along the way", and that's really crummy, because here
2:31 you had original filter and scale that doesn't need as much documentation or as many comments to understand what's happening. But here's the thing,
2:41 this reference here, when you go and set the next line like this, it replaces it and dropped the reference to what was called "original".
2:48 This line is going to drop the reference to what was called "filtered" and so on. So we shouldn't be holding on to those from step to step to step.
2:55 Let's just run it again and see what this, like literally five word change means for memory. How cool is that?
3:03 So we've come along and we've started our nine again. This is the same. But then notice this step up to 59 was less, and
3:12 then 79, or 78, T guess 79 if you rounded it up, and then we get the data back. So this is 78, the final,
3:20 which is 69. And what did we have before? We had, not that many "o's", call this "single variable mode" or something like that,
3:36 Right? So we've saved, not a huge amount, but we've saved a non-trivial amount of memory by just using a different variable name.
3:45 How cool is that? So I think that's a pretty big deal. The more data that we load, like if this was 10 million or larger,
3:54 it would make a bigger difference. If we had more steps, this technique would make a bigger difference,
3:59 right? It's how much cumulatively did you have to like hang on to as you went along? I think because we're converting from maybe ints to floats here,
4:07 probably this last step, it takes the most memory. So if we started with floats or something like that, we could probably see a
4:13 bigger difference. But very cool. We were able to basically delete original and delete
4:19 filtered and just keep what we had called "scaled" here to work with, and that was it. I think that's super cool.
4:26 I guess a parting comment is if I was writing this code, you know, I would have something, some kind of comment here that is like using
4:34 single variable name you ensure data cleaned up as fast as possible. I don't know, Something like this. I'm not generally a big fan of code comments,
4:45 because usually it means your code is not clear, but here we made it unclear on purpose. It's worth while to reduce that amount of memory,
4:53 definitely in this case and some in real cases, right? this could be huge.
4:57 What we're going to see later is that we could actually do much, much better than this. But there will be a performance trade-off to a small degree,
5:05 right? So here's one variation on trying to take this like pipeline of data processing
5:10 and make it much more efficient by not holding onto the intermediate steps.
5:14 We do that by having a single variable name that were just reusing over and over.


Talk Python's Mastodon Michael Kennedy's Mastodon