Python Memory Management and Tips Transcripts
Chapter: Memory and functions
Lecture: Dropping intermediate data
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
Let's look at this code and think about how we can make it better.
0:04
We'll see that we could make the smallest little change and actually dramatically improve the memory
0:09
usage. But first, I just really quickly recorded how much memory is used at
0:13
the beginning, in the end, and have the print out now show
0:17
exactly. So when I run this,
0:18
you'll see it'll say, used basically 83 MB.
0:22
Great. So what I want to do is I'm gonna keep this one around for
0:24
you. I'll call this one, go up here and call this "greedy_main" something like that because
0:34
this one uses extra memory and this one here,
0:37
we'll just leave this as "main".
0:40
It's gonna be our improved version.
0:43
So thinking about this, why is so much memory being used?
0:46
You can see it grow each time when we run it.
0:49
It's using first 48, and then 60, and then 90.
0:54
And actually, what's happening is, at least from this step, from going from original to
0:58
filtered, we're using less data, right?
1:01
We're dropping out 20% of the data from a million to 799,000.
1:06
Should use less, not more,
1:07
right? Well, when you think about how reference counting works,
1:10
when do these variables go away?
1:13
The variables go away when they're no longer defined in the function, right?
1:18
There's no optimization that says, Well,
1:20
"original is not used after this line so we can clean it up".
1:23
Some things have that like .NET and it's JIT compiler in production will clean those
1:27
variables up as soon as possible,
1:28
but in debug, it'll keep it around.
1:30
So, like if you said a break point here,
1:31
you can see it. But Python doesn't make those types of optimizations.
1:36
What it does is as long as this function is running,
1:39
the variables defined in it still exist.
1:41
So that means original filtered still has a reference to it and won't be cleaned
1:45
up even though clearly, you know,
1:47
original is not needed after line 39, filtered is not needed after line
1:50
40. How do we fix it?
1:52
Well, one not as beautiful way,
1:55
but really easy way to fix this is to just reuse the variable.
2:00
What if we just call this data?
2:01
Here's the data in the current, the current data in the pipeline.
2:05
And that goes here. And now here's the current data in the pipeline,
2:09
and we're gonna pass that along.
2:10
And now he's a current date in the pipeline,
2:12
and then we're going to work with the data
2:14
at the step it's at. Now, this is not as nice,
2:18
right? If I read this code,
2:20
you know which data from which step am I working on? Somebody doing code review
2:25
like this might say, Well,
2:26
"this variable means three different things along the way",
2:29
and that's really crummy, because here
2:30
you had original filter and scale that doesn't need as much documentation or as many comments to
2:36
understand what's happening. But here's the thing,
2:40
this reference here, when you go and set the next line like this,
2:44
it replaces it and dropped the reference to what was called "original".
2:47
This line is going to drop the reference to what was called "filtered" and so on.
2:51
So we shouldn't be holding on to those from step to step to step.
2:54
Let's just run it again and see what this, like literally five word change means for
2:59
memory. How cool is that?
3:02
So we've come along and we've started our nine again.
3:07
This is the same. But then notice this step up to 59 was less, and
3:11
then 79, or 78, T guess 79 if you rounded it up,
3:15
and then we get the data back.
3:16
So this is 78, the final,
3:19
which is 69. And what did we have before? We had, not that many "o's", call
3:31
this "single variable mode" or something like that,
3:35
Right? So we've saved, not a huge amount,
3:38
but we've saved a non-trivial amount of memory by just using a different variable name.
3:44
How cool is that? So I think that's a pretty big deal.
3:48
The more data that we load, like if this was 10 million or larger,
3:53
it would make a bigger difference.
3:54
If we had more steps, this technique would make a bigger difference,
3:58
right? It's how much cumulatively did you have to like hang on to as you went
4:02
along? I think because we're converting from maybe ints to floats here,
4:06
probably this last step, it takes the most memory.
4:08
So if we started with floats or something like that,
4:11
we could probably see a
4:12
bigger difference. But very cool.
4:15
We were able to basically delete original and delete
4:18
filtered and just keep what we had called "scaled" here to work with, and that was
4:23
it. I think that's super cool.
4:25
I guess a parting comment is if I was writing this code,
4:28
you know, I would have something, some kind of comment here that is like using
4:33
single variable name you ensure data cleaned up as fast as possible. I don't know,
4:41
Something like this. I'm not generally a big fan of code comments,
4:44
because usually it means your code is not clear,
4:47
but here we made it unclear on purpose.
4:49
It's worth while to reduce that amount of memory,
4:52
definitely in this case and some in real cases,
4:55
right? this could be huge.
4:56
What we're going to see later is that we could actually do much, much better than
5:00
this. But there will be a performance trade-off to a small degree,
5:04
right? So here's one variation on trying to take this like pipeline of data processing
5:09
and make it much more efficient by not holding onto the intermediate steps.
5:13
We do that by having a single variable name that were just reusing over and over.