Data Science Jumpstart with 10 Projects Transcripts
Chapter: Project 5: Cleaning Heart Disease Data in Pandas
Lecture: Refactoring to a Function in Pandas for Cleanup
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
After I've gone through this whole process, what I want to do now is make this into a function so that I can use this code really easily.
0:07
Let me show you how to do that. Here's our original data, and we're just loading our original data frame. This is the raw data.
0:15
I do like to have the raw data around so I can clean it up. Then here is my remove question function.
0:22
We have that before, and I like to make this little tweak function. The tweak function is going to take in
0:26
the raw data and it's just going to return our chain. Let's run that and make sure that that works. It looks like that did work.
0:32
Once you've done this, this is really nice. What I like to do is I like to take this code and move it to the top of my notebook.
0:39
I have the code that loads the raw data. When I come to my notebook, I can just load the raw data,
0:44
and then I have the code right below it that cleans it up. I don't have to run through 50 cells to do that.
0:48
I'm ready to go with both the raw data and the cleaned up data. Why do I like to work with raw data?
0:54
I like to work with raw data because in my experience, it seems like the boss or whoever asked me to do this process,
1:00
comes back and ask me some questions. I don't have the raw data, it's hard to track through and explain
1:06
what went on with a particular example, and that's typically, why did this row do this? Well, if I have the raw data,
1:12
I can trace through that chain and see what happened. Now, if you look at this, if you squint at this, it looks like we have a bunch of lambdas here.
1:20
Can we clean up this code a little bit? Let's try and do that. What I'm going to do is I'm going to take
1:25
a dictionary called types and just map the columns to those types. Then I'm going to use a dictionary comprehension in here and say,
1:35
let's stick in this dictionary comprehension inside of the assign. This is the dictionary comprehension with a curly brace,
1:43
and that star unpacks it and sticks it into our assign, as in each key from the dictionary comprehension is a parameter for the assign method.
1:54
Let's do that and see if that works. It looks like it works. However, if you look closer, it looks like all of these columns here have the same value.
2:05
That's disconcerting. What is going on there? The issue here is that Python, if you look at this,
2:15
we are looping over column and D type in here in a loop. When you stick this into a lambda, lambda sees column over here,
2:24
and it's just going to use the last column from the loop, which is annoying. How do we get around that?
2:31
We get around that by doing this little step here. When we create the lambda, we pass in the column. Why does that work?
2:43
Well, now we are passing in the column when the lambda is created. We are not evaluating the column when the lambda is executed.
2:51
When we execute this for loop, it hasn't evaluated the lambdas, it's just created the lambdas. It's going to take that last value of column
2:59
unless we pass that indirectly, and then that is set when the lambda is created, not when the lambda is evaluated.
3:07
That's a little tweak that you might need to do if you want to refactor that. Let's run that and see if that works. That looks like it does work.
3:14
You can see that our values are not all the same. This example is just showing that we could also use this syntax here and do a look.
3:22
We're pulling off those columns and we're just going to convert all of those columns to floating point values. Could we do that? We could.
3:32
Why don't these all end up looking like floating point values? Because then afterwards, we're going to use this as type to specify the types as well.
3:40
My preference here is probably to use this commented out version because then I don't have to put a secondary as type in there.
3:48
Know that if I comment this out, we get a failure because PyArrow really wants us to convert from strings to numbers before we do that as type.
3:58
In this section, I showed you how to make a function and gave you some of the benefits behind that. Make sure you're doing this with your code.
4:04
I don't care if you chain. I think chaining is going to make your code better, but I do think a really good practice is once you've got
4:11
that code to clean up your data, move it to a function and put that at the top. that's gonna make your life using notebooks a lot easier.