Data Science Jumpstart with 10 Projects Transcripts
Chapter: Project 5: Cleaning Heart Disease Data in Pandas
Lecture: Creating a Function to Repeat Common Cleanup in the Chol Column
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
Our next column is the serum cholesterol column. So we've got our existing code here.
0:06
I'm going to attack on call and then I'm going to describe on the end of that.
0:11
And we can see that again, we are not getting a lot of useful information I've described because the column type is kind of messed up.
0:18
Let's do a value counts on that. We can see that we've got, looks like two zeros, which is a little bit weird.
0:26
We also have a question mark and we have a 220. This is telling me that we have string types mixed in with numeric types.
0:32
And we probably have two different types of numeric types, probably integers and floating point values.
0:37
So when we concatenate these together, we did not get a nice clean type. It got a little bit confused.
0:44
So this is probably going to be the same thing that we did up above with the blood pressure value.
0:51
So what I'm going to do is I'm going to refactor this and I'm going to make a function here called remove question.
0:59
And you can see that it takes a data frame as the first parameter. Then it takes a column name and then it takes a D type, a final D type.
1:06
And I'm just defaulting that to int8 pi arrow. And then I'm going to take the chain that I had up above.
1:13
If we look at the code up above here, I'm just taking that chain and sticking it into
1:21
this function here, replacing the final type with that defaulted type.
1:25
And then what I'm going to do is I'm going to come down here to my assign and I'm going to replace that chain with a call to lambda.
1:32
Why do I need to use lambda here? Why can't I call this function directly?
1:36
I can't call this function directly because in order to use a function with assign, it
1:43
only can take one parameter and our function does not take one parameter. We have to pass in the data frame and the column. So we wrap that in lambda.
1:51
So we are able to call it with just the data frame and then the lambda inside the lambda sticks in the column that we want. Let's test that out.
1:58
I'm going to do that to the previous column to make sure that still works. And I'm also going to do that with the cholesterol.
2:03
And then we'll look at the value counts of cholesterol. Looks like the value counts of cholesterol did work. Let's do a histogram of that.
2:12
And there we go. Oh, this histogram is interesting. It looks like we have a bunch of zeros and then we have a bunch of data along the way.
2:21
To me, I love these visualizations because it lets me see what's going on.
2:24
It seems that we have a lot of people that have a value of zero for cholesterol.
2:29
Again, zero is probably not a valid number for cholesterol, but there's also probably values that are missing as well. Let me just validate that here.
2:37
So I'm going to say isNA. We saw that we can do this already. And then I'm going to say, let's do a sum of the isNAs there.
2:45
There are 30 values that are missing in addition to that. So if this were my data set, what I would do at this point is talk to the subject matter
2:53
expert, someone who knows about this data, and figure out is a zero the same as a missing value and take the appropriate action.
3:00
I might go and change all of those zeros to missing or vice versa, depending on what my end goal was.