Data Science Jumpstart with 10 Projects Transcripts
Chapter: Project 5: Cleaning Heart Disease Data in Pandas
Lecture: Dealing with ? Characters in the Trestbps Numeric Column
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
The next column that we're going to look at is the resting blood pressure column.
0:04
Now I've got my chain that I built up. Again, I'm just going to keep building this up.
0:08
This is where I was after our last column cleanup. So let's pull off that
0:13
resting blood pressure column. It looks like it is an object column. Let's describe that.
0:20
We don't get super useful information here. We get a count, the unique entries. There's 103
0:26
unique entries. The top entry is 104. The bottom entry is 94. But we saw that there are also
0:32
values that aren't numeric in there. So to understand those, I'm going to do value counts.
0:37
That's our friend. You can see that there is a question mark in there. It looks like we've got
0:41
floating point numbers in there. And we also have integer values in there. That probably explains
0:46
why we have that object type in there. So here's what I'm going to do. I'm going to say, let's
0:51
replace question mark with none. And then let's convert that whole thing to an integer. In this
0:57
case, I'm going to do an unsigned 8-bit integer. And I get an error. It says it could not convert
1:05
95 with the type string to an unsigned 8-bit integer. Again, we're seeing that PyArrow is a
1:12
little bit picky about type conversion. So I'm going to have to jump through a little hoop to
1:16
clean this up. So instead of going directly from the object type to PyArrow, I'm first going to
1:23
convert this to a string type. And then from that, I'm going to convert that to a PyArrow type. Let's
1:28
run that and see if that works. And it looks like we got an error here. What's our error? It says
1:36
it could not convert 138 as a uint8. Okay, I'm probably going to have to jump through a few more
1:44
hoops here. Let's see if we can do this. So I'm going to convert from a string to a float and
1:50
then to an int16. And it looks like that works. So let's just do a summary statistics now. And
2:00
it looks like our values go from 0 to 230. We do have some missing values. You can see that we have
2:07
861 as the count. Remember that count is the number of non-missing values. In this section,
2:13
I showed changing those types and cleaning them up. We did see that PyArrow is a little bit picky
2:19
about changing types. Sometimes we have to go through a little bit more conversions to get them to convert to the correct types.