Data Science Jumpstart with 10 Projects Transcripts
Chapter: Project 5: Cleaning Heart Disease Data in Pandas
Lecture: Dealing with ? Characters in the Trestbps Numeric Column

Login or purchase this course to watch this video and the rest of the course contents.
0:00 The next column that we're going to look at is the resting blood pressure column.
0:04 Now I've got my chain that I built up. Again, I'm just going to keep building this up.
0:08 This is where I was after our last column cleanup. So let's pull off that
0:13 resting blood pressure column. It looks like it is an object column. Let's describe that.
0:20 We don't get super useful information here. We get a count, the unique entries. There's 103
0:26 unique entries. The top entry is 104. The bottom entry is 94. But we saw that there are also
0:32 values that aren't numeric in there. So to understand those, I'm going to do value counts.
0:37 That's our friend. You can see that there is a question mark in there. It looks like we've got
0:41 floating point numbers in there. And we also have integer values in there. That probably explains
0:46 why we have that object type in there. So here's what I'm going to do. I'm going to say, let's
0:51 replace question mark with none. And then let's convert that whole thing to an integer. In this
0:57 case, I'm going to do an unsigned 8-bit integer. And I get an error. It says it could not convert
1:05 95 with the type string to an unsigned 8-bit integer. Again, we're seeing that PyArrow is a
1:12 little bit picky about type conversion. So I'm going to have to jump through a little hoop to
1:16 clean this up. So instead of going directly from the object type to PyArrow, I'm first going to
1:23 convert this to a string type. And then from that, I'm going to convert that to a PyArrow type. Let's
1:28 run that and see if that works. And it looks like we got an error here. What's our error? It says
1:36 it could not convert 138 as a uint8. Okay, I'm probably going to have to jump through a few more
1:44 hoops here. Let's see if we can do this. So I'm going to convert from a string to a float and
1:50 then to an int16. And it looks like that works. So let's just do a summary statistics now. And
2:00 it looks like our values go from 0 to 230. We do have some missing values. You can see that we have
2:07 861 as the count. Remember that count is the number of non-missing values. In this section,
2:13 I showed changing those types and cleaning them up. We did see that PyArrow is a little bit picky
2:19 about changing types. Sometimes we have to go through a little bit more conversions to get them to convert to the correct types.


Talk Python's Mastodon Michael Kennedy's Mastodon