Data Science Jumpstart with 10 Projects Transcripts
Chapter: Project 5: Cleaning Heart Disease Data in Pandas
Lecture: Understanding the Heart Data to Cleanup

Login or purchase this course to watch this video and the rest of the course contents.
0:00 In this next section, we're going to do a quick overview of our columns.
0:04 We're going to understand what columns we have and what are the types of those columns.
0:07 Let's run the describe method. This gives me summary statistics. It looks like I've got 920 non-blank entries for each of these numeric values.
0:16 Look at my minimum values, my maximum values. It looks like these are all integer-like values. They don't go very high.
0:22 It's probably likely that they're being stored as 64-bit integers, and from the looks of this, they could all be 8-bit integers.
0:31 Let's look at our string columns. We've got two string columns. You can see that these actually look numeric, but they've got question marks in it,
0:40 so we'll show how to deal with those. Let's look at object columns in here, and we've got a few object columns as well.
0:51 Let's examine our D types, and we can see that we have a bunch of object columns in here, which is a little concerning given that we are using PyArrow.
1:05 If you look at our code up above here, we are using the PyArrow backend, and the engine is PyArrow. Where did these object columns come from?
1:15 These probably came from us reading multiple files, and some of the files had missing values in them.
1:20 Some of them did not, and so we ended up getting different types in them.
1:25 Because we don't have proper PyArrow types in here, I do want to clean these up, and we'll show the process for doing that.
1:32 This data comes with a little descriptive file here, and if you look at this, the goal is... Scroll down a little bit, we can find the goal.
1:42 It says in this section 4, the database has 76 attributes, but we're using 14 of them, and the goal field refers to the presence of heart disease.
1:53 So it's an integer from no presence 0 to 4. And then we have a bunch of other attributes down here.
2:00 Here's the 14 that are used, and then there's some documentation about that down here.
2:10 Interesting that section 9 says that missing values have the value of minus 9.
2:15 We saw some question marks in there, so we might have to dive into that a little bit. In this section, I showed a quick diving into my data.
2:22 We did summary statistics, and then we selected the different types. We did notice that we had an issue with object types being in there.
2:31 Again, you do want to check those D types, and make sure that all of your columns have the proper types.
2:36 We also noticed that some of our numeric types probably don't need to be 64-bit integers.
2:42 They can probably be shrunken, and we can save a little bit of space if we need to.


Talk Python's Mastodon Michael Kennedy's Mastodon