Data Science Jumpstart with 10 Projects Transcripts
Chapter: Project 8: Predicting Heart Disease with Machine Learning
Lecture: Preparing a Pandas Dataset to Create an XGBoost Model

Login or purchase this course to watch this video and the rest of the course contents.
0:00 Okay, we're going to get our data ready for prediction now. Let's look at our columns.
0:04 We want to make sure that we have certain columns and that they're in the right order.
0:09 Let's look at our target here. This is numb, again, zero meaning no heart disease, above zero
0:15 meaning heart disease. We can look at the value counts of those. So we have a lot with no heart
0:20 disease. And there's about 400 that do have heart disease, but they're spread among those other
0:24 values. Are there missing values in here? That's another thing that we need to be aware of. You can
0:29 see that there are missing values. Let's quantify that. Which columns have missing values? We just
0:35 do an any on that. That collapses that. We can do a sum to count them. You can see that CA has a lot
0:42 of missing values. We can do a mean to quantify that. So you can see that 66% of CA is missing.
0:51 We might want to make sure that numb doesn't have missing values. You can see that it doesn't here.
0:55 We can also do a value counts and say drop NA if there were missing values that would show up there.
1:01 Okay, let's look at the D types. And our D types are mostly pi-arrow. We do have a categorical
1:09 value in there as well. Let's look at the number types. There aren't any in 64 types. Let's look at
1:19 the number types. These are all the values that are numbers.
1:23 That's looking okay. What I'm going to do now is I'm going to use our XGBoost model. XGBoost
1:31 basically makes a decision tree that's going to predict a value and then it makes subsequent
1:36 decision trees that correct the value. I like to compare this to golfing. If you think of a decision
1:41 tree, a single decision tree can hit the ball once and it might be some amount off from the hole.
1:46 XGBoost is like a model that can hit the ball once. That's the first tree. And then subsequent
1:52 trees look at the error, how far the ball is off the hole, and try and fix that. And so you can
1:58 make multiple trees. Those are called boosters that try and get the ball in the hole or in essence
2:04 make the correct prediction. That's why this is a super useful model because it tends to get the
2:09 ball pretty close to the hole. Okay, so what am I going to do? I'm going to make X. And X is a
2:14 common variable name that we use for our data here. Let's look at this chain that I have here.
2:18 I'm going to take all of my object columns and I'm going to convert those to categoricals.
2:24 And I'm going to take all of my number columns and convert those to just normal numpy floats.
2:30 This is a cool trick that we can do with pandas. So this right here, what I've highlighted,
2:35 is a data frame. If I stick a star star in front of it, it's going to unpack it. If I unpack inside
2:42 and assign, it's going to replace the columns. So this is basically replacing object columns
2:46 with categoricals. Same thing here. This is taking all the number columns and converting those to
2:51 float. Now after I've done that, I'm going to say let's make sex a category, FBS a float,
2:58 X-ange a float, and slope a category. And then I'm going to drop the number column. So that is my X.
3:06 It's everything but the number column. And then our Y is going to be those labels, those number
3:11 labels there. Let's split that up. I'm going to use scikit-learn to split it up. Remember,
3:16 we want to split up our data. We want to train our model on some portion of data. And then we
3:21 want to have a holdout, some data that it hasn't seen that we have the real labels for. And we can
3:26 evaluate how our model performs with that data that it hasn't seen. Why do we need a holdout?
3:31 Well, if you evaluate your model on data that's seen, it's really easy to make a model that
3:35 performs well. You just memorize the data. But in the real world, that's not going to work very well,
3:39 memorizing the data, because presumably the data is going to come in a little bit different and
3:43 you won't have exact matches. In fact, if you had exact matches, you wouldn't have to use machine
3:47 learning because you could just memorize the data and you could use an if statement to check whether
3:51 they're an exact match or not. Okay, so we've split up our data. And what I'm going to do is
3:57 make an XGBoost classifier. I'm going to say enable categoricals. So because we have those
4:02 categoricals, we're going to enable that. And we need to set the tree model to hist so it can do
4:07 categorical classification. And then we just call fit with our training data. Let's let that run.
4:14 And it looked like that worked. Let's evaluate our model on our testing data. It looks like it gets
4:21 58% correct. Is 58% the right number? It might be, it might not be. We know that it's better than
4:27 guessing. So further evaluation might be useful to understand what's going on there. Let's look
4:32 at how it did on the training data. And you can see that it actually did really well on the training
4:37 data, indicating that we might be overfitting. What does overfitting mean? It means that your
4:43 model is memorizing or extracting too much information from what it was trained on and
4:49 it's not generalizing as well as it could be. My experience has shown that XGBoost does tend to
4:55 slightly overfit out of the box, even though it tends to get pretty good results.


Talk Python's Mastodon Michael Kennedy's Mastodon