Data Science Jumpstart with 10 Projects Transcripts
Chapter: Project 8: Predicting Heart Disease with Machine Learning
Lecture: Preparing a Pandas Dataset to Create an XGBoost Model
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
Okay, we're going to get our data ready for prediction now. Let's look at our columns.
0:04
We want to make sure that we have certain columns and that they're in the right order.
0:09
Let's look at our target here. This is numb, again, zero meaning no heart disease, above zero
0:15
meaning heart disease. We can look at the value counts of those. So we have a lot with no heart
0:20
disease. And there's about 400 that do have heart disease, but they're spread among those other
0:24
values. Are there missing values in here? That's another thing that we need to be aware of. You can
0:29
see that there are missing values. Let's quantify that. Which columns have missing values? We just
0:35
do an any on that. That collapses that. We can do a sum to count them. You can see that CA has a lot
0:42
of missing values. We can do a mean to quantify that. So you can see that 66% of CA is missing.
0:51
We might want to make sure that numb doesn't have missing values. You can see that it doesn't here.
0:55
We can also do a value counts and say drop NA if there were missing values that would show up there.
1:01
Okay, let's look at the D types. And our D types are mostly pi-arrow. We do have a categorical
1:09
value in there as well. Let's look at the number types. There aren't any in 64 types. Let's look at
1:19
the number types. These are all the values that are numbers.
1:23
That's looking okay. What I'm going to do now is I'm going to use our XGBoost model. XGBoost
1:31
basically makes a decision tree that's going to predict a value and then it makes subsequent
1:36
decision trees that correct the value. I like to compare this to golfing. If you think of a decision
1:41
tree, a single decision tree can hit the ball once and it might be some amount off from the hole.
1:46
XGBoost is like a model that can hit the ball once. That's the first tree. And then subsequent
1:52
trees look at the error, how far the ball is off the hole, and try and fix that. And so you can
1:58
make multiple trees. Those are called boosters that try and get the ball in the hole or in essence
2:04
make the correct prediction. That's why this is a super useful model because it tends to get the
2:09
ball pretty close to the hole. Okay, so what am I going to do? I'm going to make X. And X is a
2:14
common variable name that we use for our data here. Let's look at this chain that I have here.
2:18
I'm going to take all of my object columns and I'm going to convert those to categoricals.
2:24
And I'm going to take all of my number columns and convert those to just normal numpy floats.
2:30
This is a cool trick that we can do with pandas. So this right here, what I've highlighted,
2:35
is a data frame. If I stick a star star in front of it, it's going to unpack it. If I unpack inside
2:42
and assign, it's going to replace the columns. So this is basically replacing object columns
2:46
with categoricals. Same thing here. This is taking all the number columns and converting those to
2:51
float. Now after I've done that, I'm going to say let's make sex a category, FBS a float,
2:58
X-ange a float, and slope a category. And then I'm going to drop the number column. So that is my X.
3:06
It's everything but the number column. And then our Y is going to be those labels, those number
3:11
labels there. Let's split that up. I'm going to use scikit-learn to split it up. Remember,
3:16
we want to split up our data. We want to train our model on some portion of data. And then we
3:21
want to have a holdout, some data that it hasn't seen that we have the real labels for. And we can
3:26
evaluate how our model performs with that data that it hasn't seen. Why do we need a holdout?
3:31
Well, if you evaluate your model on data that's seen, it's really easy to make a model that
3:35
performs well. You just memorize the data. But in the real world, that's not going to work very well,
3:39
memorizing the data, because presumably the data is going to come in a little bit different and
3:43
you won't have exact matches. In fact, if you had exact matches, you wouldn't have to use machine
3:47
learning because you could just memorize the data and you could use an if statement to check whether
3:51
they're an exact match or not. Okay, so we've split up our data. And what I'm going to do is
3:57
make an XGBoost classifier. I'm going to say enable categoricals. So because we have those
4:02
categoricals, we're going to enable that. And we need to set the tree model to hist so it can do
4:07
categorical classification. And then we just call fit with our training data. Let's let that run.
4:14
And it looked like that worked. Let's evaluate our model on our testing data. It looks like it gets
4:21
58% correct. Is 58% the right number? It might be, it might not be. We know that it's better than
4:27
guessing. So further evaluation might be useful to understand what's going on there. Let's look
4:32
at how it did on the training data. And you can see that it actually did really well on the training
4:37
data, indicating that we might be overfitting. What does overfitting mean? It means that your
4:43
model is memorizing or extracting too much information from what it was trained on and
4:49
it's not generalizing as well as it could be. My experience has shown that XGBoost does tend to
4:55
slightly overfit out of the box, even though it tends to get pretty good results.