Data Science Jumpstart with 10 Projects Transcripts
Chapter: Project 8: Predicting Heart Disease with Machine Learning
Lecture: Exploring heart disease with aggregations and scatterplots
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
In this section we're going to explore the data a little bit more and just try and understand what's going on in there.
0:05
Oftentimes this is an important step before making a machine learning model. The better you can understand the data, the better your model will be.
0:13
Okay, so what we're going to try and predict is whether someone has heart disease. That's this num column here.
0:18
And so I'm going to group by number and look at the mean values for all the numeric entries in there.
0:23
And I'm going to color that. I'm just going to see if there are things that stand out here.
0:28
So remember zero means no heart disease. Everything else means that there is some heart disease. It looks like zero is red for age, meaning
0:37
people without heart disease apparently tend to be younger. Their CP value is lower. Their
0:43
TREST BPS value is lower. It looks like their cholesterol value is actually higher here. The REST ECG value, it looks like that is lower as well.
0:56
The TALAC column is higher and the old peak is lower and the CA is lower as well. By coloring these that makes it really easy to see.
1:05
If we didn't color this, let me just show you what it would look like. This is nice data, but it's just hard to see what's going on there.
1:13
So I do like sticking that background gradient on that. I said axis equals index.
1:18
So we are coloring each of the columns down the index. That's what that means. Here's an alternate view of that.
1:25
I'm just transposing that. That's what this T is right here.
1:28
Then I'm sticking that style in there as well. We might also want to consider doing a correlation.
1:34
So I'm going to do a Spearman correlation coefficient here. Let's see if there's anything that correlates. What I'm looking for is whether something
1:41
correlates with NUM here. You can see that CP has a pretty strong correlation. CA has a pretty strong correlation. TALAC has a negative correlation.
1:51
So maybe we want to come in here and scatter that and see what's going on here.
1:55
So I might say let's look at the relationship between NUM and TALAC. This has a slight negative correlation.
2:01
It's a little bit hard to see that. This is really dark. So I might clean this up a little bit. Remember we did have jittering before.
2:08
So that's one way that we can look at that. We can also adjust the alpha there.
2:12
So this I think this tells a completely different story than this up here.
2:16
Yeah, you can see that it looks like if you're zero you tend to be higher and goes down lower. It also
2:22
shows that we have a lot more zero entries than we do the other ones.
2:26
Not really clear from looking at this that there are a lot more zero entries there.
2:30
Let's look at the scatter plot between NUM and CP. This is I would say less than useful.
2:38
It looks like it's just points on a grid. So I'm going to jitter both the X and the Y here.
2:44
Here's the code to do that and let's scatter that and see what's going on there. It looks like as these values go up for heart disease,
2:52
there is more heart disease. There aren't many lower values for that. Let's also look at a categorical relationship here.
3:02
We're just going to group by sex and then we're going to look at the various values for those. So it looks like in this case like male,
3:12
higher value for NUM here. Again, that's your heart disease. So presumably males, at least in this sample, are more likely to have heart disease.
3:22
Because that was encoded as a categorical, we didn't see that male pop up over here. If we left it as a binary zero or one, we would see that
3:34
represented. Let me show you how I could do that here. I could say heart and then say assign sex equals heart.
3:51
Sex equals equals male. So now we have the sex there and then we can say let's do a core here of this.
4:02
That gives an error because Pyro doesn't like non-numeric values. So I'm going to say numeric only. Is equal to true.
4:12
If we look at sex here, we should see a relationship and I'm not seeing sex pop in there. So I'm going to convert this to an
4:21
integer as type and we'll do int 8 Pyro. And now we have sex pop in there. Okay, so there is a slight
4:39
slight positive correlation there with NUM. Meaning that as sex goes to male, the NUM tends to go up as well. NUM going up indicates
4:50
heart disease. In this section I showed some exploration. It's often very useful to explore your data.
4:56
Look at those relationships, especially between that target variable that you're trying to predict.