Data Science Jumpstart with 10 Projects Transcripts
Chapter: Project 8: Predicting Heart Disease with Machine Learning
Lecture: Combining Multiple Datasets with Pandas and concat
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
We're going to look at a heart disease data set. This comes from again, University of California, Irvine Machine Learning Repository.
0:07
Let's load our imports here and let's load our data. Our data frame looks like this. We're predicting whether someone has heart disease or not.
0:14
We actually saw this data set previously. Once we have our data, we can go through it and start looking at it.
0:21
Like I can say, pull up the FPS and look at the value types for that. We can say, what are the D types for this? It looks like they're all doubles.
0:30
Here is my tweak heart. This probably looks familiar to what we saw before. Let's run this code and let's look at the output of that.
0:39
There's the output of that. Let me just walk through this code. This is a chain here. The first thing we're doing is converting types,
0:48
and then we're making a sex column. We're replacing one with male and zero with female. The THAL, we are replacing the numbers with normal,
0:59
fixed, and reversible, and we're changing that to a categorical type. Slope, we're also changing that to text values there.
1:08
There's our output after doing that. Let's just look at our memory usage after doing that.
1:15
Here's our original memory usage and here's our cleaned up memory usage. It's gone down. We're using about 40 percent of the original memory usage.
1:23
At this point, we've loaded our data and cleaned it up a little bit. Now, we're not done yet. We need to do a little bit more processing
1:29
of the data to get it ready for machine learning. Many machine learning models don't work with non-numeric data.
1:35
It turns out that models like XGBoost are more flexible and they will work with non-numeric data, which is nice.
1:41
Also, a lot of machine learning models don't work with missing values like NAN. Turns out XGBoost does work with NAN,
1:48
so that makes it a little easier to shove a model into XGBoost. But if you want to compare it to other models like,
1:54
say, logistic regression or linear regression, you would have to do further processing on your data to make it work with those models.