Data Science Jumpstart with 10 Projects Transcripts
Chapter: Project 7: Working with Movie Review Text Data in Pandas
Lecture: Using XGBoost to Create a Classification Model
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
In this section we're going to make a model that predicts whether something is positive or negative.
0:04
I'm going to use the XGBoost model. Make sure you have installed XGBoost. We're going to import that. All I'm going to do here is
0:13
I'm going to say x is equal to this tfdf and our y is going to be equal to whether something is positive.
0:23
So what's our tfdf? That is our data frame that we stuck onto the end of the other one. So let's look at x. x looks like this.
0:33
It's a bunch of numbers in here. So this is basically how important 10 is in document 3.
0:43
You can see that zone appeared in this document but a lot of these are zeros. This is sparse because not all of the reviews have all the
0:50
columns. Okay what does y look like? Y is just a series whether something is positive or negative. And what am I going to do here? I'm going
0:58
to use scikit-learn to split our data into a training set and a testing set. Why do we want a training set and testing set? Well we want to see how our
1:05
model would perform on data that it hasn't seen before. So what we do is we hold out some subset of that, we train a model, and then with
1:11
the subset we held out we evaluate our model. We already know what the true positive negative labels are but we see
1:18
how well our model predicts those based on data that it hasn't seen, giving us some sort of feel of how it might perform in the real world.
1:25
Okay so we've split up our data let's train our model. It's really easy to train a model you just say fit. So we're going to fit it with the x and
1:32
the y. The x are the features the y is the label whether something was positive or negative. That takes a while because we've got a
1:39
lot of columns but it looks like we did get a model that came out of that. And then let's evaluate it. One way to evaluate it is to use the score.
1:46
We're going to pass in the testing data, the data that it hasn't seen, and we're going to pass in the real labels for that and this is going to
1:52
give us back an accuracy. It looks like it got 78% right. Is 78% good? Well the answer to that is it depends.
2:01
It might be good it might not be good. This is saying that basically four-fifths of the reviews that you classify as positive or negative
2:11
are correct. Now if you have a situation where you're making a model that predicts maybe fraud, and fraud is not very common like you
2:21
could imagine fraud might occur in like one in a thousand, well you could make a model that's highly accurate. You just predict not fraud.
2:27
It's 99% accurate. So accuracy in and of itself might not be a sufficient metric to determine whether something's good, but
2:35
it's good to give us a baseline. This is better than flipping a coin. It's 80% accurate.