Fundamentals of Dask Transcripts
Chapter: Dask-ML
Lecture: K-nearest neighbors with scikit-learn
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
So now it's time to build our very first machine learning model together.
0:04
We'll be using a basic model here called 'K-nearest neighbors classification'. If you haven't seen that before,
0:10
you can check it out in 'scikit-learns' documentation or anywhere else for that matter.
0:15
Essentially, it creates a model which makes a prediction for a point based on its neighbors. The other points near it,
0:23
scikit-learn actually makes it super easy to train this model as we'll see. So first we "import the kNeighborsClassifier".
0:32
We're doing some timing here just to make sure that everything is running smoothly. What we then do is we insaniate Classifier.
0:37
The classifier passing at the keyword argument and '_neighbors', which specifies how many data points it wants to look at.
0:45
Each point, wants to look at around it in order to perform the prediction.
0:49
And then we fit this model we've just built using the 'fit( ) method and pass it the features and the target. Great. That took next to no time at all,
1:01
800 milliseconds. Now, what we can do is we can use this model 'clf' either to predict on new points or to check out the score as well,
1:12
see how well it performs. So what we're going to do is see how well this model performs on our original data.
1:19
Now in general, you don't want to do this. You want to see how well it performs on a holdout set. You may have seen train test,
1:25
split or cross validation, which we'll get to a minute previously. But just to see how this works. And for the purpose of learning,
1:32
we're going to look at the score on the data set that we that we trained it on. So let's execute that. Now. This may take a bit longer.
1:44
Now, the reason this may take a bit longer is to train the model essentially you really need to store all the data points there. Now to fit the model
1:52
you need to compute all the distances between the points and the nearest three points And we're doing that for 100,000 points currently.
2:03
Great. And what we see is that we had a score of .93. Now this actually the score we've computed here as we've written above,
2:12
is the accuracy. This is the fraction of the data the model gets right. So this model got 93% of the points, correct. Now, of course,
2:20
this is an overestimation because we're computing the score on what we used to build the
2:25
model and we'll figure out soon with cross validation how to do that differently.