Fundamentals of Dask Transcripts
Chapter: Dask-ML
Lecture: K-nearest neighbors with scikit-learn

Login or purchase this course to watch this video and the rest of the course contents.
0:00 So now it's time to build our very first machine learning model together.
0:04 We'll be using a basic model here called 'K-nearest neighbors classification'. If you haven't seen that before,
0:10 you can check it out in 'scikit-learns' documentation or anywhere else for that matter.
0:15 Essentially, it creates a model which makes a prediction for a point based on its neighbors. The other points near it,
0:23 scikit-learn actually makes it super easy to train this model as we'll see. So first we "import the kNeighborsClassifier".
0:32 We're doing some timing here just to make sure that everything is running smoothly. What we then do is we insaniate Classifier.
0:37 The classifier passing at the keyword argument and '_neighbors', which specifies how many data points it wants to look at.
0:45 Each point, wants to look at around it in order to perform the prediction.
0:49 And then we fit this model we've just built using the 'fit( ) method and pass it the features and the target. Great. That took next to no time at all,
1:01 800 milliseconds. Now, what we can do is we can use this model 'clf' either to predict on new points or to check out the score as well,
1:12 see how well it performs. So what we're going to do is see how well this model performs on our original data.
1:19 Now in general, you don't want to do this. You want to see how well it performs on a holdout set. You may have seen train test,
1:25 split or cross validation, which we'll get to a minute previously. But just to see how this works. And for the purpose of learning,
1:32 we're going to look at the score on the data set that we that we trained it on. So let's execute that. Now. This may take a bit longer.
1:44 Now, the reason this may take a bit longer is to train the model essentially you really need to store all the data points there. Now to fit the model
1:52 you need to compute all the distances between the points and the nearest three points And we're doing that for 100,000 points currently.
2:03 Great. And what we see is that we had a score of .93. Now this actually the score we've computed here as we've written above,
2:12 is the accuracy. This is the fraction of the data the model gets right. So this model got 93% of the points, correct. Now, of course,
2:20 this is an overestimation because we're computing the score on what we used to build the
2:25 model and we'll figure out soon with cross validation how to do that differently.


Talk Python's Mastodon Michael Kennedy's Mastodon