Fundamentals of Dask Transcripts
Chapter: Dask-ML
Lecture: Generating dataset with scikit-learn
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
Welcome back and I'm so excited to jump into scikit-learn for machine learning with you
0:05
now. So you may recall that scikit-learn is a powerful library for machine learning in Python which provides among many other things,
0:13
tools for pre-processing, model training, evaluation and more. If your model and data fit on your computer,
0:19
definitely use scikit- learn with no parallelism. Will soon see how to generalize your scikit-learn code using Dask-ML,
0:26
to a parallel and distributed setting. So let's now see how you can train machine learning models in scikit- learn.
0:33
So first we want to create a data set. You could import one. But "scikit-learn(sklearn)" has nice utility functions for creating datasets.
0:41
So I've just executed this to use the 'make_classification' function to create a data set
0:48
that has 100,000 data points And 10 features for each data point.
0:55
Now you may note that we've unpacked the result of make classification into two variables X and Y. It's worth spending a minute talking about these.
1:04
You may recall from your knowledge of machine learning that a machine learning challenge has feature
1:09
variables which you input to your model and then output or target variables. That your model is trying to predict.
1:16
What we're doing is unpacking the features into capital X by convention and the target into lower case 'y'. Also by convention,
1:25
as we've written here, X is the set of input variables. And 'y' is the output or target variable.
1:32
If we look at, let's say the first five entries or in this case, rows of X we should get something that's five by 10.
1:43
So here we have five rows of 10 columns each. Similarly, what we hope to see when looking at the first five entries of of
1:51
'y' a five binary elements. So zeros and ones. Because we're working in classification, we expect to see discrete outputs and the default here
2:02
is binary, so we should say five zeroes and ones, as we do now in the next video,
2:08
we're going to come back and build our very first machine learning model together.
2:11
It will be a 'K-nearest neighbors Classification' for the data set that we've just generated