Fundamentals of Dask Transcripts
Chapter: Dask-ML
Lecture: Generating dataset with scikit-learn

Login or purchase this course to watch this video and the rest of the course contents.
0:00 Welcome back and I'm so excited to jump into scikit-learn for machine learning with you
0:05 now. So you may recall that scikit-learn is a powerful library for machine learning in Python which provides among many other things,
0:13 tools for pre-processing, model training, evaluation and more. If your model and data fit on your computer,
0:19 definitely use scikit- learn with no parallelism. Will soon see how to generalize your scikit-learn code using Dask-ML,
0:26 to a parallel and distributed setting. So let's now see how you can train machine learning models in scikit- learn.
0:33 So first we want to create a data set. You could import one. But "scikit-learn(sklearn)" has nice utility functions for creating datasets.
0:41 So I've just executed this to use the 'make_classification' function to create a data set
0:48 that has 100,000 data points And 10 features for each data point.
0:55 Now you may note that we've unpacked the result of make classification into two variables X and Y. It's worth spending a minute talking about these.
1:04 You may recall from your knowledge of machine learning that a machine learning challenge has feature
1:09 variables which you input to your model and then output or target variables. That your model is trying to predict.
1:16 What we're doing is unpacking the features into capital X by convention and the target into lower case 'y'. Also by convention,
1:25 as we've written here, X is the set of input variables. And 'y' is the output or target variable.
1:32 If we look at, let's say the first five entries or in this case, rows of X uh we should get something that's five by 10.
1:43 So here we have five rows of 10 columns each. Similarly, what we hope to see when looking at the first five entries of of
1:51 'y' a five binary elements. So zeros and ones. Because we're working in classification, we expect to see discrete outputs and the default here
2:02 is binary, so we should say five zeroes and ones, as we do now in the next video,
2:08 we're going to come back and build our very first machine learning model together.
2:11 It will be a 'K-nearest neighbors Classification' for the data set that we've just generated


Talk Python's Mastodon Michael Kennedy's Mastodon