|
|
1:45 |
|
show
|
1:05 |
Hi everyone.
Welcome to fundamentals of DASK.
We are so excited to get started with the second course on DASK.
Follow along with the course.
All you need is some basic experience with Python programming The first Dask course getting started with, Dask covered some basic Dask concepts and discuss 'Dask Data Frame' in detail.
The topics in this course will be built on previous topics.
So if you haven't watched the first course yet, we encourage you to check it out.
Working with 'NumPy', 'pandas' and 'scikit-learn'.
We'll help you follow the course better.
But don't worry if you haven't used these before.
We'll explain everything we use.
In this course will look at three more Dask collections, 'Dask Array' that helps parallize 'NumPy code', 'Dask Delay' that can parallelize general Python code and 'Dask Bag', which allows us to work with Unstructured and Messy Data.
In parallel, we're then going to take a deeper dive and look at 'Dask Schedulers' We'll also talk more about 'Dask-ML'.
For Machine Learning, which lets us parallelized 'Scikit-learn' code, will complete the course by scaling the Machine Learning code to the cloud.
I hope you're as excited as me to learn about Dask.
Let's get started
|
|
show
|
0:40 |
Hi, I'm Hugo Bowne-Anderson, One of your instructors for this course.
I'm the Head of Data Science Evangelism at Coiled.
Previously, I've worked as a Data Scientist, Evangelist and Educator, at Data Camp where I created several courses on Data Science with Python.
My background is in Cell Biology, where I witnessed the need for scalable compute firsthand.
I've been a big fan of Dask for many years and I can't wait to teach you more about it.
My co creator for this course, Matt Rocklin is the CEO of Coiled and the co creator of Dask.
Matt helped to create this course and built many reference materials that we share during the course.
So let's get started.
|
|
|
11:57 |
|
show
|
2:25 |
Hi everyone Hugo here.
I am very excited to be telling you about Dask.
Array, which is a wonderful generalization of the 'NumPy Array' that allows you to do a 'Array Computation' at scale with super large dataset, among other things.
The real purpose of the Dask Array is to have a high level user interface to things that are kind of like 'NumPy Arrays' but may not fit in memory.
So, essentially what we're doing is scaling 'NumPy' code.
But one of the really cool parts of the 'Dask Array' is that the code you write, the API mimics the NumPy code that you write as well.
So on the left here, you see you have 'X =np.array', etc,.
And on the right, the Dask code is 'X = da.from_array', right.
And similarly with mean, it mimics the code.
It isn't always exactly the same.
You'll see we have a '.compute' for Dask which as we'll see, that's because Dask does something called lazy evaluation, but that's by the by.
The point really is that the code you write is relatively similar.
Okay.
And the other thing to note here is that it's actually doing computation on NumPy Arrays themselves in the back end.
So, your mental model of what's happening is actually what's happening under the hood, which is pretty cool if you ask me.
So, I just want to say another few words about NumPy Arrays.
You maybe already be aware that Dask is used everywhere.
It's used in retail at Walmart and Grubhub.
It's used in the Life Sciences at Harvard Medical School, among many other places.
It's used in finance.
It's used that geophysical facilities.
It's used for a lot of other Softwares such as RAPIDS, Pangeo and PyTorch.
Now, I just want to make clear that a lot of the time, all of these things actually start with Dask Array.
So this is a really cool place to get started diving a bit deeper into Dask So what we're going to cover essentially we're going to cover demonstrating NumPy, so having a look at the basics of the NumPy to familiarize yourself or re familiarize yourself with the NumPy.
Then we're going to talk about 'Blocked algorithms' and in short a blocked algorithm executes on a large dataset by breaking it up into many smaller chunks.
Then we're going to introduce the 'Dask Array', which you've already had a little hint of in the previous slides.
After that, we're gonna just have to have a frank conversation about some of the limitations of the 'Dask Array'.
There aren't many, but it's worth talking about.
And then we're going to provide some references.
See you in the next video.
|
|
show
|
3:37 |
All right, so it's time to jump into the 'NumPy Library' and 'NumPy Arrays'.
So, NumPy, as we've mentioned, is a Python library that provides multi dimensional arrays, but also provides Routines for fast operations on arrays on top of this, it has a collection of high level mathematical functions, among many other things.
What we're doing now is an introduction to a very small subset of NumPy Arrays, but it will provide a lot of motivation for what we do with Dask in a minute.
Let's jump into a Jupyter Notebook in Jupyter lab to see Numpy in action.
So here we are in Jupyter lab in a Jupyter Notebook about to jump into some NumPy stuff.
Okay, so in this notebook we're gonna demonstrate NumPy look at some blocked algorithms, then jump into Dask Array, which I'm pretty, pretty excited to show you.
So the first thing that we want to look at is NumPy, we're just gonna show some basic functionality there.
So NumPy has a ones( ) function to create unit arrays or these are a raise of all ones.
So we're gonna use it after doing an import to create a 10 by 10 matrix or array.
We use those terms interchangeably of of ones and we'll print it.
Okay?
So yeah, we've got our array of 1's.
Now.
We can use the 'sum ( )' method on this array that will add up all the entries there and we use this 'magic command' (%%time) to time it an array of 100 1's Sums 200.
That's good.
We can see the 'wall time' was 135 microseconds there.
And what we're gonna do is we're gonna do kind of similar things with larger arrays And see that the time to do these things that gets larger and larger.
Right?
So we're gonna use the 'Random' module which I love a lot to create array an array of random data.
So we're going to create a larger one.
Is gonna be 1000 by 1000 here.
So we see that now we're going to perform the sum ( ) and we'll see instead of on the order of hundreds of microseconds that took on the order of milliseconds there.
Okay, so the time to do it is growing.
NumPy has a bunch of helpful operations like 'Matrix transpose', 'Matrix Addition' and 'Mean', we're going to use these we're going to create a new array.
Y By adding X to its transpose And we'll see that took 24 milliseconds.
We're also gonna now take the mean of Y Which took on the order of milliseconds.
Now, what we're gonna do?
So I'm going to execute this code because it's going to take a little bit of time.
So I'm just gonna execute that as well.
So it's nice text.
So we're creating an even larger matrix.
This is going to be 20,000 by 20,000.
And of course we're using the 'Random Modules' normal function there.
It's going to give us normally distributed random variables and we're also computing it's mean, so this is going to take some time And it should be done give or take in in in in 10 or so seconds.
So here we have it computed and we see that it took our 40 seconds, which is significantly longer, now if this would take any longer, it definitely wouldn't wouldn't make my workflow comfortable.
And so that's an example of when we may want to start moving, moving to something like Dask.
Okay.
But before we do that, I'm going to try to do something that people will do occasionally, which is import or create an array of a really large size.
So this one I'm trying to create one with a billion values along each axis.
Okay.
And look at that.
It throughs a 'Memory Error', which means that NumPy isn't even able to handle data at this size.
Okay.
What we're gonna do in the next video is work around this limitation using 'Blocked Algorithms'.
But after that, we're gonna see how we can achieve success with these types of challenges using Dask also.
See you in a minute.
|
|
show
|
1:38 |
So now I'm excited to tell you about 'Blocked Algorithms'.
These essentially execute on large datasets by breaking up the datasets into smaller blocks.
So in the above example we had a billion times a billion numbers or something like that.
And if we want to take the sum of all numbers, we could break up the array into 1000 chunks for example, and take the sum of each chunk and then take the sum of the intermediate sums Okay, so let's do this with a random data set that we've generated here.
So this creates a pointer to the data but doesn't actually loaded So we execute that cell.
Now what we're going to do and this data set is a small example.
We're not doing it on billion by billion, but we're doing on a smaller one for pedagogical instructive purposes.
Okay.
So we create a list called 'sums', where we add all the intermediate sums and then we do we loop we iterate through chunks, smaller chunks, take the sum of each chunk and then append that to sum.
So we get a list of all the smaller sums there.
And then for the total we take the sum of all the sums in the list and then we print the total.
Okay, so we do that and look that took 800 milliseconds around a second.
Give or take.
Okay?
But note that this is a sequential process in the Notebook, Kernel.
Okay.
The loading and then the summing.
And what I want to make clear is that this is something that we can do in 'Parallel', particularly when we have multiple cause on an individual workstation.
Okay, so after this video we're going to come back for a checkpoint and then we're going to show you how to 'Parallelize'.
This type of code can't wait to see there.
|
|
show
|
0:29 |
After that Whirlwind introduction to 'Block Algorithms', we'd like you to just check in with yourselves and with us to test your understanding of what we've been through.
So we have a Checkpoint.
A question what we want you to do is to create a random matrix.
That's 1000 by 1000 and compute the standard deviation.
You can write your code in that cell, execute it, see how you go.
And when you want to check out the answer, you can click on those three dots to open the answer.
All right.
Best of luck.
|
|
show
|
2:39 |
Let's now take a look at how 'Dask Arrays' help us scale.
NumPy.
We saw how NumPy throws a Memory Error when given large data sets, Dask Array can handle this 'larger-than-memory data'.
A Dask Array is composed of multiple NumPy Arrays, as shown in this diagram.
Also note that Dask Array computes them in 'Parallel'.
Dask array operations use NumPy operations internally.
So the syntax will be familiar to you.
Dask Array, lets us define a chunk size property to divide our Dask Array into appropriate blocks.
For optimal calculations, it leverages the concept of 'Blocked Algorithms' that we just learn to give us good performance.
Now, let's jump into the 'Jupyter Notebook' and see how you can use Dask Arrays.
First we need to spin up a new cluster here.
We are using four workers and let's open some diagnostic dashboards, the 'Cluster Map', Pew, Pew and the 'Task Stream'.
And let me rearrange these tabs to the right.
Great.
Let's now create a 10,000 by 10,000 array with 100 x 100 chunks.
We'll be using the 1's function from Dask or NumPy for this.
Looking at this, we see Dask array has created the array and display some Metadata.
This is incredibly useful if you read a large file, this Metadata can help you understand what's going on without needing to compute and display the entire array.
Here we see information about the size of the array and the chunks the shape of the array.
Account of tasks and chunks and data types of the values of the array.
The diagram helps us visualize the chunks.
Now let's compute the sum of this array and time it.
That only took several seconds because Dask Array also evaluates lazily recall how 'Lazy Evaluation' refers to computing the results only when necessary.
So if we see what the variable result is, it displays a Dask Array, we need to 'Call' Compute to get the actual results and we also see some activity happening in our dashboards.
All right.
That gives us 100 billion.
Which makes sense.
Next, let's do the same NumPy computation as earlier, we used 'da.random' to create an array of Random values, We calculate the mean And compute every 100th value.
Finally, always remember to close your Cluster.
|
|
show
|
0:23 |
Congrats on getting this far with Dask Arrays.
Now, test your knowledge with this checkpoint.
What we want you to do is to use Dask Array and create a random matrix that's one million by one million, and then compute the standard deviation.
Now, once you've got your solution, click on the three dots and see our solution.
Or if you're really stuck, click on it to take a look, see you in the next video.
|
|
show
|
0:46 |
So before wrapping up, we thought it prudent to tell you about some of the limitations of Dask Array.
So, for example, Dask Array doesn't implement the entire NumPy interface.
I mean, the NumPy interfaces huge these days.
So for example, it doesn't implement 'np.linalg' or 'np.sometrue'.
Now, Dask Array doesn't support some operations as well when the resulting shape depends on the values of the array.
On top of this Dask Array doesn't implement or attempt operations like 'sort', which are quite difficult to do in 'Parallel'.
So before wrapping up, as always, we'd love to supply you with some references.
So you can find these in the notebook, you can find the Dask Array documentation, the Dask Array API, some examples and then the Array section of the Dask tutorial.
|
|
|
8:47 |
|
show
|
1:00 |
Welcome back after working with Dask Arrays, It's now time to jump into, Dask Delayed.
You may remember Dask Delayed from the first course to recap The Dask Delayed API is a low level API that a lot of other distributed Dask stuff call all the time, for example, Dask Array and Dark DataFrame called Dask Delayed.
And as we've seen Dask Array and Dask DataFrame can't be used everywhere.
And the places they can't be 'Dask Delayed' can come to the rescue because it allows us to write custom parallel computations using Dask.
On top of that.
You can parallelize existing Python code using Dask Delayed and that's what we're about to do.
A few words about what we're going to cover.
First, we're gonna recap the Delayed API Then we'll parallelize some Python code with the Delayed API.
Then we'll discuss some best practices for using Dask Delayed and then we'll wrap up with some references for those of you who are pretty eager to use Dask Delayed a lot more.
|
|
show
|
1:53 |
All right, everyone.
Now it's time to jump into our Jupyter Notebook in Jupyter lab to check out Dask Delayed.
So I just want to recap a few things about Dask Delayed as I've said and as you know, from the first course, it can be used to parallelize, regular Python code important to recognize that it's evaluated lazily.
What that means is that you need to call '.compute ( )' in order to get the computation to be evaluated.
And and that's important cause you don't want all like these big computations to run unless you explicitly stated.
And we can also generate a 'task graph', which will see, So these are functions that we created in course one as well Once an increment function which takes X and adds one to it, the other takes two arguments X and Y.
And it adds them together.
And what we did is initially we put a little 'sleep( )' into these functions of one second for pedagogical purpose is to show you that when you parallelize the code, essentially, that the time is reduced to doing it serially.
Executed that code.
Now, what we do is we import delayed, which allows us to pass the functions that we want to delay as the first argument to delayed and then pass to that the argument that we want to pass into the function.
Okay, so we execute that Now.
It returns a Delayed object, it hasn't been computed yet.
We're going to compute it now, what do we expect?
We increment 10 twice and add them together.
So that's 11 plus 11.
Is it 22?
Is what we expect?
Okay, that's a good sanity check.
Now let's visualize the task graph that was created here to really make explicit what did Dask Delayed has done.
So 'z.visualize( )' what it has done is it does these increments in parallel and then adds them together.
Okay, so we'll be back in a second to parallelize further Python code with Dask Delay.
|
|
show
|
1:46 |
Now it's time to Parallelize.
Even more Python code with Delayed.
We're going to first 'Parallelized' a 'for-loop', which is right for Parallelization.
Try to say that fast 10 times because when you do a for-loop, you're doing these results are serially.
Okay, so that's right to be Parallelized.
So that's what we're gonna do here.
We're going to create a list which we call 'data', a basic list of some dummy data and then we're going to go through the list We're going to '%%time' this in order to see how long it takes.
And we're going to increment each item in data and 'append' it to the results list And then we're going to take the sum and compute the total should be done in a second.
All right.
Took around eight seconds and you can verify that that's the result that you wanted.
As I said, all of these increments are happening in serial but they could be happening in parallel.
Okay.
So we can wrap certain functions with 'Delayed' to make sure that happens.
So we are pretty much the same code as before, but we're wrapping the increment functions in Delayed and we're wrapping the sum in Delayed as well.
So let's check out how long this takes.
Okay, Now you may think, wow, that was quick.
But remember this hasn't actually performed the computation yet because Dask evaluates Lazily.
So we need to call compute in order to do this Great.
So we see we got out the same result in 1/8 of the time here on my system.
Now, what I want to do is visualize the 'task graph' so we can get a sense of what actually happened there.
So look at that.
We had all these increments occurring in Parallel, then feeding into the final sum.
And that is why our computation took an eighth amount of the time.
|
|
show
|
1:53 |
Now let's see how we can use Pandas 'groupby( )' in Parallel by Leveraging.
Dask Delay.
Now note that this is purely for demonstration purposes and 'Dask Data Frame'.
should always be preferred in real world situations.
We'll be going back to the NYC taxi cab data set that were used in the first course.
If you don't have the data set, you can un comment this cell to download it, a note to move all the files to a data subdirectory as we have done here Let's start by importing the data for January 2019 using Pandas and calculate the mean 'tip_amount' as a function of ''passenger_count''.
We use the group by function in Pandas for this computation.
Now to compute this over the entire 12 months of data without Dask Data Frame we can go through each file that corresponds to each month, one by one.
We perform Pandas group by on it And for each possible value of the number of passengers, we calculate two things.
First, the sum of the tip amount and second the total number of data points which had that value for the number of passengers.
We then save these values and calculate the 'mean'.
After we have gone through all the files we encourage you to pause the video and take your time to go through this block of code.
Now we'll introduce 'Parallelism' into this code.
Using Delayed.
This code block is similar to the previous block but notice how we read the CSV Files in a Delayed fashion.
This makes all the consecutive operations Delayed Objects as well We then compute the sum and count values here after going through all the files and then we calculate the mean as before notice the time difference here.
It's not a lot but significant enough to add up when we work with really large data sets.
|
|
show
|
0:29 |
Welcome back and congrats on getting to know 'Dask Delayed' even more.
Now.
It's time for a short checkpoint.
So what we'd like you to do is to use the 'Dask Delayed API' yourself to Parallelize the following, which is creating a 'NumPy Array' of any size and compute the sum of all entries in the array as before, enter your answer below and once you're done or if you are finding yourself stuck, click on the three dots to find out how we answered this checkpoint.
|
|
show
|
1:46 |
Welcome back to our Daskis and congratulations on making it through that checkpoint.
Now, I want to tell you about some best practices when using 'Dask Delayed'.
Now, there are lots of best practices we could discuss.
We're going to talk about three now and it's really the handful of ones that we think are probably worth mentioning upfront based on the type of code.
We see people right all the time with Dask and Dask Delayed to the first point number one, 'Don't call Delayed on the result of a function as it will execute immediately', 'Do Call Delayed on the function and then pass the arguments of the function to the code you've written there, best practice two, 'compute at once instead of repeatedly'.
So here we see that we're calling, compute within a for loop.
Let's avoid that completely.
So what we want to do is collect as many delayed calls as possible for one compute so we look through them here and then we pass them to 'dask.compute( )' there.
Now, the third one, this is worth spending a moment on because we've usually seen functions passed to the Delayed wrapper.
Instead of doing that.
When defining a function, you can add the '@dask.delayed' decorator at the top which pretty much does the same as passing a function to the delayed rapper.
Now the best practice here is do not mutate inputs in the function, so don't do this mutation here within the function.
Okay.
What the best practices is to return the new value or to return copies Okay, so for more best practices, I'd encourage you to refer to the Dask documentation.
Now to wrap up.
We have included some references below of what we think are some pretty cool documentation and tutorials on Dask Delay.
See you back here soon for the next video.
|
|
|
12:57 |
|
show
|
1:51 |
Let's talk about another high level DASK collection.
Dask Bag.
We've already seen 'Dask Array' and 'Dask Data Frame'.
These collections are great for structured data but we don't always have organized data like that now do we?
Sometimes we have huge 'XML' or 'JSON' files that come with inconsistencies and annoys.
In other words, the data is messy.
Dask Bag, helps us work with this type of data.
Typically, users start with Dask Bag for pre processing data, which is making the data suitable for further analyses.
Then they moved to other Dask collections to work with the data.
Most often 'Dask Data Frames'.
Dask bag is powerful because we can use it to work with general Python Data Structures as well, like lists, dictionaries and sets.
Dask Bag implements operations like map, filter fold and more on these data structures by leveraging, Parallel compute if you've worked with 'iter' tools or 'Py tools' before you can think of Dask Bag as a Parallel version of these.
Now we can condense Dask Bag's benefits to two key areas.
Computing in Parallel, which means we can use all the compute power your machine has and 'Iterating' 'Dask Bag' computes lazily, which allows us to work with large datasets comfortably, even on a single machine with a single core.
In this chapter, we will learn to read and manipulate different types of data using Dask bag, we will also see how you can convert a 'Desk Bag to a Dask Data Frame, a common workflow among data professionals.
Again, we'll share the limitations you need to be aware of and leave you with some references to explore further.
Now, lets jump in to the Notebook.
|
|
show
|
2:49 |
It's time to jump into Dask Bags and we're going to first learn about how to read from a Python list, other types of collections and sequences into Dask Bag's.
Okay, before reading data into into a Dask Bag.
What we want to do of course as always, is start by creating a Cluster.
So I'm going to do that here and you can cut along in your Jupyter Notebook spreading a Cluster with 4 workers and then we're going to open a couple of dashboards once this cluster is created.
Cool, look at that 4 workers.
8 cores 8 gigs of memory.
I am going to open my 'Cluster Map' now I'll drag that over to the side and I also want my 'Task Stream'.
Beautiful.
I'm gonna close this for our viewing pleasure, drag this down here and we are ready to go, What we're gonna do is we're gonna create a Dask Bag from a Python List But just to be clear, you can create bags similarly from sets and dictionaries and other general Python Objects such as 'Collections' and 'Sequences'.
The data we want to partition into blocks.
In the following example, it's a small example for learning purposes and we in this example there are two partitions with five elements each.
Now you may say, well I could do that in one partition.
Of course you could, but we're doing this for for teaching purposes as we've said So first we import Dask Bag as 'db', which is a convention just as numpy as 'np' and pandas as 'pd' is never import pandas as 'mp'.
So now we're going to execute this code using 'db' from sequence and passing it the list and passing it the keyword argument and partitions sending that equal to two.
And look, it's returned a Dask Bag object as you may have appreciated would happen.
No computation has occurred because Dask evaluate things lazily.
So we need to call 'compute( )' to get the result.
So let's do that.
And let's be prepared to see some things happen in our 'task stream' and some stuff happen in our close them up.
Fantastic.
So we've seen a couple of tasks occur.
We saw a list light up a bit.
Okay.
And we can see that it's returned the bag as as expected.
The other thing we can do is use the 'take method' to display the first few values directly.
So I'm going to apply the 'take method' to the Dask Bag B And give it the argument three in order to give me.
There we go.
The first three elements of the Dask Bag, Alaska, Minnesota and Georgia.
All right.
We'll be back in a minute to start reading some more messy unstructured data into Dask Bags.
|
|
show
|
2:31 |
Welcome back.
Now it's time to see how Dask Bags can be used to read from JSON files.
Now, if you've been working in data science for a hot minute, let's say you've probably had to work with all types of JSON's before.
Perhaps JSON's that are pretty large as well.
So hopefully Dask Bag can help you in this type of work.
Also.
To begin with, we're simply going to create a Dask Bag from some JSON files to do that.
Well, first create some random data and store it as JSON files.
So we perform our imports.
Then we use a Utility function to create some data and then we send it to some JSON files.
As we see here.
Now we can see that they've been put in our data directory.
What I want to do is just pop over to our data directory as as you see, we're in Dask two Dask fundamentals are Here we go into our data directory and we see our JSON's that we've just created there.
And also we saw some cool activity in the 'Task Stream' and in the 'Cluster Map' as well.
So now what we're gonna do is we're going to use the read text function to read these.
JSON's in as a 'Dask Bag' and assign them to the variable B.
Now I want to say 'read_text' is mainly used for '.txt' files.
The items in the bag will be strings.
It can also handle compressed files as we see here.
It can also be used for '.JSON'.
Well, okay, we're going to do this and we see of course we haven't computed yet lazy evaluation for the win once again.
So we're going to take the first two elements and look at those, I would say beautiful, but reading JSON is really beautiful and as we've written here the data comes out as lines of text and we can make it more readable using (json.loads) Okay.
And what we need to do with (json.loads) is we need to map it across the Dask Bag.
Okay, So what we do here is we ''map(json.loads) across the Dask Bag and then take the first two elements of our new B.
Fantastic.
And now it's more human readable, which is fantastic because we're humans occasionally trying to read.
I will say that Dask Bag can also read binary files and delayed values.
I'd love it if you went and checked it out in the API documentation and we'll be back soon to talk about manipulating data with Dask Bag.
|
|
show
|
3:07 |
So we've seen how to create Dask Bags and how to get JSON data, for example, into Dask Bags.
And as we've written here, bag objects have a standard functional API found in projects like the Python, Standard Library tools, PySpark.
So they include 'map' and 'filter' and 'groupby'.
And we're going to see all of these things in action now.
So the first thing to note is that operations on bag objects create new bags.
So without further ado, let's look at several common operations.
So 'Filter' is an important one.
The reason Filter is so important because when you have your data, you may want to look at certain values of interest and you may want to filter your records, for example.
So we have the Bag "b" from before and what we're going to do is filter it according to the age in the record.
Right?
So we use a 'lambda function' here to do that, which we pass to the filter method.
And then we use take with the argument(5) in order to look at the first five records performed after this filter And there we go.
We have Harold for example, who's 42, then we have Jack 66, etc.
So we have the first five records of people who are older than 25.
So that demonstrates how to filter the 'Dask Bag'.
We can also 'map' functions across bags.
For example, you may want to get all the first names from our json data And the way we do this is we 'map' the function which extracts the first name.
We map that across the entire bag.
So that's what we do here and we take the first 10.
So we're gonna get the first 10 names here.
And we said we have Harold, Jack, Emmett, Jonah, Eugenia sterling, Rudolph, Erlin, Lawrence and Valentine, wow, that was that was a mouthful.
But we got those 1st 1st 10 there.
Okay.
Another common operation for data professionals.
Data analyst, data scientists.
Citizen data scientists use a "group by" which you may recall using in 'pandas' all the time.
So essentially a 'groupby' allows you to group data by some property or function.
So what we're going to do here is if you recall 'x' is a bag of all the first names of people in the records.
And we're going to group by the length and then compute now what this will do it will return a list of the length of the names and then the names that correspond that have that length essentially.
So let's see that.
Great.
So we have a list where we have six and then all the names or the first names that have six characters in them.
Then 4,8,7 and 9.
So one thing to note about the 'Dask Groupby' operation It can be slow.
So I just want to say a bit about an alternative which is called 'foldby'.
Okay.
And I encourage you to check out the Dask documentation on 'foldby' and I'll tell you briefly what some of the documentation says.
So the 'groupby' method is straightforward according to the documentation, but forces a full shuffle of the data, which is expensive.
Now 'foldby' is slightly harder to use but faster.
So go and check out the docs and see for your particular use case whether you'd like to use 'groupby' or 'foldby'.
|
|
show
|
0:38 |
Welcome back and I'm more than excited to bring you to your next checkpoint where you get to do a bit of coding yourself.
So the question here is to find all the cities from the 'json data' that we created earlier.
So what I want you to do is have a go at it help your answer in here, execute it and see whether it returns all the cities have a play around.
If you need some help, try it again.
Maybe search on google a little bit, go back through the previous examples we've been through if you, if you really can't do it or if you can and did it and want to check your answer, click on those three dots to see how we solved it.
Best of Luck.
|
|
show
|
1:03 |
Congrats on making it through that checkpoint.
We've worked a bunch with Dask Bags.
Sometimes though, We really want to be working with 'Dask Data Frames'.
Okay, so the 'Dask Bag' can be used for simple analysis but 'Dask Data Frame' and 'Dask Arrays' are more useful sometimes for complex operations.
One way to think about it is that they're faster than Dask Bags.
For the same reason that 'Pandas' and 'NumPy' are faster than Python.
They also have more functionalities suited for data analysis.
How do we do it?
Well, we have a wonderful "to_dataframe( )" method.
Once again, we recreate our bag from before from our json files.
Then we apply the 'to_dataframe' method and then we let's check out the head of the data frame.
All right, and look at that.
So having done all that, remember, it's good to be tidy in your workspace.
So let's close the cluster with "client.close( )" we'll be back for one last video on Dask Bag to talk about Dask Bags limitations and to provide some further references.
|
|
show
|
0:58 |
To wrap up on Dask Bag, I'd like to tell you about some of the limitations of Dask Bag.
So, firstly Dask Bag doesn't always perform that well on computations that include inter worker communication which is due to restrictions in the default multi processing scheduler and we'll see this in the next chapter.
On top of that bag, operations are slower than 'Array Data Frame computations' as we saw in the previous video does Python, of course, is slower than 'NumPy' or 'Pandas' for these types of operations 'groupby' is slow and you should use 'foldby' if possible.
As we've also already discussed.
On top of this, note that Bags are immutable and so you cannot change individual elements.
Now, if you're excited by Bags and want to use them for your work, we've provided a list of references in the notebook and I'd also encourage you to check out the wonderful Dask documentation that the open source community has built for us.
|
|
|
6:35 |
|
show
|
0:53 |
In this chapter will discuss, Dask, Schedules in a little more detail, so we've all done a lot of compute recently, so please now you can sit back, relax and listen for this one.
So, if you recall from course one the Scheduler ingests the Task Craft generated by Dask collections such as Bags, Disk Arrays or Dask Data Frames, for example, the 'Scheduler' then communicates with all the Workers, manages resources and gets the computation done.
So in this chapter will be covering the types of Dask schedulers as there are and yes, there are multiple, will also cover how to select a scheduler, particularly if you need to select a different one, then we'll discuss how the different scheduler is compared to each other and at the end as we like to, will provide some references from which you can learn more
|
|
show
|
2:03 |
Now with great pleasure, it is time to introduce you to the different types of Dask schedulers.
There are two main types of schedulers, single machine and distributed as the name suggests, the single machine scheduler.
It works only on a single machine and does not scale to more than one machine.
It is lightweight and simple to use and it's the default for many collections.
Now there are three types of single machine schedulers is available in Dask.
First 'threaded' which is backed by a 'thread pool'.
All the computations happen in a single process, but on multiple threads, what this means is that no data transfer happens between tasks It's lightweight.
On top of that, it's used mainly when computation isn't Python dominant.
What I mean by this is for example, NumPy and Pandas are written in 'Siphon' for efficiency.
So this is the default operation for Dask array, Dask Data Frame and Dask Delayed,after threaded.
We have the 'Multi Processing scheduler', which is backed by a 'process pool'.
It's still lightweight but here we have multiple processes involved.
There is some data transfer between different processes so that adds some overhead.
It will perform best if we can minimize data transfer which is common while reading and writing data for example, Dask Bag uses this Scheduler by default.
Now 3rd we have the 'single threaded/synchronous' scheduler In some cases like debugging, for example, certain operations will fail because they don't support parallelism.
In such cases we can use this scheduler.
it computes on a single thread with no parallelism.
Next up we have the 'Distributed Scheduler' Now if you want to scale beyond a single machine, this is your only choice and in fact we recommend using it even if you're working locally, it has more features and better performance optimizations as we'll see in the following videos.
|
|
show
|
0:46 |
Now it's time to have even more fun and see how to select the scheduler is So to recap, we've learned about the different available schedule is so now it's time to see how to select them.
The first way to do it is you can do it in line if you want to use it for just a single compute call, for example, as you see here while debugging, you may want to compute one value without Parallelism.
Next, you can set it to be the default scheduler, just within a block, as shown here.
In this case we're using a context manager.
Using "with" essentially.
You can also set it globally with the code here.
Up next, we'll be jumping into a Jupyter Notebook to get practicing with everything we've just learn.
|
|
show
|
1:41 |
So now it's time to see how 'Schedulers' is in action to do.
So we're going to go back to the NYC taxi cab data set And if you recall this example for Dask Data Frame, what we're doing is computing the tip amount.
The first thing we do is import client and instantly call our client with four workers You can see that's what we've done here.
Uh then we import Dask Data Frame as "dd".
We import all our data as a Dask Data Frame.
We set up our computation what we want to do, which doesn't compute it yet because remember lazy evaluation and then what we do is we compute the amount.
Okay.
And that's exactly what we've done here.
We can see it took around two minutes.
Okay.
Now what we're going to do is see this computation using different schedulers and look at the results.
Okay.
So what we're doing here is selecting the scheduler in line while calling, compute and we're doing it for the threading processes and synchronous single threaded schedulers in a for-loop here and look at what we have so that we can see that the results are the same, but the time to compute varies.
Now.
This is because each scheduler works differently and is best suited for specific purposes.
So let's just have a look at the compute time.
We see that threading, You know, took just under two minutes, processes took several minutes, and and synchronous took 2.5 minutes.
Now, it looks as though the multi processing scheduler took the longest here
|
|
show
|
0:54 |
As we mentioned earlier, we always recommend using the "Distributed scheduler" and we've been using it throughout this course already, so it's the only scheduler that supports all the diagnostic dashboards and improved memory management capabilities.
It's also a separate sub project with a separate team of maintainers.
So just a few points, the distributed scheduler, as we have here in the notebook, will also work well for workloads on a single machine on top of that.
It is recommended for workloads that do not hold the GIL, such as (dask.bag) and custom code wrapped in (dask.delayed), even recommended on a single machine on top of this, it's kind of more intelligent and provides better diagnostics than the processes scheduler, It's absolutely required for scaling out work across a Cluster.
Now let's not forget our Dask and distributed hygiene as we always close the 'Cluster'.
|
|
show
|
0:18 |
To wrap up this chapter on Schedulers, Urz, I just want to remind you that Dask has a lot of excellent documentation on everything Dasky, but in particular also on 'Schedulers' is that you can check out at docs.dask.org and distributed.
dask.org.
Enjoy.
|
|
|
25:27 |
|
show
|
1:55 |
All right, welcome back Now we're in the last stretch of the course and what I personally consider a very exciting part, Machine Learning and Distributed Machine Learning.
So before jump in, I just want to start with a bit of background about scalability more generally in machine learning.
So this is a figure that I first saw Tom Albeigers present.
He's a maintainer of the DaskML project, among many other things, which this figure describes dimensions of scale, the Data size is on the X axis and the Model size is on the y axis.
Now, I want to make very clear that a lot of people mistakenly think about Distributed Compute and Dask as being helpful only for big data, whereas actually it's incredibly helpful as your model size or your compute size increases as well and we'll get to this okay.
In the bottom left quadrant, when both model size and data size are low, you're computation fits in RAM.
Beyond that point, we've become bound by memory or compute.
Let's think about compute bound when your model size or complexity increases your reach state, where your compute bound or CPU bound examples of this tasks like training, prediction, evaluation and more will take a long time to compute one solution that I really dig here is using 'joblib' as well demonstrate soon and SciKit learn offers job lib out of the box, which is which is pretty cool.
Now the next dimension of scale we need to consider is being memory bound.
This is when your data is too large to fit in RAM and in this case we have a memory bound problem.
In this case we can't even read the data without Dask Collections like Dusk Data Frame as we saw earlier.
And in this case, what we'll do is use dask_ml estimators that parallelized scikit learn code.
|
|
show
|
1:05 |
So let's jump in.
First, I want to tell you that ml.dask .org, which you can check out, describes Dask-ML as a single unified interface around the familiar NumPy, Pandas and Scikit-learn APIs.
Now this is true.
It implements the scikit-learn API and inter operates between Dask Data Frame and Dask Array to provide a seamless experience for machine learning tasks in a distributed setting.
Wowie.
So, let me just tell you what we're going to look at.
In this chapter.
I'm very excited to bring to you 1st.
I'll start by demonstrating scikit-learn which you may be aware is a library for machine learning in Python.
So this will be a crash course in using scikit-learn for machine learning.
Then we'll jump into solving compute bound challenges withJoblib and Dask After this will solve memory bound problems.
Using Dask-ML estimators, will end up with some references where you can learn more.
Now it's time to jump into the notebook.
|
|
show
|
2:16 |
Welcome back and I'm so excited to jump into scikit-learn for machine learning with you now.
So you may recall that scikit-learn is a powerful library for machine learning in Python which provides among many other things, tools for pre-processing, model training, evaluation and more.
If your model and data fit on your computer, definitely use scikit- learn with no parallelism.
Will soon see how to generalize your scikit-learn code using Dask-ML, to a parallel and distributed setting.
So let's now see how you can train machine learning models in scikit- learn.
So first we want to create a data set.
You could import one.
But "scikit-learn(sklearn)" has nice utility functions for creating datasets.
So I've just executed this to use the 'make_classification' function to create a data set that has 100,000 data points And 10 features for each data point.
Now you may note that we've unpacked the result of make classification into two variables X and Y.
It's worth spending a minute talking about these.
You may recall from your knowledge of machine learning that a machine learning challenge has feature variables which you input to your model and then output or target variables.
That your model is trying to predict.
What we're doing is unpacking the features into capital X by convention and the target into lower case 'y'.
Also by convention, as we've written here, X is the set of input variables.
And 'y' is the output or target variable.
If we look at, let's say the first five entries or in this case, rows of X we should get something that's five by 10.
So here we have five rows of 10 columns each.
Similarly, what we hope to see when looking at the first five entries of of 'y' a five binary elements.
So zeros and ones.
Because we're working in classification, we expect to see discrete outputs and the default here is binary, so we should say five zeroes and ones, as we do now in the next video, we're going to come back and build our very first machine learning model together.
It will be a 'K-nearest neighbors Classification' for the data set that we've just generated
|
|
show
|
2:30 |
So now it's time to build our very first machine learning model together.
We'll be using a basic model here called 'K-nearest neighbors classification'.
If you haven't seen that before, you can check it out in 'scikit-learns' documentation or anywhere else for that matter.
Essentially, it creates a model which makes a prediction for a point based on its neighbors.
The other points near it, scikit-learn actually makes it super easy to train this model as we'll see.
So first we "import the kNeighborsClassifier".
We're doing some timing here just to make sure that everything is running smoothly.
What we then do is we insaniate Classifier.
The classifier passing at the keyword argument and '_neighbors', which specifies how many data points it wants to look at.
Each point, wants to look at around it in order to perform the prediction.
And then we fit this model we've just built using the 'fit( ) method and pass it the features and the target.
Great.
That took next to no time at all, 800 milliseconds.
Now, what we can do is we can use this model 'clf' either to predict on new points or to check out the score as well, see how well it performs.
So what we're going to do is see how well this model performs on our original data.
Now in general, you don't want to do this.
You want to see how well it performs on a holdout set.
You may have seen train test, split or cross validation, which we'll get to a minute previously.
But just to see how this works.
And for the purpose of learning, we're going to look at the score on the data set that we that we trained it on.
So let's execute that.
Now.
This may take a bit longer.
Now, the reason this may take a bit longer is to train the model essentially you really need to store all the data points there.
Now to fit the model you need to compute all the distances between the points and the nearest three points And we're doing that for 100,000 points currently.
Great.
And what we see is that we had a score of .93.
Now this actually the score we've computed here as we've written above, is the accuracy.
This is the fraction of the data the model gets right.
So this model got 93% of the points, correct.
Now, of course, this is an overestimation because we're computing the score on what we used to build the model and we'll figure out soon with cross validation how to do that differently.
|
|
show
|
3:26 |
So we've successfully fit our first model, the k-nearest neighbors model.
Now it's time to talk about 'Hyper parameters and "hyper parameter Tuning".
Which can be very compute intensive and demanding in terms of CPU.
And so this is a realm in which distributed compute even in a small or medium data sense can be very useful as we'll see.
'Hyperparameters'.
What are they?
They're predefined attributes of models that impact their performance.
So remember in the K-nearest neighbors example we selected K is equal to three.
We define that ahead of time.
But perhaps a model with K = 5 has more predictive power and is more accurate So maybe we want that essentially 'Hyperparameter Tuning' is looking at the space of all these different hyper parameters and seeing how the model performs for all of them and then choosing the one that performs best.
Okay, so in our case we want to check how the model performs with different values of K and then select the most performance value of K.
There are lots of ways to do this.
We're going to use a method called "GridSearchCV".
Which is grid search cross validation essentially this sets up a big grid across all the hyper parameters and perform something called cross validation.
Which we're not going to get into now.
But if you want go and check out the docs, we're not going to execute the code here because it takes a bit of time.
So for pedagogical purposes and purposes of efficiency, we're just going to talk through it.
So we "import GridSearchCV".
Then we want to specify the hyper parameters to be explored.
Okay.
And we set them up in a Python dictionary where the keys are the names of the hyper parameters we want to tune and the values are a list of the values we want we want to explore.
So we're gonna explore 3,5 and 8 as values for N-neighbors And there's another argument Another hyper parameter called "weights".
And you can check out the documentation as to what this actually is and we're going to choose two different ones "uniform" and "distance".
All right.
We execute this code we instantiate as we have before our classifier We then assigned a grid search the function with the following arguments.
A GridSearchCV.
We pass it the estimator we want to fit, which is neigh for 'k-Neighbors classifier'.
The grid we pass it (verbose=2) which gives us a detailed output and then how many folds in the "Cross Validation" we want to do.
And once again, if you want to check out the documentation, you can use 2345 is relatively standard.
The more folds the more time it takes and then you perform the fit as before passing it, the feature variables and then the output or target variable.
You can see what happened here getting two folds for each of six candidates.
Now, why are there six candidates?
We've got a three by two grid.
So we have six pairs, three uniform three distance, five uniform five distance, eight uniform eight distance.
Okay, So then we saw that took around four minutes on my system.
Now we've done that.
We want to see what the best parameters were and what score they produced and what we actually see.
When we look at the 'best_params' attributes.
We see 'n_neighbors' was 8 and 'weights': 'distance'.
Then we can see what the best score was.
And we saw that when 'n_neighbors = 8 and weights = distance.
The score was very close to 90% accuracy, which is pretty good.
See you in the next video to explore doing this all in a Distributed Setting.
I can't wait.
|
|
show
|
3:42 |
Alright and welcome back.
I am very excited now to show you how distributed compute can be leveraged for your machine learning workflows.
Having fit a model and checked out hyper parameter tuning.
You may have noticed that hyper parameter tuning is something that's embarrassingly parallelizable.
And really what I mean by that is we have a bunch of tasks that could happen in parallel that really don't need each other for any, any form of operation between each other.
There's no need for data transfer or information between them.
So these are tasks you can essentially send to different workers and that's exactly what we're going to do.
First we're going to use "Single machine parallelism" using scikit-learn and something scikit-learn leverages called 'Joblib'.
And then we're going to look at "Multi machine parallelism" with scikit-learn and joblib and Dask as in the last video, I'm not going to execute all of this code for efficiency and pedagogical purposes, but I'll talk you through it and I really excited for you to execute it yourself and incorporate it into your own machine learning parallelizable workflows without further ado before using Dask I want you to try something called 'Joblib' and this is something that scikit-learned leverages and offers.
It's a tool called 'Joblib' and the only thing you need to do is to alter the 'n_jobs parameter'.
So what we're doing is we're using 'GridSearch', CV again, we're passing at the same things as before and then we're using the cores 'n_jobs'.
What that essentially does is tells it how many cores to use.
We can do a little trick if you don't know how many, cores you have locally, you can set it to "n_jobs=-1" and that essentially makes it the maximum number.
Of course.
Okay, so if I had four, cores 'n_jobs= -1' is exactly the same as n_jobs = 4 and that's exactly what we've done here.
Now.
I want you to notice that for me this took two minutes and 44 seconds, whereas previously not leveraging joblib all it took four minutes.
So all that's to say is that this compute time was reduced significantly almost by half In fact, which I think is pretty exciting when all we needed to do was add an extra quarks and set end jobs equal to minus one.
This is all well and good for single machine parallelism, but let's say you wanted to do multi machine parallelism and leverage a whole bunch of cause and clusters for your for your computation.
So this is where Dask comes in.
So Dask offers a parallel back end to scale this computation to a cluster.
The first thing that you need to do is spin up a cluster and open the dashboards as you've done before.
And you see this is what I've done here and then what you do is you pretty much use the same APIs as before using grid search CV.
And what we do is we once again in santiate a grid search CV and assign it to grid search.
Now, what we do is we set up a context manager using "with" and we execute "with joblib.parallel_back end", selecting Dask and assigning to the scatter, quag, X and y.
Then within that context we fit grid search to X and y as as we've done previously.
What I'll get you to notice is really the only thing we're doing differently now is doing it within the context of using the "daskparallel_backend" forth, scikit-learn.
And you'll see that took three minutes and and 22 seconds for me I hope you enjoyed that a great deal.
And when we come back, we'll be having a checkpoint for you to exercise your new muscles here.
|
|
show
|
1:03 |
Thank you for spending so much time listening and watching and hopefully coding along.
But now it's time to get your hands even dirtier with a checkpoint.
So the question here is to fit a logistic regression cross validation model on the given data.
Let's unpack that slightly.
Logistic regression is another type of classification model that's really all you need to know for the purpose of this exercise for the purposes of this checkpoint.
So it's kind of similar to k-nearest neighbors.
There's a different type of model, but the API generally works in the same way with fit and predict.
Now there's a little twist here.
This is actually a cross validation estimator.
So you'll be using logistic regression CV, which combines logistic regression with this 'GridSearchCV' capability above.
So to see how that works, I'd encourage you to click on the documentation here.
So what we want you to do now is to implement this with and without parallelism and see how long each of them takes.
Best of luck.
|
|
show
|
2:13 |
All right.
Okay.
So we have seen how to use and leverage distributed compute in the compute intensive case.
In the CPU intensive case, for example, we've looked at hyper parameter tuning, but as we have stated, distributed compute can also be leveraged for memory bound problems Okay.
As we've seen, these types of problems arise when your data set is too large to store in memory So this is where Dask can help.
In the previous course, you saw how Dask Data Frames can be used to perform pandas like operations on larger than memory data in the same fashion, we can use Dask-ML to perform scikit-learn ish operations on our large datasets.
So that's what we're going to do now.
We're not going to import a significantly larger data set, but we're going to show how the API works.
We're going to show you the code like that for pedagogical purposes.
So, first we "import dask_ml.model_selection as dcv".
Now, Dask-ML model selection has something in it which is a grid search cross validation method but generalizes to out of memory situations.
So we do that.
Then we set up the parameter grid as we've done beforehand And then once again we set up the grid search and use grid search.fit supplying it with the arguments X and Y.
As before.
Also let's have a brief look at another algorithm in the previous checkpoint You met logistic regression.
And now we're going to show logistic regression using Dask-ML .
And this really showcases if you know a bit of scikit-learn how Dask-ML mimics it in a very ergonomic and user friendly way as Dask-ML implements.
The scikit-learn API the code is similar from dask_ml.linear_ model.
We import logistic regression.
Then we take the logistic regression and fit it to X and Y.
And then we check out the score on top of that.
We can also use it to predict on new data, but we're doing it on X here, of course.
But we can generalize that to new data as well.
And we'll check out the first five elements there where we see it predicts false, false, false, false and true.
All right, that's it.
And we'll be back in a minute for a checkpoint
|
|
show
|
0:44 |
All right, everybody now it's time for a check point.
We've just seen how to use Dask-ML.
in order to generalize the scikit- learn API.
And code and algorithms to out of memory data sets So now it's your turn to use dask-ml to do the same with a "Naive Bayes classifier" on the given data set.
So we've seen it with k-nearest neighbors and also with logistic regression.
Now you can do it with a Naive Bayes classifier.
If you want to know a bit more about that, you can check out the scikit-learn resources either by googling it or clicking on the hyperlink here.
So what we're going to get you to do is put your answer in there and of course always respect distributed data science hygiene by closing the cluster afterwards
|
|
show
|
2:31 |
Welcome back Now.
It's time to talk about Dask in the cloud.
What do I even mean by this and why would we want to do something along these lines when we've seen how we can leverage the distributed computation of our local workstations The truth is scaling to the cloud can help us a great deal larger workflows will benefit a huge amount from more computational resources, such as large clusters, which you will not or may not have locally.
So there are cloud services that can help you leverage these types of clusters such as Amazon Web Services, Google Cloud Platform, Microsoft Azure and so on.
How do you get Dask Up and running on these services?
There are different types of Dask Cloud Deployments such as a 'Kubernetes integration', 'Yarn integration' among many others.
Now there are a significant number to choose from and you will need to know a bunch about containerization, Dockerization, maybe kubernetes these types of things in order to get this done There are also significant challenges such as environment and data management.
These involved questions such as all the machines have all the same software installed.
Can many people share the same hardware and where is the actual data?
Another challenge that's involved with cloud deployments, security and compliance, which your team leads and IT will be very much interested in These are questions such as authentication, do they have access to these machines and security?
What stops others from connecting and running arbitrary code as me or you?
The user.
Now, there's another challenge which is cost management and this is this is huge.
You want to know what will stop a novice from jumping on an idling 100 GPUs.
You want to track costs so you want to know how much money is everyone spending and you want to optimize costs and optimize workflow for cost So how do we profile and tune for cost?
So if you're going to get up and running on the cloud, these are the types of questions that you'll need to answer.
So I gave a talk at the 'Dask distributed summit' in 2021 about getting Dask working on the cloud and hoping to get Dask available to everyone.
And I encourage you to check that out a "bit.ly/task" for everyone if you're interested.
But what's happening next is we're going to jump into a notebook and check out how to get Dask Up and running on the cloud with a particular service called Coiled and Disclaimer.
I work for Coiled and I love it a lot.
I'll see you in the notebook.
|
|
show
|
4:02 |
All right, so now it's time to jump in and look at a bit of machine learning in the cloud.
Now this section is optional.
We've given you some resources to think about how to do your data science workflows in the cloud.
This section is optional because we'll be using the product that we work on coiled.
You can also get set up on AWS yourself with Dask kubernetes or anything along those lines out of the technologies we just introduced, but we'll be doing it on coiled and feel free to get started.
Sign up and code along as well.
So what I've already done is I've signed up to Coiled cloud, got my login token there and I'm going to use some of my, my, my free credits now for the purposes of time, I've already imported coiled and created a cluster there.
What I'm gonna do now is instantiate a client and then look at the dashboard and then we're going to do some machine learning.
So what this is actually done is it's already created a cluster on AWS for me using all the coiled coil technology according to a predefined environment.
It throws a little warning here, which is all that signaling.
It's not an error.
Mind you, it's a warning that there are version mismatches between what's on my cloud environment and what's happening locally.
So that's generally cool for the time being.
Remember, you can click on this being here to get our desk dashboards.
We can also look at our opinionated coil dashboard here which shows you what we consider to be the most important diagnostic tools for task.
We've got, you know, the task stream processes, progress along these lines and we'll see a bit of action there in a minute.
So the dashboard link of course points to AWS So what we're gonna do is we're gonna fit a "KMeans" model to some data that we're going to generate using using scikit-learn and we're going to use the "dask_mlKMeans".
So this is actually we're not trying to predict a label but we've got all these these data points and we're trying to find clusters of them Okay, so for those of you have done a bit of machine learning.
This is something called unsupervised learning.
That doesn't really matter.
We're just gonna show you how the API works and how easy it is to scale your work close to the cloud here.
So we're going to generate some fake data or synthetic data once again.
And it's a small data set.
And we're doing this for pedagogical purposes.
But we'll see how quickly it's processed on the cloud.
We're going to import KMeans and now we're going to fit K.
Means to the data that we've generated.
What we're going to see on the dashboard is a bunch of work, starting starting to be done.
So we can see a bunch of a raise, get items, data, transfer, these, these types of things.
That's good.
We can see all that work happening across all our workers there and it looks like it may have stopped.
Let's go and see in Jupyter lab, yep, That took 20 seconds and that all happened on aws through coiled cloud.
What we want to do is see what labels it actually predicted.
So I I think would be finding five or so clusters.
That's the data we generated anyway.
So we'll see whether that's that would be the case.
This will be 100 by one array of the predicted labels or clusters for each of our data points.
And let's just compute the 1st 10 and see what it looks like.
Perfect.
Looks like it found more than more than five clusters.
That's fine.
Maybe it's an algorithm we need to tweak in future.
That wasn't the point of this video.
Rather, it was to show you with products such as coiled cloud, how easy it can be to scale to the cloud, but I definitely encourage you all, not only to check out coiled cloud, but to check out whether you wanna figure out how to provision your own AWS clusters and that type of stuff using dust kubernetes or whatever it may be.
And if that works for you, as always, we practice healthy distributed data flow hygiene and close our client and we'll be back soon to tell you about a few references for further work.
|
|
|
1:39 |
|
show
|
1:39 |
With that We come to the end of this course.
We hope that you found it helpful and have started tinkering with Dask yourself to recap we covered many concepts in this course.
We covered Dask Arrays, Dask Delayed, Dask Bag, Dask Schedulers and Dask-ML with a brief introduction to scaling to the cloud.
This is just the tip of the iceberg.
We have covered the fundamental topics to help you kick start your Dask and scalable data science journey but there is so much more to Dask.
For example, there are many advanced Dask concepts that give you finer control over your computations.
We looked at Coiled cloud but there are many ways to deploy Dask.
One of Dask superpowers is how well it can work with other tools and there is so much to discuss their if you'd like to continue learning and seriously who wouldn't Dasks official documentation is one of the best resources followed closely by Dasks examples Repository.
If you look up task on GIT hub, you'll notice that Dask has numerous sub projects maintained by different teams.
Some of these projects have their own documentation, which is also worth checking out.
As you can see.
This is a huge community effort and would like to thank all the Dask contributors for maintaining and improving this amazing project.
Thank you for taking this journey with us and we hope to see you around in all the Dask spaces.
Happy scaling!.
|