|
|
3:39 |
|
show
|
3:18 |
Hello everyone and welcome to the Talk Python Training Course: "Getting started with Dask".
My name is Matthew Rocklin, I'll be one of your instructors for this course.
So first, what is Dask?
Dask is a library for parallel computing in Python, that means that Dask is designed to be run like any other Python library but to control a variety of parallel resources for you, you can execute your code at scale.
Dask is free and open source software.
It's developed in much the same practices that other libraries like Pandas or Jupyter are developed and it's designed originally to scale out a lot of the Data Science or engineering workloads or machine learning workloads that we often find in Python today.
Dask was designed to scale up to use all of the cores on your local laptop or to scale out to distributed machines either on prem or in the cloud.
Dask makes it very easy to parallelize common processing operations using workflows like Numpy, Pandas, Scikit-learn and many other libraries throughout the PyData ecosystem.
So Dask is really, you can think of it as infrastructure for the scientific Python or Python Data Science ecosystem.
It's at the same layer of libraries like Numpy or Jupyter or Cython, it's a tool that many other libraries in the Python stack use in order to add a little bit of parallelism.
So it's really more of a framework to build distributed applications and as a result it's been used with lots of other systems throughout the ecosystem, including very common libraries like Pandas with which Dask was co-developed, but also lots of new and exciting capabilities within Python or capabilities like time series processing, workflow management, machine learning, GPU processing and many more.
Because Dask is so flexible and so lightweight, it's been used inside of lots of other libraries, adding parallelism to many parts within the ecosystem.
However, most people start using Dask with traditional Data Science or Data Engineering workloads.
Here were thinking of libraries like Pandas, Numpy and Scikit-learn and Dask has been developed to include APIs that look very, very familiar to those APIs.
So for example, if using the Pandas read_csv and groupby operations, Dask has equivalent operations built into it, that allow you to operate on much larger data sets with the same familiar APIs.
The same for Numpy and Scikit-learn as well.
In this course we're going to assume that you have basic Python programming understanding.
It's also quite useful if you have some understanding of Pandas and Numpy and the rest of the scientific stack, but it's not necessary.
This will help you accelerate a bit.
But we'll teach you some of those things as well.
In this course we're gonna cover five sections.
There's "What is Dask", which we've just covered.
Next we'll go into a brief introduction to Big Data, then we'll talk about how to set up your environment to do the exercises within this course.
Then we'll dive more deeply into using Dask DataFrames to scale Pandas code, this is probably the most common use case.
Finally getting a bit of diagnostics and show you how you can get a lot of visual feedback from your computations.
Again, my name is Matthew Rocklin, I'm one of the lead maintainers of Dask and the CEO of Coiled.
I'm really excited to get started with this course.
|
|
show
|
0:21 |
Hi, my name is Michal and I will also be your instructor for this course.
I'm a Data Science Evangelist at Coiled.
Prior to joining Coiled, I've led a successful consulting career, delivering numerous solutions in Data Science and scaling them using high performance Python.
I'm super excited to talk to you about Dask because it's the perfect tool for scaling the PyData stack.
|
|
|
1:36 |
|
show
|
1:36 |
When you follow along with the course, execute code and work on exercises, you may sometimes run into errors like this one here.
No worries, in these situations, the first step is to scroll to the bottom of the trace back stack and try to figure out what the issue is.
As you can see here, Dask gives you some useful information about the error and how you can fix it.
In this example, as Dask says, specifying dtype values manually should fix this error.
What if the error message isn't so verbose and well explained?
You can copy the error message and paste it into your search engine.
Someone else might have come across the same issue before and written about possible solutions.
A common place you might search is Stack Overflow or GitHub.
Stack Overflow is a very popular forum where developers help one another by asking and answering questions.
It's a great knowledge exchange center, built up over many, many years.
In fact, for modern libraries like Dask, that are growing and improving so fast, Stack Overflow can be the leading source for information about problems with new features.
Many Dask maintainers follow the Dask topic on Stack Overflow.
It is very likely that you'll find an answer here.
If not, you can always click on the blue button and ask the question yourself.
Finally, if all else fails, you can reach to the course instructors or on the corresponding GitHub repository.
You can create a new issue on the project issue tracker, but make sure you go through the closed issues to avoid duplication.
Okay, now we're well armed to solve problems and are ready to jump into the course.
|
|
|
2:32 |
|
show
|
2:05 |
In the beginning, Matthew talked about how Dask is used to scale your data science workflows but let's pause for a second and consider what scaling to larger datasets actually means.
We can broadly divide datasets into three categories: small.
medium and large.
A small dataset can be loaded into your local RAM and you can do analysis on it comfortably using the tools you know and love.
Medium datasets are datasets that exceed your RAM capacity but still fit on your local disk, normally our tools would fail because of insufficient memory.
You can take advantage of parallel computing by loading only one part of the data into RAM at the time, Dusk helps you scale up.
Large datasets however, do not fit on your physical local drive and only distributed cloud computing is a viable solution.
Dask help scale out to fleets of machines in the cloud as well.
We typically call these large datasets Big Data.
Another way to think about Big Data is any data that can't be processed with traditional methods, any data that needs extra engineering to handle it efficiently or to even begin working with it.
When we talk about Big Data, it's worth mentioning the four V's that define it.
The first is Volume, technology is enabling collection of more and more data.
Velocity is the second one, as data is being generated at a speed like never before, thanks again to technology, then social media as well as democratizing access to the internet to larger and larger groups of people.
Veracity, collecting data always raises concerns about accuracy, authenticity, biases.
When data grows, so the reliability concerns.
Variety, data is available in so many different formats from text, to images, to videos, to social content and more.
New formats are invented every year, I encourage you to pause and look at this infographic prepared by IBM Big Data & Analytics Hub.
|
|
show
|
0:10 |
Let's stop for a checkpoint.
Can you think of some examples of Big Data?
|
|
show
|
0:17 |
Big Data is all around us.
Any field you look into, examples abound.
Financial transactions, social media activity, robots in factories, genome sequencing.
Data is integral to modern life.
Okay, now we can get started actually using Dask, let's get into setup.
|
|
|
6:15 |
|
show
|
1:03 |
Before we dive into the course, let's set up our environment.
We encourage you to use one of these two ways, either set up a local environment with Anaconda on your computer or if you're in a hurry use Binder, it's a great open platform that offers executable environments for anyone anywhere, at a moment's notice.
We have one created for this course.
However, Binder will not store your changes to the notebooks.
You can get started with Binder very, very quickly by pressing this badge on our GitHub repository README page for this course.
That said, we would like to encourage you to modify the notebooks that come with this course, and as we introduce new notebooks, we invite you to try things and explore the Dask API.
In our experience, trying in the interactive Jupyter environment is a fantastic way to learn new concepts quickly and through practice.
That gives you a deeper level of experience.
Setting up your local environment, gives you the ability to keep your notes and come back to them.
Let's go through the setup right now.
|
|
show
|
0:51 |
We will start the set up by cloning either your own fork of the course repository or the original course repository.
Go to your terminal and type git clone, paste the course repository link and hit enter.
This will quickly download the cloned copy onto your Dask.
Then we just need to change directory into the course directory, where you will find an environment.yml file which has the conda environment specification.
To install that environment, we only need to type conda env create, to create a new environment, and specify the file from which the instructions come.
So in in our case that's environment.yml.
Let's launch that.
This triggers the set up process.
This can take a few minutes so we'll skip through that part and see you on the other side.
|
|
show
|
4:21 |
After the environment is done installing, we need to activate it.
To do so, we can follow this instruction that the installer has provided for us up here.
So just type in conda activate talkPython-dask.
You can see that on the left hand side at the beginning of the line, the environment name now appears instead of base.
Okay.
We're ready to start jupyter lab.
After typing jupyter lab into the console, we'll just wait for it to start.
It will automatically open a browser window for us and show us our repository.
JupyterLab starts in the directory from which it is launched in the console.
We're in the course introduction-to-dask and we need to open the first notebook which is the 00-setup.ipynb.
Once we're in, we see a typical JupyterLab window where on the left hand side we have the utility bar and in the main window we have notebooks.
We also have access to a launcher which allows us to create new notebooks, new consoles, terminals, files and so forth.
Let's go back to the setup notebook.
A notebook consists of cells, which can be Markdown cells like this one here, which provide rich textual representation for can contain images, links, headings.
It really helps with making the whole notebook feel like a story, like a like a book or like a journal.
Then we see code cells which also have output.
For example, running this cell would provide us with an output.
First, actually we need to import a function that this cell uses.
Then we can run it right here.
These two functions take two minutes and to run the cell we just need to press Shift + Enter or Ctrl + Enter.
And we've got the result.
You can see JupyterLab features syntax highlighting and the interactive environment like, that allows us to move cells around.
So if I wanted to, I could move those cells up and down.
I can also copy and paste cells.
It's a very very nice working environment, especially for data science and for interactive workloads.
Let's explore setting up a variable and printing it.
So jupyter cells by default, print out the value of the variable that is stated in the last line.
That's an easy way to get to know what is in a particular variable.
I can add a cell, just see what is inside the variable a, I can see that's 11, which makes sense, 10 plus one.
Before we go, let's just have a look at keyboard shortcuts.
So those are very very useful, they accelerate your work.
You can run cells by pressing this play button up here in the top bar on the notebook screen.
You can also add cells through there and by pressing the plus, use the copy, cut and paste, you can also restart the notebook kernel.
But a lot of these activities can be accomplished by keyboard shortcuts.
So for example, "A" and "B" are for inserting cells, "M" and "Y" is for switching cells from Markdown to code and vice versa.
Of course you can run cells as well with Shitf + Enter, Ctrl + Enter or even Alt + Enter.
All of those are variations, you can read more about those in the help.
It's time well spent getting to know keyboard shortcuts for JupyterLab.
As for the exercises in this, in this course, every notebook will have those scattered around just so the experience of learning is deeper.
We encourage you to always spend the time to try to exercise and ideally make the exercise even richer on your own, that will unleash your creativity, it will make you understand the framework deeper.
Please try that whenever you can, it will be very worth your while.
At this point, just to let you know the format, questions will be written in this format and we'll invite you to type in your answer and then hidden with this ellipsis, there will be a cell with an answer that we provided for you, which you can check against.
So before you open that ellipsis, make sure to give it an honest try or if you don't have time, just go right to the answer, be informed through that.
Great.
By the way, if you want to collapse the cell, you can do so as well by pressing on this blue bar here, you can compress and decompress cells.
Like I said before, it's a fantastic environment to work in.
You can have multiple notebooks running at the same time, you can have multiple tabs, arrange them as you wish.
If this is your first foray into JupyterLab, I hope you enjoy it very much and have fun with the course.
|
|
|
2:54 |
|
show
|
2:13 |
We are almost ready to jump into the notebook and start using Dask.
But before we do that, let's look under the hood and find out what are the components of Dask and how they work together.
At a high level, Dask has collections that create task graphs.
Then the task graphs are consumed by schedulers, which delegate workers to do the computations.
Collections are the APIs you use to write Dask code, collections can be high-level like Array, corresponding to Numpy, DataFrame corresponding to Pandas and Bag, or they can also be low-level collections, such as Delayed and Futures.
These collections create a task graph, let's look what a task graph is.
For example, these two functions, they do simple mathematical operations and sleep for one second.
y and y can be executed in parallel.
However z, the task z, depends on the results of x and y.
Therefore the total time is two seconds.
Because the individual tasks, each of them takes one second, so if executed in sequence they would take three seconds.
But because the task graphs understand which part of the work can be done in parallel, actually they are executed in parallel, specifically, explicitly saying that.
Finally those things combine into a cluster, let's look at what a cluster is comprised of.
First, it has the scheduler, which is the beating heart.
It consumes the task graph and sends tasks to the workers, manages the workers, manages the interactions, knows where the workers are, what part of data is on what worker and so forth.
Then, the workers are the machines that can be added or removed and they perform the actual computation.
Dask is quite dynamic so new workers can even appear during the workflow being executed, which is known as dynamic scaling.
Then finally the Client, the Client is the window to the world.
It lives where you write your Python code, in your JupyterLab session, in your command line interface and so forth.
It's the entry point for you to interact with the cluster.
This is what it looks like in JupyterLab, the Client has a nice output presentation, which tells you where is the Dashboard, which you can use to further inspect the inner workings of the cluster.
It also has information on the resources allocated to the cluster.
|
|
show
|
0:41 |
Great, now let's do a checkpoint.
Why is the scheduler important?
The scheduler is a core part of any distributed computing system.
Amish barn raising is a great metaphor for distributed computing.
The scheduler is like the foreman that delegates tasks to all the builders.
The Dask scheduler takes care of a lot of things, including task stream optimization, data locality, and it is constantly being improved by world class experts.
|
|
|
2:16 |
|
show
|
0:15 |
Now we're ready to introduce the Dask API.
Here is where the fun starts, let's write some actual Dask code.
In this chapter we'll learn how to spin up a Dask cluster and introduce the Dask delayed API.
Let's dive into the notebook.
|
|
show
|
2:01 |
We are in notebook number one.
All we really need to start a Dask cluster is in this first block of code.
We import the Client class from the dask.distributed module and then create a new client object while specifying the number of workers that we want.
That's it, that's really all it takes.
Looking at our client, we can see the locations of the Scheduler, and the Dashboard as well as available cluster resources, the workers, the cores and the memory.
The dashboard URL will take you to Dask's diagnostic dashboards that display real time information about the state of your cluster.
Go ahead, click the link, explore the dashboard.
When you're done with your computation, always remember to close the cluster.
If you have multiple clients running it may cause a lot of confusion and if you're connected to a remote cluster, you might be accumulating idle charges.
We don't want that now, do we?
Dask Delayed is one of the low-level APIs in Dask.
Let's look at how we can parallelize and distribute any Python code with Dask.
Consider these two functions that do basic arithmetic operations and sleep for one second each.
In regular Python, incrementing two numbers and adding them together happens sequentially and takes three seconds in total.
We can parallelize this regular Python code using Dask Delayed.
All you need to do is use the delayed decorator for the appropriate operations.
Well, let's look at that.
It took 600 microseconds.
Fantastic, right?
but wait a second, lazy evaluation is what happened.
I don't want to bust your bubble, but Dask has actually not computed your result yet.
If you recall, from "Dask under the hood", here Dask has created a task graph and it's ready to compute whenever you ask it to do so.
This is called lazy evaluation.
It's evaluating only when you need to.
Dask computes your result only when you call compute.
You can also visualize the task graph by calling visualize()
|
|
|
11:38 |
|
show
|
1:19 |
Dask DataFrame, Dask DataFrame is the high level API that we use to scale Pandas code.
Take a look at these two code snippets.
Do you see how Dask code on the right is almost identical to Pandas code on the left?
That is not a coincidence, it's an intentional deliberate choice of design.
Dask creators wanted to invent nothing, they wanted Dask to be as familiar as possible to users of the PyData stack.
In this chapter we'll start by downloading the New York City Yellow Taxicab Dataset.
We'll perform some Pandas operations on them.
Well then scale this same Pandas code using Dask DataFrame and use the Dask Dashboards to understand parallel computations happening live.
And finally, we'll discuss some limitations of Dask DataFrame and share some resources where you can learn more.
Pandas is an incredibly popular library for analyzing tabular data.
Data practitioners use Pandas for pre-processing tasks and for exploratory analysis.
Pandas is a very powerful library but it has a limitation that is hard to overcome when working with Big Data.
It can only work on data in your RAM.
When your dataset exceeds that, Pandas throws a memory error like this one here.
|
|
show
|
2:42 |
We are in notebook two of this course.
For this notebook, we'll be using the New York City Yellow Taxi Trips Dataset.
This is a public dataset, released by the city of New York for everyone to use and download.
It contains Taxi Cab trip records dating all the way back to 2009, which amounts to over 200 gigabytes of files.
We'll be using only data for 2019.
To download the dataset, either run the cell given here, the wget cell or download it directly from the website.
The website is linked here at the New York City Yellow Taxi Trips Dataset blue hyperlink in the cell above.
If you're on Windows this wget instruction might not work.
Okay, as we will rely on this dataset now and in the future, we recommend curating all the files in a subdirectory called data which is easily accessible from your workspace.
Okay let's read in the data.
Pandas has a read_csv method to read CSV files into your Python session.
We will read data for only one month, which is already a lot.
We'll use january and time the reading using the %%time magic.
The double percent time magic.
Note that it takes us 12 seconds to load from disk into memory.
It may be different depending on your machine configuration but it will be a number in seconds.
Now we have over seven million rows loaded directly into memory ready to use.
It's over seven million rows with 18 columns of various types.
We can learn more about the dataset by running the df.info() method.
It shows us the different column names as well as their data types and the total memory usage.
Great, after reading the data, the next step is to do some meaningful computation on it.
Let's find the mean of the tip_amount as a function of passenger_count of the vehicle.
We can use the mean and groupby functions for these operations.
If you've used Pandas before, you've probably seen them.
Those are very popular methods.
groupby splits the dataset by column values while mean calculates the mean for those groups.
Again we time it and wow, we have the result in just 100 milliseconds, that is pretty impressive.
That is well worth paying the price we've paid in the beginning, the 12 seconds it took to load the entire file into memory.
Fantastic, we've seen some basic operations in Pandas.
As mentioned earlier, Pandas can't deal with data that is larger than memory, it really shines except when we run into datasets that are too big.
In this cell, we try reading data from all months using Pandas.
This cell right here, we are reading every file, 12 files and then concatenating them together.
Feel free to uncomment and run this cell on your own to see the dreaded memory error.
This is where Dask comes to our aid.
|
|
show
|
0:32 |
We can use Dask DataFrame to handle larger than memory data while relying on the familiar Pandas API.
We don't need to change our code, Dask DataFrames are made up of multiple, smaller Pandas DataFrames split along the index.
In fact, Dask DataFrame actually executes Pandas operations internally.
Each of those smaller DataFrames are called chunks, or partitions, and the upper and lower bound separating them are called divisions.
Dask operations are executed harmoniously on the smaller Pandas DataFrames.
|
|
show
|
4:51 |
We are back in the notebook.
Let's again use the same New York Taxi dataset but this time with Dask DataFrame.
First, let's spin up a cluster.
I hope you remember how to do that.
When using Dask, it's a good practice to have the dashboard open to the side.
You can access the dashboard by clicking the link here which will open a tab with the Dask status, which is currently empty or using the Dask JupyterLab extension, which is available here in the sidebar.
Let me click the Dask logo in the sidebar which opens an entire section here with all the different views that Dask dashboard provides.
I'm going to click on Cluster Map which shows the interactions between the client and the workers, the Task Stream, which shows what each worker is doing at any given moment and the Dask Workers, which tells me the CPU and memory usage across the cluster.
This might be uncomfortable to work with so I need to rearrange my tabs.
First, I'll close the sidebar and then arrange my tabs neatly on the side so that I can work with my code as I view the status of the cluster.
Great, after doing that, I need to make some more room for my code, perhaps move the Dask Workers to the lower tab or keep it up here.
That's up to you.
Okay, this setup works for me.
Let's perform the same read operation as we did before but with Dask.
Let's time it as well.
I'm going to use Dask DataFrame read_csv to load the entire year instead of one month as I did last time.
It took 400 milliseconds to come up with this operation.
But what actually happened, did my data get into the memory?
No, it didn't.
This is a lazy operation as we've covered before.
What we've got instead is a task graph.
The task graph will be invoked when we actually want to perform computations.
On that note, let's do some basic reading.
Let's see what are the top few rows by using the head method.
We can see in the Tas Stream that it took 1.6 seconds to just get a little bit of the data in and we can see the result up here.
To look at the last few rows, we can use the tail method.
Oh, but we get a big nasty error.
Let's scroll down and look at it closer.
Okay, Dask provides a verbose explanation for what the error might be and what might be the difference.
Actually, because Dask is not loading the entire dataset into memory all at once, its approach to inferring data types is estimation.
Because the last few roads differ from what Dask initially estimated, we get this error.
The official recommendation, which we can see in the error message is to specify manually what data types belong to what columns.
Dask estimates data types with a small sample of data to stay efficient so it's common to run into this error.
This is also why you get to see the message with all the recommendations in it.
Great, let's apply the recommendation and load the dataset but this time with the specified data types.
The head method worked and so did the tail method.
Excellent, let's move on and have a look at some basic operations like we did last time.
Let's group the vehicles by tip_amount and calculate the mean tip_amount.
This again is not the true result, it just took 12 milliseconds to accomplish.
To get the actual result I need to call compute, that will trigger all the steps from reading CSV's from disk all the way through the groupby and mean computations.
At the end, we'll get a Pandas DataFrame or Pandas Series with the output.
Let's time that.
As I run this operation, I can see the Task sStream getting busy.
I can see worker cores coming online, matching the number of available cores, which should be 16 in my case.
I can see the different chunks being read in, I can see the cluster nodes communicating with each other and with the scheduler and I can see CPU utilization as well as memory use across my cluster.
The tasks continue, the workers continue communicating and I'm still waiting for my result.
I can also see that the entire operation is just taking over 30 seconds now and I can see the progressive parts of the CSV being read.
This is very helpful to see and as there are no white spaces between the tasks, I can see the workers are busy at every moment.
As the CSV's are read, we can see different color tasks which signal that the CSV is now being processed so the data is already in memory but it's being processed.
And in just under 60 seconds, I was able to get my results with the same code as I would with regular Pandas, but on a dataset that's bigger than memory.
Along with lazy computation of task graphs, Dask also releases results, intermediate results and end results, unless specifically asked to keep them.
In order to store intermediate results for future use, we can use the persist method.
It's time for a check point.
Can you compute the standard deviation for tip_amount as a function of passenger_count for the entire dataset?
Please type your answer here
|
|
show
|
1:05 |
In some cases, you can benefit from sharing intermediate results.
We've learned that Dask computations are created as task graphs.
In other words, they are just plans.
When two computations are related, they can use the same intermediate steps.
We can use that phenomenon to our advantage for shared efficiency.
For example, when computing minimum and maximum values.
Let's look at an example here.
Just like in Pandas, you can use min and max to compute the minimum and maximum values in Dask.
Without sharing is how we have been computing so far, executing each task graph separately.
If we time it to examine the difference, we can see very clearly that running the two computations together, just like we do in the section with sharing saves us nearly half the time.
To share intermediate results, we can pass both maximum and minimum, tip_amount together to dask.compute.
The shared computation is much faster because we save on loading data from disk and the groupby.
Similarly, if we have persisted the dataset into memory earlier, the entire process would be even faster yet.
|
|
show
|
0:19 |
Now it's time for another checkpoint.
Can you compute the mean and standard deviation for the total fare amount, with sharing intermediate results?
Try it yourself and verify your answer.
After you're done, remember to close your cluster.
|
|
show
|
0:50 |
Dask DataFrame is fantastic and incredibly powerful, but it does have some limitations.
It does not implement the entire Pandas API because not all Pandas operations are suited for a parallel and distributed environment.
For example, operations that require data shuffling.
As we know, Dask DataFrames consist of multiple Pandas DataFrames.
Each has index starting from zero.
Pandas indexing operations like set_index, reset_index are slower in Dask because they may need the data to be sorted, which requires a lot of time consuming shuffling and synchronization of data among workers, moving data across different machines has network and communication costs.
To avoid sorting, you can presort the index and make logical partitions.
Finally, to learn more,go through the resources shared here.
|
|
|
3:09 |
|
show
|
0:48 |
So far, we have worked with a medium dataset that is too big to fit in RAM, but it fits on your local disk.
We have used Dask to parallelize computation locally and leverage all the cores of your local machine.
But what if you want to scale further?
Maybe you don't have enough cores on your local machine, or not enough storage, or you just want to get your work done faster.
This is where you can start considering using the cloud.
Like AWS, Azure, GCP and other providers.
To show how to do it with Dask, we will be using Coiled Cloud.
This is not the only way to do cloud deployments with Dask, and this section is not mandatory to follow.
We're using Coiled Cloud here because honestly, it's the most convenient way to do this today.
It also offers a generous free tier so you can get started right away and don't feel bothered by paying money for cloud services.
|
|
show
|
0:19 |
In this step, we just have to jump out of Jupyter for one quick second to log into Coiled Cloud.
To do that, we can click the link, cloud.coiled.io or type the address into the browser.
Then, we can use our social media, GitHub credentials or type in our email and password.
I'll skip this part now and show you how it looks from the other side.
|
|
show
|
0:27 |
In the dashboard, we can monitor are currently running clusters.
The only reason we're here now, however, is to get our unique account token, so that we can start clusters directly from our JupyterLab session or any other Python script.
To log in, we can open a terminal window and execute the coiled login instruction from below here, or we can share the token when prompted in our interactive session.
That's it.
That's all it takes to get connected to Coiled Cloud.
|
|
|
1:15 |
|
show
|
0:20 |
The cluster is ready.
We can open the dashboards, note that the AWS link here, or we can connect our JupyterLab plugin to this remote cluster.
I'll do so by copying the link to the scheduler.
Opening the Dask tab and pasting it over here.
After a second, I'm connected to my new cluster, my distributed cluster.
Now I'll run the same computation as before.
This time we read data directed from Amazon S3.
Do you remember at the outset of the course, how long it took to download the entire dataset to local disk?
Now the entire network exchange is happening in the cloud.
We can see the different parts of the dataset being read by the different workers.
The Dask cluster dashboard gives us live updates to what's going on the remote setup.
In fact the entire computation happened in just nine seconds.
We just saw how the same notebook we started, with some Pandas computation on our local machine, moved to using Dask DataFrame locally to parallelize and work on larger data, then also connected to AWS to work on even larger data.
Everything from your laptop and the same notebook, from anywhere in the world.
This is the real power of Dask.
|
|
|
2:00 |
|
show
|
2:00 |
Thank you, Michal.
Finally, to end on a few different notes here.
First, Dask isn't one library, it's actually dozens of libraries, all work together.
All of these libraries are maintained by hundreds of different individuals, like yourself, spread around the world, working at many different institutions on many different projects.
Go to the github.com/dask organization.
You'll find lots of information about the Dask developer community.
You might also want to look at the Dask documentation.
This is at docs.dask.org and there's tons of information here if you want to dive more deeply and see the capabilities that are now available to you.
I also highly recommend looking at examples.dask.org.
This website contains dozens of fully worked examples in many different domains.
You may find something here that speaks much more to your workloads than the exercises we've just gone through.
Additionally, all of these exercises, you can run yourself by clicking the launch binder button, to be taken into a notebook running on the cloud where you can run Dask yourself.
Again, on some of these different and maybe more exciting or more relevant applications for you.
Our next course is going to be about the fundamentals of Dask, we've touched today here just on Dask to scale out Pandas workloads with Dask DataFrame, in a very simple way, but there's a lot more to Dask.
As we've mentioned before, at the beginning of the series, Dask is used inside of many different libraries.
Some of them are built into Dask, things like Dask DataFrame, which we've just seen.
Dask Array for parallel NumPy, Dask Delayed for parallelizing general purpose Python code.
Dask Bag for dealing with more JSON or text data, as well as things like Dask-ML for machine learning.
We'll also get into using Dask in the cloud and figure out how to deploy these workloads at scale.
So that's it.
Again, thank you all for your time, we look forward to seeing you in our next session.
Cheers!
|