Getting started with Dask Transcripts
Chapter: Dask under the hood
Lecture: Take a peek under the hood
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
We are almost ready to jump into the notebook and start using Dask. But before we do that, let's look under the hood and find out what are the
0:08
components of Dask and how they work together. At a high level, Dask has collections that
0:13
create task graphs. Then the task graphs are consumed by schedulers, which delegate workers to do the computations.
0:20
Collections are the APIs you use to write Dask code, collections can be high-level like Array, corresponding to Numpy, DataFrame corresponding to
0:29
Pandas and Bag, or they can also be low-level collections, such as Delayed and Futures. These collections create a task graph,
0:37
let's look what a task graph is. For example, these two functions, they do simple mathematical operations and sleep for one second.
0:45
y and y can be executed in parallel. However z, the task z, depends on the results of x and y. Therefore the total time is two seconds.
0:55
Because the individual tasks, each of them takes one second, so if executed in sequence they would take three seconds.
1:02
But because the task graphs understand which part of the work can be done in parallel, actually they are executed in parallel,
1:09
specifically, explicitly saying that. Finally those things combine into a cluster, let's look at what a cluster is comprised of.
1:17
First, it has the scheduler, which is the beating heart. It consumes the task graph and sends tasks to the
1:23
workers, manages the workers, manages the interactions, knows where the workers are, what part of data is on
1:29
what worker and so forth. Then, the workers are the machines that can be added or removed and they perform the actual computation.
1:37
Dask is quite dynamic so new workers can even appear during the workflow being executed, which is known as dynamic scaling.
1:44
Then finally the Client, the Client is the window to the world. It lives where you write your Python code,
1:50
in your JupyterLab session, in your command line interface and so forth. It's the entry point for you to interact with the cluster.
1:58
This is what it looks like in JupyterLab, the Client has a nice output presentation, which tells you where is the Dashboard,
2:06
which you can use to further inspect the inner workings of the cluster. It also has information on the resources allocated to the cluster.