Getting started with Dask Transcripts
Chapter: Welcome to the course
Lecture: Intro to the course and to Matthew Rocklin
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
Hello everyone and welcome to the Talk Python Training Course: "Getting started with Dask". My name is Matthew Rocklin, I'll be one of your instructors
0:09
for this course. So first, what is Dask? Dask is a library for parallel computing
0:16
in Python, that means that Dask is designed to be run like any other Python library but to control a variety of parallel resources for you,
0:25
you can execute your code at scale. Dask is free and open source software.
0:31
It's developed in much the same practices that other libraries like Pandas or Jupyter are developed
0:35
and it's designed originally to scale out a lot of the Data Science or engineering workloads or machine learning workloads that we often find in Python
0:42
today. Dask was designed to scale up to use all of the cores on your local
0:47
laptop or to scale out to distributed machines either on prem or in the cloud. Dask
0:53
makes it very easy to parallelize common processing operations using workflows like Numpy, Pandas,
0:59
Scikit-learn and many other libraries throughout the PyData ecosystem.
1:04
So Dask is really, you can think of it as infrastructure for the scientific Python
1:09
or Python Data Science ecosystem. It's at the same layer of libraries like Numpy or
1:15
Jupyter or Cython, it's a tool that many other libraries in the Python stack use in order to add a little bit of parallelism.
1:24
So it's really more of a framework to build distributed applications and as a result it's been
1:28
used with lots of other systems throughout the ecosystem,
1:31
including very common libraries like Pandas with which Dask was co-developed, but also lots of new
1:36
and exciting capabilities within Python or capabilities like time series processing,
1:41
workflow management, machine learning, GPU processing and many more. Because Dask is so flexible
1:48
and so lightweight, it's been used inside of lots of other libraries, adding parallelism to many parts within the ecosystem.
1:56
However, most people start using Dask with traditional Data Science or Data Engineering workloads. Here were thinking of libraries like Pandas,
2:04
Numpy and Scikit-learn and Dask has been developed to include APIs that look very, very familiar to those APIs. So for example,
2:13
if using the Pandas read_csv and groupby operations, Dask has equivalent operations built into it, that allow you to operate on much larger data
2:21
sets with the same familiar APIs. The same for Numpy and Scikit-learn as well.
2:26
In this course we're going to assume that you have basic Python programming understanding.
2:31
It's also quite useful if you have some understanding of Pandas and Numpy and the rest of the scientific stack, but it's not necessary.
2:38
This will help you accelerate a bit. But we'll teach you some of those things as well. In this course we're gonna cover five sections.
2:45
There's "What is Dask", which we've just covered. Next we'll go into a brief introduction to Big Data,
2:52
then we'll talk about how to set up your environment to do the exercises within this
2:55
course. Then we'll dive more deeply into using Dask DataFrames to scale Pandas code, this is probably the most common use case.
3:03
Finally getting a bit of diagnostics and show you how you can get a lot of visual feedback from your computations. Again, my name is Matthew Rocklin,
3:12
I'm one of the lead maintainers of Dask and the CEO of Coiled. I'm really excited to get started with this course.