Getting started with Dask Transcripts
Chapter: Welcome to the course
Lecture: Intro to the course and to Matthew Rocklin

Login or purchase this course to watch this video and the rest of the course contents.
0:00 Hello everyone and welcome to the Talk Python Training Course: "Getting started with Dask". My name is Matthew Rocklin, I'll be one of your instructors
0:09 for this course. So first, what is Dask? Dask is a library for parallel computing
0:16 in Python, that means that Dask is designed to be run like any other Python library but to control a variety of parallel resources for you,
0:25 you can execute your code at scale. Dask is free and open source software.
0:31 It's developed in much the same practices that other libraries like Pandas or Jupyter are developed
0:35 and it's designed originally to scale out a lot of the Data Science or engineering workloads or machine learning workloads that we often find in Python
0:42 today. Dask was designed to scale up to use all of the cores on your local
0:47 laptop or to scale out to distributed machines either on prem or in the cloud. Dask
0:53 makes it very easy to parallelize common processing operations using workflows like Numpy, Pandas,
0:59 Scikit-learn and many other libraries throughout the PyData ecosystem.
1:04 So Dask is really, you can think of it as infrastructure for the scientific Python
1:09 or Python Data Science ecosystem. It's at the same layer of libraries like Numpy or
1:15 Jupyter or Cython, it's a tool that many other libraries in the Python stack use in order to add a little bit of parallelism.
1:24 So it's really more of a framework to build distributed applications and as a result it's been
1:28 used with lots of other systems throughout the ecosystem,
1:31 including very common libraries like Pandas with which Dask was co-developed, but also lots of new
1:36 and exciting capabilities within Python or capabilities like time series processing,
1:41 workflow management, machine learning, GPU processing and many more. Because Dask is so flexible
1:48 and so lightweight, it's been used inside of lots of other libraries, adding parallelism to many parts within the ecosystem.
1:56 However, most people start using Dask with traditional Data Science or Data Engineering workloads. Here were thinking of libraries like Pandas,
2:04 Numpy and Scikit-learn and Dask has been developed to include APIs that look very, very familiar to those APIs. So for example,
2:13 if using the Pandas read_csv and groupby operations, Dask has equivalent operations built into it, that allow you to operate on much larger data
2:21 sets with the same familiar APIs. The same for Numpy and Scikit-learn as well.
2:26 In this course we're going to assume that you have basic Python programming understanding.
2:31 It's also quite useful if you have some understanding of Pandas and Numpy and the rest of the scientific stack, but it's not necessary.
2:38 This will help you accelerate a bit. But we'll teach you some of those things as well. In this course we're gonna cover five sections.
2:45 There's "What is Dask", which we've just covered. Next we'll go into a brief introduction to Big Data,
2:52 then we'll talk about how to set up your environment to do the exercises within this
2:55 course. Then we'll dive more deeply into using Dask DataFrames to scale Pandas code, this is probably the most common use case.
3:03 Finally getting a bit of diagnostics and show you how you can get a lot of visual feedback from your computations. Again, my name is Matthew Rocklin,
3:12 I'm one of the lead maintainers of Dask and the CEO of Coiled. I'm really excited to get started with this course.


Talk Python's Mastodon Michael Kennedy's Mastodon