Fundamentals of Dask: High Performance Data Science Course

1.2 hours, 100% free
Take this course for FREE

Course Summary

The Python data science stack, consisting of tools like pandas, NumPy, scikit-learn, and many more is extremely powerful, but it rarely leverages the parallel computing potential of modern hardware. Dask can help bridge this gap. Dask has a flexible and familiar API that integrates seamlessly with the PyData ecosystem. It parallelizes your favorite tools and allows you to scale your workflow quickly. And that’s just the beginning! Dask also provides a complete framework for distributed computing in Python that powers specialized tools like PyTorch and RAPIDS. This course will teach you how to parallelize everything from array computations to general Python code with Dask and even perform distributed machine learning to train models at scale.

What students are saying

If you’re at all interested in Python, or new to programming, take any (or all!) of Talk Python’s courses. A friendly, approachable, and incredibly practical set of courses.
-- James G.

Source code and course GitHub repository

github.com/coiled/talkpython-fundamentals-of-dask

What's this course about and how is it different?

This course is a quick and no-fluff introduction to the fundamentals of Dask. It's authored by the folks over at Coiled who offer Dask as a Service, including Matthew Rocklin, one of the co-creators of Dask. So you know you're getting definitive information from people who use Dask in practice.

What topics are covered

In this course, you will learn to:

  • Scale array computations using a parallel alternative to NumPy
  • Parallelize general Python code including for-loops
  • Work with unstructured data in parallel
  • Train machine learning models faster using distributed computing
  • And lots more!

View the full course outline.

Who is this course for?

This course is for anyone with basic Python and data science experience who would like to use Dask to scale their workflows. You'll need to know things like variables, modules, import statements, etc. in Python. A general understanding of machine learning, and data science tools like NumPy and scikit-learn is also helpful but not necessary. The Python code used is not deep or advanced so it should be broadly available to most.

Note: All software used during this course, including editors, Python language, etc., are 100% free and open source. You won't have to buy anything to take the course.

Get hands-on for almost every chapter

While watching videos is great to give you that high-level overview of what you need to know about a technology, nothing makes that skill your own like writing actual code and scaling data science computations in your notebooks.

In this course, you'll have access to all the source code at github.com/coiled/talkpython-fundamentals-of-dask. You're encouraged to follow along and play with the notebook throughout this course.

This course is delivered in very high resolution

Example of 1440p high res video

This course is delivered in 1440p (4x the pixels as 720p). When you're watching the videos for this course, it will feel like you're sitting next to the instructor looking at their screen.

Every little detail, menu item, and icon is clear and crisp. Watch the introductory video at the top of this page to see an example.

Follow along with subtitles and transcripts

Each course comes with subtitles and full transcripts. The transcripts are available as a separate searchable page for each lecture. They also are available in course-wide search results to help you find just the right lecture.

Each course has subtitles available in the video player.

Free office hours keep you from getting stuck

One of the challenges of self-paced online learning is getting stuck. It can be hard to get the help you need to get unstuck.

That's why at Talk Python Training, we offer live, online office hours. You drop in and join a group of fellow students to chat about your course progress and see solutions via screen sharing.

Just visit your account page to see the upcoming office hour schedule.

The time to act is now

If you are working with data using pandas or other data science libraries, you owe it to yourself to see how to process significantly larger datasets and how to run Python computation outside the grips of the GIL and across cores all the way out to across an entire cluster. This course will get you up to speed in just a few hours!

Course Outline: Chapters and Lectures

Welcome to the course
1:45
Introduction
1:05
Meet the instructors
0:40
Dask Array
11:57
Introduction to Dask array
2:25
Demonstrating numpy
3:37
Blocked algorithm
1:38
Checkpoint 1
0:29
Dask array for parallel numpy
2:39
Checkpoint 2
0:23
Dask array limitations and references
0:46
Dask Delayed
8:47
Introducing Delayed
1:00
Recap: Delayed
1:53
Parallel for loops
1:46
Pandas groupby using Dask delayed
1:53
Checkpoint
0:29
Best practices and references
1:46
Dask Bag
12:57
Introducing Dask bag
1:51
Reading from Python collections
2:49
Reading from JSON
2:31
Manipulating data
3:07
Checkpoint
0:38
Dask bag to Dask DataFrame
1:03
Dask bag limitations
0:58
Dask Schedulers
6:35
Introducing schedulers
0:53
Types of schedulers
2:03
Selecting a scheduler
0:46
Comparing different schedulers
1:41
Distributed scheduler
0:54
Scheduler references
0:18
Dask-ML
25:27
Dimensions of scale
1:55
Introducing Dask-ML
1:05
Generating dataset with scikit-learn
2:16
K-nearest neighbors with scikit-learn
2:30
Hyperparameter tuning
3:26
Joblib and Dask for compute-bound problems
3:42
Checkpoint 1
1:03
Dask-ML for memory bound problems
2:13
Checkpoint 2
0:44
Dask in the cloud
2:31
Machine learning in the cloud
4:02
Next Steps
1:39
Whats next
1:39
Take this course for FREE
Talk Python's Mastodon Michael Kennedy's Mastodon