Fundamentals of Dask Transcripts
Chapter: Dask-ML
Lecture: Joblib and Dask for compute-bound problems

Login or purchase this course to watch this video and the rest of the course contents.
0:00 Alright and welcome back. I am very excited now to show you how distributed compute can be leveraged for your machine learning workflows.
0:10 Having fit a model and checked out hyper parameter tuning.
0:13 You may have noticed that hyper parameter tuning is something that's embarrassingly parallelizable.
0:19 And really what I mean by that is we have a bunch of tasks that could happen in parallel that really don't need each other for any,
0:27 any form of operation between each other. There's no need for data transfer or information between them.
0:32 So these are tasks you can essentially send to different workers and that's exactly what we're
0:37 going to do. First we're going to use "Single machine parallelism" using scikit-learn and something scikit-learn leverages called 'Joblib'.
0:45 And then we're going to look at "Multi machine parallelism" with scikit-learn and joblib and Dask as in the last video,
0:51 I'm not going to execute all of this code for efficiency and pedagogical purposes,
0:56 but I'll talk you through it and I really excited for you to execute it yourself
1:00 and incorporate it into your own machine learning parallelizable workflows without further ado before using
1:07 Dask I want you to try something called 'Joblib' and this is something that
1:10 scikit-learned leverages and offers. It's a tool called 'Joblib' and the only thing you need to do is to alter the 'n_jobs parameter'.
1:22 So what we're doing is we're using 'GridSearch', CV again, we're passing at the same things as before and then we're using the cores 'n_jobs'.
1:32 What that essentially does is tells it how many cores to use. We can do a little trick if you don't know how many,
1:39 cores you have locally, you can set it to "n_jobs=-1" and that essentially makes it the maximum number. Of course. Okay, so if I had four,
1:48 cores 'n_jobs= -1' is exactly the same as n_jobs = 4 and that's exactly what we've done here.
1:56 Now. I want you to notice that for me this took two minutes and 44 seconds, whereas previously not leveraging joblib all it took four minutes.
2:06 So all that's to say is that this compute time was reduced significantly almost by half
2:12 In fact, which I think is pretty exciting when all we needed to do was add an extra quarks and set end jobs equal to minus one.
2:21 This is all well and good for single machine parallelism, but let's say you wanted to do multi machine parallelism and leverage a whole bunch of
2:30 cause and clusters for your for your computation. So this is where Dask comes in.
2:36 So Dask offers a parallel back end to scale this computation to a cluster. The first thing that you need to do is spin up a cluster and open the
2:45 dashboards as you've done before. And you see this is what I've done here and
2:49 then what you do is you pretty much use the same APIs as before using grid search CV. And what we do is we once again in santiate a grid search
2:59 CV and assign it to grid search. Now, what we do is we set up a context manager using "with" and we execute "with joblib.parallel_back end",
3:10 selecting Dask and assigning to the scatter, quag, X and y. Then within that context we fit grid search to X and y as as we've done previously.
3:23 What I'll get you to notice is really the only thing we're doing differently now is
3:27 doing it within the context of using the "daskparallel_backend" forth, scikit-learn. And you'll see that took three minutes and and 22 seconds for me
3:36 I hope you enjoyed that a great deal. And when we come back, we'll be having a checkpoint for you to exercise your new muscles here.


Talk Python's Mastodon Michael Kennedy's Mastodon