Fundamentals of Dask Transcripts
Chapter: Dask-ML
Lecture: Dimensions of scale
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
All right, welcome back Now we're in the last stretch of the course and what I personally consider a very exciting part,
0:07
Machine Learning and Distributed Machine Learning. So before jump in, I just want to start with a bit of background about
0:14
scalability more generally in machine learning. So this is a figure that I first saw Tom Albeigers present. He's a maintainer of the DaskML project,
0:23
among many other things, which this figure describes dimensions of scale, the Data size is on the X axis and the Model size is on the y
0:33
axis. Now, I want to make very clear that a lot of people mistakenly think about Distributed Compute and Dask as being helpful only for big data,
0:42
whereas actually it's incredibly helpful as your model size or your compute size increases as well and we'll get to this okay.
0:50
In the bottom left quadrant, when both model size and data size are low, you're computation fits in RAM. Beyond that point,
0:59
we've become bound by memory or compute. Let's think about compute bound when your model size or complexity increases your reach state,
1:10
where your compute bound or CPU bound examples of this tasks like training,
1:15
prediction, evaluation and more will take a long time to compute one solution that I
1:22
really dig here is using 'joblib' as well demonstrate soon and SciKit learn offers job lib out of the box, which is which is pretty cool.
1:31
Now the next dimension of scale we need to consider is being memory bound. This is when your data is too large to fit in RAM and in this case
1:39
we have a memory bound problem. In this case we can't even read the data without Dask Collections like Dusk Data Frame
1:46
as we saw earlier. And in this case, what we'll do is use dask_ml estimators that parallelized scikit learn code.