Fundamentals of Dask Transcripts
Chapter: Dask-ML
Lecture: Dimensions of scale

Login or purchase this course to watch this video and the rest of the course contents.
0:00 All right, welcome back Now we're in the last stretch of the course and what I personally consider a very exciting part,
0:07 Machine Learning and Distributed Machine Learning. So before jump in, I just want to start with a bit of background about
0:14 scalability more generally in machine learning. So this is a figure that I first saw Tom Albeigers present. He's a maintainer of the DaskML project,
0:23 among many other things, which this figure describes dimensions of scale, the Data size is on the X axis and the Model size is on the y
0:33 axis. Now, I want to make very clear that a lot of people mistakenly think about Distributed Compute and Dask as being helpful only for big data,
0:42 whereas actually it's incredibly helpful as your model size or your compute size increases as well and we'll get to this okay.
0:50 In the bottom left quadrant, when both model size and data size are low, you're computation fits in RAM. Beyond that point,
0:59 we've become bound by memory or compute. Let's think about compute bound when your model size or complexity increases your reach state,
1:10 where your compute bound or CPU bound examples of this tasks like training,
1:15 prediction, evaluation and more will take a long time to compute one solution that I
1:22 really dig here is using 'joblib' as well demonstrate soon and SciKit learn offers job lib out of the box, which is which is pretty cool.
1:31 Now the next dimension of scale we need to consider is being memory bound. This is when your data is too large to fit in RAM and in this case
1:39 we have a memory bound problem. In this case we can't even read the data without Dask Collections like Dusk Data Frame
1:46 as we saw earlier. And in this case, what we'll do is use dask_ml estimators that parallelized scikit learn code.


Talk Python's Mastodon Michael Kennedy's Mastodon