Fundamentals of Dask Transcripts
Chapter: Dask Array
Lecture: Blocked algorithm

Login or purchase this course to watch this video and the rest of the course contents.
0:00 So now I'm excited to tell you about 'Blocked Algorithms'. These essentially execute on large datasets by breaking up the datasets into smaller blocks.
0:09 So in the above example we had a billion times a billion numbers or something like that. And if we want to take the sum of all numbers,
0:16 we could break up the array into 1000 chunks for example, and take the sum of each chunk and then take the sum of the intermediate sums
0:24 Okay, so let's let's do this with a random data set that we've generated here. So this creates a pointer to the data but doesn't actually loaded
0:33 So we execute that cell. Now what we're going to do and this data set is a small example. We're not doing it on billion by billion,
0:41 but we're doing on a smaller one for pedagogical instructive purposes. Okay. So we create a list called 'sums',
0:48 where we add all the intermediate sums and then we do we loop we iterate through
0:54 chunks, smaller chunks, take the sum of each chunk and then append that to sum. So we get a list of all the smaller sums there.
1:03 And then for the total we take the sum of all the sums in the list and then we print the total.
1:08 Okay, so we do that and look that took 800 milliseconds around a second. Give or take. Okay? But note that this is a sequential process in the
1:18 Notebook, Kernel. Okay. The loading and then the summing. And what I want to make clear is that this is something that we can do
1:25 in 'Parallel', particularly when we have multiple cause on an individual workstation.
1:30 Okay, so after this video we're going to come back for a checkpoint and then we're going to show you how to 'Parallelize'.
1:35 This type of code can't wait to see there.


Talk Python's Mastodon Michael Kennedy's Mastodon