Fundamentals of Dask Transcripts
Chapter: Dask Array
Lecture: Blocked algorithm
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
So now I'm excited to tell you about 'Blocked Algorithms'. These essentially execute on large datasets by breaking up the datasets into smaller blocks.
0:09
So in the above example we had a billion times a billion numbers or something like that. And if we want to take the sum of all numbers,
0:16
we could break up the array into 1000 chunks for example, and take the sum of each chunk and then take the sum of the intermediate sums
0:24
Okay, so let's do this with a random data set that we've generated here. So this creates a pointer to the data but doesn't actually loaded
0:33
So we execute that cell. Now what we're going to do and this data set is a small example. We're not doing it on billion by billion,
0:41
but we're doing on a smaller one for pedagogical instructive purposes. Okay. So we create a list called 'sums',
0:48
where we add all the intermediate sums and then we do we loop we iterate through
0:54
chunks, smaller chunks, take the sum of each chunk and then append that to sum. So we get a list of all the smaller sums there.
1:03
And then for the total we take the sum of all the sums in the list and then we print the total.
1:08
Okay, so we do that and look that took 800 milliseconds around a second. Give or take. Okay? But note that this is a sequential process in the
1:18
Notebook, Kernel. Okay. The loading and then the summing. And what I want to make clear is that this is something that we can do
1:25
in 'Parallel', particularly when we have multiple cause on an individual workstation.
1:30
Okay, so after this video we're going to come back for a checkpoint and then we're going to show you how to 'Parallelize'.
1:35
This type of code can't wait to see there.