Fundamentals of Dask Transcripts
Chapter: Dask Array
Lecture: Dask array for parallel numpy

Login or purchase this course to watch this video and the rest of the course contents.
0:00 Let's now take a look at how 'Dask Arrays' help us scale. NumPy. We saw how NumPy throws a Memory Error when given large data sets,
0:09 Dask Array can handle this 'larger-than-memory data'. A Dask Array is composed of multiple NumPy Arrays,
0:17 as shown in this diagram. Also note that Dask Array computes them in 'Parallel'. Dask array operations use NumPy operations internally.
0:27 So the syntax will be familiar to you. Dask Array, lets us define a chunk size property to divide our Dask Array into appropriate blocks.
0:36 For optimal calculations, it leverages the concept of 'Blocked Algorithms' that we just learn to give us good performance. Now,
0:46 let's jump into the 'Jupyter Notebook' and see how you can use Dask Arrays. First we need to spin up a new cluster here.
0:54 We are using four workers and let's open some diagnostic dashboards, the 'Cluster Map', Pew, Pew and the 'Task Stream'.
1:03 And let me rearrange these tabs to the right. Great. Let's now create a 10,000 by 10,000 array with 100 x 100 chunks.
1:15 We'll be using the 1's function from Dask or NumPy for this. Looking at this, we see Dask array has created the array and display some
1:24 Metadata. This is incredibly useful if you read a large file,
1:28 this Metadata can help you understand what's going on without needing to compute and display the
1:34 entire array. Here we see information about the size of the array and the chunks the shape of the array.
1:41 Account of tasks and chunks and data types of the values of the array. The diagram helps us visualize the chunks.
1:50 Now let's compute the sum of this array and time it. That only took several seconds
1:56 because Dask Array also evaluates lazily recall how 'Lazy Evaluation' refers to computing the results only
2:03 when necessary. So if we see what the variable result is, it displays a Dask Array, we need to 'Call' Compute to get the actual results
2:13 and we also see some activity happening in our dashboards. All right. That gives us 100 billion.
2:21 Which makes sense. Next, let's do the same NumPy computation as earlier, we used 'da.random' to create an array of Random values,
2:31 We calculate the mean And compute every 100th value. Finally, always remember to close your Cluster.


Talk Python's Mastodon Michael Kennedy's Mastodon