Getting started with Dask Transcripts
Chapter: Using the Dask DataFrame
Lecture: Sharing intermediate results
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
In some cases, you can benefit from sharing intermediate results. We've learned that Dask computations are created as task graphs.
0:08
In other words, they are just plans. When two computations are related, they can use the same intermediate steps.
0:15
We can use that phenomenon to our advantage for shared efficiency. For example, when computing minimum and maximum values.
0:22
Let's look at an example here. Just like in Pandas, you can use min and max to compute the minimum and
0:27
maximum values in Dask. Without sharing is how we have been computing so far, executing each
0:34
task graph separately. If we time it to examine the difference, we can see very clearly that running the two computations together,
0:42
just like we do in the section with sharing saves us nearly half the time. To share intermediate results, we can pass both maximum and minimum,
0:51
tip_amount together to dask.compute. The shared computation is much faster because we save on loading data from disk and the groupby.
0:59
Similarly, if we have persisted the dataset into memory earlier, the entire process would be even faster yet.