Getting started with Dask Transcripts
Chapter: Using the Dask DataFrame
Lecture: Reading and working with data in pandas
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
We are back in the notebook. Let's again use the same New York Taxi dataset but this time with Dask DataFrame. First, let's spin up a cluster.
0:08
I hope you remember how to do that. When using Dask, it's a good practice to
0:12
have the dashboard open to the side. You can access the dashboard by clicking the link here
0:17
which will open a tab with the Dask status, which is currently empty or using the Dask JupyterLab extension, which is available here in the sidebar.
0:27
Let me click the Dask logo in the sidebar which opens an entire section here with all the different views that Dask dashboard provides.
0:35
I'm going to click on Cluster Map which shows the interactions between the client and the
0:39
workers, the Task Stream, which shows what each worker is doing at any given moment
0:45
and the Dask Workers, which tells me the CPU and memory usage across the cluster.
0:50
This might be uncomfortable to work with so I need to rearrange my tabs. First,
0:55
I'll close the sidebar and then arrange my tabs neatly on the side so that I can work with my code as I view the status of the cluster.
1:05
Great, after doing that, I need to make some more room for my code, perhaps move the Dask Workers to the lower tab or keep it up here.
1:14
That's up to you. Okay, this setup works for me. Let's perform the same read operation as we did before but with Dask. Let's time it as well.
1:23
I'm going to use Dask DataFrame read_csv to load the entire year instead of one month as I did
1:29
last time. It took 400 milliseconds to come up with this operation. But what actually happened, did my data get into the memory?
1:37
No, it didn't. This is a lazy operation as we've covered before. What we've got instead is a task graph.
1:44
The task graph will be invoked when we actually want to perform computations. On that note, let's do some basic reading.
1:50
Let's see what are the top few rows by using the head method. We can see in the Tas Stream that it took
1:57
1.6 seconds to just get a little bit of the data in and we can see the result up here. To look at the last few rows, we can use the tail method.
2:05
Oh, but we get a big nasty error. Let's scroll down and look at it closer.
2:10
Okay, Dask provides a verbose explanation for what the error might be and what might
2:14
be the difference. Actually, because Dask is not loading the entire dataset into memory all
2:19
at once, its approach to inferring data types is estimation. Because the last few roads differ from what Dask initially estimated, we get this error.
2:28
The official recommendation, which we can see in the error message is to specify manually what data types belong to what columns. Dask estimates
2:36
data types with a small sample of data to stay efficient so it's common to run into this error. This is also why you get to see the message with all
2:45
the recommendations in it. Great, let's apply the recommendation and load the dataset but this time with the specified data types.
2:53
The head method worked and so did the tail method. Excellent, let's move on and have a look at some basic operations like we did
3:01
last time. Let's group the vehicles by tip_amount and calculate the mean tip_amount. This again is not the true result,
3:08
it just took 12 milliseconds to accomplish. To get the actual result I need to call compute, that will trigger all the steps from reading CSV's
3:16
from disk all the way through the groupby and mean computations. At the end, we'll get a Pandas DataFrame or Pandas Series with the output.
3:23
Let's time that. As I run this operation, I can see the Task sStream getting busy.
3:30
I can see worker cores coming online, matching the number of available cores, which should be 16 in my case.
3:37
I can see the different chunks being read in, I can see the cluster nodes communicating with each other and with the scheduler and
3:44
I can see CPU utilization as well as memory use across my cluster.
3:49
The tasks continue, the workers continue communicating and I'm still waiting for my result.
3:54
I can also see that the entire operation is just taking over 30 seconds now and I can see the progressive parts of the CSV being read.
4:02
This is very helpful to see and as there are no white spaces between the tasks, I can see the workers are busy at every moment. As the CSV's
4:10
are read, we can see different color tasks which signal that the CSV is now being processed so the data is already in memory but it's being processed.
4:18
And in just under 60 seconds, I was able to get my results with the same
4:22
code as I would with regular Pandas, but on a dataset that's bigger than memory. Along with lazy computation of task graphs,
4:29
Dask also releases results, intermediate results and end results, unless specifically asked to keep them.
4:35
In order to store intermediate results for future use, we can use the persist method. It's time for a check point.
4:42
Can you compute the standard deviation for tip_amount as a function of passenger_count for the entire dataset? Please type your answer here