Fundamentals of Dask Transcripts
Chapter: Dask Delayed
Lecture: Pandas groupby using Dask delayed
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
Now let's see how we can use Pandas 'groupby( )' in Parallel by Leveraging.
0:05
Dask Delay. Now note that this is purely for demonstration purposes and 'Dask Data Frame'. should always be preferred in real world situations.
0:14
We'll be going back to the NYC taxi cab data set that were used in the first course. If you don't have the data set,
0:21
you can un comment this cell to download it, a note to move all the files to a data subdirectory as we have done here
0:27
Let's start by importing the data for January 2019 using Pandas and calculate the mean 'tip_amount' as a function of ''passenger_count''.
0:38
We use the group by function in Pandas for this computation. Now to compute this over the entire 12 months of data without Dask Data Frame we
0:48
can go through each file that corresponds to each month, one by one. We perform Pandas group by on it And for each possible value
0:56
of the number of passengers, we calculate two things. First, the sum of the tip amount and second the total number of data points
1:05
which had that value for the number of passengers. We then save these values and calculate the 'mean'.
1:12
After we have gone through all the files we encourage you to pause the video and take your time to go through this block of code.
1:21
Now we'll introduce 'Parallelism' into this code. Using Delayed. This code block is similar to the previous block but notice how we read the CSV
1:30
Files in a Delayed fashion. This makes all the consecutive operations Delayed Objects as well
1:35
We then compute the sum and count values here after going through all the files
1:42
and then we calculate the mean as before notice the time difference here.
1:47
It's not a lot but significant enough to add up when we work with really large data sets.