Fundamentals of Dask Transcripts
Chapter: Dask Delayed
Lecture: Pandas groupby using Dask delayed
Login or
purchase this course
to watch this video and the rest of the course contents.
Now let's see how we can use Pandas 'groupby( )' in Parallel by Leveraging.
Dask Delay. Now note that this is purely for demonstration purposes and 'Dask Data Frame'. should always be preferred in real world situations.
We'll be going back to the NYC taxi cab data set that were used in the first course. If you don't have the data set,
you can un comment this cell to download it, a note to move all the files to a data subdirectory as we have done here
Let's start by importing the data for January 2019 using Pandas and calculate the mean 'tip_amount' as a function of ''passenger_count''.
We use the group by function in Pandas for this computation. Now to compute this over the entire 12 months of data without Dask Data Frame we
can go through each file that corresponds to each month, one by one. We perform Pandas group by on it And for each possible value
of the number of passengers, we calculate two things. First, the sum of the tip amount and second the total number of data points
which had that value for the number of passengers. We then save these values and calculate the 'mean'.
After we have gone through all the files we encourage you to pause the video and take your time to go through this block of code.
Now we'll introduce 'Parallelism' into this code. Using Delayed. This code block is similar to the previous block but notice how we read the CSV
Files in a Delayed fashion. This makes all the consecutive operations Delayed Objects as well
We then compute the sum and count values here after going through all the files
and then we calculate the mean as before notice the time difference here.
It's not a lot but significant enough to add up when we work with really large data sets.