Fundamentals of Dask Transcripts
Chapter: Dask Delayed
Lecture: Pandas groupby using Dask delayed

Login or purchase this course to watch this video and the rest of the course contents.
0:00 Now let's see how we can use Pandas 'groupby( )' in Parallel by Leveraging.
0:05 Dask Delay. Now note that this is purely for demonstration purposes and 'Dask Data Frame'. should always be preferred in real world situations.
0:14 We'll be going back to the NYC taxi cab data set that were used in the first course. If you don't have the data set,
0:21 you can un comment this cell to download it, a note to move all the files to a data subdirectory as we have done here
0:27 Let's start by importing the data for January 2019 using Pandas and calculate the mean 'tip_amount' as a function of ''passenger_count''.
0:38 We use the group by function in Pandas for this computation. Now to compute this over the entire 12 months of data without Dask Data Frame we
0:48 can go through each file that corresponds to each month, one by one. We perform Pandas group by on it And for each possible value
0:56 of the number of passengers, we calculate two things. First, the sum of the tip amount and second the total number of data points
1:05 which had that value for the number of passengers. We then save these values and calculate the 'mean'.
1:12 After we have gone through all the files we encourage you to pause the video and take your time to go through this block of code.
1:21 Now we'll introduce 'Parallelism' into this code. Using Delayed. This code block is similar to the previous block but notice how we read the CSV
1:30 Files in a Delayed fashion. This makes all the consecutive operations Delayed Objects as well
1:35 We then compute the sum and count values here after going through all the files
1:42 and then we calculate the mean as before notice the time difference here.
1:47 It's not a lot but significant enough to add up when we work with really large data sets.


Talk Python's Mastodon Michael Kennedy's Mastodon