Getting started with Dask Transcripts
Chapter: Using the Dask DataFrame
Lecture: pandas

0:00 We are in notebook two of this course. For this notebook, we'll be using the New York City Yellow Taxi Trips Dataset.
0:07 This is a public dataset, released by the city of New York for everyone to use
0:11 and download. It contains Taxi Cab trip records dating all the way back to 2009, which amounts to over 200 gigabytes of files.
0:20 We'll be using only data for 2019. To download the dataset, either run the cell given here, the wget cell or download it directly from the
0:29 website. The website is linked here at the New York City Yellow Taxi Trips Dataset blue hyperlink in the cell above.
0:36 If you're on Windows this wget instruction might not work. Okay, as we will rely
0:42 on this dataset now and in the future, we recommend curating all the files in
0:46 a subdirectory called data which is easily accessible from your workspace. Okay let's read in the data. Pandas has a read_csv
0:54 method to read CSV files into your Python session. We will read data for only one month, which is already a lot.
1:00 We'll use january and time the reading using the %%time magic. The double percent time magic. Note that it takes us 12 seconds to load from disk
1:10 into memory. It may be different depending on your machine configuration but it will be
1:15 a number in seconds. Now we have over seven million rows loaded directly into memory
1:19 ready to use. It's over seven million rows with 18 columns of various types. We can learn more about the dataset by running the method.
1:30 It shows us the different column names as well as their data types and the total memory usage. Great, after reading the data,
1:38 the next step is to do some meaningful computation on it. Let's find the mean of the tip_amount as a function of passenger_count of the
1:46 vehicle. We can use the mean and groupby functions for these operations. If you've used Pandas before, you've probably seen them.
1:53 Those are very popular methods. groupby splits the dataset by column values while mean
1:59 calculates the mean for those groups. Again we time it and wow, we have the result in just 100 milliseconds, that is pretty impressive.
2:07 That is well worth paying the price we've paid in the beginning, the 12 seconds it took to load the entire file into memory.
2:14 Fantastic, we've seen some basic operations in Pandas. As mentioned earlier,
2:19 Pandas can't deal with data that is larger than memory, it really shines except when we run into datasets that are too big. In this cell,
2:27 we try reading data from all months using Pandas. This cell right here, we are reading every file, 12 files and then concatenating them together.
2:35 Feel free to uncomment and run this cell on your own to see the dreaded memory error. This is where Dask comes to our aid.

