Getting started with Dask Transcripts
Chapter: Using the Dask DataFrame
Lecture: pandas

Login or purchase this course to watch this video and the rest of the course contents.
0:00 We are in notebook two of this course. For this notebook, we'll be using the New York City Yellow Taxi Trips Dataset.
0:07 This is a public dataset, released by the city of New York for everyone to use
0:11 and download. It contains Taxi Cab trip records dating all the way back to 2009, which amounts to over 200 gigabytes of files.
0:20 We'll be using only data for 2019. To download the dataset, either run the cell given here, the wget cell or download it directly from the
0:29 website. The website is linked here at the New York City Yellow Taxi Trips Dataset blue hyperlink in the cell above.
0:36 If you're on Windows this wget instruction might not work. Okay, as we will rely
0:42 on this dataset now and in the future, we recommend curating all the files in
0:46 a subdirectory called data which is easily accessible from your workspace. Okay let's read in the data. Pandas has a read_csv
0:54 method to read CSV files into your Python session. We will read data for only one month, which is already a lot.
1:00 We'll use january and time the reading using the %%time magic. The double percent time magic. Note that it takes us 12 seconds to load from disk
1:10 into memory. It may be different depending on your machine configuration but it will be
1:15 a number in seconds. Now we have over seven million rows loaded directly into memory
1:19 ready to use. It's over seven million rows with 18 columns of various types. We can learn more about the dataset by running the df.info() method.
1:30 It shows us the different column names as well as their data types and the total memory usage. Great, after reading the data,
1:38 the next step is to do some meaningful computation on it. Let's find the mean of the tip_amount as a function of passenger_count of the
1:46 vehicle. We can use the mean and groupby functions for these operations. If you've used Pandas before, you've probably seen them.
1:53 Those are very popular methods. groupby splits the dataset by column values while mean
1:59 calculates the mean for those groups. Again we time it and wow, we have the result in just 100 milliseconds, that is pretty impressive.
2:07 That is well worth paying the price we've paid in the beginning, the 12 seconds it took to load the entire file into memory.
2:14 Fantastic, we've seen some basic operations in Pandas. As mentioned earlier,
2:19 Pandas can't deal with data that is larger than memory, it really shines except when we run into datasets that are too big. In this cell,
2:27 we try reading data from all months using Pandas. This cell right here, we are reading every file, 12 files and then concatenating them together.
2:35 Feel free to uncomment and run this cell on your own to see the dreaded memory error. This is where Dask comes to our aid.


Talk Python's Mastodon Michael Kennedy's Mastodon