Getting started with Dask Transcripts
Chapter: Using the Dask DataFrame
Lecture: pandas
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
We are in notebook two of this course. For this notebook, we'll be using the New York City Yellow Taxi Trips Dataset.
0:07
This is a public dataset, released by the city of New York for everyone to use
0:11
and download. It contains Taxi Cab trip records dating all the way back to 2009, which amounts to over 200 gigabytes of files.
0:20
We'll be using only data for 2019. To download the dataset, either run the cell given here, the wget cell or download it directly from the
0:29
website. The website is linked here at the New York City Yellow Taxi Trips Dataset blue hyperlink in the cell above.
0:36
If you're on Windows this wget instruction might not work. Okay, as we will rely
0:42
on this dataset now and in the future, we recommend curating all the files in
0:46
a subdirectory called data which is easily accessible from your workspace. Okay let's read in the data. Pandas has a read_csv
0:54
method to read CSV files into your Python session. We will read data for only one month, which is already a lot.
1:00
We'll use january and time the reading using the %%time magic. The double percent time magic. Note that it takes us 12 seconds to load from disk
1:10
into memory. It may be different depending on your machine configuration but it will be
1:15
a number in seconds. Now we have over seven million rows loaded directly into memory
1:19
ready to use. It's over seven million rows with 18 columns of various types. We can learn more about the dataset by running the df.info() method.
1:30
It shows us the different column names as well as their data types and the total memory usage. Great, after reading the data,
1:38
the next step is to do some meaningful computation on it. Let's find the mean of the tip_amount as a function of passenger_count of the
1:46
vehicle. We can use the mean and groupby functions for these operations. If you've used Pandas before, you've probably seen them.
1:53
Those are very popular methods. groupby splits the dataset by column values while mean
1:59
calculates the mean for those groups. Again we time it and wow, we have the result in just 100 milliseconds, that is pretty impressive.
2:07
That is well worth paying the price we've paid in the beginning, the 12 seconds it took to load the entire file into memory.
2:14
Fantastic, we've seen some basic operations in Pandas. As mentioned earlier,
2:19
Pandas can't deal with data that is larger than memory, it really shines except when we run into datasets that are too big. In this cell,
2:27
we try reading data from all months using Pandas. This cell right here, we are reading every file, 12 files and then concatenating them together.
2:35
Feel free to uncomment and run this cell on your own to see the dreaded memory error. This is where Dask comes to our aid.