Data Science Jumpstart with 10 Projects Transcripts
Chapter: Project 1: Working with Student Information CSV Files
Lecture: Loading CSV data from a ZIP file with Pandas and Pyarrow

Login or purchase this course to watch this video and the rest of the course contents.
0:00 So the data we're going to be looking at is from University of California, Irvine's machine learning repository.
0:06 This is a data set of student performance from Portugal. Let's load our libraries. I'm loading the pandas library,
0:13 and I'm also loading some libraries from the Python standard library to help me fetch files from the internet and read zip files.
0:22 The data is located at the University of California, Irvine in the zip file.
0:26 And if you look inside of the zip file, there are various files inside of it. We are worried about the student mat csv file.
0:35 So what I'm going to do is I'm going to download the zip file using curl.
0:38 You'll note in this cell at the front of the cell, I have an exclamation point indicating that I am running an external command.
0:45 So curl is not a Python command, but I have curl installed on my Mac machine, and this is using curl to download that data.
0:53 Once I've got this zip file, I have it locally. I can look at it and see that it has the same files.
1:00 Now pandas has the ability to read csvs in zip files if there's only one csv in the zip file. In this case, there are multiple csv files inside of it.
1:09 So I'm going to have to use this command here, combine the zip file library with pandas to pull out the file that I want from that. Let's run that.
1:21 It looked like that worked. I've stored the result df in this df variable. Let's look at that. This is a data frame.
1:28 We're going to be seeing this a lot in this course. A data frame represents a table of data. Down the left-hand side in bold, you see the index.
1:36 In this case, it's numeric. Pandas puts that in for us if we didn't specify one. There are 395 rows and 33 columns.
1:43 So we're only seeing the first five rows and the last five rows. We're actually only seeing the first 10 columns.
1:49 And the last 10 columns, you can see that there's an ellipses in the middle, separating the first 10 columns from the last 10 columns.
1:56 And you can also see that there's an ellipses separating the first five rows from the last five rows.
2:01 Now, once you have a data frame in pandas, there are various things you can do with it. One of them might be to look at the memory usage.
2:07 I'm going to look at the memory usage from this data frame. And it looks like it's using 454 kilobytes of memory.
2:15 Now, one of the things that pandas 2 introduced is this pyarrow backend.
2:20 So I'm going to reload the file using dtype backend as pyarrow and engine is equal to pyarrow. It looks like that worked.
2:27 Let's look at our memory usage now. And we see that our memory usage has gone to 98 kilobytes.
2:33 Prior to pandas 2, pandas would back the data using numpy arrays. And numpy arrays didn't have a type for storing stream data.
2:43 So it was not really optimized for storing stream data. Pandas 2, if you use pyarrow as a backend, does have a stream type that we can leverage.
2:53 And that's probably where we're seeing the memory usage. Now, we are getting that memory savings by saying dtype backend is pyarrow.
3:01 So instead of using numpy, the dtype backend parameter says use pyarrow to store the data.
3:06 The other parameter there, engine is equal to pyarrow, is what is used to parse the CSV file. The pyarrow library is multi-threaded and presumably can
3:16 parse files faster than the native pandas parse. Okay, the next thing I want to do is I want to run this microbenchmark here.
3:24 And that's going to tell us how long it takes to read this file using pyarrow as the engine. And it says it takes six milliseconds.
3:35 Let's run it without using pyarrow and see how long that takes. Now, %%timeit is not Python code. This is Cell Magic.
3:44 This is something that's unique to Jupyter that allows us to do a microbenchmark.
3:47 Basically, it's going to run the code inside the cell some amount of time and report how long it took.
3:53 Interestingly, in this case, it looks like we are not getting a performance benefit from using the pyarrow engine to read the CSV file.
4:03 It looks like it's a little bit slower. When you're running a benchmark with Python, make sure you benchmark it
4:09 with what you will be using in production, the size of the data that you will be using in production.
4:13 In this case, we saw that using that pyarrow engine actually didn't help us. It ran a little bit slower.
4:20 But the number is so small that it's not really a big deal. If you have minutes and you're going to seconds, that can be a huge savings.
4:29 Another thing that you can do with Jupyter is you can put a question mark after a method or a function and you can pull up the documentation here.
4:36 You see that read CSV has like 40 different parameters. If we scroll down a little bit, I think we'll find engine in here. Let's see if we can find it.
4:44 And there it is right here. So let's scroll down a little bit more. There is documentation about engine. So let's read that. Here it is.
4:52 It says that this is the parser engine to use. The C and pyarrow engines are faster, while the Python engine is currently more feature complete.
5:00 The pandas developers have taken it upon themselves to write a CSV parser that will read 99.99% of CSVs in existence.
5:08 The pyarrow parser is not quite as feature complete, but can run faster on certain data sets.
5:15 To summarize, we've learned that we can use pandas to read CSV files. We also learned that pandas 2 has some optimizations to make it use less memory.


Talk Python's Mastodon Michael Kennedy's Mastodon