Data Science Jumpstart with 10 Projects Transcripts
Chapter: Project 3: Merging AirBnB Temperature Data
Lecture: Cleanup columns after merging with loc
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
I want to show you how to clean up the columns a little bit, so let's look at that. Okay, here's my chain up here.
0:06
Let me just comment this out and we'll walk through it. So one of the things that people say is it's hard to understand
0:12
what's going on with a chain, and generally when I'm making these chains, I'm building them up from scratch. So I might say, here's my data frame here,
0:20
and then maybe I'll just stick in this debugging info before, and note that because this debug returns the ddf,
0:30
if I scroll up here and look at the debug function that I defined up here, you can see that it returns df.
0:35
When I'm using it with pipe, I can continue operating on that data frame. So this is just having a side effect of printing out that output.
0:44
Now let's do our merge. Okay, it looks like our merge returned this output over here.
0:50
I can pipe in how big that is, and now let's look at the columns of that. Here are the columns after I've done the merge.
0:57
One of the things that I like to do is I like to explicitly list out the columns. So I'm going to use .loc to do that,
1:06
and loc if you're not familiar with it, is a little bit weird. It's not a method to be specific, it's a property,
1:12
and we index off of the property, so we use the square bracket syntax here. Also, this is a little bit interesting as well.
1:18
If you look at what we're indexing with, in this case, we're actually indexing with a tuple. We've got a colon here.
1:23
This is the row selector, so this is saying take all of the rows, and then we've got a comma, and this is the column selector.
1:29
So these are the columns that we're taking. So let's run that, and you can see that we've limited it to 17 columns.
1:36
So this might seem like a small thing, but this is one of those practices that I've found is actually very useful.
1:43
You want to be flexible in what you receive, but you want to be strict in what you output. So we can use that loc just to add a hint to the end user.
1:52
These are the columns that are coming out of this. This makes it really easy when you come to the code. You want to know what's coming out of it.
1:58
You can see at that last step of the chain. So I recommend, especially if you're doing long pipelines or you're doing things for machine learning,
2:07
as a last step, just put in that loc and make sure that you are explicit about what columns come out of that.
2:13
Especially if your data is changing in the future and new columns come in, you want to be explicit about what columns come out.