Data Science Jumpstart with 10 Projects Transcripts
Chapter: Project 3: Merging AirBnB Temperature Data
Lecture: Cleanup columns after merging with loc

Login or purchase this course to watch this video and the rest of the course contents.
0:00 I want to show you how to clean up the columns a little bit, so let's look at that. Okay, here's my chain up here.
0:06 Let me just comment this out and we'll walk through it. So one of the things that people say is it's hard to understand
0:12 what's going on with a chain, and generally when I'm making these chains, I'm building them up from scratch. So I might say, here's my data frame here,
0:20 and then maybe I'll just stick in this debugging info before, and note that because this debug returns the ddf,
0:30 if I scroll up here and look at the debug function that I defined up here, you can see that it returns df.
0:35 When I'm using it with pipe, I can continue operating on that data frame. So this is just having a side effect of printing out that output.
0:44 Now let's do our merge. Okay, it looks like our merge returned this output over here.
0:50 I can pipe in how big that is, and now let's look at the columns of that. Here are the columns after I've done the merge.
0:57 One of the things that I like to do is I like to explicitly list out the columns. So I'm going to use .loc to do that,
1:06 and loc if you're not familiar with it, is a little bit weird. It's not a method to be specific, it's a property,
1:12 and we index off of the property, so we use the square bracket syntax here. Also, this is a little bit interesting as well.
1:18 If you look at what we're indexing with, in this case, we're actually indexing with a tuple. We've got a colon here.
1:23 This is the row selector, so this is saying take all of the rows, and then we've got a comma, and this is the column selector.
1:29 So these are the columns that we're taking. So let's run that, and you can see that we've limited it to 17 columns.
1:36 So this might seem like a small thing, but this is one of those practices that I've found is actually very useful.
1:43 You want to be flexible in what you receive, but you want to be strict in what you output. So we can use that loc just to add a hint to the end user.
1:52 These are the columns that are coming out of this. This makes it really easy when you come to the code. You want to know what's coming out of it.
1:58 You can see at that last step of the chain. So I recommend, especially if you're doing long pipelines or you're doing things for machine learning,
2:07 as a last step, just put in that loc and make sure that you are explicit about what columns come out of that.
2:13 Especially if your data is changing in the future and new columns come in, you want to be explicit about what columns come out.


Talk Python's Mastodon Michael Kennedy's Mastodon