Data Science Jumpstart with 10 Projects Transcripts
Chapter: Project 3: Merging AirBnB Temperature Data
Lecture: Debugging Merging by piping dataframe size

Login or purchase this course to watch this video and the rest of the course contents.
0:00 In this part, I want to show you a really cool debugging trick that I think is super useful in Pandas,
0:07 especially when you start building up these chains. One of the methods in Pandas is the pipe method.
0:13 And what the pipe method allows you to do is pass in an arbitrary Python function,
0:19 and Pandas is going to call that Python function passing in the current state of the data frame as the first parameter.
0:25 In this case, I've defined a function called limit, and it has two other parameters, n rows and n columns,
0:31 and it's just going to do some slicing using ilope to do that.
0:35 So not particularly interesting, other than we are leveraging that with pipe instead of calling ilope directly there.
0:43 Now, let me show another way of using pipe. I like to make these little debug helpers here. So I've got a function up here called debug.
0:51 Note that the first parameter is df, and then I have an optional extra there. And here's my chain down here. Note that I've got merge in there,
0:59 but before I do the merge, I'm going to do this debug call, and I'm going to say extra is before,
1:04 and then after I do the merge, I'm going to say extra is after. Let's run that, and you can see that it outputs the before shape and the after shape.
1:14 So before we had 16 columns, after we have 19 columns. You can see that the number of rows does not change.
1:21 This is super useful if you're doing complicated merges, and you can see what happens to the number of rows and columns when you're doing that.
1:29 So let's do a little bit more here. I'm going to do a merge, and then after that, I'm going to group by the neighborhood
1:35 and get the mean values of all the neighborhoods. Okay, and it looks like that failed. It did not work. Let's scroll up and look at the output here.
1:43 We do see the before and after, but we see a complaint here, and it's complaining about this mean here.
1:49 It says that you can't aggregate on a mean on non-numeric columns or string columns there.
1:58 So in Pandas 2, if we want to do this mean, we need to tack in that numeric only as true.
2:03 So let's run that again here, and now we can see that we have the before, after,
2:08 and then we have the summary there, and you can see that the shape does change. The shape had 48,000 rows before, and after we do the group by,
2:17 there's only five neighborhood groups, so we have five rows after doing that, and we only have 13 numeric columns.
2:24 This pipe trick is super useful. I use it all the time. So in this case, I'm using it to debug the shape, but you can use it for other things as well.
2:34 It's limited by what you can stick in a function.


Talk Python's Mastodon Michael Kennedy's Mastodon