Data Science Jumpstart with 10 Projects Transcripts
Chapter: Project 3: Merging AirBnB Temperature Data
Lecture: Validating one to one and one to many merges
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
One of the things you might want to do with merged data is to validate it and see if it was successful.
0:06
This is a challenge with SQL when you try and merge something. Oftentimes it's unclear, did the join work? What's going on there?
0:15
Pandas has some functionality to make this a little easier. Okay, so I'm going to demonstrate this with just a different dataset here.
0:22
I've got the left and the right dataset. In this cell, I'm showing the display function which Jupyter provides you.
0:28
This allows us just to output multiple things in a single cell. So we've got this data frame left that has names and pets,
0:37
and we have a data frame right that has names and ages. So let's look at the different merges that we can do. We can do an inner merge.
0:48
And so here is an inner merge. You can see that Ravi and Jose are in both of my datasets, but Sally is only in one.
0:58
So the inner merge doesn't include Sally. I can do a left merge, which is going to say, take everyone from the left data frame.
1:06
In this case, it should have Ravi and Jose in it. I can do a right merge, which should include Sally.
1:13
And we can do an outer, which is going to include everyone from both. And we can also do a cross merge.
1:20
And a cross merge is something you need to be careful with. Basically, it's going to say for every row in one data frame,
1:28
merge it with every row in the other data frame. So you can see we have for the first Ravi from left, we have Ravi, Jose, and Sally.
1:38
From the right-hand side, for every Jose in the left, we have Ravi, Jose, and Sally from the other side.
1:46
One of the super convenient parameters that comes with this is the validate parameter.
1:51
And here I'm going to say I want to validate that this is a one-to-one merge. And when I run this, I get an error.
1:58
And it says that the merge keys are not unique in the left data set. And if we look at that left data set, again, it looks like this.
2:09
And you can see that Jose is repeated there. So it is impossible to do a one-to-one merge here because Jose is repeated.
2:16
So if we want to do this, what it's really going to be is a many-to-one merge, and we can specify that with an m-to-one.
2:22
And you can see that that comes out without a complaint. Pandas provides a nice functionality in this validate parameter.
2:31
And you can specify 1,1 for a one-to-one merge, m,1 for many-to-one, or 1,m, or an m,m. Note that validation on m,m really doesn't exist,
2:45
but you can put it in there as a hint to your reader about what kind of merge you're expecting to perform.