Data Science Jumpstart with 10 Projects Transcripts
Chapter: Project 4: Understanding Grouping and Aggregation Retail Data
Lecture: Grouping by Month and Country and Visualizing with a Line Plot
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
Okay, let's do something a little bit more complicated. Your boss is like, I really like what you do. Let's look at sales by month,
0:09
by your top end countries. Okay, so let's walk through this. We're actually grouping by two things now. We want month and countries.
0:19
So I'm going to say let's group by, and I'm going to say PD grouper, and I'm going to also passing country in there. So I'm passing in a list,
0:27
we're going to group by the day frequency, and then country there. And that's lazy, it doesn't do anything until we aggregate.
0:36
Now, if you look at this, we have the end of each day, and for each day we have each country. And then if we got the total there,
0:45
we get something that looks like this. This is kind of weird looking. Remember I said that a series is one dimension.
0:52
This does not look like one dimension starting to look like a data frame. However, it is in monospace font. This is a series. What's going on here?
0:59
In the left-hand side, we've got invoice date and country. Both of those are actually the index.
1:06
The value of the series is a number on the right-hand side. So we call this a multi-index or hierarchical index.
1:12
This is what happens when we group by two or more items, we get a hierarchical index. These can be somewhat of a challenge to work with,
1:20
so let me show you some best practices for that. So one of the things I like to do with this is to unstack this. So here's our data,
1:27
and let me show you what unstack does. What unstack is going to do is it's going to take the innermost index and rotate it up into the columns.
1:35
So watch what happens here. Okay, this is what we had before. We're going to take country and rotate it up. And there we go.
1:44
Now in the index, we have invoice date. Now we have a data frame, and in the columns, we have all of the countries.
1:50
Now there are a lot of missing values. Now we see a lot of those NAs, because this is sparse data. Well, what we can do is we can fill those values
1:59
with zero, and we get something that looks like this. Okay, so 38 columns, 305 rows. One of the cool things that we can do now is we can plot that.
2:09
So there's what's going on there. Is this the world's greatest plot? No, not necessarily. People would call this a spaghetti plot.
2:16
There's a lot of data that we're plotting here. You can see this pink line here. That's probably United Kingdom, which is the majority of our data.
2:24
So this plot is not particularly useful. So what can we do? Let's remove United Kingdom. Okay, now it's looking like just a bunch of colors there.
2:36
Let's move the legend to the side. So I'm going to put in this line right here. This is going to push the legend off to the side.
2:43
Okay, there's still a lot of colors. This is hard to see. Here's a technique that visualization experts use. If you want to emphasize a single thing,
2:52
color it and make everything else gray. So let's see if we can apply that here. In this case, I'm going to emphasize Spain.
2:59
Everything here is similar to what we had before, but I'm going to say, let's pipe this set colors here before we call plot.
3:07
So let's look at what set colors does. You can see there's a global variable called colors, which is a list.
3:12
I get the data frame as the first parameter. I get Spain as the country, and I'm going to loop through my columns.
3:19
And if my column is not equal to that, I'm going to append the normal, which is this gray color. After I do that, at the very end,
3:27
I'm going to append the highlight color, which is that red color. And I'm going to append my country to my columns here.
3:36
I'm actually reordering the data frame and making a global colors that's aligned with the colors there. Why am I reordering the data frame?
3:44
Because the last column is going to be plotted on top, and I want that to be the colored column. Okay, so once I've done that,
3:51
it actually is not going to do probably what we think it will. So it doesn't spit anything out. Why doesn't it spit anything out?
3:59
Because we're assigning to AX here. So if I spit out AX, here is AX. So this is the data. We want to plot this. So let's call this piped plot,
4:12
and let's see what happens when we do that. So we're going to take a data frame here, and we're going to say, plot the data frame,
4:19
and we're going to use the colors that's this global variables. And we're going to add a title here. This is going to return a map plot lip axis.
4:26
That's the AX here. I'm going to, with that map plot lip axis, move the legend to the side.
4:32
So one, one means move it to the right-hand side at the top. And then I'm going to make it have two columns.
4:37
I'm also going to set the Y label to US dollars, and I'm going to return the data frame as the output of that. So this is saying AX here.
4:45
This shouldn't say AX. This should say like some other data frame. We'll just call that final here. But if we look at the output down here,
4:54
we can see that here is our plot. So here is Spain in red, and we can clearly see what's going on with Spain.
5:03
Again, the other columns are not super bright, but we're focusing on Spain. If we wanted to focus, let's say on Finland,
5:12
we could just come in here and change this to Finland. And that's what's going on with Finland. We just showed you how to do a more complicated
5:22
group by group by two things. We showed how to do unstack to rearrange the data, and then we showed a visualization technique
5:28
to actually draw attention to what we want to focus on.