Data Science Jumpstart with 10 Projects Transcripts
Chapter: Project 4: Understanding Grouping and Aggregation Retail Data
Lecture: Using Groupby in Pandas to visualize Sales by country

Login or purchase this course to watch this video and the rest of the course contents.
0:00 Let's assume that your boss comes in and says, thanks, I also want to see summaries by country. How would we do that?
0:08 I've got a chain here that does it. Again, I'm going to show you how I'd build this up. First of all, we would want to get our totals and our years.
0:18 Now I'm going to say, let's group by country. We already have that column in there. When we do that, it is lazy.
0:24 It's not going to do anything until we tell it to aggregate. In this case, we'll say, sum the numeric values.
0:29 And now we can see for each country, those are the sums. I'm going to pull off the total here. And we have that. Now I'm going to visualize that.
0:40 We can do a bar plot. And this lets us clearly see what that looks like. Now let's assume that we want to look at everything but the UK.
0:52 The UK is blowing everything else out. So how would we do that? Well, the first thing that we could do is we could put a query in here and say,
0:59 I want to have all the rows without UK. And again, I can come in here and comment this out and sort of walk through this chain here, what's going on.
1:10 So this is our data without the UK. And you can see that instead of 500,000 rows, we only have 46,000 rows now. And then we'll add our columns.
1:20 And then we'll do our group by. And that's lazy. Let's take the sum. There we go. Let's pull off the total. OK, and let's plot that.
1:29 Remember, a bar plot is going to take the index and put in the x-axis and then plot a bar for each of those values.
1:36 OK, let's sort this and change it to horizontal. So I'm not going to walk through all of this. But here's what we had before.
1:48 If we want to sort that, we say sort values. If we want to do a horizontal bar plot, we say bar h instead. We might want to change the figure size
1:58 to make it so these aren't overlapping. We could make this a little bit taller if we really cared about that.
2:03 Let me show you another technique that we do. And instead of doing that, we might want to say, OK, let's look at the first n values and everything else
2:10 let's put into an other bucket. So let's say here's countries. And here's the counts for those countries.
2:16 And let's say we want to look at the top 10 countries. So how could we do that? I'm going to say, OK, let's take the cells
2:25 and pull off countries and then look at the top counts for each of those countries. And then that's going to, well, let me run that here.
2:34 And if you look at top here, this is the value counts. And it is in order. If you pull off the index, those countries are in order.
2:43 So we could pull off the first 10 of those. And that would be the top 10 countries. So that's what's going on down here.
2:49 I'm saying top n, those are the top 10 countries. Now down here, I'm updating the country column. And I'm saying, OK, I want the country column
2:57 to be the country column. And then we're using this where command. This where method is a little bit weird.
3:02 But basically what you do is you pass in a Boolean operation here where this Boolean operation is true. You keep the value of this.
3:12 Where it's not true, you put in this value. So it's a little bit weird. It's like an else statement.
3:18 If it's true, keep the value else put in this other value. Let's just look at that. So you can see that we have United Kingdom and France.
3:29 We're not seeing other here because there's a lot of United Kingdom in there. If we pull off that, we're still not seeing that.
3:35 But let's do a value counts there. And now we see that other is right there. OK, this is actually super useful.
3:44 I actually make a little helper here, put this in a function so we can do something like this where we say limit n.
3:51 So you pass in a data frame into that as the first parameter. Because you have that, you can use this for pipe. But you can also use this for a sign.
4:00 And then you pass in the column that you want to limit. And I could also pass in how many values I want to limit and other as well.
4:08 But here we're going to say let's just limit country. And here's the top 20 countries with everything else in other.
4:16 So here's how I'd use this in a chain here. Let's just walk through this here. I want you to get used to understanding how these things work.
4:25 Here's our original data frame here. You can see that we have United Kingdom in the first couple rows there. I'm going to filter out United Kingdom.
4:32 So we have 541,000 rows. Now we have 46,000 rows here. The next thing I'm going to do is I'm going to make some new columns or override columns.
4:40 I'm going to update the country column. So instead of being country, I'm going to limit it to the top 20 countries. So I'm going to use a lambda here.
4:50 And inside of the lambda, I'm going to say let's call limit with dfn country. Why can't I use limit directly here?
4:57 Because if you look at my function up here, I need to specify a column. And I can't do that just by passing in the column here.
5:06 I need to basically wrap that with a lambda that dispatches to limit n with the column that I want. Let's just run that.
5:13 We should also get a total in a year column. There is a total in a year column. We're not really seeing any update here
5:18 because France was in the top 20. But we should see that down below here. Let's group by country. That's lazy.
5:26 It doesn't do anything until we do an aggregation. We'll do a sum aggregation. And if you look at this, we should see an other in here. There it is.
5:33 There is an other. Again, if I comment this out, this will be a lot longer here. So by changing that, we are limiting that to just the top 20.
5:47 And then we'll pull off the total column. Let's sort those values. And then let's plot that. In this example, we built up operations
5:57 to summarize by country. And we also showed you a nice technique to limit the number of countries, but include the extra companies in a new column.


Talk Python's Mastodon Michael Kennedy's Mastodon