Data Science Jumpstart with 10 Projects Transcripts
Chapter: Project 4: Understanding Grouping and Aggregation Retail Data
Lecture: Aggregating in Pandas to Calculate Sales by Year
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
Okay, so let's start looking at aggregation. Let's assume that your boss
0:04
says I want to look at cells by year. What does that even mean? Well, it means that we
0:08
want to group things by year and then summarize the cells for each of those years. Turns out that this is relatively easy to do in Pandas. Once you
0:18
understand the syntax, Pandas will allow you to do these group by operations with
0:23
relative ease. The first thing I'm going to want to do here is make a total value
0:28
here. So I'm going to take my quantity and multiply it by the unit price and then I
0:33
have a total value over there. Next thing I want to do is get a year entry. Now I
0:40
don't have a year entry right now, so I'm going to want to make a new column that
0:45
has a year entry. How can I do that? Well, I have this invoice date that is a date
0:51
object. If you look at the D type of it, so because it is a date object, it has
0:57
this DT accessor which allows us to access various attributes of that. This
1:02
is really cool. Once you've converted something to a date, you can do various
1:05
operations with that. So in this case, we want to pull off the year and you can
1:09
see that I can make a year column over there. Now at that point, once I have the
1:14
year column, I can say I want to group by year. Now Pandas is lazy when you do a
1:18
group by. It doesn't do anything until you summarize that. So what do we want to
1:21
do? Let's just say let's take the sum of every numeric column for each year.
1:27
When we do that, we get our friend not implemented error. This is an error that
1:33
we see often in Pandas 2 because we're using the arrow back end, but it's
1:38
complaining about strings not supporting sum. So how do we get around that? We're
1:43
going to say numeric only is true. Here is our summarization by year. You can see that for the year 2010, we have a total of 748,000 pounds in
1:56
this is Europe or in Great Britain. For 2011, it looks like we have almost 9 million. Now one of the nice things I can do here once I have this
2:06
little chain here, so this is what I had before, I can say let's pull off that
2:09
total column. There's a series and now let's do a bar plot of that. We can
2:13
visualize that relatively easily. So there is our visualization of cells from
2:18
2010 to 2011. In this example, I gave a walkthrough of how we would summarize by
2:23
year. First of all, we want to figure out what we want to summarize by and we
2:28
might need to make a column to summarize by that column. Once you've made a column
2:32
with what you want to summarize by, you can say group by with that column in it
2:36
and then you need to provide the aggregation. In our case, the aggregation was the sum and we summed up everything for that year column.