Data Science Jumpstart with 10 Projects Transcripts
Chapter: Project 4: Understanding Grouping and Aggregation Retail Data
Lecture: Aggregating in Pandas to Calculate Sales by Year

Login or purchase this course to watch this video and the rest of the course contents.
0:00 Okay, so let's start looking at aggregation. Let's assume that your boss
0:04 says I want to look at cells by year. What does that even mean? Well, it means that we
0:08 want to group things by year and then summarize the cells for each of those years. Turns out that this is relatively easy to do in Pandas. Once you
0:18 understand the syntax, Pandas will allow you to do these group by operations with
0:23 relative ease. The first thing I'm going to want to do here is make a total value
0:28 here. So I'm going to take my quantity and multiply it by the unit price and then I
0:33 have a total value over there. Next thing I want to do is get a year entry. Now I
0:40 don't have a year entry right now, so I'm going to want to make a new column that
0:45 has a year entry. How can I do that? Well, I have this invoice date that is a date
0:51 object. If you look at the D type of it, so because it is a date object, it has
0:57 this DT accessor which allows us to access various attributes of that. This
1:02 is really cool. Once you've converted something to a date, you can do various
1:05 operations with that. So in this case, we want to pull off the year and you can
1:09 see that I can make a year column over there. Now at that point, once I have the
1:14 year column, I can say I want to group by year. Now Pandas is lazy when you do a
1:18 group by. It doesn't do anything until you summarize that. So what do we want to
1:21 do? Let's just say let's take the sum of every numeric column for each year.
1:27 When we do that, we get our friend not implemented error. This is an error that
1:33 we see often in Pandas 2 because we're using the arrow back end, but it's
1:38 complaining about strings not supporting sum. So how do we get around that? We're
1:43 going to say numeric only is true. Here is our summarization by year. You can see that for the year 2010, we have a total of 748,000 pounds in
1:56 this is Europe or in Great Britain. For 2011, it looks like we have almost 9 million. Now one of the nice things I can do here once I have this
2:06 little chain here, so this is what I had before, I can say let's pull off that
2:09 total column. There's a series and now let's do a bar plot of that. We can
2:13 visualize that relatively easily. So there is our visualization of cells from
2:18 2010 to 2011. In this example, I gave a walkthrough of how we would summarize by
2:23 year. First of all, we want to figure out what we want to summarize by and we
2:28 might need to make a column to summarize by that column. Once you've made a column
2:32 with what you want to summarize by, you can say group by with that column in it
2:36 and then you need to provide the aggregation. In our case, the aggregation was the sum and we summed up everything for that year column.


Talk Python's Mastodon Michael Kennedy's Mastodon