Move from Excel to Python with Pandas Transcripts
Chapter: Aggregating, grouping, and merging
Lecture: Examples of aggregation in pandas
0:00 Now let's go through some examples of how to aggregate in group data in Panis.
0:06 So Ivory started this notebook to read in my sample sales data and take a look at the top five rows, the data like we have been doing in the past
0:16 So now let's try and do a simple aggregation of the Price column. If we use the ag function and call that on the price calm,
0:29 we can pass in a list of one or more functions to apply to that column and get the results. So in this example,
0:36 I did the mean function, which tells us that we have an average price of 22.8. I can also do something like dinner deviation, or I can do men Max,
0:48 and there are many different aggregations you can apply. But this basic feature that we use on a column we can also use when we
0:57 group data, which I will show in a second. One interesting aspect about aggregations is that you can run it on an entire data frame
1:06 as well, so we can say and pendants will then run a mean in a max on all of the columns that it can,
1:18 and so you'll notice that it will run the functions on day columns,
1:23 object columns, numeric columns and in those places where it can actually do a compass Correct calculation. It returns this in a N,
1:32 which means not a number function. We can define a dictionary that has the key as each column name and then a
1:39 list of all the aggregation functions we want to apply on those columns. So let's go ahead. And to that, just a show what I'm talking about.
1:48 So if we want to get the average quantity, or maybe let's just let's do the total 20 and then for price,
2:01 we can pass our standard deviation in the mean we can all student invoice and maybe for this one I want to count.
2:11 So this dictionary says for each of these columns, we want to perform these operations on those columns, and then if we want to actually apply it,
2:21 we two DF AG and add the AG columns to that so you can see here that it went through each column and it did various functions for each of the calm
2:33 So I said, Let's get us some of the quantity, the mean and standard deviation price, the count of the invoice and the some of the extended amount.
2:43 And we have those in a in values there because pandas is constructing a single table or data frame and doesn't know what fill it in.
2:51 So it will just place the in A in there. If you want to fix that so that you get a zero, we use fill in A. And that function will fill all of your in a
3:02 N values with zero so that we've talked about aggregating. We're gonna talk about grouping because they really go together.
3:10 While you can aggregate on a data frame by itself most of the time you're going to group first. So let's do a quick example.
3:17 Room in group. Buy everything in the product column, and then we tell it the some. So what this does is a groups,
3:24 all the products. So we have four products that has grouped together, and it's summed all the numerical. Um, so we know the total quantity,
3:32 the total price and the extended amount. But you think about this. The some of the price really doesn't make a whole
3:39 lot of sense. So maybe what we really want to dio is just get the quantity was copied. That and what we can do is tell it.
3:50 So now we've said group byproduct, but just run the some operation on the quantity column,
3:57 and this functionality works. But it's not the preferred way that I like to do
4:02 aggregations in pandas. And there's the structure that makes the most sense for dealing with
4:09 more complex grouping and aggregations. So I define a prod calls. So I say in quantity column what some that.
4:18 So if you followed what I did on the aggregation, I'm going down that same path. So now I say, group by product and now I ag broad calls.
4:29 So, combining what we've talked about earlier with aggregations and group buys,
4:34 we defined the columns that we want to perform mathematical operations on. In this case, we want to some all the values in the quantity.
4:42 We want a group by product and therefore we have be counts or our books, pins and posters, which, if you recall,
4:50 is very similar to what we did with be some ifs function in excel. So the groupie and aggregation that you've been doing an Excel.
4:58 This is just a different way of doing it, and it's really more powerful because you have a lot of flexibility and a very small
5:06 subset of code to do at a lot of summary information and grouping on your data For instance, copy and paste this,
5:14 but let's say we don't want just do the the some. Maybe we also want to see what the average size of each transaction is.
5:24 This starts toe give you a feel for how quickly you can iterated through analysing your data. I just adding in the mean function,
5:32 you start to get mawr information about what the purchase patterns look like, and we can go back in and out of state or deviation.
5:41 Um, maybe we want to see what's the max. Someone is purchased, and this gives us a really quick way to reiterate through our
5:48 data and keep looping back and making adjustments and figuring out what insights we can get
5:55 from our data. And we can also do our group by on multiple columns. So right now, up until now,
6:03 we just grew by product. But what if we want to see company in products Now we're gonna get more results.
6:11 So we're gonna see for each company what products they purchased. The total amount of the average Mount Sandy eight deviation,
6:18 the max in the men. And once again, we have our Entei ends. And what that's telling us is there are some negative
6:24 values or some other values in there that are causing the standard deviation to not calculate correctly. So if we want to,
6:33 if we don't like, be in ends, we can fill in a was zero to turn those into zeros.
6:38 And one of the things you likely noticed is that once we do these aggregations, our indices and our columns look a little bit different.
6:50 And what what Pandas has done is put together a multi index force. I'm not gonna go through the details of what a multi indexes or why you would
7:00 want to use it. But I am going to introduce this concept of doing reset index. And once you call that you can see now that instead of having company
7:12 and then product and book pin poster group together, it's reset so that there's one row for each value.
7:21 And I highlight that because sometimes some of the future summaries and analyses that you want
7:26 to do are much easier when you've reset the index this way. So I want to introduce that concept to you. Now, for the final example,
7:35 we're gonna introduce a concept called named Aggregations. So probably the best way is just Ah, walk through this and then I'll explain what I'm doing.
7:58 Okay, so let me walk through this what we've done with our named aggregation here
8:02 So we want to count the number of invoices that each company company has. And then what is there? Max Purchase?
8:11 And instead of just leaving the columns to the default naming convention based on the extended mountain invoice actually defined the column name.
8:22 So this says that the invoice total should be the name of the column, and Max purchase should be the name of the call.
8:28 Now, this is really helpful as you start developing these aggregations and want to present
8:34 the results. Other people you want to make sure that the column names are consistent