Move from Excel to Python with Pandas Transcripts
Chapter: Aggregating, grouping, and merging
Lecture: Examples of aggregation in pandas
0:00 Now let's go through some examples of how to aggregate in group data in Panis.
0:05 So Ivory started this notebook to read in my sample sales data and take a look
0:11 at the top five rows, the data like we have been doing in the past
0:15 So now let's try and do a simple aggregation of the Price column.
0:23 If we use the ag function and call that on the price calm,
0:28 we can pass in a list of one or more functions to apply to that column
0:33 and get the results. So in this example,
0:35 I did the mean function, which tells us that we have an average price of
0:39 22.8. I can also do something like dinner deviation,
0:44 or I can do men Max,
0:47 and there are many different aggregations you can apply.
0:50 But this basic feature that we use on a column we can also use when we
0:56 group data, which I will show in a second.
0:59 One interesting aspect about aggregations is that you can run it on an entire data frame
1:05 as well, so we can say and pendants will then run a mean in a
1:14 max on all of the columns that it can,
1:17 and so you'll notice that it will run the functions on day columns,
1:22 object columns, numeric columns and in those places where it can actually do a compass
1:27 Correct calculation. It returns this in a N,
1:31 which means not a number function.
1:33 We can define a dictionary that has the key as each column name and then a
1:38 list of all the aggregation functions we want to apply on those columns.
1:42 So let's go ahead. And to that,
1:43 just a show what I'm talking about.
1:47 So if we want to get the average quantity,
1:51 or maybe let's just let's do the total 20 and then for price,
2:00 we can pass our standard deviation in the mean we can all student invoice and maybe
2:06 for this one I want to count.
2:10 So this dictionary says for each of these columns,
2:13 we want to perform these operations on those columns,
2:18 and then if we want to actually apply it,
2:20 we two DF AG and add the AG columns to that so you can see here
2:26 that it went through each column and it did various functions for each of the calm
2:32 So I said, Let's get us some of the quantity,
2:36 the mean and standard deviation price,
2:38 the count of the invoice and the some of the extended amount.
2:42 And we have those in a in values there because pandas is constructing a single table
2:47 or data frame and doesn't know what fill it in.
2:50 So it will just place the in A in there.
2:53 If you want to fix that so that you get a zero,
2:56 we use fill in A. And that function will fill all of your in a
3:01 N values with zero so that we've talked about aggregating.
3:05 We're gonna talk about grouping because they really go together.
3:09 While you can aggregate on a data frame by itself most of the time you're going
3:13 to group first. So let's do a quick example.
3:16 Room in group. Buy everything in the product column,
3:19 and then we tell it the some.
3:20 So what this does is a groups,
3:23 all the products. So we have four products that has grouped together,
3:26 and it's summed all the numerical.
3:29 Um, so we know the total quantity,
3:31 the total price and the extended amount.
3:33 But you think about this. The some of the price really doesn't make a whole
3:38 lot of sense. So maybe what we really want to dio is just get the
3:42 quantity was copied. That and what we can do is tell it.
3:49 So now we've said group byproduct,
3:52 but just run the some operation on the quantity column,
3:56 and this functionality works. But it's not the preferred way that I like to do
4:01 aggregations in pandas. And there's the structure that makes the most sense for dealing with
4:08 more complex grouping and aggregations. So I define a prod calls.
4:13 So I say in quantity column what some that.
4:17 So if you followed what I did on the aggregation,
4:19 I'm going down that same path.
4:21 So now I say, group by product and now I ag broad calls.
4:28 So, combining what we've talked about earlier with aggregations and group buys,
4:33 we defined the columns that we want to perform mathematical operations on.
4:37 In this case, we want to some all the values in the quantity.
4:41 We want a group by product and therefore we have be counts or our books,
4:46 pins and posters, which, if you recall,
4:49 is very similar to what we did with be some ifs function in excel.
4:54 So the groupie and aggregation that you've been doing an Excel.
4:57 This is just a different way of doing it,
5:00 and it's really more powerful because you have a lot of flexibility and a very small
5:05 subset of code to do at a lot of summary information and grouping on your data
5:10 For instance, copy and paste this,
5:13 but let's say we don't want just do the the some.
5:18 Maybe we also want to see what the average size of each transaction is.
5:23 This starts toe give you a feel for how quickly you can iterated through analysing your
5:28 data. I just adding in the mean function,
5:31 you start to get mawr information about what the purchase patterns look like,
5:36 and we can go back in and out of state or deviation.
5:40 Um, maybe we want to see what's the max.
5:43 Someone is purchased, and this gives us a really quick way to reiterate through our
5:47 data and keep looping back and making adjustments and figuring out what insights we can get
5:54 from our data. And we can also do our group by on multiple columns.
6:00 So right now, up until now,
6:02 we just grew by product. But what if we want to see company in products
6:08 Now we're gonna get more results.
6:10 So we're gonna see for each company what products they purchased.
6:13 The total amount of the average Mount Sandy eight deviation,
6:17 the max in the men. And once again,
6:18 we have our Entei ends. And what that's telling us is there are some negative
6:23 values or some other values in there that are causing the standard deviation to not calculate
6:29 correctly. So if we want to,
6:32 if we don't like, be in ends,
6:33 we can fill in a was zero to turn those into zeros.
6:37 And one of the things you likely noticed is that once we do these aggregations,
6:45 our indices and our columns look a little bit different.
6:49 And what what Pandas has done is put together a multi index force.
6:55 I'm not gonna go through the details of what a multi indexes or why you would
6:59 want to use it. But I am going to introduce this concept of doing reset
7:04 index. And once you call that you can see now that instead of having company
7:11 and then product and book pin poster group together,
7:15 it's reset so that there's one row for each value.
7:20 And I highlight that because sometimes some of the future summaries and analyses that you want
7:25 to do are much easier when you've reset the index this way.
7:29 So I want to introduce that concept to you.
7:31 Now, for the final example,
7:34 we're gonna introduce a concept called named Aggregations.
7:37 So probably the best way is just Ah,
7:39 walk through this and then I'll explain what I'm doing.
7:57 Okay, so let me walk through this what we've done with our named aggregation here
8:01 So we want to count the number of invoices that each company company has.
8:07 And then what is there? Max Purchase?
8:10 And instead of just leaving the columns to the default naming convention based on the extended
8:18 mountain invoice actually defined the column name.
8:21 So this says that the invoice total should be the name of the column,
8:24 and Max purchase should be the name of the call.
8:27 Now, this is really helpful as you start developing these aggregations and want to present
8:33 the results. Other people you want to make sure that the column names are consistent