Move from Excel to Python with Pandas Transcripts
Chapter: Aggregating, grouping, and merging
Lecture: Examples of aggregation in pandas
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
Now let's go through some examples of how to aggregate in group data in Panis.
0:05
So Ivory started this notebook to read in my sample sales data and take a look
0:11
at the top five rows, the data like we have been doing in the past
0:15
So now let's try and do a simple aggregation of the Price column.
0:23
If we use the ag function and call that on the price calm,
0:28
we can pass in a list of one or more functions to apply to that column
0:33
and get the results. So in this example,
0:35
I did the mean function, which tells us that we have an average price of
0:39
22.8. I can also do something like dinner deviation,
0:44
or I can do men Max,
0:47
and there are many different aggregations you can apply.
0:50
But this basic feature that we use on a column we can also use when we
0:56
group data, which I will show in a second.
0:59
One interesting aspect about aggregations is that you can run it on an entire data frame
1:05
as well, so we can say and pendants will then run a mean in a
1:14
max on all of the columns that it can,
1:17
and so you'll notice that it will run the functions on day columns,
1:22
object columns, numeric columns and in those places where it can actually do a compass
1:27
Correct calculation. It returns this in a N,
1:31
which means not a number function.
1:33
We can define a dictionary that has the key as each column name and then a
1:38
list of all the aggregation functions we want to apply on those columns.
1:42
So let's go ahead. And to that,
1:43
just a show what I'm talking about.
1:47
So if we want to get the average quantity,
1:51
or maybe let's just let's do the total 20 and then for price,
2:00
we can pass our standard deviation in the mean we can all student invoice and maybe
2:06
for this one I want to count.
2:10
So this dictionary says for each of these columns,
2:13
we want to perform these operations on those columns,
2:18
and then if we want to actually apply it,
2:20
we two DF AG and add the AG columns to that so you can see here
2:26
that it went through each column and it did various functions for each of the calm
2:32
So I said, Let's get us some of the quantity,
2:36
the mean and standard deviation price,
2:38
the count of the invoice and the some of the extended amount.
2:42
And we have those in a in values there because pandas is constructing a single table
2:47
or data frame and doesn't know what fill it in.
2:50
So it will just place the in A in there.
2:53
If you want to fix that so that you get a zero,
2:56
we use fill in A. And that function will fill all of your in a
3:01
N values with zero so that we've talked about aggregating.
3:05
We're gonna talk about grouping because they really go together.
3:09
While you can aggregate on a data frame by itself most of the time you're going
3:13
to group first. So let's do a quick example.
3:16
Room in group. Buy everything in the product column,
3:19
and then we tell it the some.
3:20
So what this does is a groups,
3:23
all the products. So we have four products that has grouped together,
3:26
and it's summed all the numerical.
3:29
Um, so we know the total quantity,
3:31
the total price and the extended amount.
3:33
But you think about this. The some of the price really doesn't make a whole
3:38
lot of sense. So maybe what we really want to dio is just get the
3:42
quantity was copied. That and what we can do is tell it.
3:49
So now we've said group byproduct,
3:52
but just run the some operation on the quantity column,
3:56
and this functionality works. But it's not the preferred way that I like to do
4:01
aggregations in pandas. And there's the structure that makes the most sense for dealing with
4:08
more complex grouping and aggregations. So I define a prod calls.
4:13
So I say in quantity column what some that.
4:17
So if you followed what I did on the aggregation,
4:19
I'm going down that same path.
4:21
So now I say, group by product and now I ag broad calls.
4:28
So, combining what we've talked about earlier with aggregations and group buys,
4:33
we defined the columns that we want to perform mathematical operations on.
4:37
In this case, we want to some all the values in the quantity.
4:41
We want a group by product and therefore we have be counts or our books,
4:46
pins and posters, which, if you recall,
4:49
is very similar to what we did with be some ifs function in excel.
4:54
So the groupie and aggregation that you've been doing an Excel.
4:57
This is just a different way of doing it,
5:00
and it's really more powerful because you have a lot of flexibility and a very small
5:05
subset of code to do at a lot of summary information and grouping on your data
5:10
For instance, copy and paste this,
5:13
but let's say we don't want just do the the some.
5:18
Maybe we also want to see what the average size of each transaction is.
5:23
This starts toe give you a feel for how quickly you can iterated through analysing your
5:28
data. I just adding in the mean function,
5:31
you start to get mawr information about what the purchase patterns look like,
5:36
and we can go back in and out of state or deviation.
5:40
Um, maybe we want to see what's the max.
5:43
Someone is purchased, and this gives us a really quick way to reiterate through our
5:47
data and keep looping back and making adjustments and figuring out what insights we can get
5:54
from our data. And we can also do our group by on multiple columns.
6:00
So right now, up until now,
6:02
we just grew by product. But what if we want to see company in products
6:08
Now we're gonna get more results.
6:10
So we're gonna see for each company what products they purchased.
6:13
The total amount of the average Mount Sandy eight deviation,
6:17
the max in the men. And once again,
6:18
we have our Entei ends. And what that's telling us is there are some negative
6:23
values or some other values in there that are causing the standard deviation to not calculate
6:29
correctly. So if we want to,
6:32
if we don't like, be in ends,
6:33
we can fill in a was zero to turn those into zeros.
6:37
And one of the things you likely noticed is that once we do these aggregations,
6:45
our indices and our columns look a little bit different.
6:49
And what what Pandas has done is put together a multi index force.
6:55
I'm not gonna go through the details of what a multi indexes or why you would
6:59
want to use it. But I am going to introduce this concept of doing reset
7:04
index. And once you call that you can see now that instead of having company
7:11
and then product and book pin poster group together,
7:15
it's reset so that there's one row for each value.
7:20
And I highlight that because sometimes some of the future summaries and analyses that you want
7:25
to do are much easier when you've reset the index this way.
7:29
So I want to introduce that concept to you.
7:31
Now, for the final example,
7:34
we're gonna introduce a concept called named Aggregations.
7:37
So probably the best way is just Ah,
7:39
walk through this and then I'll explain what I'm doing.
7:57
Okay, so let me walk through this what we've done with our named aggregation here
8:01
So we want to count the number of invoices that each company company has.
8:07
And then what is there? Max Purchase?
8:10
And instead of just leaving the columns to the default naming convention based on the extended
8:18
mountain invoice actually defined the column name.
8:21
So this says that the invoice total should be the name of the column,
8:24
and Max purchase should be the name of the call.
8:27
Now, this is really helpful as you start developing these aggregations and want to present
8:33
the results. Other people you want to make sure that the column names are consistent