Data Science Jumpstart with 10 Projects Transcripts
Chapter: Project 4: Understanding Grouping and Aggregation Retail Data
Lecture: Using Grouper in Pandas to Groupby by Month Frequency
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
Your boss is liking all this stuff that you're able to give them now. They're asking for sales by month Okay
0:07
so here is our original data and we're gonna say let's make that total column and I'm gonna show you a new way to
0:13
Get that value here the month without actually making a column
0:18
What I'm gonna do is I'm gonna say PD grouper and we're say I want to group by this invoice date But look at this I say freak is equal to M
0:27
so what this is going to do is It's going to complain. It's gonna say nope. I can't do that
0:33
So this is only valid with date time index, but got a instance of an index
0:40
So this is this is actually a pandas to bug. Hopefully they'll be fixing this soon
0:43
So I'm gonna look at the types of cells here and you can see that we have the date. It's this time stamp
0:51
It's PI arrow time. I'm gonna actually convert this to a pandas time so let's do that here I'm gonna say as type and
1:00
If you look at this, we don't really see much difference here, but if we look at the D types now of this You can see this as date time and that's 64
1:14
So hopefully in in a soon released version of pandas this will be fixed
1:19
but I'm gonna add that total column there and then we're gonna group by and I'm gonna say
1:25
that column which is the date column that we just changed to a
1:28
NumPy date now and this freak here M is the month frequency. So let's do that. That's gonna be lazy
1:36
It's gonna give us that group by object and then we're gonna say I want to summarize the numeric columns there now look at the index
1:41
Here instead of having a month here. What we have is the end of each month. So this is really cool With relatively little code again
1:52
I did have to change the type because PI arrow doesn't support that I was able to summarize by month
2:00
I'm just going to look at the memory usage of our old data here. There's not a
2:07
Difference in memory usage. It's just that one's using NumPy and the other one's using PI arrow
2:13
Okay, so one things I like to do with this once I have that So here here's what we had. Let's throw on a plot here to visualize that
2:24
So look at this this is going to do a line plot we haven't seen line plots yet
2:31
But here's our data. This is a series when we do a just plot by default. It's going to do a line plot
2:37
It's going to put the index in the x-axis in this case The index is dates and then it's going to draw a line plot for those values there
2:45
So really easy to make a line plot in pandas
2:48
Just called dot plot there. This makes it really clear that in November. That's where we have the most cells
2:55
But this is aggregated at the month level, but watch this I can change this freak value here from M to W and
3:03
Now we are looking at an aggregation of the weekly values In fact, I can change it to a D and we can aggregate at the day value, which is kind of cool
3:14
I can do a 3d here and aggregate at every three day value So pandas makes it really flexible to aggregate at different date intervals
3:24
We call this the offset alias and we're going to use that PD grouper
3:28
So the PD grouper syntax is a little weird, but once you get used to it, it's able to do very powerful things