Eve: Building RESTful APIs with MongoDB and Flask Transcripts
Chapter: Setup and tools
Lecture: SQL, Elastic and other alternative backends
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
In this section, I'm going to take you through what I like to do with categorical columns. So let's get going.
0:07
First of all, let's just select what our categorical columns are. In Pandas 1, we would do it this way. We would say, select D types object.
0:15
Again, that's because Pandas 1 didn't have a native way to represent strings, and so it used Python strings, which are objects in NumPy parlance.
0:25
In Pandas 2, we do have that ability. So if we do say string here, we get back a data frame and all of the columns here are string columns.
0:35
Now, I want to summarize these. I can't use those same summary statistics that I did use with describe up above,
0:42
but I can do some other things and I'll show you those. Alternatively, we could say select D type string and then square bracket,
0:51
pie arrow, that gives us the same result in this case. Is there any value to that? Not necessarily. It's a little bit more typing.
0:59
I want to show you my go-to method. So when we're doing a lot of these operations, I like to think as Pandas as a tool belt.
1:08
It has 400 different attributes that you can do on a data frame and 400 different things that you can do to a series.
1:14
Do you have to memorize all of those? No, you don't. But I want to show you common ones and you can think of them as tools.
1:19
You put them in your tool belt and then we use these chains to build up these operations. Your go-to when you're dealing with string or
1:26
categorical data is going to be the value counts method. Let's look at that. Let's assume that I want to look at this fam size,
1:33
which is the size of the family. You can see the column over here, but let's explore that a little bit more.
1:40
So all I'm going to do is I'm going to say, let's take my data frame, pull off that fam size column, and then do a value counts on that.
1:48
What this returns is a Pandas series. Now, let me just explain what's being output here because it might be a little bit confusing.
1:56
At the top, we see fam size, and that is the name of the column or the name of the series in this case. Then on the left-hand side, we see GT3 and LE3.
2:09
Those are the values and they are in the index. The actual values of the series 281 and 114 are on the right-hand side. At the bottom, we see name.
2:18
Name is count. So that is derived from doing value counts there. We see D types. It says this is an int64.
2:26
So the type of the series is a PyArrow int64. Let's do the same thing for higher. We'll do value counts,
2:34
and you can see that we get back a series with those counts in that. Now, if we want to compare two categorical or string columns with each other,
2:44
Pandas has a built-in function to do that called cross tab or cross tabulation. What that is going to give us is a data frame,
2:52
and we'll see in this case we have sex in the index and higher in the columns, and then it gives us the count of each of those.
3:00
This has various options. Again, we can put our cursor there, hold down shift and hit tab four times there to pull up the documentation.
3:07
So there's a lot of things we can do. Turns out Pandas has pretty good documentation. So check that out if you want to.
3:13
I'm not going to go over all that right now. But an example is we can say normalize. Now, instead of having the counts there, we have the percentages,
3:23
and this is normalized over all of the values in there. If I want to format that and convert that into a percent, we can say style.format,
3:34
and now I'm getting percents there. I can say I want to normalize this across the index. So what does that do?
3:42
It says I want to take each row and normalize each row. So we're going down the index and normalizing each row.
3:51
I think that normalizing across the index is a little bit weird. To me, this seems backwards.
3:57
To me, it seems like we're normalizing across the columns instead of the index. But if we want to normalize down a column,
4:05
then we would say normalize columns there, and we're normalizing down the columns that way. Pandas has some warts. I'll be the first to admit it.
4:13
And oftentimes, when we are doing aggregation operations, if we want to sum across the columns, we would say axis is equal to columns,
4:20
and we would sum across the columns. In this case, this normalize here seems a little bit backwards, but we'll just deal with it. It is what it is.
4:28
In this section, I showed you how I would look at string data. Generally, I'm going to take that value counts and quantify what is in there.
4:36
Oftentimes, we can see whether we have low cardinality, if we have few unique values, or if we have all unique values,
4:42
we can see that relatively quickly. If I want to compare two categorical values, I'm going to use that cross tabulation to do that.