Eve: Building RESTful APIs with MongoDB and Flask Transcripts
Chapter: Setup and tools
Lecture: VSCode editor
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
In this section I want to show you some visualizations that you can do really easily with pandas.
0:06
So if I've got a numeric column, I like to do a histogram on it.
0:10
So I'm going to say, let's take the health column, which is this numeric value from 1 to 5. And this is the health of the student.
0:20
And all I do is say pull off the column and then say .his. Now I am saying fig size is equal to 8,3. Fig size is a matplotlib-ism.
0:30
This is leveraging matplotlib. Now you do see a space in here around 2.5.
0:35
The issue here is that by default we are using 10 bins here and these values only go up to 5.
0:40
So I might want to come in here and say bins is equal to 5 and change that. Oftentimes people say they want to look at a table of data.
0:50
And again, humans aren't really optimized for that. If I gave you a table of the health column and said like, what does this have in it?
0:57
It's hard for you to really understand that too much. But if you plot it, if you visualize it using a histogram, it makes sense.
1:04
And that's a great way to understand what's going on with your data. Let's just take another numeric column.
1:09
We'll take the final grade and do a histogram of that. In this case, I'm going to say bins is equal to 20 because this value goes up to 20.
1:15
This is really interesting to me. You can see that there's a peak there at 0, indicating that you do have a large percent of people who fail.
1:24
And then it looks like around 10, you have another peak. That's probably your average student.
1:31
So this is illustrating not a bell curve, so to speak, but the distribution of grades, which I think is interesting.
1:39
And it tells a story just by looking at this. Again, could we tell this by looking at the column of data? It would be really hard to do.
1:47
But giving that plot there makes it relatively easy. If I have two numeric columns and I want to compare them, I like to use a scatter plot.
1:54
We're going to plot one value in the x-axis and another value in the y-axis. Pandas makes it really easy to do this as well.
2:01
What we're going to do is we're going to say df and then an attribute on the data frame is plot. And from that plot, we can do various plots here.
2:08
So one of those is scatter. In fact, there's also a hist there as well. So hist is on data frame and it's on a series directly.
2:17
But also those are both available from the plot accessor.
2:21
In order to use the scatter plot, we need to say what column we want to plot in the x-direction and what column we want to plot in the y-direction.
2:29
So we're going to plot the mother's education in the x-direction and their final grade in the y-direction.
2:35
And I'm just going to change the size of that so it's 8 by 3. Here's our plot.
2:40
When I look at this plot, a couple of things stand out to me immediately. One is we see these columns.
2:46
One is that we see values at regular intervals here. So this tells me that we have gradations that are at some level, which kind of makes sense.
3:00
Our grade is at the whole number level. You don't have like a 15.2 or a 15.1. You just have 15, 16, 17, et cetera.
3:09
Makes it very clear when you see the scatter plot. The other one is that we're seeing columns there.
3:14
And so you can think of the mother's education, it is a numeric value, but it's also somewhat categorical in that it's lined up in columns.
3:23
So I'm going to show you some tricks to tease that apart and understand what's going on here.
3:27
If you just look at this plot on its own, it's hard to tell where the majority of the data is.
3:33
So I'm going to show you how we can find out what's going on behind this plot. One of my favorite tricks with a scatter plot is to adjust the alpha.
3:43
Now, if I just see a bunch of dark values there, what I want to do is I want to lower that alpha,
3:48
which is the transparency, until I start to see some separation there. I think that looks pretty good. I might even go a little bit lower.
3:59
You can see that I'm now starting to see some faded values here. So by looking at this, this tells a different story to me than this value up here.
4:09
This is telling me that we have more values at 4. How do I know that we have more values at 4? Because it's darker there when we lowered the alpha.
4:17
We're not really seeing that so much on this plot. What's another thing we can do? Another thing that we can do is add jitter.
4:26
Basically, we'll add a random amount to the data to spread it apart and let us see what's going on inside of that.
4:33
So I'm going to add jitter in the x direction to spread apart that mother's education value.
4:39
I'm going to use NumPy to do that, and this is going to use the assign method. The assign method lets us create or update columns on our data frame.
4:47
I'm going to say let's make a new column called EduJit, and it's going to consist of the mother's education plus some random amount.
4:56
I'm using NumPy random to generate some random values there. In this case, the amount is 0.5.
5:04
I don't want my random values to overlap values from another value, so I'm keeping them within a certain width.
5:14
Then I'm going to say on that new data frame, let's plot that. Let me just show you that this is pretty easy to debug once you have these chains here.
5:24
You can actually say here's my data frame, and then I want to make a new column. There is my new column. It popped over there on the end.
5:31
Now once I have that, I'm going to plot the new column in the X direction and plot the grade in the Y direction. We get something that looks like this.
5:40
This also tells us a different story than this one up here. I think this is a much better plot, letting us see where the majority of the data is.
5:52
Now I have inlined that Jitter functionality right here, but it's pretty easy to make a function to do that.
6:00
I'm going to write a function down here in this next one called Jitter. Then to leverage that, I'm going to say,
6:07
okay, EduJit is now this result over here. Now let's explain what's going on here. On the right-hand side of a parameter in a sine,
6:19
up above here you can see that we passed in this is a series, and we're adding some amount to it. This is a Pandas series up here.
6:26
Down here, this is a lambda function. We can pass in a lambda function on the right-hand side. What happens when we pass in a lambda function?
6:33
When you have a lambda function inside of a sine, Pandas is going to pass in the current state of the data frame
6:39
as the first parameter to that lambda function. Generally, you will want that lambda function to return a series
6:47
because you want that to be what the column is. Now do you have to use lambdas? No, you don't have to use lambdas.
6:52
You can use normal functions as well. Oftentimes, it is nice to use lambdas because you want that logic directly there inside.
7:00
When you're looking at your code, the logic's right there. If you were to repeatedly use the same lambda all over the place,
7:07
then I might recommend moving that out to a function so you only have to write it one place. Let's run that and make sure that that works.
7:15
That looks like that works as well. If this jitter was useful, what I would do is make a helpers file,
7:22
and I would stick that jitter into the helpers file so I can leverage that. I also want to look at how to visualize string data.
7:31
What I'm going to do is I'm just going to tack on a plot.bar into my values count. When we do a bar plot in Pandas, what it does is it takes the index
7:41
and it puts it in the x-axis. Then each of those values for those index values, it plots those as bar plots.
7:49
Once you understand that, it makes it really easy to do bar plots. Let's see what happens when we run .plot.bar.
7:55
We should see mother and father and other go into the x-axis. We do see that.
8:02
This is a little bit hard to read because I have to tweak my head to the side.
8:05
Generally, when I'm making these bar plots, I prefer them to be horizontal. To make a horizontal bar plot, I just say bar h.
8:13
There we go. There's our visualization of that. We can see that most of the guardians are actually the mother in this case.
8:22
In this section, we looked at how to visualize your data. I'm a huge fan of visualization because I think it tells stories
8:29
that you wouldn't get otherwise. Once you understand how to make these visualizations in Pandas, it's going to make your life really easy.