Eve: Building RESTful APIs with MongoDB and Flask Transcripts
Chapter: Setup and tools
Lecture: Installing Mongo
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
I want to explore correlations. Correlations are the relationships between two numeric columns.
0:07
And this is a good way to understand if one value is going up, does the other value go up
0:11
or down or does it have no impact on it. So let's see how we can do that with pandas.
0:16
I'm going to say df.core and I'm going to pass in that numeric only because otherwise it's going to
0:21
complain about that. And look at what this returns. It's a data frame. In the index we have all the
0:27
numeric columns and in the columns we have all the numeric columns. In the values here we have
0:33
what's called the Pearson correlation coefficient. This is a number between negative one and one.
0:38
A value of one means that as one value goes up the other value goes up in a linear fashion. If
0:44
you were to scatter plot that you would see a line going up and to the right. A correlation of negative
0:49
one means that if you scatter plotted it you'd see a line going down and to the right. A correlation
0:56
of zero means that as one value is going up the other value might go up or down. You might see a
1:01
flat line but you also might see alternating values. As one value increases the other value
1:07
may or may not increase. They don't have a relationship to each other. Now humans are
1:12
optimized for looking at big tables of data like this. Generally what I want to do when I have this
1:17
correlation table is to look for the highest values and the lowest values. But I might want
1:22
to look for values around zero and it's kind of hard to pick those out. If you look you might
1:27
notice that along the diagonal we do see a bunch of ones and that's because the correlation of a
1:34
column with itself is the column goes up the column goes up. So you do see that value there
1:40
but we're actually not interested in that value. We want to look at the off diagonal values. So let
1:46
me give you some hints on how we can do this. One of the things that pandas allows us to do is add a
1:51
style. So I'm going to use this style attribute and off of that I can say background gradient.
1:59
Let me note one more thing here. This is showing how to use what's called chaining in pandas. I'm
2:05
actually doing multiple operations to the same data frame here and I put parentheses around it.
2:12
What that allows me to do is put each step on its own line and that makes it read like a recipe.
2:18
I'm first going to do this then I'm going to do this then I'm going to do this. Do I need parentheses?
2:21
No I don't. If I didn't use parentheses I would have to put all of that code on one line and it
2:27
gets really hard to read. So I recommend that when you write your change you put parentheses at the
2:31
front and then parentheses at the end and then just space it each operation on its own line. It's
2:36
going to make your life a lot easier. Okay so what we've done is we've added this background gradient.
2:41
The default gradient here is a blue gradient. It goes from white to blue, dark blue. Again along
2:48
that diagonal you do see the dark blue but this is actually not a good gradient. What we want to
2:55
use when we're doing a heat map of a correlation is to use a color map that is diverging. Meaning
3:02
it goes from one color and then hopefully passes through like a light or white color and goes to
3:07
another color. That way we can look for one color for the negative values and the other color for
3:12
the positive values. So let's see if we can do that. I'm going to specify a diverging color map.
3:17
That's the RDBU, the red blue color map. And it looks like we are seeing those diverging values
3:23
now. Now there is one issue with this. The issue is that if you look for the reddest values I'm
3:29
seeing pretty red values for example around negative 0.23. That's not negative one and I
3:36
would like my red values to actually be at negative one because I also want my white values to be
3:42
around zero. If I look at my white values it looks like they're around 0.42 right now. Note that the
3:49
blue values are at one. Again that's because that diagonal by definition is going to be one.
3:56
So pandas has an option for us to do that. We can specify these Vmin and Vmax values to specify
4:02
where those get pinned down. And when we do that we actually get a proper coloring here. Now this
4:09
makes it really easy to find the reddest values and I can see that failures have a large negative
4:17
correlation with the grade. Again we do have that diagonal there but we want to look at the
4:22
off diagonal values for correlations. And over there at grades we can see that grades are pretty
4:28
highly correlated with each other. Probably makes sense that if you did good on the first test you
4:33
probably did good on the second test etc. Another thing that you can do with the correlation is you
4:40
can change the method. I can say instead of doing the Pearson correlation coefficient which is the
4:44
default one I can do a Spearman correlation. A Spearman correlation does not assume a linear
4:49
relationship rather it's also called a rank correlation. So you might see if a relationship
4:55
if you did a scatterplot it curves like that. That could have a correlation of one as the rank of one
5:01
goes up the rank of the other one goes up but it's not a linear correlation. So oftentimes I do like
5:07
to do a Spearman correlation instead of the Pearson correlation which is the default value.
5:12
In this section I showed you how to look at correlations. I showed you one of my pet peeves
5:18
I often see in social media and other places people showing these correlation heatmaps and
5:22
they'll throw a color on them but they don't pin those values. So make sure you use a diverging
5:27
color map when you're coloring this and make sure you pin those values so that the negative
5:32
value is pinned at negative one and that light value goes at zero.