Eve: Building RESTful APIs with MongoDB and Flask Transcripts
Chapter: Setup and tools
Lecture: Installing Mongo
Login or
purchase this course
to watch this video and the rest of the course contents.
I want to explore correlations. Correlations are the relationships between two numeric columns.
And this is a good way to understand if one value is going up, does the other value go up
or down or does it have no impact on it. So let's see how we can do that with pandas.
I'm going to say df.core and I'm going to pass in that numeric only because otherwise it's going to
complain about that. And look at what this returns. It's a data frame. In the index we have all the
numeric columns and in the columns we have all the numeric columns. In the values here we have
what's called the Pearson correlation coefficient. This is a number between negative one and one.
A value of one means that as one value goes up the other value goes up in a linear fashion. If
you were to scatter plot that you would see a line going up and to the right. A correlation of negative
one means that if you scatter plotted it you'd see a line going down and to the right. A correlation
of zero means that as one value is going up the other value might go up or down. You might see a
flat line but you also might see alternating values. As one value increases the other value
may or may not increase. They don't have a relationship to each other. Now humans are
optimized for looking at big tables of data like this. Generally what I want to do when I have this
correlation table is to look for the highest values and the lowest values. But I might want
to look for values around zero and it's kind of hard to pick those out. If you look you might
notice that along the diagonal we do see a bunch of ones and that's because the correlation of a
column with itself is the column goes up the column goes up. So you do see that value there
but we're actually not interested in that value. We want to look at the off diagonal values. So let
me give you some hints on how we can do this. One of the things that pandas allows us to do is add a
style. So I'm going to use this style attribute and off of that I can say background gradient.
Let me note one more thing here. This is showing how to use what's called chaining in pandas. I'm
actually doing multiple operations to the same data frame here and I put parentheses around it.
What that allows me to do is put each step on its own line and that makes it read like a recipe.
I'm first going to do this then I'm going to do this then I'm going to do this. Do I need parentheses?
No I don't. If I didn't use parentheses I would have to put all of that code on one line and it
gets really hard to read. So I recommend that when you write your change you put parentheses at the
front and then parentheses at the end and then just space it each operation on its own line. It's
going to make your life a lot easier. Okay so what we've done is we've added this background gradient.
The default gradient here is a blue gradient. It goes from white to blue, dark blue. Again along
that diagonal you do see the dark blue but this is actually not a good gradient. What we want to
use when we're doing a heat map of a correlation is to use a color map that is diverging. Meaning
it goes from one color and then hopefully passes through like a light or white color and goes to
another color. That way we can look for one color for the negative values and the other color for
the positive values. So let's see if we can do that. I'm going to specify a diverging color map.
That's the RDBU, the red blue color map. And it looks like we are seeing those diverging values
now. Now there is one issue with this. The issue is that if you look for the reddest values I'm
seeing pretty red values for example around negative 0.23. That's not negative one and I
would like my red values to actually be at negative one because I also want my white values to be
around zero. If I look at my white values it looks like they're around 0.42 right now. Note that the
blue values are at one. Again that's because that diagonal by definition is going to be one.
So pandas has an option for us to do that. We can specify these Vmin and Vmax values to specify
where those get pinned down. And when we do that we actually get a proper coloring here. Now this
makes it really easy to find the reddest values and I can see that failures have a large negative
correlation with the grade. Again we do have that diagonal there but we want to look at the
off diagonal values for correlations. And over there at grades we can see that grades are pretty
highly correlated with each other. Probably makes sense that if you did good on the first test you
probably did good on the second test etc. Another thing that you can do with the correlation is you
can change the method. I can say instead of doing the Pearson correlation coefficient which is the
default one I can do a Spearman correlation. A Spearman correlation does not assume a linear
relationship rather it's also called a rank correlation. So you might see if a relationship
if you did a scatterplot it curves like that. That could have a correlation of one as the rank of one
goes up the rank of the other one goes up but it's not a linear correlation. So oftentimes I do like
to do a Spearman correlation instead of the Pearson correlation which is the default value.
In this section I showed you how to look at correlations. I showed you one of my pet peeves
I often see in social media and other places people showing these correlation heatmaps and
they'll throw a color on them but they don't pin those values. So make sure you use a diverging
color map when you're coloring this and make sure you pin those values so that the negative
value is pinned at negative one and that light value goes at zero.