# Eve: Building RESTful APIs with MongoDB and Flask Transcripts Chapter: Setup and tools Lecture: Installing Mongo

0:00 I want to explore correlations. Correlations are the relationships between two numeric columns.
0:07 And this is a good way to understand if one value is going up, does the other value go up
0:11 or down or does it have no impact on it. So let's see how we can do that with pandas.
0:16 I'm going to say df.core and I'm going to pass in that numeric only because otherwise it's going to
0:21 complain about that. And look at what this returns. It's a data frame. In the index we have all the
0:27 numeric columns and in the columns we have all the numeric columns. In the values here we have
0:33 what's called the Pearson correlation coefficient. This is a number between negative one and one.
0:38 A value of one means that as one value goes up the other value goes up in a linear fashion. If
0:44 you were to scatter plot that you would see a line going up and to the right. A correlation of negative
0:49 one means that if you scatter plotted it you'd see a line going down and to the right. A correlation
0:56 of zero means that as one value is going up the other value might go up or down. You might see a
1:01 flat line but you also might see alternating values. As one value increases the other value
1:07 may or may not increase. They don't have a relationship to each other. Now humans are
1:12 optimized for looking at big tables of data like this. Generally what I want to do when I have this
1:17 correlation table is to look for the highest values and the lowest values. But I might want
1:22 to look for values around zero and it's kind of hard to pick those out. If you look you might
1:27 notice that along the diagonal we do see a bunch of ones and that's because the correlation of a
1:34 column with itself is the column goes up the column goes up. So you do see that value there
1:40 but we're actually not interested in that value. We want to look at the off diagonal values. So let
1:46 me give you some hints on how we can do this. One of the things that pandas allows us to do is add a
1:51 style. So I'm going to use this style attribute and off of that I can say background gradient.
1:59 Let me note one more thing here. This is showing how to use what's called chaining in pandas. I'm
2:05 actually doing multiple operations to the same data frame here and I put parentheses around it.
2:12 What that allows me to do is put each step on its own line and that makes it read like a recipe.
2:18 I'm first going to do this then I'm going to do this then I'm going to do this. Do I need parentheses?
2:21 No I don't. If I didn't use parentheses I would have to put all of that code on one line and it
2:27 gets really hard to read. So I recommend that when you write your change you put parentheses at the
2:31 front and then parentheses at the end and then just space it each operation on its own line. It's
2:36 going to make your life a lot easier. Okay so what we've done is we've added this background gradient.
2:41 The default gradient here is a blue gradient. It goes from white to blue, dark blue. Again along
2:48 that diagonal you do see the dark blue but this is actually not a good gradient. What we want to
2:55 use when we're doing a heat map of a correlation is to use a color map that is diverging. Meaning
3:02 it goes from one color and then hopefully passes through like a light or white color and goes to
3:07 another color. That way we can look for one color for the negative values and the other color for
3:12 the positive values. So let's see if we can do that. I'm going to specify a diverging color map.
3:17 That's the RDBU, the red blue color map. And it looks like we are seeing those diverging values
3:23 now. Now there is one issue with this. The issue is that if you look for the reddest values I'm
3:29 seeing pretty red values for example around negative 0.23. That's not negative one and I
3:36 would like my red values to actually be at negative one because I also want my white values to be
3:42 around zero. If I look at my white values it looks like they're around 0.42 right now. Note that the
3:49 blue values are at one. Again that's because that diagonal by definition is going to be one.
3:56 So pandas has an option for us to do that. We can specify these Vmin and Vmax values to specify
4:02 where those get pinned down. And when we do that we actually get a proper coloring here. Now this
4:09 makes it really easy to find the reddest values and I can see that failures have a large negative
4:17 correlation with the grade. Again we do have that diagonal there but we want to look at the
4:22 off diagonal values for correlations. And over there at grades we can see that grades are pretty
4:28 highly correlated with each other. Probably makes sense that if you did good on the first test you
4:33 probably did good on the second test etc. Another thing that you can do with the correlation is you
4:40 can change the method. I can say instead of doing the Pearson correlation coefficient which is the
4:44 default one I can do a Spearman correlation. A Spearman correlation does not assume a linear
4:49 relationship rather it's also called a rank correlation. So you might see if a relationship
4:55 if you did a scatterplot it curves like that. That could have a correlation of one as the rank of one
5:01 goes up the rank of the other one goes up but it's not a linear correlation. So oftentimes I do like
5:07 to do a Spearman correlation instead of the Pearson correlation which is the default value.
5:12 In this section I showed you how to look at correlations. I showed you one of my pet peeves
5:18 I often see in social media and other places people showing these correlation heatmaps and
5:22 they'll throw a color on them but they don't pin those values. So make sure you use a diverging
5:27 color map when you're coloring this and make sure you pin those values so that the negative
5:32 value is pinned at negative one and that light value goes at zero.

Talk Python's Mastodon Michael Kennedy's Mastodon