Data Science Jumpstart with 10 Projects Transcripts
Chapter: Project 2: Excel Integration with Adult Income Data
Lecture: Understanding Numbers with Correlations, Scatterplots, and Histograms
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
Okay, so I want to look at the numeric columns as well, understand those. I'm doing this, it might seem a little repetitive, but the point here is that
0:09
these are standard procedures that I'm going to do to a lot of data sets. So
0:13
let's do a correlation here. We can see our code that we had from before. We're
0:18
going to say I want to do correlation and use a diverging color map from red
0:23
to blue, and we're going to pin those values from negative 1 to 1. We can see the most bluest values and the most reddest values. In this case there
0:33
doesn't appear to be very many strong correlations other than the correlation
0:38
of a column with itself. Let's do a scatter plot to see if we can get some insight into what's going on here. So to do a scatter plot you need to say
0:49
.plot.scatter on a data frame, and then we specify the column that we want in
0:53
the x and y direction here. So I'm going to say in the x direction, education
0:57
number, in the y the capital gain, and again we're seeing something similar
1:03
that we saw previously in our other scatter plots. We're seeing these things
1:07
line up in columns. So one of the first things I'm going to do is I'm going to
1:13
adjust my alpha. My general standard practice is just to lower that until we
1:18
start seeing some of these values fade away. I might even go even lower than
1:23
that, but I think this tells a slightly different story than what we're seeing
1:28
above here. We can also use our jitter code. If I put it in the helpers file, now
1:34
I can say import helpers, and I can say okay let's make education be a column
1:39
that is jittered from the education num column, and then let's plot the education
1:46
column there. Okay and so I think this tells a different story. I probably
1:50
would lower the alpha even more, so maybe let's bring it down to 0.2 or 0.1. I
1:56
might even go lower than this. How do I tell how much alpha is enough? There's no
2:02
hard rule of thumb. My take on this is if I'm seeing really dark concentrated
2:07
values, I kind of want to lower that until I start seeing some transition in
2:12
that. So maybe let's go 0.05 here, and what I'm seeing here is around 9 and 10
2:19
and 13. It's still really dark, and so maybe I'll even go to 0.01. So I think
2:25
this is telling a different story than what we saw above. We're seeing that
2:30
education around 9. Again, if we want to evaluate this, we can say df education
2:35
number, and if I try and do education number like this, I'm going to get an
2:41
issue here. It's actually trying to do education minus number, so to pull that off I need to use this index operation syntax here. And then with that I
2:52
can do a histogram here, and that validates what I said before that around
2:57
9 we're seeing the majority of our data or a concentration, and we are seeing
3:02
that. We're also seeing 13 here, which is probably high school graduation and probably a lot of like junior high growth, so that is a way to understand
3:11
just the single value there, but not the relationship between that. We just demonstrated how to visualize and understand that relationship between
3:20
numeric columns. Again, correlation is your friend. Oftentimes you might not see
3:25
a strong correlation. Visualization using scatter plots can also show you relationships between those columns as well, or at least the distributions.