Data Science Jumpstart with 10 Projects Transcripts
Chapter: Project 2: Excel Integration with Adult Income Data
Lecture: Understanding Numbers with Correlations, Scatterplots, and Histograms

Login or purchase this course to watch this video and the rest of the course contents.
0:00 Okay, so I want to look at the numeric columns as well, understand those. I'm doing this, it might seem a little repetitive, but the point here is that
0:09 these are standard procedures that I'm going to do to a lot of data sets. So
0:13 let's do a correlation here. We can see our code that we had from before. We're
0:18 going to say I want to do correlation and use a diverging color map from red
0:23 to blue, and we're going to pin those values from negative 1 to 1. We can see the most bluest values and the most reddest values. In this case there
0:33 doesn't appear to be very many strong correlations other than the correlation
0:38 of a column with itself. Let's do a scatter plot to see if we can get some insight into what's going on here. So to do a scatter plot you need to say
0:49 .plot.scatter on a data frame, and then we specify the column that we want in
0:53 the x and y direction here. So I'm going to say in the x direction, education
0:57 number, in the y the capital gain, and again we're seeing something similar
1:03 that we saw previously in our other scatter plots. We're seeing these things
1:07 line up in columns. So one of the first things I'm going to do is I'm going to
1:13 adjust my alpha. My general standard practice is just to lower that until we
1:18 start seeing some of these values fade away. I might even go even lower than
1:23 that, but I think this tells a slightly different story than what we're seeing
1:28 above here. We can also use our jitter code. If I put it in the helpers file, now
1:34 I can say import helpers, and I can say okay let's make education be a column
1:39 that is jittered from the education num column, and then let's plot the education
1:46 column there. Okay and so I think this tells a different story. I probably
1:50 would lower the alpha even more, so maybe let's bring it down to 0.2 or 0.1. I
1:56 might even go lower than this. How do I tell how much alpha is enough? There's no
2:02 hard rule of thumb. My take on this is if I'm seeing really dark concentrated
2:07 values, I kind of want to lower that until I start seeing some transition in
2:12 that. So maybe let's go 0.05 here, and what I'm seeing here is around 9 and 10
2:19 and 13. It's still really dark, and so maybe I'll even go to 0.01. So I think
2:25 this is telling a different story than what we saw above. We're seeing that
2:30 education around 9. Again, if we want to evaluate this, we can say df education
2:35 number, and if I try and do education number like this, I'm going to get an
2:41 issue here. It's actually trying to do education minus number, so to pull that off I need to use this index operation syntax here. And then with that I
2:52 can do a histogram here, and that validates what I said before that around
2:57 9 we're seeing the majority of our data or a concentration, and we are seeing
3:02 that. We're also seeing 13 here, which is probably high school graduation and probably a lot of like junior high growth, so that is a way to understand
3:11 just the single value there, but not the relationship between that. We just demonstrated how to visualize and understand that relationship between
3:20 numeric columns. Again, correlation is your friend. Oftentimes you might not see
3:25 a strong correlation. Visualization using scatter plots can also show you relationships between those columns as well, or at least the distributions.


Talk Python's Mastodon Michael Kennedy's Mastodon