Data Science Jumpstart with 10 Projects Transcripts
Chapter: Project 2: Excel Integration with Adult Income Data
Lecture: Quantifying Strings with filter and value_counts
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
I'm going to show how to explore some of the object columns again. Let's jump into that code.
0:06
So again, what I can do is I can say select D types and then say string. This will give me all of the string columns.
0:12
In Pandas 1, you would say object there, but because we have those PyArrow types, we can say string here. Remember what I said previously,
0:21
value counts is your friend here. So let's explore education. I'm going to say education.valueCounts. And here's a summary of that.
0:30
If I wanted to visualize that, look how easy this is. I'm going to say .plot.barH to do a horizontal bar plot,
0:38
and we can visualize that really easily there. So we see that most of these are high school graduates.
0:43
We have some college graduates, some masters, etc. If I want to filter columns, I want to get the columns that have education in them.
0:54
One of the things I can do is use this filter operation, and here are the columns that have education in them.
1:02
Note that this valueCounts also works with numbers as well. So if we take the age column, we might want to summarize the age.
1:10
Again, I'd probably do histogram here, but we can do a valueCounts on that. We can sort the index there.
1:17
In fact, if we do a plot and we do a bar on that, we're kind of getting the histogram by doing that.
1:24
So this is a very manual way of doing a histogram here. Again, I would probably just do age.hist to get a similar thing here.
1:32
If we want to bump up the bins, we'd say bins is equal to 20, and maybe we say figSize, so it doesn't come off the screen, 8 by 3.
1:44
In this section, we looked at pulling out those object columns. Again, valueCounts is your friend to summarize those.
1:51
We also saw that one of the things that you can do is you can use filter to limit what columns you're pulling out as well with a regular expression.
1:59
Filter has a bunch of other options as well. Again, I recommend that you pull that documentation up in Jupyter and see how to use it in other contexts.