Data Science Jumpstart with 10 Projects Transcripts
Chapter: Project 2: Excel Integration with Adult Income Data
Lecture: Understanding Counts and Frequencies of Missing Data in Pandas with isna, any, sum, and mean

Login or purchase this course to watch this video and the rest of the course contents.
0:00 Another thing that's very common to do with data is to look for missing values. So let's see how we can do that.
0:07 This is another tool to stick in your tool belt. One of the 400 things that you can do with Pandas DataFrame is this isNA.
0:13 And when we do that, we get back a DataFrame, but if you look at this DataFrame, in the values of it, we have true-false values.
0:21 So there's going to be a true any place where the value is missing. And if I look at this, it looks like I'm not seeing any trues
0:28 just from spot-checking it, but really, as I've said multiple times, humans are not optimized for looking at tables of data. They're optimized for
0:39 finding patterns and visualizations or seeing things that pop out. So if you feel this urge to look through a big table of data,
0:47 your spy descent should go off, telling you instead of doing that, you should use a computer to either visualize that
0:53 or use a computer to filter the data that you want. In this case, we might want to filter to see if those values actually are missing
1:00 or quantify those. One of the many things that we can do with a DataFrame is we can do this any. Any is an aggregation,
1:09 and what it's going to do is it's going to aggregate the DataFrame above. In this case, are any of the values truthy in each of the columns?
1:18 So this is going to collapse that, give us back a series, and it looks like there aren't any missing values.
1:24 Alternatively, this is a cool thing that you can do. You can say SUM here, and in this case, this counts how many missing values there are.
1:32 You can also do MEAN and multiply that by 100. That gives you the percent of missing values. In this case, this is not very interesting,
1:39 but let me just show you another example here. I'm going to say df, and let's say I want to know what is the count of folks
1:47 whose age is greater than 50. So I'm going to come down here and say age, and then say .gt 50, and this is a Boolean array.
1:59 It has true-false values in it, so if I wanted to find the count of those, I can just say let's sum that. So there are 6,460 people
2:08 whose age is greater than 50. If I want to know what percent that is, I can say MEAN and then multiply that by 100,
2:16 and this didn't work. It didn't have a mole here, because if you look at MEAN, MEAN is an aggregation in this case
2:22 because I am working on one dimension and I aggregate it to a scalar value, I'm now entering into the Python realm. This is a 19, so in this case,
2:31 I would come in here and say times 100. So 20% of our people are greater than 50. In this section, I showed you how to quantify
2:43 how many values are missing. That can be really important, especially if you're doing machine learning, because a lot of machine learning
2:50 algorithms don't like to have missing values. Also, if you have survey data, you might want to understand what percent is missing.
2:57 If you're reporting on that, and 99% of it is missing, it might not be really interesting to report on.


Talk Python's Mastodon Michael Kennedy's Mastodon