Data Science Jumpstart with 10 Projects Transcripts
Chapter: Project 6: Working with Time Series - Air Quality over Time
Lecture: Interpolating and Filling in Missing values in Pandas

Login or purchase this course to watch this video and the rest of the course contents.
0:00 In this section I want to look at missing values. So here's our plot from last section here.
0:09 Let's show what we can do with that. One of the things we can do is interpolate that. Let's try that. And I got an error here. It says I got an invalid
0:21 fill error. So what's going on here? Interpolate is supposed to take a value and connect it to the next value if there's missing values in there.
0:33 It turns out that the PyArrow version of data doesn't like to do interpolation. So I'm going to convert this
0:41 back to NumPy values, which will show you how interpolation works. I'm going to say
0:49 as type float and then we'll do interpolate. And you can see for example here between 9 and 10 we have those values
0:57 connected. There it is without the interpolation. There's also a small portion over here after 11 but it's a little bit hard to see that.
1:05 Another option is we can do what's called a forward fill. Forward fill is going to take the last known value and it's going to
1:13 push it forward. Let's do that. And you can see that it's pushing this value forward. We do get a warning here.
1:25 This warning is a little bit weird. It's saying I'm falling back on a non-PyArrow path, which is weird that this works
1:33 for forward fill but it doesn't work for interpolate. I wish Pandas was consistent
1:37 here. There's also a back fill. The back fill is going to pull the last value and pull it back. Let's run that.
1:45 And it's a little bit hard to tell the difference here. This one is pushing this value forward. This back value is pulling this value back. A note on
1:53 back fill. If you're doing machine learning and you want to stay away from back fill
1:57 you probably want to stay away from interpolate as well because those are working
2:01 with values from the future. Interpolate is connecting the previous value
2:05 and the next seen value. The next seen value is a value possibly from the future. A back fill is taking possibly future values and pulling them back.
2:13 So if you are doing machine learning with these you want to not use these because oftentimes they're considered cheating and they will lead
2:21 to models that won't work or will give you bad results. Another thing you can do is you can replace missing values. Here I'm just saying
2:29 replace that with 22. Is 22 the right value? No, not necessarily. But you can see that 22 pops up here and here. So what is the correct
2:37 value to replace it with? Again, the answer is it depends. You probably should talk to a subject matter expert and see what to replace it with.
2:45 Here I'm saying replace minus 200 with nan and then look at this. I'm using our friend pipe and I'm saying let's just fill in the missing values with
2:57 the mean of the current data frame. Why am I using lambda instead of just fill and a? Because pipe expects a data frame as the first
3:05 parameter and fill and a does not expect a data frame as the first parameter. So here we see that the mean value is getting
3:13 stuck in there over here. It's a little bit hard to see but the blue
3:17 value is getting the mean value as well. In this section I showed you how to deal with
3:21 missing values. Pandas has a lot of flexibility there. Again, talk to a subject matter expert and make sure you're doing the right thing.


Talk Python's Mastodon Michael Kennedy's Mastodon