#100DaysOfCode in Python Transcripts
Chapter: Days 16-18: List comprehensions and generators
Lecture: Cleaning data with list comprehensions
0:00 Let's do a more interesting example.
0:02 I'm going to load in the text of Harry Potter,
0:05 split it into words, and use list comprehensions to filter
0:10 out stop words, or other words that are not meaningful.
0:15 So let's load in Harry Potter, and parse the response,
0:20 which is response.text, I lowercase it,
0:25 and I split it into a list of words.
0:28 And you can see that by just getting a slice.
0:34 Cool. And let's see the most common words so far.
0:47 Right, well here are stop words
0:50 we're not really interested in,
0:52 and the dataset also has a couple of other characters,
0:56 that should not really be taken into account,
0:59 for example, do we have a dash in words?
1:06 Right, so we need to filter that out as well.
1:08 So let's clean out any non-alphabetic characters first.
1:17 So this looks over the words, and any word that contains
1:20 one or more non-alphabetic, or alphanumeric even,
1:24 characters gets stripped out, and I do realize
1:28 that that might lead to empty words in the result list,
1:31 but next we will have another list comprehension that
1:35 takes care of that. So is the dash gone?
1:42 And yes its gone, but we still have stop words,
1:46 for example "the", which we're not really interested in.
1:50 So let's do another list comprehension to filter those out,
1:54 but for that I need a list of stop words.
1:57 I already prepared it, and the code is the same
2:00 as loading in Harry, I'm just going to copy/paste that.
2:05 And here you have a list of all the stop words.
2:11 Let's wipe those stop words out of the words list so far.
2:16 So words equals word for word in words.
2:22 If word strip, and that's what I said before.
2:26 There might be some empty strings in there,
2:28 and by checking if word strip is true,
2:32 you're basically saying, discard any empty strings.
2:37 So if you have a non empty string,
2:40 and the word is not in stop words, then it's a go.
2:47 So we need non empty words, and a word
2:49 that's not a stop word. If so, store that into the new list.
2:54 And then we can do a simple check.
2:57 If "the" is still in words, and now it's gone.
3:01 Now let's do the counter again,
3:06 and see if we have a more relevant result.
3:12 And there you go, there's the Dumbledore.
3:15 I have to confess I didn't read Harry Potter,
3:18 but this sounds more like Harry Potter.
3:21 So, I think this was a great example to show you
3:25 how you can use list comprehension to clean up data
3:29 for analysis using few lines of code.