#100DaysOfCode in Python Transcripts
Chapter: Days 16-18: List comprehensions and generators
Lecture: Cleaning data with list comprehensions
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
Let's do a more interesting example. I'm going to load in the text of Harry Potter, split it into words, and use list comprehensions to filter
0:11
out stop words, or other words that are not meaningful. So let's load in Harry Potter, and parse the response, which is response.text, I lowercase it,
0:26
and I split it into a list of words. And you can see that by just getting a slice. Cool. And let's see the most common words so far.
0:48
Right, well here are stop words we're not really interested in, and the dataset also has a couple of other characters,
0:57
that should not really be taken into account, for example, do we have a dash in words? Right, so we need to filter that out as well.
1:09
So let's clean out any non-alphabetic characters first. So this looks over the words, and any word that contains
1:21
one or more non-alphabetic, or alphanumeric even, characters gets stripped out, and I do realize
1:29
that that might lead to empty words in the result list, but next we will have another list comprehension that takes care of that. So is the dash gone?
1:43
And yes its gone, but we still have stop words, for example "the", which we're not really interested in.
1:51
So let's do another list comprehension to filter those out, but for that I need a list of stop words. I already prepared it, and the code is the same
2:01
as loading in Harry, I'm just going to copy/paste that. And here you have a list of all the stop words.
2:12
Let's wipe those stop words out of the words list so far. So words equals word for word in words. If word strip, and that's what I said before.
2:27
There might be some empty strings in there, and by checking if word strip is true, you're basically saying, discard any empty strings.
2:38
So if you have a non empty string, and the word is not in stop words, then it's a go. So we need non empty words, and a word
2:50
that's not a stop word. If so, store that into the new list. And then we can do a simple check. If "the" is still in words, and now it's gone.
3:02
Now let's do the counter again, and see if we have a more relevant result. And there you go, there's the Dumbledore.
3:16
I have to confess I didn't read Harry Potter, but this sounds more like Harry Potter. So, I think this was a great example to show you
3:26
how you can use list comprehension to clean up data for analysis using few lines of code.