Eve: Building RESTful APIs with MongoDB and Flask Transcripts
Chapter: Your first Eve service
Lecture: Defining document schemas
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
In this section, I'm going to show you how to remove stop words. Stop words are words that don't add value to your text.
0:07
Oftentimes when we're doing natural language processing, we want to get rid of stop words.
0:11
Things like a, the, things that occur a lot but don't really mean anything or add value.
0:17
We're going to use the spaCy library to do that. Make sure you install that.
0:21
After you install it, you need to download some English files so that it understands how to process English.
0:27
This is the command to load this small data set here. Then you can validate that your spaCy install worked.
0:37
You can see that I have downloaded that small one. I'm going to load spaCy and then I'm going to say load that small English data.
0:47
Now I'm going to remove the stop words. I'm going to use apply here and say, okay, here's the remove text. We're going to apply this function here.
0:57
And we pass in this NLP object. What this is going to do if we look at it is it's going to get a document from that,
1:07
which understands what's going on with the text. Then I'm going to loop over the tokens in the document here.
1:14
And if it's not a stop word, I'm going to stick that in there. So let's run that. I'm also using the time cell magic at the top.
1:24
This is going to take a while. This is using apply, which is slow. It's also working with strings, which tend to be slow as well.
1:32
But there's not really a way to vectorize this and make it much quicker. So we'll just deal with that. Okay, so this takes about 30 seconds.
1:44
You can see that I've got, it looks like some HTML in here. So I might want to further replace some of that HTML.
1:53
And I could put in code like this to do further manipulation there. Let's just load the original data so you can compare the two data sets
2:11
and see that the stop words are being removed. Okay, so that's looking better. Here is the original data you can see for a movie that gets no respect.
2:24
It got changed to movie gets respect, sure, lot, memorable quotes. You can see the bottom one here. I saw this at the premiere in Melbourne.
2:34
Saw premiere in Melbourne. Do you need to remove stop words? No, you don't, but this is something that's going to make your models
2:41
perform better because there's a lot of noise in those stop words.