Eve: Building RESTful APIs with MongoDB and Flask Transcripts
Chapter: Your first Eve service
Lecture: Defining document schemas
Login or
purchase this course
to watch this video and the rest of the course contents.
In this section, I'm going to show you how to remove stop words. Stop words are words that don't add value to your text.
Oftentimes when we're doing natural language processing, we want to get rid of stop words.
Things like a, the, things that occur a lot but don't really mean anything or add value.
We're going to use the spaCy library to do that. Make sure you install that.
After you install it, you need to download some English files so that it understands how to process English.
This is the command to load this small data set here. Then you can validate that your spaCy install worked.
You can see that I have downloaded that small one. I'm going to load spaCy and then I'm going to say load that small English data.
Now I'm going to remove the stop words. I'm going to use apply here and say, okay, here's the remove text. We're going to apply this function here.
And we pass in this NLP object. What this is going to do if we look at it is it's going to get a document from that,
which understands what's going on with the text. Then I'm going to loop over the tokens in the document here.
And if it's not a stop word, I'm going to stick that in there. So let's run that. I'm also using the time cell magic at the top.
This is going to take a while. This is using apply, which is slow. It's also working with strings, which tend to be slow as well.
But there's not really a way to vectorize this and make it much quicker. So we'll just deal with that. Okay, so this takes about 30 seconds.
You can see that I've got, it looks like some HTML in here. So I might want to further replace some of that HTML.
And I could put in code like this to do further manipulation there. Let's just load the original data so you can compare the two data sets
and see that the stop words are being removed. Okay, so that's looking better. Here is the original data you can see for a movie that gets no respect.
It got changed to movie gets respect, sure, lot, memorable quotes. You can see the bottom one here. I saw this at the premiere in Melbourne.
Saw premiere in Melbourne. Do you need to remove stop words? No, you don't, but this is something that's going to make your models
perform better because there's a lot of noise in those stop words.