Getting Started with NLP and spaCy Transcripts
Chapter: Part 1: spaCy syntax
Lecture: Tokens

Login or purchase this course to watch this video and the rest of the course contents.
0:00 All right, so let's explore spaCy in the notebook. I will start by doing a import statement there. So spaCy is now imported.
0:11 And next what I will do is I will load a spaCy pipeline. So just to be explicit, this en-core-webmd is the name of the model.
0:21 It is a pre-trained model that the spaCy library, but what we're getting back here is a language object. And one way to think about that
0:31 is that we have an NLP pipeline of sorts. So what can we do with such a pipeline? Well, one thing that I can do is I can give it a sentence
0:40 like, ""Hi, my name is Vincent and I like to write Python. When I type that, it almost looks as if we get a string back
0:54 a string that says, ""Hi, my name is Vincent and I like to write Python. But we actually get something different back.
1:00 And we can confirm by checking the type of what comes out of this pipeline. And we can see that what comes out is a spaCy doc object,
1:09 which stands for document. So let's rewrite that a little bit just so it's more explicit. So we give text to a spaCy pipeline and out comes a document.
1:20 That's what we see here. And this document has many properties, but I guess like one property to maybe start with is that a document has tokens.
1:29 So one thing I could do is I could loop over all the tokens inside of this document. Now you might be tempted to think originally
1:40 that a token inside of a document would be a word. And to a large extent, that's accurate. But in spaCy, punctuation can also be a token.
1:50 So we can see that this comma over here gets printed as well as this dot at the end of the sentence. And there's a couple of interesting examples
2:01 when it comes to tokens. So let's try another sentence. So Python isn't just a language, comma, it's a community, exclamation mark.
2:16 Now something that I think is pretty interesting here is that Python is and then nt. That is to say this last part of isn't,
2:26 that's now also considered a separate token, just like abbreviated is over here from its. Now that might come across as unintuitive,
2:35 but in this particular case, you could also argue that this can be translated to not. And this over here could be translated to is
2:44 if we're talking about like the meaning of characters in the sentence. So the first thing that spaCy gives you is tokenization really.
2:54 We have a document and inside of a document, there are tokens, but the way that it handles these tokens and the way that they are parsed
3:02 is because of a rule-based system that's internal to spaCy. And these rules are language specific. So the parsing rules that you might have for English
3:12 are different than in Dutch. I won't focus too much on this in the rest of the course, but I do think it's pretty good to just acknowledge
3:19 that a token in a sentence isn't necessarily a word because we can also have punctuation, but also words can theoretically be split up because again,
3:29 a token is not necessarily the same thing as a word. Now, what is the whole point of the spaCy library? Well, the whole point of the spaCy library
3:37 is to attach properties that you might be interested in to these documents and to these tokens. And just to give a example of this,
3:46 we have some part of speech information that's attached to each of these tokens. Now, part of speech in this case gives us information
3:55 about what kind of word we're dealing with grammatically. So is it a noun? Is it a proper noun? Is it an auxiliary verb? That sort of a thing.
4:06 And under the hood, there's actually a statistical machine learning model that spaCy has pre-trained to give you this information.
4:15 These models are not necessarily perfect, but the whole point again is to give you models that give you properties on these tokens
4:24 and on these documents, some of which are rule-based and some of which are based on machine learning.
4:30 And what we'll do in this first part of the course is we'll just explore what spaCy has to offer from these pre-trained models from the get-go.


Talk Python's Mastodon Michael Kennedy's Mastodon