Getting Started with NLP and spaCy Transcripts
Chapter: Part 1: spaCy syntax
Lecture: Tokens
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
All right, so let's explore spaCy in the notebook. I will start by doing a import statement there. So spaCy is now imported.
0:11
And next what I will do is I will load a spaCy pipeline. So just to be explicit, this en-core-webmd is the name of the model.
0:21
It is a pre-trained model that the spaCy library, but what we're getting back here is a language object. And one way to think about that
0:31
is that we have an NLP pipeline of sorts. So what can we do with such a pipeline? Well, one thing that I can do is I can give it a sentence
0:40
like, ""Hi, my name is Vincent and I like to write Python. When I type that, it almost looks as if we get a string back
0:54
a string that says, ""Hi, my name is Vincent and I like to write Python. But we actually get something different back.
1:00
And we can confirm by checking the type of what comes out of this pipeline. And we can see that what comes out is a spaCy doc object,
1:09
which stands for document. So let's rewrite that a little bit just so it's more explicit. So we give text to a spaCy pipeline and out comes a document.
1:20
That's what we see here. And this document has many properties, but I guess like one property to maybe start with is that a document has tokens.
1:29
So one thing I could do is I could loop over all the tokens inside of this document. Now you might be tempted to think originally
1:40
that a token inside of a document would be a word. And to a large extent, that's accurate. But in spaCy, punctuation can also be a token.
1:50
So we can see that this comma over here gets printed as well as this dot at the end of the sentence. And there's a couple of interesting examples
2:01
when it comes to tokens. So let's try another sentence. So Python isn't just a language, comma, it's a community, exclamation mark.
2:16
Now something that I think is pretty interesting here is that Python is and then nt. That is to say this last part of isn't,
2:26
that's now also considered a separate token, just like abbreviated is over here from its. Now that might come across as unintuitive,
2:35
but in this particular case, you could also argue that this can be translated to not. And this over here could be translated to is
2:44
if we're talking about like the meaning of characters in the sentence. So the first thing that spaCy gives you is tokenization really.
2:54
We have a document and inside of a document, there are tokens, but the way that it handles these tokens and the way that they are parsed
3:02
is because of a rule-based system that's internal to spaCy. And these rules are language specific. So the parsing rules that you might have for English
3:12
are different than in Dutch. I won't focus too much on this in the rest of the course, but I do think it's pretty good to just acknowledge
3:19
that a token in a sentence isn't necessarily a word because we can also have punctuation, but also words can theoretically be split up because again,
3:29
a token is not necessarily the same thing as a word. Now, what is the whole point of the spaCy library? Well, the whole point of the spaCy library
3:37
is to attach properties that you might be interested in to these documents and to these tokens. And just to give a example of this,
3:46
we have some part of speech information that's attached to each of these tokens. Now, part of speech in this case gives us information
3:55
about what kind of word we're dealing with grammatically. So is it a noun? Is it a proper noun? Is it an auxiliary verb? That sort of a thing.
4:06
And under the hood, there's actually a statistical machine learning model that spaCy has pre-trained to give you this information.
4:15
These models are not necessarily perfect, but the whole point again is to give you models that give you properties on these tokens
4:24
and on these documents, some of which are rule-based and some of which are based on machine learning.
4:30
And what we'll do in this first part of the course is we'll just explore what spaCy has to offer from these pre-trained models from the get-go.