Getting Started with NLP and spaCy Transcripts
Chapter: Part 1: spaCy syntax
Lecture: Properties
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
So what I want to do now is just dive into some of the properties that are on
0:04
these tokens that spaCy provides and what I've done to help me is I've made this function that gives me a pretty overview. Internally this function is
0:14
using a library called Wasabi which is a dependency on spaCy so if you've downloaded spaCy you will have also downloaded Wasabi. This is a pretty
0:24
printing library that spaCy uses internally but let's just run this and then I'll explain the function in more detail.
0:32
I have a function over here text to doc table and I'm giving it this sentence
0:37
that I also mentioned earlier so hi my name is Vincent I like to write Python.
0:42
The text goes into this function and I'm turning that into a document and then this document is used in this list comprehension over here that's
0:53
generating me a bunch of data. I'm looping over all the tokens in the
0:57
document and then I'm accessing the text property, the lemma property, the part of
1:03
speech property, the entity type property, the shape property and the is
1:09
punctuation property and the morphology property. It's a whole bunch but then all
1:15
of that's put into this table function over here and then I'm printing it and
1:19
then this is the table that we get. So let's go over some of these properties.
1:22
The first one is relatively simple this is just the text of each token. No surprises here. But then we have the lemma and that's something that's kind
1:33
of interesting. The way to think about the lemma is that it turns a token into
1:40
its base form and that's going to be relevant to some nouns and some verbs. So
1:45
is turned into be for example. I could change the verb to was and then you'll
1:53
also notice it gets turned into be because that's I guess you could say the base form of the verb. Another example let's just let's write down another
2:04
example. I own two books. So let's consider just one more example to make the point of the lemma more clear. So let's say my name is Vincent I own two
2:15
books. In this case the lemma on this noun books turns it into a singular which again kind of feels like it's a base form. Next we have the part of
2:28
speech which we saw earlier which says something like hey is the word a noun or
2:32
a verb that sort of a thing. Followed by that is something that's called an entity and this is something that's also generated by a statistical model.
2:41
Entities tend to be quite useful because they are things that you might be interested in detecting in a sentence and in particular Vincent in this
2:50
sentence is indeed a person and if you want to detect the name of a person in
2:54
a sentence then this is a useful entity to detect. There's another entity being
3:00
detected here called cardinal that basically deals with numeric values but
3:05
one thing that's interesting is that and this is also a pretty useful entity in
3:09
general because sometimes you're dealing with numbers in text form instead of
3:13
written down as a numeric value. Let's move on. So we also have the shape which
3:21
says something about capitalization and length of a token so in this case high
3:27
is capital letter X and then lowercase X. Followed by that we have whether or not a
3:34
token is part of punctuation. You technically get the same information from the part of speech but it's also nice to have this as a property on the
3:42
token as well. And this final feature is a bit of a mouthful but these are the
3:47
morphological features and especially if you're interested in more linguistic
3:52
properties this is something you might be interested in but this tells you
3:56
things like what is the tense of a verb is the past or present tense is a word possessive yes or no that sort of a thing. There are use cases where
4:07
information like this can be useful but it's also a feature where having more knowledge about linguistics can definitely help. Out of all the
4:15
properties that I've shown you here I think the part of speech and the entities are the two items that I've used the most in the past but I do want
4:23
to give you a good overview of all the different properties that we do have
4:27
access to because who knows maybe they are useful to you it's just good to know that there are lots of properties that spaCy does provide.