Getting Started with NLP and spaCy Transcripts
Chapter: Part 1: spaCy syntax
Lecture: Properties

Login or purchase this course to watch this video and the rest of the course contents.
0:00 So what I want to do now is just dive into some of the properties that are on
0:04 these tokens that spaCy provides and what I've done to help me is I've made this function that gives me a pretty overview. Internally this function is
0:14 using a library called Wasabi which is a dependency on spaCy so if you've downloaded spaCy you will have also downloaded Wasabi. This is a pretty
0:24 printing library that spaCy uses internally but let's just run this and then I'll explain the function in more detail.
0:32 I have a function over here text to doc table and I'm giving it this sentence
0:37 that I also mentioned earlier so hi my name is Vincent I like to write Python.
0:42 The text goes into this function and I'm turning that into a document and then this document is used in this list comprehension over here that's
0:53 generating me a bunch of data. I'm looping over all the tokens in the
0:57 document and then I'm accessing the text property, the lemma property, the part of
1:03 speech property, the entity type property, the shape property and the is
1:09 punctuation property and the morphology property. It's a whole bunch but then all
1:15 of that's put into this table function over here and then I'm printing it and
1:19 then this is the table that we get. So let's go over some of these properties.
1:22 The first one is relatively simple this is just the text of each token. No surprises here. But then we have the lemma and that's something that's kind
1:33 of interesting. The way to think about the lemma is that it turns a token into
1:40 its base form and that's going to be relevant to some nouns and some verbs. So
1:45 is turned into be for example. I could change the verb to was and then you'll
1:53 also notice it gets turned into be because that's I guess you could say the base form of the verb. Another example let's just let's write down another
2:04 example. I own two books. So let's consider just one more example to make the point of the lemma more clear. So let's say my name is Vincent I own two
2:15 books. In this case the lemma on this noun books turns it into a singular which again kind of feels like it's a base form. Next we have the part of
2:28 speech which we saw earlier which says something like hey is the word a noun or
2:32 a verb that sort of a thing. Followed by that is something that's called an entity and this is something that's also generated by a statistical model.
2:41 Entities tend to be quite useful because they are things that you might be interested in detecting in a sentence and in particular Vincent in this
2:50 sentence is indeed a person and if you want to detect the name of a person in
2:54 a sentence then this is a useful entity to detect. There's another entity being
3:00 detected here called cardinal that basically deals with numeric values but
3:05 one thing that's interesting is that and this is also a pretty useful entity in
3:09 general because sometimes you're dealing with numbers in text form instead of
3:13 written down as a numeric value. Let's move on. So we also have the shape which
3:21 says something about capitalization and length of a token so in this case high
3:27 is capital letter X and then lowercase X. Followed by that we have whether or not a
3:34 token is part of punctuation. You technically get the same information from the part of speech but it's also nice to have this as a property on the
3:42 token as well. And this final feature is a bit of a mouthful but these are the
3:47 morphological features and especially if you're interested in more linguistic
3:52 properties this is something you might be interested in but this tells you
3:56 things like what is the tense of a verb is the past or present tense is a word possessive yes or no that sort of a thing. There are use cases where
4:07 information like this can be useful but it's also a feature where having more knowledge about linguistics can definitely help. Out of all the
4:15 properties that I've shown you here I think the part of speech and the entities are the two items that I've used the most in the past but I do want
4:23 to give you a good overview of all the different properties that we do have
4:27 access to because who knows maybe they are useful to you it's just good to know that there are lots of properties that spaCy does provide.


Talk Python's Mastodon Michael Kennedy's Mastodon