Getting Started with NLP and spaCy Transcripts
Chapter: Part 1: spaCy syntax
Lecture: Document properties
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
In the last few videos we've been looking at properties that spaCy provides on tokens,
0:07
but there are also a couple of properties that spaCy provides on the document itself, and I figured just showing a few demos of that.
0:16
One thing that spaCy provides on the document is this sense property.
0:21
This will return a generator, which you can unravel by turning that into a list, but this gives you the separate sentences in the original document.
0:34
And you might be tempted to think that that's relatively easy because you just have to split
0:37
on this one dot over here, but just to show you that it doesn't necessarily have to be
0:44
the case, if I were to write down ""My name is Mr. Warmerdown"", then the dot here that's
0:52
part of ""Mr."" should not be the reason why the sentence splits.
0:57
So the fact that spaCy can detect these separate sentences for you is indeed a very useful
1:02
and likable feature, because it might be the case that you have systems that work on a
1:06
sentence level, and then spaCy can be used to generate the separate sentences on your behalf. Let's now consider a very different sentence.
1:16
So I have ""Star Wars is a very popular science fiction series. Besides sentences, a document also has a property that's noun chunks.
1:27
Again, that gives me a generator, so I need to turn it into a list.
1:32
But the way to look at this is that spaCy is able to detect that multiple words together,
1:38
as far as grammar goes, could kind of be seen as a noun chunk, almost as if it's a single phrase or a single noun chunk, so to say.
1:50
And the reason why this might be of interest is that a lot of entities you're typically interested in detecting, they tend to be nouns.
1:58
They're usually not verbs, so having something that can just give you noun chunks to evaluate is also pretty useful in practice, I would say.
2:07
And then finally, here's a property that could be useful if you want to turn this into an API at some point.
2:12
A doc object also has a toJSON method attached, and this will give you back a dictionary that
2:19
just contains a lot of the information that spaCy has detected. And that's kind of neat, because that means you can use this inside of a web app or API
2:30
to communicate with a front end. Now, as you go through all the properties here, you'll notice that typically there's
2:35
a start as well as an end being specified. So for example, the sentence starts somewhere and it ends somewhere, and another sentence
2:43
starts somewhere and it ends somewhere. And these refer to the character indices in this sentence.
2:49
So you can also see that this first token over here, it starts somewhere, it stops at
2:55
the second character, then we have the comma, the punctuation that goes from two to three. So these represent the character indices.
3:04
But there's also all sorts of other information, especially on these tokens.
3:07
So we have the lemma, for example, we have the morphological information, etc. You can also scroll down and see all the different tokens.
3:16
But one thing you will notice as you look around is that even though we have lots of
3:20
information in here, it doesn't show all the possible properties on this document. So the noun chunks, for example, aren't being shown here.
3:30
But that said, a lot of the time this will be sufficient, especially if you're mainly interested in these entities, so to say.
3:38
So if you're building web APIs, this tends to be a very useful method to know about.