Getting Started with NLP and spaCy Transcripts
Chapter: Part 1: spaCy syntax
Lecture: Spans
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
So far in these videos, we've been talking about some of the building blocks in spaCy. So what we've seen is that we have a doc object,
0:09
a document, and that it has some tokens. That's all well and good, but we also saw this thing called an entity. And there's an interesting thing there
0:20
because we also noticed that an entity, even though it is definitely part of a document, we also noticed that an entity
0:26
can actually contain one or more tokens. So you might wonder what is up with that. In short, an entity can be seen as a new concept
0:37
that we haven't explained yet, that's called a span. And a span can be thought of as a sequence of tokens in order. And to maybe help explain that,
0:46
I'll go ahead and explore that with some code right now. I have my sentence here, ""Hi, my name is Vincent. That gives me a document.
0:55
And just to confirm, this is the representation of the document. It looks like a string, but it's actually a spaCy document.
1:04
The type is being confirmed here. And I can do the same thing for the first token in that document. So just for good measure,
1:11
let's just grab that thing that's the token ""Hi. And we can confirm that that's indeed a token. But let's now grab some more.
1:21
So this is grabbing the first two tokens. That will be ""Hi"" plus the punctuation point over here. Those are two separate tokens.
1:34
And the type of those two tokens together, attached like this, that's a span. Now we will remember that because one property that this document has
1:46
is it has all the available entities. And we can confirm that Vincent is indeed an entity on that document.
1:53
So let's loop over that for ""int"" in document entities. Let's print that. So we can see that the entity Vincent is actually a span.
2:10
It's not a separate entity class. It is really just a span object. And spans also have a couple of properties.
2:18
So they tend to have a start and end segment. In this case, that means that the start token will be index five and it would end at index six.
2:30
So let's count one, two, three, four, five. That's where it starts. And then six where it ends. So that seems correct.
2:40
But I can also query for the starting character and the ending character. Depending on what the use case is, you might be more interested
2:49
in where the characters start and end. Now at this point, you might wonder, well, if an entity is just a span, what makes it so special?
2:57
And the primary reason is that an entity has a label that is attached. So we can confirm that this span, this Vincent span, so to say,
3:09
that has a person label attached. We can see that through this property. And that's not the case for this span that I can select,
3:19
like the first three characters. If I were to query for the label there, it is going to tell me that it's an empty string.
3:25
So this label is something that I would only expect on a span that is actually an entity in a sentence. So that's just really good to remember.
3:34
But moreover, the reason why we need a span here, that's related to the fact that an entity can have more than one token in it. So as we can see now,
3:44
if I were to change my name to my first and last name, then the entity updates this full name over here. That's the entity that's being detected.
3:53
I need something that can represent that. And that's what we have the span for inside of spaCy. Now, again, this span needs to have tokens
4:02
that are consecutive. So first name and then last name, but you can't have empty tokens in the middle. It all has to be sequential.
4:10
And we can have many different kinds of spans. We can select many of them, but typically the entities as found on this doc.ents property,
4:18
those will have a label that we are typically interested in. So maybe in summary, an entity in spaCy is a span,
4:26
but not every span in spaCy is an entity.