Getting Started with NLP and spaCy Transcripts
Chapter: Part 1: spaCy syntax
Lecture: Spans

Login or purchase this course to watch this video and the rest of the course contents.
0:00 So far in these videos, we've been talking about some of the building blocks in spaCy. So what we've seen is that we have a doc object,
0:09 a document, and that it has some tokens. That's all well and good, but we also saw this thing called an entity. And there's an interesting thing there
0:20 because we also noticed that an entity, even though it is definitely part of a document, we also noticed that an entity
0:26 can actually contain one or more tokens. So you might wonder what is up with that. In short, an entity can be seen as a new concept
0:37 that we haven't explained yet, that's called a span. And a span can be thought of as a sequence of tokens in order. And to maybe help explain that,
0:46 I'll go ahead and explore that with some code right now. I have my sentence here, ""Hi, my name is Vincent. That gives me a document.
0:55 And just to confirm, this is the representation of the document. It looks like a string, but it's actually a spaCy document.
1:04 The type is being confirmed here. And I can do the same thing for the first token in that document. So just for good measure,
1:11 let's just grab that thing that's the token ""Hi. And we can confirm that that's indeed a token. But let's now grab some more.
1:21 So this is grabbing the first two tokens. That will be ""Hi"" plus the punctuation point over here. Those are two separate tokens.
1:34 And the type of those two tokens together, attached like this, that's a span. Now we will remember that because one property that this document has
1:46 is it has all the available entities. And we can confirm that Vincent is indeed an entity on that document.
1:53 So let's loop over that for ""int"" in document entities. Let's print that. So we can see that the entity Vincent is actually a span.
2:10 It's not a separate entity class. It is really just a span object. And spans also have a couple of properties.
2:18 So they tend to have a start and end segment. In this case, that means that the start token will be index five and it would end at index six.
2:30 So let's count one, two, three, four, five. That's where it starts. And then six where it ends. So that seems correct.
2:40 But I can also query for the starting character and the ending character. Depending on what the use case is, you might be more interested
2:49 in where the characters start and end. Now at this point, you might wonder, well, if an entity is just a span, what makes it so special?
2:57 And the primary reason is that an entity has a label that is attached. So we can confirm that this span, this Vincent span, so to say,
3:09 that has a person label attached. We can see that through this property. And that's not the case for this span that I can select,
3:19 like the first three characters. If I were to query for the label there, it is going to tell me that it's an empty string.
3:25 So this label is something that I would only expect on a span that is actually an entity in a sentence. So that's just really good to remember.
3:34 But moreover, the reason why we need a span here, that's related to the fact that an entity can have more than one token in it. So as we can see now,
3:44 if I were to change my name to my first and last name, then the entity updates this full name over here. That's the entity that's being detected.
3:53 I need something that can represent that. And that's what we have the span for inside of spaCy. Now, again, this span needs to have tokens
4:02 that are consecutive. So first name and then last name, but you can't have empty tokens in the middle. It all has to be sequential.
4:10 And we can have many different kinds of spans. We can select many of them, but typically the entities as found on this doc.ents property,
4:18 those will have a label that we are typically interested in. So maybe in summary, an entity in spaCy is a span,
4:26 but not every span in spaCy is an entity.


Talk Python's Mastodon Michael Kennedy's Mastodon