Getting Started with NLP and spaCy Transcripts
Chapter: Part 1: spaCy syntax
Lecture: Displacy
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
In the previous video I made a little table to show information on separate tokens, but spaCy also provides a utility via this Displacy submodule
0:13
to visualize documents right from the get-go, and I figured that might be good to show too. So just for good measure, I have my sentence here,
0:22
Hi, my name is Vincent, I would like to write Python. That's my sentence going in, that is turned into a document,
0:28
and then the Displacy module has a function called render that I can pass the doc into, and here's what that looks like.
0:37
Now this is a visualization that is pretty big, so I need to scroll to see it properly. There's two sentences here that are being plotted,
0:47
and what you're also seeing here is yet another property that spaCy does provide you. We have the token here with the part of speech attached,
0:54
but there are also grammatical relationships between the tokens that spaCy can estimate on your behalf, and these are also shown in this visualization.
1:04
There are, however, also other visualizations possible. If you're interested in grammar, this might be cool,
1:09
but sometimes you're more interested in just looking at the entities, so that's a style that you can select as well.
1:17
So in this case, we no longer see the arcs, but we do see that Vincent is detected as a person.
1:23
One thing that's actually kind of nice about this visualization is that it also shows a property of entities.
1:30
So in this case, I'm saying my somewhat full name, Vincent Warmerdam,
1:35
and if I were now to run this, you will see that Vincent Warmerdam together is seen as a single entity,
1:42
and that's something that this visual shows you quite nicely. So if you feel like playing around with spaCy and what it can detect,
1:50
you'll see that this can be a very fun interface to do that in.
1:53
One thing that I also just like to do in general is also show you when the model maybe doesn't work out so well.
2:00
So let's go to this base example again. So hi, my name is Vincent. I like to write Python,
2:05
and in this case, we can see that Vincent is indeed a person. Well, let's see what happens if I were to sort of introduce a slight misspelling
2:13
by calling Vincent with a lower letter V. Well, then I get a warning.
2:20
It is warning me that no entity was detected, so spaCy is giving me a warning about it,
2:25
but I can also see from the visual that right now, Vincent is no longer being detected as an entity. So that also serves as kind of a nice reminder.
2:35
These entities that are being predicted are part of a statistical model, and the statistical model will not be perfect.
2:43
And this is especially true if you consider how the spaCy models were trained. These spaCy models were trained on a preexisting corpus,
2:50
and if you think about the data set that the model was trained on, there are a couple of properties.
2:57
One property is that the data set that spaCy has trained on historically has always had pretty good spelling.
3:03
Names were always capitalized, but that also means that if your use case involves social media data,
3:10
let's say, where spelling isn't necessarily immaculate, well, then that might be a reason why a spaCy model doesn't perform as well,
3:17
because the data that it trained on originally did have this property. And second, I also think related to that,
3:24
it might be fair to say that the data set that spaCy was trained on was relatively formal.
3:29
A lot of the data sets that were used have also been used in academia, and that's all fair and good,
3:35
but maybe not all text out there is like the text you would have in an academic setting.
3:40
Even if you have immaculate spelling, things like slang might also be hard for the spaCy model to detect. And that brings me to the final point,
3:50
and that is also that the data set might be just a little bit dated. A lot of new concepts in language can be introduced over time.
3:58
Just to give one example, Brexit is definitely a phenomenon that's been in the news,
4:03
but only if you've been paying attention in the last couple of years, I suppose just like COVID.
4:09
And as far as I'm aware at least, spaCy hasn't had data sets that have these concepts in them as well.
4:16
So that means that it could be tricky for spaCy to understand these topics out of the box natively,
4:21
but there's also many other topics that might just be too new for spaCy to detect, or I should say for these base models to detect.
4:30
You can always train your own models on your own data, and we'll see later in this course how to do that,
4:35
but I do think it's fair to not expect too much from the pre-trained models that spaCy provides you. Anyway, this was a slight tangent.
4:45
If you're exploring entities in spaCy models though, I highly recommend you play around with this spaCy tool.
4:51
It is a very likable and interactive way to understand what models are detecting in sentences.