Getting Started with NLP and spaCy Transcripts
Chapter: Part 1: spaCy syntax
Lecture: Displacy

Login or purchase this course to watch this video and the rest of the course contents.
0:00 In the previous video I made a little table to show information on separate tokens, but spaCy also provides a utility via this Displacy submodule
0:13 to visualize documents right from the get-go, and I figured that might be good to show too. So just for good measure, I have my sentence here,
0:22 Hi, my name is Vincent, I would like to write Python. That's my sentence going in, that is turned into a document,
0:28 and then the Displacy module has a function called render that I can pass the doc into, and here's what that looks like.
0:37 Now this is a visualization that is pretty big, so I need to scroll to see it properly. There's two sentences here that are being plotted,
0:47 and what you're also seeing here is yet another property that spaCy does provide you. We have the token here with the part of speech attached,
0:54 but there are also grammatical relationships between the tokens that spaCy can estimate on your behalf, and these are also shown in this visualization.
1:04 There are, however, also other visualizations possible. If you're interested in grammar, this might be cool,
1:09 but sometimes you're more interested in just looking at the entities, so that's a style that you can select as well.
1:17 So in this case, we no longer see the arcs, but we do see that Vincent is detected as a person.
1:23 One thing that's actually kind of nice about this visualization is that it also shows a property of entities.
1:30 So in this case, I'm saying my somewhat full name, Vincent Warmerdam,
1:35 and if I were now to run this, you will see that Vincent Warmerdam together is seen as a single entity,
1:42 and that's something that this visual shows you quite nicely. So if you feel like playing around with spaCy and what it can detect,
1:50 you'll see that this can be a very fun interface to do that in.
1:53 One thing that I also just like to do in general is also show you when the model maybe doesn't work out so well.
2:00 So let's go to this base example again. So hi, my name is Vincent. I like to write Python,
2:05 and in this case, we can see that Vincent is indeed a person. Well, let's see what happens if I were to sort of introduce a slight misspelling
2:13 by calling Vincent with a lower letter V. Well, then I get a warning.
2:20 It is warning me that no entity was detected, so spaCy is giving me a warning about it,
2:25 but I can also see from the visual that right now, Vincent is no longer being detected as an entity. So that also serves as kind of a nice reminder.
2:35 These entities that are being predicted are part of a statistical model, and the statistical model will not be perfect.
2:43 And this is especially true if you consider how the spaCy models were trained. These spaCy models were trained on a preexisting corpus,
2:50 and if you think about the data set that the model was trained on, there are a couple of properties.
2:57 One property is that the data set that spaCy has trained on historically has always had pretty good spelling.
3:03 Names were always capitalized, but that also means that if your use case involves social media data,
3:10 let's say, where spelling isn't necessarily immaculate, well, then that might be a reason why a spaCy model doesn't perform as well,
3:17 because the data that it trained on originally did have this property. And second, I also think related to that,
3:24 it might be fair to say that the data set that spaCy was trained on was relatively formal.
3:29 A lot of the data sets that were used have also been used in academia, and that's all fair and good,
3:35 but maybe not all text out there is like the text you would have in an academic setting.
3:40 Even if you have immaculate spelling, things like slang might also be hard for the spaCy model to detect. And that brings me to the final point,
3:50 and that is also that the data set might be just a little bit dated. A lot of new concepts in language can be introduced over time.
3:58 Just to give one example, Brexit is definitely a phenomenon that's been in the news,
4:03 but only if you've been paying attention in the last couple of years, I suppose just like COVID.
4:09 And as far as I'm aware at least, spaCy hasn't had data sets that have these concepts in them as well.
4:16 So that means that it could be tricky for spaCy to understand these topics out of the box natively,
4:21 but there's also many other topics that might just be too new for spaCy to detect, or I should say for these base models to detect.
4:30 You can always train your own models on your own data, and we'll see later in this course how to do that,
4:35 but I do think it's fair to not expect too much from the pre-trained models that spaCy provides you. Anyway, this was a slight tangent.
4:45 If you're exploring entities in spaCy models though, I highly recommend you play around with this spaCy tool.
4:51 It is a very likable and interactive way to understand what models are detecting in sentences.


Talk Python's Mastodon Michael Kennedy's Mastodon