Getting Started with NLP and spaCy Transcripts
Chapter: Part 1: spaCy syntax
Lecture: Displacy

0:00 In the previous video I made a little table to show information on separate tokens, but spaCy also provides a utility via this Displacy submodule

0:13 to visualize documents right from the get-go, and I figured that might be good to show too. So just for good measure, I have my sentence here,

0:22 Hi, my name is Vincent, I would like to write Python. That's my sentence going in, that is turned into a document,

0:28 and then the Displacy module has a function called render that I can pass the doc into, and here's what that looks like.

0:37 Now this is a visualization that is pretty big, so I need to scroll to see it properly. There's two sentences here that are being plotted,

0:47 and what you're also seeing here is yet another property that spaCy does provide you. We have the token here with the part of speech attached,

0:54 but there are also grammatical relationships between the tokens that spaCy can estimate on your behalf, and these are also shown in this visualization.

1:04 There are, however, also other visualizations possible. If you're interested in grammar, this might be cool,

1:09 but sometimes you're more interested in just looking at the entities, so that's a style that you can select as well.

1:17 So in this case, we no longer see the arcs, but we do see that Vincent is detected as a person.

1:23 One thing that's actually kind of nice about this visualization is that it also shows a property of entities.

1:30 So in this case, I'm saying my somewhat full name, Vincent Warmerdam,

1:35 and if I were now to run this, you will see that Vincent Warmerdam together is seen as a single entity,

1:42 and that's something that this visual shows you quite nicely. So if you feel like playing around with spaCy and what it can detect,

1:50 you'll see that this can be a very fun interface to do that in.

1:53 One thing that I also just like to do in general is also show you when the model maybe doesn't work out so well.

2:00 So let's go to this base example again. So hi, my name is Vincent. I like to write Python,

2:05 and in this case, we can see that Vincent is indeed a person. Well, let's see what happens if I were to sort of introduce a slight misspelling

2:13 by calling Vincent with a lower letter V. Well, then I get a warning.

2:20 It is warning me that no entity was detected, so spaCy is giving me a warning about it,

2:25 but I can also see from the visual that right now, Vincent is no longer being detected as an entity. So that also serves as kind of a nice reminder.

2:35 These entities that are being predicted are part of a statistical model, and the statistical model will not be perfect.

2:43 And this is especially true if you consider how the spaCy models were trained. These spaCy models were trained on a preexisting corpus,

2:50 and if you think about the data set that the model was trained on, there are a couple of properties.

2:57 One property is that the data set that spaCy has trained on historically has always had pretty good spelling.

3:03 Names were always capitalized, but that also means that if your use case involves social media data,

3:10 let's say, where spelling isn't necessarily immaculate, well, then that might be a reason why a spaCy model doesn't perform as well,

3:17 because the data that it trained on originally did have this property. And second, I also think related to that,

3:24 it might be fair to say that the data set that spaCy was trained on was relatively formal.

3:29 A lot of the data sets that were used have also been used in academia, and that's all fair and good,

3:35 but maybe not all text out there is like the text you would have in an academic setting.

3:40 Even if you have immaculate spelling, things like slang might also be hard for the spaCy model to detect. And that brings me to the final point,

3:50 and that is also that the data set might be just a little bit dated. A lot of new concepts in language can be introduced over time.

3:58 Just to give one example, Brexit is definitely a phenomenon that's been in the news,

4:03 but only if you've been paying attention in the last couple of years, I suppose just like COVID.

4:09 And as far as I'm aware at least, spaCy hasn't had data sets that have these concepts in them as well.

4:16 So that means that it could be tricky for spaCy to understand these topics out of the box natively,

4:21 but there's also many other topics that might just be too new for spaCy to detect, or I should say for these base models to detect.

4:30 You can always train your own models on your own data, and we'll see later in this course how to do that,

4:35 but I do think it's fair to not expect too much from the pre-trained models that spaCy provides you. Anyway, this was a slight tangent.

4:45 If you're exploring entities in spaCy models though, I highly recommend you play around with this spaCy tool.

4:51 It is a very likable and interactive way to understand what models are detecting in sentences.

Getting Started with NLP and spaCy Transcripts Chapter: Part 1: spaCy syntax Lecture: Displacy

Getting Started with NLP and spaCy Transcripts
Chapter: Part 1: spaCy syntax
Lecture: Displacy