Getting Started with NLP and spaCy Transcripts
Chapter: Part 3: spaCy Projects
Lecture: Annotation
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
So let's talk about data annotation for just a bit. And again, we have our transcripts and we would somehow like to turn this into annotations.
0:12
Then one thing that I could do is I could just go through this list of texts one by one, and then I could use some sort of a UI
0:22
to highlight where my entities are. And that could give me my annotations. And again, the annotations that I would need
0:29
is if I have a sentence like Python is nice, then I would like to have some sort of user interface that allows me to highlight Python in this case
0:38
and say that that's a tech tool, let's say. This is all well and good. And one direct approach would be to say, well, just take that big list
0:47
and go through a lot of them in this user interface. But there are a couple of problems with that. In particular, these transcripts are sorted
0:56
and it could be that we have to go through a very specific episode, maybe an episode that's all about Django. And it might take us half an hour
1:04
before we get to the next episode that's all about Click. And before you know it, you've spent an hour annotating
1:11
while you've only covered a small amount of the service area of all the tools that you would like to get examples of in your annotated dataset.
1:19
So maybe the right way to go about this is to try and see if we can do things that make it easy for ourselves. And in this particular case,
1:27
I actually did a little bit of extra work to do just that. Because you see one thing that we can do is we can take all of these transcripts
1:35
and we can train a little search engine. In particular, there's a lovely little Python library called lunar.py,
1:43
which is something that I've used in the past. But what that allows us to do is that allows us to build an index
1:49
such that if we ever have a specific query like Django, for example, that then the search engine can retrieve us 50 examples that have Django in it.
2:00
And then once we've done Django, we might be able to move on to another project like Click. Click might be especially interesting
2:07
because not every instance of the word Click will refer to the Python library that's called Click. But again, you can imagine
2:13
that having such a search engine around might actually make our day a whole lot easier. Especially if we're able to use that search engine
2:21
inside of this user interface to help us steer the stuff that we'd like to annotate next. And for this particular project in this particular demo,
2:30
I've actually been doing just this. For the labeling tool, I am going to be using Prodigy. Note that Prodigy is made by the same people
2:39
who are making spaCy, but it deserves to be said that Prodigy is a paid tool. And I should also be upfront and mention
2:46
that I was a core developer of this product when I was employed over at Explosion. I definitely feel that Prodigy is a very powerful tool,
2:54
but I'll gladly leave it up to your own discretion to see if you need it. There are other annotation interfaces out there as well.
3:01
For this particular course, the main thing that I do think is relevant and important is that you think about ways
3:06
to make annotation easier for yourself. In this particular case, I felt that having a search engine around was going to make it easier for me,
3:14
but there are many techniques out there. And as you're going to be iterating, odds are that you will be using one technique for one part of the data
3:22
and maybe another technique for another. Having said all this though, what I would now like to do is just give you a quick demo
3:29
of the UI and the setup. So you can also kind of see what it's like to be annotating this dataset.
3:35
All right, so this is what my user interface looks like. What I'm able to do is I'm able to say, well, I'm interested in looking for instances
3:44
where FastAPI made an appearance. And then this interface allows me to say, well, that's FastAPI, let's highlight that. That's also FastAPI.
3:53
Let's accept that. That's also FastAPI. So far, so good. And then after a while, after I feel that I've, whoops,
4:02
annotated enough of these FastAPI examples, I can also just hit save for now and maybe look for Flask instead. So Flask makes an appearance there,
4:16
makes an appearance there, et cetera. And accept that. So I hope that you agree that being able to annotate this way is actually really, really nice,
4:27
but I still need to be in the loop, so to say, as a human. I need to make sure that I cover enough ground with these queries that I got a good portion
4:36
of the Python tools in here. And this will also require a little bit of iteration. It is possible that at some point we have a trained model
4:43
and we learned that it's really bad at detecting some kinds of tools. And then I will have to iterate and then make sure that I add tools
4:52
that the model gets wrong in here. So as I mentioned before, I have already been annotating for a bit. I have a small data set annotated for now,
5:02
about 140 examples or so. And while I will definitely need more data moving forward at some point, I do think that this is enough
5:13
to start talking about the project some more. So let's move on to that.