Getting Started with NLP and spaCy Transcripts
Chapter: Part 2: Exploring data with spaCy
Lecture: Giving the setup a spin
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
With the speed improvements in place, it now feels like I can actually test my hypothesis. I am importing two extra tools.
0:10
I'm importing the counter object from the Python collections API. And I'm also using a library that will give me a progress bar called tqdm.
0:18
Next, I have my spacey model that only does named entity recognition. And I also rewrote the loop that we had before
0:28
because it does a couple of extra things now. So what am I doing? First, I am just making sure that I'm dealing with a fresh generator.
0:36
I am then initializing a counter object that I'll use in a bit. And then I'm saying how many lines
0:41
I actually want to go ahead and read from this generator. In this case, I'm just doing 500, but I can easily increase this number.
0:48
I am then making my subset just like I did before. I am then making my generator tuples again, just like before.
0:56
But then I'm using this progress bar library, which I'm able to give a generator and I'm also able to pass the total number of items
1:03
in that generator as an integer. And that's nice because then this progress bar can give me lots of relevant information.
1:10
And I know the number of lines that I'm about to draw from the get-go. So that's something I can totally put in here.
1:16
Then next comes the big for loop over here. I am passing it this timed variable over here. And that's something that spacey can still batch.
1:25
I'm still treating this as tuples. So I have my document and my original example at the ready. But for every document that I got here,
1:33
what I'll be doing is I'll be looping over all the detected entities. And then if any of the entities have the label product,
1:42
then I'm keeping track of the text that the entity has. This gives me a list of entities. I can then pass that to a new counter object.
1:54
And this is going to count how often each entity appears. And then this counter object can be used to update this I will call global counter.
2:05
And therefore every time I loop and I loop, this counter is going to get an update. Hopefully when I run this,
2:11
I should just get an overview of examples that get detected as a product. So that ran relatively quickly, which is nice.
2:21
But let's now inspect the counter. Django got detected a bunch of times. FastAPI got detected, JavaScript,
2:30
EuroPython, which is a conference, not a tool really. Twitter is also not really a tool, but Flask got detected.
2:38
I'm seeing Ninja here, which might refer to Jinja instead. But in general, if I were to just look at this, it seems that my product hypothesis
2:48
is not that big of a stretch. There's definitely a couple of programming tools in here. And that is pretty interesting.
2:55
If I'm interested in finding programming languages in these transcripts, this might not be a bad starting point.
3:02
Okay, so let's just go through a whole bunch of lines now, not 500, let's go through 50,000. All right, so that took a bit less than a minute,
3:13
but we definitely went through a whole bunch of data. I'm happy we took the effort of making somewhat performant code here.
3:18
That speed up is definitely something we're getting benefits from now, but let's explore the counter one more time. Okay, so again, not bad.
3:32
I guess we see some operating systems, which you could argue is kind of like a programming tool. Is Excel a programming tool? Yes, no, I mean,
3:41
that's more of a philosophical debate at some point. But just from glancing at this again, a lot of this stuff definitely feels like it's tools.
3:50
Probably not everything, but it's definitely a bunch of stuff in here that does feel appropriate. And it does feel like I hit a nice balance
3:58
between effort and reward. This is actually kind of a nice example on how you might be able to use spaCy.
4:07
I'm able to reuse an entity that a spaCy model does provide. And even though it is not a perfect match, given that I have a very specific dataset,
4:16
I might still be able to reuse it in an interesting way. I should remember that even though there's a couple of entities here that have been detected,
4:23
it is likely that there's also a bunch of entities in this document that could be a programming tool
4:30
that I'm missing because this is definitely only a subset. But again, as a first iteration, I think this is pretty nice.