Getting Started with NLP and spaCy Transcripts
Chapter: Part 2: Exploring data with spaCy
Lecture: Performance: Part 3

Login or purchase this course to watch this video and the rest of the course contents.
0:00 As a next and also final step for this line of work, I figured that I would run the same exercise but on all the lines that I've got.
0:11 So I'm going over all the lines, I'm counting one for each line, I'm taking the sum and that gives me about 84,000 lines.
0:21 I've updated the number of lines over here so the code that's listed here will now actually give me the counter and find me all the detected products which might just well be programming utilities from the transcripts and that's great.
0:38 And looking at this setup, it seems to take about two minutes which, you know, is pretty decent given the number of items I've got here. I could argue that's pretty quick.
0:50 But there is this one extra thing we can actually do to make it just a bit faster and what I'm about to suggest won't always make it go faster but in this case I found that it actually did.
1:03 And that is that I can add this one extra parameter to my nlp.pipe method.
1:10 You see, this pipe over here is able to batch data in and that's already a good performance boost but it also has some multi-core capabilities in it.
1:21 The thing with multi-core processes though is that it can be a bit hit or miss because there is a little bit of syncing that needs to happen as well.
1:29 Not to mention the fact that going through these batches, that's something we might be able to do in parallel if we give it more cores.
1:36 But the stuff that I'm doing inside of this for loop, well, that's still very much a single threaded thing.
1:44 So again, the mileage might vary if you do stuff like this but if you're working on big datasets it can make a difference.
1:51 Because here you can definitely see that out of the two minutes we're almost down to 140 here.
1:58 That's still not a bad chunk of performance I guess. So that's still something that I might keep in mind if you're dealing with very big datasets.
2:07 And if you're working on a machine that actually has a couple of cores then this is something I would also try out.
2:12 Now having said all that, there is also another line of work that we should pursue because you could wonder if we have to go through the effort of actually resorting to somewhat heavy machine learning models.
2:24 Maybe if we want to detect tools from Python in these transcripts, there is actually just another more simple technique that we can try.


Talk Python's Mastodon Michael Kennedy's Mastodon