Getting Started with NLP and spaCy Transcripts
Chapter: Part 2: Exploring data with spaCy
Lecture: Performance: Part 1

Login or purchase this course to watch this video and the rest of the course contents.
0:00 I have my spaCy model loaded, and right now I wanna do something with the entities of all the lines that I've got.
0:09 So your first inclination might be to write code that looks a little bit like this. To keep things relatively lightweight,
0:15 what I'm doing first is I'm making sure that I'm only grabbing the first 1000 examples from my lines over here. But after that, I'm saying,
0:23 well, let's loop over every single line in that subset. Let's grab the text from that line. Let's pass that text into my NLP model,
0:31 and then I have a document that can totally give me the entities that I need. Now, this code will work,
0:37 but let's just track how long it takes to actually run this. All right, it seems to take about seven seconds.
0:46 Note, by the way, that what I'm using here is something called a Jupyter magic. In particular, I'm using a time magic on this cell.
0:53 And effectively what it does is it's just going to try and run the cell while keeping track of how long it took to run everything in it.
1:00 So, okay, seven seconds, 1000 examples, I have many thousands of them. It will be kind of nice if we can maybe speed this up.
1:08 And there is one big thing we can do right from the get-go. When we have a look at what's happening here,
1:15 I have my NLP model and I'm giving it a single line of text. Now, you can imagine kickstarting
1:21 the big machine learning engine just to analyze one text. We're going to do that over and over again as we are running this for loop.
1:30 Instead, what might be maybe better is if we can kickstart the big machine learning model over here and then give it a batch of texts,
1:37 because then there's all sorts of internal optimizations that might be able to happen. Stuff might get vectorized and that sort of thing.
1:44 Spacey also has support for this. So let's rewrite the cell just so we can see how we might be able to improve it.
1:51 All right, so here is a revised version. A lot of stuff is still the same. I still have my subset,
1:59 but the next thing that I do is I just grab every text that I have in this subset. Remember that this lines generator that I've got
2:10 that returns me some dictionaries and the Spacey model really just needs this text. So by doing it this way, texts right now is a generator of strings.
2:20 And that is something that I can pass to the NLP pipe method. By doing this, Spacey actually has the opportunity to do some batching internally,
2:30 which means that this should run a whole lot quicker. And when I iterate over this, I just get my document objects directly this way.
2:38 And indeed, this definitely runs a whole lot quicker. So that's certainly very nice. However, there is this one awkward thing at the moment
2:49 with the way that this loop is currently set up. And that is if I were to call next on the lines again, then sure I am using the text here for Spacey
2:58 and that's great, but I am losing this meta information, which might actually be useful too,
3:03 depending on what I want to do with this data set afterwards. So with that in mind, there is this one extra thing that we can do
3:11 if we were to rewrite this one more time. And there we go. What I've now done is I've rewritten this line that turns my dictionaries into texts
3:23 and I've adapted it to make a generator that returns me tuples. The first item in the tuple is the text
3:31 that I do want to see translated into a document, but the second item can just remain a dictionary. Now what I can do is I can actually tell
3:38 this NLP.pipe method that the data stream that's coming in represents tuples. And then the assumption is that Spacey
3:46 should only really treat the first item of a tuple as text and then the second item will just remain intact, which means that within the for loop,
3:54 I still have access to the document and my entities, but I also still have access to the original dictionary with all the meta information.
4:03 Let's run this. And this is definitely nice because I can confirm that we're not really getting a performance hit if we do this.
4:11 So if we're going to run this product hypothesis on all of our data, this might be a very nice way to do that.


Talk Python's Mastodon Michael Kennedy's Mastodon