Getting Started with NLP and spaCy Transcripts
Chapter: Part 2: Exploring data with spaCy
Lecture: Why generators?
Login or
purchase this course
to watch this video and the rest of the course contents.
So at this point I hope that you recognize that we have a nice little way to loop over all of these
Separate files and then after that we are doing some generator stuff to Go through each line so to say and this is a big generator
It is gonna give us all the lines of all the files
But I can't imagine that you might have been trained to use different thing. Maybe you've been more accustomed to
Pandas when you're dealing with data in which case you might also be more familiar with a data frame
And with that in mind you might be wondering well, why would we prefer generators in the first place?
it's a pretty good question, but this example actually highlights a reason why generators could be seen as a good thing and
That has to do with memory use you see when I'm looping over this folder then at most
One file will actually be opened. We're not gonna open multiple files in one go and
That's kind of nice. I don't have to load all these separate files into memory in order to do some analysis I can really just take it line by line
But there is also another reason and that has to do with nested data structures
So let's import spacey to demonstrate that I will get me an NLP object and I will load the medium model I
Will reset this generator just for good measure, so let's now make a function called two sentences it will accept a generator and
Let's pretend that I am going to be passing the text in that line To my spacey model and that I'm going to get all the sentences out
Then I could say well for every sentence in This document let's add a variable for that just for good measure Well, then I can yield again
saying something like the text that I've got here is
The text from that sentence and I can keep the meta data attached that was the metadata was attached to that line
But what's kind of nice? I can just use this two sentences function on that generator. I had before you And I can call next on it
Just like I would before and It kind of feels flat still and that's kind of the nice thing here when you keep everything inside of a generator
Being able to always call next allows you to turn something
That's kind of nested like multiple sentences in a single doc. You can very easily make that flat
the fact that we are able to keep things low on the memory and Also be able to do stuff like this. That's just kind of pragmatic
not just because of the memory but also because these documents tend to have nested objects in them and
Using a generator is just kind of a nice way to unnest it
Not to mention the fact that if at some point we're going to be doing this with huge data sets
Then this whole we're not loading all the data in memory immediately aspect of it is going to matter a lot, too