Getting Started with NLP and spaCy Transcripts
Chapter: Part 2: Exploring data with spaCy
Lecture: Why generators?
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
So at this point I hope that you recognize that we have a nice little way to loop over all of these
0:07
Separate files and then after that we are doing some generator stuff to Go through each line so to say and this is a big generator
0:17
It is gonna give us all the lines of all the files
0:20
But I can't imagine that you might have been trained to use different thing. Maybe you've been more accustomed to
0:28
Pandas when you're dealing with data in which case you might also be more familiar with a data frame
0:33
And with that in mind you might be wondering well, why would we prefer generators in the first place?
0:38
it's a pretty good question, but this example actually highlights a reason why generators could be seen as a good thing and
0:46
That has to do with memory use you see when I'm looping over this folder then at most
0:54
One file will actually be opened. We're not gonna open multiple files in one go and
1:00
That's kind of nice. I don't have to load all these separate files into memory in order to do some analysis I can really just take it line by line
1:10
But there is also another reason and that has to do with nested data structures
1:15
So let's import spacey to demonstrate that I will get me an NLP object and I will load the medium model I
1:24
Will reset this generator just for good measure, so let's now make a function called two sentences it will accept a generator and
1:37
Let's pretend that I am going to be passing the text in that line To my spacey model and that I'm going to get all the sentences out
1:49
Then I could say well for every sentence in This document let's add a variable for that just for good measure Well, then I can yield again
2:01
saying something like the text that I've got here is
2:04
The text from that sentence and I can keep the meta data attached that was the metadata was attached to that line
2:12
But what's kind of nice? I can just use this two sentences function on that generator. I had before you And I can call next on it
2:25
Just like I would before and It kind of feels flat still and that's kind of the nice thing here when you keep everything inside of a generator
2:37
Being able to always call next allows you to turn something
2:42
That's kind of nested like multiple sentences in a single doc. You can very easily make that flat
2:48
the fact that we are able to keep things low on the memory and Also be able to do stuff like this. That's just kind of pragmatic
2:56
not just because of the memory but also because these documents tend to have nested objects in them and
3:02
Using a generator is just kind of a nice way to unnest it
3:06
Not to mention the fact that if at some point we're going to be doing this with huge data sets
3:11
Then this whole we're not loading all the data in memory immediately aspect of it is going to matter a lot, too