Getting Started with NLP and spaCy Transcripts
Chapter: Part 2: Exploring data with spaCy
Lecture: Why generators?

Login or purchase this course to watch this video and the rest of the course contents.
0:00 So at this point I hope that you recognize that we have a nice little way to loop over all of these
0:07 Separate files and then after that we are doing some generator stuff to Go through each line so to say and this is a big generator
0:17 It is gonna give us all the lines of all the files
0:20 But I can't imagine that you might have been trained to use different thing. Maybe you've been more accustomed to
0:28 Pandas when you're dealing with data in which case you might also be more familiar with a data frame
0:33 And with that in mind you might be wondering well, why would we prefer generators in the first place?
0:38 it's a pretty good question, but this example actually highlights a reason why generators could be seen as a good thing and
0:46 That has to do with memory use you see when I'm looping over this folder then at most
0:54 One file will actually be opened. We're not gonna open multiple files in one go and
1:00 That's kind of nice. I don't have to load all these separate files into memory in order to do some analysis I can really just take it line by line
1:10 But there is also another reason and that has to do with nested data structures
1:15 So let's import spacey to demonstrate that I will get me an NLP object and I will load the medium model I
1:24 Will reset this generator just for good measure, so let's now make a function called two sentences it will accept a generator and
1:37 Let's pretend that I am going to be passing the text in that line To my spacey model and that I'm going to get all the sentences out
1:49 Then I could say well for every sentence in This document let's add a variable for that just for good measure Well, then I can yield again
2:01 saying something like the text that I've got here is
2:04 The text from that sentence and I can keep the meta data attached that was the metadata was attached to that line
2:12 But what's kind of nice? I can just use this two sentences function on that generator. I had before you And I can call next on it
2:25 Just like I would before and It kind of feels flat still and that's kind of the nice thing here when you keep everything inside of a generator
2:37 Being able to always call next allows you to turn something
2:42 That's kind of nested like multiple sentences in a single doc. You can very easily make that flat
2:48 the fact that we are able to keep things low on the memory and Also be able to do stuff like this. That's just kind of pragmatic
2:56 not just because of the memory but also because these documents tend to have nested objects in them and
3:02 Using a generator is just kind of a nice way to unnest it
3:06 Not to mention the fact that if at some point we're going to be doing this with huge data sets
3:11 Then this whole we're not loading all the data in memory immediately aspect of it is going to matter a lot, too


Talk Python's Mastodon Michael Kennedy's Mastodon