Getting Started with NLP and spaCy Transcripts
Chapter: Part 2: Exploring data with spaCy
Lecture: Cleaning transcripts
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
Right, so what I've now done is I have downloaded the transcripts, I have my little transcripts
0:08
folder over here and I can confirm just from opening up a file over here that these are
0:13
indeed transcripts and what I will be doing is I'll be focusing in on the txt files that are in this folder.
0:23
But as we saw in a previous video, this is a text file where we're going to have these
0:27
timestamps so I'm going to have to do something clever that turns this into some clean usable data. And I have written a little bit of logic for that.
0:37
Let's open that. And here is just some utility code. What I'll do is I'll just quickly go over what's happening here.
0:46
Definitely feel free to just copy this code. But the reason I want to sort of just go through this is also because usually some data cleaning
0:53
needs to happen before you're going to do NLP and this serves kind of as a nice tangible example.
0:59
So just as a rough sketch, what is this code doing over here? Well I'm using a regex, that's what you see me do over here, and that regex is basically
1:07
there to detect the timestamp that we have on the line. So if I were to look at this function over here, I give it a path and then the goal of
1:18
this function is to give me a generator with every single line properly printed with some meta information.
1:26
So I'm going over every line in that path and I'm going to match a regex and if it matches then I'm going to do some logic.
1:33
So if I see a timestamp appear on the line then I'm dealing with a line that I'm interested in.
1:39
Then this variable is basically that line without the timestamp.
1:43
And next what I do is I use the colon to figure out if there is maybe a name, because remember
1:49
some of the files that we saw had a name attached as well. And if there is, there's just a little bit of extra logic for me to find the speaker.
1:58
All of this stuff is pretty useful. Sometimes I will have some meta information about the speaker, but the main thing I'm
2:04
interested in is just every single line that's appearing and I'm outputting that in this yield statement over here.
2:13
So maybe just for good measure, let's come back to that little bit of extra code at the bottom later, but let's just give a quick demo of this.
2:21
So I'm saying episode lines, let's just give it one of the files. So I have my transcripts folder and then I have that htmx for Django developers file.
2:38
This function returns a generator, so what I should be able to do is just call next on it.
2:44
And we can see the first sentence that was spoken in that transcript file. Part of the metadata here is telling me that the speaker is unknown.
2:51
This was the first turn in the episode in terms of speakers and the files attached just for good measure.
2:59
And this little generator will just loop over every single line. And this is just kind of nice.
3:07
Gives me a nice way to just loop over all the different lines in a single file.
3:12
Now of course doing that for a single file over here is nice and all, but I also want to do this for every single file.
3:20
So that's what this function does. But basically it just allows me to do the same trick.
3:27
I have all of the lines in an episode, I can just call next on it and this is going to give me every single line in the generator.
3:34
Note by the way that I'm doing this with reversed sorting, so newest episodes kind of go first.
3:41
But again I really just want to have a generator here that can loop over all the different sentences.
3:46
What we're going to do soon is we're going to use a text over here and that's something we're going to pass to Spacey eventually.
3:52
But again I do hope that it's clear that even though this cleaning code is probably not
3:57
complete, when you're doing NLP there's always a step that kind of looks like this.
4:02
You are going to have to think about what data is coming in and how do I want to pass that forward in a somewhat clean way.
4:07
Investing in a function like this definitely saves a whole lot of time later.