Getting Started with NLP and spaCy Transcripts
Chapter: Part 2: Exploring data with spaCy
Lecture: Cleaning transcripts

Login or purchase this course to watch this video and the rest of the course contents.
0:00 Right, so what I've now done is I have downloaded the transcripts, I have my little transcripts
0:08 folder over here and I can confirm just from opening up a file over here that these are
0:13 indeed transcripts and what I will be doing is I'll be focusing in on the txt files that are in this folder.
0:23 But as we saw in a previous video, this is a text file where we're going to have these
0:27 timestamps so I'm going to have to do something clever that turns this into some clean usable data. And I have written a little bit of logic for that.
0:37 Let's open that. And here is just some utility code. What I'll do is I'll just quickly go over what's happening here.
0:46 Definitely feel free to just copy this code. But the reason I want to sort of just go through this is also because usually some data cleaning
0:53 needs to happen before you're going to do NLP and this serves kind of as a nice tangible example.
0:59 So just as a rough sketch, what is this code doing over here? Well I'm using a regex, that's what you see me do over here, and that regex is basically
1:07 there to detect the timestamp that we have on the line. So if I were to look at this function over here, I give it a path and then the goal of
1:18 this function is to give me a generator with every single line properly printed with some meta information.
1:26 So I'm going over every line in that path and I'm going to match a regex and if it matches then I'm going to do some logic.
1:33 So if I see a timestamp appear on the line then I'm dealing with a line that I'm interested in.
1:39 Then this variable is basically that line without the timestamp.
1:43 And next what I do is I use the colon to figure out if there is maybe a name, because remember
1:49 some of the files that we saw had a name attached as well. And if there is, there's just a little bit of extra logic for me to find the speaker.
1:58 All of this stuff is pretty useful. Sometimes I will have some meta information about the speaker, but the main thing I'm
2:04 interested in is just every single line that's appearing and I'm outputting that in this yield statement over here.
2:13 So maybe just for good measure, let's come back to that little bit of extra code at the bottom later, but let's just give a quick demo of this.
2:21 So I'm saying episode lines, let's just give it one of the files. So I have my transcripts folder and then I have that htmx for Django developers file.
2:38 This function returns a generator, so what I should be able to do is just call next on it.
2:44 And we can see the first sentence that was spoken in that transcript file. Part of the metadata here is telling me that the speaker is unknown.
2:51 This was the first turn in the episode in terms of speakers and the files attached just for good measure.
2:59 And this little generator will just loop over every single line. And this is just kind of nice.
3:07 Gives me a nice way to just loop over all the different lines in a single file.
3:12 Now of course doing that for a single file over here is nice and all, but I also want to do this for every single file.
3:20 So that's what this function does. But basically it just allows me to do the same trick.
3:27 I have all of the lines in an episode, I can just call next on it and this is going to give me every single line in the generator.
3:34 Note by the way that I'm doing this with reversed sorting, so newest episodes kind of go first.
3:41 But again I really just want to have a generator here that can loop over all the different sentences.
3:46 What we're going to do soon is we're going to use a text over here and that's something we're going to pass to Spacey eventually.
3:52 But again I do hope that it's clear that even though this cleaning code is probably not
3:57 complete, when you're doing NLP there's always a step that kind of looks like this.
4:02 You are going to have to think about what data is coming in and how do I want to pass that forward in a somewhat clean way.
4:07 Investing in a function like this definitely saves a whole lot of time later.


Talk Python's Mastodon Michael Kennedy's Mastodon