Getting Started with NLP and spaCy Transcripts
Chapter: Part 2: Exploring data with spaCy
Lecture: Diving into transcripts

Login or purchase this course to watch this video and the rest of the course contents.
0:00 Okay. So far in this series of videos we've really just been discussing the API spaCy
0:08 and I think at this point we've kind of got some of the basics covered. So now what I
0:12 would like to do is just get a fun dataset in and just really start using spaCy. And
0:17 as I was looking for a fun dataset I was kind of reminded that we have this Talk Python
0:23 podcast. It's a podcast that you might have heard of, it's a pretty good one, it's about
0:27 Talk Python. But the cool thing about this podcast in particular is that Michael, the
0:32 host of this program, actually maintains a GitHub repository with all the transcripts.
0:38 So if you go to GitHub to Mike Kennedy and then Talk Python transcripts, that's the name
0:44 of the repo, then you have this repository that actually has all the transcripts of past
0:49 episodes. And these go back a long time, like many, many years. So I'm just going to grab
0:56 one at random, let's go for this one. And this is one kind of transcript where you can
1:03 see a timestamp, then a name and then a colon. And then basically we can read what was spoken
1:11 at that point in time in the podcast. Now one thing to keep in mind, and this is usually
1:17 true when you're dealing with text data, is that the data is not necessarily perfect.
1:21 And there's a couple of reasons for it. One is we see that we have these multiple formats
1:26 that we might want to deal with, and not just in the file names, but we can also see that
1:30 here I've got a file where I do have a timestamp, but I don't have the name of a person saying
1:35 something. But there is also something else which this GitHub message is actually hinting
1:41 at, and that is the fact that all of these transcripts are generated by a machine learning
1:47 model. So we shouldn't assume that these transcripts are going to be a perfect representation of
1:51 what was said. I am going to assume they're good enough though for what we're going to
1:55 try and do. But if you want to follow along, basically now might be a good time to go to
2:02 this GitHub repository and clone it locally, just so you have access to this transcripts
2:07 folder over here, because we're going to do a bunch of fun stuff with this text data.


Talk Python's Mastodon Michael Kennedy's Mastodon