Getting Started with NLP and spaCy Transcripts
Chapter: Part 2: Exploring data with spaCy
Lecture: Diving into transcripts
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
Okay. So far in this series of videos we've really just been discussing the API spaCy
0:08
and I think at this point we've kind of got some of the basics covered. So now what I
0:12
would like to do is just get a fun dataset in and just really start using spaCy. And
0:17
as I was looking for a fun dataset I was kind of reminded that we have this Talk Python
0:23
podcast. It's a podcast that you might have heard of, it's a pretty good one, it's about
0:27
Talk Python. But the cool thing about this podcast in particular is that Michael, the
0:32
host of this program, actually maintains a GitHub repository with all the transcripts.
0:38
So if you go to GitHub to Mike Kennedy and then Talk Python transcripts, that's the name
0:44
of the repo, then you have this repository that actually has all the transcripts of past
0:49
episodes. And these go back a long time, like many, many years. So I'm just going to grab
0:56
one at random, let's go for this one. And this is one kind of transcript where you can
1:03
see a timestamp, then a name and then a colon. And then basically we can read what was spoken
1:11
at that point in time in the podcast. Now one thing to keep in mind, and this is usually
1:17
true when you're dealing with text data, is that the data is not necessarily perfect.
1:21
And there's a couple of reasons for it. One is we see that we have these multiple formats
1:26
that we might want to deal with, and not just in the file names, but we can also see that
1:30
here I've got a file where I do have a timestamp, but I don't have the name of a person saying
1:35
something. But there is also something else which this GitHub message is actually hinting
1:41
at, and that is the fact that all of these transcripts are generated by a machine learning
1:47
model. So we shouldn't assume that these transcripts are going to be a perfect representation of
1:51
what was said. I am going to assume they're good enough though for what we're going to
1:55
try and do. But if you want to follow along, basically now might be a good time to go to
2:02
this GitHub repository and clone it locally, just so you have access to this transcripts
2:07
folder over here, because we're going to do a bunch of fun stuff with this text data.