Getting Started with NLP and spaCy Transcripts
Chapter: Part 2: Exploring data with spaCy
Lecture: Diving into transcripts

0:00 Okay. So far in this series of videos we've really just been discussing the API spaCy

0:08 and I think at this point we've kind of got some of the basics covered. So now what I

0:12 would like to do is just get a fun dataset in and just really start using spaCy. And

0:17 as I was looking for a fun dataset I was kind of reminded that we have this Talk Python

0:23 podcast. It's a podcast that you might have heard of, it's a pretty good one, it's about

0:27 Talk Python. But the cool thing about this podcast in particular is that Michael, the

0:32 host of this program, actually maintains a GitHub repository with all the transcripts.

0:38 So if you go to GitHub to Mike Kennedy and then Talk Python transcripts, that's the name

0:44 of the repo, then you have this repository that actually has all the transcripts of past

0:49 episodes. And these go back a long time, like many, many years. So I'm just going to grab

0:56 one at random, let's go for this one. And this is one kind of transcript where you can

1:03 see a timestamp, then a name and then a colon. And then basically we can read what was spoken

1:11 at that point in time in the podcast. Now one thing to keep in mind, and this is usually

1:17 true when you're dealing with text data, is that the data is not necessarily perfect.

1:21 And there's a couple of reasons for it. One is we see that we have these multiple formats

1:26 that we might want to deal with, and not just in the file names, but we can also see that

1:30 here I've got a file where I do have a timestamp, but I don't have the name of a person saying

1:35 something. But there is also something else which this GitHub message is actually hinting

1:41 at, and that is the fact that all of these transcripts are generated by a machine learning

1:47 model. So we shouldn't assume that these transcripts are going to be a perfect representation of

1:51 what was said. I am going to assume they're good enough though for what we're going to

1:55 try and do. But if you want to follow along, basically now might be a good time to go to

2:02 this GitHub repository and clone it locally, just so you have access to this transcripts

2:07 folder over here, because we're going to do a bunch of fun stuff with this text data.

Getting Started with NLP and spaCy Transcripts Chapter: Part 2: Exploring data with spaCy Lecture: Diving into transcripts

Getting Started with NLP and spaCy Transcripts
Chapter: Part 2: Exploring data with spaCy
Lecture: Diving into transcripts