Getting Started with NLP and spaCy Transcripts
Chapter: Part 3: spaCy Projects
Lecture: What is an NLP project
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
So I'm just going to draw out schematically what kind of things I need in my NLP project, just to kind of get the project structure maybe going.
0:10
So one thing I've got, let's draw that over here, are my transcripts.
0:16
These are the things that were spoken inside of a podcast, and there's stuff in here that I would like to predict.
0:22
However, if I'm going to have a machine learning learn anything, then I will also need to have some labels.
0:29
I will need to figure out some sort of way to turn at least a subset of these transcripts into a subset that is, I will call, annotated.
0:39
And just to give a quick example, if I have a sentence, something like ""Python is nice,
0:46
then this annotated subset would have that sentence, but also something that indicates that Python over here, that is a tech tool, let's say.
0:53
And I need to have some sort of data set where my machine learning model is able to learn from these annotated patterns.
1:01
Once I've got my annotated subset, there's actually another step, and that is to maybe prepare this data set for training.
1:10
There's a little bit of a detail here. Typically what we want to do is you want to have one set of data that you are going to
1:16
train on, and another set of data that you're going to use for evaluation.
1:23
Then this training data set over here, that can be used to train a machine learning model.
1:29
And that machine learning model, maybe we want to be able to package that.
1:33
And as you can see from this little overview, I do hope that you appreciate that there are actually a bunch of steps here that depend on each other.
1:41
And it'd be nice if we can structure our project accordingly.
1:44
Note that another aspect of this is that suppose that I've got my annotated subset over here, well, then I can train a machine learning model.
1:54
But if this subset doesn't change, then there's also no need to retrain this machine learning model.
1:58
So there's also something I would like to have in the system that is going to prevent unnecessary work.
2:04
So hopefully this diagram paints you a picture of what we need. We are going to need separate steps in this entire process.
2:12
But before diving into the code, what I would just like to do first is just give a glimpse of how to do this part.
2:22
Creating proper training data is an art in and of itself. But there are things that we have at our disposal to make this easier.
2:29
And I'm going to discuss that first before moving on to how I'm going to implement this project structure.