Getting Started with NLP and spaCy Transcripts
Chapter: Part 3: spaCy Projects
Lecture: What is an NLP project

Login or purchase this course to watch this video and the rest of the course contents.
0:00 So I'm just going to draw out schematically what kind of things I need in my NLP project, just to kind of get the project structure maybe going.
0:10 So one thing I've got, let's draw that over here, are my transcripts.
0:16 These are the things that were spoken inside of a podcast, and there's stuff in here that I would like to predict.
0:22 However, if I'm going to have a machine learning learn anything, then I will also need to have some labels.
0:29 I will need to figure out some sort of way to turn at least a subset of these transcripts into a subset that is, I will call, annotated.
0:39 And just to give a quick example, if I have a sentence, something like ""Python is nice,
0:46 then this annotated subset would have that sentence, but also something that indicates that Python over here, that is a tech tool, let's say.
0:53 And I need to have some sort of data set where my machine learning model is able to learn from these annotated patterns.
1:01 Once I've got my annotated subset, there's actually another step, and that is to maybe prepare this data set for training.
1:10 There's a little bit of a detail here. Typically what we want to do is you want to have one set of data that you are going to
1:16 train on, and another set of data that you're going to use for evaluation.
1:23 Then this training data set over here, that can be used to train a machine learning model.
1:29 And that machine learning model, maybe we want to be able to package that.
1:33 And as you can see from this little overview, I do hope that you appreciate that there are actually a bunch of steps here that depend on each other.
1:41 And it'd be nice if we can structure our project accordingly.
1:44 Note that another aspect of this is that suppose that I've got my annotated subset over here, well, then I can train a machine learning model.
1:54 But if this subset doesn't change, then there's also no need to retrain this machine learning model.
1:58 So there's also something I would like to have in the system that is going to prevent unnecessary work.
2:04 So hopefully this diagram paints you a picture of what we need. We are going to need separate steps in this entire process.
2:12 But before diving into the code, what I would just like to do first is just give a glimpse of how to do this part.
2:22 Creating proper training data is an art in and of itself. But there are things that we have at our disposal to make this easier.
2:29 And I'm going to discuss that first before moving on to how I'm going to implement this project structure.


Talk Python's Mastodon Michael Kennedy's Mastodon