Getting Started with NLP and spaCy Transcripts
Chapter: Part 3: spaCy Projects
Lecture: Training the model

Login or purchase this course to watch this video and the rest of the course contents.
0:00 All of that brings us to the train command, which is where we're actually gonna start training our machine learning model.
0:08 Under the hood, it is just using the spaCy command line. In particular, we are passing it our configuration file over here,
0:17 but you'll notice that we are actually setting some extra settings. And there's just one interesting detail here to kind of just zoom in on.
0:25 Notice here, we're saying paths.train, and we're passing in the train file. Well, there's a correspondence between this parameter
0:36 that we're setting from here and this configuration file. So let's remember, there was a paths.train value that I've set in the command line over here.
0:49 And in this configuration file, we can see that there is this paths key and that there is a train key under it.
1:00 The way to read this is that from the command line, you can actually choose to override configuration settings.
1:06 And there are moments when this is quite convenient. I personally like it that it's very nice and explicit
1:13 that this is the place where we're going to be taking our training data from. This is particularly nice if we have more than one config file.
1:22 And you'll notice that we actually have a couple of these settings that we set this way. So the path to the evaluation set
1:30 is also something that I've got listed here. I'm also able to override the maximum number of steps that we're going to be training.
1:38 But you'll also notice that I'm setting an output folder over here as well. And this is basically the folder
1:45 where we're going to save our trained model. This parameter is part of the spaCy command line. This is not part of the configuration file.
1:55 But what is going to happen when I run this step is that we are going to be training a machine learning model
1:59 and the saved representation is going to be stored in a folder called trained. So let's run this. Python spaCy project run train.
2:12 So we can see the command that's being listed here. We can see some confirmation that stuff got initialized.
2:21 And we can also see something of a progress bar or table, I guess you could say. From this table, we can see that there are steps that are being taken.
2:30 And every 200 steps, it seems that we get our evaluation that's being shown here. So we get some metrics out.
2:37 Some of the metrics are kind of detailed and nitty gritty, like these loss functions over here. But there are also some other metrics
2:46 that have a pretty clear interpretation. So we have the recall score, which tells us how many entities we detected.
2:53 We have the precision score over here, which tells us something about how accurate our prediction is if we are saying it's an entity.
2:59 To get more of an intuition on these, definitely check the docs. For now, though, the main thing that's important
3:06 is that the higher these numbers are in general, the better. And as time moves forward, you can see that sometimes it is making an improvement.
3:15 Sometimes it's also degrading a little bit. And once we're done after 2000 steps, which I configured, we get this notification that indeed,
3:24 the model is done training. We can have a look in this trained folder. And one interesting thing here is that we actually see two folders appear here.
3:33 We have one which represents the last state that we had in training. That will be the model that we had over at the end over here.
3:43 That's the model that gets stored there. But we've got this other folder called model best. Typically, this is the folder
3:51 that you'll be most interested in. And that's because theoretically it is possible, as the model is searching for the best weights,
3:59 that there is a degradation. You do see sometimes that we go from a higher number to a somewhat lower number as time moves forward.
4:08 There are all sorts of numerical reasons for this, but because we are making sure that we're storing the best model at all times,
4:15 we don't have to be afraid that we actually lose information. When you open up this folder, you can confirm that there are lots
4:23 of different files and folders over here. These are all very spaCy specific files. But one quick thing we can kind of do now, just as a final demo,
4:34 if I were now to open up this IPython REPL over here, what I could do is import spaCy. But now I can call spaCy load
4:46 and I can point it to that model best folder over there. You might remember that before we would write something
4:53 like encore webmd to specify a downloaded model, but in this case, I can actually point it at a model that we just trained.
5:02 And I can give it a sentence, something like, I enjoy using Django and FastAPI. Let's have that be the sentence going in.
5:14 I'm gonna store that onto a document. I can ask for the entities. And there we go. What you're seeing now is a NLP pipeline
5:23 that we trained from scratch that is able to detect some Python libraries on our behalf. Now, what I don't wanna suggest
5:31 is that this model will be perfect because what you are seeing here is a reflection of the way that I've been annotating.
5:38 I've been annotating lots of examples like Django and FastAPI. But if I were to pick a more obscure Python library, like I enjoy using lunar.py,
5:50 which is a very cool little library actually, but I don't think I annotated that. That doesn't get detected.
5:58 So again, it's definitely cool as a milestone that we are able to load up our own custom model, but I also hope that it's clear
6:05 that we're not exactly done yet because our model is definitely still making some mistakes.


Talk Python's Mastodon Michael Kennedy's Mastodon