Getting Started with NLP and spaCy Transcripts
Chapter: Part 3: spaCy Projects
Lecture: ML Config
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
In a previous video, we showed this convert script from this convert step. And we also showed that it generated these .spacy files,
0:10
which are binary representations of documents. And because those binary representations have a very strict, well understood internal schema,
0:19
it's relatively lightweight and spaCy knows what to look for. So that's great. We need this in order to train a machine learning model.
0:27
However, there are lots of ways to train a machine learning model. And there are also lots of settings. So the next thing that we will need
0:35
is some sort of configuration file. And I have this one extra step over here that is going to generate one such file.
0:43
Under the hood, it uses the spaCy command line utility to initialize a config file. I can tell it where I want the config file to go,
0:53
but I can also give it some extra settings. In this particular case, I'm saying, well, I want to do named entity recognition.
1:00
I have the English language and I want you to care about efficiency. By choosing this setting, we are effectively saying that we don't care
1:10
about having the best of the best of the best model, because that might imply that we have a model
1:15
that's very heavy and might be very compute intensive. Instead, we are actually fine with having some settings
1:22
that are pretty good, but can actually run quite quickly. In general, I advise everyone to go with this setting.
1:29
If you go for the most optimal setting out there, you might need a GPU, but I figured mentioning it explicitly because you might have a use case
1:36
where you care more about this. So that is definitely an easy setting that you can change. After you ran this config command,
1:44
you will see a configuration file that will look a little bit something like this. This is my base configuration file. And at first glance,
1:53
you will notice that there are lots and lots of settings. And I can definitely imagine that at first glance, this is also somewhat intimidating.
2:02
In general, my advice would be not to sweat this too much. If you have a background in machine learning, then you may recognize some of the names
2:12
of the settings here, and you might find your way to try to make an improvement. And if you want to go further and read the docs,
2:19
then there are also some settings like this vector setting over here that actually can make a bit of an impact.
2:26
That said, if I were to take a step back and think about the larger project, then really tweaking the settings in this file,
2:35
that is something I would do quite late in the project. When you're very early in a project, you're much better off focusing on these annotations.
2:43
And that's partially because these expose you to the problem, but also at some point, we're going to have our first machine learning algorithm,
2:50
and that algorithm is going to have mistakes. And the easiest way to fix those mistakes is to make sure that those mistakes are properly annotated,
2:59
and that we add enough examples like the mistakes to improve the model. Training data in the end is a very well understood steering wheel,
3:08
and tweaking things here sometimes requires you to just get a little bit lucky. So for now, the main thing that's really important
3:16
is we need some sort of a config file for the spaCy model to train, and we have this step that basically just does that.
3:24
But don't worry too much about the contents of this file. It's not the most pressing thing for us to focus on right now.
3:31
What is important for us to focus on next is to actually train a model, because after this step, we have our training data,
3:39
and we have a configuration file, so that should be everything spaCy needs.