Getting Started with NLP and spaCy Transcripts
Chapter: Part 3: spaCy Projects
Lecture: ML Config

Login or purchase this course to watch this video and the rest of the course contents.
0:00 In a previous video, we showed this convert script from this convert step. And we also showed that it generated these .spacy files,
0:10 which are binary representations of documents. And because those binary representations have a very strict, well understood internal schema,
0:19 it's relatively lightweight and spaCy knows what to look for. So that's great. We need this in order to train a machine learning model.
0:27 However, there are lots of ways to train a machine learning model. And there are also lots of settings. So the next thing that we will need
0:35 is some sort of configuration file. And I have this one extra step over here that is going to generate one such file.
0:43 Under the hood, it uses the spaCy command line utility to initialize a config file. I can tell it where I want the config file to go,
0:53 but I can also give it some extra settings. In this particular case, I'm saying, well, I want to do named entity recognition.
1:00 I have the English language and I want you to care about efficiency. By choosing this setting, we are effectively saying that we don't care
1:10 about having the best of the best of the best model, because that might imply that we have a model
1:15 that's very heavy and might be very compute intensive. Instead, we are actually fine with having some settings
1:22 that are pretty good, but can actually run quite quickly. In general, I advise everyone to go with this setting.
1:29 If you go for the most optimal setting out there, you might need a GPU, but I figured mentioning it explicitly because you might have a use case
1:36 where you care more about this. So that is definitely an easy setting that you can change. After you ran this config command,
1:44 you will see a configuration file that will look a little bit something like this. This is my base configuration file. And at first glance,
1:53 you will notice that there are lots and lots of settings. And I can definitely imagine that at first glance, this is also somewhat intimidating.
2:02 In general, my advice would be not to sweat this too much. If you have a background in machine learning, then you may recognize some of the names
2:12 of the settings here, and you might find your way to try to make an improvement. And if you want to go further and read the docs,
2:19 then there are also some settings like this vector setting over here that actually can make a bit of an impact.
2:26 That said, if I were to take a step back and think about the larger project, then really tweaking the settings in this file,
2:35 that is something I would do quite late in the project. When you're very early in a project, you're much better off focusing on these annotations.
2:43 And that's partially because these expose you to the problem, but also at some point, we're going to have our first machine learning algorithm,
2:50 and that algorithm is going to have mistakes. And the easiest way to fix those mistakes is to make sure that those mistakes are properly annotated,
2:59 and that we add enough examples like the mistakes to improve the model. Training data in the end is a very well understood steering wheel,
3:08 and tweaking things here sometimes requires you to just get a little bit lucky. So for now, the main thing that's really important
3:16 is we need some sort of a config file for the spaCy model to train, and we have this step that basically just does that.
3:24 But don't worry too much about the contents of this file. It's not the most pressing thing for us to focus on right now.
3:31 What is important for us to focus on next is to actually train a model, because after this step, we have our training data,
3:39 and we have a configuration file, so that should be everything spaCy needs.


Talk Python's Mastodon Michael Kennedy's Mastodon