Getting Started with NLP and spaCy Transcripts
Chapter: Part 3: spaCy Projects
Lecture: spaCy projects

Login or purchase this course to watch this video and the rest of the course contents.
0:00 All right, so let's talk about project setup. Now what I've got here are some files and folders and one file that in particular
0:11 I would like to point our attention to first is this project.yaml file. It's a file that I've got open right here
0:19 and this is a spacey specific yaml file. That's kind of like a make file if you're aware of that, but the main thing in this file
0:29 that's going to be interesting and important is that I'm able to have this collection of commands that I can reuse later.
0:36 And I'll just highlight one such command. So I've got a command over here called annotation export that I could go ahead and run.
0:48 And this is going to get annotations out of my annotation tool and into one of these folders. So the way to read this script by the way
0:56 is I am going to be exporting, that's what this command does. And I'm going to be exporting a particular name of a dataset into a folder.
1:05 And then I'm going to say, well, let's take that file name that got generated and actually make a file called annots.jsonl short for annotations.
1:15 And let's move that in the data folder. So as we can see right now, that file is not in here. But what I should be able to do now
1:22 is call python-mspacey and then run the project command. And then this command will pick up that there is this project.yaml file
1:30 and that there are these commands in it. And I'm telling it to run this annotation export command. Then spacey on our behalf
1:39 is going to run all of these scripts and let's just confirm that that works. And there we go. We have our annotations file.
1:49 That's now listed over here. That's all well and good. There is another step though that we can run now. And that is a step that follows,
1:57 which is we are going to take our annotations over here and we are going to turn these annotations into a format that spacey can go ahead and use.
2:07 We will dive into this script in the next video, but I want to highlight a thing that happens when I actually run this command.
2:14 So let's run the convert command. When this runs, we will generate some spacey data. That's going to happen as a side effect.
2:23 But notice that this command over here, it's got outputs defined, but it's also got dependencies defined. And what I'm able to say here
2:35 is that this particular script, it depends on this file as input. Note by the way, that this other command that I used before
2:47 mentions the same file as an output over here. Under the hood, that is super useful information because what spacey can now do on our behalf
2:56 is it can keep track of this lock file. And what it's going to do is it's going to say, ah, there's a command over here.
3:05 This command is generating a dataset. And from here, if this dataset didn't change, then any scripts that depend on it don't have to run a new either.
3:19 So if I run this convert command one more time now, (mouse clicking) you can see that this command actually got skipped because nothing changed.
3:30 There were no new annotations that were moved into this file over here, which means that the script doesn't have to run to generate these files.
3:41 Now, in this example, that's not going to save a whole lot of time, but you can imagine as we have a project that's going to grow and grow,
3:48 the fact that we can have this collection of scripts that we can write unit tests for, but also that we have this framework
3:55 such that they don't run unless they really have to, that is going to be super nice. We can kind of make a collection of scripts that need to run
4:05 and this project.yaml file gives us a nice way to orchestrate that. Now, if you're curious about the details,
4:11 definitely go and have a look in this file, this project.lock file. And when you look around, you'll notice that we have specific names
4:20 of commands over here, and that for all these different outputs, we have this hash that's readily available. And in this case,
4:27 we can confirm that the last time that this convert command was run, the same hash appeared as what we've got over here. So under the hood,
4:37 this is the method that spaCy uses to understand which commands need to be reran and which commands don't. So I'm going to be using this a whole bunch.
4:46 I'll take the time to explain the steps, but I hope that the orchestration of what we're about to do is also clear. Having such a system around
4:53 is going to make it a lot easier for us to have a proper project, as opposed to having lots of different scripts in a Jupyter notebook.


Talk Python's Mastodon Michael Kennedy's Mastodon