Getting Started with NLP and spaCy Transcripts
Chapter: Part 3: spaCy Projects
Lecture: Converting data

Login or purchase this course to watch this video and the rest of the course contents.
0:00 In the previous video, we noticed that the spaCy project framework makes sure that commands
0:06 like convert that depend on a previous step don't run unless they really need to.
0:11 And in particular, we saw that if and only if this file over here changes, then this script will run to generate these spaCy specific files.
0:22 What I would now like to do though, is actually dive into the scripts that we're using to
0:27 generate these files, because they paint a somewhat general picture of something that you'll probably need in most spaCy projects.
0:35 So I have a folder over here with lots of scripts, and in particular, let's just have a look at this convert.py script.
0:44 So here is the script, and as this little doc string can confirm, this script basically makes sure that data is pushed into this .spaCy format.
0:56 This is a binary representation of the data that's nice and specific and also relatively lightweight.
1:03 And the way you should kind of think about it is that we're really just storing spaCy documents.
1:09 There is this object called a doc bin that we are importing, and as we are looping over
1:14 all the examples from our annotation file over here, what's really just happening is
1:21 that we are taking a text, turning that into a spaCy document, and then this JSON file
1:28 has a key called spans, and we are just adding all of those spans as entities.
1:38 By the time that we're over here, we have a document with entities, and then the main
1:42 thing that's happening here is that I'm saying, well, let's have a few documents for training and a few other documents for evaluation.
1:51 In general, it is a good idea to keep evaluation sets separate from your train sets, but that's the final bit of logic that's happening here.
2:00 Everything else that's happening above is really just creating a spaCy document object
2:06 with all the properties that I would like to predict in my machine learning model.
2:10 And then finally at the end, I have this doc bin object with lots of documents for my train
2:15 set, and that needs to be stored to disk, and I'm doing that to the validation set as well.
2:23 Note that in this particular case, we are interested in doing named entity recognition,
2:28 NER, which is commonly abbreviated as NER, and that's why we really have to make sure
2:33 that any entities that appear in our annotations actually get set here.
2:39 But if you're dealing with another task, effectively you will write a very similar script as what we've got over here.
2:45 You just got to make sure that the thing you're predicting is properly attached to the document.
2:49 That's the main thing that really needs to happen here. So you might wonder, well, what do I do if I don't have named entities, but I've got
2:57 this other task that I'm interested in? Well, then my best advice is to go to GitHub and go to the Explosion Projects repository.
3:04 There's a folder here with lots and lots of tutorials. Some of these tutorials are for named entity recognition, but we've also got some for text
3:16 classification, and in particular, here's one for doc issue tags.
3:22 And what you can find here is a project.yaml file, just like we have before, but moreover, what you can find here are just example scripts.
3:30 Each one of these projects typically has some sort of pre-processing script that takes some
3:35 sort of JSON file, and it's then assumed that this JSON file has a specific format, but again, the pattern remains the same.
3:42 In this case, we are not adding entities, we are adding categories to this document over here.
3:48 And again, we are adding that to a doc bin, and then at the end, that doc bin is saved to disk. And this kind of holds true for any spaCy project.
3:58 You will always have to pre-process data into the spaCy format, and the way that you would
4:02 go about that does depend on the task, but there are plenty of these examples on the project's repository.
4:09 So if you ever feel lost, I do advise you to just go ahead and copy some relevant scripts from here. Quite frankly, that's actually what I always do.


Talk Python's Mastodon Michael Kennedy's Mastodon