Getting Started with NLP and spaCy Transcripts
Chapter: Part 3: spaCy Projects
Lecture: Converting data
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
In the previous video, we noticed that the spaCy project framework makes sure that commands
0:06
like convert that depend on a previous step don't run unless they really need to.
0:11
And in particular, we saw that if and only if this file over here changes, then this script will run to generate these spaCy specific files.
0:22
What I would now like to do though, is actually dive into the scripts that we're using to
0:27
generate these files, because they paint a somewhat general picture of something that you'll probably need in most spaCy projects.
0:35
So I have a folder over here with lots of scripts, and in particular, let's just have a look at this convert.py script.
0:44
So here is the script, and as this little doc string can confirm, this script basically makes sure that data is pushed into this .spaCy format.
0:56
This is a binary representation of the data that's nice and specific and also relatively lightweight.
1:03
And the way you should kind of think about it is that we're really just storing spaCy documents.
1:09
There is this object called a doc bin that we are importing, and as we are looping over
1:14
all the examples from our annotation file over here, what's really just happening is
1:21
that we are taking a text, turning that into a spaCy document, and then this JSON file
1:28
has a key called spans, and we are just adding all of those spans as entities.
1:38
By the time that we're over here, we have a document with entities, and then the main
1:42
thing that's happening here is that I'm saying, well, let's have a few documents for training and a few other documents for evaluation.
1:51
In general, it is a good idea to keep evaluation sets separate from your train sets, but that's the final bit of logic that's happening here.
2:00
Everything else that's happening above is really just creating a spaCy document object
2:06
with all the properties that I would like to predict in my machine learning model.
2:10
And then finally at the end, I have this doc bin object with lots of documents for my train
2:15
set, and that needs to be stored to disk, and I'm doing that to the validation set as well.
2:23
Note that in this particular case, we are interested in doing named entity recognition,
2:28
NER, which is commonly abbreviated as NER, and that's why we really have to make sure
2:33
that any entities that appear in our annotations actually get set here.
2:39
But if you're dealing with another task, effectively you will write a very similar script as what we've got over here.
2:45
You just got to make sure that the thing you're predicting is properly attached to the document.
2:49
That's the main thing that really needs to happen here. So you might wonder, well, what do I do if I don't have named entities, but I've got
2:57
this other task that I'm interested in? Well, then my best advice is to go to GitHub and go to the Explosion Projects repository.
3:04
There's a folder here with lots and lots of tutorials. Some of these tutorials are for named entity recognition, but we've also got some for text
3:16
classification, and in particular, here's one for doc issue tags.
3:22
And what you can find here is a project.yaml file, just like we have before, but moreover, what you can find here are just example scripts.
3:30
Each one of these projects typically has some sort of pre-processing script that takes some
3:35
sort of JSON file, and it's then assumed that this JSON file has a specific format, but again, the pattern remains the same.
3:42
In this case, we are not adding entities, we are adding categories to this document over here.
3:48
And again, we are adding that to a doc bin, and then at the end, that doc bin is saved to disk. And this kind of holds true for any spaCy project.
3:58
You will always have to pre-process data into the spaCy format, and the way that you would
4:02
go about that does depend on the task, but there are plenty of these examples on the project's repository.
4:09
So if you ever feel lost, I do advise you to just go ahead and copy some relevant scripts from here. Quite frankly, that's actually what I always do.