Getting Started with NLP and spaCy Transcripts
Chapter: Part 3: spaCy Projects
Lecture: Converting data

0:00 In the previous video, we noticed that the spaCy project framework makes sure that commands

0:06 like convert that depend on a previous step don't run unless they really need to.

0:11 And in particular, we saw that if and only if this file over here changes, then this script will run to generate these spaCy specific files.

0:22 What I would now like to do though, is actually dive into the scripts that we're using to

0:27 generate these files, because they paint a somewhat general picture of something that you'll probably need in most spaCy projects.

0:35 So I have a folder over here with lots of scripts, and in particular, let's just have a look at this convert.py script.

0:44 So here is the script, and as this little doc string can confirm, this script basically makes sure that data is pushed into this .spaCy format.

0:56 This is a binary representation of the data that's nice and specific and also relatively lightweight.

1:03 And the way you should kind of think about it is that we're really just storing spaCy documents.

1:09 There is this object called a doc bin that we are importing, and as we are looping over

1:14 all the examples from our annotation file over here, what's really just happening is

1:21 that we are taking a text, turning that into a spaCy document, and then this JSON file

1:28 has a key called spans, and we are just adding all of those spans as entities.

1:38 By the time that we're over here, we have a document with entities, and then the main

1:42 thing that's happening here is that I'm saying, well, let's have a few documents for training and a few other documents for evaluation.

1:51 In general, it is a good idea to keep evaluation sets separate from your train sets, but that's the final bit of logic that's happening here.

2:00 Everything else that's happening above is really just creating a spaCy document object

2:06 with all the properties that I would like to predict in my machine learning model.

2:10 And then finally at the end, I have this doc bin object with lots of documents for my train

2:15 set, and that needs to be stored to disk, and I'm doing that to the validation set as well.

2:23 Note that in this particular case, we are interested in doing named entity recognition,

2:28 NER, which is commonly abbreviated as NER, and that's why we really have to make sure

2:33 that any entities that appear in our annotations actually get set here.

2:39 But if you're dealing with another task, effectively you will write a very similar script as what we've got over here.

2:45 You just got to make sure that the thing you're predicting is properly attached to the document.

2:49 That's the main thing that really needs to happen here. So you might wonder, well, what do I do if I don't have named entities, but I've got

2:57 this other task that I'm interested in? Well, then my best advice is to go to GitHub and go to the Explosion Projects repository.

3:04 There's a folder here with lots and lots of tutorials. Some of these tutorials are for named entity recognition, but we've also got some for text

3:16 classification, and in particular, here's one for doc issue tags.

3:22 And what you can find here is a project.yaml file, just like we have before, but moreover, what you can find here are just example scripts.

3:30 Each one of these projects typically has some sort of pre-processing script that takes some

3:35 sort of JSON file, and it's then assumed that this JSON file has a specific format, but again, the pattern remains the same.

3:42 In this case, we are not adding entities, we are adding categories to this document over here.

3:48 And again, we are adding that to a doc bin, and then at the end, that doc bin is saved to disk. And this kind of holds true for any spaCy project.

3:58 You will always have to pre-process data into the spaCy format, and the way that you would

4:02 go about that does depend on the task, but there are plenty of these examples on the project's repository.

4:09 So if you ever feel lost, I do advise you to just go ahead and copy some relevant scripts from here. Quite frankly, that's actually what I always do.

Getting Started with NLP and spaCy Transcripts Chapter: Part 3: spaCy Projects Lecture: Converting data

Getting Started with NLP and spaCy Transcripts
Chapter: Part 3: spaCy Projects
Lecture: Converting data