Getting Started with NLP and spaCy Transcripts
Chapter: Part 1: spaCy syntax
Lecture: Document properties

0:00 In the last few videos we've been looking at properties that spaCy provides on tokens,

0:07 but there are also a couple of properties that spaCy provides on the document itself, and I figured just showing a few demos of that.

0:16 One thing that spaCy provides on the document is this sense property.

0:21 This will return a generator, which you can unravel by turning that into a list, but this gives you the separate sentences in the original document.

0:34 And you might be tempted to think that that's relatively easy because you just have to split

0:37 on this one dot over here, but just to show you that it doesn't necessarily have to be

0:44 the case, if I were to write down ""My name is Mr. Warmerdown"", then the dot here that's

0:52 part of ""Mr."" should not be the reason why the sentence splits.

0:57 So the fact that spaCy can detect these separate sentences for you is indeed a very useful

1:02 and likable feature, because it might be the case that you have systems that work on a

1:06 sentence level, and then spaCy can be used to generate the separate sentences on your behalf. Let's now consider a very different sentence.

1:16 So I have ""Star Wars is a very popular science fiction series. Besides sentences, a document also has a property that's noun chunks.

1:27 Again, that gives me a generator, so I need to turn it into a list.

1:32 But the way to look at this is that spaCy is able to detect that multiple words together,

1:38 as far as grammar goes, could kind of be seen as a noun chunk, almost as if it's a single phrase or a single noun chunk, so to say.

1:50 And the reason why this might be of interest is that a lot of entities you're typically interested in detecting, they tend to be nouns.

1:58 They're usually not verbs, so having something that can just give you noun chunks to evaluate is also pretty useful in practice, I would say.

2:07 And then finally, here's a property that could be useful if you want to turn this into an API at some point.

2:12 A doc object also has a toJSON method attached, and this will give you back a dictionary that

2:19 just contains a lot of the information that spaCy has detected. And that's kind of neat, because that means you can use this inside of a web app or API

2:30 to communicate with a front end. Now, as you go through all the properties here, you'll notice that typically there's

2:35 a start as well as an end being specified. So for example, the sentence starts somewhere and it ends somewhere, and another sentence

2:43 starts somewhere and it ends somewhere. And these refer to the character indices in this sentence.

2:49 So you can also see that this first token over here, it starts somewhere, it stops at

2:55 the second character, then we have the comma, the punctuation that goes from two to three. So these represent the character indices.

3:04 But there's also all sorts of other information, especially on these tokens.

3:07 So we have the lemma, for example, we have the morphological information, etc. You can also scroll down and see all the different tokens.

3:16 But one thing you will notice as you look around is that even though we have lots of

3:20 information in here, it doesn't show all the possible properties on this document. So the noun chunks, for example, aren't being shown here.

3:30 But that said, a lot of the time this will be sufficient, especially if you're mainly interested in these entities, so to say.

3:38 So if you're building web APIs, this tends to be a very useful method to know about.

Getting Started with NLP and spaCy Transcripts Chapter: Part 1: spaCy syntax Lecture: Document properties

Getting Started with NLP and spaCy Transcripts
Chapter: Part 1: spaCy syntax
Lecture: Document properties