Getting Started with NLP and spaCy Transcripts
Chapter: Part 1: spaCy syntax
Lecture: Document properties

Login or purchase this course to watch this video and the rest of the course contents.
0:00 In the last few videos we've been looking at properties that spaCy provides on tokens,
0:07 but there are also a couple of properties that spaCy provides on the document itself, and I figured just showing a few demos of that.
0:16 One thing that spaCy provides on the document is this sense property.
0:21 This will return a generator, which you can unravel by turning that into a list, but this gives you the separate sentences in the original document.
0:34 And you might be tempted to think that that's relatively easy because you just have to split
0:37 on this one dot over here, but just to show you that it doesn't necessarily have to be
0:44 the case, if I were to write down ""My name is Mr. Warmerdown"", then the dot here that's
0:52 part of ""Mr."" should not be the reason why the sentence splits.
0:57 So the fact that spaCy can detect these separate sentences for you is indeed a very useful
1:02 and likable feature, because it might be the case that you have systems that work on a
1:06 sentence level, and then spaCy can be used to generate the separate sentences on your behalf. Let's now consider a very different sentence.
1:16 So I have ""Star Wars is a very popular science fiction series. Besides sentences, a document also has a property that's noun chunks.
1:27 Again, that gives me a generator, so I need to turn it into a list.
1:32 But the way to look at this is that spaCy is able to detect that multiple words together,
1:38 as far as grammar goes, could kind of be seen as a noun chunk, almost as if it's a single phrase or a single noun chunk, so to say.
1:50 And the reason why this might be of interest is that a lot of entities you're typically interested in detecting, they tend to be nouns.
1:58 They're usually not verbs, so having something that can just give you noun chunks to evaluate is also pretty useful in practice, I would say.
2:07 And then finally, here's a property that could be useful if you want to turn this into an API at some point.
2:12 A doc object also has a toJSON method attached, and this will give you back a dictionary that
2:19 just contains a lot of the information that spaCy has detected. And that's kind of neat, because that means you can use this inside of a web app or API
2:30 to communicate with a front end. Now, as you go through all the properties here, you'll notice that typically there's
2:35 a start as well as an end being specified. So for example, the sentence starts somewhere and it ends somewhere, and another sentence
2:43 starts somewhere and it ends somewhere. And these refer to the character indices in this sentence.
2:49 So you can also see that this first token over here, it starts somewhere, it stops at
2:55 the second character, then we have the comma, the punctuation that goes from two to three. So these represent the character indices.
3:04 But there's also all sorts of other information, especially on these tokens.
3:07 So we have the lemma, for example, we have the morphological information, etc. You can also scroll down and see all the different tokens.
3:16 But one thing you will notice as you look around is that even though we have lots of
3:20 information in here, it doesn't show all the possible properties on this document. So the noun chunks, for example, aren't being shown here.
3:30 But that said, a lot of the time this will be sufficient, especially if you're mainly interested in these entities, so to say.
3:38 So if you're building web APIs, this tends to be a very useful method to know about.


Talk Python's Mastodon Michael Kennedy's Mastodon