Getting Started with NLP and spaCy Transcripts
Chapter: Part 2: Exploring data with spaCy
Lecture: Testing a product hypothesis

Login or purchase this course to watch this video and the rest of the course contents.
0:00 If you're tagging along and you're also exploring this data set, then you might have noticed something
0:06 and that's related to these product entities that it sometimes detects.
0:11 Right off the bat, in this case Django makes a couple of appearances in this document and the model doesn't always have consistent predictions.
0:21 Over here Django is detected as a person, over here it's detected as a product
0:26 and there are also instances where, but it's not detected as an entity one way or another.
0:32 Again, there are all sorts of statistical reasons for this that depend on the data set that spaCy has used
0:38 but one thing that I have noticed, if the spaCy model detects a product
0:45 it is commonly, at least seemingly, referring to a Python tool or a programming language
0:52 and that kind of makes sense if you think about how people like to talk about products or programming languages
1:00 because usually it's a noun that has utility, so when I read the sentence we can just start with product"" itself before we dive into etc.
1:10 Well, given the kind of text that I'm dealing with, I am curious if we were to reuse this product prediction from spaCy
1:19 do we actually get a bunch of programming related entities in return? And I wrote some code to just quickly test this hypothesis.
1:30 And here's a little script. 20 times I'm doing the following. I'm grabbing text from my lines generator, I'm turning it into a document
1:38 and then I'm checking all the entities that are in there and then I'm checking the label for those entities
1:43 and if the product string appears in any of those labels well, let's just render the document then
1:49 just so we can see what kind of products got detected. And if I just have a quick glance over here then Python is a product. Flask is a product.
2:04 I also see that Twitter is a product. JavaScript makes an appearance. So even though it's definitely not perfect
2:14 it does feel that for this particular corpus I might be able to reuse spaCy's product entity over here
2:23 to see if I can grab me a bunch of programming languages and/or programming tools that are used in the talk Python transcripts.
2:32 As we'll see in a bit, it's not going to be perfect but it's not going to be horrible as a starting point either.


Talk Python's Mastodon Michael Kennedy's Mastodon