Build An Audio AI App Transcripts
Chapter: Feature 2: Search
Lecture: Running the Search Engine

Login or purchase this course to watch this video and the rest of the course contents.
0:00 To get us just a little bit more familiar with the search engine, let's just do one more thing.
0:05 Let's see how we could bring in our transcript information, because the way it works now, it's only using the basic info.
0:12 So let's go down here and it says, ""Build index for a podcast. And it comes down here, and basically it says, ""Give us each episode.
0:23 If it doesn't have any changed contents here, then that's fine. We don't need to re-index it if nothing's changed.
0:30 And then it says, ""We're going to come up with some episode text, is what we're calling it. And episode text is going to include the title.
0:40 And if there's no title, it uses an empty string rather than none, because that avoids a crash when you try to add it together. Give it a space.
0:47 Come up with the description. This, some of the podcasts, they ship their show notes as HTML. Others ship them as a PDF.
0:59 Others ship their podcast as plain text. So this little thing, it's job is to, if it's HTML, turn it into plain text, kind of like Markdown.
1:07 And then it takes all the tags that might be in the RSS and throws them in there. And then the base text up here has to do with,
1:16 what is the podcast title, podcast description, et cetera, et cetera. So it just takes all the words it can find and makes one giant string out of it.
1:26 And then it says, ""Hey, give me the transcript too. Right? Here's the full transcript for the episode.
1:31 And if there's a transcript, go to every word that appears in the transcript. Don't turn it into sentences and stuff. Just give me all the words.
1:41 Jam them in there as well. It doesn't, they don't have to be in an order. Remember, we're just looking for unique words that appear.
1:47 And then we add on any sort of summary information. We'll generate that later, but eventually we'll have summary information.
1:56 And in the end, this episode text is just, what words can we find about this episode?
2:03 The ones that it comes with and the ones that we generate through Assembly AI.
2:06 And then we just say, turn that into a huge distinct set without duplication. That's what set to do of keywords.
2:16 And we're just going to save that, right? We're just going to stash that into the database like that and then save it.
2:24 So then when we do a search, we just really, it's incredibly simple, actually. Let's go see the search. Search episodes means search record dot find,
2:34 where the keywords contains the word that you specified. And it builds up this query to say, and the next word and the next word and the next word.
2:45 If you pass in like geese of Canada, that would be three different and statements, right? Shows you the latest one first and then boom, off it goes.
2:55 Just iterates it and gets the results. So that's how this works. That's how we're able to take things like transcripts and summaries
3:02 and plug them into this search engine. And with that, we should be able to run it and see stuff going into the database. So let's give that a go.
3:09 Restart it. And remember in five seconds right there, I'll clean this up. It should kick off. Let's see what happens.
3:18 One more thing that we got to make sure we've got going here. So when I was showing you spacey, that model
3:26 that they're using right here, this load model is actually, the small one is not super large, but I want to have this work well for you.
3:36 So I chose the large English model. So we need to download this and we can just go over here into our virtual environment
3:44 and say pip and they already got it. So we're just going to run Python dash M spacey, download this, which will ultimately pip install the thing.
3:54 So let it go. You can see it's 587 megs, which is why it doesn't come with it. So give it a second, but it's coming in nice and fast here. Excellent.
4:06 Now that's loaded the search engine, make sure that's present because it needs it for that goose to geese trick I was talking about. Try again.
4:13 Look at that. It's indexing fresh air, talk Python, accidental tech podcasts, pulling in the episodes that it knows about. All right.
4:24 Indexing complete in 10 seconds. If we go over to our database again, you can see here for episode number 572 for what we're doing here.
4:37 You can see here for episode number 572 for what is this? Let's see. Accidental tech podcasts.
4:45 We've got month, store, programming, Mac, fee, bootleg, schatzel, ferrite, and become and so on. Right?
4:57 So those are the words and we have an index on it. Super, super cool. And how many search records do we have? 2,052. That is pretty excellent. Awesome.
5:10 So that's how the search engine goes. That's how we've got to run it.
5:14 Make sure you install the language model through that command that just tells you what to do.
5:18 If it doesn't work, let it go in the background and just live in that async IO space and do its thing.


Talk Python's Mastodon Michael Kennedy's Mastodon