Async Techniques and Examples in Python Transcripts
Chapter: async and await with asyncio
Lecture: Synchronous web scraping

Login or purchase this course to watch this video and the rest of the course contents.
0:00 Let me introduce you to another program that we're going to use here. This is the web scraping example that I just spoke about
0:06 and it's going to use a couple of libraries requests and Beautiful Soup, and so I've added a requirements file for it.
0:11 Just start by pip installing those requirements which we can do by copying the path like that. Okay, so we have our code, and this is just
0:21 where we're going to start from. We don't have to write anything but I want to walk you through it so you see what it is we're doing
0:28 when we get to the async stuff. So there's a couple methods up here, get_HTML and get_title. Now the get_title is purely in memory
0:36 CPU bound sort of thing. It's going to use Beautiful Soup which is a library that understands the HTML DOM in memory
0:43 on the client side like here, and it's going to let you do queries against it that are basically like CSS queries.
0:50 So we're going to say give me the header and either we're going to get missing or it'll give us the text cleaned up out of the header.
0:57 This get_HTML is more interesting. We're going to give it an episode number so we're going to go to my podcast talkpython.fm
1:04 and it's going to use this like short, little URL here this shortcut, talkpython.fm/<some episode number>
1:10 and it will go and redirect us to the right place follow that redirect, and then get that data.
1:16 So we're just going to use requests, do a get on the URL. We're going to verify that that worked and then we're going to return just the text
1:24 and if we look down here, we're going to say get_title. So what it's going to do is go from episode 150 to 160.
1:32 It's going to first get the HTML, and then it's going to get the title, and then it's going to print. Now of course this is all serial code, right?
1:40 This is in order, so when we run this don't expect any speedups, but you should see it working. So here it is, it's getting the HTML for episode 150
1:53 then it's getting the title, and it found that it's Technical Lessons Learned from Pythonic Refactoring then 151, and then Gradual Typing
2:03 for Production Applications, and so on and we're finally eventually done. Let me just run it one more time
2:08 without me around and just watch how long it takes get a sense for the speed here. It's doing one. It's waiting on the server to respond.
2:15 It's getting a response back. It's not terribly slow. My website's pretty fast, but it's not that fast
2:21 and of course, I'm on the West Coast of the United States the server is on the East Coast. There's a long ping time just to go from
2:30 west to east back, about a 100 milliseconds, I'm guessing. There's a lot of places for improvement here. That's what we're going to work with
2:39 and our job is going to be to take this program apply asyncio to the places where we're waiting namely that line right there
2:48 you're doing lots of waiting there and make this go much, much, faster. And it turns out, the algorithm that we have here
2:54 won't actually make it go faster, at least for this particular application. If we had a bunch of different attempts running parallel
3:02 it's a really straightforward transition but there's one more thing to be learned to make this actually go faster in pure performance.


Talk Python's Mastodon Michael Kennedy's Mastodon