Async Techniques and Examples in Python Transcripts
Chapter: async and await with asyncio
Lecture: Synchronous web scraping
0:00 Let me introduce you to another program
0:02 that we're going to use here.
0:03 This is the web scraping example that I just spoke about
0:05 and it's going to use a couple of libraries requests
0:07 and Beautiful Soup, and so I've added
0:09 a requirements file for it.
0:10 Just start by Pip installing those requirements
0:13 which we can do by copying the path like that.
0:19 Okay, so we have our code, and this is just
0:20 where we're going to start from.
0:22 We don't have to write anything
0:23 but I want to walk you through it
0:24 so you see what it is we're doing
0:27 when we get to the async stuff.
0:28 So there's a couple methods up here, get_html and get_title.
0:32 Now the get_title is purely in memory
0:35 CPU bound sort of thing.
0:37 It's going to use Beautiful Soup which is a library
0:39 that understands the HTML DOM in memory
0:42 on the client side like here, and it's going to let you
0:45 do queries against it that are basically like CSS queries.
0:49 So we're going to say give me the header
0:51 and either we're going to get missing
0:52 or it'll give us the text cleaned up out of the header.
0:56 This get_html is more interesting.
0:58 We're going to give it an episode number
1:00 so we're going to go to my podcast talkpython.fm
1:03 and it's going to use this like short, little URL here
1:06 this shortcut, talkpython.fm/<some episode number>
1:09 and it will go and redirect us to the right place
1:12 follow that redirect, and then get that data.
1:15 So we're just going to use requests, do a get on the URL.
1:18 We're going to verify that that worked
1:20 and then we're going to return just the text
1:23 and if we look down here, we're going to say get_title.
1:26 So what it's going to do is go from episode 150 to 160.
1:31 It's going to first get the HTML, and then it's going to
1:34 get the title, and then it's going to print.
1:36 Now of course this is all serial code, right?
1:39 This is in order, so when we run this
1:42 don't expect any speedups, but you should see it working.
1:48 So here it is, it's getting the HTML for episode 150
1:52 then it's getting the title, and it found that
1:54 it's Technical Lessons Learned from Pythonic Refactoring
1:57 then 151, and then Gradual Typing
2:02 for Production Applications, and so on
2:03 and we're finally eventually done.
2:06 Let me just run it one more time
2:07 without me around and just watch how long it takes
2:09 get a sense for the speed here. It's doing one.
2:12 It's waiting on the server to respond.
2:14 It's getting a response back. It's not terribly slow.
2:16 My website's pretty fast, but it's not that fast
2:20 and of course, I'm on the West Coast of the United States
2:23 the server is on the East Coast.
2:24 There's a long ping time just to go from
2:29 west to east back, about a 100 milliseconds, I'm guessing.
2:34 There's a lot of places for improvement here.
2:37 That's what we're going to work with
2:38 and our job is going to be to take this program
2:41 apply asyncio to the places where we're waiting
2:44 namely that line right there
2:47 you're doing lots of waiting there
2:48 and make this go much, much, faster.
2:50 And it turns out, the algorithm that we have here
2:53 won't actually make it go faster, at least for
2:56 this particular application.
2:58 If we had a bunch of different attempts running parallel
3:01 it's a really straightforward transition
3:02 but there's one more thing to be learned
3:04 to make this actually go faster in pure performance.