Async Techniques and Examples in Python Transcripts
Chapter: async and await with asyncio
Lecture: async web scraping

Login or purchase this course to watch this video and the rest of the course contents.
0:00 Now we're in our asyncronous program but notice it's still using requests and it's still doing the synchronous version.
0:07 Our job, our goal, during the next few section is to convert this over to an async version that works much, much better.
0:18 Now, in order for us to actually write async code that does anything interesting we need to use a library that supports asyncio, that has async methods
0:28 and coroutines that we can actually await. So we're not using a requests we're going to switch over to this thing called aiohttp.
0:35 Now, this is both a server and a client it does web sockets and all sorts of stuff. What we care about is this client thing.
0:42 So we're going to use this to very simply convert our requests code over to asyncio. So let's get started. So instead of using a requests
0:52 we're going to use aiohttp, and we're going to need to install some new requirements. So we don't need requests anymore and we're going to use aiohttp.
1:02 That's the main library but there's actually two other libraries that will make this even faster. So aiohttp has to do DNS lookups and other things
1:13 so there's actually an aiodns, and a cchardet. These two are going to be a little bit better. So, we're going to copy that path
1:23 and install those requirements. With the requirements in place now we can start writing our code. We actually don't have to change very much.
1:37 This line right here, I'll comment it out for a second so we still have it that's the line we have to change.
1:42 Now, I'm going to introduce you to a new bit of syntax that is a little bit funky. We've seen how to make this method async. We say, async, right?
1:51 And you might think, I would just write, await but it turns out aiohttp client runs in a context manager. Otherwise known as a with block.
2:03 And the with block itself has to do asyncronous stuff. So, Python has been extended to have what are called asyncronous with blocks
2:13 or asyncronous context managers. So what we're going to write is async with aiohttp.clientsession, and then within the session
2:23 we're going to make a request so we have another with block we're going to get the URL as the response
2:32 and then it's pretty similar to what requests has. We do this, and we do that. Now, this text here, if we look at it it's an asyncronous function.
2:45 So, first of all, it wasn't a function in requests it is here, but it's also async so we have to await it.
2:51 This line right here is probably the most important one. This one and this one, these are the two most important ones here for what we're trying to do.
2:59 So we take this one line, and yeah it gets a little more complicated but trust me, the benefit is huge. All right, so let's go make this run.
3:07 So if I just try to run it down here notice this is not going to work so much. So this is actually returning a coroutine
3:14 not a string, and when we try to pass that where a string is expected, it goes whoa whoa whoa. Not great. So, how do we do this?
3:22 Actually, sorry, I don't want to run it here. Let's go up here and do it in main. Then over here, I'll just say loop.run_until_complete
3:31 and we're going to give it this which means we're going to make this async as well then this gets pretty simple. All we have to do is await.
3:39 Now, this is going to absolutely run it's going to do it asyncronously I think everything is going to be perfect.
3:45 But it turns out, there's one little problem that we're going to run into. But, let's just run to see that it still works
3:51 at least the way it did before. So, we're going to run the program. It's working, you can see the titles are
3:57 correct, understanding and using Python's AST How Python evolves, etc, etc. But, did you notice a difference in speed?
4:05 Did you see things happening concurrently? No. Let's look at that. That's a little bit weird here. So, if we look at the main part
4:13 we're running this function so let's go look at the get_title_range. And I'm going to make a copy of this so you can see how it was
4:21 I'll call this version one this will be the old version let's call it old version. This is the real one. So what happens when we run this
4:31 is we go through here and each time we block and stop before we do anything else and then we get the title and go on. So, yeah, there's this event loop
4:40 but we're only doing one thing at a time. What we need to do is start all the requests and then then go process the responses as they come in.
4:50 So we need to make a tiny little change here. Let's call this tasks equals this and we're going to kick them all off
4:59 so we're going to say, we're going to append now I want to store basically this. So I'd love to just store this response here
5:09 that we get back, however this coroutine that's been started, however this is not actually going to start it. Remember, these are like generators
5:17 you have to trigger them to go. So, what I can do over here is I can say asyncio, I create_task of that I also need when I print this out
5:27 I need to pass the number and the HTML so I'm going to need that later. So let's also pass the number as a tuple so we're passing one thing to our list
5:37 which is actually this tuple right here. So what our goal is, is to start all the tasks and then for each one of them
5:51 we then want to do the other work. So we'll say the HTML is await t and then we're going to put it in there. So we start all the task, they're running
6:01 and then we're going to either just get their value right back or we're going to now block and wait for that response to come in
6:09 and then get the next task maybe it's already done, we'll get it's response right away and we got to wait on the next one
6:14 so the key thing here is instead of trying to do one at a time, we're going to start them all and then process them all.
6:21 Now, if you were asking for hundreds or thousands of pages, you might want to somehow rate limit this so that the tasks don't get too out of control
6:29 but if we're only doing 10, it's not too bad. Are you ready for the grand finale? For the big moment, to see if we actually got our code
6:37 one, working, and two, faster? I think we have, let's try. Look at that. Man, that is just awesome. I did nothing to speed that up.
6:48 I didn't edit the video at all. Let me run it one more time. That is so awesome, let me run it one more time. Start. Done. Bam.
7:01 Notice, we started all of the requests and then as they came in, we started to process them. The way in which we processed them
7:09 was the order we started them and it's probably not the order they actually finished. But that doesn't matter because
7:14 all the latency around the ping time you know we're making 10 requests over to the server that's a whole second right there
7:20 just waiting on the internet. Well, we can do all those requests and get them all started and really just incur probably more or less
7:28 the ping time of one for this particular server. Maybe 100 milliseconds, not 1,000 which is really, really great.
7:35 And then of course, all the concurrent processing that the server's doing as well. So really, really awesome and that's how we were able to use asyncio
7:44 and a library that can do web requests that itself supports asyncio to dramatically increase the speed. While we're on the subject of aiohttp
7:56 let me just tell you a really, really quick story to drive this point home of how useful this library and this technique can be.
8:02 We talked about this on my other podcast PythonBytes, and there was a listener he knows I share this story every now and then and it's pretty fun.
8:10 So, he had some project where he was requesting a whole bunch of pages and he was using requests, and it was taking hours or something like that.
8:20 He switched to this technique where he's using aiohttp and async and await and things like that, it went so fast that it actually crashed his server
8:29 because the server ran out of memory trying to process all the requests it was getting back all at once. So, I think that's awesome.
8:36 It goes from hours to less than a minute and so much data you actually have to think about the performance of receiving that much data at a time
8:45 because you're adding so much concurrency to the system. And how hard was it? Well, yeah, this was like four lines instead of two
8:54 maybe instead of three? So, not too bad at all. The real key to the technique is to make sure you start all of the work and then
9:03 start to process the responses. 'Cause we saw in our first version our old version, that we got actually zero speed up from that.
9:10 Just a little bit of added complexity for no real benefit. So here's doing some real work with asyncio and async and await.


Talk Python's Mastodon Michael Kennedy's Mastodon