Async Techniques and Examples in Python Transcripts
Chapter: async and await with asyncio
Lecture: async web scraping
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
Now we're in our asyncronous program but notice it's still using requests and it's still doing the synchronous version.
0:07
Our job, our goal, during the next few section is to convert this over to an async version that works much, much better.
0:18
Now, in order for us to actually write async code that does anything interesting we need to use a library that supports asyncio, that has async methods
0:28
and coroutines that we can actually await. So we're not using a requests we're going to switch over to this thing called aiohttp.
0:35
Now, this is both a server and a client it does web sockets and all sorts of stuff. What we care about is this client thing.
0:42
So we're going to use this to very simply convert our requests code over to asyncio. So let's get started. So instead of using a requests
0:52
we're going to use aiohttp, and we're going to need to install some new requirements. So we don't need requests anymore and we're going to use aiohttp.
1:02
That's the main library but there's actually two other libraries that will make this even faster. So aiohttp has to do DNS lookups and other things
1:13
so there's actually an aiodns, and a cchardet. These two are going to be a little bit better. So, we're going to copy that path
1:23
and install those requirements. With the requirements in place now we can start writing our code. We actually don't have to change very much.
1:37
This line right here, I'll comment it out for a second so we still have it that's the line we have to change.
1:42
Now, I'm going to introduce you to a new bit of syntax that is a little bit funky. We've seen how to make this method async. We say, async, right?
1:51
And you might think, I would just write, await but it turns out aiohttp client runs in a context manager. Otherwise known as a with block.
2:03
And the with block itself has to do asyncronous stuff. So, Python has been extended to have what are called asyncronous with blocks
2:13
or asyncronous context managers. So what we're going to write is async with aiohttp.clientsession, and then within the session
2:23
we're going to make a request so we have another with block we're going to get the URL as the response
2:32
and then it's pretty similar to what requests has. We do this, and we do that. Now, this text here, if we look at it it's an asyncronous function.
2:45
So, first of all, it wasn't a function in requests it is here, but it's also async so we have to await it.
2:51
This line right here is probably the most important one. This one and this one, these are the two most important ones here for what we're trying to do.
2:59
So we take this one line, and yeah it gets a little more complicated but trust me, the benefit is huge. All right, so let's go make this run.
3:07
So if I just try to run it down here notice this is not going to work so much. So this is actually returning a coroutine
3:14
not a string, and when we try to pass that where a string is expected, it goes whoa whoa whoa. Not great. So, how do we do this?
3:22
Actually, sorry, I don't want to run it here. Let's go up here and do it in main. Then over here, I'll just say loop.run_until_complete
3:31
and we're going to give it this which means we're going to make this async as well then this gets pretty simple. All we have to do is await.
3:39
Now, this is going to absolutely run it's going to do it asyncronously I think everything is going to be perfect.
3:45
But it turns out, there's one little problem that we're going to run into. But, let's just run to see that it still works
3:51
at least the way it did before. So, we're going to run the program. It's working, you can see the titles are
3:57
correct, understanding and using Python's AST How Python evolves, etc, etc. But, did you notice a difference in speed?
4:05
Did you see things happening concurrently? No. Let's look at that. That's a little bit weird here. So, if we look at the main part
4:13
we're running this function so let's go look at the get_title_range. And I'm going to make a copy of this so you can see how it was
4:21
I'll call this version one this will be the old version let's call it old version. This is the real one. So what happens when we run this
4:31
is we go through here and each time we block and stop before we do anything else and then we get the title and go on. So, yeah, there's this event loop
4:40
but we're only doing one thing at a time. What we need to do is start all the requests and then then go process the responses as they come in.
4:50
So we need to make a tiny little change here. Let's call this tasks equals this and we're going to kick them all off
4:59
so we're going to say, we're going to append now I want to store basically this. So I'd love to just store this response here
5:09
that we get back, however this coroutine that's been started, however this is not actually going to start it. Remember, these are like generators
5:17
you have to trigger them to go. So, what I can do over here is I can say asyncio, I create_task of that I also need when I print this out
5:27
I need to pass the number and the HTML so I'm going to need that later. So let's also pass the number as a tuple so we're passing one thing to our list
5:37
which is actually this tuple right here. So what our goal is, is to start all the tasks and then for each one of them
5:51
we then want to do the other work. So we'll say the HTML is await t and then we're going to put it in there. So we start all the task, they're running
6:01
and then we're going to either just get their value right back or we're going to now block and wait for that response to come in
6:09
and then get the next task maybe it's already done, we'll get it's response right away and we got to wait on the next one
6:14
so the key thing here is instead of trying to do one at a time, we're going to start them all and then process them all.
6:21
Now, if you were asking for hundreds or thousands of pages, you might want to somehow rate limit this so that the tasks don't get too out of control
6:29
but if we're only doing 10, it's not too bad. Are you ready for the grand finale? For the big moment, to see if we actually got our code
6:37
one, working, and two, faster? I think we have, let's try. Look at that. Man, that is just awesome. I did nothing to speed that up.
6:48
I didn't edit the video at all. Let me run it one more time. That is so awesome, let me run it one more time. Start. Done. Bam.
7:01
Notice, we started all of the requests and then as they came in, we started to process them. The way in which we processed them
7:09
was the order we started them and it's probably not the order they actually finished. But that doesn't matter because
7:14
all the latency around the ping time you know we're making 10 requests over to the server that's a whole second right there
7:20
just waiting on the internet. Well, we can do all those requests and get them all started and really just incur probably more or less
7:28
the ping time of one for this particular server. Maybe 100 milliseconds, not 1,000 which is really, really great.
7:35
And then of course, all the concurrent processing that the server's doing as well. So really, really awesome and that's how we were able to use asyncio
7:44
and a library that can do web requests that itself supports asyncio to dramatically increase the speed. While we're on the subject of aiohttp
7:56
let me just tell you a really, really quick story to drive this point home of how useful this library and this technique can be.
8:02
We talked about this on my other podcast Python Bytes, and there was a listener he knows I share this story every now and then and it's pretty fun.
8:10
So, he had some project where he was requesting a whole bunch of pages and he was using requests, and it was taking hours or something like that.
8:20
He switched to this technique where he's using aiohttp and async and await and things like that, it went so fast that it actually crashed his server
8:29
because the server ran out of memory trying to process all the requests it was getting back all at once. So, I think that's awesome.
8:36
It goes from hours to less than a minute and so much data you actually have to think about the performance of receiving that much data at a time
8:45
because you're adding so much concurrency to the system. And how hard was it? Well, yeah, this was like four lines instead of two
8:54
maybe instead of three? So, not too bad at all. The real key to the technique is to make sure you start all of the work and then
9:03
start to process the responses. 'Cause we saw in our first version our old version, that we got actually zero speed up from that.
9:10
Just a little bit of added complexity for no real benefit. So here's doing some real work with asyncio and async and await.