Async Techniques and Examples in Python Transcripts
Chapter: async and await with asyncio
Lecture: async web scraping
0:00 Now we're in our asyncronous program
0:02 but notice it's still using requests
0:04 and it's still doing the synchronous version.
0:06 Our job, our goal, during the next few section
0:10 is to convert this over to an async version
0:13 that works much, much better.
0:17 Now, in order for us to actually write
0:19 async code that does anything interesting
0:21 we need to use a library that supports
0:25 asyncio, that has async methods
0:27 and coroutines that we can actually await.
0:30 So we're not using a requests
0:31 we're going to switch over to this thing called aiohttp.
0:34 Now, this is both a server and a client
0:36 it does web sockets and all sorts of stuff.
0:39 What we care about is this client thing.
0:41 So we're going to use this to very simply
0:45 convert our requests code over to asyncio.
0:49 So let's get started. So instead of using a requests
0:51 we're going to use aiohttp, and we're going to need
0:55 to install some new requirements.
0:57 So we don't need requests anymore
0:59 and we're going to use aiohttp.
1:01 That's the main library
1:03 but there's actually two other libraries
1:05 that will make this even faster.
1:07 So aiohttp has to do DNS lookups and other things
1:12 so there's actually an aiodns, and a cchardet.
1:17 These two are going to be a little bit better.
1:20 So, we're going to copy that path
1:22 and install those requirements.
1:30 With the requirements in place
1:32 now we can start writing our code.
1:34 We actually don't have to change very much.
1:36 This line right here, I'll comment it
1:37 out for a second so we still have it
1:39 that's the line we have to change.
1:41 Now, I'm going to introduce you to a new bit
1:44 of syntax that is a little bit funky.
1:46 We've seen how to make this method async.
1:48 We say, async, right?
1:50 And you might think, I would just write, await
1:55 but it turns out aiohttp client runs in a context manager.
2:00 Otherwise known as a with block.
2:02 And the with block itself has to do asyncronous stuff.
2:06 So, Python has been extended to have
2:09 what are called asyncronous with blocks
2:12 or asyncronous context managers.
2:14 So what we're going to write is async with
2:17 aiohttp.clientsession, and then within the session
2:22 we're going to make a request
2:24 so we have another with block
2:28 we're going to get the URL as the response
2:31 and then it's pretty similar to what requests has.
2:35 We do this, and we do that.
2:38 Now, this text here, if we look at it
2:41 it's an asyncronous function.
2:44 So, first of all, it wasn't a function in requests
2:46 it is here, but it's also async so we have to await it.
2:50 This line right here is probably the most important one.
2:54 This one and this one, these are the
2:55 two most important ones here
2:57 for what we're trying to do.
2:58 So we take this one line, and yeah
3:00 it gets a little more complicated
3:02 but trust me, the benefit is huge.
3:04 All right, so let's go make this run.
3:06 So if I just try to run it down here
3:07 notice this is not going to work so much.
3:10 So this is actually returning a coroutine
3:13 not a string, and when we try to pass that
3:15 where a string is expected, it goes whoa whoa whoa.
3:18 Not great. So, how do we do this?
3:21 Actually, sorry, I don't want to run it here.
3:23 Let's go up here and do it in main.
3:27 Then over here, I'll just say loop.run_until_complete
3:30 and we're going to give it this
3:31 which means we're going to make this async as well
3:35 then this gets pretty simple.
3:36 All we have to do is await.
3:38 Now, this is going to absolutely run
3:40 it's going to do it asyncronously
3:42 I think everything is going to be perfect.
3:44 But it turns out, there's one little problem
3:47 that we're going to run into.
3:48 But, let's just run to see that it still works
3:50 at least the way it did before.
3:52 So, we're going to run the program.
3:55 It's working, you can see the titles are
3:56 correct, understanding and using Python's AST
3:59 How Python evolves, etc, etc.
4:02 But, did you notice a difference in speed?
4:04 Did you see things happening concurrently?
4:06 No. Let's look at that.
4:08 That's a little bit weird here.
4:10 So, if we look at the main part
4:12 we're running this function
4:14 so let's go look at the get_title_range.
4:17 And I'm going to make a copy of this
4:19 so you can see how it was
4:20 I'll call this version one
4:22 this will be the old version
4:23 let's call it old version.
4:27 This is the real one.
4:29 So what happens when we run this
4:30 is we go through here and each time we block and stop
4:34 before we do anything else
4:36 and then we get the title and go on.
4:38 So, yeah, there's this event loop
4:39 but we're only doing one thing at a time.
4:41 What we need to do is start all the requests and then
4:46 then go process the responses as they come in.
4:49 So we need to make a tiny little change here.
4:51 Let's call this tasks equals this
4:56 and we're going to kick them all off
4:58 so we're going to say, we're going to append
5:02 now I want to store basically this.
5:05 So I'd love to just store this response here
5:08 that we get back, however this coroutine
5:11 that's been started, however this is not
5:14 actually going to start it.
5:15 Remember, these are like generators
5:16 you have to trigger them to go.
5:18 So, what I can do over here is
5:19 I can say asyncio, I create_task of that
5:24 I also need when I print this out
5:26 I need to pass the number and the html
5:30 so I'm going to need that later.
5:31 So let's also pass the number as a tuple
5:35 so we're passing one thing to our list
5:36 which is actually this tuple right here.
5:42 So what our goal is, is to start all the tasks
5:46 and then for each one of them
5:50 we then want to do the other work.
5:53 So we'll say the html is await t
5:56 and then we're going to put it in there.
5:58 So we start all the task, they're running
6:00 and then we're going to either
6:02 just get their value right back
6:04 or we're going to now block and wait
6:06 for that response to come in
6:08 and then get the next task
6:09 maybe it's already done, we'll get it's response right away
6:11 and we got to wait on the next one
6:13 so the key thing here is instead of trying
6:16 to do one at a time, we're going to
6:18 start them all and then process them all.
6:20 Now, if you were asking for hundreds or thousands
6:22 of pages, you might want to somehow
6:24 rate limit this so that the tasks
6:27 don't get too out of control
6:28 but if we're only doing 10, it's not too bad.
6:31 Are you ready for the grand finale?
6:34 For the big moment, to see if we actually got our code
6:36 one, working, and two, faster?
6:39 I think we have, let's try.
6:42 Look at that. Man, that is just awesome.
6:45 I did nothing to speed that up.
6:47 I didn't edit the video at all.
6:48 Let me run it one more time.
6:54 That is so awesome, let me run it one more time.
6:57 Start. Done. Bam.
7:00 Notice, we started all of the requests
7:04 and then as they came in, we started to process them.
7:07 The way in which we processed them
7:08 was the order we started them
7:11 and it's probably not the order they actually finished.
7:12 But that doesn't matter because
7:13 all the latency around the ping time
7:16 you know we're making 10 requests over to the server
7:18 that's a whole second right there
7:19 just waiting on the internet.
7:21 Well, we can do all those requests
7:23 and get them all started
7:24 and really just incur probably more or less
7:27 the ping time of one for this particular server.
7:31 Maybe 100 milliseconds, not 1,000
7:33 which is really, really great.
7:34 And then of course, all the concurrent processing
7:35 that the server's doing as well.
7:37 So really, really awesome
7:39 and that's how we were able to use asyncio
7:43 and a library that can do web requests
7:46 that itself supports asyncio
7:49 to dramatically increase the speed.
7:53 While we're on the subject of aiohttp
7:55 let me just tell you a really, really quick story
7:57 to drive this point home of how useful
7:59 this library and this technique can be.
8:01 We talked about this on my other podcast
8:03 PythonBytes, and there was a listener
8:05 he knows I share this story every now and then
8:08 and it's pretty fun.
8:09 So, he had some project where he was
8:11 requesting a whole bunch of pages
8:15 and he was using requests, and it was taking
8:16 hours or something like that.
8:19 He switched to this technique
8:20 where he's using aiohttp and async and await
8:23 and things like that, it went so fast
8:26 that it actually crashed his server
8:28 because the server ran out of memory
8:30 trying to process all the requests
8:31 it was getting back all at once.
8:33 So, I think that's awesome.
8:35 It goes from hours to less than a minute
8:39 and so much data you actually have to think
8:40 about the performance of receiving
8:42 that much data at a time
8:44 because you're adding so much concurrency
8:47 to the system.
8:48 And how hard was it? Well, yeah, this was like four lines instead of two
8:53 maybe instead of three?
8:55 So, not too bad at all.
8:56 The real key to the technique is to make sure
8:58 you start all of the work and then
9:02 start to process the responses.
9:04 'Cause we saw in our first version
9:05 our old version, that we got actually
9:08 zero speed up from that.
9:09 Just a little bit of added complexity for no real benefit.
9:12 So here's doing some real work with asyncio
9:15 and async and await.