Consuming HTTP Services in Python Transcripts
Chapter: Screen scraping: Adding APIs where there are none
Lecture: Downloading transcript html

Login or purchase this course to watch this video and the rest of the course contents.
0:01 Okay, next up, we want to download the actual html from each one of these pages.
0:06 So, over here, I've created a function called download transcript pages,
0:10 and I am going to pass the urls in and I want to actually download
0:14 the html for each one of those and do a little transformation on it.
0:17 Now to make that transformation nicer, I want to define two named tuples,
0:22 I want to define a page and a paragraph, and the page is going to contain
0:28 a list of paragraphs as well as the title and url,
0:31 and the paragraph is going to contain
0:34 the text of the paragraph and the seconds, the timing.
0:37 If we go and look over here, at one of these right here,
0:41 you can't tell yet, but if you actually view the source of this,
0:46 this is what our application is going to see,
0:51 you can see that the timing here is on each one of these,
0:55 the timing isn't super precise for this stuff,
0:58 it's much better for the courses, but basically we have these classes
1:01 that are transcript segments and each one of them has a time
1:04 and some text we got to strip off from the beginning as well as some text,
1:09 there is just the body of the paragraph, okay, so our goal would be
1:14 to actually parse these out and turn this into like a list,
1:17 so we have this text associated with this time and we can ask questions like
1:22 hey what was said at this time, or if you want to know here this time is,
1:26 we get to seek to that second in the audio and just start playing exactly what this is,
1:30 okay, so that is going to be our goal, is to download this whole page
1:34 or a whole bunch of these pages like a list of these pages,
1:38 because we got them all from the site map, download these and parse them.
1:41 Now, I just ask when you do this, be kind,
1:44 let's not just completely power over the server,
1:47 it's not really going to kill it but let's just get, let's say the top 5.
1:50 Alright, so we are going to download the top 5 pages,
1:53 this is 0 up to but not including index 5, and then we'll get them out,
1:56 so just to keep it a little bit chill here.
1:59 Okay, you guys don't want to watch it download 75 pages and parse it anyway.
2:02 Okay, so let's go over here, we are going to do this download
2:05 and we'll say for url in tx_urls, we've got to download and parse that page,
2:11 okay so let's say page=build_page_from_url, and give it the url,
2:22 let's go ahead and add that function as well, and here then we'll just say
2:25 pages.append(page), okay, so keep that nice and simple,
2:29 let's change the order to read form high level to low level, so I'll put it like that,
2:36 now, over here, we're going to say something totally normal,
2:38 response=request.get url, right, and I guess we could put
2:44 some sort of error handling, I am going to assume this works in a real app,
2:48 you put your own error handling here, maybe we could check if this is none,
2:53 we don't add it, something to that effect.
2:56 Alright, so let's assume this works, we'll say html=response.text,
2:59 that's what we're going to get back, and then, we need to do a couple of things,
3:03 we need to get the title and we need to get all of these pieces here,
3:09 now let me just stat with regular expressions are not the answer,
3:13 okay, they are definitely not the answer, so we want to do two things,
3:17 we are going to find over here somewhere that there is an h1 I believe,
3:21 there we go, so we have our h1, we have our sort of ems,
3:27 our spaces, our new lines, our brs all that kind of stuff,
3:30 but what we are going to do is we are going to get this h1
3:33 and that is going to be the title of our page, so that seems pretty straightforward,
3:37 there is only one h1, if I look that's the only one right there,
3:41 so we can just go after that one thing right, that is how you should design pages,
3:45 they should have only one h1,
3:49 but, how do we do that, how do we get started?
3:51 Well, this is why we are going to bring Beautiful Soup into action.