Consuming HTTP Services in Python Transcripts
Chapter: Screen scraping: Adding APIs where there are none
Lecture: Downloading transcript html

Login or purchase this course to watch this video and the rest of the course contents.
0:01 Okay, next up, we want to download the actual HTML from each one of these pages.
0:07 So, over here, I've created a function called download transcript pages, and I am going to pass the URLs in and I want to actually download
0:15 the HTML for each one of those and do a little transformation on it. Now to make that transformation nicer, I want to define two named tuples,
0:23 I want to define a page and a paragraph, and the page is going to contain a list of paragraphs as well as the title and url,
0:32 and the paragraph is going to contain the text of the paragraph and the seconds, the timing. If we go and look over here, at one of these right here,
0:42 you can't tell yet, but if you actually view the source of this, this is what our application is going to see,
0:52 you can see that the timing here is on each one of these, the timing isn't super precise for this stuff,
0:59 it's much better for the courses, but basically we have these classes that are transcript segments and each one of them has a time
1:05 and some text we got to strip off from the beginning as well as some text, there is just the body of the paragraph, okay, so our goal would be
1:15 to actually parse these out and turn this into like a list, so we have this text associated with this time and we can ask questions like
1:23 hey what was said at this time, or if you want to know here this time is,
1:27 we get to seek to that second in the audio and just start playing exactly what this is,
1:31 okay, so that is going to be our goal, is to download this whole page or a whole bunch of these pages like a list of these pages,
1:39 because we got them all from the site map, download these and parse them. Now, I just ask when you do this, be kind,
1:45 let's not just completely power over the server, it's not really going to kill it but let's just get, let's say the top 5.
1:51 Alright, so we are going to download the top 5 pages, this is 0 up to but not including index 5, and then we'll get them out,
1:57 so just to keep it a little bit chill here. Okay, you guys don't want to watch it download 75 pages and parse it anyway.
2:03 Okay, so let's go over here, we are going to do this download and we'll say for url in tx_URLs, we've got to download and parse that page,
2:12 okay so let's say page=build_page_from_url, and give it the url, let's go ahead and add that function as well, and here then we'll just say
2:26 pages.append(page), okay, so keep that nice and simple, let's change the order to read form high level to low level, so I'll put it like that,
2:37 now, over here, we're going to say something totally normal, response=request.get url, right, and I guess we could put
2:45 some sort of error handling, I am going to assume this works in a real app, you put your own error handling here, maybe we could check if this is none,
2:54 we don't add it, something to that effect. Alright, so let's assume this works, we'll say HTML=response.text,
3:00 that's what we're going to get back, and then, we need to do a couple of things, we need to get the title and we need to get all of these pieces here,
3:10 now let me just stat with regular expressions are not the answer, okay, they are definitely not the answer, so we want to do two things,
3:18 we are going to find over here somewhere that there is an h1 I believe, there we go, so we have our h1, we have our sort of ems,
3:28 our spaces, our new lines, our brs all that kind of stuff, but what we are going to do is we are going to get this h1
3:34 and that is going to be the title of our page, so that seems pretty straightforward, there is only one h1, if I look that's the only one right there,
3:42 so we can just go after that one thing right, that is how you should design pages, they should have only one h1,
3:50 but, how do we do that, how do we get started? Well, this is why we are going to bring Beautiful Soup into action.


Talk Python's Mastodon Michael Kennedy's Mastodon