Consuming HTTP Services in Python Transcripts
Chapter: Screen scraping: Adding APIs where there are none
Lecture: Downloading transcript html

0:01 Okay, next up, we want to download the actual HTML from each one of these pages.

0:07 So, over here, I've created a function called download transcript pages, and I am going to pass the URLs in and I want to actually download

0:15 the HTML for each one of those and do a little transformation on it. Now to make that transformation nicer, I want to define two named tuples,

0:23 I want to define a page and a paragraph, and the page is going to contain a list of paragraphs as well as the title and url,

0:32 and the paragraph is going to contain the text of the paragraph and the seconds, the timing. If we go and look over here, at one of these right here,

0:42 you can't tell yet, but if you actually view the source of this, this is what our application is going to see,

0:52 you can see that the timing here is on each one of these, the timing isn't super precise for this stuff,

0:59 it's much better for the courses, but basically we have these classes that are transcript segments and each one of them has a time

1:05 and some text we got to strip off from the beginning as well as some text, there is just the body of the paragraph, okay, so our goal would be

1:15 to actually parse these out and turn this into like a list, so we have this text associated with this time and we can ask questions like

1:23 hey what was said at this time, or if you want to know here this time is,

1:27 we get to seek to that second in the audio and just start playing exactly what this is,

1:31 okay, so that is going to be our goal, is to download this whole page or a whole bunch of these pages like a list of these pages,

1:39 because we got them all from the site map, download these and parse them. Now, I just ask when you do this, be kind,

1:45 let's not just completely power over the server, it's not really going to kill it but let's just get, let's say the top 5.

1:51 Alright, so we are going to download the top 5 pages, this is 0 up to but not including index 5, and then we'll get them out,

1:57 so just to keep it a little bit chill here. Okay, you guys don't want to watch it download 75 pages and parse it anyway.

2:03 Okay, so let's go over here, we are going to do this download and we'll say for url in tx_URLs, we've got to download and parse that page,

2:12 okay so let's say page=build_page_from_url, and give it the url, let's go ahead and add that function as well, and here then we'll just say

2:26 pages.append(page), okay, so keep that nice and simple, let's change the order to read form high level to low level, so I'll put it like that,

2:37 now, over here, we're going to say something totally normal, response=request.get url, right, and I guess we could put

2:45 some sort of error handling, I am going to assume this works in a real app, you put your own error handling here, maybe we could check if this is none,

2:54 we don't add it, something to that effect. Alright, so let's assume this works, we'll say HTML=response.text,

3:00 that's what we're going to get back, and then, we need to do a couple of things, we need to get the title and we need to get all of these pieces here,

3:10 now let me just stat with regular expressions are not the answer, okay, they are definitely not the answer, so we want to do two things,

3:18 we are going to find over here somewhere that there is an h1 I believe, there we go, so we have our h1, we have our sort of ems,

3:28 our spaces, our new lines, our brs all that kind of stuff, but what we are going to do is we are going to get this h1

3:34 and that is going to be the title of our page, so that seems pretty straightforward, there is only one h1, if I look that's the only one right there,

3:42 so we can just go after that one thing right, that is how you should design pages, they should have only one h1,

3:50 but, how do we do that, how do we get started? Well, this is why we are going to bring Beautiful Soup into action.

Consuming HTTP Services in Python Transcripts Chapter: Screen scraping: Adding APIs where there are none Lecture: Downloading transcript html

Consuming HTTP Services in Python Transcripts
Chapter: Screen scraping: Adding APIs where there are none
Lecture: Downloading transcript html