Consuming HTTP Services in Python Transcripts
Chapter: Screen scraping: Adding APIs where there are none
Lecture: Downloading transcript html
Login or
purchase this course
to watch this video and the rest of the course contents.
0:01
Okay, next up, we want to download the actual HTML from each one of these pages.
0:07
So, over here, I've created a function called download transcript pages, and I am going to pass the URLs in and I want to actually download
0:15
the HTML for each one of those and do a little transformation on it. Now to make that transformation nicer, I want to define two named tuples,
0:23
I want to define a page and a paragraph, and the page is going to contain a list of paragraphs as well as the title and url,
0:32
and the paragraph is going to contain the text of the paragraph and the seconds, the timing. If we go and look over here, at one of these right here,
0:42
you can't tell yet, but if you actually view the source of this, this is what our application is going to see,
0:52
you can see that the timing here is on each one of these, the timing isn't super precise for this stuff,
0:59
it's much better for the courses, but basically we have these classes that are transcript segments and each one of them has a time
1:05
and some text we got to strip off from the beginning as well as some text, there is just the body of the paragraph, okay, so our goal would be
1:15
to actually parse these out and turn this into like a list, so we have this text associated with this time and we can ask questions like
1:23
hey what was said at this time, or if you want to know here this time is,
1:27
we get to seek to that second in the audio and just start playing exactly what this is,
1:31
okay, so that is going to be our goal, is to download this whole page or a whole bunch of these pages like a list of these pages,
1:39
because we got them all from the site map, download these and parse them. Now, I just ask when you do this, be kind,
1:45
let's not just completely power over the server, it's not really going to kill it but let's just get, let's say the top 5.
1:51
Alright, so we are going to download the top 5 pages, this is 0 up to but not including index 5, and then we'll get them out,
1:57
so just to keep it a little bit chill here. Okay, you guys don't want to watch it download 75 pages and parse it anyway.
2:03
Okay, so let's go over here, we are going to do this download and we'll say for url in tx_URLs, we've got to download and parse that page,
2:12
okay so let's say page=build_page_from_url, and give it the url, let's go ahead and add that function as well, and here then we'll just say
2:26
pages.append(page), okay, so keep that nice and simple, let's change the order to read form high level to low level, so I'll put it like that,
2:37
now, over here, we're going to say something totally normal, response=request.get url, right, and I guess we could put
2:45
some sort of error handling, I am going to assume this works in a real app, you put your own error handling here, maybe we could check if this is none,
2:54
we don't add it, something to that effect. Alright, so let's assume this works, we'll say HTML=response.text,
3:00
that's what we're going to get back, and then, we need to do a couple of things, we need to get the title and we need to get all of these pieces here,
3:10
now let me just stat with regular expressions are not the answer, okay, they are definitely not the answer, so we want to do two things,
3:18
we are going to find over here somewhere that there is an h1 I believe, there we go, so we have our h1, we have our sort of ems,
3:28
our spaces, our new lines, our brs all that kind of stuff, but what we are going to do is we are going to get this h1
3:34
and that is going to be the title of our page, so that seems pretty straightforward, there is only one h1, if I look that's the only one right there,
3:42
so we can just go after that one thing right, that is how you should design pages, they should have only one h1,
3:50
but, how do we do that, how do we get started? Well, this is why we are going to bring Beautiful Soup into action.