Consuming HTTP Services in Python Transcripts
Chapter: Screen scraping: Adding APIs where there are none
Lecture: Downloading transcript html
Login or
purchase this course
to watch this video and the rest of the course contents.
Okay, next up, we want to download the actual HTML from each one of these pages.
So, over here, I've created a function called download transcript pages, and I am going to pass the URLs in and I want to actually download
the HTML for each one of those and do a little transformation on it. Now to make that transformation nicer, I want to define two named tuples,
I want to define a page and a paragraph, and the page is going to contain a list of paragraphs as well as the title and url,
and the paragraph is going to contain the text of the paragraph and the seconds, the timing. If we go and look over here, at one of these right here,
you can't tell yet, but if you actually view the source of this, this is what our application is going to see,
you can see that the timing here is on each one of these, the timing isn't super precise for this stuff,
it's much better for the courses, but basically we have these classes that are transcript segments and each one of them has a time
and some text we got to strip off from the beginning as well as some text, there is just the body of the paragraph, okay, so our goal would be
to actually parse these out and turn this into like a list, so we have this text associated with this time and we can ask questions like
hey what was said at this time, or if you want to know here this time is,
we get to seek to that second in the audio and just start playing exactly what this is,
okay, so that is going to be our goal, is to download this whole page or a whole bunch of these pages like a list of these pages,
because we got them all from the site map, download these and parse them. Now, I just ask when you do this, be kind,
let's not just completely power over the server, it's not really going to kill it but let's just get, let's say the top 5.
Alright, so we are going to download the top 5 pages, this is 0 up to but not including index 5, and then we'll get them out,
so just to keep it a little bit chill here. Okay, you guys don't want to watch it download 75 pages and parse it anyway.
Okay, so let's go over here, we are going to do this download and we'll say for url in tx_URLs, we've got to download and parse that page,
okay so let's say page=build_page_from_url, and give it the url, let's go ahead and add that function as well, and here then we'll just say
pages.append(page), okay, so keep that nice and simple, let's change the order to read form high level to low level, so I'll put it like that,
now, over here, we're going to say something totally normal, response=request.get url, right, and I guess we could put
some sort of error handling, I am going to assume this works in a real app, you put your own error handling here, maybe we could check if this is none,
we don't add it, something to that effect. Alright, so let's assume this works, we'll say HTML=response.text,
that's what we're going to get back, and then, we need to do a couple of things, we need to get the title and we need to get all of these pieces here,
now let me just stat with regular expressions are not the answer, okay, they are definitely not the answer, so we want to do two things,
we are going to find over here somewhere that there is an h1 I believe, there we go, so we have our h1, we have our sort of ems,
our spaces, our new lines, our brs all that kind of stuff, but what we are going to do is we are going to get this h1
and that is going to be the title of our page, so that seems pretty straightforward, there is only one h1, if I look that's the only one right there,
so we can just go after that one thing right, that is how you should design pages, they should have only one h1,
but, how do we do that, how do we get started? Well, this is why we are going to bring Beautiful Soup into action.