Consuming HTTP Services in Python Transcripts
Chapter: Screen scraping: Adding APIs where there are none
Lecture: Scraping Talk Python by leveraging sitemaps
Login or
purchase this course
to watch this video and the rest of the course contents.
So let's put this into practice. Let's pick a real world example of something that there is no API for, that you might want to go and get,
and I am going to pick on my own website, so I don't mess with anyone else's. Alright, so here is the deal, for each one of my episodes,
here is the Talk Python To Me website podcast, each one of the episodes, some of the newer ones don't have them because I am a little behind
on getting those produced, but let's just go down here to this one, this is a very popular episode, you should totally check this out
if you haven't heard what John has to say; but notice, there is a full transcript section here,
and I could even search for entrepreneur and you can see there is like five paragraphs here, right, so maybe I want this data,
that is available right here, now technically, there is a github repo that has all the transcripts that you could just go get it's text files,
but maybe for whatever reason let's just assume you want to get it, through here it's not available and here is the github link,
let's assume it's not available directly on github, alright, we want to go and find these. Now, each episode you can sort of see the structure up here,
you could possibly come up with that, but we are going to do something more clever. Now, because my site cares about showing up in search engines,
yours probably would as well, I created a thing called the site map, and that looks kind of nasty, right, but if you look at the actual way
it looks without formatting, this has a lot of things that I want the search engines to go find, so you can see I wanted to find
like all these episodes that are listed here, if I go down to get past those, notice there is a whole transcript section, okay,
so these are all the transcripts that I have created, and what we are going to do is we are actually going to grab this url
download this xml and it will tell us everywhere in the site we got to go look to do,
to sort of like pull this transcript data, so we are going to this in several parts,
we are going to d straight xml against the site map and then once we get to these pages, this is HTML we are going to break into screen scraping mode,
okay, so first of all, let's go over here, and w'll just create a new project,
we'll say talk_Python_tx_scraper, something like that for the transcripts, right,
let me just run it, we can get rid of that, let's go and run this so it's set up, you can see it's running Python 3 on our virtual environment,
and I'll give it the site map url, this is what I had copied there. Okay, so what we are going to do is we are going to download this,
and in order to download we are going to use requests, and then we are going to parse the xml, so then let's go and define a main method,
like this and at the bottom we'll go and call it. Okay, so let's go over here and start this process,
so what we are going to do is we are going to need to download this,
and this should be old, and you should be totally comfortable by doing this by now, so we are going to do our get and we'll go ahead
and just to be clear in case people's network are off or something, there is not 200, something is wrong, we'll print cannot get site map,
and here we'll just do response.status code, response.text if there is any. And of course, we'll bale, okay, so now that we have it,
we can parse the xml into a DOM, so all we are going to need to do, is go ElementTree fromstring and we want to give it the response.text,
and let's just take a moment to see that we got everything working okay so far,
let's run this, it takes a moment, and boom, there is an element right here. Now, notice this name space, if we go back and look at this,
up here at the very top, this name space turns out to make our xpath queries kind of not so fun, and instead of we are trying to worry
about whether we want to like set up a name spaces correctly, we want to keep this simple, we are just going to say response.text.replace,
and it's going to drop that name space, I am going to put like nothing in here, okay, so now if we'd run it, we should just get, oh of course,
if I actually parse the correct text, the updated text we should get something that is just the url set, okay, so back here again, the url set don't,
in the xpath queries we don't say the name here, we just say the subsequent thing,
so the path will be url/location, and so we should be able to really simply not print,
but let's do this, let's say tx_URLs and we'll just do a list comprehension here, okay, so we want to say, we want to get a node,
so we are going to say n.text, for n in dom and then find all and we are going to do I would like to find a url/location.
So url/loc and then we want to have it only in the case where the word transcript, let me go down the few of these, transcript appears,
so let's say /episodes/transcript, so we'll say if n.text.find this is greater than 0
and let's print the tx URLs, okay, I am not convinced this is going to work,
but let's give it a shot, look at that, there they all are, all in one enormous line, but if I click this, yeah, that is the transcript page, okay,
so that was step one, we were able to actually leverage the site map
which a lot of sites have to sort of shortcut a lot of really digging through the data
and the HTML, all the stuff that is brittle, the site maps are way less brittle So, this gets us to where we actually have the URLs,
and let's actually factor this in a better way, create a method out of this,
I'll call this- it doesn't like it because it returned, let me just do this,
I'll say get, tx_URLs=get_transcript_URLs, like so, and that is going to be here,
and I'll say return nothing if we have an error, otherwise, we'll return tx_URLs, and let's change this, okay,
so now we've finished this task of getting the transcript URLs, let me just run it one more time and make sure it's all hanging together,
and it is, the next thing that we are going to need to do is parse each one of those, and get the transcript data out.