Consuming HTTP Services in Python Transcripts
Chapter: Screen scraping: Adding APIs where there are none
Lecture: Scraping Talk Python by leveraging sitemaps
Login or
purchase this course
to watch this video and the rest of the course contents.
0:01
So let's put this into practice. Let's pick a real world example of something that there is no API for, that you might want to go and get,
0:09
and I am going to pick on my own website, so I don't mess with anyone else's. Alright, so here is the deal, for each one of my episodes,
0:16
here is the Talk Python To Me website podcast, each one of the episodes, some of the newer ones don't have them because I am a little behind
0:23
on getting those produced, but let's just go down here to this one, this is a very popular episode, you should totally check this out
0:28
if you haven't heard what John has to say; but notice, there is a full transcript section here,
0:32
and I could even search for entrepreneur and you can see there is like five paragraphs here, right, so maybe I want this data,
0:39
that is available right here, now technically, there is a github repo that has all the transcripts that you could just go get it's text files,
0:45
but maybe for whatever reason let's just assume you want to get it, through here it's not available and here is the github link,
0:52
let's assume it's not available directly on github, alright, we want to go and find these. Now, each episode you can sort of see the structure up here,
1:00
you could possibly come up with that, but we are going to do something more clever. Now, because my site cares about showing up in search engines,
1:07
yours probably would as well, I created a thing called the site map, and that looks kind of nasty, right, but if you look at the actual way
1:15
it looks without formatting, this has a lot of things that I want the search engines to go find, so you can see I wanted to find
1:22
like all these episodes that are listed here, if I go down to get past those, notice there is a whole transcript section, okay,
1:29
so these are all the transcripts that I have created, and what we are going to do is we are actually going to grab this url
1:36
download this xml and it will tell us everywhere in the site we got to go look to do,
1:40
to sort of like pull this transcript data, so we are going to this in several parts,
1:43
we are going to d straight xml against the site map and then once we get to these pages, this is HTML we are going to break into screen scraping mode,
1:51
okay, so first of all, let's go over here, and w'll just create a new project,
1:57
we'll say talk_Python_tx_scraper, something like that for the transcripts, right,
2:02
let me just run it, we can get rid of that, let's go and run this so it's set up, you can see it's running Python 3 on our virtual environment,
2:09
and I'll give it the site map url, this is what I had copied there. Okay, so what we are going to do is we are going to download this,
2:18
and in order to download we are going to use requests, and then we are going to parse the xml, so then let's go and define a main method,
2:30
like this and at the bottom we'll go and call it. Okay, so let's go over here and start this process,
2:42
so what we are going to do is we are going to need to download this,
2:45
and this should be old, and you should be totally comfortable by doing this by now, so we are going to do our get and we'll go ahead
2:53
and just to be clear in case people's network are off or something, there is not 200, something is wrong, we'll print cannot get site map,
3:01
and here we'll just do response.status code, response.text if there is any. And of course, we'll bale, okay, so now that we have it,
3:09
we can parse the xml into a DOM, so all we are going to need to do, is go ElementTree fromstring and we want to give it the response.text,
3:17
and let's just take a moment to see that we got everything working okay so far,
3:22
let's run this, it takes a moment, and boom, there is an element right here. Now, notice this name space, if we go back and look at this,
3:30
up here at the very top, this name space turns out to make our xpath queries kind of not so fun, and instead of we are trying to worry
3:37
about whether we want to like set up a name spaces correctly, we want to keep this simple, we are just going to say response.text.replace,
3:44
and it's going to drop that name space, I am going to put like nothing in here, okay, so now if we'd run it, we should just get, oh of course,
3:51
if I actually parse the correct text, the updated text we should get something that is just the url set, okay, so back here again, the url set don't,
3:59
in the xpath queries we don't say the name here, we just say the subsequent thing,
4:03
so the path will be url/location, and so we should be able to really simply not print,
4:08
but let's do this, let's say tx_URLs and we'll just do a list comprehension here, okay, so we want to say, we want to get a node,
4:18
so we are going to say n.text, for n in dom and then find all and we are going to do I would like to find a url/location.
4:28
So url/loc and then we want to have it only in the case where the word transcript, let me go down the few of these, transcript appears,
4:38
so let's say /episodes/transcript, so we'll say if n.text.find this is greater than 0
4:51
and let's print the tx URLs, okay, I am not convinced this is going to work,
4:56
but let's give it a shot, look at that, there they all are, all in one enormous line, but if I click this, yeah, that is the transcript page, okay,
5:06
so that was step one, we were able to actually leverage the site map
5:10
which a lot of sites have to sort of shortcut a lot of really digging through the data
5:17
and the HTML, all the stuff that is brittle, the site maps are way less brittle So, this gets us to where we actually have the URLs,
5:25
and let's actually factor this in a better way, create a method out of this,
5:28
I'll call this- it doesn't like it because it returned, let me just do this,
5:34
I'll say get, tx_URLs=get_transcript_URLs, like so, and that is going to be here,
5:44
and I'll say return nothing if we have an error, otherwise, we'll return tx_URLs, and let's change this, okay,
5:56
so now we've finished this task of getting the transcript URLs, let me just run it one more time and make sure it's all hanging together,
6:02
and it is, the next thing that we are going to need to do is parse each one of those, and get the transcript data out.