Consuming HTTP Services in Python Transcripts
Chapter: Screen scraping: Adding APIs where there are none
Lecture: Scraping Talk Python by leveraging sitemaps
Login or
purchase this course
to watch this video and the rest of the course contents.
0:01
So let's put this into practice. Let's pick a real world example
0:04
of something that there is no API for, that you might want to go and get,
0:08
and I am going to pick on my own website, so I don't mess with anyone else's.
0:12
Alright, so here is the deal, for each one of my episodes,
0:15
here is the Talk Python To Me website podcast, each one of the episodes,
0:19
some of the newer ones don't have them because I am a little behind
0:22
on getting those produced, but let's just go down here to this one,
0:25
this is a very popular episode, you should totally check this out
0:27
if you haven't heard what John has to say; but notice, there is a full transcript section here,
0:31
and I could even search for entrepreneur and you can see
0:35
there is like five paragraphs here, right, so maybe I want this data,
0:38
that is available right here, now technically, there is a github repo
0:41
that has all the transcripts that you could just go get it's text files,
0:44
but maybe for whatever reason let's just assume you want to get it,
0:47
through here it's not available and here is the github link,
0:51
let's assume it's not available directly on github,
0:54
alright, we want to go and find these.
0:56
Now, each episode you can sort of see the structure up here,
0:59
you could possibly come up with that, but we are going to do something more clever.
1:03
Now, because my site cares about showing up in search engines,
1:06
yours probably would as well, I created a thing called the site map,
1:11
and that looks kind of nasty, right, but if you look at the actual way
1:14
it looks without formatting, this has a lot of things that I want
1:18
the search engines to go find, so you can see I wanted to find
1:21
like all these episodes that are listed here, if I go down to get past those,
1:25
notice there is a whole transcript section, okay,
1:28
so these are all the transcripts that I have created,
1:31
and what we are going to do is we are actually going to grab this url
1:35
download this xml and it will tell us everywhere in the site we got to go look to do,
1:39
to sort of like pull this transcript data, so we are going to this in several parts,
1:42
we are going to d straight xml against the site map and then once we get to these pages,
1:46
this is html we are going to break into screen scraping mode,
1:50
okay, so first of all, let's go over here, and w'll just create a new project,
1:56
we'll say talk_python_tx_scraper, something like that for the transcripts, right,
2:01
let me just run it, we can get rid of that, let's go and run this so it's set up,
2:05
you can see it's running Python 3 on our virtual environment,
2:08
and I'll give it the site map url, this is what I had copied there.
2:14
Okay, so what we are going to do is we are going to download this,
2:17
and in order to download we are going to use requests,
2:21
and then we are going to parse the xml, so then let's go and define a main method,
2:29
like this and at the bottom we'll go and call it.
2:35
Okay, so let's go over here and start this process,
2:41
so what we are going to do is we are going to need to download this,
2:44
and this should be old, and you should be totally comfortable by doing this by now,
2:50
so we are going to do our get and we'll go ahead
2:52
and just to be clear in case people's network are off or something,
2:55
there is not 200, something is wrong, we'll print cannot get site map,
3:00
and here we'll just do response.status code, response.text if there is any.
3:05
And of course, we'll bale, okay, so now that we have it,
3:08
we can parse the xml into a DOM, so all we are going to need to do,
3:12
is go ElementTree fromstring and we want to give it the response.text,
3:16
and let's just take a moment to see that we got everything working okay so far,
3:21
let's run this, it takes a moment, and boom, there is an element right here.
3:26
Now, notice this name space, if we go back and look at this,
3:29
up here at the very top, this name space turns out to make our xpath queries
3:33
kind of not so fun, and instead of we are trying to worry
3:36
about whether we want to like set up a name spaces correctly,
3:40
we want to keep this simple, we are just going to say response.text.replace,
3:43
and it's going to drop that name space, I am going to put like nothing in here,
3:47
okay, so now if we'd run it, we should just get, oh of course,
3:50
if I actually parse the correct text, the updated text we should get something
3:53
that is just the url set, okay, so back here again, the url set don't,
3:58
in the xpath queries we don't say the name here, we just say the subsequent thing,
4:02
so the path will be url/location, and so we should be able to really simply not print,
4:07
but let's do this, let's say tx_urls and we'll just do a list comprehension here,
4:13
okay, so we want to say, we want to get a node,
4:17
so we are going to say n.text, for n in dom and then find all
4:25
and we are going to do I would like to find a url/location.
4:27
So url/loc and then we want to have it only in the case where the word transcript,
4:33
let me go down the few of these, transcript appears,
4:37
so let's say /episodes/transcript, so we'll say if n.text.find this is greater than 0
4:50
and let's print the tx urls, okay, I am not convinced this is going to work,
4:55
but let's give it a shot, look at that, there they all are, all in one enormous line,
5:02
but if I click this, yeah, that is the transcript page, okay,
5:05
so that was step one, we were able to actually leverage the site map
5:09
which a lot of sites have to sort of shortcut a lot of really digging through the data
5:16
and the html, all the stuff that is brittle, the site maps are way less brittle
5:21
So, this gets us to where we actually have the urls,
5:24
and let's actually factor this in a better way, create a method out of this,
5:27
I'll call this- it doesn't like it because it returned, let me just do this,
5:33
I'll say get, tx_urls=get_transcript_urls, like so, and that is going to be here,
5:43
and I'll say return nothing if we have an error, otherwise, we'll return tx_urls,
5:51
and let's change this, okay,
5:55
so now we've finished this task of getting the transcript urls,
5:58
let me just run it one more time and make sure it's all hanging together,
6:01
and it is, the next thing that we are going to need to do is
6:04
parse each one of those, and get the transcript data out.