Consuming HTTP Services in Python Transcripts
Chapter: Screen scraping: Adding APIs where there are none
Lecture: Scraping Talk Python by leveraging sitemaps

Login or purchase this course to watch this video and the rest of the course contents.
0:01 So let's put this into practice. Let's pick a real world example
0:04 of something that there is no API for, that you might want to go and get,
0:08 and I am going to pick on my own website, so I don't mess with anyone else's.
0:12 Alright, so here is the deal, for each one of my episodes,
0:15 here is the Talk Python To Me website podcast, each one of the episodes,
0:19 some of the newer ones don't have them because I am a little behind
0:22 on getting those produced, but let's just go down here to this one,
0:25 this is a very popular episode, you should totally check this out
0:27 if you haven't heard what John has to say; but notice, there is a full transcript section here,
0:31 and I could even search for entrepreneur and you can see
0:35 there is like five paragraphs here, right, so maybe I want this data,
0:38 that is available right here, now technically, there is a github repo
0:41 that has all the transcripts that you could just go get it's text files,
0:44 but maybe for whatever reason let's just assume you want to get it,
0:47 through here it's not available and here is the github link,
0:51 let's assume it's not available directly on github,
0:54 alright, we want to go and find these.
0:56 Now, each episode you can sort of see the structure up here,
0:59 you could possibly come up with that, but we are going to do something more clever.
1:03 Now, because my site cares about showing up in search engines,
1:06 yours probably would as well, I created a thing called the site map,
1:11 and that looks kind of nasty, right, but if you look at the actual way
1:14 it looks without formatting, this has a lot of things that I want
1:18 the search engines to go find, so you can see I wanted to find
1:21 like all these episodes that are listed here, if I go down to get past those,
1:25 notice there is a whole transcript section, okay,
1:28 so these are all the transcripts that I have created,
1:31 and what we are going to do is we are actually going to grab this url
1:35 download this xml and it will tell us everywhere in the site we got to go look to do,
1:39 to sort of like pull this transcript data, so we are going to this in several parts,
1:42 we are going to d straight xml against the site map and then once we get to these pages,
1:46 this is html we are going to break into screen scraping mode,
1:50 okay, so first of all, let's go over here, and w'll just create a new project,
1:56 we'll say talk_python_tx_scraper, something like that for the transcripts, right,
2:01 let me just run it, we can get rid of that, let's go and run this so it's set up,
2:05 you can see it's running Python 3 on our virtual environment,
2:08 and I'll give it the site map url, this is what I had copied there.
2:14 Okay, so what we are going to do is we are going to download this,
2:17 and in order to download we are going to use requests,
2:21 and then we are going to parse the xml, so then let's go and define a main method,
2:29 like this and at the bottom we'll go and call it.
2:35 Okay, so let's go over here and start this process,
2:41 so what we are going to do is we are going to need to download this,
2:44 and this should be old, and you should be totally comfortable by doing this by now,
2:50 so we are going to do our get and we'll go ahead
2:52 and just to be clear in case people's network are off or something,
2:55 there is not 200, something is wrong, we'll print cannot get site map,
3:00 and here we'll just do response.status code, response.text if there is any.
3:05 And of course, we'll bale, okay, so now that we have it,
3:08 we can parse the xml into a DOM, so all we are going to need to do,
3:12 is go ElementTree fromstring and we want to give it the response.text,
3:16 and let's just take a moment to see that we got everything working okay so far,
3:21 let's run this, it takes a moment, and boom, there is an element right here.
3:26 Now, notice this name space, if we go back and look at this,
3:29 up here at the very top, this name space turns out to make our xpath queries
3:33 kind of not so fun, and instead of we are trying to worry
3:36 about whether we want to like set up a name spaces correctly,
3:40 we want to keep this simple, we are just going to say response.text.replace,
3:43 and it's going to drop that name space, I am going to put like nothing in here,
3:47 okay, so now if we'd run it, we should just get, oh of course,
3:50 if I actually parse the correct text, the updated text we should get something
3:53 that is just the url set, okay, so back here again, the url set don't,
3:58 in the xpath queries we don't say the name here, we just say the subsequent thing,
4:02 so the path will be url/location, and so we should be able to really simply not print,
4:07 but let's do this, let's say tx_urls and we'll just do a list comprehension here,
4:13 okay, so we want to say, we want to get a node,
4:17 so we are going to say n.text, for n in dom and then find all
4:25 and we are going to do I would like to find a url/location.
4:27 So url/loc and then we want to have it only in the case where the word transcript,
4:33 let me go down the few of these, transcript appears,
4:37 so let's say /episodes/transcript, so we'll say if n.text.find this is greater than 0
4:50 and let's print the tx urls, okay, I am not convinced this is going to work,
4:55 but let's give it a shot, look at that, there they all are, all in one enormous line,
5:02 but if I click this, yeah, that is the transcript page, okay,
5:05 so that was step one, we were able to actually leverage the site map
5:09 which a lot of sites have to sort of shortcut a lot of really digging through the data
5:16 and the html, all the stuff that is brittle, the site maps are way less brittle
5:21 So, this gets us to where we actually have the urls,
5:24 and let's actually factor this in a better way, create a method out of this,
5:27 I'll call this- it doesn't like it because it returned, let me just do this,
5:33 I'll say get, tx_urls=get_transcript_urls, like so, and that is going to be here,
5:43 and I'll say return nothing if we have an error, otherwise, we'll return tx_urls,
5:51 and let's change this, okay,
5:55 so now we've finished this task of getting the transcript urls,
5:58 let me just run it one more time and make sure it's all hanging together,
6:01 and it is, the next thing that we are going to need to do is
6:04 parse each one of those, and get the transcript data out.