Consuming HTTP Services in Python Transcripts
Chapter: Screen scraping: Adding APIs where there are none
Lecture: What is screen scraping and web scraping?
0:02 Now we've made our way in the course to screen scraping.
0:06 And, the little subtitle I've added here is
0:09 these are for the sites with missing services.
0:12 So, what is screen scraping- well, if there is an API,
0:16 we've seen there is a specific endpoint url that we can call
0:19 and things like json or xml or even soap data comes back.
0:23 But a lot of data is out there, the majority of data is out there with no API behind it,
0:27 so imagine we want to learn about stuff on the python.org homepage,
0:32 now there is probably an rss feed or something
0:35 where we can get this information, but let's assume it doesn't exist,
0:39 let's assume only this web page contains the information that we need,
0:42 and our goal is to know what the current versions of Python
0:46 are available for download right here, you can see the two buttons,
0:49 like in download Python 3.6.0 or I could download Python 2.7.13,
0:53 so if we want to use screen scraping to get this,
0:56 what we do is we actually just like before, issue an http get to that url,
1:00 and what we get back is not some nice structured data but probably malformed,
1:05 almost certainly malformed html,
1:08 however, html does have a few things that we can do,
1:12 notice that we have a paragraph, it has a class called download buttons,
1:16 and in there, there is some hyperlinks with text inside like download Python 3.6.0,
1:19 download Python 2.7.13 so we can feed that to an html parser
1:24 which can deal with the malformed components
1:28 of xml because it's not usually exact xhtml,
1:33 it's usually even html 5 doesn't necessarily match this like say straight xml
1:37 so you've got to do a little work to parse that, load it into a dom
1:41 and then we can use this in our app, we can query this data,
1:46 either by navigating the hierarchy or even using css,
1:49 so I could easily write a css selector say .download buttons a
1:56 and that would give me two elements that return back
1:59 and those would be the two download links,
2:02 and the links would contain actually the link to download
2:04 as well as the texts which I could do some kind of work,
2:07 some kind of string search to figure out what the details there are.
2:10 So that is how screen scraping works, that is the screen scraping workflow,
2:14 so, it's surprisingly easy, surprisingly effective,
2:18 however, there are some rules that you should keep in mind.
2:21 Basically, try not to rock the boat, be a good citizen,
2:25 know the terms and conditions for the site.
2:29 Many of these sites have things saying
2:31 basically you can't do like random screen scraping and consuming their data,
2:35 it is their data after all, so there is what you can't do with screen scraping
2:40 and there is what you can legally do with screen scraping
2:43 and then there is what kind of what you should do, and so,
2:46 be sure that you are on good terms, somebody wanted to work with my data,
2:49 my transcript data, off of my website in a live fashion,
2:52 not out of the github repo that I have, and they sent me messages
2:55 and said hey Michael, do you mind if I screen scrape your source
2:58 for some like data science analyses of the transcripts- no, not at all, I don't mind,
3:03 and, I gave them permission to do it, and it's great,
3:06 they will probably do something like what I am going to show you here.
3:09 But, consider asking and getting permission if it's not allowed
3:11 or at least check the terms and conditions.
3:15 Also, be aware that your scraping code will break,
3:18 if you get an email from a site that you've been doing screen scraping against,
3:22 and they are like big news, we've redesigned our site,
3:25 it's beautiful and you can just think okay, you just broke my code,
3:29 because even what I was just describing before the fact
3:32 that the thing that contains the buttons had the class download buttons
3:36 and it was hyperlinks that were the actual things that we're after,
3:40 if something about that changes, like they change that class
3:43 or it will become actual buttons not hyperlinks, right, broken.
3:46 So, little changes to the layout will break your code, it's not usually to hard to fix,
3:51 you want to isolate that stuff off one or two functions,
3:54 but just be aware that these things need care and feeding because of this.
3:57 The resulting data that comes back is going to look somewhat nasty,
4:02 if you look at the html a lot of times there will be extra line breaks,
4:06 there will be new line characters interspersed in there, and so on,
4:09 so you are going to have to do a little bit of work to take the values you pulled out
4:13 and actually clean them up, especially the raw data in between
4:17 like the download Python 3.6.0 text, that might come back really
4:22 with lots of junk around it.
4:25 You are getting attributes and things maybe a little less so,
4:27 because there is less flexibility there, finally, don't hammer the server,
4:30 the sites are built to have users come and do a couple of requests a minute,
4:33 and sort of cruise around, you could just pound this thing with a good cloud based server,
4:38 trying to do screen scraping against it, so consider adding some sort of delay,
4:42 some little time.sleep type thing to make it
4:46 not so intense what you do to these guys,
4:49 like just be considerate of their server resources
4:52 and don't do effectively a denial service thing on there,
4:55 so consider some sort of slow down, I adapted these notes,
4:59 or these rules from Greg Reda's article which you can see at the bottom,
5:02 which is a screen scraping 101 in python, I thought he had some good rules,
5:06 so this is sort of my adaptation of his, so thank you Greg.