Consuming HTTP Services in Python Transcripts
Chapter: Screen scraping: Adding APIs where there are none
Lecture: What is screen scraping and web scraping?
Login or
purchase this course
to watch this video and the rest of the course contents.
0:02
Now we've made our way in the course to screen scraping. And, the little subtitle I've added here is these are for the sites with missing services.
0:13
So, what is screen scraping- well, if there is an API, we've seen there is a specific endpoint url that we can call
0:20
and things like json or xml or even soap data comes back. But a lot of data is out there, the majority of data is out there with no API behind it,
0:28
so imagine we want to learn about stuff on the Python.org homepage, now there is probably an rss feed or something
0:36
where we can get this information, but let's assume it doesn't exist, let's assume only this web page contains the information that we need,
0:43
and our goal is to know what the current versions of Python are available for download right here, you can see the two buttons,
0:50
like in download Python 3.6.0 or I could download Python 2.7.13, so if we want to use screen scraping to get this,
0:57
what we do is we actually just like before, issue an http get to that url,
1:01
and what we get back is not some nice structured data but probably malformed, almost certainly malformed HTML,
1:09
however, HTML does have a few things that we can do, notice that we have a paragraph, it has a class called download buttons,
1:17
and in there, there is some hyperlinks with text inside like download Python 3.6.0, download Python 2.7.13 so we can feed that to an HTML parser
1:25
which can deal with the malformed components of xml because it's not usually exact xHTML,
1:34
it's usually even HTML 5 doesn't necessarily match this like say straight xml so you've got to do a little work to parse that, load it into a dom
1:42
and then we can use this in our app, we can query this data, either by navigating the hierarchy or even using css,
1:50
so I could easily write a css selector say .download buttons a and that would give me two elements that return back
2:00
and those would be the two download links, and the links would contain actually the link to download
2:05
as well as the texts which I could do some kind of work, some kind of string search to figure out what the details there are.
2:11
So that is how screen scraping works, that is the screen scraping workflow, so, it's surprisingly easy, surprisingly effective,
2:19
however, there are some rules that you should keep in mind. Basically, try not to rock the boat, be a good citizen,
2:26
know the terms and conditions for the site. Many of these sites have things saying
2:32
basically you can't do like random screen scraping and consuming their data,
2:36
it is their data after all, so there is what you can't do with screen scraping and there is what you can legally do with screen scraping
2:44
and then there is what kind of what you should do, and so, be sure that you are on good terms, somebody wanted to work with my data,
2:50
my transcript data, off of my website in a live fashion, not out of the github repo that I have, and they sent me messages
2:56
and said hey Michael, do you mind if I screen scrape your source for some like data science analyses of the transcripts- no, not at all, I don't mind,
3:04
and, I gave them permission to do it, and it's great, they will probably do something like what I am going to show you here.
3:10
But, consider asking and getting permission if it's not allowed or at least check the terms and conditions.
3:16
Also, be aware that your scraping code will break, if you get an email from a site that you've been doing screen scraping against,
3:23
and they are like big news, we've redesigned our site, it's beautiful and you can just think okay, you just broke my code,
3:30
because even what I was just describing before the fact that the thing that contains the buttons had the class download buttons
3:37
and it was hyperlinks that were the actual things that we're after, if something about that changes, like they change that class
3:44
or it will become actual buttons not hyperlinks, right, broken. So, little changes to the layout will break your code, it's not usually to hard to fix,
3:52
you want to isolate that stuff off one or two functions, but just be aware that these things need care and feeding because of this.
3:58
The resulting data that comes back is going to look somewhat nasty, if you look at the HTML a lot of times there will be extra line breaks,
4:07
there will be new line characters interspersed in there, and so on,
4:10
so you are going to have to do a little bit of work to take the values you pulled out and actually clean them up, especially the raw data in between
4:18
like the download Python 3.6.0 text, that might come back really with lots of junk around it.
4:26
You are getting attributes and things maybe a little less so, because there is less flexibility there, finally, don't hammer the server,
4:31
the sites are built to have users come and do a couple of requests a minute,
4:34
and sort of cruise around, you could just pound this thing with a good cloud based server,
4:39
trying to do screen scraping against it, so consider adding some sort of delay, some little time.sleep type thing to make it
4:47
not so intense what you do to these guys, like just be considerate of their server resources and don't do effectively a denial service thing on there,
4:56
so consider some sort of slow down, I adapted these notes, or these rules from Greg Reda's article which you can see at the bottom,
5:03
which is a screen scraping 101 in Python, I thought he had some good rules, so this is sort of my adaptation of his, so thank you Greg.