Consuming HTTP Services in Python Transcripts
Chapter: Screen scraping: Adding APIs where there are none
Lecture: What is screen scraping and web scraping?
Login or
purchase this course
to watch this video and the rest of the course contents.
Now we've made our way in the course to screen scraping. And, the little subtitle I've added here is these are for the sites with missing services.
So, what is screen scraping- well, if there is an API, we've seen there is a specific endpoint url that we can call
and things like json or xml or even soap data comes back. But a lot of data is out there, the majority of data is out there with no API behind it,
so imagine we want to learn about stuff on the homepage, now there is probably an rss feed or something
where we can get this information, but let's assume it doesn't exist, let's assume only this web page contains the information that we need,
and our goal is to know what the current versions of Python are available for download right here, you can see the two buttons,
like in download Python 3.6.0 or I could download Python 2.7.13, so if we want to use screen scraping to get this,
what we do is we actually just like before, issue an http get to that url,
and what we get back is not some nice structured data but probably malformed, almost certainly malformed HTML,
however, HTML does have a few things that we can do, notice that we have a paragraph, it has a class called download buttons,
and in there, there is some hyperlinks with text inside like download Python 3.6.0, download Python 2.7.13 so we can feed that to an HTML parser
which can deal with the malformed components of xml because it's not usually exact xHTML,
it's usually even HTML 5 doesn't necessarily match this like say straight xml so you've got to do a little work to parse that, load it into a dom
and then we can use this in our app, we can query this data, either by navigating the hierarchy or even using css,
so I could easily write a css selector say .download buttons a and that would give me two elements that return back
and those would be the two download links, and the links would contain actually the link to download
as well as the texts which I could do some kind of work, some kind of string search to figure out what the details there are.
So that is how screen scraping works, that is the screen scraping workflow, so, it's surprisingly easy, surprisingly effective,
however, there are some rules that you should keep in mind. Basically, try not to rock the boat, be a good citizen,
know the terms and conditions for the site. Many of these sites have things saying
basically you can't do like random screen scraping and consuming their data,
it is their data after all, so there is what you can't do with screen scraping and there is what you can legally do with screen scraping
and then there is what kind of what you should do, and so, be sure that you are on good terms, somebody wanted to work with my data,
my transcript data, off of my website in a live fashion, not out of the github repo that I have, and they sent me messages
and said hey Michael, do you mind if I screen scrape your source for some like data science analyses of the transcripts- no, not at all, I don't mind,
and, I gave them permission to do it, and it's great, they will probably do something like what I am going to show you here.
But, consider asking and getting permission if it's not allowed or at least check the terms and conditions.
Also, be aware that your scraping code will break, if you get an email from a site that you've been doing screen scraping against,
and they are like big news, we've redesigned our site, it's beautiful and you can just think okay, you just broke my code,
because even what I was just describing before the fact that the thing that contains the buttons had the class download buttons
and it was hyperlinks that were the actual things that we're after, if something about that changes, like they change that class
or it will become actual buttons not hyperlinks, right, broken. So, little changes to the layout will break your code, it's not usually to hard to fix,
you want to isolate that stuff off one or two functions, but just be aware that these things need care and feeding because of this.
The resulting data that comes back is going to look somewhat nasty, if you look at the HTML a lot of times there will be extra line breaks,
there will be new line characters interspersed in there, and so on,
so you are going to have to do a little bit of work to take the values you pulled out and actually clean them up, especially the raw data in between
like the download Python 3.6.0 text, that might come back really with lots of junk around it.
You are getting attributes and things maybe a little less so, because there is less flexibility there, finally, don't hammer the server,
the sites are built to have users come and do a couple of requests a minute,
and sort of cruise around, you could just pound this thing with a good cloud based server,
trying to do screen scraping against it, so consider adding some sort of delay, some little time.sleep type thing to make it
not so intense what you do to these guys, like just be considerate of their server resources and don't do effectively a denial service thing on there,
so consider some sort of slow down, I adapted these notes, or these rules from Greg Reda's article which you can see at the bottom,
which is a screen scraping 101 in Python, I thought he had some good rules, so this is sort of my adaptation of his, so thank you Greg.