Consuming HTTP Services in Python Transcripts
Chapter: Screen scraping: Adding APIs where there are none
Lecture: What is screen scraping and web scraping?
Login or
purchase this course
to watch this video and the rest of the course contents.
0:02
Now we've made our way in the course to screen scraping.
0:06
And, the little subtitle I've added here is
0:09
these are for the sites with missing services.
0:12
So, what is screen scraping- well, if there is an API,
0:16
we've seen there is a specific endpoint url that we can call
0:19
and things like json or xml or even soap data comes back.
0:23
But a lot of data is out there, the majority of data is out there with no API behind it,
0:27
so imagine we want to learn about stuff on the python.org homepage,
0:32
now there is probably an rss feed or something
0:35
where we can get this information, but let's assume it doesn't exist,
0:39
let's assume only this web page contains the information that we need,
0:42
and our goal is to know what the current versions of Python
0:46
are available for download right here, you can see the two buttons,
0:49
like in download Python 3.6.0 or I could download Python 2.7.13,
0:53
so if we want to use screen scraping to get this,
0:56
what we do is we actually just like before, issue an http get to that url,
1:00
and what we get back is not some nice structured data but probably malformed,
1:05
almost certainly malformed html,
1:08
however, html does have a few things that we can do,
1:12
notice that we have a paragraph, it has a class called download buttons,
1:16
and in there, there is some hyperlinks with text inside like download Python 3.6.0,
1:19
download Python 2.7.13 so we can feed that to an html parser
1:24
which can deal with the malformed components
1:28
of xml because it's not usually exact xhtml,
1:33
it's usually even html 5 doesn't necessarily match this like say straight xml
1:37
so you've got to do a little work to parse that, load it into a dom
1:41
and then we can use this in our app, we can query this data,
1:46
either by navigating the hierarchy or even using css,
1:49
so I could easily write a css selector say .download buttons a
1:56
and that would give me two elements that return back
1:59
and those would be the two download links,
2:02
and the links would contain actually the link to download
2:04
as well as the texts which I could do some kind of work,
2:07
some kind of string search to figure out what the details there are.
2:10
So that is how screen scraping works, that is the screen scraping workflow,
2:14
so, it's surprisingly easy, surprisingly effective,
2:18
however, there are some rules that you should keep in mind.
2:21
Basically, try not to rock the boat, be a good citizen,
2:25
know the terms and conditions for the site.
2:29
Many of these sites have things saying
2:31
basically you can't do like random screen scraping and consuming their data,
2:35
it is their data after all, so there is what you can't do with screen scraping
2:40
and there is what you can legally do with screen scraping
2:43
and then there is what kind of what you should do, and so,
2:46
be sure that you are on good terms, somebody wanted to work with my data,
2:49
my transcript data, off of my website in a live fashion,
2:52
not out of the github repo that I have, and they sent me messages
2:55
and said hey Michael, do you mind if I screen scrape your source
2:58
for some like data science analyses of the transcripts- no, not at all, I don't mind,
3:03
and, I gave them permission to do it, and it's great,
3:06
they will probably do something like what I am going to show you here.
3:09
But, consider asking and getting permission if it's not allowed
3:11
or at least check the terms and conditions.
3:15
Also, be aware that your scraping code will break,
3:18
if you get an email from a site that you've been doing screen scraping against,
3:22
and they are like big news, we've redesigned our site,
3:25
it's beautiful and you can just think okay, you just broke my code,
3:29
because even what I was just describing before the fact
3:32
that the thing that contains the buttons had the class download buttons
3:36
and it was hyperlinks that were the actual things that we're after,
3:40
if something about that changes, like they change that class
3:43
or it will become actual buttons not hyperlinks, right, broken.
3:46
So, little changes to the layout will break your code, it's not usually to hard to fix,
3:51
you want to isolate that stuff off one or two functions,
3:54
but just be aware that these things need care and feeding because of this.
3:57
The resulting data that comes back is going to look somewhat nasty,
4:02
if you look at the html a lot of times there will be extra line breaks,
4:06
there will be new line characters interspersed in there, and so on,
4:09
so you are going to have to do a little bit of work to take the values you pulled out
4:13
and actually clean them up, especially the raw data in between
4:17
like the download Python 3.6.0 text, that might come back really
4:22
with lots of junk around it.
4:25
You are getting attributes and things maybe a little less so,
4:27
because there is less flexibility there, finally, don't hammer the server,
4:30
the sites are built to have users come and do a couple of requests a minute,
4:33
and sort of cruise around, you could just pound this thing with a good cloud based server,
4:38
trying to do screen scraping against it, so consider adding some sort of delay,
4:42
some little time.sleep type thing to make it
4:46
not so intense what you do to these guys,
4:49
like just be considerate of their server resources
4:52
and don't do effectively a denial service thing on there,
4:55
so consider some sort of slow down, I adapted these notes,
4:59
or these rules from Greg Reda's article which you can see at the bottom,
5:02
which is a screen scraping 101 in python, I thought he had some good rules,
5:06
so this is sort of my adaptation of his, so thank you Greg.