Consuming HTTP Services in Python Transcripts
Chapter: Screen scraping: Adding APIs where there are none
Lecture: Survey of screen scraping libraries
Login or
purchase this course
to watch this video and the rest of the course contents.
0:01
So let's talk about some of our options that we can use for web scraping. Certainly you don't want to just load the HTML and do this yourself,
0:10
so one really nice combination is to use this library called Beautiful Soup and Beautiful Soup doesn't download the content, it just parses the text,
0:18
you want to make sure request in there actually get the content, and we've seen how to do basic http get with requests,
0:24
the whole way through this class so that is not a big deal; and we just hand off the HTML to Beautiful Soup and it lets us do things like
0:29
search by css, and things like that we can also use Scrapy; Scrapy is really nice and there is a whole range of things you can do with Scrapy,
0:40
so I definitely recommend that you check out Scrapy as well. Originally I had chosen Beautiful Soup because for a while Scrapy didn't support Python 3,
0:50
but now it fully supports Python 3, so that is a really great news, and I had started using Beautiful Soup previously,
0:56
before Scrapy started working with Python3, but Scrapy has actually got some really interesting ways of working
1:02
and you'll see that it can actually grow a little bit farther than just writing, just bringing this package into your code and writing it yourself.
1:10
Scrapy, the founders of Scrapy created this place called Scraping hub which is like web scraping as a service.
1:20
So there is all sorts of retry, cashing, staleness, infrastructure, things that you really got to do to do like large scale web scraping,
1:29
so if that is your goal, check out scraping hub, they've got all of that setup for you,
1:33
and you take the same code that you would write in Scrapy locally, drop it in here and it runs in their infrastructure.
1:39
So that is pretty sweet and I also did an entire episode on screen scraping with the founder of Scraping hub and the creator of Scrapy, Pablo Hoffman,
1:47
so we talked about web scraping, some of the techniques, Scraping hub, some of the rules around this and so on,
1:53
so if you are interested in going deeper on this topic, go ahead and check out talkpython.fm/50.