Consuming HTTP Services in Python Transcripts
Chapter: Screen scraping: Adding APIs where there are none
Lecture: Survey of screen scraping libraries
0:01 So let's talk about some of our options that we can use for web scraping.
0:05 Certainly you don't want to just load the html and do this yourself,
0:09 so one really nice combination is to use this library called Beautiful Soup
0:13 and Beautiful Soup doesn't download the content, it just parses the text,
0:17 you want to make sure request in there actually get the content,
0:20 and we've seen how to do basic http get with requests,
0:23 the whole way through this class so that is not a big deal;
0:25 and we just hand off the html to Beautiful Soup and it lets us do things like
0:28 search by css, and things like that we can also use Scrapy;
0:33 Scrapy is really nice and there is a whole range of things you can do with Scrapy,
0:39 so I definitely recommend that you check out Scrapy as well.
0:42 Originally I had chosen Beautiful Soup
0:46 because for a while Scrapy didn't support Python 3,
0:49 but now it fully supports Python 3, so that is a really great news,
0:52 and I had started using Beautiful Soup previously,
0:55 before Scrapy started working with Python3,
0:58 but Scrapy has actually got some really interesting ways of working
1:01 and you'll see that it can actually grow a little bit farther than just writing,
1:05 just bringing this package into your code and writing it yourself.
1:09 Scrapy, the founders of Scrapy created this place called Scraping hub
1:15 which is like web scraping as a service.
1:19 So there is all sorts of retry, cashing, staleness, infrastructure,
1:24 things that you really got to do to do like large scale web scraping,
1:28 so if that is your goal, check out scraping hub, they've got all of that setup for you,
1:32 and you take the same code that you would write in Scrapy locally,
1:35 drop it in here and it runs in their infrastructure.
1:38 So that is pretty sweet and I also did an entire episode on screen scraping
1:41 with the founder of Scraping hub and the creator of Scrapy, Pablo Hoffman,
1:46 so we talked about web scraping, some of the techniques,
1:49 Scraping hub, some of the rules around this and so on,
1:52 so if you are interested in going deeper on this topic,
1:55 go ahead and check out talkpython.fm/50.