#100DaysOfCode in Python Transcripts
Chapter: Days 46-48: Web Scraping with BeautifulSoup4
Lecture: Requests best practice
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
Alright, one quick public service announcement regarding this code that we've just written. This section here, the pulling of the site.
0:10
Not actually kosher to keep that in this script. The reason for that is we don't want to submit a request to a website
0:18
every time we want to scrape some data. We might run a scraper like this at a different interval set to actually pulling the website.
0:28
The reason for that is, you think about it, not every site is going to update every few minutes. Not every site is going to update every day.
0:36
So if you keep pinging that site with a request, you're going to very quickly spam them. You might even get yourself blocked.
0:44
And you could use up their bandwidth limit. There are certain websites that, you know, can only support a certain amount of hits per
0:52
per minute, alright? And if you keep doing that, you're going to make the website that you enjoy viewing so much pretty unhappy.
1:00
So the best practice here is to put all of this into a different script, run that on a cron job at a different interval
1:09
or whatever other automated way you want to do that, and then using Beautiful Soup 4, point at the downloaded HTML file
1:19
or at the page that you have pulled down from requests, alright. Nice and easy. It's actually much more pleasant for everyone
1:30
to do it that way, and I would totally recommend doing it. The only reason we've done it this way right now is just for demonstration purposes.
1:37
It's much easier. But in production, definitely put your requests in a different script and use Beautiful Soup 4
1:45
to talk to a static file, not the actual URL, unless of course it's a one off thing.