#100DaysOfCode in Python Transcripts
Chapter: Days 46-48: Web Scraping with BeautifulSoup4
Lecture: Building your first BS4 scraper
0:00 Time for some code. We need to think about what we're going to do first. The first thing we need to do is actually pull down
0:08 our website, and what are we going to use for that? We're going to use requests because we pip installed it,
0:14 didn't we, that was a bit of a dead giveaway. We're also going to import bs4. That's it. Let's specify the actual URL that we're going to be
0:26 dealing with here, I'm just going to copy and paste it. This is the URL of our Pybites projects page. Looking here,
0:38 we have out PyBites Code Challenges. What we're going to do with this one is bring down all of these PyBites projects headers, so, our 100 days.
0:51 Our 100 Days Of Code, our 100 Days Of Django. These different headers. We're going to pull all of those down
0:57 and we're just going to use that as a nice, simple example for this script. Let's start off our code with the standard dum dum.
1:12 Now, what are we going to do? The first thing, obviously, as I said, is to pull down the website, so let's create a function for that.
1:20 def pull_site, nice and creative. We're going to use requests, so I'm going to do this really quickly just so that we can get a move on
1:31 because we've dealt with requests before. So, requests.get URL. That will get the page and store it in the raw site page object.
1:42 So, raw site page .raise_for_status Now, this is just to make sure that it works. If it doesn't work, we'll get an error.
1:54 And then, we're just going to return, raw site page. Nice and easy. Let's just assign that to something down here called site. This will
2:07 assign the raw site page to a variable called site. Now, we'll record a video after this explaining why this is not a good idea, what we're doing.
2:20 But, for now, as just a nice little explainer, this will do. Let's create another function. This function we're going to call scrape.
2:30 It's going to be used against our site object. We need to think ahead a little bit. I'm going to think ahead by putting this list here.
2:43 If you think about our page, as we pull this data out of the page, these headers, we need to store them somewhere, don't we?
2:54 We must store them in a header list. We create the empty list. Now we get to the Beautiful Soup 4 stuff,
3:04 and this is really, really easy, so don't worry if you don't wrap your head around it. But it's only a couple of lines of code,
3:10 which is why we all love Python, right? We're going to create a soup object, and it's not a bowl of soup, it's just a normal object.
3:20 bs4.BeautifulSoup4, .BeautifulSoup, sorry. We're going to take the text of the site. So, site.text, we're going to take that.
3:34 That's going to be called against our Beautiful Soup 4. We're going to use the Beautiful Soup 4 HTML parser in order to get our sort of HTML code
3:49 nicely into this soup object. Once we do that we have our soup object and HTML header list. This is going to be a list of our HTML headers.
4:06 You can see what we're doing. This is already really, really simple. We've taken Beautiful Soup 4 and we've told it
4:15 to get the text of the site using the HTML parser because site is going to be a HTML document, pretty much, right? We're going to store that in soup.
4:28 We're creating an object here that is going to be... I'll show you. HTML header list equals soup.select. What are we selecting?
4:42 The select option here for soup, it allows us to pull down exactly what we need. We get to select something out of the HTML source.
4:54 Let's look at the HTML source again. We'll go view page source. We'll get to these headers. The first header is called zero.PyBites apps.
5:12 We find that on the page, it's going to be in a nice little h3 class. What's unique about it, and this is where you really
5:20 have to get thinking and analyzing this page, the thing that's unique about all of our headers here, so, here's zero, here's number one down here,
5:31 but they all have the project header CSS class. Playing with bs4 does need some tinkering. Occasionally, you'll find someone will have reused
5:44 the same CSS class somewhere else in the same page, so when you select it you'll get more than just what you wanted.
5:52 But in this case, I know, because this is our site, we've only used project header against these headers that we want to see, these ones here.
6:04 We're going to select everything that has the project header class. Let's copy that. We'll go down here, this is what we're selecting.
6:14 We have to put the dot because it is a CSS class. And let's see it. All this has done is we've created the soup object
6:24 of the site using the HTML parser, and then we've selected, within our soup object, everything with the CSS class project header.
6:35 We've stored those, we've stored everything that it finds into HTML header list. Easy peasy. Now all we need to do is iterate over this and store
6:49 the information that we need into this header list. We'll do that. We'll go, for headers in HTML header_list we're going to go header_list.append,
7:07 as we know how to do that. headers.get text. We're saying, just get the text. Just to show you what that means.
7:20 Everything in here, in the class project header, we actually got the whole h3 tag. That soup select pulled the whole tag,
7:33 but all we wanted was the text. That's what the get text option does, that's what this get text does right here.
7:42 It strips out the tags, the HTML code, and gets you just the plain string text that we asked for. That's it.
7:53 We want to see these headers, so let's just quickly create another for loop here. For headers in header_list print, ooo, what did I do there?
8:06 Print headers. And that's it. Save that, and what this will now allow us to do is print out the headers that we have stored
8:17 in header list in this for loop here. Let's have a look at that and see what it looks like.
8:23 Silly me, I've forgotten one thing, we actually have to call our scrape function. So now we will write scrape site. Simple.
8:37 Save that, and let's give it a crack. I'll just move the screen to the right. There's my command prompt. Let's just run Python scraper.py. Hit Enter.
8:52 And it worked, look at that. There's that plain text that we asked for with the get text. We got the first header, PyBites apps.
9:03 We got second header 100 Days Of Code, 100 Days Of Django, and so on, and so forth. Now we have a list with just our headers.
9:12 This is really cool. If you think about it, you could use this to create a newsletter. You could use this to save your website's headers for
9:19 who knows what, to print out in a list and stick on the wall as a trophy. But this is the idea of Beautiful Soup 4.
9:26 You can get a website and you can just strip out all the tags and find the information you want, pull it out, and then, do things with it.
9:34 So, it's really, really cool. The next video we're going to cover some more interesting...