#100DaysOfCode in Python Transcripts
Chapter: Days 46-48: Web Scraping with BeautifulSoup4
Lecture: Building your first BS4 scraper
0:00 Time for some code.
0:02 We need to think about what we're going to do first.
0:05 The first thing we need to do is actually pull down
0:07 our website, and what are we going to use for that?
0:10 We're going to use requests because we pip installed it,
0:13 didn't we, that was a bit of a dead giveaway.
0:16 We're also going to import bs4.
0:19 That's it.
0:22 Let's specify the actual URL that we're going to be
0:25 dealing with here, I'm just going to copy and paste it.
0:28 This is the URL of
0:31 our Pybites projects page.
0:34 Looking here,
0:37 we have out PyBites Code Challenges.
0:41 What we're going to do with this one is bring down
0:44 all of these PyBites
0:47 projects headers,
0:48 so, our 100 days.
0:50 Our 100 Days Of Code, our 100 Days Of Django.
0:53 These different headers.
0:54 We're going to pull all of those down
0:56 and we're just going to use that as a nice,
0:58 simple example for this script.
1:03 Let's start off our code with the standard
1:08 dum dum.
1:11 Now, what are we going to do?
1:12 The first thing, obviously, as I said, is to pull down
1:15 the website, so let's create a function for that.
1:19 def pull_site, nice and creative.
1:24 We're going to use requests, so I'm going to do this
1:26 really quickly
1:28 just so that we can get a move on
1:30 because we've dealt with requests before.
1:33 So, requests.get URL.
1:36 That will get the page and store it
1:38 in the raw site page object.
1:41 So, raw site page .raise_for_status
1:47 Now, this is just to make sure that it works.
1:50 If it doesn't work, we'll get an error.
1:53 And then, we're just going to return, raw site page.
1:57 Nice and easy.
2:00 Let's just assign that to something
2:02 down here called site.
2:04 This will
2:06 assign the raw site page
2:09 to a variable called site.
2:13 Now, we'll record a video after this explaining why
2:16 this is not a good idea, what we're doing.
2:19 But, for now, as just a nice little explainer, this will do.
2:25 Let's create another function.
2:27 This function we're going to call scrape.
2:29 It's going to be used against our site object.
2:37 We need to think ahead a little bit.
2:39 I'm going to think ahead by putting this list here.
2:42 If you think about our page,
2:47 as we pull this data out of the page, these headers,
2:51 we need to store them somewhere, don't we?
2:53 We must store them in a header list.
2:58 We create the empty list.
3:00 Now we get to the Beautiful Soup 4 stuff,
3:03 and this is really, really easy, so don't worry
3:06 if you don't wrap your head around it.
3:08 But it's only a couple of lines of code,
3:09 which is why we all love Python, right?
3:13 We're going to create a soup object,
3:15 and it's not a bowl of soup, it's just a normal object.
3:23 .BeautifulSoup, sorry.
3:26 We're going to take the text of the site.
3:29 So, site.text, we're going to take that.
3:33 That's going to be called against our Beautiful Soup 4.
3:38 We're going to use the Beautiful Soup 4
3:40 HTML parser
3:43 in order to get our sort of HTML code
3:48 nicely into this soup object.
3:52 Once we do that we have our soup object
3:56 and HTML header list.
4:01 This is going to be a list of our HTML headers.
4:05 You can see what we're doing.
4:06 This is already really, really simple.
4:09 We've taken
4:11 Beautiful Soup 4 and we've told it
4:14 to get the text of the site using the HTML parser
4:18 because site is going to be a HTML document,
4:21 pretty much, right?
4:23 We're going to store that in soup.
4:27 We're creating an object here
4:30 that is going to be...
4:32 I'll show you.
4:33 HTML header list equals
4:40 What are we selecting?
4:41 The select option here for soup,
4:45 it allows us to pull down exactly what we need.
4:48 We get to select something out of the HTML source.
4:53 Let's look at the HTML source again.
4:57 We'll go view page source.
5:01 We'll get to these headers.
5:03 The first header
5:05 is called
5:07 zero.PyBites apps.
5:11 We find that on the page, it's going to be
5:13 in a nice little h3 class.
5:16 What's unique about it, and this is where you really
5:19 have to get thinking and analyzing this page,
5:22 the thing that's unique about all of our headers here,
5:25 so, here's zero,
5:27 here's number one down here,
5:30 but they all have the project header CSS class.
5:36 Playing with bs4 does need some tinkering.
5:40 Occasionally, you'll find someone will have reused
5:43 the same CSS class somewhere else
5:46 in the same page, so when you select it you'll get more
5:50 than just what you wanted.
5:51 But in this case, I know, because this is our site,
5:54 we've only used project header against these headers
5:59 that we want to see, these ones here.
6:03 We're going to select everything that has
6:07 the project header class.
6:10 Let's copy that.
6:11 We'll go down here, this is what we're selecting.
6:13 We have to put the dot because it is a CSS class.
6:18 And let's see it.
6:19 All this has done is we've created the soup object
6:23 of the site using the HTML parser, and then we've selected,
6:28 within our soup object, everything with the CSS class
6:32 project header.
6:34 We've stored those, we've stored everything that it finds
6:37 into HTML header list.
6:41 Easy peasy.
6:43 Now all we need to do is iterate over this and store
6:48 the information that we need into this header list.
6:53 We'll do that.
6:54 We'll go, for headers in HTML
7:02 we're going to go header_list.append,
7:06 as we know how to do that.
7:08 headers.get text.
7:13 We're saying, just get the text.
7:17 Just to show you what that means.
7:19 Everything in here, in the class project header,
7:25 we actually got the whole h3 tag.
7:29 That soup select pulled the whole tag,
7:32 but all we wanted was the text.
7:37 That's what the get text option does, that's what this
7:40 get text does right here.
7:41 It strips out the tags, the HTML code, and gets you
7:46 just the plain string text that we asked for.
7:51 That's it.
7:52 We want to see these headers, so let's just quickly create
7:55 another for loop here.
7:57 For headers in header_list
8:02 print, ooo, what did I do there?
8:05 Print headers.
8:07 And that's it.
8:09 Save that, and what this will now allow us to do
8:12 is print out the headers that we have stored
8:16 in header list in this for loop here.
8:20 Let's have a look at that and see what it looks like.
8:22 Silly me, I've forgotten one thing, we actually have to call
8:26 our scrape function.
8:28 So now we will write scrape
8:36 Save that, and let's give it a crack.
8:39 I'll just move the screen to the right.
8:42 There's my command prompt.
8:44 Let's just run Python scraper.py.
8:48 Hit Enter.
8:51 And it worked, look at that.
8:53 There's that plain text that we asked for with the get text.
8:58 We got the first header, PyBites apps.
9:02 We got second header 100 Days Of Code, 100 Days Of Django,
9:06 and so on, and so forth.
9:08 Now we have a list with just our headers.
9:11 This is really cool. If you think about it,
9:12 you could use this to create a newsletter.
9:14 You could use this to save your website's headers for
9:18 who knows what, to print out in a list and stick on the wall
9:20 as a trophy.
9:22 But this is the idea of Beautiful Soup 4.
9:25 You can get a website and you can just strip out
9:28 all the tags and find the information you want,
9:31 pull it out, and then, do things with it.
9:33 So, it's really, really cool.
9:35 The next video we're going to cover some more interesting...