#100DaysOfCode in Python Transcripts
Chapter: Days 46-48: Web Scraping with BeautifulSoup4
Lecture: Detailed BS4 scraping and searching
0:00 Alrighty, now we want to actually do something interesting
0:04 so let's go to our article page here on PyBites.
0:08 I want to do something like pull down every single article
0:12 name that we have written.
0:15 Very, very daunting so the first thing we want to do
0:18 is view the page source.
0:22 If you, just a quick tip, if you ever get stuck
0:24 and you're not too sure what is what in this page here,
0:30 you can click on inspect.
0:34 As you scroll down through inspect through the code
0:37 the HTML code within the inspect, you'll actually be able
0:41 to see what is which part of the page.
0:45 We need to lower this down here, lower this down here.
0:49 And here is our HTML code that runs the page.
0:53 As we hover down through these little tabs,
0:56 these little arrows here, these drop downs,
0:59 you'll see parts of the page get highlighted.
1:02 That's how you know which code is actioning which part
1:06 of the page.
1:07 We know this is our main here.
1:09 We find the main tag and sure enough that's highlighted.
1:13 Now we can click on article here.
1:15 You can see that's highlighted the section in the middle
1:17 that we want, right?
1:19 We can keep drilling down.
1:21 Here is the unordered list, ul, that is our list
1:25 of article names.
1:27 Then here are all the list elements, the li.
1:30 Open them up and there is our href link
1:37 and the actual name of the article.
1:40 That's what we want to pull down.
1:42 We can look at that quite simply here.
1:45 But in case this was a much more complex page
1:47 such as news websites and game websites and whatever,
1:51 you might want to use that inspect methodology.
1:56 Alright, now that we know we want to get the unordered list,
2:01 let's do some playing in the Python shell.
2:05 I've started off by importing Beautiful Soup 4
2:08 and requests.
2:09 I've already done the requests start get
2:11 of our articles page.
2:14 We've done the raise_for_status to make sure it worked.
2:17 Now let's create our soup object.
2:21 So soup equals bs4.BeautifulSoup,
2:27 oops, don't need the capital O.
2:31 So this is very similar to our other script.
2:33 And HTML parser.
2:38 Alright, and with that done what can we do?
2:41 We can search.
2:43 We don't have to use that select object.
2:44 That was very specific and that was very CSS oriented.
2:48 But now we're going to look at tags, alright?
2:50 Back on our webpage, how's this for tricky?
2:53 The individual links don't have specific CSS classes.
2:59 Uh-oh, so how are we going to, how are we going to find them?
3:02 Alright, we're going to have to do some searching right?
3:05 How about we search for the unordered list.
3:10 We could just do soup.tagname, isn't that cool?
3:14 Now you just have to specify the tag that you want to find.
3:17 You don't even have to put anything around it.
3:19 It doesn't need to be in brackets, nothing special.
3:24 Hang on a minute, what do we get?
3:27 Now look at this, we got an unordered list.
3:30 Ah but, we got this unordered list.
3:36 We got the actual menu bar on the left.
3:38 We didn't get this one.
3:41 Now why is that?
3:43 That is because soup.ul,
3:47 or soup.tag only returns the very first tag
3:52 that it finds that matches.
3:55 If we look at that source code again you'll find
3:58 we actually have another unordered list on the page.
4:02 That is our menu here, okay.
4:06 That's not going to work using soup.ul is not going to work,
4:09 because we're only pulling this one here.
4:13 What do we need to do to find the next one?
4:16 Well we need to do a little bit more digging.
4:19 We could try the soup.find_all options.
4:26 Let's see what that returns.
4:27 So soup.find_all.
4:30 Then we specify the name of the tag that we want.
4:33 We're going to specify ul.
4:35 Let's hit enter and see what happens.
4:37 Look at that, we've got everything.
4:40 We got all of the article names.
4:42 Let's scroll up quickly here.
4:44 But no we also got the very first unordered list as well.
4:50 We got that same menu bar.
4:52 How do we strip that out?
4:53 Well, I suppose we could go through and do regex
4:58 and all sorts of crazy stuff.
5:00 But Beautiful Soup actually let's you drill down
5:03 a little bit further.
5:05 What makes our unordered list here unique?
5:10 On our page, on PyBites, it's the only unordered list
5:15 that lives within our main tag.
5:19 Look up here, here's main.
5:22 That's the opening of main and we saw that closing of main
5:24 down on the bottom.
5:25 And look, it's the only unordered list.
5:28 What we can do is we can do soup.main.ul.
5:37 You can see how far we can actually drill down.
5:39 Isn't this really cool?
5:41 We run that and what do we get?
5:45 Let's go back up to where we ran the command
5:47 and from soup.main.ul, we got the unordered list
5:51 and we got all of the list elements with the atags,
5:56 the hrefs, and the actual plain text in the middle.
6:01 There we go, we've already gotten exactly what we want.
6:05 But we actually don't need all of these tags
6:09 around the side.
6:10 We don't even need the URL.
6:13 How are we going to get around that?
6:15 Well let's see what we can do.
6:18 We don't actually need the unordered list, do we?
6:21 We don't need this whole UL tag,
6:23 we only need the list elements and we know
6:26 from looking at the code that these are the only
6:30 list elements within the main tag.
6:33 Main tag ends here, there's no more list elements.
6:36 What if we do that same, find_all, but just on list.
6:42 Let's go soup.manin.find_all,
6:46 because we're only going to search within the main element.
6:53 There we go, look at that.
6:54 It's not formatted as nicely as the other one
6:56 that we saw, but it is the information that we want.
7:01 You've got your list elements, you've got a URL,
7:04 and they you've got your title, and then it closes off.
7:09 Let's give ourselves some white space to make this
7:13 a little easy to read.
7:15 What can we do, we want to store all of that.
7:18 Let's store that in a list called all_li.
7:22 We're getting all of the li options.
7:26 We do soup.main.find_all li.
7:34 Now all of that is stored in all_li.
7:39 Now how cool is this?
7:42 The text in here in each one of these list elements
7:49 is actually the stream, right?
7:51 You know that, we discussed it in the previous video.
7:54 What we can do, we can do for title in all_li
8:01 I guess for items for each one of these items.
8:04 I shouldn't really use the word title.
8:06 Let's actually change that to be for item
8:10 in all_li.
8:13 It's this we're going for each one of these
8:18 just so you can follow along.
8:21 What do we want?
8:22 Well we want just the subject line.
8:24 We just want this string, we just want the text.
8:27 How do we specify that?
8:29 We go print, now this is going to be really easy
8:33 and I'll tell you what, it's just crazy sometimes
8:36 how simple they make it.
8:39 item.string, and that is it.
8:44 Can you believe it?
8:45 To strip out all of this, all the HTML and all we want
8:50 is that, item.string.
8:53 You ready?
8:58 How cool is that?
8:59 We've just gone through I think it's like 100 objects,
9:03 100 articles that we've written.
9:05 We've just printed out all the names.
9:07 This would be so cool to store in a database
9:10 and check back on it from time to time.
9:13 Email it out, keep it saved somewhere.
9:16 I love this sort of stuff.
9:18 There you go, we've look at quite a few things here.
9:22 We've looked at find_all.
9:24 We've looked at using just a tag name,
9:30 and we've looked at using a sort of nested tag name.
9:36 So you can actually drill down through your HTML document
9:40 which is super cool to help you find nested
9:44 or children objects within your HTML.
9:48 That's Beautiful Soup 4.
9:50 That's pretty much it and it's absolutely wonderful.
9:55 There's obviously so much more to cover
9:57 and that could be a course in itself.
9:59 The documentation is wonderful.
10:02 Definitely have a look through that if you get stuck.
10:04 But for most people this is pretty much the bread
10:08 and butter of Beautiful Soup 4,
10:10 so enjoy it, come up with cool stuff.
10:12 If you do make anything cool using bs4, let us know.
10:16 Send us an email or something like that.