#100DaysOfCode in Python Transcripts
Chapter: Days 46-48: Web Scraping with BeautifulSoup4
Lecture: Detailed BS4 scraping and searching
Login or purchase this course to watch this video and the rest of the course contents.
0:00 Alrighty, now we want to actually do something interesting so let's go to our article page here on PyBites.
0:09 I want to do something like pull down every single article name that we have written. Very, very daunting so the first thing we want to do
0:19 is view the page source. If you, just a quick tip, if you ever get stuck and you're not too sure what is what in this page here,
0:31 you can click on inspect. As you scroll down through inspect through the code the HTML code within the inspect, you'll actually be able
0:42 to see what is which part of the page. We need to lower this down here, lower this down here. And here is our HTML code that runs the page.
0:54 As we hover down through these little tabs, these little arrows here, these drop downs, you'll see parts of the page get highlighted.
1:03 That's how you know which code is actioning which part of the page. We know this is our main here.
1:10 We find the main tag and sure enough that's highlighted. Now we can click on article here. You can see that's highlighted the section in the middle
1:18 that we want, right? We can keep drilling down. Here is the unordered list, ul, that is our list of article names.
1:28 Then here are all the list elements, the li. Open them up and there is our href link and the actual name of the article.
1:41 That's what we want to pull down. We can look at that quite simply here. But in case this was a much more complex page
1:48 such as news websites and game websites and whatever, you might want to use that inspect methodology.
1:57 Alright, now that we know we want to get the unordered list, let's do some playing in the Python shell. I've started off by importing Beautiful Soup 4
2:09 and requests. I've already done the requests start get of our articles page. We've done the raise_for_status to make sure it worked.
2:18 Now let's create our soup object. So soup equals bs4.BeautifulSoup, oops, don't need the capital O. site.text.
2:32 So this is very similar to our other script. And HTML parser. Alright, and with that done what can we do? We can search.
2:44 We don't have to use that select object. That was very specific and that was very CSS oriented. But now we're going to look at tags, alright?
2:51 Back on our webpage, how's this for tricky? The individual links don't have specific CSS classes.
3:00 Uh-oh, so how are we going to, how are we going to find them? Alright, we're going to have to do some searching right?
3:06 How about we search for the unordered list. We could just do soup.tagname, isn't that cool? Now you just have to specify the tag that you want to find.
3:18 You don't even have to put anything around it. It doesn't need to be in brackets, nothing special. Soup.ul. Hang on a minute, what do we get?
3:28 Now look at this, we got an unordered list. Ah but, we got this unordered list. We got the actual menu bar on the left. We didn't get this one.
3:42 Now why is that? That is because soup.ul, or soup.tag only returns the very first tag that it finds that matches.
3:56 If we look at that source code again you'll find we actually have another unordered list on the page. That is our menu here, okay.
4:07 That's not going to work using soup.ul is not going to work, because we're only pulling this one here. What do we need to do to find the next one?
4:17 Well we need to do a little bit more digging. We could try the soup.find_all options. Let's see what that returns. So soup.find_all.
4:31 Then we specify the name of the tag that we want. We're going to specify ul. Let's hit enter and see what happens. Look at that, we've got everything.
4:41 We got all of the article names. Let's scroll up quickly here. But no we also got the very first unordered list as well. We got that same menu bar.
4:53 How do we strip that out? Well, I suppose we could go through and do regex and all sorts of crazy stuff.
5:01 But Beautiful Soup actually let's you drill down a little bit further. What makes our unordered list here unique?
5:11 On our page, on PyBites, it's the only unordered list that lives within our main tag. Look up here, here's main.
5:23 That's the opening of main and we saw that closing of main down on the bottom. And look, it's the only unordered list.
5:29 What we can do is we can do soup.main.ul. You can see how far we can actually drill down. Isn't this really cool? We run that and what do we get?
5:46 Let's go back up to where we ran the command and from soup.main.ul, we got the unordered list and we got all of the list elements with the atags,
5:57 the hrefs, and the actual plain text in the middle. There we go, we've already gotten exactly what we want.
6:06 But we actually don't need all of these tags around the side. We don't even need the URL. How are we going to get around that?
6:16 Well let's see what we can do. We don't actually need the unordered list, do we? We don't need this whole UL tag,
6:24 we only need the list elements and we know from looking at the code that these are the only list elements within the main tag.
6:34 Main tag ends here, there's no more list elements. What if we do that same, find_all, but just on list. Let's go soup.manin.find_all,
6:47 because we're only going to search within the main element. There we go, look at that. It's not formatted as nicely as the other one
6:57 that we saw, but it is the information that we want. You've got your list elements, you've got a URL,
7:05 and they you've got your title, and then it closes off. Let's give ourselves some white space to make this a little easy to read.
7:16 What can we do, we want to store all of that. Let's store that in a list called all_li. We're getting all of the li options.
7:27 We do soup.main.find_all li. Now all of that is stored in all_li. Now how cool is this? The text in here in each one of these list elements
7:50 is actually the stream, right? You know that, we discussed it in the previous video. What we can do, we can do for title in all_li
8:02 I guess for items for each one of these items. I shouldn't really use the word title. Let's actually change that to be for item in all_li.
8:14 It's this we're going for each one of these just so you can follow along. What do we want? Well we want just the subject line.
8:25 We just want this string, we just want the text. How do we specify that? We go print, now this is going to be really easy
8:34 and I'll tell you what, it's just crazy sometimes how simple they make it. item.string, and that is it. Can you believe it?
8:46 To strip out all of this, all the HTML and all we want is that, item.string. You ready? How cool is that?
9:00 We've just gone through I think it's like 100 objects, 100 articles that we've written. We've just printed out all the names.
9:08 This would be so cool to store in a database and check back on it from time to time. Email it out, keep it saved somewhere. I love this sort of stuff.
9:19 There you go, we've look at quite a few things here. We've looked at find_all. We've looked at using just a tag name,
9:31 and we've looked at using a sort of nested tag name. So you can actually drill down through your HTML document
9:41 which is super cool to help you find nested or children objects within your HTML. That's Beautiful Soup 4.
9:51 That's pretty much it and it's absolutely wonderful. There's obviously so much more to cover and that could be a course in itself.
10:00 The documentation is wonderful. Definitely have a look through that if you get stuck. But for most people this is pretty much the bread
10:09 and butter of Beautiful Soup 4, so enjoy it, come up with cool stuff. If you do make anything cool using bs4, let us know.
10:17 Send us an email or something like that.