#100DaysOfCode in Python Transcripts
Chapter: Days 46-48: Web Scraping with BeautifulSoup4
Lecture: Detailed BS4 scraping and searching
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
Alrighty, now we want to actually do something interesting
0:04
so let's go to our article page here on PyBites.
0:08
I want to do something like pull down every single article
0:12
name that we have written.
0:15
Very, very daunting so the first thing we want to do
0:18
is view the page source.
0:22
If you, just a quick tip, if you ever get stuck
0:24
and you're not too sure what is what in this page here,
0:30
you can click on inspect.
0:34
As you scroll down through inspect through the code
0:37
the HTML code within the inspect, you'll actually be able
0:41
to see what is which part of the page.
0:45
We need to lower this down here, lower this down here.
0:49
And here is our HTML code that runs the page.
0:53
As we hover down through these little tabs,
0:56
these little arrows here, these drop downs,
0:59
you'll see parts of the page get highlighted.
1:02
That's how you know which code is actioning which part
1:06
of the page.
1:07
We know this is our main here.
1:09
We find the main tag and sure enough that's highlighted.
1:13
Now we can click on article here.
1:15
You can see that's highlighted the section in the middle
1:17
that we want, right?
1:19
We can keep drilling down.
1:21
Here is the unordered list, ul, that is our list
1:25
of article names.
1:27
Then here are all the list elements, the li.
1:30
Open them up and there is our href link
1:37
and the actual name of the article.
1:40
That's what we want to pull down.
1:42
We can look at that quite simply here.
1:45
But in case this was a much more complex page
1:47
such as news websites and game websites and whatever,
1:51
you might want to use that inspect methodology.
1:56
Alright, now that we know we want to get the unordered list,
2:01
let's do some playing in the Python shell.
2:05
I've started off by importing Beautiful Soup 4
2:08
and requests.
2:09
I've already done the requests start get
2:11
of our articles page.
2:14
We've done the raise_for_status to make sure it worked.
2:17
Now let's create our soup object.
2:21
So soup equals bs4.BeautifulSoup,
2:27
oops, don't need the capital O.
2:29
site.text.
2:31
So this is very similar to our other script.
2:33
And HTML parser.
2:38
Alright, and with that done what can we do?
2:41
We can search.
2:43
We don't have to use that select object.
2:44
That was very specific and that was very CSS oriented.
2:48
But now we're going to look at tags, alright?
2:50
Back on our webpage, how's this for tricky?
2:53
The individual links don't have specific CSS classes.
2:59
Uh-oh, so how are we going to, how are we going to find them?
3:02
Alright, we're going to have to do some searching right?
3:05
How about we search for the unordered list.
3:10
We could just do soup.tagname, isn't that cool?
3:14
Now you just have to specify the tag that you want to find.
3:17
You don't even have to put anything around it.
3:19
It doesn't need to be in brackets, nothing special.
3:22
Soup.ul.
3:24
Hang on a minute, what do we get?
3:27
Now look at this, we got an unordered list.
3:30
Ah but, we got this unordered list.
3:36
We got the actual menu bar on the left.
3:38
We didn't get this one.
3:41
Now why is that?
3:43
That is because soup.ul,
3:47
or soup.tag only returns the very first tag
3:52
that it finds that matches.
3:55
If we look at that source code again you'll find
3:58
we actually have another unordered list on the page.
4:02
That is our menu here, okay.
4:06
That's not going to work using soup.ul is not going to work,
4:09
because we're only pulling this one here.
4:13
What do we need to do to find the next one?
4:16
Well we need to do a little bit more digging.
4:19
We could try the soup.find_all options.
4:26
Let's see what that returns.
4:27
So soup.find_all.
4:30
Then we specify the name of the tag that we want.
4:33
We're going to specify ul.
4:35
Let's hit enter and see what happens.
4:37
Look at that, we've got everything.
4:40
We got all of the article names.
4:42
Let's scroll up quickly here.
4:44
But no we also got the very first unordered list as well.
4:50
We got that same menu bar.
4:52
How do we strip that out?
4:53
Well, I suppose we could go through and do regex
4:58
and all sorts of crazy stuff.
5:00
But Beautiful Soup actually let's you drill down
5:03
a little bit further.
5:05
What makes our unordered list here unique?
5:10
On our page, on PyBites, it's the only unordered list
5:15
that lives within our main tag.
5:19
Look up here, here's main.
5:22
That's the opening of main and we saw that closing of main
5:24
down on the bottom.
5:25
And look, it's the only unordered list.
5:28
What we can do is we can do soup.main.ul.
5:37
You can see how far we can actually drill down.
5:39
Isn't this really cool?
5:41
We run that and what do we get?
5:45
Let's go back up to where we ran the command
5:47
and from soup.main.ul, we got the unordered list
5:51
and we got all of the list elements with the atags,
5:56
the hrefs, and the actual plain text in the middle.
6:01
There we go, we've already gotten exactly what we want.
6:05
But we actually don't need all of these tags
6:09
around the side.
6:10
We don't even need the URL.
6:13
How are we going to get around that?
6:15
Well let's see what we can do.
6:18
We don't actually need the unordered list, do we?
6:21
We don't need this whole UL tag,
6:23
we only need the list elements and we know
6:26
from looking at the code that these are the only
6:30
list elements within the main tag.
6:33
Main tag ends here, there's no more list elements.
6:36
What if we do that same, find_all, but just on list.
6:42
Let's go soup.manin.find_all,
6:46
because we're only going to search within the main element.
6:53
There we go, look at that.
6:54
It's not formatted as nicely as the other one
6:56
that we saw, but it is the information that we want.
7:01
You've got your list elements, you've got a URL,
7:04
and they you've got your title, and then it closes off.
7:09
Let's give ourselves some white space to make this
7:13
a little easy to read.
7:15
What can we do, we want to store all of that.
7:18
Let's store that in a list called all_li.
7:22
We're getting all of the li options.
7:26
We do soup.main.find_all li.
7:34
Now all of that is stored in all_li.
7:39
Now how cool is this?
7:42
The text in here in each one of these list elements
7:49
is actually the stream, right?
7:51
You know that, we discussed it in the previous video.
7:54
What we can do, we can do for title in all_li
8:01
I guess for items for each one of these items.
8:04
I shouldn't really use the word title.
8:06
Let's actually change that to be for item
8:10
in all_li.
8:13
It's this we're going for each one of these
8:18
just so you can follow along.
8:21
What do we want?
8:22
Well we want just the subject line.
8:24
We just want this string, we just want the text.
8:27
How do we specify that?
8:29
We go print, now this is going to be really easy
8:33
and I'll tell you what, it's just crazy sometimes
8:36
how simple they make it.
8:39
item.string, and that is it.
8:44
Can you believe it?
8:45
To strip out all of this, all the HTML and all we want
8:50
is that, item.string.
8:53
You ready?
8:58
How cool is that?
8:59
We've just gone through I think it's like 100 objects,
9:03
100 articles that we've written.
9:05
We've just printed out all the names.
9:07
This would be so cool to store in a database
9:10
and check back on it from time to time.
9:13
Email it out, keep it saved somewhere.
9:16
I love this sort of stuff.
9:18
There you go, we've look at quite a few things here.
9:22
We've looked at find_all.
9:24
We've looked at using just a tag name,
9:30
and we've looked at using a sort of nested tag name.
9:36
So you can actually drill down through your HTML document
9:40
which is super cool to help you find nested
9:44
or children objects within your HTML.
9:48
That's Beautiful Soup 4.
9:50
That's pretty much it and it's absolutely wonderful.
9:55
There's obviously so much more to cover
9:57
and that could be a course in itself.
9:59
The documentation is wonderful.
10:02
Definitely have a look through that if you get stuck.
10:04
But for most people this is pretty much the bread
10:08
and butter of Beautiful Soup 4,
10:10
so enjoy it, come up with cool stuff.
10:12
If you do make anything cool using bs4, let us know.
10:16
Send us an email or something like that.