#100DaysOfCode in Python Transcripts
Chapter: Days 46-48: Web Scraping with BeautifulSoup4
Lecture: Concepts: what did we learn
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
And that my friends, is the basic overview
0:03
of web scraping with Beautiful Soup 4.
0:06
There is so much to it as I said,
0:07
but this should get you up and running,
0:09
and I hope it was enough to really get you excited
0:12
about web scraping and get you creating some cool stuff.
0:16
So just a quick re-cap of everything we did.
0:20
Well we scraped a website, and the first thing we did,
0:24
was we imported Beautiful Soup 4,
0:26
simple, simple, simple stuff.
0:29
Now, we then scraped the site, okay,
0:35
and then we created an empty header list,
0:38
all right, the first step was to create an empty list
0:41
so that when we got all the headers,
0:43
we could pop them there.
0:46
All right, now this is the important part.
0:47
We're creating the soup object,
0:49
the Beautiful Soup 4 object and we do that by
0:53
running that bs4.BeautifulSoup
0:56
and we tell it to get the text of the site, site.text
1:00
and pass it using the HTML passer.
1:04
All right, and then, we decided to select,
1:09
okay so we're being very specific here,
1:13
we were selecting the css .projectheader class,
1:18
so anything in our document; in our HTML document;
1:22
that had that class was going to appear,
1:25
and we were very lucky, we did the research,
1:27
and we found that our H3 headers,
1:30
were the only tags that use that CSS class,
1:35
okay, that's why it could work.
1:37
So just be careful again in case you pull a class
1:40
that quite a few HTML objects are using,
1:44
'cause then you're going to get a lot of unexpected results.
1:48
All right, and then after that
1:49
the only thing worth noting here,
1:51
is as we were creating our list, our header list,
1:55
we were using get text on all of those headers,
1:59
on all of those items that we selected
2:02
using the projectheader class, to just get the text,
2:06
we wanted the get text option there.
2:09
Okay, and that's the scraping a website,
2:13
pretty simple, next we did some funky command line stuff,
2:17
using the Python Shell, and that was just to demonstrate
2:21
some very simple, yet effective Beautiful Soup 4 features.
2:26
All right, so the first thing we did was
2:28
we imported bs4 of course and then we
2:31
created that soup object again, so skipping through that.
2:35
The first cool thing we did is we were able to
2:37
search the entire site, that soup object that we created
2:42
for the very first ul tag,
2:45
remembering that this sort of search,
2:48
only brings up the first tag, okay
2:50
and that didn't work for us.
2:52
Then we wanted to find all of the ul,
2:55
the unordered list tags and while that works,
2:59
that brought up everything and again,
3:01
that's not what we wanted, so find_all,
3:04
will search the entire HTML site, for that specific tag.
3:12
All right, now, this time we decided to
3:15
drill down into the main tag, so as we've covered,
3:19
you have that nice little nested feature here,
3:23
where you search soup for the main tag
3:25
and then we went, within the main tag,
3:27
drilled down to the unordered list,
3:30
and the first unordered list it pulled,
3:32
was the list we wanted, but again it had ul tags in there
3:37
which we didn't actually need for our purposes.
3:40
So then we did something very similar,
3:43
but in this case we wanted find_all,
3:46
because if we had just specified soup.main.li,
3:51
we would have only gotten the first list object within main,
3:55
so this time we go soup.main.findall list tags,
4:01
find all of the li tags within that HTML document,
4:06
okay, that fall underneath the main tag.
4:10
And then we stored all of that into an object called all_li,
4:17
and then we just iterated over it using a for loop,
4:20
pulled all of the items within there,
4:23
the individual li tags, and then we used .string
4:28
to simply pull, that plain text, the plain text,
4:32
the headers of our articles and that was it.
4:35
Nice and easy, pretty simple stuff,
4:37
the more practice you do, the better you'll get.
4:39
And that was it, so your turn, very cool stuff.
4:44
Go out, I reckon the challenge for you should be
4:47
to go out to one of your favorite sites, okay,
4:50
find maybe the news article section
4:54
and try and pull down all of their news articles.
4:57
Maybe just on one page, maybe across multiple pages,
5:00
do something like that.
5:02
Even try and pull, rather than just the header,
5:04
maybe try and pull the very first blurb of the news article.
5:09
Do that maybe it's a game news website,
5:12
could be anything you want, but just give it a try.
5:15
This is now your chance to go out there
5:17
and try and pass a website.
5:20
If you want a really fun one,
5:21
try going to talkpython.fm and try
5:24
and pull down maybe all the episodes.
5:26
Either way, have fun, keep calm and code.