#100DaysOfCode in Python Transcripts
Chapter: Days 46-48: Web Scraping with BeautifulSoup4
Lecture: Concepts: what did we learn
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
And that my friends, is the basic overview of web scraping with Beautiful Soup 4. There is so much to it as I said,
0:08
but this should get you up and running, and I hope it was enough to really get you excited about web scraping and get you creating some cool stuff.
0:17
So just a quick re-cap of everything we did. Well we scraped a website, and the first thing we did, was we imported Beautiful Soup 4,
0:27
simple, simple, simple stuff. Now, we then scraped the site, okay, and then we created an empty header list,
0:39
all right, the first step was to create an empty list so that when we got all the headers, we could pop them there.
0:47
All right, now this is the important part. We're creating the soup object, the Beautiful Soup 4 object and we do that by running that bs4.BeautifulSoup
0:57
and we tell it to get the text of the site, site.text and pass it using the HTML passer. All right, and then, we decided to select,
1:10
okay so we're being very specific here, we were selecting the css .projectheader class, so anything in our document; in our HTML document;
1:23
that had that class was going to appear, and we were very lucky, we did the research, and we found that our H3 headers,
1:31
were the only tags that use that CSS class, okay, that's why it could work. So just be careful again in case you pull a class
1:41
that quite a few HTML objects are using, 'cause then you're going to get a lot of unexpected results. All right, and then after that
1:50
the only thing worth noting here, is as we were creating our list, our header list, we were using get text on all of those headers,
2:00
on all of those items that we selected using the projectheader class, to just get the text, we wanted the get text option there.
2:10
Okay, and that's the scraping a website, pretty simple, next we did some funky command line stuff,
2:18
using the Python Shell, and that was just to demonstrate some very simple, yet effective Beautiful Soup 4 features.
2:27
All right, so the first thing we did was we imported bs4 of course and then we created that soup object again, so skipping through that.
2:36
The first cool thing we did is we were able to search the entire site, that soup object that we created for the very first ul tag,
2:46
remembering that this sort of search, only brings up the first tag, okay and that didn't work for us. Then we wanted to find all of the ul,
2:56
the unordered list tags and while that works, that brought up everything and again, that's not what we wanted, so find_all,
3:05
will search the entire HTML site, for that specific tag. All right, now, this time we decided to drill down into the main tag, so as we've covered,
3:20
you have that nice little nested feature here, where you search soup for the main tag and then we went, within the main tag,
3:28
drilled down to the unordered list, and the first unordered list it pulled, was the list we wanted, but again it had ul tags in there
3:38
which we didn't actually need for our purposes. So then we did something very similar, but in this case we wanted find_all,
3:47
because if we had just specified soup.main.li, we would have only gotten the first list object within main,
3:56
so this time we go soup.main.findall list tags, find all of the li tags within that HTML document, okay, that fall underneath the main tag.
4:11
And then we stored all of that into an object called all_li, and then we just iterated over it using a for loop, pulled all of the items within there,
4:24
the individual li tags, and then we used .string to simply pull, that plain text, the plain text, the headers of our articles and that was it.
4:36
Nice and easy, pretty simple stuff, the more practice you do, the better you'll get. And that was it, so your turn, very cool stuff.
4:45
Go out, I reckon the challenge for you should be to go out to one of your favorite sites, okay, find maybe the news article section
4:55
and try and pull down all of their news articles. Maybe just on one page, maybe across multiple pages, do something like that.
5:03
Even try and pull, rather than just the header, maybe try and pull the very first blurb of the news article. Do that maybe it's a game news website,
5:13
could be anything you want, but just give it a try. This is now your chance to go out there and try and pass a website. If you want a really fun one,
5:22
try going to talkpython.fm and try and pull down maybe all the episodes. Either way, have fun, keep calm and code.