#100DaysOfCode in Python Transcripts
Chapter: Days 46-48: Web Scraping with BeautifulSoup4
Lecture: Concepts: what did we learn
0:00 And that my friends, is the basic overview
0:03 of web scraping with Beautiful Soup 4.
0:06 There is so much to it as I said,
0:07 but this should get you up and running,
0:09 and I hope it was enough to really get you excited
0:12 about web scraping and get you creating some cool stuff.
0:16 So just a quick re-cap of everything we did.
0:20 Well we scraped a website, and the first thing we did,
0:24 was we imported Beautiful Soup 4,
0:26 simple, simple, simple stuff.
0:29 Now, we then scraped the site, okay,
0:35 and then we created an empty header list,
0:38 all right, the first step was to create an empty list
0:41 so that when we got all the headers,
0:43 we could pop them there.
0:46 All right, now this is the important part.
0:47 We're creating the soup object,
0:49 the Beautiful Soup 4 object and we do that by
0:53 running that bs4.BeautifulSoup
0:56 and we tell it to get the text of the site, site.text
1:00 and pass it using the HTML passer.
1:04 All right, and then, we decided to select,
1:09 okay so we're being very specific here,
1:13 we were selecting the css .projectheader class,
1:18 so anything in our document; in our HTML document;
1:22 that had that class was going to appear,
1:25 and we were very lucky, we did the research,
1:27 and we found that our H3 headers,
1:30 were the only tags that use that CSS class,
1:35 okay, that's why it could work.
1:37 So just be careful again in case you pull a class
1:40 that quite a few HTML objects are using,
1:44 'cause then you're going to get a lot of unexpected results.
1:48 All right, and then after that
1:49 the only thing worth noting here,
1:51 is as we were creating our list, our header list,
1:55 we were using get text on all of those headers,
1:59 on all of those items that we selected
2:02 using the projectheader class, to just get the text,
2:06 we wanted the get text option there.
2:09 Okay, and that's the scraping a website,
2:13 pretty simple, next we did some funky command line stuff,
2:17 using the Python Shell, and that was just to demonstrate
2:21 some very simple, yet effective Beautiful Soup 4 features.
2:26 All right, so the first thing we did was
2:28 we imported bs4 of course and then we
2:31 created that soup object again, so skipping through that.
2:35 The first cool thing we did is we were able to
2:37 search the entire site, that soup object that we created
2:42 for the very first ul tag,
2:45 remembering that this sort of search,
2:48 only brings up the first tag, okay
2:50 and that didn't work for us.
2:52 Then we wanted to find all of the ul,
2:55 the unordered list tags and while that works,
2:59 that brought up everything and again,
3:01 that's not what we wanted, so find_all,
3:04 will search the entire HTML site, for that specific tag.
3:12 All right, now, this time we decided to
3:15 drill down into the main tag, so as we've covered,
3:19 you have that nice little nested feature here,
3:23 where you search soup for the main tag
3:25 and then we went, within the main tag,
3:27 drilled down to the unordered list,
3:30 and the first unordered list it pulled,
3:32 was the list we wanted, but again it had ul tags in there
3:37 which we didn't actually need for our purposes.
3:40 So then we did something very similar,
3:43 but in this case we wanted find_all,
3:46 because if we had just specified soup.main.li,
3:51 we would have only gotten the first list object within main,
3:55 so this time we go soup.main.findall list tags,
4:01 find all of the li tags within that HTML document,
4:06 okay, that fall underneath the main tag.
4:10 And then we stored all of that into an object called all_li,
4:17 and then we just iterated over it using a for loop,
4:20 pulled all of the items within there,
4:23 the individual li tags, and then we used .string
4:28 to simply pull, that plain text, the plain text,
4:32 the headers of our articles and that was it.
4:35 Nice and easy, pretty simple stuff,
4:37 the more practice you do, the better you'll get.
4:39 And that was it, so your turn, very cool stuff.
4:44 Go out, I reckon the challenge for you should be
4:47 to go out to one of your favorite sites, okay,
4:50 find maybe the news article section
4:54 and try and pull down all of their news articles.
4:57 Maybe just on one page, maybe across multiple pages,
5:00 do something like that.
5:02 Even try and pull, rather than just the header,
5:04 maybe try and pull the very first blurb of the news article.
5:09 Do that maybe it's a game news website,
5:12 could be anything you want, but just give it a try.
5:15 This is now your chance to go out there
5:17 and try and pass a website.
5:20 If you want a really fun one,
5:21 try going to talkpython.fm and try
5:24 and pull down maybe all the episodes.
5:26 Either way, have fun, keep calm and code.