#100DaysOfWeb in Python Transcripts
Chapter: Days 73-76: Web Scraping
Lecture: What did we learn?

Login or purchase this course to watch this video and the rest of the course contents.
0:00 Well, that was web scraping a very brief overview of web scraping in Python. We used Beautiful Soup 4 and we used newspaper3k
0:09 to do some very interesting things and I really hope you enjoyed it. Let's quickly go through everything that we learned.
0:16 First, we're going to look at Beautiful Soup 4 and requests. We import the module bs4 that's how we get started, and requests as well.
0:26 Everything that you do with web scraping will pretty much involve requests, so that's a default. Now, specify the URL of the site we're scraping.
0:35 That was the training.talkPython website. And the reason we specify that in a URL is just purely to make it readable in our code.
0:47 We use requests.get, so we're doing a GET request. To pull down the URL, we actually pull down the entire page.
0:56 And then we check to see if that was successful using the raise_for_status. And finally, we use Beautiful Soup 4 to create a soup object
1:06 and that simply takes the page that we pulled down takes the text of that and runs our HTML parser against that all part of Beautiful Soup 4
1:16 and creates our soup object that we can then interact with. And now, to interact with that we have a whole bunch of different functionality here.
1:25 To start with, we can return the title of the page using soup.title, very easy. Then we can use select to pull down all of a specific tag
1:36 and we used the div tag to demonstrate that so we were able to use select div and we returned all of the divs and the content within that page.
1:48 Next, we can use select again to return all items but this time, we were able to specify a CSS class.
1:56 So, that's what the dot denotes there, so .bundlelist and we're able to pull down anything in our article or in our page, I should say
2:05 that had .bundlelist as the CSS class. Select again. We actually did it for our production purposes
2:15 of our Talk Python script, we did .select on the h3 header to get all of the headers or the titles of our courses.
2:25 Now, I wanted to add something different in here that we didn't actually go over but I thought this might be interesting for you to see
2:31 was we can use soup.find. And find allows us to pull down the very first iteration of what we specify. So, ul in HTML is an unordered list.
2:44 So by using soup.find ul we pull down the very first ul tag on the site. So, that can save a lot of effort because select will pull down everything
2:56 find will just pull down the first match. Likewise, we can use .main.ul meaning we can use specific tags.
3:08 We can actually go down through the HTML tree. So if we do soup.main what soup is searching for is it looks for the main tag.
3:18 So, soup.main, and then it looks within the main tag and pulls down ul. So, the very first ul tag, or unordered list within our main tag in our HTML
3:33 is going to be returned using this command. So, if we wanted to take our first div with the first list item within that div we could do soup.div.li.
3:47 See what I mean. Next up, we can use findAll. Now that operates similar to select in that it will actually take all of the tags that we search for.
3:59 So, expanding on what we saw just there with soup.main.ul we can do a soup.main.findAll meaning find all of the coming tag in main.
4:13 So, soup.main.findAll of li will get us all of the list objects within our main tag, within the HTML. So that's just some extra functionality there
4:25 you can use with bs4. Really makes it usable, makes it a lot of fun. So, just keep those ones in mind as well. Next, we did some work with newspaper3k.
4:37 First, we import newspaper. Now notice we don't import newspaper3k. So I touched on in the videos, we just import newspaper. The 3k denotes Python 3
4:49 and that has to do with installing the module not with actually using it. So, when we use it, we still use import newspaper.
4:59 Next up, we import the article function from within newspaper and that function allows us to actually initiate er, initialize I should say, our article
5:11 but we'll get to that in a sec. Next, same thing we did with bs4 and requests we specify the URL. I've left the URL out of this slide
5:19 just because it's way too long to include that it would go between the quotes there. Then we initialize the article.
5:27 That means we're getting ready to work with it so we would substitute the URL within there. And then we use download to actually download that article.
5:37 It's similar to our requests.get. Next, we actually start to parse all of that article. First thing we can do before we parse
5:47 is we can return the unformatted HTML on the page. If you recall, that actually looked really disgusting it was unreadable.
5:57 So, what we do is we parse it first using .parse and then we can actually extract data from there. And the first one we can extract here is authors.
6:08 That will return the name of the authors of the article. The publish date. The body text, which is a huge block of text.
6:17 The top image, which is the first image that occurs in the article. The movie, so if there is a YouTube or Vimeo or other embedded video in there
6:27 that the page actually does the article actually does allow us to scrape we can get the URL from there. And the same thing goes with that top image
6:37 we get the URL of the image. And finally, we can get the summary line of the article if it has one. Now obviously, there's a lot more we can do
6:46 but this is just the sort of stuff that we touched on and that's that. So, if you're at the end of day three and moving into day four
6:55 this is the point where you will be creating a Flask app. So for day four, create a Flask app that allows you to scrape a page using newspaper3k
7:06 pull it down, and then present the data that you want to display on your Flask app. So, similar to the one we used in the Heroku demo app
7:14 you should be using, you should be returning, sorry authors, publish date, the text, and perhaps an image. So, see if you can do that.
7:22 That's your day four, enjoy. I really hope you enjoyed web scraping. Obviously, there is a lot more to it
7:28 and this could take an entire course in itself but this is just a quick overview for you for the #100DaysOfWeb. Enjoy, keep calm, and code in Python.


Talk Python's Mastodon Michael Kennedy's Mastodon