#100DaysOfWeb in Python Transcripts
Chapter: Days 73-76: Web Scraping
Lecture: What did we learn?
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
Well, that was web scraping a very brief overview of web scraping in Python. We used Beautiful Soup 4 and we used newspaper3k
0:09
to do some very interesting things and I really hope you enjoyed it. Let's quickly go through everything that we learned.
0:16
First, we're going to look at Beautiful Soup 4 and requests. We import the module bs4 that's how we get started, and requests as well.
0:26
Everything that you do with web scraping will pretty much involve requests, so that's a default. Now, specify the URL of the site we're scraping.
0:35
That was the training.talkPython website. And the reason we specify that in a URL is just purely to make it readable in our code.
0:47
We use requests.get, so we're doing a GET request. To pull down the URL, we actually pull down the entire page.
0:56
And then we check to see if that was successful using the raise_for_status. And finally, we use Beautiful Soup 4 to create a soup object
1:06
and that simply takes the page that we pulled down takes the text of that and runs our HTML parser against that all part of Beautiful Soup 4
1:16
and creates our soup object that we can then interact with. And now, to interact with that we have a whole bunch of different functionality here.
1:25
To start with, we can return the title of the page using soup.title, very easy. Then we can use select to pull down all of a specific tag
1:36
and we used the div tag to demonstrate that so we were able to use select div and we returned all of the divs and the content within that page.
1:48
Next, we can use select again to return all items but this time, we were able to specify a CSS class.
1:56
So, that's what the dot denotes there, so .bundlelist and we're able to pull down anything in our article or in our page, I should say
2:05
that had .bundlelist as the CSS class. Select again. We actually did it for our production purposes
2:15
of our Talk Python script, we did .select on the h3 header to get all of the headers or the titles of our courses.
2:25
Now, I wanted to add something different in here that we didn't actually go over but I thought this might be interesting for you to see
2:31
was we can use soup.find. And find allows us to pull down the very first iteration of what we specify. So, ul in HTML is an unordered list.
2:44
So by using soup.find ul we pull down the very first ul tag on the site. So, that can save a lot of effort because select will pull down everything
2:56
find will just pull down the first match. Likewise, we can use .main.ul meaning we can use specific tags.
3:08
We can actually go down through the HTML tree. So if we do soup.main what soup is searching for is it looks for the main tag.
3:18
So, soup.main, and then it looks within the main tag and pulls down ul. So, the very first ul tag, or unordered list within our main tag in our HTML
3:33
is going to be returned using this command. So, if we wanted to take our first div with the first list item within that div we could do soup.div.li.
3:47
See what I mean. Next up, we can use findAll. Now that operates similar to select in that it will actually take all of the tags that we search for.
3:59
So, expanding on what we saw just there with soup.main.ul we can do a soup.main.findAll meaning find all of the coming tag in main.
4:13
So, soup.main.findAll of li will get us all of the list objects within our main tag, within the HTML. So that's just some extra functionality there
4:25
you can use with bs4. Really makes it usable, makes it a lot of fun. So, just keep those ones in mind as well. Next, we did some work with newspaper3k.
4:37
First, we import newspaper. Now notice we don't import newspaper3k. So I touched on in the videos, we just import newspaper. The 3k denotes Python 3
4:49
and that has to do with installing the module not with actually using it. So, when we use it, we still use import newspaper.
4:59
Next up, we import the article function from within newspaper and that function allows us to actually initiate er, initialize I should say, our article
5:11
but we'll get to that in a sec. Next, same thing we did with bs4 and requests we specify the URL. I've left the URL out of this slide
5:19
just because it's way too long to include that it would go between the quotes there. Then we initialize the article.
5:27
That means we're getting ready to work with it so we would substitute the URL within there. And then we use download to actually download that article.
5:37
It's similar to our requests.get. Next, we actually start to parse all of that article. First thing we can do before we parse
5:47
is we can return the unformatted HTML on the page. If you recall, that actually looked really disgusting it was unreadable.
5:57
So, what we do is we parse it first using .parse and then we can actually extract data from there. And the first one we can extract here is authors.
6:08
That will return the name of the authors of the article. The publish date. The body text, which is a huge block of text.
6:17
The top image, which is the first image that occurs in the article. The movie, so if there is a YouTube or Vimeo or other embedded video in there
6:27
that the page actually does the article actually does allow us to scrape we can get the URL from there. And the same thing goes with that top image
6:37
we get the URL of the image. And finally, we can get the summary line of the article if it has one. Now obviously, there's a lot more we can do
6:46
but this is just the sort of stuff that we touched on and that's that. So, if you're at the end of day three and moving into day four
6:55
this is the point where you will be creating a Flask app. So for day four, create a Flask app that allows you to scrape a page using newspaper3k
7:06
pull it down, and then present the data that you want to display on your Flask app. So, similar to the one we used in the Heroku demo app
7:14
you should be using, you should be returning, sorry authors, publish date, the text, and perhaps an image. So, see if you can do that.
7:22
That's your day four, enjoy. I really hope you enjoyed web scraping. Obviously, there is a lot more to it
7:28
and this could take an entire course in itself but this is just a quick overview for you for the #100DaysOfWeb. Enjoy, keep calm, and code in Python.