#100DaysOfWeb in Python Transcripts
Chapter: Days 73-76: Web Scraping
Lecture: Installing and introducing newspaper3k
0:00 We're going to change pace just a little bit here. While we were looking at specifically scraping web pages before
0:08 we're going to try a different library that we found here. It's called Newspaper3K. It did exist prior to this, it was a Python 2 library.
0:19 But now it has a Python 3 library which is why its called 3K. And it's actually a really useful Python module
0:28 Python library that is specifically designed to interrogate news articles. So if you go to your favorite news site and you look up
0:38 a specific article, that's the sort of thing that this module looks at. That's why it's called Newspaper.
0:44 Now, the best way to show you how it works is using a demo they've created themselves, the makers of the tool
0:51 but first let's just get your environment set up. So, again, I'm back in our 6-webscraping directory. I've got my virtual environment set up.
1:01 Let's just simply pip install newspaper3k Now you have to specifically mention 3k because Newspaper still exists as an older library.
1:13 So let's pip install that. It does use requests but because we already have that installed we don't have to worry about it.
1:23 Okay, now that that's installed we can sit here and have a play but we'll do that in the next video. The first thing I'd like you to do is
1:31 actually see it in action. So I've got my browser here. If you go to newspaper-demo.herokuapp.com you'll get to the Newspaper demo app.
1:42 And I've just picked a random article from news.com.au. This was the first article that I found that wasn't just you know, doom and gloom.
1:52 I'm just going to copy the URL, CTRL+C. Head back to our Newspaper demo and watch what this does.
1:59 Watch how it scrapes the page and look at what it returns. Look at the details. So, straight away on the page you can see we have a
2:07 whole lot of rubbish on this page, a whole lot of fluff in our face, we have the bar, we have ads, we have these pop-ups down the bottom.
2:16 Just all sorts of stuff that really takes away from the experience, right? We just want the content.
2:21 So, Newspaper allows you to actually pull out all of that detail that you want. So, at the top of this app it's given us our URL.
2:31 We can get the title, the authors, the body text. Now obviously this is not formatted anything remotely
2:40 readable but that's something we can tackle later that you can play with. And it gives you the top image, meaning the first image on the page.
2:48 There you go, this one here. This here is a video so obviously it's not going to give us that straight away.
2:55 And then it gives us key words and anything else that it can pull that the article actually has tagged in
3:01 its HTML. And that's it! So it's really simple. Now if you were to go through and do this scraping
3:08 manually yourself, there'd be a lot of work involved. So it's almost like magic this works and they've done the back-end work for you.
3:16 It's really cool, it's really impressive, and it's a really nice tool to add to your toolkit. So let's have a look in the next video on what we do
3:26 on our end on the command line in the Python REPL to get the information that we need.