#100DaysOfWeb in Python Transcripts
Chapter: Days 73-76: Web Scraping
Lecture: Extra! Extra! Student scrapes a news article!
0:00 Now that we've seen newspaper in action let's actually break it down to the individual commands that will pull out this specific data
0:09 from the webpage at will. Now, the first thing we need to do is bring up our Python Shell, our REPL. Again, good practice just to play in here first
0:19 before you write any scripts. Not that we're going to write a script for this because this is free form fun.
0:25 Now when we import newspaper, you'd think that we would import newspaper3k because we installed newspaper3k but newspaper3k is just
0:37 the Python 3 library name, for newspaper. When we use it, for the actual using of the library we still import the newspaper module
0:48 just like the old-fashioned way in Python 2. We still call it newspaper it's just we needed to pull down the Python 3 version of it
0:56 and that's where the 3k comes from. With newspaper imported we now want to pull out article so import... sorry, from newspaper import article.
1:12 Now what article does is, the sub-module of newspaper is it actually allows us to pass our article the actual web page, the newspaper article
1:23 that we're pulling down, and this is similar to it's almost like a combination of BeautifulSoup 4, and requests
1:33 all right, because we're going to specify the URL and we're going to use article against that
1:39 and that's going to allow us to pull it down, and pass it. You'll see in a minute. Fist, let's actually specify that URL
1:49 so URL is assigned the web page. That's the one we looked at before and now let's actually pull it down. So we create an article object
2:03 and we specify URL so article will run the capital article against that and assign it to the article object.
2:13 Now we have to actually tell it to download, right so article.download will pull down the page. This is like requests, right?
2:22 And if we want to see what this looks like we can go article.html, and this prints all of the HTML that we pulled down. Right, everything in that page.
2:35 Now, the article was able to pass and pull down. If we actually looked at this in a nice, simple way you can break down and see the little flags
2:46 that newspaper actually calls on but we're not going to bother, it's too... this is obviously too poorly formatted so what we need to do
2:53 is we need to pass it to article.parse and this will parse the article for us and allow us to start pulling out all of the relevant data.
3:03 If you remember, on their web page they had the authors there. Right, they had the authors, they had the title they had everything there.
3:10 What we can do is, we can actually run that so article.authors will pull down the name the author of the article.
3:18 So, this newspaper article is written by Lauren McMah. I hope I've pronounced that correctly and now we can pull down other details.
3:27 article.publishdate: there's our publish date in date time format and you can print that nicely Publish date, and there you have your nice format
3:46 there of that. Date, so 322 on the ninth of the fourth and what else do we have? We have the text, so if you remember that giant text block
3:56 that we saw on the page, let's pull it all down. article.text and what you could do is you could also do some splitting.
4:05 You could split this on new lines but for all intents and purposes on that page they simply ran article.text
4:13 and that gave them that massive dump there. Next, we have article.topimage: and what that does is this gives us actually the URL of the image
4:26 that was on that page so you can imagine if you wanted to present this image when you return this data
4:33 you would need to be, say, on a webpage of your own hint hint, and I'll give you a clue. You would need to specify this URL to display it on your page.
4:45 What else do we have? We have article.movies. Now you didn't see this on their page 'cause there wasn't anything to present back
4:53 but if you're on article.movies this should give you videos. So obviously our newspaper article here doesn't have the right flags specified for a movie
5:02 so that's why we're not getting anything. We can do article.summary and if there is a summary associated with our article in the HTML
5:11 we would get, I guess the heading overview summary of that page. And last, but not least we actually want the title, right?
5:20 So, article.title: that will give us the actual title of our newspaper article. So seeing how this works now
5:28 you can pretty much imagine if you use cases here you can see you might want to scrape certain web pages daily
5:35 just to get the titles of all the newspaper articles. You can do all sorts of stuff. This is web scraping, but it's almost too simple.
5:46 It's almost really simplified because someone's done that hard work in the background of analyzing what newspaper article tags look like
5:53 and HTML tags look like to be able to make this simplified for you and finally to look at something a little more simplified.
6:01 I thought I'd show you this. We have this giant block of text here, right? It's not great. It's a pain in the butt to play with.
6:08 You really can't present it nicely if you're simply going to use article.text but one thing you'll notice if we take a look at this closely
6:18 we have these new lines in there so this is something we can actually work with. So let's take a look at this. If we do article.text.split
6:31 so we're going to split that text on our new line. What do we get? Well, we actually get a list. Look at that! We have that square right bracket
6:43 and at the top of that we have a square left bracket. So this is a list, and the list is now split based on new lines.
6:52 The first entry in our list is this line here the second entry is a space, or a carriage return hopefully and then the next one is that sentence
7:05 ending in decades, and so on and so forth through the article. So now we've split up our article by new lines
7:12 into a list, which means we can cycle through that list we can pass that list. So let's call it for i in. You know,We'll just type it out again:
7:26 article.text.split new line. Let's just do print i, let's see what we come up with and this should print in a nicely formatted way.
7:39 There we go. Look at how much more readable that is so if you were putting this on a web page again, another hint for you
7:49 you'd be able to present the text much nicely pretty much by paragraph so there's your first sentence and there's that space that we had
7:56 there's your next sentence, and so on. So there's a lot you can do with this but you've seen how we can use newspaper
8:04 to break down and pull out specific things we want from the webpage.