Python for .NET Developers Transcripts
Chapter: Computational notebooks
Lecture: Getting the links from the RSS data

0:00 Now that we have the entries we need to go through and somehow extract the HTML links the actual source reference of those links.

0:10 This is sort of XML, I mean, yes RSS is XML and the XML has actually been turned into a dictionary, which is great.

0:17 But the thing that is the description itself is not XML, it's an HTML fragment. So we're going to use Beautiful Soup again

0:25 to parse, to screen scrape that and then pull out the entry. So we're going to do a little juggling there.

0:32 One of the things I'd like to do really quick is I'd like to, and I've showed you the Ctrl + Enter hotkey for running an entry.

0:37 Notice that that incrementing, that execution count at the bottom there, incremented. And I can run this over here

0:44 sometimes it'll show you like what the hotkeys are up here like Shift + Enter and stuff. But I want to know how to add a cell below here.

0:52 Now that's B, but let me show you how you can find out. If I go over here and I type cell, there's a ton of stuff.

0:59 If I want to insert a cell below, I can just type B. If I want to select the cell above I could hit like, K, if I want.

1:07 In the code, I could hit K and then you'll run that and then go back down, J back down. Now if I want to insert a new piece, I hit B, there we go.

1:18 If I want to change to markdown, I can hit M. If I want to change to code do I hit C? No, I hit Y. So I come down here and say, a new heading

1:27 parse the HTML from the entry descriptions. We could add even more text, right, this is just markdown. Then we hit ALT + CTRL + Enter to do that.

1:39 B to define another block below, and off we go. So let's define it all links. We're going to just pile up every single link

1:48 that we can find, not worrying about the domains yet. We just want to get the links out of the body. So we'll say four E in entries, let's just print

1:57 to R or something and do that real quick. Now, that's probably, we'd want a quote. Here we go, all right. So it looks like there's a bunch of entries.

2:05 So we're going to go through them, and what are we going to do? We're going to save the description as E dot get description.

2:13 All right, that's going to be the HTML. Then we'll come over here I want to say BS4, if you want auto complete

2:19 you have to hit Tab, it doesn't automatically come up. But notice we have a Beautiful Soup and what are we going to pass in?

2:24 Well, we can't give it the fragment I don't think, so let's give it proper HTML. We run that, now, I believe it would prefer

2:38 that we said HTML dot parser here, yeah. We won't see the warning that it's putting out but, you know, deep down in the guts

2:45 I'm sure there's some kind of warning. So let's go over here and we'll say the links and we'll do a cool little list comprehension.

2:52 Now just give it some space, 'cause I want to separate these for a in soup.findall('a'). Alright, And let's just really quickly here

3:00 print LAN of links, all right awesome. So it looks like the data's coming along. See how cool it is? We can re-run and explore this little bit

3:08 without re-computing all this stuff up here, granted it's not that bad, but like I keep thinking computation is expense, we're doing a bunch of work

3:15 but we want to keep playing with it and seeing the output. All right, so we can just see that, yeah it looks like that's probably decent.

3:22 We could even print out what the links are like let's say the first two. Well, that's messy, but it looks correct, doesn't it?

3:30 Okay, so we're on a great path here. But what we have is the entire hyperlink and I want to get just the href.

3:37 So let's go in here in Beautiful Soup we can go like this and, but wait, wait, print links.

3:45 Run it again, oh yeah, now were getting just the hyperlinks. Okay, our data's looking better, data's looking better.

3:51 There's a couple of things that I don't like, though. I don't want to talk about www versus not www don't really care about HTTPS, things like that.

4:01 So we're going to do a little bit of normalization here. Okay, we're going to iterate on top of links and do a little bit of clean up

4:12 so this part should go away, if we run this. Here we go, try again, okay, perfect. So techrepublic and www.techrepublic

4:21 no longer a difference, we just want the base domain name. It turns out we also have some aliases that we sometimes use and sometimes don't.

4:30 So let's go down here and do this one more time. There's probably a slightly cleaner way to do this, but we're going to just say

4:37 replace do.co with Digital Ocean. This is like a re-director URL. Okay, this is all good, now what we're printing out here

4:45 each time we do a print, like notice, like right there and right there, the closing brace we're printing this out for just that one entry.

4:54 We want all the entries for all 152 or we want all the links for all 152 entries. Now what we're going to do is come down here and say

5:05 all_links.extend, it's like ad range links bottom, lets just print the first, I don't know, ten. Let's also print out the link.

5:18 How many links do we have total? How many different unique links do we have? Well, how many times have a link been mentioned?

5:28 All right, 2,721, so that's pretty cool. And see how nice it is to just explore this data we don't have to keep re-running it.

5:36 Like we can forget about how we even got this feed data. Yeah, we're going out to an API doing RSS feed

5:42 and we're hitting it, but all the stuff we've been doing down here, like, its off the screen and out of mind. We just have this data magically

5:49 by the magic of technology, we have it. We can just work with it over and over and over not concerned about the latency of getting it

5:56 or the computational cost, or whatever. Though maybe it, kind of keeping with our style here let's put out a little statement here.

6:09 We can we say something like parse some number of links from all the episodes and we just run that again. Perfect, we've parsed 2,721 links.

6:16 And notice when I run this, watch though as soon as this turns to a star and then goes back to 45. That took like two seconds to run, but now that

6:26 it's done, we never have to run that code again. We just work with all links, which is now, remember just the raw links, we don't have to parse

6:34 that two point five MEGS of HTML, RSS XML, blended weirdness, we're done. We just worry about the links.

6:41 Like, now that we're done with that step, we're golden. Let's hit B to add another one and then M to convert it to markdown.

6:51 All right, so we're going to stay with extracted domain names. Actually, lets do one more really quick thing here let's make this a little bit smaller.

6:57 Let's talk about how many unique links there are. And we can do that by creating a list and then going through a set through all links.

7:08 What does that do? Well, that takes the links, all of them with duplication converts it to a set which has no duplicates

7:13 and then turns it back to a list so that we can deal with it, and who knows? Maybe it's like reasonable to say sort, sort that, right?

7:21 We just do it once and we're not going to compute it again anyway. And lets move this down here so we know how many distinct links we have.

7:30 There we go, we've lost about 400 duplicates that were in there, these are like sponsor links and, you know, maybe links back to our profile.

7:38 Or who knows what those are? But they're gone 'cause they were duplicates.

Python for .NET Developers Transcripts Chapter: Computational notebooks Lecture: Getting the links from the RSS data

Python for .NET Developers Transcripts
Chapter: Computational notebooks
Lecture: Getting the links from the RSS data