Python for the .NET Developer Transcripts
Chapter: Computational notebooks
Lecture: Getting the links from the RSS data
0:00 Now that we have the entries we need to go through and somehow extract the HTML links the actual source reference of those links.
0:10 This is sort of XML, I mean, yes RSS is XML and the XML has actually been turned into a dictionary, which is great.
0:17 But the thing that is the description itself is not XML, it's an HTML fragment. So we're going to use Beautiful Soup again
0:25 to parse, to screen scrape that and then pull out the entry. So we're going to do a little juggling there.
0:32 One of the things I'd like to do really quick is I'd like to, and I've showed you the Ctrl + Enter hotkey for running an entry.
0:37 Notice that that incrementing, that execution count at the bottom there, incremented. And I can run this over here
0:44 sometimes it'll show you like what the hotkeys are up here like Shift + Enter and stuff. But I want to know how to add a cell below here.
0:52 Now that's B, but let me show you how you can find out. If I go over here and I type cell, there's a ton of stuff.
0:59 If I want to insert a cell below, I can just type B. If I want to select the cell above I could hit like, K, if I want.
1:07 In the code, I could hit K and then you'll run that and then go back down, J back down. Now if I want to insert a new piece, I hit B, there we go.
1:18 If I want to change to markdown, I can hit M. If I want to change to code do I hit C? No, I hit Y. So I come down here and say, a new heading
1:27 parse the HTML from the entry descriptions. We could add even more text, right, this is just markdown. Then we hit ALT + CTRL + Enter to do that.
1:39 B to define another block below, and off we go. So let's define it all links. We're going to just pile up every single link
1:48 that we can find, not worrying about the domains yet. We just want to get the links out of the body. So we'll say four E in entries, let's just print
1:57 to R or something and do that real quick. Now, that's probably, we'd want a quote. Here we go, all right. So it looks like there's a bunch of entries.
2:05 So we're going to go through them, and what are we going to do? We're going to save the description as E dot get description.
2:13 All right, that's going to be the HTML. Then we'll come over here I want to say BS4, if you want auto complete
2:19 you have to hit Tab, it doesn't automatically come up. But notice we have a Beautiful Soup and what are we going to pass in?
2:24 Well, we can't give it the fragment I don't think, so let's give it proper HTML. We run that, now, I believe it would prefer
2:38 that we said HTML dot parser here, yeah. We won't see the warning that it's putting out but, you know, deep down in the guts
2:45 I'm sure there's some kind of warning. So let's go over here and we'll say the links and we'll do a cool little list comprehension.
2:52 Now just give it some space, 'cause I want to separate these for a in soup.findall('a'). Alright, And let's just really quickly here
3:00 print LAN of links, all right awesome. So it looks like the data's coming along. See how cool it is? We can re-run and explore this little bit
3:08 without re-computing all this stuff up here, granted it's not that bad, but like I keep thinking computation is expense, we're doing a bunch of work
3:15 but we want to keep playing with it and seeing the output. All right, so we can just see that, yeah it looks like that's probably decent.
3:22 We could even print out what the links are like let's say the first two. Well, that's messy, but it looks correct, doesn't it?
3:30 Okay, so we're on a great path here. But what we have is the entire hyperlink and I want to get just the href.
3:37 So let's go in here in Beautiful Soup we can go like this and, but wait, wait, print links.
3:45 Run it again, oh yeah, now were getting just the hyperlinks. Okay, our data's looking better, data's looking better.
3:51 There's a couple of things that I don't like, though. I don't want to talk about www versus not www don't really care about HTTPS, things like that.
4:01 So we're going to do a little bit of normalization here. Okay, we're going to iterate on top of links and do a little bit of clean up
4:12 so this part should go away, if we run this. Here we go, try again, okay, perfect. So techrepublic and www.techrepublic
4:21 no longer a difference, we just want the base domain name. It turns out we also have some aliases that we sometimes use and sometimes don't.
4:30 So let's go down here and do this one more time. There's probably a slightly cleaner way to do this, but we're going to just say
4:37 replace do.co with Digital Ocean. This is like a re-director URL. Okay, this is all good, now what we're printing out here
4:45 each time we do a print, like notice, like right there and right there, the closing brace we're printing this out for just that one entry.
4:54 We want all the entries for all 152 or we want all the links for all 152 entries. Now what we're going to do is come down here and say
5:05 all_links.extend, it's like ad range links bottom, lets just print the first, I don't know, ten. Let's also print out the link.
5:18 How many links do we have total? How many different unique links do we have? Well, how many times have a link been mentioned?
5:28 All right, 2,721, so that's pretty cool. And see how nice it is to just explore this data we don't have to keep re-running it.
5:36 Like we can forget about how we even got this feed data. Yeah, we're going out to an API doing RSS feed
5:42 and we're hitting it, but all the stuff we've been doing down here, like, its off the screen and out of mind. We just have this data magically
5:49 by the magic of technology, we have it. We can just work with it over and over and over not concerned about the latency of getting it
5:56 or the computational cost, or whatever. Though maybe it, kind of keeping with our style here let's put out a little statement here.
6:09 We can we say something like parse some number of links from all the episodes and we just run that again. Perfect, we've parsed 2,721 links.
6:16 And notice when I run this, watch though as soon as this turns to a star and then goes back to 45. That took like two seconds to run, but now that
6:26 it's done, we never have to run that code again. We just work with all links, which is now, remember just the raw links, we don't have to parse
6:34 that two point five MEGS of HTML, RSS XML, blended weirdness, we're done. We just worry about the links.
6:41 Like, now that we're done with that step, we're golden. Let's hit B to add another one and then M to convert it to markdown.
6:51 All right, so we're going to stay with extracted domain names. Actually, lets do one more really quick thing here let's make this a little bit smaller.
6:57 Let's talk about how many unique links there are. And we can do that by creating a list and then going through a set through all links.
7:08 What does that do? Well, that takes the links, all of them with duplication converts it to a set which has no duplicates
7:13 and then turns it back to a list so that we can deal with it, and who knows? Maybe it's like reasonable to say sort, sort that, right?
7:21 We just do it once and we're not going to compute it again anyway. And lets move this down here so we know how many distinct links we have.
7:30 There we go, we've lost about 400 duplicates that were in there, these are like sponsor links and, you know, maybe links back to our profile.
7:38 Or who knows what those are? But they're gone 'cause they were duplicates.