Python for the .NET developer Transcripts
Chapter: Computational notebooks
Lecture: Getting the links from the RSS data
0:00 Now that we have the entries
0:01 we need to go through and somehow extract the HTML links
0:06 the actual source reference of those links.
0:09 This is sort of XML, I mean, yes RSS is XML
0:12 and the XML has actually been turned
0:14 into a dictionary, which is great.
0:16 But the thing that is the description itself
0:19 is not XML, it's an HTML fragment.
0:22 So we're going to use Beautiful Soup again
0:24 to parse, to screen scrape that and then pull out the entry.
0:28 So we're going to do a little juggling there.
0:31 One of the things I'd like to do really quick
0:32 is I'd like to, and I've showed you
0:34 the Ctrl + Enter hotkey for running an entry.
0:36 Notice that that incrementing, that execution count
0:39 at the bottom there, incremented.
0:41 And I can run this over here
0:43 sometimes it'll show you like what the hotkeys are up here
0:47 like Shift + Enter and stuff.
0:49 But I want to know how to add a cell below here.
0:51 Now that's B, but let me show you how you can find out.
0:54 If I go over here and I type cell, there's a ton of stuff.
0:58 If I want to insert a cell below, I can just type B.
1:01 If I want to select the cell above
1:03 I could hit like, K, if I want.
1:06 In the code, I could hit K and then you'll run that
1:10 and then go back down, J back down.
1:12 Now if I want to insert a new piece, I hit B, there we go.
1:17 If I want to change to markdown, I can hit M.
1:20 If I want to change to code do I hit C? No, I hit Y.
1:24 So I come down here and say, a new heading
1:26 parse the HTML from the entry descriptions.
1:31 We could add even more text, right, this is just markdown.
1:36 Then we hit ALT + CTRL + Enter to do that.
1:38 B to define another block below, and off we go.
1:42 So let's define it all links.
1:44 We're going to just pile up every single link
1:47 that we can find, not worrying about the domains yet.
1:49 We just want to get the links out of the body.
1:52 So we'll say four E in entries, let's just print
1:56 to R or something and do that real quick.
1:58 Now, that's probably, we'd want a quote.
2:02 Here we go, all right. So it looks like there's a bunch of entries.
2:04 So we're going to go through them, and what are we going to do?
2:07 We're going to save the description as E dot get description.
2:12 All right, that's going to be the HTML.
2:15 Then we'll come over here
2:16 I want to say BS4, if you want auto complete
2:18 you have to hit Tab, it doesn't automatically come up.
2:20 But notice we have a Beautiful Soup
2:22 and what are we going to pass in?
2:23 Well, we can't give it the fragment
2:25 I don't think, so let's give it proper HTML.
2:34 We run that, now, I believe it would prefer
2:37 that we said HTML dot parser here, yeah.
2:41 We won't see the warning that it's putting out
2:43 but, you know, deep down in the guts
2:44 I'm sure there's some kind of warning.
2:46 So let's go over here and we'll say the links
2:48 and we'll do a cool little list comprehension.
2:51 Now just give it some space, 'cause I want to separate these
2:53 for a in soup.findall('a').
2:58 Alright, And let's just really quickly here
2:59 print LAN of links, all right awesome.
3:02 So it looks like the data's coming along.
3:04 See how cool it is?
3:05 We can re-run and explore this little bit
3:07 without re-computing all this stuff up here, granted
3:09 it's not that bad, but like I keep thinking computation
3:13 is expense, we're doing a bunch of work
3:14 but we want to keep playing with it and seeing the output.
3:17 All right, so we can just see that, yeah
3:18 it looks like that's probably decent.
3:21 We could even print out what the links are
3:23 like let's say the first two.
3:27 Well, that's messy, but it looks correct, doesn't it?
3:29 Okay, so we're on a great path here.
3:31 But what we have is the entire hyperlink
3:33 and I want to get just the href.
3:36 So let's go in here in Beautiful Soup we can go like this
3:39 and, but wait, wait, print links.
3:44 Run it again, oh yeah, now were getting just the hyperlinks.
3:47 Okay, our data's looking better, data's looking better.
3:50 There's a couple of things that I don't like, though.
3:52 I don't want to talk about www versus not www
3:57 don't really care about HTTPS, things like that.
4:00 So we're going to do a little bit of normalization here.
4:08 Okay, we're going to iterate on top of links
4:10 and do a little bit of clean up
4:11 so this part should go away, if we run this.
4:14 Here we go, try again, okay, perfect.
4:16 So techrepublic and www.techrepublic
4:20 no longer a difference, we just want the base domain name.
4:24 It turns out we also have some aliases
4:27 that we sometimes use and sometimes don't.
4:29 So let's go down here and do this one more time.
4:32 There's probably a slightly cleaner way
4:34 to do this, but we're going to just say
4:36 replace do.co with Digital Ocean.
4:38 This is like a re-director URL.
4:41 Okay, this is all good, now what we're printing out here
4:44 each time we do a print, like notice, like right there
4:47 and right there, the closing brace
4:50 we're printing this out for just that one entry.
4:53 We want all the entries for all 152
4:57 or we want all the links for all 152 entries.
5:00 Now what we're going to do is come down here and say
5:04 all_links.extend, it's like ad range links
5:09 bottom, lets just print the first, I don't know, ten.
5:15 Let's also print out the link.
5:17 How many links do we have total?
5:19 How many different unique links do we have?
5:21 Well, how many times have a link been mentioned?
5:27 All right, 2,721, so that's pretty cool.
5:32 And see how nice it is to just explore this data
5:34 we don't have to keep re-running it.
5:35 Like we can forget about how we even got this feed data.
5:39 Yeah, we're going out to an API doing RSS feed
5:41 and we're hitting it, but all the stuff we've been doing
5:44 down here, like, its off the screen and out of mind.
5:46 We just have this data magically
5:48 by the magic of technology, we have it.
5:50 We can just work with it over and over and over
5:52 not concerned about the latency of getting it
5:55 or the computational cost, or whatever.
5:59 Though maybe it, kind of keeping with our style here
6:01 let's put out a little statement here.
6:08 We can we say something like parse some number
6:09 of links from all the episodes and we just run that again.
6:12 Perfect, we've parsed 2,721 links.
6:15 And notice when I run this, watch though
6:18 as soon as this turns to a star and then goes back to 45.
6:23 That took like two seconds to run, but now that
6:25 it's done, we never have to run that code again.
6:28 We just work with all links, which is now, remember
6:31 just the raw links, we don't have to parse
6:33 that two point five MEGS of HTML, RSS
6:36 XML, blended weirdness, we're done.
6:38 We just worry about the links.
6:40 Like, now that we're done with that step, we're golden.
6:43 Let's hit B to add another one
6:44 and then M to convert it to markdown.
6:50 All right, so we're going to stay
6:51 with extracted domain names.
6:52 Actually, lets do one more really quick thing here
6:54 let's make this a little bit smaller.
6:56 Let's talk about how many unique links there are.
7:01 And we can do that by creating a list
7:03 and then going through a set through all links.
7:07 What does that do? Well, that takes the links, all of them with duplication
7:10 converts it to a set which has no duplicates
7:12 and then turns it back to a list so that we
7:15 can deal with it, and who knows?
7:17 Maybe it's like reasonable to say sort, sort that, right?
7:20 We just do it once and we're not
7:21 going to compute it again anyway.
7:23 And lets move this down here
7:25 so we know how many distinct links we have.
7:29 There we go, we've lost about 400 duplicates
7:32 that were in there, these are like sponsor links
7:34 and, you know, maybe links back to our profile.
7:37 Or who knows what those are?
7:38 But they're gone 'cause they were duplicates.