Python for the .NET developer Transcripts
Chapter: Computational notebooks
Lecture: From links to domains
0:00 Now that we have our unique links
0:01 let's parse out the domain names
0:04 because things like GitHub and Twitter and Reddit
0:07 all that kind of stuff is going to appear many times
0:09 we want to know how many times each one appears.
0:12 So, B to add another code block here.
0:16 And we're going to use this thing called urllib
0:19 which is built in to Python
0:20 and we're going to use this URL parse.
0:25 We'll say the domains are URL parse, hit Tab.
0:31 And that's going to be link for link
0:34 and what unique links, like so.
0:37 Now we don't want this, this is going to give us an object.
0:39 What we want is net location, like so.
0:43 And let's just do a little exploration
0:45 let's print out domains first ten, something like that.
0:50 You know what those look like?
0:51 Those just look like broken, broken links.
0:55 My goodness, so we have
0:56 this looks like I got to deal with it later.
0:58 So one, two, three
1:00 that's the first three after we sorted, are broken, so.
1:03 Let's do this. Three onward, run that one.
1:09 There we go, fixed, like a charm.
1:11 I don't know what's going on
1:13 we must have just typed in some bad markdown along the way.
1:16 We did type 2.5 megabytes of text
1:18 so I guess that generates a few errors.
1:21 Anyway, we we're able too see that really quick
1:23 and just go back and change that here.
1:26 You might want to change this in the future
1:28 cause I'm going to go back and fix that on the site probably.
1:31 None the less, here we have our domains
1:33 and let's just print out the first ten.
1:35 There, that's more like what I was expecting.
1:37 We're getting Pycon.de, Python weekly, aka dot ms.
1:42 I'm tempted to replace that with Microsoft.com
1:45 but it could redirect via Microsoft to somewhere else
1:47 so I'm just going to leave it like that.
1:49 Amazon, Amazon, Amazon.
1:51 Now, we want the duplication in this list, at the moment.
1:55 We don't want duplication in the links, necessarily.
1:59 I guess maybe, maybe we do.
2:01 But probably we don't, right?
2:02 We probably just want to say
2:03 what are all the things that we pointed at
2:05 and then how many of them are from any given domain.
2:08 So the point is that we want to say
2:09 well Amazon is more popular than the others
2:12 because there is three links to Amazon
2:14 and only one to Python weekly.
2:16 So we want this duplication, this is not a problem
2:19 this is actually the essential part
2:20 of what we are trying to work with here.
2:22 So, there we have it, it wasn't a lot of work
2:24 it was a lot of talking and like looking at the data
2:27 but yeah, not so much work, right.
2:29 Just a tiny bit of code.
2:31 You'll notice we're using a lot of these list comprehensions
2:33 and other clever little programming techniques
2:37 that we can write the minimal amount of code
2:38 like this could be three loops, or it could just be that
2:41 it could even be less if we did it
2:43 it we change it I'm sure but none the less
2:46 we want to have it really focus small bits of code
2:48 some explanation, maybe a picture
2:50 we're not there yet, but we're getting close
2:52 as we go through it.
2:53 So this is the notebook style.