Python for decision makers and business leaders Transcripts
Chapter: Data science in Python
Lecture: From links to domain names
0:00 Now that we have our links
0:01 what we want is to get the domains.
0:03 So let's write that here.
0:09 Now it turns out this is not too hard.
0:11 We're going to need a library
0:13 that's not super obvious
0:14 but a little Stack Overflow googling will get you there.
0:20 So we're going to use urllib.parse
0:21 and what we need is the domains.
0:24 This is going to be
0:25 let's just put l for l in all_links for a second
0:31 and if we print out the domains really quick
0:34 well, that's just these.
0:35 But what we need to do is convert this thing
0:38 so we can say this urllib.parse
0:42 yourolivparse and we'll pass at that little thing.
0:46 Now notice we get a parse result over and over and over
0:49 but if you look there's a value
0:51 or property net location and that is what we want.
0:54 So we can say .netlocation or again what
0:58 there it is. There they all are and we have a duplication.
1:01 We want that duplication
1:02 because these are multiple references back
1:05 to the original site and we're going to count
1:08 how many times each of those appear.
1:11 I guess we'll have as many domains there.
1:13 I guess we could have some little print out
1:14 that means somthin' here at least.
1:15 So we could say how many times
1:22 how many different ones are there?
1:23 That sounds like it could be challenging.
1:25 But we can just use what's called a set
1:27 and a set will take a whole bunch
1:29 of items worth duplication
1:30 and just get it down to a unique set.
1:32 So we could say domain
1:34 and we have to ask how many of those there are
1:37 and we do that like this. Run it again
1:39 there are799 unique domains. Cool, huh?