Python for Decision Makers and Business Leaders Transcripts
Chapter: Data science in Python
Lecture: From links to domain names
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
Now that we have our links
0:01
what we want is to get the domains.
0:03
So let's write that here.
0:09
Now it turns out this is not too hard.
0:11
We're going to need a library
0:13
that's not super obvious
0:14
but a little Stack Overflow googling will get you there.
0:20
So we're going to use urllib.parse
0:21
and what we need is the domains.
0:24
This is going to be
0:25
let's just put l for l in all_links for a second
0:31
and if we print out the domains really quick
0:34
well, that's just these.
0:35
But what we need to do is convert this thing
0:38
so we can say this urllib.parse
0:42
yourolivparse and we'll pass at that little thing.
0:46
Now notice we get a parse result over and over and over
0:49
but if you look there's a value
0:51
or property net location and that is what we want.
0:54
So we can say .netlocation or again what
0:58
there it is. There they all are and we have a duplication.
1:01
We want that duplication
1:02
because these are multiple references back
1:05
to the original site and we're going to count
1:08
how many times each of those appear.
1:11
I guess we'll have as many domains there.
1:13
I guess we could have some little print out
1:14
that means somthin' here at least.
1:15
So we could say how many times
1:22
how many different ones are there?
1:23
That sounds like it could be challenging.
1:25
But we can just use what's called a set
1:27
and a set will take a whole bunch
1:29
of items worth duplication
1:30
and just get it down to a unique set.
1:32
So we could say domain
1:34
and we have to ask how many of those there are
1:37
and we do that like this. Run it again
1:39
there are799 unique domains. Cool, huh?