Python for the .NET developer Transcripts
Chapter: Computational notebooks
Lecture: Counting domains
0:00 Alright, we have our domains here.
0:02 Maybe we'll go ahead and change this
0:03 to say something, really quick.
0:05 First 10, domains are...
0:08 here we go, first 10 domains are like that.
0:10 Let's add some of them below, some markdown.
0:13 Now we're going to write some code
0:15 and I think this will impress you.
0:16 I'm pretty sure.
0:17 It definitely impresses me when I first learned it.
0:20 So here's what we want to do.
0:21 I want to go through that list, find all the unique names
0:24 I want to find that, and I want to find that, and so on.
0:28 Then I want to count how many there are.
0:30 Then I want to sort them by the most common first.
0:33 Give me that name and the count.
0:35 And then the second most common
0:36 then the third most common, and so on.
0:38 So there's a cool library called collections
0:40 so we can say, from collections import counter.
0:44 And we can say, the counter is going to be
0:45 a counter of these domain names.
0:49 And then we can ask it questions like
0:51 give me the most common.
0:53 And what that does is basically
0:55 gives us a list of these things.
0:57 Say top 25, is going to be common, up to 25.
1:02 Why is it going to work?
1:03 Because this is sorted as I described it
1:05 most popular to least popular.
1:07 Then we can just print, Top 25.
1:10 Are you ready for this?
1:11 Look how little code this is.
1:12 Boom, actually let's put it out like this.
1:14 I think we'll see it better.
1:15 There we go, I like the way that looks better.
1:17 We could do better pretty printing
1:19 but you know what, we got this covered.
1:21 Look at that. 382. 153. 64, and so on. That's it.
1:28 We've gone and found the popular domains.
1:31 GitHub, Twitter, and YouTube, apparently
1:33 as well as python.org, medium Reddit.
1:35 Maybe we want to exclude referring back to ourselves.
1:39 So we can come over here and we can do some cool trick
1:41 with that, we call it excluded.
1:46 It's going to be a set like that
1:48 and who knows what else.
1:49 Empty string, possibly, hash.
1:51 Here we could say, if link not in that.
1:55 Oh, whoops, we need to...
1:57 Probably the best way to parse it is like this.
2:00 D4D in domains if D is not an excluded one.
2:06 I'm going to run down here, see if this python by 32
2:08 should go away.
2:09 And it does, because it's excluded.
2:11 We're like, hey, we don't want to count referring to ourself.
2:13 That's kind of weird.
2:14 No, we're not taking credit for that.
2:15 So here we have it.
2:16 These are the top 25 domains.
2:19 And as I look at this, I'm kind of feeling like
2:21 this unique bit that we were doing.
2:24 I don't think I want to do that anymore.
2:25 Because we might refer to a project five times
2:28 and we'll just go back.
2:31 I think this is still probably going to be exactly
2:32 the same issue.
2:36 So many thousand, and we just need to change
2:38 that little bit right there.
2:40 Rerun it again.
2:41 Ah, guess we got, one more to get rid of.
2:48 First 10 domains look good again.
2:50 There we go, and notice we're pointing to GitHub a lot more
2:53 Twitter a lot more, that's because there are projects
2:55 that are popular, and I want to count those.
2:58 All right, that's it.
2:59 So a lot of looking at the data, thinking about it
3:01 bumping around back and forth and playing with it.
3:04 But in order to go from our domains
3:06 to what ones are popular, it's ridiculous, right?