Python for the .NET Developer Transcripts
Chapter: Computational notebooks
Lecture: Counting domains
0:00 Alright, we have our domains here. Maybe we'll go ahead and change this to say something, really quick. First 10, domains are...
0:09 here we go, first 10 domains are like that. Let's add some of them below, some markdown. Now we're going to write some code
0:16 and I think this will impress you. I'm pretty sure. It definitely impresses me when I first learned it. So here's what we want to do.
0:22 I want to go through that list, find all the unique names I want to find that, and I want to find that, and so on.
0:29 Then I want to count how many there are. Then I want to sort them by the most common first. Give me that name and the count.
0:36 And then the second most common then the third most common, and so on. So there's a cool library called collections
0:41 so we can say, from collections import counter. And we can say, the counter is going to be a counter of these domain names.
0:50 And then we can ask it questions like give me the most common. And what that does is basically gives us a list of these things.
0:58 Say top 25, is going to be common, up to 25. Why is it going to work? Because this is sorted as I described it most popular to least popular.
1:08 Then we can just print, Top 25. Are you ready for this? Look how little code this is. Boom, actually let's put it out like this.
1:15 I think we'll see it better. There we go, I like the way that looks better. We could do better pretty printing but you know what, we got this covered.
1:22 Look at that. 382. 153. 64, and so on. That's it. We've gone and found the popular domains. GitHub, Twitter, and YouTube, apparently
1:34 as well as Python.org, medium Reddit. Maybe we want to exclude referring back to ourselves. So we can come over here and we can do some cool trick
1:42 with that, we call it excluded. It's going to be a set like that and who knows what else. Empty string, possibly, hash.
1:52 Here we could say, if link not in that. Oh, whoops, we need to... Probably the best way to parse it is like this.
2:01 D4D in domains if D is not an excluded one. I'm going to run down here, see if this Python by 32 should go away. And it does, because it's excluded.
2:12 We're like, hey, we don't want to count referring to ourself. That's kind of weird. No, we're not taking credit for that. So here we have it.
2:17 These are the top 25 domains. And as I look at this, I'm kind of feeling like this unique bit that we were doing.
2:25 I don't think I want to do that anymore. Because we might refer to a project five times and we'll just go back.
2:32 I think this is still probably going to be exactly the same issue. So many thousand, and we just need to change that little bit right there.
2:41 Rerun it again. Ah, guess we got, one more to get rid of. First 10 domains look good again. There we go, and notice we're pointing to GitHub a lot more
2:54 Twitter a lot more, that's because there are projects that are popular, and I want to count those. All right, that's it.
3:00 So a lot of looking at the data, thinking about it bumping around back and forth and playing with it. But in order to go from our domains
3:07 to what ones are popular, it's ridiculous, right?