Python for Decision Makers and Business Leaders Transcripts
Chapter: Data science in Python
Lecture: Downloading the RSS feed
Login or purchase this course to watch this video and the rest of the course contents.
0:00 Now our goal is to go and download that RSS feed. Let's get started and what I want you to take away from this and why I'm doing it live
0:07 and I'm not just showing you some finished product. What I want you to take away from it is there'll be big steps like I want to go to the Internet
0:14 download that data and parse apart huge amount of RSS data. Like 1.5, two megabytes of data and parse it apart into something we can use
0:23 in memory and programming. You'll see that that is a handful of lines of code and they're all really straightforward.
0:29 You could look at this even as a non-developer and go, wow, I see what that's doing. That's not actually that complicated.
0:35 So we want to work with some libraries. In our regular program, remember we had to import them to say we're going to use them.
0:40 Same thing here. And if import is called feedparser and we're also going to need this thing called bs4. We can go ahead and get those
0:51 and we could put a little description above it but just for the sake of time I'm not going to do that. So when I come down here
1:00 we're going to download something from a url. Pretty straightforward url there. Now, how do we go to the Internet download this, parse it apart
1:10 and get it into something we can work with? Watch this. feedparser.parse_url. Now I'm going to print out a little bit of this.
1:22 Just the first hundred characters. Going to run this. Look at that. Feed, title, Python Bytes, title detail, plain text.
1:31 Language, I guess. It's not set. How cool is that? So this is quite amazing. Let's run it again and just store that.
1:39 And one of the things that's really interesting about these notebooks is this takes a little while to run. It's not hugely complicated
1:45 but imagine I was doing some financial calculation that takes 10 minutes. I can go on and keep working and it will just remember this.
1:53 If I want to run some other code at the bottom I don't have to rerun this. I can explore, change, tweak it, and so on
2:00 without having to recompute the whole thing. That's a huge part of this for computationally expensive things.
2:09 The next thing we need to do is get the descriptions. So let's play around just a moment and see how that goes. If we look at the feed
2:17 we can say it's what's called a dictionary we can get the items. Let's see what that is. Okay, there they are. There's actually a whole bunch
2:27 so let's not get all of them. Let's just get one. That's cool, now here it is. And notice it has a summary and a little bit farther down
2:36 woo, this is a mess somewhere it has a description. I can't quite see it but if we just come up here and say get description there we go.
2:47 So what we need to do is go through all the items and get out this description thing and then it's HTML.
2:52 We're going to go through and we're going to find this hyperlink and then we're going to convert that out
2:56 to get the domain, and then we're going to graph it. It sounds like really a complicated problem but inspired by this these two steps already got us
3:05 downloading and parsing to understand that. You can bet that we can take these steps quite they're going to be big and useful steps
3:13 right along the way. So notice how I wasn't sure what I was doing so I just explored it and now I can say the descriptions are
3:19 and we can use this cool little expression here. We can say item, not get, description. Or item, n, feed.get, items. And we can just print.
3:35 We found some number of descriptions. Let's run this. Damn, look at that. How cool is this? We found 157 descriptions
3:43 so what we have is a list or an array full of each description. We can go through those descriptions and pull out those things.
3:51 So this is how we're going to get started. We're going to go through and we've already gone downloaded the data
3:57 converted it from xml to something we can work with and now we've converted it to a whole bunch of strings and the question is now
4:04 'cause these are pretty complicated what do we do with this? With all this HTML goo. It turns out, like I said Python's pretty awesome at handling it
4:15 but we're making our steps right down the way here. Very, very cool.