Python for Decision Makers and Business Leaders Transcripts
Chapter: Data science in Python
Lecture: Downloading the RSS feed
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
Now our goal is to go and download that RSS feed. Let's get started and what I want you to take away from this and why I'm doing it live
0:07
and I'm not just showing you some finished product. What I want you to take away from it is there'll be big steps like I want to go to the Internet
0:14
download that data and parse apart huge amount of RSS data. Like 1.5, two megabytes of data and parse it apart into something we can use
0:23
in memory and programming. You'll see that that is a handful of lines of code and they're all really straightforward.
0:29
You could look at this even as a non-developer and go, wow, I see what that's doing. That's not actually that complicated.
0:35
So we want to work with some libraries. In our regular program, remember we had to import them to say we're going to use them.
0:40
Same thing here. And if import is called feedparser and we're also going to need this thing called bs4. We can go ahead and get those
0:51
and we could put a little description above it but just for the sake of time I'm not going to do that. So when I come down here
1:00
we're going to download something from a url. Pretty straightforward url there. Now, how do we go to the Internet download this, parse it apart
1:10
and get it into something we can work with? Watch this. feedparser.parse_url. Now I'm going to print out a little bit of this.
1:22
Just the first hundred characters. Going to run this. Look at that. Feed, title, Python Bytes, title detail, plain text.
1:31
Language, I guess. It's not set. How cool is that? So this is quite amazing. Let's run it again and just store that.
1:39
And one of the things that's really interesting about these notebooks is this takes a little while to run. It's not hugely complicated
1:45
but imagine I was doing some financial calculation that takes 10 minutes. I can go on and keep working and it will just remember this.
1:53
If I want to run some other code at the bottom I don't have to rerun this. I can explore, change, tweak it, and so on
2:00
without having to recompute the whole thing. That's a huge part of this for computationally expensive things.
2:09
The next thing we need to do is get the descriptions. So let's play around just a moment and see how that goes. If we look at the feed
2:17
we can say it's what's called a dictionary we can get the items. Let's see what that is. Okay, there they are. There's actually a whole bunch
2:27
so let's not get all of them. Let's just get one. That's cool, now here it is. And notice it has a summary and a little bit farther down
2:36
woo, this is a mess somewhere it has a description. I can't quite see it but if we just come up here and say get description there we go.
2:47
So what we need to do is go through all the items and get out this description thing and then it's HTML.
2:52
We're going to go through and we're going to find this hyperlink and then we're going to convert that out
2:56
to get the domain, and then we're going to graph it. It sounds like really a complicated problem but inspired by this these two steps already got us
3:05
downloading and parsing to understand that. You can bet that we can take these steps quite they're going to be big and useful steps
3:13
right along the way. So notice how I wasn't sure what I was doing so I just explored it and now I can say the descriptions are
3:19
and we can use this cool little expression here. We can say item, not get, description. Or item, n, feed.get, items. And we can just print.
3:35
We found some number of descriptions. Let's run this. Damn, look at that. How cool is this? We found 157 descriptions
3:43
so what we have is a list or an array full of each description. We can go through those descriptions and pull out those things.
3:51
So this is how we're going to get started. We're going to go through and we've already gone downloaded the data
3:57
converted it from xml to something we can work with and now we've converted it to a whole bunch of strings and the question is now
4:04
'cause these are pretty complicated what do we do with this? With all this HTML goo. It turns out, like I said Python's pretty awesome at handling it
4:15
but we're making our steps right down the way here. Very, very cool.