Python for decision makers and business leaders Transcripts
Chapter: Data science in Python
Lecture: Downloading the RSS feed
0:00 Now our goal is to go and download that RSS feed.
0:02 Let's get started
0:03 and what I want you to take away from this
0:05 and why I'm doing it live
0:06 and I'm not just showing you some finished product.
0:09 What I want you to take away from it is
0:10 there'll be big steps
0:12 like I want to go to the Internet
0:13 download that data
0:14 and parse apart huge amount of RSS data.
0:17 Like 1.5, two megabytes of data
0:20 and parse it apart into something we can use
0:22 in memory and programming.
0:24 You'll see that that is a handful of lines of code
0:26 and they're all really straightforward.
0:28 You could look at this even as a non-developer
0:31 and go, wow, I see what that's doing.
0:32 That's not actually that complicated.
0:34 So we want to work with some libraries.
0:36 In our regular program, remember
0:37 we had to import them to say we're going to use them.
0:39 Same thing here. And if import is called feedparser
0:43 and we're also going to need this thing called bs4.
0:48 We can go ahead and get those
0:50 and we could put a little description above it
0:52 but just for the sake of time
0:53 I'm not going to do that.
0:58 So when I come down here
0:59 we're going to download something from a url.
1:04 Pretty straightforward url there.
1:06 Now, how do we go to the Internet
1:08 download this, parse it apart
1:09 and get it into something we can work with?
1:12 Watch this. feedparser.parse_url.
1:17 Now I'm going to print out a little bit of this.
1:21 Just the first hundred characters.
1:23 Going to run this. Look at that.
1:26 Feed, title, Python Bytes, title detail, plain text.
1:30 Language, I guess. It's not set.
1:32 How cool is that? So this is quite amazing.
1:35 Let's run it again and just store that.
1:38 And one of the things that's really interesting
1:39 about these notebooks is this takes a little while to run.
1:42 It's not hugely complicated
1:44 but imagine I was doing some financial calculation
1:47 that takes 10 minutes.
1:48 I can go on and keep working
1:51 and it will just remember this.
1:52 If I want to run some other code at the bottom
1:54 I don't have to rerun this.
1:55 I can explore, change, tweak it, and so on
1:59 without having to recompute the whole thing.
2:01 That's a huge part of this
2:02 for computationally expensive things.
2:08 The next thing we need to do is get the descriptions.
2:11 So let's play around just a moment and see how that goes.
2:15 If we look at the feed
2:16 we can say it's what's called a dictionary
2:18 we can get the items.
2:20 Let's see what that is.
2:23 Okay, there they are.
2:24 There's actually a whole bunch
2:26 so let's not get all of them.
2:27 Let's just get one.
2:30 That's cool, now here it is.
2:31 And notice it has a summary
2:34 and a little bit farther down
2:35 woo, this is a mess
2:36 somewhere it has a description.
2:38 I can't quite see it
2:39 but if we just come up here and say get description
2:44 there we go.
2:46 So what we need to do is go through all the items
2:48 and get out this description thing
2:49 and then it's HTML.
2:51 We're going to go through and we're going to find
2:52 this hyperlink
2:53 and then we're going to convert that out
2:55 to get the domain, and then we're going to graph it.
2:57 It sounds like really a complicated problem
3:00 but inspired by this
3:02 these two steps already got us
3:04 downloading and parsing to understand that.
3:07 You can bet that we can take these steps quite
3:10 they're going to be big and useful steps
3:12 right along the way. So notice how I wasn't sure what I was doing
3:14 so I just explored it
3:16 and now I can say the descriptions are
3:18 and we can use this cool little expression here.
3:20 We can say item, not get, description.
3:26 Or item, n, feed.get, items.
3:32 And we can just print.
3:34 We found some number of descriptions.
3:35 Let's run this. Damn, look at that.
3:38 How cool is this? We found 157 descriptions
3:42 so what we have is a list or an array
3:45 full of each description.
3:46 We can go through those descriptions
3:48 and pull out those things.
3:50 So this is how we're going to get started.
3:52 We're going to go through
3:53 and we've already gone downloaded the data
3:56 converted it from xml to something we can work with
3:58 and now we've converted it to a whole bunch of strings
4:01 and the question is now
4:03 'cause these are pretty complicated
4:05 what do we do with this?
4:10 With all this HTML goo.
4:11 It turns out, like I said
4:13 Python's pretty awesome at handling it
4:14 but we're making our steps right down the way here.
4:17 Very, very cool.