Python Jumpstart by Building 10 Apps Transcripts
Chapter: App 5: Real-time weather client
Lecture: Getting started with Beautiful Soup
0:00 Now that we have the HTML downloaded,
0:02 it's time to use Beautiful Soup to pull out
0:05 just the elements that we want to work with.
0:08 Now what you kind of see is what many of you probably know,
0:11 that the web is a messy, dirty place,
0:14 there is all sorts of malformed documents
0:17 and weird white space and odd formatting that we are going to get back,
0:22 so there is going to be a lot of cleanup going on here,
0:24 but this Beautiful Soup package is really going to help us.
0:27 So we get started using Beautiful Soup
0:29 with an import statement, we just say BS4.
0:31 So we are going to write the function parse HTML
0:35 and we are not going to worry about the return value right away,
0:38 so let's just say parse, no let's make it be the more purpose driven,
0:42 get_weather_from_html() and we'll give it the html,
0:47 so let's go write this function, down here,
0:50 so our goal is to take this HTML which is dynamically generated
0:54 by the server including the weather report
0:57 for whatever the weather is in the zip code that the user entered.
1:01 But there is a known structure to this
1:04 and the way that we work with the structure is really
1:08 the same way that people style web pages and layout web pages,
1:11 it has to do with the css, we are going to start
1:14 by parsing this and then we are going to go explore css classes
1:18 and ides and stuff form this HTML
1:20 and then we'll use that in code to pull out the little bits that we need.
1:24 Start by creating something called a Beautiful Soup class,
1:27 so we say bs4. beautiful soup,
1:30 now notice how I can just type BS and it actually finds
1:34 all of the things that have sort of this camel casing capital B capital S
1:39 that's super handy and we are going to give it this HTML.
1:41 Here is the big step, we are going to take the text and parse it
1:44 into an n memory dome so that we can ask questions,
1:47 and let's just make sure that this doesn't crash on us,
1:49 so we'll print out the soup, over here and we'll just enter 97201 wait for a moment,
1:54 and boom- out comes what we were looking for.
1:57 It turns out when you print the soup object,
2:00 it just turns it back into a string that represents the document,
2:03 so it looks like that worked.
2:05 Ok, so we are off to a great start but I noticed something that maybe I didn't expect,
2:09 like a little warning came by, so let's comment out
2:11 that little print statement and do this again.
2:15 And Beautiful Soup now has a warning that says we are choosing
2:18 what we think is the best parser for this particular system
2:22 and this might be a different answer on a different machine,
2:24 so get potentially non deterministic results
2:27 so it says why don't we make this very explicit and put that here
2:30 so now we should get no more warnings and be ready to carry on.
2:35 So everything is running good, just do a little reformat in here,
2:39 so the next thing that we need to do is if we look at this soup object
2:42 you'll see there is all sorts of find methods here
2:45 and we can use this finds to navigate the tree and pull out the elements
2:49 that we need but what do we put in there,
2:52 that's where we go and we inspect the HTML.