Consuming HTTP Services in Python Transcripts
Chapter: Screen scraping: Adding APIs where there are none
Lecture: Finding the title with BeautifulSoup

Login or purchase this course to watch this video and the rest of the course contents.
0:01 Okay, so to use Beautiful Soup, we are going to need to come up here,
0:04 and say import bs4. And, in order to get that to work,
0:09 what we need is we need to have Beautiful Soup 4 and if we go over here to the bs,
0:16 there is none, so we'll say pip install Beautiful Soup 4,
0:22 now be careful, I think there is a Beautiful Soup that is not 4.
0:25 Okay, great, so now a little error over here went away,
0:30 so now over here, actually we could even write it like this,
0:33 let's say from bs4 import Beautiful Soup, that's the one class,
0:38 like etree really that all we're going to do with that one, okay.
0:41 So down here, what we're going to do is we're going to parse it,
0:46 so I'll say soup=beautiful soup and I'll give it the html,
0:49 now, it's going to not love this as much as you might hope,
0:52 let me out a little print here so we'll say print downloading, we'll say the url
0:59 and we'll say flush=true to make sure this comes out straight away,
1:03 so if I run this, you'll see it works and then error,
1:06 not error, warning, warning, warning, and the warning is no parser
1:09 what is explicitly specified, so you can do this html.parser,
1:15 or let me just show you can use other ones as well so I could come over here
1:19 and say I want to use the high performance lxml one here,
1:22 so let's go update our requirements doc here,
1:26 make sure we have this and beautiful soup 4, have those there,
1:30 those are not misspelled, thank you, okay, so let's go back down here
1:36 and pip install lxml, it's a nice high performance c based one,
1:41 it takes a moment to install, if for some reason it doesn't install,
1:44 like this might be tricky on Windows, just use what they specified over there
1:48 which was html.parser I think, you'll see in the error message,
1:52 okay that took a while actually, and this is a pretty fast computer,
1:55 but, it's installed so now we have lxml and we have beautiful soup both installed,
2:00 so if I run this one more time there should be no warnings.
2:04 Great, okay so we've downloaded these, none, none, none, none, none,
2:07 apparently is what we returned from this method, five times,
2:10 which by the way Python methods always have a return value,
2:14 it's just none if you don't return anything explicit.
2:17 Okay, so now we have this, let's look for the title, let's give this a shot here,
2:20 so we are going to say soup.find
2:23 there is a few navigational traversal type things here,
2:26 there is many finds, as you see, find parents, sibling, next, previous
2:30 and we also have a select, okay, so let's do this,
2:33 let's say find because find works on nodes basically,
2:38 so I can say I want to find h1 and then I come over here
2:41 and I can say get text, and let me just print the title,
2:45 just to make sure something is going on here, boom,
2:48 now look at that, that almost is what we wanted, it's so super close,
2:51 look, we have this weird new line and whatnot, let me just add
2:55 a quick function here that we can use to clean this stuff up,
2:58 so let's go down here, put this at the bottom, so the problem is we have,
3:02 when we say get text like that br is converted,
3:05 and then all the white space around like the indentation,
3:08 the html, we can get just exactly what is inside of that section,
3:11 so if you look over here, this form here to there like that stays in,
3:18 the br becomes a new line, but still, all that white space in tabs that's there,
3:21 so what we got to do is have a little function that goes through and says
3:24 okay, new lines, tabs, all those become just spaces,
3:28 and then you might end up with a bunch of spaces,
3:31 so we'll write a little loop to convert two spaces to one space
3:34 until there is no more two spaces,
3:37 and that urns out to be pretty much the trick we need,
3:40 so let's go over here, and we'll say here is our title, instead of doing this
3:44 we'll say clean text we'll give it that, let's try again,
3:47 boom, just like we were hoping for, okay, so this is really nice,
3:51 we're going to use this clean text over and over again,
3:54 okay, so we're getting the titles, that's really cool, next,
3:57 let's do something a little more advanced, let's go and get the paragraphs.