Consuming HTTP Services in Python Transcripts
Chapter: Screen scraping: Adding APIs where there are none
Lecture: Finding the title with BeautifulSoup

Login or purchase this course to watch this video and the rest of the course contents.
0:01 Okay, so to use Beautiful Soup, we are going to need to come up here, and say import bs4. And, in order to get that to work,
0:10 what we need is we need to have Beautiful Soup 4 and if we go over here to the bs, there is none, so we'll say pip install Beautiful Soup 4,
0:23 now be careful, I think there is a Beautiful Soup that is not 4. Okay, great, so now a little error over here went away,
0:31 so now over here, actually we could even write it like this, let's say from bs4 import Beautiful Soup, that's the one class,
0:39 like etree really that all we're going to do with that one, okay. So down here, what we're going to do is we're going to parse it,
0:47 so I'll say soup=beautiful soup and I'll give it the HTML, now, it's going to not love this as much as you might hope,
0:53 let me out a little print here so we'll say print downloading, we'll say the url and we'll say flush=true to make sure this comes out straight away,
1:04 so if I run this, you'll see it works and then error, not error, warning, warning, warning, and the warning is no parser
1:10 what is explicitly specified, so you can do this HTML.parser, or let me just show you can use other ones as well so I could come over here
1:20 and say I want to use the high performance lxml one here, so let's go update our requirements doc here,
1:27 make sure we have this and beautiful soup 4, have those there, those are not misspelled, thank you, okay, so let's go back down here
1:37 and pip install lxml, it's a nice high performance c based one, it takes a moment to install, if for some reason it doesn't install,
1:45 like this might be tricky on Windows, just use what they specified over there which was HTML.parser I think, you'll see in the error message,
1:53 okay that took a while actually, and this is a pretty fast computer, but, it's installed so now we have lxml and we have beautiful soup both installed,
2:01 so if I run this one more time there should be no warnings. Great, okay so we've downloaded these, none, none, none, none, none,
2:08 apparently is what we returned from this method, five times, which by the way Python methods always have a return value,
2:15 it's just none if you don't return anything explicit. Okay, so now we have this, let's look for the title, let's give this a shot here,
2:21 so we are going to say soup.find there is a few navigational traversal type things here,
2:27 there is many finds, as you see, find parents, sibling, next, previous and we also have a select, okay, so let's do this,
2:34 let's say find because find works on nodes basically, so I can say I want to find h1 and then I come over here
2:42 and I can say get text, and let me just print the title, just to make sure something is going on here, boom,
2:49 now look at that, that almost is what we wanted, it's so super close, look, we have this weird new line and whatnot, let me just add
2:56 a quick function here that we can use to clean this stuff up, so let's go down here, put this at the bottom, so the problem is we have,
3:03 when we say get text like that br is converted, and then all the white space around like the indentation,
3:09 the HTML, we can get just exactly what is inside of that section, so if you look over here, this form here to there like that stays in,
3:19 the br becomes a new line, but still, all that white space in tabs that's there,
3:22 so what we got to do is have a little function that goes through and says okay, new lines, tabs, all those become just spaces,
3:29 and then you might end up with a bunch of spaces, so we'll write a little loop to convert two spaces to one space until there is no more two spaces,
3:38 and that urns out to be pretty much the trick we need, so let's go over here, and we'll say here is our title, instead of doing this
3:45 we'll say clean text we'll give it that, let's try again, boom, just like we were hoping for, okay, so this is really nice,
3:52 we're going to use this clean text over and over again, okay, so we're getting the titles, that's really cool, next,
3:58 let's do something a little more advanced, let's go and get the paragraphs.


Talk Python's Mastodon Michael Kennedy's Mastodon