Consuming HTTP Services in Python Transcripts
Chapter: Screen scraping: Adding APIs where there are none
Lecture: Concept: Scraping with BeautifulSoup

Login or purchase this course to watch this video and the rest of the course contents.
0:01 You've seen the screen scraping workflow, we issue a basic http get
0:06 against the page that we're after, we get the raw html back
0:10 and then we feed it to whatever screen scraping library we want,
0:14 in this case we've chosen Beautiful Soup, what comes out the other side
0:17 very much like working with an xml dom is a set of converted Python objects,
0:22 we feed those Python objects to our app
0:25 and out pops the real analyses that we're after,
0:28 in this case, maybe we wanted to do transcript analyses,
0:31 so let's see all the concepts in code.
0:33 First of all, if you have a site map that you can leverage, leverage it,
0:36 it's super simple, well structured xml that tells you right where to look for so many things,
0:40 so that's great, and in order to work with that, that's basic xml so of course,
0:45 we go to ElementTree, and we are going to need to get it from the internet,
0:48 so request, then we come up with the url, usually /sitemap.xml
0:52 but I suppose it doesn't have to be that way, we'll do a basic get against it,
0:56 and we'll grab the xml text, and then we just treat it like xml, like we always have,
1:00 so ElementTree.fromstring now, if you try to write xpath queries,
1:05 you are going to have to use name spaces and the name space syntax in the queries,
1:08 and whatnot, and so, I decided you know what, forget those name spaces,
1:12 we really don't, they serve like no purpose for us in this case,
1:16 let's just throw them away, drop them off that route element and then,
1:19 we can write non name spaced xpath queries, and then of course,
1:24 we just set a basic list comprehension, a dom.findall url/location,
1:29 and then we did a substring search, so find all the urls that point to transcripts,
1:34 give me their text, boom, those are the urls that we can go download
1:38 in a subsequent loop; next, we want to download and parse them,
1:41 so we are going to use Beautiful Soup and we want to have
1:44 somewhere structured to store them, so we're going to import collections,
1:48 and create two named tuples, page which has url title and paragraphs
1:52 and each paragraph is going to be a named tuple with text in seconds.
1:55 And then, we go through each url and we go download it, just like you would expect,
1:59 and here we have like zero error handling we'll say .text
2:02 and then we go to Beautiful Soup and we say create
2:05 an instance of a Beautiful Soup dom based on this html text
2:09 and here you can see we are passing lxml as the parser, remember,
2:13 you have to install that so realize that that's a dependency
2:16 that you are not directly importing but you are going to need
2:19 if you have to run if you use it. And, after we get it like this, we have our data,
2:22 in a Beautiful Soup object we just need to extract it with find and select.
2:26 Okay, so how do we extract this data, well, we are going to go to the Soup dom
2:31 and we want to say give me the node that is in h1, and give me the text,
2:35 so that was cool, we got that back they gave us the title, it should be just one h1,
2:39 and then we said I am going to do a select, so,
2:45 so that was the class name and . means class in css, so select by that class,
2:50 and then for each one of those, we are going to go through some jurations
2:53 of conversion and cleanup and store it in a paragraph, so p.get text,
2:57 and then clean that, you've seen how necessary that is,
3:00 you'd want to apply that to the title as well, and then get the seconds
3:03 out of the attributes and convert that to an integer here
3:06 again we're assuming that is going to work.
3:09 And then list comprehension generates a list of paragraphs, so use the css selectors,
3:12 and then we can create our page, which holds the url,
3:16 the title and all of these well structured cleaned up paragraphs, and we're done.
3:20 Remember, clean line is pretty important, because,
3:24 the way normally html disregards white space, but in Python,
3:28 it regards it, right, it matters what comes out, so you want to basically apply that
3:32 I don't care about how many spaces there are, if there is a hundred, or there is one,
3:36 that's just one space, so this clean line more or less does that for you.