Consuming HTTP Services in Python Transcripts
Chapter: Screen scraping: Adding APIs where there are none
Lecture: Concept: Scraping with BeautifulSoup
Login or
purchase this course
to watch this video and the rest of the course contents.
0:01
You've seen the screen scraping workflow, we issue a basic http get against the page that we're after, we get the raw HTML back
0:11
and then we feed it to whatever screen scraping library we want, in this case we've chosen Beautiful Soup, what comes out the other side
0:18
very much like working with an xml dom is a set of converted Python objects, we feed those Python objects to our app
0:26
and out pops the real analyses that we're after, in this case, maybe we wanted to do transcript analyses, so let's see all the concepts in code.
0:34
First of all, if you have a site map that you can leverage, leverage it,
0:37
it's super simple, well structured xml that tells you right where to look for so many things,
0:41
so that's great, and in order to work with that, that's basic xml so of course,
0:46
we go to ElementTree, and we are going to need to get it from the internet, so request, then we come up with the url, usually /sitemap.xml
0:53
but I suppose it doesn't have to be that way, we'll do a basic get against it,
0:57
and we'll grab the xml text, and then we just treat it like xml, like we always have, so ElementTree.fromstring now, if you try to write xpath queries,
1:06
you are going to have to use name spaces and the name space syntax in the queries,
1:09
and whatnot, and so, I decided you know what, forget those name spaces, we really don't, they serve like no purpose for us in this case,
1:17
let's just throw them away, drop them off that route element and then, we can write non name spaced xpath queries, and then of course,
1:25
we just set a basic list comprehension, a dom.findall url/location, and then we did a substring search, so find all the URLs that point to transcripts,
1:35
give me their text, boom, those are the URLs that we can go download in a subsequent loop; next, we want to download and parse them,
1:42
so we are going to use Beautiful Soup and we want to have somewhere structured to store them, so we're going to import collections,
1:49
and create two named tuples, page which has url title and paragraphs and each paragraph is going to be a named tuple with text in seconds.
1:56
And then, we go through each url and we go download it, just like you would expect, and here we have like zero error handling we'll say .text
2:03
and then we go to Beautiful Soup and we say create an instance of a Beautiful Soup dom based on this HTML text
2:10
and here you can see we are passing lxml as the parser, remember, you have to install that so realize that that's a dependency
2:17
that you are not directly importing but you are going to need if you have to run if you use it. And, after we get it like this, we have our data,
2:23
in a Beautiful Soup object we just need to extract it with find and select.
2:27
Okay, so how do we extract this data, well, we are going to go to the Soup dom and we want to say give me the node that is in h1, and give me the text,
2:36
so that was cool, we got that back they gave us the title, it should be just one h1,
2:40
and then we said I am going to do a select, so soup.select.transcript_segments,
2:46
so that was the class name and . means class in css, so select by that class, and then for each one of those, we are going to go through some jurations
2:54
of conversion and cleanup and store it in a paragraph, so p.get text, and then clean that, you've seen how necessary that is,
3:01
you'd want to apply that to the title as well, and then get the seconds out of the attributes and convert that to an integer here
3:07
again we're assuming that is going to work. And then list comprehension generates a list of paragraphs, so use the css selectors,
3:13
and then we can create our page, which holds the url, the title and all of these well structured cleaned up paragraphs, and we're done.
3:21
Remember, clean line is pretty important, because, the way normally HTML disregards white space, but in Python,
3:29
it regards it, right, it matters what comes out, so you want to basically apply that
3:33
I don't care about how many spaces there are, if there is a hundred, or there is one,
3:37
that's just one space, so this clean line more or less does that for you.