Consuming HTTP Services in Python Transcripts
Chapter: Screen scraping: Adding APIs where there are none
Lecture: Searching for elements via CSS in BeautifulSoup
0:01 Okay, so that was pretty good, that we got the title out
0:04 now, the next thing we want are the paragraphs, so we can go over here
0:07 and we can go to soup, and instead of saying find we can say select,
0:10 and what we can put in here is a css selector,
0:14 css, once you get used to it, is a really fantastic way to navigate these pages,
0:17 especially since we know this is the only time this class up here
0:21 is on exactly the little data elements we want.
0:24 We just say . (dot) means class and then this, and let's print, look at that,
0:27 alright, so that looks like that worked, great.
0:34 Now we just need to do some transformations on them,
0:38 okay, so we get these like a list of nodes back that we can work with here,
0:43 but we don't want nodes, we want a paragraph
0:46 and I am just going to copy this down here
0:49 so we can see what it looks like, so what we're going for,
0:52 is something that loads up into there, so instead of saying paragraphs this way,
0:55 let's do another list comprehension, so we'll go like so, and we're going to generate,
1:00 this is going to be the element, we want a paragraph and in here
1:03 we have to pass the text and the seconds, so how do we get those?
1:07 Well, let's say for p in soup.select okay, well, we go to the paragraph
1:13 and we say get text, like that, and let me just put the number one for now,
1:19 so if we run this, what do we get back, something that looks pretty good actually,
1:24 seconds, 111 but again, all this weird space, let's fix that,
1:28 remember our clean text I am telling you, this stuff is super messy
1:33 when it comes out of here, what do we get now, oh yeah, look at that,
1:36 this is looking pretty good, there is a paragraph, and then there is another paragraph,
1:40 yeah, this is what we're after, but notice, we want to get this number here,
1:44 and that number is in the seconds, attribute. So how do we get that?
1:49 Well, it's going to come off of p and the way we get it out is
1:52 the attributes are just a dictionary, so that is going to give it
1:55 to us as a string, and we wanted as an integer.
1:59 Okay, so let’s run this again, oh yeah, look at it rolling out,
2:03 so you come over here, there is seconds zero,
2:06 oops, there was seconds zero, the very beginnings don't always have times on them,
2:12 text is this, here we go, this minute was two, which converts to 120 seconds,
2:19 now in practice, I need to write a little bit of code that strips that segment off,
2:23 because it's actually in the text, we could do a regular expression to say
2:27 find me the numerical sort of time stamp looking thing, at the front and strip that off,
2:32 we're not going to do that, right, I'll leave that up to you guys,
2:35 but here, we have our paragraphs, we have our title, let's stop printing them,
2:38 and instead, what we're supposed to do with this method,
2:41 is we're supposed to build a thing called a page, a page has a url,
2:44 copy this number for a sec, so the page has a url, a title and a paragraph.
2:51 So we'll just say return page, we have the urls passed in the title we got early,
2:57 and the paragraphs we just finished generating.
3:00 Okay, let's run it, downloading, downloading, downloading, and bam,
3:05 there is our array of pages, here is the url, there is the title,
3:08 and then there is all the paragraphs. Pretty darn cool, huh,
3:12 so we can use get text, we can access the attributes like this,
3:17 we can do find for node discovery, we can do select for working with css selectors,
3:25 notice there is also a select one so that if we know there is going to be
3:30 one thing we're selecting by id for example then we can use select one
3:34 and not have to like treat it as a list but get an element, which is cool;
3:37 so finally, let's do a little bit of output here,
3:40 let's just do go back up to this download pages,
3:43 instead of this print, let me write a better method show pages, I'll pass in pages,
3:47 create that and I'll just sort of print this out here,
3:52 no sense in you guys watching me type that, so we are going to go through
3:56 and say here is the pages, here is the url, it has this number of paragraphs,
4:01 I am not going to show you the text because it's too much and then,
4:04 we'll just show you what the text of the first paragraph is.
4:07 Alright, let's run that, download, download, download, parse, parse, parse,
4:12 and ta- da, look at that rich data we have to work with, p.url, p.paragraphs,
4:17 p.paragraphs.zero.text, just pure Python beauty,
4:22 so here is the title, and here is the url, let's click and make sure it works,
4:25 I am sure it looks like that is what we were after, doesn't it,
4:29 and it has 238 paragraphs in that hour long conversation,
4:32 and here is the first paragraph in the list.
4:35 So that is how you do screen scraping with Beautiful Soup and requests.