Consuming HTTP Services in Python Transcripts
Chapter: Screen scraping: Adding APIs where there are none
Lecture: Searching for elements via CSS in BeautifulSoup

Login or purchase this course to watch this video and the rest of the course contents.
0:01 Okay, so that was pretty good, that we got the title out now, the next thing we want are the paragraphs, so we can go over here
0:08 and we can go to soup, and instead of saying find we can say select, and what we can put in here is a css selector,
0:15 css, once you get used to it, is a really fantastic way to navigate these pages, especially since we know this is the only time this class up here
0:22 is on exactly the little data elements we want. We just say . (dot) means class and then this, and let's print, look at that,
0:28 alright, so that looks like that worked, great. Now we just need to do some transformations on them,
0:39 okay, so we get these like a list of nodes back that we can work with here, but we don't want nodes, we want a paragraph
0:47 and I am just going to copy this down here so we can see what it looks like, so what we're going for,
0:53 is something that loads up into there, so instead of saying paragraphs this way,
0:56 let's do another list comprehension, so we'll go like so, and we're going to generate, this is going to be the element, we want a paragraph and in here
1:04 we have to pass the text and the seconds, so how do we get those? Well, let's say for p in soup.select okay, well, we go to the paragraph
1:14 and we say get text, like that, and let me just put the number one for now,
1:20 so if we run this, what do we get back, something that looks pretty good actually, seconds, 111 but again, all this weird space, let's fix that,
1:29 remember our clean text I am telling you, this stuff is super messy when it comes out of here, what do we get now, oh yeah, look at that,
1:37 this is looking pretty good, there is a paragraph, and then there is another paragraph,
1:41 yeah, this is what we're after, but notice, we want to get this number here, and that number is in the seconds, attribute. So how do we get that?
1:50 Well, it's going to come off of p and the way we get it out is the attributes are just a dictionary, so that is going to give it
1:56 to us as a string, and we wanted as an integer. Okay, so let’s run this again, oh yeah, look at it rolling out,
2:04 so you come over here, there is seconds zero, oops, there was seconds zero, the very beginnings don't always have times on them,
2:13 text is this, here we go, this minute was two, which converts to 120 seconds,
2:20 now in practice, I need to write a little bit of code that strips that segment off,
2:24 because it's actually in the text, we could do a regular expression to say
2:28 find me the numerical sort of time stamp looking thing, at the front and strip that off,
2:33 we're not going to do that, right, I'll leave that up to you guys, but here, we have our paragraphs, we have our title, let's stop printing them,
2:39 and instead, what we're supposed to do with this method, is we're supposed to build a thing called a page, a page has a url,
2:45 copy this number for a sec, so the page has a url, a title and a paragraph.
2:52 So we'll just say return page, we have the URLs passed in the title we got early, and the paragraphs we just finished generating.
3:01 Okay, let's run it, downloading, downloading, downloading, and bam, there is our array of pages, here is the url, there is the title,
3:09 and then there is all the paragraphs. Pretty darn cool, huh, so we can use get text, we can access the attributes like this,
3:18 we can do find for node discovery, we can do select for working with css selectors,
3:26 notice there is also a select one so that if we know there is going to be one thing we're selecting by id for example then we can use select one
3:35 and not have to like treat it as a list but get an element, which is cool; so finally, let's do a little bit of output here,
3:41 let's just do go back up to this download pages, instead of this print, let me write a better method show pages, I'll pass in pages,
3:48 create that and I'll just sort of print this out here, no sense in you guys watching me type that, so we are going to go through
3:57 and say here is the pages, here is the url, it has this number of paragraphs, I am not going to show you the text because it's too much and then,
4:05 we'll just show you what the text of the first paragraph is. Alright, let's run that, download, download, download, parse, parse, parse,
4:13 and ta- da, look at that rich data we have to work with, p.url, p.paragraphs, p.paragraphs.zero.text, just pure Python beauty,
4:23 so here is the title, and here is the url, let's click and make sure it works, I am sure it looks like that is what we were after, doesn't it,
4:30 and it has 238 paragraphs in that hour long conversation, and here is the first paragraph in the list.
4:36 So that is how you do screen scraping with Beautiful Soup and requests.


Talk Python's Mastodon Michael Kennedy's Mastodon