Consuming HTTP Services in Python Transcripts
Chapter: Screen scraping: Adding APIs where there are none
Lecture: Searching for elements via CSS in BeautifulSoup
Login or
purchase this course
to watch this video and the rest of the course contents.
0:01
Okay, so that was pretty good, that we got the title out
0:04
now, the next thing we want are the paragraphs, so we can go over here
0:07
and we can go to soup, and instead of saying find we can say select,
0:10
and what we can put in here is a css selector,
0:14
css, once you get used to it, is a really fantastic way to navigate these pages,
0:17
especially since we know this is the only time this class up here
0:21
is on exactly the little data elements we want.
0:24
We just say . (dot) means class and then this, and let's print, look at that,
0:27
alright, so that looks like that worked, great.
0:34
Now we just need to do some transformations on them,
0:38
okay, so we get these like a list of nodes back that we can work with here,
0:43
but we don't want nodes, we want a paragraph
0:46
and I am just going to copy this down here
0:49
so we can see what it looks like, so what we're going for,
0:52
is something that loads up into there, so instead of saying paragraphs this way,
0:55
let's do another list comprehension, so we'll go like so, and we're going to generate,
1:00
this is going to be the element, we want a paragraph and in here
1:03
we have to pass the text and the seconds, so how do we get those?
1:07
Well, let's say for p in soup.select okay, well, we go to the paragraph
1:13
and we say get text, like that, and let me just put the number one for now,
1:19
so if we run this, what do we get back, something that looks pretty good actually,
1:24
seconds, 111 but again, all this weird space, let's fix that,
1:28
remember our clean text I am telling you, this stuff is super messy
1:33
when it comes out of here, what do we get now, oh yeah, look at that,
1:36
this is looking pretty good, there is a paragraph, and then there is another paragraph,
1:40
yeah, this is what we're after, but notice, we want to get this number here,
1:44
and that number is in the seconds, attribute. So how do we get that?
1:49
Well, it's going to come off of p and the way we get it out is
1:52
the attributes are just a dictionary, so that is going to give it
1:55
to us as a string, and we wanted as an integer.
1:59
Okay, so let’s run this again, oh yeah, look at it rolling out,
2:03
so you come over here, there is seconds zero,
2:06
oops, there was seconds zero, the very beginnings don't always have times on them,
2:12
text is this, here we go, this minute was two, which converts to 120 seconds,
2:19
now in practice, I need to write a little bit of code that strips that segment off,
2:23
because it's actually in the text, we could do a regular expression to say
2:27
find me the numerical sort of time stamp looking thing, at the front and strip that off,
2:32
we're not going to do that, right, I'll leave that up to you guys,
2:35
but here, we have our paragraphs, we have our title, let's stop printing them,
2:38
and instead, what we're supposed to do with this method,
2:41
is we're supposed to build a thing called a page, a page has a url,
2:44
copy this number for a sec, so the page has a url, a title and a paragraph.
2:51
So we'll just say return page, we have the urls passed in the title we got early,
2:57
and the paragraphs we just finished generating.
3:00
Okay, let's run it, downloading, downloading, downloading, and bam,
3:05
there is our array of pages, here is the url, there is the title,
3:08
and then there is all the paragraphs. Pretty darn cool, huh,
3:12
so we can use get text, we can access the attributes like this,
3:17
we can do find for node discovery, we can do select for working with css selectors,
3:25
notice there is also a select one so that if we know there is going to be
3:30
one thing we're selecting by id for example then we can use select one
3:34
and not have to like treat it as a list but get an element, which is cool;
3:37
so finally, let's do a little bit of output here,
3:40
let's just do go back up to this download pages,
3:43
instead of this print, let me write a better method show pages, I'll pass in pages,
3:47
create that and I'll just sort of print this out here,
3:52
no sense in you guys watching me type that, so we are going to go through
3:56
and say here is the pages, here is the url, it has this number of paragraphs,
4:01
I am not going to show you the text because it's too much and then,
4:04
we'll just show you what the text of the first paragraph is.
4:07
Alright, let's run that, download, download, download, parse, parse, parse,
4:12
and ta- da, look at that rich data we have to work with, p.url, p.paragraphs,
4:17
p.paragraphs.zero.text, just pure Python beauty,
4:22
so here is the title, and here is the url, let's click and make sure it works,
4:25
I am sure it looks like that is what we were after, doesn't it,
4:29
and it has 238 paragraphs in that hour long conversation,
4:32
and here is the first paragraph in the list.
4:35
So that is how you do screen scraping with Beautiful Soup and requests.