#100DaysOfWeb in Python Transcripts
Chapter: Days 73-76: Web Scraping
Lecture: Dive into BeautifulSoup4 and Talk Python
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
The way we're going to work through this is to first actually do it within the Python REPL. Think of it, yeah, the Python command line
0:09
if you haven't used it before. So from your virtual environment simply type in Python. And that will bring up the shell.
0:16
That's what we see here with the three little greater than arrows. And all we're going to do is we're going to play with the website.
0:23
We're going to pull it down and then just do some quick code against it to see what we get and then we can write our script.
0:30
This is a really good way of doing any code. Play with it in the REPL and once you sort of figure out what you're doing and massage
0:37
your code, then write your script. It's much easier that way. We'll begin by importing bs4 and we'll also import requests.
0:47
Now, let's specify the URL that we're going to actually play here. Let's just copy that from our website here.
0:54
It's https://training.talkpython.fm/courses/all Let's pop that into there, close it off. Right, we have our URL.
1:10
So the first bit of code we're going to do here is actually requests. It's not bs4. So requests is used for web scraping.
1:17
It's for pulling down the raw site content, okay? And so we're going to call it raw_site_page. And we're going to assign it requests.get
1:29
so we're doing a get from the website, and the URL. That will return with nothing, which is good. It's assigned, essentially that entire webpage
1:41
to raw site page, to that variable there. Now, what we want to do, what is a good practice is to do raw site page, so that object, raise_for_status.
1:58
Now, what this does is this actually checks to see that the request worked properly. If it did work we'll get no output.
2:07
If it didn't work you'll get some errors and you'll know that something went wrong. Perhaps the URL was broken, maybe you don't
2:13
have an Internet connection, who knows. Now, what we need to do is actually massage this site. We need to pass it.
2:20
So now we're moving away from requests and we're moving into beautiful soup 4. So let's create our soup, right? This is our soup object.
2:32
And we're going to use bs4.BeautifulSoup And we're going to take the raw_site_page.text and we're going to hit it with the HTML
2:49
passer that is built into beautiful soup 4. Now what that does is it goes through and it takes the plain text and it runs
2:57
the HTML passer against it, dumps it all into our soup object. And now we get to do some fun stuff. What we have here, and just do soup, you can see
3:10
all of the HTML for this website. Now, obviously it's not formatted correctly with all the tabbing but this matches what you
3:19
see here, right, on our view source for the talk Python webpage. So what we want to do now is start playing with this. We want to pull things out.
3:30
When you're doing web scraping you generally don't want the entire page, right? You don't need all of this extra code all around everything, right?
3:40
Let's close this X and take a look at the page. Generally what you actually want is some sort of specific set of data.
3:49
Sometimes you'll just want the header of every page. Sometimes you might want what's in the nav bar. Sometimes you might want article updates
3:58
latest RSS feed updates. You might even just want to get all of the individual headers of certain things or collect all of the hours that each
4:09
of these courses could add up to. Either way that's the beauty of web scraping. You sort of have to dig into the HTML code that you want to pull.
4:22
You need to dig into it and find out what it is that you need, what it is that is going to give you what you need.
4:30
Now for this exercise let's just have a play with a couple of the options here. If we wanted to get the title of the page
4:39
we could go soup dot title and that'll give us our title tag. So you can see it searched for the title tag and it's given us the title of the webpage
4:49
online course catalog and so on. These are really cool shortcuts to get to specific data without you having to go and parse it in any complex way.
5:00
One of the other interesting things you can do is you can pull all types of one tag. So this is a bit crazy but let's say we want to pull every div.
5:10
We could do soup.select and we could just type in div. And that brings up every div but also everything inside it.
5:19
Now, we can get all of the lists. We could do li. And that'll give us every list that appears on this page.
5:27
Now, we're getting this so many, isn't there? So this is where certain factors start to come into play. Take a look at this code here.
5:36
Take a look at the HTML on the website. One of the interesting things is that we have CSS. We have classes for our divs.
5:46
So we don't want every div but let's say we wanted just the nav bar. We can actually specify the class. So here's our nav bar header class for the div.
5:58
Let's try and pull that. If we put that there this is actually not going to work. It comes back with nothing because there's no
6:07
HTML tag that goes by nav bar header. What we want to do is we want to put a dot there to indicate that this is a class.
6:17
Once we do that then we get all of the HTML that falls below the class nav bar header. Notice that we didn't have to specify div.
6:28
We just had to specify the class. And by doing that we get the button we get everything else under that. Let's try that again with a different class
6:37
just to demonstrate. We can get the bundle list. So div class bundle list. This will give us just the list of the everything
6:50
bundle on Mike's page. This part here, right? Let's hop back in here. We'll change nav bar header to bundle dash list. And there we go.
7:06
We get just that HTML code and we'll find that it matches, right? It matches what we see on that source page. There's everything bundle.
7:14
There is the head of four, there's the image alt for it I should say, here is the h3 header tag and so on. Right, now the final part of this exercise
7:24
for this specific video, I want to get every title of every course. I want to get these headers here, Python gems start by building 10 apps.
7:33
Now if we dig through this we'll find something unique to each of those headers. When you're looking through your own website
7:41
you'll probably find that people have specific classes that will indicate headers. In Mike's case here with the talk Python website
7:52
all of our headers for the actual course names are simply within h3 tags. So if we pull down our h3 tags just the HTML, we don't have to put the dot
8:08
there are all of our names. And they're all in a list, nice and handy. So that's that. This is where we want to be.
8:16
We now know the code that we need to write our script and we can then write some more code around that to make it readable, make it nicer.
8:24
So let's leave it here. Feel free to keep playing if you'd like and try to pull out some other information.
8:30
But in the next video we are going to write the script.