#100DaysOfWeb in Python Transcripts
Chapter: Days 73-76: Web Scraping
Lecture: Write a Talk Python Training web scraping script
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
Now that we've figured out what we want and we've figured out what commands we need to run to get what we want let's actually pop it into a script.
0:10
Just quickly mention that if you are stuck in that Python REPL or shell whatever you want to call it just type in exit with the double brackets there
0:19
and that should get you out of it and back to the command line. Now, in our folder we only have a virtual environment, right?
0:28
So, let's create a file and we'll call it talkpy_bs4.py In our file, the first thing we need to do is import.
0:44
So we'll import requests, remembering what we did on the command line, in the REPL there. And the next thing we did was we specified the URL.
0:54
So let's throw that in there now. https://training.talkpython.fm/courses/all Easy peasy. Now we have our, let's just call it a main function.
1:11
You know what, let's just call it main, it's just easy. So def main and this is going to be a nice simple script so let's not go overboard.
1:20
Let's just keep it simple stupid alright. So we have our raw site page remembering the first thing we need to do is actually pull down the page.
1:29
So request.get and in the brackets we were going to say pull the URL. So this is the code to pull down that webpage. Now we need to make sure
1:40
that the webpage pulled down correctly. So, web raw site page raise_for_status. And again, if we see output from that that means we have a problem.
1:53
And next, we need to create that soup object. Soup is bs4.BeautifulSoup raw_site_page.text so we're taking the text of the page
2:08
and we're hitting it with our HTML parser. Next, what we want to do is we want to take the headers remember that our headers of these course names
2:22
are all tagged with h3. So if we pull down, if we select h3 then that's what we get. But, the problem here is that we're talking about a script
2:32
we're no longer talking about the Python repl. So, typing in soup.select is not going to necessarily get us what we want.
2:40
What we need to do when we pulled down soup.select for h3 for the header, we actually got to list. So let's create ourselves a list.
2:51
Call it HTML header list, soup.select and remember we choose the HTML tag that we want and we're choosing h3.
3:03
You could also choose a CSS class of your choice, whatever. This is just where you are adding that output or whatever you pulled down into a list.
3:15
What we'd like to do is we now need to do something with that. We need to take this list and we need to get the headers out of it.
3:25
Because if you remember soup.select gave us all the HTML code with it. It didn't just give us those headers by themselves
3:34
we got all the tags and everything around it. And we can do that, and I'll show you, with.get text. So let's do a quick for loop here.
3:45
Let's parse through all of the headers in our HTML header list. So for headers in HTML header list
3:56
so for headers in HTML header list, what do we want to do? Well, we want to actually take the headers out and add them to another list.
4:05
So let's create ourselves a list we'll specify this up top in a second. So we create a list called header list. So we have our HTML header list
4:14
where we had all the tags and all the rubbish that came along with it and we're going to create a header list that is just
4:20
the headers without all the HTML tagging around it. So we do header list.append headers, so this here, so all the entries in this
4:33
all the list entries in there, .getText. And what this does is it actually goes into our h3 header tags
4:42
and it strips out just the text between the HTML tags. So if we do that, we'll now get a nicely formatted list. We won't have all the rubbish around it
4:54
we'll have just the text that we need, okay? So let's actually specify this list up top. Let's define it up here, we'll just create an empty one.
5:06
So by creating this empty list up here we can then append to it down here. So, let's go back here. Now once we have this list created
5:16
once we have our header list here populated with just the text from getText let's parse it again. For headers in header list, print headers.
5:34
Nice simple for loop there. And that's it, that's all we need from this script because this script will print out the headers that we want.
5:44
Let's just quickly call the main function pop that in there if name is main and we'll just run our main function here.
5:54
So def main, we'll just call that here. And that's it. Let's go through this script really quickly one more time.
6:01
Import request, import bs4, set up our URL make our empty list that we're going to populate. Then, in the main function that we're going to call
6:11
we have the raw site page that we pulled down the request.get of the URL so that's pulling down the Talk Python page.
6:18
Check to see if it was successful, yep. Create our soup object by parsing that raw site page with the HTML parser. We then pull out just the h3 tags
6:31
and anything that came along with it pop it into the HTML header list. We then parse that and use.get text
6:39
to pull just the text out of those h3 header tags and then we print it. Nice and simple. Let's save that and run it. Python talkpy_bs4.py.
6:55
That will go out, do our requests, parse it and print it. There we go, the everything bundle, we saw that. We have Python jumpstart by building 10 apps
7:07
and we have the rest of the courses and we will find that these match our Talk Python website here.
7:16
Going down here, mastering PyCharm, is that in our list? Yes it is. Mastering PyCharm there and you can go through the rest yourself.
7:24
But that's it, that's how we've written our script. This is a very simple exercise for web scraping. Hope you enjoyed it, let's move on.