#100DaysOfWeb in Python Transcripts
Chapter: Days 73-76: Web Scraping
Lecture: Write a Talk Python Training web scraping script
Login or
purchase this course
to watch this video and the rest of the course contents.
Now that we've figured out what we want and we've figured out what commands we need to run to get what we want let's actually pop it into a script.
Just quickly mention that if you are stuck in that Python REPL or shell whatever you want to call it just type in exit with the double brackets there
and that should get you out of it and back to the command line. Now, in our folder we only have a virtual environment, right?
So, let's create a file and we'll call it talkpy_bs4.py In our file, the first thing we need to do is import.
So we'll import requests, remembering what we did on the command line, in the REPL there. And the next thing we did was we specified the URL.
So let's throw that in there now. https://training.talkpython.fm/courses/all Easy peasy. Now we have our, let's just call it a main function.
You know what, let's just call it main, it's just easy. So def main and this is going to be a nice simple script so let's not go overboard.
Let's just keep it simple stupid alright. So we have our raw site page remembering the first thing we need to do is actually pull down the page.
So request.get and in the brackets we were going to say pull the URL. So this is the code to pull down that webpage. Now we need to make sure
that the webpage pulled down correctly. So, web raw site page raise_for_status. And again, if we see output from that that means we have a problem.
And next, we need to create that soup object. Soup is bs4.BeautifulSoup raw_site_page.text so we're taking the text of the page
and we're hitting it with our HTML parser. Next, what we want to do is we want to take the headers remember that our headers of these course names
are all tagged with h3. So if we pull down, if we select h3 then that's what we get. But, the problem here is that we're talking about a script
we're no longer talking about the Python repl. So, typing in soup.select is not going to necessarily get us what we want.
What we need to do when we pulled down soup.select for h3 for the header, we actually got to list. So let's create ourselves a list.
Call it HTML header list, soup.select and remember we choose the HTML tag that we want and we're choosing h3.
You could also choose a CSS class of your choice, whatever. This is just where you are adding that output or whatever you pulled down into a list.
What we'd like to do is we now need to do something with that. We need to take this list and we need to get the headers out of it.
Because if you remember soup.select gave us all the HTML code with it. It didn't just give us those headers by themselves
we got all the tags and everything around it. And we can do that, and I'll show you, with.get text. So let's do a quick for loop here.
Let's parse through all of the headers in our HTML header list. So for headers in HTML header list
so for headers in HTML header list, what do we want to do? Well, we want to actually take the headers out and add them to another list.
So let's create ourselves a list we'll specify this up top in a second. So we create a list called header list. So we have our HTML header list
where we had all the tags and all the rubbish that came along with it and we're going to create a header list that is just
the headers without all the HTML tagging around it. So we do header list.append headers, so this here, so all the entries in this
all the list entries in there, .getText. And what this does is it actually goes into our h3 header tags
and it strips out just the text between the HTML tags. So if we do that, we'll now get a nicely formatted list. We won't have all the rubbish around it
we'll have just the text that we need, okay? So let's actually specify this list up top. Let's define it up here, we'll just create an empty one.
So by creating this empty list up here we can then append to it down here. So, let's go back here. Now once we have this list created
once we have our header list here populated with just the text from getText let's parse it again. For headers in header list, print headers.
Nice simple for loop there. And that's it, that's all we need from this script because this script will print out the headers that we want.
Let's just quickly call the main function pop that in there if name is main and we'll just run our main function here.
So def main, we'll just call that here. And that's it. Let's go through this script really quickly one more time.
Import request, import bs4, set up our URL make our empty list that we're going to populate. Then, in the main function that we're going to call
we have the raw site page that we pulled down the request.get of the URL so that's pulling down the Talk Python page.
Check to see if it was successful, yep. Create our soup object by parsing that raw site page with the HTML parser. We then pull out just the h3 tags
and anything that came along with it pop it into the HTML header list. We then parse that and use.get text
to pull just the text out of those h3 header tags and then we print it. Nice and simple. Let's save that and run it. Python talkpy_bs4.py.
That will go out, do our requests, parse it and print it. There we go, the everything bundle, we saw that. We have Python jumpstart by building 10 apps
and we have the rest of the courses and we will find that these match our Talk Python website here.
Going down here, mastering PyCharm, is that in our list? Yes it is. Mastering PyCharm there and you can go through the rest yourself.
But that's it, that's how we've written our script. This is a very simple exercise for web scraping. Hope you enjoyed it, let's move on.