Python for .NET Developers Transcripts
Chapter: Package management and external libraries
Lecture: Parsing HTML in Python
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
Last thing to make our little program zing is to implement this method. So let's do that right here. This is going to be a string and it's
0:09
going to return a string as well. So this is where we start using our second library that we brought in. We had colorama, we've been using that before.
0:18
This got us the request, this is going to do the parsing. At the top we going to say import this, great we're
0:26
is not using it yet but that will fixed very soon. Let's do something like this as for now, and put a different color say Cyan or something.
0:38
Then getting title, let's also add a flush equals true on these just to make sure this goes out right away.
0:47
A lot of stuff is happening, want to make sure that the buffer gets flushed, sometimes it can be delayed there.
0:54
And this, we also need to pass, I'll do n watch as an n, I'm not a fan of the name. Let's just keep rolling with it.
1:04
Great, so down here doing n as well. So, how are we going to do this? Well, it turns out again with this library it's incredibly easy.
1:12
We are going to create a soup which is the bs4.BeautifulSoup like that. How does it work? You give it the HTML.
1:20
Now it's going to want another thing here and I'll go and run it and then it's... I'll show you the warning that comes out, it's
1:27
not a big deal but I'll go ahead and show it. So, then we want to get the header, the main title. So, we'll say header = soup.select and
1:38
let's give it a CSS thing here, how about just the tag name as h1 because there should really only be one h1, this should be fine.
1:48
We'll say if not header, we couldn't find something we'll return missing or something like that. That's what we are going to say what the title is.
1:55
Otherwise, we return header.text.strip() and that's it. Well, like I said, we'll get a warning here but it's not a big deal. Warning! Warning! Warning!
2:13
Let's see, it printing it out, it hasn't actually printed the title yet but presumably it's working.
2:19
To get rid of this warning, you need to additionally specify the HTML parser or an alternative HTML parser.
2:27
So, what we can put in is right here, just this like this, some quotes. And also, let's print the title, go with green.
2:40
Yes, look at it! It totally works! Get in the HTML for 220, title for 220. Boom, there's the title! Get in for 228, hunting bugs and tech startups
2:50
and Python, building advanced Pythonic interviews with Docker symbol and so on. This is the title, if we were to go to that URL
2:58
that's what's in the h1 tag. We were able to recreate it using Python and remember how I said it was super cool.
3:05
Over here, we were able to do this in C# in just 76 lines of code. We did 42 over here, well, 41 really. How awesome is that!
3:16
We can completely super power our applications by using stuff off of PyPI, the way we do it is set up a virtual environment, specify the
3:25
requirement and we either pip install them all here or we just individually pip install the name of the package you want, and you can
3:32
import it, use it. You're off to the races, beautiful.