Python for the .NET developer Transcripts
Chapter: Package management and external libraries
Lecture: Parsing HTML in Python
0:00 Last thing to make our little program
0:01 zing is to implement this method.
0:03 So let's do that right here.
0:05 This is going to be a string and it's
0:08 going to return a string as well.
0:11 So this is where we start using
0:13 our second library that we brought in.
0:15 We had colorama, we've been using that before.
0:17 This got us the request, this is going to do the parsing.
0:21 At the top we going to say import this, great we're
0:25 is not using it yet but that will fixed very soon.
0:29 Let's do something like this as for now, and put
0:34 a different color say Cyan or something.
0:37 Then getting title, let's also add a flush equals true
0:43 on these just to make sure this goes out right away.
0:46 A lot of stuff is happening, want to make sure that the
0:49 buffer gets flushed, sometimes it can be delayed there.
0:53 And this, we also need to pass, I'll do n
0:58 watch as an n, I'm not a fan of the name.
1:01 Let's just keep rolling with it.
1:03 Great, so down here doing n as well.
1:07 So, how are we going to do this?
1:08 Well, it turns out again with this library
1:10 it's incredibly easy.
1:11 We are going to create a soup which is the
1:13 bs4.BeautifulSoup like that.
1:16 How does it work? You give it the HTML.
1:19 Now it's going to want another thing here and
1:23 I'll go and run it and then it's...
1:25 I'll show you the warning that comes out, it's
1:26 not a big deal but I'll go ahead and show it.
1:28 So, then we want to get the header, the main title.
1:31 So, we'll say header = soup.select and
1:37 let's give it a CSS thing here, how about just the
1:40 tag name as h1 because there should really only
1:44 be one h1, this should be fine.
1:47 We'll say if not header, we couldn't find something
1:49 we'll return missing or something like that.
1:52 That's what we are going to say what the title is.
1:54 Otherwise, we return header.text.strip() and that's it.
2:03 Well, like I said, we'll get a warning here but
2:05 it's not a big deal. Warning! Warning! Warning!
2:12 Let's see, it printing it out, it hasn't actually
2:16 printed the title yet but presumably it's working.
2:18 To get rid of this warning, you need to additionally
2:21 specify the HTML parser or an alternative HTML parser.
2:26 So, what we can put in is right here, just this
2:29 like this, some quotes.
2:31 And also, let's print the title, go with green.
2:39 Yes, look at it! It totally works!
2:42 Get in the HTML for 220, title for 220.
2:45 Boom, there's the title!
2:47 Get in for 228, hunting bugs and tech startups
2:49 and Python, building advanced Pythonic interviews
2:53 with Docker symbol and so on.
2:55 This is the title, if we were to go to that URL
2:57 that's what's in the h1 tag.
2:59 We were able to recreate it using Python and
3:02 remember how I said it was super cool.
3:04 Over here, we were able to do this in C# in
3:07 just 76 lines of code.
3:10 We did 42 over here, well, 41 really.
3:13 How awesome is that!
3:15 We can completely super power our applications
3:18 by using stuff off of PyPI, the way we do it is
3:22 set up a virtual environment, specify the
3:24 requirement and we either pip install them
3:26 all here or we just individually pip install
3:29 the name of the package you want, and you can
3:31 import it, use it. You're off to the races, beautiful.