#100DaysOfCode in Python Transcripts
Chapter: Days 28-30: Regular Expressions
Lecture: Compiling regexes with re.VERBOSE

Login or purchase this course to watch this video and the rest of the course contents.
0:00 Compiling regexes. If you want to run a regex multiple times, it can be more efficient and convenient to define it in a re.compile statement.
0:12 Let's work with some data. Here I define a list of movies and two extra things in Python. You can define a multi-line string with triple quotes,
0:22 and you can build up a list by splitting a multi-line string, or whatever string, on a space or in this case, a new line. So this gives me a nice list
0:33 of the first element, one Citizen Kane, second element, Godfather, etc. And the task here is to identify movie titles with exactly two words.
0:43 Before moving along, maybe you want to give this a try yourself. I hope you had fun working on this little regex problem.
0:50 And an extra concept I'm going to show you is the use of re.verbose, which allows you to wrap your regular expressions over multiple lines
0:59 and add commands, which is great to teach them and makes them more readable. So let's load in the data and let's start writing a multi-line,
1:11 medium advanced, regular expression. And as we're talking about compiling one, the syntax for that is re.compile.
1:20 I'm using a raw string, and as explained before, you can make a multi-line string with triple quotes. And I'm going to write this out
1:29 because it's quite a large regular expression. And then we come back and I explain line by line what it does. So here you go.
1:38 To define the start of a string, we use the caret then we need to match one or more digits. Let me scroll a little bit up to see the data,
1:49 so the numbering of the movies. Then we match a literal dot, and note that I escape the dot otherwise it would match any character.
1:58 Then we have one or more spaces. And then I use a non capturing parentheses. So we've seen capturing parentheses before,
2:08 but if you add inside parentheses, question mark, colon, it kind of undos the capturing. Then you can group the various things together
2:19 without capturing. And we use a character class then to include uppercase, lowercase, and single quote. And we want one or more of them,
2:31 followed by a space. And I commanded that all at the right. Then we do the closure of that parentheses. Then we want exactly two of those.
2:45 So just go back to the data, and that's basically a word. Why don't I do just backslash w? Turns out that was my first approach, but,
2:58 and that's the funny thing with parsing data or strings, is that they're always these exceptions. And singing has this apostrophe or single quote
3:08 and I had to account for that, so instead of just word, I had to go with a more specific portion of the regular expression.
3:18 So two of those because we want to match the ones that have exactly two words. Then, we do a literal open parentheses, and as with the dot, right,
3:31 all these characters have a special meaning. Dot matches all, parentheses are for capturing, so if you want a literal one, you have to escape them.
3:40 So here I'm doing the same thing as with the dot, and that's escaping the parentheses. I want literal parentheses, because the years
3:47 are in parentheses. Then the years are four digits, and then I do a dollar which is the end of the string. Phew, that was quite a regular expression,
3:59 but the nice thing about verbose is that I could add all these commands, which made it super easy to explain it to you.
4:07 So run that cell, and it's now stored in pat. And pat is just variable name, and now I can use that pattern to loop over the movies and match them all.
4:18 So let's do that next. Four movie in movies. Print movie. Just the text. And then I can use match on the pattern. So before we did re.match,
4:34 but now we can do pattern.match. And I'm interested in match because I want to match the string from beginning to end. Put in a movie and there you go.
4:47 So let's check if this regular expression is actually correct. Citizen Kane, two words. Match. The Godfather. Match. Casablanca, one word. Not a match.
4:58 And Schindler's List, this was another tricky one. In the first iteration, I did not match this because again, I had to account for that single quote
5:08 which I told you before. So this one is actually a match because I consider Schindler's as one word. Vertigo's not and The Wizard of Oz,
5:16 four words, is not. So how cool is that? Let's move on to advanced string replacing.


Talk Python's Mastodon Michael Kennedy's Mastodon