#100DaysOfCode in Python Transcripts
Chapter: Days 28-30: Regular Expressions
Lecture: Compiling regexes with re.VERBOSE
0:00 Compiling regexes.
0:03 If you want to run a regex multiple times,
0:06 it can be more efficient and convenient
0:08 to define it in a re.compile statement.
0:11 Let's work with some data.
0:13 Here I define a list of movies
0:15 and two extra things in Python.
0:17 You can define a multi-line string with triple quotes,
0:21 and you can build up a list by splitting
0:24 a multi-line string, or whatever string,
0:27 on a space or in this case, a new line.
0:31 So this gives me a nice list
0:32 of the first element, one Citizen Kane,
0:35 second element, Godfather, etc.
0:37 And the task here is to
0:39 identify movie titles with exactly two words.
0:42 Before moving along, maybe you want to
0:44 give this a try yourself.
0:45 I hope you had fun working on this little regex problem.
0:49 And an extra concept I'm going to show you is
0:50 the use of re.verbose, which allows you to
0:55 wrap your regular expressions over multiple lines
0:58 and add commands, which is great to
1:01 teach them and makes them more readable.
1:04 So let's load in the data
1:06 and let's start writing a multi-line,
1:10 medium advanced, regular expression.
1:13 And as we're talking about compiling one,
1:16 the syntax for that is re.compile.
1:19 I'm using a raw string, and as explained before,
1:22 you can make a multi-line string with triple quotes.
1:26 And I'm going to write this out
1:28 because it's quite a large regular expression.
1:31 And then we come back
1:33 and I explain line by line what it does.
1:36 So here you go.
1:37 To define the start of a string,
1:39 we use the caret
1:41 then we need to match one or more digits.
1:46 Let me scroll a little bit up to see the data,
1:48 so the numbering of the movies.
1:50 Then we match a literal dot,
1:52 and note that I escape the dot
1:54 otherwise it would match any character.
1:57 Then we have one or more spaces.
1:59 And then I use a non capturing parentheses.
2:03 So we've seen capturing parentheses before,
2:07 but if you add inside parentheses, question mark, colon,
2:13 it kind of undos the capturing.
2:16 Then you can group the various things together
2:18 without capturing.
2:20 And we use a character class then
2:23 to include uppercase, lowercase, and single quote.
2:28 And we want one or more of them,
2:30 followed by a space.
2:33 And I commanded that all at the right.
2:36 Then we do the closure of that parentheses.
2:41 Then we want exactly two of those.
2:44 So just go back to the data,
2:47 and that's basically a word.
2:50 Why don't I do just backslash w?
2:54 Turns out that was my first approach, but,
2:57 and that's the funny thing with parsing data or strings,
3:01 is that they're always these exceptions.
3:03 And singing has this apostrophe or single quote
3:07 and I had to account for that,
3:08 so instead of just word, I had to go with
3:12 a more specific portion of the regular expression.
3:17 So two of those because we want to match
3:19 the ones that have exactly two words.
3:23 Then, we do a literal open parentheses,
3:27 and as with the dot, right,
3:30 all these characters have a special meaning.
3:33 Dot matches all, parentheses are for capturing,
3:36 so if you want a literal one, you have to escape them.
3:39 So here I'm doing the same thing as with the dot,
3:41 and that's escaping the parentheses.
3:43 I want literal parentheses, because the years
3:46 are in parentheses.
3:50 Then the years are four digits,
3:52 and then I do a dollar which is the end of the string.
3:55 Phew, that was quite a regular expression,
3:58 but the nice thing about verbose is that
4:00 I could add all these commands,
4:02 which made it super easy to explain it to you.
4:06 So run that cell, and it's now stored in pat.
4:09 And pat is just variable name,
4:11 and now I can use that pattern
4:14 to loop over the movies and match them all.
4:17 So let's do that next.
4:20 Four movie in movies.
4:24 Print movie.
4:26 Just the text.
4:28 And then I can use match on the pattern.
4:30 So before we did re.match,
4:33 but now we can do pattern.match.
4:36 And I'm interested in match because I want to
4:38 match the string from beginning to end.
4:43 Put in a movie
4:45 and there you go.
4:46 So let's check if this regular expression is
4:50 actually correct.
4:51 Citizen Kane, two words. Match.
4:53 The Godfather. Match.
4:55 Casablanca, one word. Not a match.
4:57 And Schindler's List, this was another tricky one.
5:00 In the first iteration, I did not match this because
5:04 again, I had to account for that single quote
5:07 which I told you before.
5:09 So this one is actually a match
5:10 because I consider Schindler's as one word.
5:14 Vertigo's not and The Wizard of Oz,
5:15 four words, is not.
5:17 So how cool is that?
5:19 Let's move on to advanced string replacing.