#100DaysOfCode in Python Transcripts
Chapter: Days 28-30: Regular Expressions
Lecture: Compiling regexes with re.VERBOSE
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
Compiling regexes.
0:03
If you want to run a regex multiple times,
0:06
it can be more efficient and convenient
0:08
to define it in a re.compile statement.
0:11
Let's work with some data.
0:13
Here I define a list of movies
0:15
and two extra things in Python.
0:17
You can define a multi-line string with triple quotes,
0:21
and you can build up a list by splitting
0:24
a multi-line string, or whatever string,
0:27
on a space or in this case, a new line.
0:31
So this gives me a nice list
0:32
of the first element, one Citizen Kane,
0:35
second element, Godfather, etc.
0:37
And the task here is to
0:39
identify movie titles with exactly two words.
0:42
Before moving along, maybe you want to
0:44
give this a try yourself.
0:45
I hope you had fun working on this little regex problem.
0:49
And an extra concept I'm going to show you is
0:50
the use of re.verbose, which allows you to
0:55
wrap your regular expressions over multiple lines
0:58
and add commands, which is great to
1:01
teach them and makes them more readable.
1:04
So let's load in the data
1:06
and let's start writing a multi-line,
1:10
medium advanced, regular expression.
1:13
And as we're talking about compiling one,
1:16
the syntax for that is re.compile.
1:19
I'm using a raw string, and as explained before,
1:22
you can make a multi-line string with triple quotes.
1:26
And I'm going to write this out
1:28
because it's quite a large regular expression.
1:31
And then we come back
1:33
and I explain line by line what it does.
1:36
So here you go.
1:37
To define the start of a string,
1:39
we use the caret
1:41
then we need to match one or more digits.
1:46
Let me scroll a little bit up to see the data,
1:48
so the numbering of the movies.
1:50
Then we match a literal dot,
1:52
and note that I escape the dot
1:54
otherwise it would match any character.
1:57
Then we have one or more spaces.
1:59
And then I use a non capturing parentheses.
2:03
So we've seen capturing parentheses before,
2:07
but if you add inside parentheses, question mark, colon,
2:13
it kind of undos the capturing.
2:16
Then you can group the various things together
2:18
without capturing.
2:20
And we use a character class then
2:23
to include uppercase, lowercase, and single quote.
2:28
And we want one or more of them,
2:30
followed by a space.
2:33
And I commanded that all at the right.
2:36
Then we do the closure of that parentheses.
2:41
Then we want exactly two of those.
2:44
So just go back to the data,
2:47
and that's basically a word.
2:50
Why don't I do just backslash w?
2:54
Turns out that was my first approach, but,
2:57
and that's the funny thing with parsing data or strings,
3:01
is that they're always these exceptions.
3:03
And singing has this apostrophe or single quote
3:07
and I had to account for that,
3:08
so instead of just word, I had to go with
3:12
a more specific portion of the regular expression.
3:17
So two of those because we want to match
3:19
the ones that have exactly two words.
3:23
Then, we do a literal open parentheses,
3:27
and as with the dot, right,
3:30
all these characters have a special meaning.
3:33
Dot matches all, parentheses are for capturing,
3:36
so if you want a literal one, you have to escape them.
3:39
So here I'm doing the same thing as with the dot,
3:41
and that's escaping the parentheses.
3:43
I want literal parentheses, because the years
3:46
are in parentheses.
3:50
Then the years are four digits,
3:52
and then I do a dollar which is the end of the string.
3:55
Phew, that was quite a regular expression,
3:58
but the nice thing about verbose is that
4:00
I could add all these commands,
4:02
which made it super easy to explain it to you.
4:06
So run that cell, and it's now stored in pat.
4:09
And pat is just variable name,
4:11
and now I can use that pattern
4:14
to loop over the movies and match them all.
4:17
So let's do that next.
4:20
Four movie in movies.
4:24
Print movie.
4:26
Just the text.
4:28
And then I can use match on the pattern.
4:30
So before we did re.match,
4:33
but now we can do pattern.match.
4:36
And I'm interested in match because I want to
4:38
match the string from beginning to end.
4:43
Put in a movie
4:45
and there you go.
4:46
So let's check if this regular expression is
4:50
actually correct.
4:51
Citizen Kane, two words. Match.
4:53
The Godfather. Match.
4:55
Casablanca, one word. Not a match.
4:57
And Schindler's List, this was another tricky one.
5:00
In the first iteration, I did not match this because
5:04
again, I had to account for that single quote
5:07
which I told you before.
5:09
So this one is actually a match
5:10
because I consider Schindler's as one word.
5:14
Vertigo's not and The Wizard of Oz,
5:15
four words, is not.
5:17
So how cool is that?
5:19
Let's move on to advanced string replacing.