Python Regular Expressions
In this post, we feature a comprehensive article about Regular Expressions in Python language.
1. What is a Regular Expression
If you have ever searched for a file that you didn’t exactly remember its name, then you might have used one or more special (or wildcard) characters to define the characters you couldn’t remember. For example, to search for all .txt
files that start with b
then you may have typed b*.txt
in the file manager application (Windows Explorer, or Nautilus, or Finder) or in a Linux shell or DOS window. The above expression (b*.txt
) is an example of pattern matching that is called more specifically globbing, and it matches the file names b.txt ba.txt bla.txt
etc.
Common globbing characters:
?
matches any single character*
matches any number of characters[...]
matches any single character in the set, e.g.[a-c]
matches only charactersa, b
andc
.
Let’s see some definitions from Wikipedia:
- Pattern matching is the act of checking a given sequence of tokens for the presence of the constituents of some pattern. The match usually has to be exact: “either it will or will not be a match.”
- A regular expression, regex or regexp is a sequence of characters that define a search pattern.
- Glob patterns specify sets of filenames with wildcard characters.
So, in the above example with the file search, b*.txt
is a search pattern that is applied to the file names of a directory for example, and matches the file names b.txt ba.txt bla.txt
.
In this article we will learn about search patterns or regular expressions that are supported by python, as well as what commands to use in order to use regexes in your python programs.
2. Introduction to Regular Expressions in Python
Imagine that you have the task to search for emails in a text,
e.g.
>>> text = 'You may reach us at this email address: java.info@javacodegeeks.com. Opening hours: @9am-@17pm'
How would you search for the email address in the above text?
index = text.find('@')
to find the index of @
and then you need to locate the other parts of the email address (left as an exercise to the reader; please don’t use regular expressions). Using regular expressions you would type something like:
>>> import re >>> ans = re.search('@', text)
Much simpler, isn’t it? In the following we shall see how we can refine our regular expression string inside the search()
method in order to be able to correctly identify the email address.
As another example, let’s see how we could search for the pattern b*.txt
in our current folder:
>>> import glob >>> print (glob.glob('b*.txt')) ['b.txt' 'ba.txt' 'bla.txt']
Not that difficult, was it?
2.1 Python Regular Expression methods
The re module provides a number of pattern matching functions:
Method | Explanation |
match(regex, string) | finds the regex at the beginning of the string and returns a Match object with start() and end() methods to retrieve the indices or None |
search(regex, string) | finds the regex anywhere in the string and returns a Match object with start() and end() methods to retrieve the indices or None |
fullmatch(regex, string) | returns a Match object if the regex matches the string entirely, otherwise it returns None |
findall(regex, string) | finds all occurrences of the regex in the string and returns a list of matching strings |
finditer(regex, string) | returns an iterator to loop over the regex matches in the string , e.g. for m in finditer(regex, string ) where m is of type Match |
sub(regex, replacement, string) | replaces all matches of regex in string with replacement and returns a new string with the replacements |
split(regex, string) | returns a list of strings that contains the parts of string between all the regex matches in the string |
compile(regex) | compiles the regex into a regular expression object |
Similarly, the glob module provides the pattern matching function:
-
glob(pattern)
finds the files that satisfy the (globbing)pattern
In the rest of this article we shall focus on regular expressions (not on globbing).
You may test the regular expressions in this article either to your python environment (e.g. irb
) or in one of the following online regex sites (the list is not exhaustive and you may find a better site for your needs or taste):
- https://regex101.com/
- https://pythex.org/
- http://www.pyregex.com/
- https://www.debuggex.com/
- https://www.regextester.com/
- http://www.regexplanet.com/advanced/python/index.html
The simplest regular expression is just a text string without any special characters, i.e. a literal. Literals are the simplest form of pattern matching in regular expressions. They will simply succeed whenever that literal is found in the text you search in. E.g. let’s search the pattern email
in text
defined above using the python API:
>>> regex = 'email' >>> re.match(regex, text) >>> re.search(regex, text) <_sre.SRE_Match object; span=(25, 30), match='email'> >>> re.findall(regex, text) ['email']
Do you understand the results? re.match(
) tries to find the pattern in the beginning of text
but text
doesn’t start with the word email
. re.search()
tries to find the pattern anywhere in text
and returns an instance of match
saying that the pattern was found starting at position 25 and ending at position 30 of text
(numbering starts from 0
). Finally, re.findall()
returns a list of matching strings, so 'email
‘ was found only once. After the above explanation, the following result should now be self-explanatory:
>>> re.findall('@', text) ['@', '@', '@']
Regular expressions can be used to substitute part of a string:
>>> regex = '@' >>> replacement = '#' >>> re.sub(regex, replacement, text) 'You may reach us at this email address: java.info#javacodegeeks.com. Opening hours: #9am-#17pm'
Note that text
is not altered.
To improve performance, e.g. when we reuse the regular expression in a number of matches, we can compile the regex into a regular expression object:
>>> regex = re.compile('@') >>> regex.findall(text) ['@', '@', '@']
2.2 The Match
object
The Matchobject contains a number of methods:
-
group()
returns the part of the string matched by the entire regular expression -
start()
returns the offset in the string of the start of the match (begins from 0) -
end()
returns the offset of the character after the end of the match span()
returns a 2-tuple ofstart()
andend()
>>> match = re.search(regex, text) >>> match.group() 'email' >>> match.start() 25 >>> match.end() 30 >>> match.span() (25, 30)
A common mistake is the following:
>>> match = re.search("python", text) >>> match.group() Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'NoneType' object has no attribute 'group'
For that reason, create a useful method like this:
def findMatch(regex, text): match = re.search(regex, text) if match: print(match.group()) else: print("Pattern not found!")
More efficient than re.findall(regex, text)
is re.finditer(regex, text)
. It returns an iterator that enables you to loop over the regex matches in the text:
>>> for m in re.finditer(regex, text): ... print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0))) ... 25-30: email
The for-loop variable m
is a Match
object with the details of the current match.
3. Meta-characters
The following characters have special meanings in regular expressions:
Meta-character | Meaning |
. | Any single character |
[ ], [^ ] | Any single character in the (character) set, or not (^ ) in the set (order doesn’t matter) |
? | Quantifier: Optional, i.e. zero or one of the preceding regular expression |
* | Quantifier: Zero or more of the preceding regular expression |
+ | Quantifier: One or more of the preceding regular expression |
| | Or |
^ | Anchor pattern to the beginning of a line |
$ | Anchor pattern to the end of a line |
( ) | Group characters |
{ } | Quantifier: Number of time(s) of the preceding regular expression.{n} means exactly n times{n, m} or {n-m} means n and m times (inclusive){n,} or {,m} means at least n or at most m times |
\ | Escapes a meta-character, i.e. it means that the character that follows it is not a meta-character. |
So if we wish to search for the above meta-characters, we need to escape them by using the escape meta-character (\
). The following table shows examples of the escape meta-character.
Non-printable character | Meaning |
\n | Newline |
\r | Carriage return |
\e | Escape |
\t | Tab |
\v | Vertical tab |
\f | Form feed |
\uXXXX | Unicode characters, e.g. \u20AC represents the € |
Let’s see some examples:
The dot (.
) matches a single character, except line break characters.
>>> regex = "gr.y" >>> text = "gr gry grey gray gryy grrrr graaaay gr%y" >>> re.findall(regex, text) ['grey', 'gray', 'gryy', 'gr%y']
If we wanted to match words with only 3 characters that end with y
:
>>> regex = "...y" >>> re.findall(regex, text) [' gry', 'grey', 'gray', ' gry', 'aaay', 'gr%y'] >>> regex = "gr.*y" >>> re.findall(regex, text) ['gr gry grey gray gryy grrrr graaaay gr%y'] >>> regex = "gr.+y" >>> re.findall(regex, text) ['gr gry grey gray gryy grrrr graaaay gr%y'] >>> regex = "gr.?y" >>> re.findall(regex, text) ['gry', 'grey', 'gray', 'gryy', 'gr%y']
.*
means any single character zero or more times, .+
means any single character one or more times and .
? means any single character zero or one time. You might be astonished with the results of "gr.*y"
and "gr.+y"
but they are correct; they simply match the whole text
because it indeed starts with gr
and ends with y
(the one of the word gr%y
) containing all the other characters between them.
These modifiers are called greedy, because they try to match as much as possible to have the biggest match result possible. You can convert them to non-greedy by adding an extra question mark to the quantifier; for example, ??
, *?
or +?
. A quantifier marked as non-greedy or reluctant tries to have the smallest match possible.
>>> regex = "gr.*?y" >>> re.findall(regex, text) ['gr gry', 'grey', 'gray', 'gry', 'grrrr graaaay', 'gr%y']
If we need to match one or more of the meta-characters in our text
, then the escape meta-character shows its use:
>>> text="How is life?" >>> regex="life\?" >>> findMatch(regex, text) life?
You can also split a string:
>>> text = "gr, gry, grey, gray, gryy, grrrr, graaaay, gr%y" >>> regex=", " >>> re.split(regex, text) ['gr', 'gry', 'grey', 'gray', 'gryy', 'grrrr', 'graaaay', 'gr%y']
4. Character sets or classes
But what if we want to find only the correctly spelled words, i.e. only grey
and gray
? Character sets (or character classes) to the rescue:
>>> regex = "gr[ae]y" >>> re.findall(regex, text) ['grey', 'gray']
A character set or class matches only one out of several characters. The order of the characters inside a character class does not matter. A hyphen (-
) inside a character class specifies a range of characters. E.g. [0-9]
matches a single digit between 0 and 9. You can use more than one range, e.g. [A-Za-z]
matches a single character, case insensitively.
How
would you match the email address in the first text (repeated here)?
>>> text = 'You may reach us at this email address: java.info@javacodegeeks.com. Opening hours: @9am-@17pm'
Using character classes it shouldn’t be that difficult. An email address consists of capital or lowercase letters, i.e. [A-Za-z]
and maybe a dot (.
) in our example, i.e. [A-Za-z.]
(no need to escape the dot inside a character set). But, this matches only one character in the set. To match any number of these characters we need to (you guessed right) [A-Za-z.]
+. So you end up to:
>>> regex = "[A-Za-z.]+@[A-Za-z]+\.[A-Za-z]{2,3}" >>> re.findall(regex, text) ['java.info@javacodegeeks.com']
Element | Explanation |
[A-Za-z.] | Matches any latin letter and the dot |
+ | One or more times |
@ | matches the character @ |
[A-Za-z] | matches any latin letter |
+ | One or more times |
\. | Matches the dot |
[A-Za-z] | matches any latin letter |
{2,3} | 2 or 3 times |
Congratulations! You wrote your first actual regular expression. Note that you need to escape the dot outside a character class. To restrict to a specific number of characters use { }
. {2, 3}
means 2 or 3 characters maximum as the last part of the email is usually 2 or 3 characters, e.g. eu
or com
. (Of course, nowadays there are many other domains e.g. info
or ac.u
k but this is left as an exercise to the reader).
We matched the dot in the domain by escaping it:
>>> regex = "[A-Za-z.]+@[A-Za-z]+\.[A-Za-z]{2,3}"
If we didn’t, it would still work:
>>> regex = "[A-Za-z.]+@[A-Za-z]+.[A-Za-z]{2,3}" >>> re.findall(regex, text) ['java.info@javacodegeeks.com']
but
>>> email="java.info@javacodegeeks~com" >>> re.findall(regex, email) ['java.info@javacodegeeks~com']
because dot (.) matches any character.
If you need to have characters not in the character set, then use the caret ^
as in [^A-Za-z]
which means any character that is not a letter.
Since certain character classes are used often, a series of shorthand character classes are available:
Shorthand | Character set | Matches |
\d (\D) | [0-9] ([^0-9]) | digit (non digit) |
\w (\W) | [A-Za-z0-9_] ([^A-Za-z0-9_]) | word (non word) |
\s (\S) | [ \t\r\n\f] ([^ \t\r\n\f]) | whitespace (non whitespace) |
\A | Beginning of string | |
\Z | End of string | |
\b (\B) | Word boundary, e.g. spaces, commas, colons, hyphens etc. (non word boundary) |
So, the previous regex can also be written as:
>>> regex = "[\w.]+@[\w]+\.[\w]{2,3}" >>> re.findall(regex, text) ['java.info@javacodegeeks.com']
and to avoid matching ".@javacodegeeks.com"
:
>>> email = ".@javacodegeeks.com" >>> regex = "\w[\w.]+@[\w]+\.[\w]{2,3}" >>> re.findall(regex, email) []
To find the first and last word:
>>> regex = "^\w+" >>> findMatch(regex, text) You >>> regex = "\w+$" >>> findMatch(regex, text) 17pm
Let’s take a look at another example:
>>> text = "Hello do you want to play Othello?" >>> regex = "[Hh]ello" >>> re.findall(regex, text) ['Hello', 'hello']
Where does the second 'hello'
come from? From 'Othello'
. How can we tell Python that we wish to match whole words only?
>>> regex = r"\b[Hh]ello\b" >>> re.findall(regex, text) ['Hello']
Please note that we define regex as a raw string, otherwise we would have to type:
>>> regex = "\\b[Hh]ello\\b"
Python 3.4 adds a new re.fullmatch()
function which returns a Match
object only if the regex matches the entire string, otherwise it returns None
. re.fullmatch(regex, text)
is the same as re.search("\Aregex\Z", text)
. If text
is an empty string then fullmatch()
evaluates to True
for any regex that can find a zero-length match.
Be careful when using the negated shorthands inside square brackets. E.g. [\D\S]
is not the same as [^\d\s]
. The latter matches any character that is neither a digit nor whitespace. The former, however, matches any character that is either not a digit, or is not whitespace. Because all digits are not whitespace, and all whitespace characters are not digits, [\D\S]
matches any character; digit, whitespace, or otherwise.
5. Grouping
Imagine that we wish to match only a number of TLDs (Top Level Domain)s of email addresses:
>>> email = "java.info@javacodegeeks.net" >>> regex = "\w[\w.]+@[\w]+\.com|net|org|edu" >>> re.findall(regex, email) ['net']
Apparently, the |
(or) meta-character doesn’t work here. We need to group the TLDs:
>>> regex = "\w[\w.]+@[\w]+\.(com|net|org|edu)" >>> findMatch(regex, email) java.info@javacodegeeks.net
This can also be useful if we wish to match the name and the domain of the email address, e.g.
>>> regex = "(\w[\w.]+)@([\w]+\.[\w]{2,3})" >>> match = re.search(regex, email) >>> match <_sre.SRE_Match object; span=(0, 27), match='java.info@javacodegeeks.net'> >>> match.group() 'java.info@javacodegeeks.net' >>> match.groups() ('java.info', 'javacodegeeks.net') >>> match.group(1) 'java.info' >>> match.group(2) 'javacodegeeks.net'
match.group()
or match.group(0)
returns the whole match. To push the example a bit further (nested groups):
>>> regex = "(\w[\w.]+)@([\w]+\.(com|net|org|edu))" >>> match = re.search(regex, email) >>> match <_sre.SRE_Match object; span=(0, 27), match='java.info@javacodegeeks.net'> >>> match.group() 'java.info@javacodegeeks.net' >>> match.groups() ('java.info', 'javacodegeeks.net', 'net') >>> match.group(1) 'java.info' >>> match.group(2) 'javacodegeeks.net' >>> match.group(3) 'net'
If you pay attention to the parentheses groups, you will see 3 groups:
(\w[\w.]+) ([\w]+\.(com|net|org|edu)) (com|net|org|edu)
The results of match.group()
should now be obvious.
If you do not need the group to capture its match, use a non-capturing group with the syntax (?:regex)
. For example, if we don’t wish to include the TLDs in our match:
>>> regex = "(\w[\w.]+)@([\w]+\.(?:com|net|org|edu))" >>> match = re.search(regex, email) >>> match <_sre.SRE_Match object; span=(0, 27), match='java.info@javacodegeeks.net'> >>> match.group() 'java.info@javacodegeeks.net' >>> match.groups() ('java.info', 'javacodegeeks.net') >>> match.group(1) 'java.info' >>> match.group(2) 'javacodegeeks.net' >>> match.group(3) Traceback (most recent call last): File "", line 1, in IndexError: no such group
Python was the first programming language which introduced named capturing groups. The syntax (?P<name>regex)
captures the match of regex
into the backreference name
. name
must be an alphanumeric sequence starting with a letter. You can reference the contents of the group with the named backreference \g<name>
.
>>> regex = "(\w[\w.]+)@([\w]+\.(?P<TLD>com|net|org|edu))" >>> match = re.search(regex, email) >>> match.group("TLD") 'net'
Python does not allow multiple groups to use the same name. Doing so will give a regex compilation error.
As an exercise, write the regular expression of an address in the Netherlands, e.g.
text = 'George Maduroplein 1, 2584 RZ, The Hague, The Netherlands'
Make sure that you can return independent matches of the street and house number, the zip code, the city or the country.
5.1 Backreferences
Backreferences match the same text as previously matched by a capturing group. Perhaps the best known example is the regex to find duplicated words.
>>> text = "hello hello world" >>> regex = r"(\w+) \1" >>> re.findall(regex, text) ['hello']
In the above example we’re capturing a group made up of one or more alphanumeric characters, after which the pattern tries to match a whitespace, and finally we have the \1
backreference, meaning that it must match exactly the same thing as the first group (\w+)
. Also, note the use of raw strings to avoid typing
>>> regex = "(\w+) \\1"
Backreferences can be used with the first 99 groups. Named groups, that we saw earlier, can help reducing the complexity in case of many groups in the regular expression. To backreference a named group use the syntax (?P=name)
:
>>> regex = r"(?P<word>\w+) (?P=word)" >>> re.findall(regex, text) ['hello']
6. Matching modes
search(regex, string, modes)
and match(regex, string, modes)
accept a third parameter called matching modes.
Matching mode | Grouping letter | Explanation |
re.I or re.IGNORECASE | i | Ignores case |
re.S or re.DOTALL | s | makes the dot (.) match newlines |
re.M or re.MULTILINE | m | makes the ^ and $ match after and before line breaks |
re.L or re.LOCALE | L | makes \w match all characters that are considered letters given the current locale settings |
re.U or re.UNICODE | u | treats all letters from all scripts as word characters |
>>> text = "gry Grey grey gray gryy grrrr graaaay" >>> regex="gr[ae]y" >>> re.search(regex, text, re.I) <_sre.SRE_Match object; span=(4, 8), match='Grey'>
Use the |
meta-character to specify more than one matching modes.
Or you can use the grouping letter mentioned in the above table:
>>> text = "gry Grey grey gray gryy grrrr graaaay" >>> regex=r"(?i)gr[ae]y" >>> re.search(regex, text) <_sre.SRE_Match object; span=(4, 8), match='Grey'>
7. Unicode
Since version 3.3, Python provides good support for Unicode regex pattern matching. As mentioned above, the \uFFFF
syntax must be used. For example, to match one or more digits ending with €
:
>>> text = 'This item costs 33€.' >>> regex = "\d+\u20AC" >>> re.findall(regex, text) ['33€']
8. Summary
In this tutorial we provided an overview of Regular Expressions and saw how we can execute Regular Expressions in Python. Python provides module re
for this job. We saw plenty of examples to use in your real life projects. I hope that after this tutorial you will be scared of regexes a bit less. This article is by no means exhaustive. The interested reader should look at the references for more in-depth knowledge of regular expressions. To quiz yourself, what is the difference between the characters []
, ()
and {}
in regular expressions?
10. References
- https://www.regular-expressions.info/
- Friedl J.E.F. (2006), Mastering Regular Expressions, 3rd Ed., O’Reilly.
- Krasnov A. (2017), “Python Regular Expression Tutorial”, WebCodeGeeks.
- Lopez F. & Romero V. (2014), Mastering Python Regular Expressions, Packt.
9. Download the source code
You can download the full source code of this article here: Python Regular Expressions