Some Special Use Cases of Regular Expressions in Python

In this post, we will showcase a selection of special applications for regular expressions in Python, aimed at resolving real-world challenges. As we will see in our examples, we can use word boundaries and lookahead assumptions to achieve more accurate matching. A small trick will also be introduced for how to create consistent match strings for string comparison, which is commonly used in product or inventory management.

Use word boundaries to get accurate matching

This is a feature that is often overlooked. It can be very handy especially when you read regex patterns from an external source, for instance from user input, or a database.

Let’s use word boundaries to match a string that contains “iPhone X”, but not “iPhone XS”:

re.search(r"\biphone x\b", "An iPhone X 16GB.", flags=re.I) # Match
re.search(r"\biphone x\b", "An iPhone XS 16GB.", flags=re.I) # No match

Note that we need to use the r"" syntax to put the regular expression in a raw string. Otherwise, you would need to use two slashes to indicate the word boundary \\b.

Negative lookahead assertion

What if we only want to match “iPhone 14”, but not “iPhone 14 Pro” or “iPhone 14 Plus”? We cannot just use word boundaries because they will match all three strings:

re.search(r"\biphone 14\b", "An iPhone 14.", flags=re.I) # Match
re.search(r"\biphone 14\b", "An iPhone 14 Pro.", flags=re.I) # Match
re.search(r"\biphone 14\b", "An iPhone 14 Plus.", flags=re.I) # Match

In this case, we can use the negative lookahead assertion syntax A(?!B), which matches the expression A only if it is not followed by B.

re.search(r"\biphone 14\b(?! pro| plus)", "An iPhone 14.", flags=re.I) # Match
re.search(r"\biphone 14\b(?! pro| plus)", "An iPhone 14 Pro.", flags=re.I) # No match
re.search(r"\biphone 14\b(?! pro| plus)", "An iPhone 14 Plus.", flags=re.I) # No match

Since we require that “iPhone 14” cannot be followed by “Pro” or “Plus” (note the empty spaces in the regex pattern), only the first one can match.

Check if a string contains both alphabets and numbers

Let’s see another more complex use case of lookahead assertion, which is positive this time A(?=B), meaning the expression A is matched only if it is followed by B.

We will write a regular expression that checks if a string contains both alphabets and numbers and only alphabets and numbers, which would look pretty complex at first sight:

pattern = r"^(?=[a-z0-9]*[a-z])(?=[a-z0-9]*[0-9])[a-z0-9]+$"

The first thing to notice is that two positive lookahead assumptions are used here, ^(?=[a-z0–9]*[a-z]) requires that an alphabet must follow the start of a line, and ^(?=[a-z0–9]*[0–9]) requires that a digit must follow. The order of the assumptions does not matter.

The allowed characters are explicitly specified to be alphanumeric characters [a-z0–9].

re.match(pattern, "ABC100", flags=re.I) # Match
re.match(pattern, "ABC", flags=re.I) # No match
re.match(pattern, "100", flags=re.I) # No match
re.match(pattern, "ABC 100", flags=re.I) # No match
re.match(pattern, "ABC_100", flags=re.I) # No match

re.match is used here, rather than re.search, because we want to apply the pattern at the start of the string, rather than anywhere in the string.

However, what if we don’t want to match a whole string, but extract all the substrings that match such a pattern? In this case, we need to use re.findall together with the word boundaries as introduced above:

re.match(pattern, "ABC100", flags=re.I) # Match
re.match(pattern, "ABC", flags=re.I) # No match
re.match(pattern, "100", flags=re.I) # No match
re.match(pattern, "ABC 100", flags=re.I) # No match
re.match(pattern, "ABC_100", flags=re.I) # No match

Matching ASCII and Unicode characters

Finally, let’s use regular expressions to match ASCII and Unicode characters, respectively. Actually, in Python 3, the special character \w by default matches Unicode characters:

re.search(r"\w+", "ABC_ÅÄÖ") # Match = "ABC_ÅÄÖ"

And you need to use a special flag re.ASCII or re.A to match ASCII characters only:

re.search(r"\w+", "ABC_ÅÄÖ", flags=re.ASCII) # Match = "ABC_"

What if we only want to keep alphabets and numbers, and remove all non-alphanumeric characters, including underscores? This is very commonly used to create some clean match strings for various matching purposes.

Naturally, we would want to use r"[^\w\d]" which matches all non-alphabets and non-digit characters. However, underscores will not be removed:

re.sub(r"[^\w\d]", "",  "ABC_ÅÄÖ 123") # => 'ABC_ÅÄÖ123'

Indeed, as you see the space is removed, but the underscore is still there.

You may want to use some third-party library like regex to do this job. Actually, a small trick can do the same job. We just need to add the underscore to the regex pattern:

re.sub(r"[^\w\d]|_", "",  "ABC_ÅÄÖ 123") # => 'ABCÅÄÖ123'

Now the string is cleaned properly, with all non-alphanumeric characters removed, including underscores.

In this post, we introduced some use cases of regular expressions in Python to solve practical problems. We can use word boundaries and lookahead assumptions to achieve more accurate matching. A small trick is also introduced for how to create consistent match strings for string comparison, which is commonly used in product or inventory management.

SuperDataMiner

Some Special Use Cases of Regular Expressions in Python

Use word boundaries to get accurate matching

Negative lookahead assertion

Check if a string contains both alphabets and numbers

Matching ASCII and Unicode characters

Related articles:

Leave a comment Cancel reply

Some Special Use Cases of Regular Expressions in Python

Use word boundaries to get accurate matching

Negative lookahead assertion

Check if a string contains both alphabets and numbers

Matching ASCII and Unicode characters

Related articles:

Share this:

Leave a comment Cancel reply