Take first successful match from a batch of regexes - regex

I'm trying to extract set of data from a string that can match one of three patterns. I have a list of compiled regexes. I want to run through them (in order) and go with the first match.
regexes = [
compiled_regex_1,
compiled_regex_2,
compiled_regex_3,
]
m = None
for reg in regexes:
m = reg.match(name)
if m: break
if not m:
print 'ARGL NOTHING MATCHES THIS!!!'
This should work (haven't tested yet) but it's pretty fugly. Is there a better way of boiling down a loop that breaks when it succeeds or explodes when it doesn't?
There might be something specific to re that I don't know about that allows you to test multiple patterns too.

You can use the else clause of the for loop:
for reg in regexes:
m = reg.match(name)
if m: break
else:
print 'ARGL NOTHING MATCHES THIS!!!'

If you just want to know if any of the regex match then you could use the builtin any function:
if any(reg.match(name) for reg in regexes):
....
however this will not tell you which regex matched.
Alternatively you can combine multiple patterns into a single regex with |:
regex = re.compile(r"(regex1)|(regex2)|...")
Again this will not tell you which regex matched, but you will have a match object that you can use for further information. For example you can find out which of the regex succeeded from the group that is not None:
>>> match = re.match("(a)|(b)|(c)|(d)", "c")
>>> match.groups()
(None, None, 'c', None)
However this can get complicated however if any of the sub-regex have groups in them as well, since the numbering will be changed.
This is probably faster than matching each regex individually since the regex engine has more scope for optimising the regex.

Since you have a finite set in this case, you could use short ciruit evaluation:
m = compiled_regex_1.match(name) or
compiled_regex_2.match(name) or
compiled_regex_3.match(name) or
print("ARGHHHH!")

In Python 2.6 or better:
import itertools as it
m = next(it.ifilter(None, (r.match(name) for r in regexes)), None)
The ifilter call could be made into a genexp, but only a bit awkwardly, i.e., with the usual trick for name binding in a genexp (aka the "phantom nested for clause idiom"):
m = next((m for r in regexes for m in (r.match(name),) if m), None)
but itertools is generally preferable where applicable.
The bit needing 2.6 is the next built-in, which lets you specify a default value if the iterator is exhausted. If you have to simulate it in 2.5 or earlier,
def next(itr, deft):
try: return itr.next()
except StopIteration: return deft

I use something like Dave Kirby suggested, but add named groups to the regexps, so that I know which one matched.
regexps = {
'first': r'...',
'second': r'...',
}
compiled = re.compile('|'.join('(?P<%s>%s)' % item for item in regexps.iteritems()))
match = compiled.match(my_string)
print match.lastgroup

Eric is in better track in taking bigger picture of what OP is aiming, I would use if else though. I would also think that using print function in or expression is little questionable. +1 for Nathon of correcting OP to use proper else statement.
Then my alternative:
# alternative to any builtin that returns useful result,
# the first considered True value
def first(seq):
for item in seq:
if item: return item
regexes = [
compiled_regex_1,
compiled_regex_2,
compiled_regex_3,
]
m = first(reg.match(name) for reg in regexes)
print(m if m else 'ARGL NOTHING MATCHES THIS!!!')

Related

How to make a regex pattern which compares the frequency of a group with another

The problem is to find the existence of a substring with n number of a and followed by 2 * n number b and n >= 1. This is a very easy question to do but I want to know if there is any way to do this using regex only.
For example:
zzaabbbbzz should print in print Yes.
zzaazzbbbbzz should print No.
zzaaabbbbzz should print in No.
I tried in python 3 like this:
pattern = re.compile(r'(a+)(b+)')
check = pattern.findall(input())
if(len(check)>0):
for i in check:
if(len(i[0])*2 == len(i[1])):
print('Yes')
break
else:
print('No')
else:
print('No')
I want to know if there is any way to provide some count for a and b in regex pattern itself.
The question can be done using simple loop and manually checking each character and count a and b occurrences with O(n) complexity but I want to learn something new.
Help me to shorten this code?
While this can't be done using only regex (regex can count deterministically, but can't compare quantities) your example's readability can be improved a bit:
import re
inputs = 'zzaabbbbzz', 'zzaazzbbbbzz', 'zzaaabbbbzz'
regex = re.compile(r'.*?(a+)(b+).*')
for inp in inputs:
match = regex.match(inp)
if match:
a_count = len(match.group(1))
b_count = len(match.group(2))
if b_count == 2 * a_count:
print('YES')
else:
print('NO')
else:
print('NO')
Outputs
YES
NO
NO

How would you write a regex for input validation so certain symbols can't be repeated?

I'm attempting to write a regex to prevent certain user input in mathematical expressions. (e.g. '1+1' would be valid whereas'1++1' should be invalidated)
Acceptable characters include *digits 0-9* (\d works in lieu of 0-9), + - # / ( ) and white-spaces.
I've attempted to put together a regex but I cant find anything in python regular expression syntax that would validate (or consequently invalidate certain characters when typed together.
(( is ok
++, --, +-, */, are not
I hope there is a simple way to do this, but I anticipate if there isn't, I will have to write regex's for every possible combination of characters I don't want to allow together.
I've tried:
re.compile(r"[\d\s*/()+-]")
re.compile(r"[\d]\[\s]\[*]\[/]\[(]\[)]\[+]\[-]")
I expect to be able to invalidate the expression if someone were to type "1++1"
Edit: Someone suggested the below link is similar to my question...it is not :)
Validate mathematical expressions using regular expression?
Probably the way to go is by inverting your logic:
abort if the regex detects any invalid combination - those are much less compared to the amount of valid combinations.
So e.g.:
re.compile(r"++")
Also, is it possible at all to enumerate all valid terms? If the length of the term is not limit, it is impossible to enumerate all vaild terms
Perhaps one option might be to check the string for the unwanted combinations:
[0-9]\s*(?:[+-][+-]|\*/)\s*[0-9]
Regex demo | Python demo
For example
pattern = r"[0-9]\s*(?:[+-][+-]|\*/)\s*[0-9]"
strings = [
'This is test 1 -- 1',
'This is test 2',
'This is test 3+1',
'This is test 4 */4'
]
for s in strings:
res = re.search(pattern, s)
if not res:
print("Valid: " + s)
Result
Valid: This is test 2
Valid: This is test 3+1
Below is a snippet from my code. This is hardly the solution I was originally looking for but it does accomplish what I was trying to do. When a user curls to an api endpoint 1++1, it will return "Forbidden" based on the below regex for "math2 =...." Alternatively, it will return "OK" if a user curls 1+1. I hope I am understanding how Stack Overflow works and have formatted this properly...
# returns True if a valid expression, False if not.
def validate_expression(calc):
math = re.compile(r"^[\d\s*/()+-]+$")
math2 = re.compile(r"^[\d++\d]+$")
print('2' " " 'validate_expression')
if math.search(calc) is not None and math2.search(calc) is None:
return True
else:
return False

Using python regex to find repeated values after a header

If I have a string that looks something like:
s = """
...
Random Stuff
...
HEADER
a 1
a 3
# random amount of rows
a 17
RANDOM_NEW_HEADER
a 200
a 300
...
More random stuff
...
"""
Is there a clean way to use regex (in Python) to find all instances of a \d* after HEADER, but before the pattern is broken by SOMETHING_TOTALLY_DIFFERENT? I thought about something like:
import re
pattern = r'HEADER(?:\na \d*)*\na (\d*)'
print re.findall(pattern, s)
Unfortunately, regex doesn't find overlapping matches. If there's no sensible way to do this with regex, I'm okay with anything faster than writing my own for loop to extract this data.
(TL;DR -- There's a distinct header, followed by a pattern that repeats. I want to catch each instance of that pattern, as long as there isn't a break in the repetition.)
EDIT:
To clarify, I don't necessarily know what SOMETHING_TOTALLY_DIFFERENT will be, only that it won't match a \d+. I want to collect all consecutive instances of \na \d+ that follow HEADER\n.
How about a simple loop?
import re
e = re.compile(r'(a\s+\d+)')
header = 'whatever your header field is'
breaker = 'something_different'
breaker_reached = False
header_reached = False
results = []
with open('yourfile.txt') as f:
for line in f:
if line == header:
# skip processing lines unless we reach the header
header_reached = True
continue
if header_reached:
i = e.match(line)
if i and not breaker_reached:
results.append(i.groups()[0])
else:
# There was no match, check if we reached the breaker
if line == breaker:
breaker_reached = True
Not completly sure where you want the regex to stop please clarify
'((a \d*)\s){1,}'
import re
sentinel_begin = 'HEADER'
sentinel_end = 'SOMETHING_TOTALLY_DIFFERENT'
re.findall(r'(a \d*)', s[s.find(sentinel_begin): s.find(sentinel_end)])

Python: get the string between two capitals

I'd like your opinion as you might be more experienced on Python as I do.
I came from C++ and I'm still not used to the Pythonic way to do things.
I want to loop under a string, between 2 capital letters. For example, I could do that this way:
i = 0
str = "PythonIsFun"
for i, z in enumerate(str):
if(z.isupper()):
small = ''
x = i + 1
while(not str[x].isupper()):
small += str[x]
I wrote this on my phone, so I don't know if this even works but you caught the idea, I presume.
I need you to help me get the best results on this, not just in a non-forced way to the cpu but clean code too. Thank you very much
This is one of those times when regexes are the best bet.
(And don't call a string str, by the way: it shadows the built-in function.)
s = 'PythonIsFun'
result = re.search('[A-Z]([a-z]+)[A-Z]', s)
if result is not None:
print result.groups()[0]
you could use regular expressions:
import re
re.findall ( r'[A-Z]([^A-Z]+)[A-Z]', txt )
outputs ['ython'], and
re.findall ( r'(?=[A-Z]([^A-Z]+)[A-Z])', txt )
outputs ['ython', 's']; and if you just need the first match,
re.search ( r'[A-Z]([^A-Z]+)[A-Z]', txt ).group( 1 )
You can use a list comprehension to do this easily.
>>> s = "PythonIsFun"
>>> u = [i for i,x in enumerate(s) if x.isupper()]
>>> s[u[0]+1:u[1]]
'ython'
If you can't guarantee that there are two upper case characters you can check the length of u to make sure it is at least 2. This does iterate over the entire string, which could be a problem if the two upper case characters occur at the start of a lengthy string.
There are many ways to tackle this, but I'd use regular expressions.
This example will take "PythonIsFun" and return "ythonsun"
import re
text = "PythonIsFun"
pattern = re.compile(r'[a-z]') #look for all lower-case characters
matches = re.findall(pattern, text) #returns a list of lower-chase characters
lower_string = ''.join(matches) #turns the list into a string
print lower_string
outputs:
ythonsun

Regular expression any character with dynamic size

I want to use a regular expression that would do the following thing ( i extracted the part where i'm in trouble in order to simplify ):
any character for 1 to 5 first characters, then an "underscore", then some digits, then an "underscore", then some digits or dot.
With a restriction on "underscore" it should give something like that:
^([^_]{1,5})_([\\d]{2,3})_([\\d\\.]*)$
But i want to allow the "_" in the 1-5 first characters in case it still match the end of the regular expression, for example if i had somethink like:
to_to_123_12.56
I think this is linked to an eager problem in the regex engine, nevertheless, i tried to do some lazy stuff like explained here but without sucess.
Any idea ?
I used the following regex and it appeared to work fine for your task. I've simply replaced your initial [^_] with ..
^.{1,5}_\d{2,3}_[\d\.]*$
It's probably best to replace your final * with + too, unless you allow nothing after the final '_'. And note your final part allows multiple '.' (I don't know if that's what you want or not).
For the record, here's a quick Python script I used to verify the regex:
import re
strs = [ "a_12_1",
"abc_12_134",
"abcd_123_1.",
"abcde_12_1",
"a_123_123.456.7890.",
"a_12_1",
"ab_de_12_1",
]
myre = r"^.{1,5}_\d{2,3}_[\d\.]+$"
for str in strs:
m = re.match(myre, str)
if m:
print "Yes:",
if m.group(0) == str:
print "ALL",
else:
print "No:",
print str
Output is:
Yes: ALL a_12_1
Yes: ALL abc_12_134
Yes: ALL abcd_134_1.
Yes: ALL abcde_12_1
Yes: ALL a_123_123.456.7890.
Yes: ALL a_12_1
Yes: ALL ab_de_12_1
^(.{1,5})_(\d{2,3})_([\d.]*)$
works for your example. The result doesn't change whether you use a lazy quantifier or not.
While answering the comment ( writing the lazy expression ), i saw that i did a mistake... if i simply use the folowing classical regex, it works:
^(.{1,5})_([\\d]{2,3})_([\\d\\.]*)$
Thank you.