how can I get all possible subgroups in python regex? - regex

I would like to get all possible subgroups during regex findall: (group(subgroup))+. Currently it only returns the last matches, for example:
>>> re.findall(r'SOME_STRING_(([A-D])[0-9]+)+_[A-Z]+', 'SOME_STRING_A2B2C3_OTK')
[('C3', 'C')]
Now I have to do that in two steps:
>>> match = re.match(r'SOME_STRING_(([A-D][0-9]+)+)_[A-Z]+', 'SOME_STRING_A2B2C3_OTK')
>>> re.findall(r'([A-D])[0-9]+', match.group(1))
['A', 'B', 'C']
Is there any method can let me get the same result in a single step?

Since (([A-D])[0-9]+)+ is a repeated capturing group, it is no wonder only the last match results are returned.
You may use a PyPi regex library (that you may install by typing pip install regex in the console/terminal and pressing ENTER) and then use:
import regex
results = regex.finditer(r'SOME_STRING_(([A-D])[0-9]+)+_[A-Z]+', 'SOME_STRING_A2B2C3_OTK')
print( [zip(x.captures(1),x.captures(2)) for x in results] )
# => [[('A2', 'A'), ('B2', 'B'), ('C3', 'C')]]
The match.captures property keeps track of all captures.
If you can only use re, you need to first extract all your matches, and then run a second regex on them to extract the parts you need:
import re
tmp = re.findall(r'SOME_STRING_((?:[A-D][0-9]+)+)_[A-Z]+', 'SOME_STRING_A2B2C3_OTK')
results = []
for m in tmp:
results.append(re.findall(r'(([A-D])[0-9]+)', m))
print( results )
# => [[('A2', 'A'), ('B2', 'B'), ('C3', 'C')]]
See the Python demo

A single-regex (and possibly single-pass-of-data) solution can be done, provided your sample code and sample data are both well-defined. The assumed premises are:
The length of SOME_STRING_ is fixed. This is based on the example data you give, where SOME_STRING_ reads a literal string and not a regex.
The data contains no [E-Z] or other exceptions in its "alphabet-digits" part. This is based on your working 2-lined solution, which should have returned an error AttributeError: 'NoneType' object has no attribute 'group' if data like SOME_STRING_A1B2Z3_OTK exists. However, the error was not reported, so I assume you did not have such data.
If the above are met, a single regex r"[0-9]+" can be used to perform a straightforward string split. All digits are discarded because the + operator is greedy according to the official documentation. The greedy match could be theoretically done with a single pass of data, so the efficiency should be satisfying if it is indeed the case. (I did not have a check on the implementation details though.)
Solution
import re
s = 'SOME_STRING_A10B20C30_OTK' # len("SOME_STRING_") = 12 is fixed
# may have multiple digits in between
re.compile(r"[0-9]+").split(s[12:])[:-1] # discard the last element
# returns ['A', 'B', 'C']

Related

Preprocess words that do not match list of words

I have a very specific case I'm trying to match: I have some text and a list of words (which may contain numbers, underscores, or ampersand), and I want to clean the text of numeric characters (for instance) unless it is a word in my list. This list is also long enough that I can't just make a regex that matches every one of the words.
I've tried to use regex to do this (i.e. doing something along the lines of re.sub(r'\d+', '', text), but trying to come up with a more complex regex to match my case. This obviously isn't quite working, as I don't think regex is meant to handle that kind of case.
I'm trying to experiment with other options like pyparsing, and tried something like the below, but this also gives me an error (probably because I'm not understanding pyparsing correctly):
from pyparsing import *
import re
phrases = ["76", "tw3nty", "potato_man", "d&"]
text = "there was once a potato_man with tw3nty cars and d& 76 different homes"
parser = OneOrMore(oneOf(phrases) ^ Word(alphanums).setParseAction(lambda word: re.sub(r'\d+', '', word)))
parser.parseString(text)
What's the best way to approach this sort of matching, or are there other better suited libraries that would be worth a try?
You are very close to getting this pyparsing cleaner-upper working.
Parse actions generally get their matched tokens as a list-like structure, a pyparsing-defined class called ParseResults.
You can see what actually gets sent to your parse action by wrapping it in the pyparsing decorator traceParseAction:
parser = OneOrMore(oneOf(phrases) ^ Word(alphanums).setParseAction(traceParseAction(lambda word: re.sub(r'\d+', '', word))))
Actually a little easier to read if you make your parse action a regular def'ed method instead of a lambda:
#traceParseAction
def unnumber(word):
return re.sub(r'\d+', '', word)
parser = OneOrMore(oneOf(phrases) ^ Word(alphanums).setParseAction(unnumber))
traceParseAction will report what is passed to the parse action and what is returned.
>>entering unnumber(line: 'there was once a potato_man with tw3nty cars and d& 76 different homes', 0, ParseResults(['there'], {}))
<<leaving unnumber (exception: expected string or bytes-like object)
You can see that the value passed in is in a list structure, so you should replace word in your call to re.sub with word[0] (I also modified your input string to add some numbers to the unguarded words, to see the parse action in action):
text = "there was 1once a potato_man with tw3nty cars and d& 76 different99 homes"
def unnumber(word):
return re.sub(r'\d+', '', word[0])
and I get:
['there', 'was', 'once', 'a', 'potato_man', 'with', 'tw3nty', 'cars', 'and', 'd&', '76', 'different', 'homes']
Also, you use the '^' operator for your parser. You may get a little better performance if you use the '|' operator instead, since '^' (which creates an Or instance) will evaluate all paths and choose the longest - necessary in cases where there is some ambiguity in what the alternatives might match. '|' creates a MatchFirst instance, which stops once it finds a match and does not look further for any alternatives. Since your first alternative is a list of the guard words, then '|' is actually more appropriate - if one gets matched, don't look any further.

Regex matches string, but doesn't group correctly [duplicate]

While matching an email address, after I match something like yasar#webmail, I want to capture one or more of (\.\w+)(what I am doing is a little bit more complicated, this is just an example), I tried adding (.\w+)+ , but it only captures last match. For example, yasar#webmail.something.edu.tr matches but only include .tr after yasar#webmail part, so I lost .something and .edu groups. Can I do this in Python regular expressions, or would you suggest matching everything at first, and split the subpatterns later?
re module doesn't support repeated captures (regex supports it):
>>> m = regex.match(r'([.\w]+)#((\w+)(\.\w+)+)', 'yasar#webmail.something.edu.tr')
>>> m.groups()
('yasar', 'webmail.something.edu.tr', 'webmail', '.tr')
>>> m.captures(4)
['.something', '.edu', '.tr']
In your case I'd go with splitting the repeated subpatterns later. It leads to a simple and readable code e.g., see the code in #Li-aung Yip's answer.
You can fix the problem of (\.\w+)+ only capturing the last match by doing this instead: ((?:\.\w+)+)
This will work:
>>> regexp = r"[\w\.]+#(\w+)(\.\w+)?(\.\w+)?(\.\w+)?(\.\w+)?(\.\w+)?"
>>> email_address = "william.adama#galactica.caprica.fleet.mil"
>>> m = re.match(regexp, email_address)
>>> m.groups()
('galactica', '.caprica', '.fleet', '.mil', None, None)
But it's limited to a maximum of six subgroups. A better way to do this would be:
>>> m = re.match(r"[\w\.]+#(.+)", email_address)
>>> m.groups()
('galactica.caprica.fleet.mil',)
>>> m.group(1).split('.')
['galactica', 'caprica', 'fleet', 'mil']
Note that regexps are fine so long as the email addresses are simple - but there are all kinds of things that this will break for. See this question for a detailed treatment of email address regexes.
This is what you are looking for:
>>> import re
>>> s="yasar#webmail.something.edu.tr"
>>> r=re.compile("\.\w+")
>>> m=r.findall(s)
>>> m
['.something', '.edu', '.tr']

Entire text is matched but not able to group in named groups

I have following example text:
my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10
I want to extract sub-fields from it in following way:
appName = my_app,
[
{key = key1, value = value1},
{key = user_id, value = testuser},
{key = ip_address, value = 10.10.10.10}
]
I have written following regex for doing this:
(?<appName>\w+)\|(((?<key>\w+)?(?<equals>=)(?<value>[^\|]+))\|?)+
It matches the entire text but is not able to group it correctly in named groups.
Tried testing it on https://regex101.com/
What am I missing here?
I think the main problem you have is trying to write a regex that matches ALL the key=value pairs. That's not the way to do it. The correct way is based on a pattern that matches ONLY ONE key=value, but is applied by a function that finds all accurances of the pattern. Every languages supplies such a function. Here's the code in Python for example:
import re
txt = 'my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10'
pairs = re.findall(r'(\w+)=([^|]+)', txt)
print(pairs)
This gives:
[('key1', 'value1'), ('user_id', 'testuser'), ('ip_address', '10.10.10.10')]
The pattern matches a key consisting of alpha-numeric chars - (\w+) with a value. The value is designated by ([^|]+), that is everything but a vertical line, because the value can have non-alpha numeric values, such a dot in the ip address.
Mind the findall function. There's a search function to catch a pattern once, and there's a findall function to catch all the patterns within the text.
I tested it on regex101 and it worked.
I must comment, though, that the specific text pattern you work on doesn't require regex. All high level languages supply a split function. You can split by vertical line, and then each slice you get (expcept the first one) you split again by the equal sign.
Use the PyPi regex module with the following code:
import regex
s = "my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10"
rx = r"(?<appName>\w+)(?:\|(?<key>\w+)=(?<value>[^|]+))+"
print( [(m.group("appName"), dict(zip(m.captures("key"),m.captures("value")))) for m in regex.finditer(rx, s)] )
# => [('my_app', {'ip_address': '10.10.10.10', 'key1': 'value1', 'user_id': 'testuser'})]
See the Python demo online.
The .captures property contains all the values captured into a group at all the iterations.
Not sure, but maybe regular expression might be unnecessary, and splitting similar to,
data='my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10'
x= data.split('|')
appName = []
for index,item in enumerate(x):
if index>0:
element = item.split('=')
temp = {"key":element[0],"value":element[1]}
appName.append(temp)
appName = str(x[0] + ',' + str(appName))
print(appName)
might return an output similar to the desired output:
my_app,[{'key': 'key1', 'value': 'value1'}, {'key': 'user_id', 'value': 'testuser'}, {'key': 'ip_address', 'value': '10.10.10.10'}]
using dict:
temp = {"key":element[0],"value":element[1]}
temp can be modified to other desired data structure that you like to have.

python regex for parsing filenames

I'm the worst for regex in general, but in python... I need help in fixing my regex for parsing filenames, e.g:
>>> from re import search, I, M
>>> x="/almac/data/vectors_puces_T12_C1_00_d2v_H50_corr_m10_70.mtx"
>>> for i in range(6):
... print search(r"[vectors|pairs]+_(\w+[\-\w+]*[0-9]{0,4})([_T[0-9]{2,3}_C[1-9]_[0-9]{2}]?)(_[d2v|w2v|coocc\w*|doc\w*]*)(_H[0-9]{1,4})(_[sub|co[nvs{0,2}|rr|nc]+]?)(_m[0-9]{1,3}[_[0-9]{0,3}]?)",x, M|I).group(i)
...
It gives the following output:
vectors_puces_T12_C1_00_d2v_H50_corr_m10_70
puces_T
12_C1_00
_d2v
_H50
_corr
However, what I need is
vectors_puces_T12_C1_00_d2v_H50_corr_m10_70
puces
T12_C1_00
_d2v
_H50
_corr
I don't know what exactly is wrong. Thank you
One problem is that \w would also match underscore which you want to be a delimiter between puces and T12_C1_00 in this case. Replace the \w with A-Za-z\-. Also, you should put the underscore between the appropriate saving groups:
(?:vectors|pairs)_([A-Za-z\-]+[0-9]{0,4})_([T[0-9]{2,3}_C[1-9]_[0-9]{2}]?)...
HERE^
Works for me:
>>> import re
>>> re.search(r"(?:vectors|pairs)_([A-Za-z\-]+[0-9]{0,4})_([T[0-9]{2,3}_C[1-9]_[0-9]{2}]?)(_[d2v|w2v|coocc\w*|doc\w*]*)(_H[0-9]{1,4})(_[sub|co[nvs{0,2}|rr|nc]+]?)(_m[0-9]{1,3}[_[0-9]{0,3}]?)",x, re.M|re.I).groups()
('puces', 'T12_C1_00', '_d2v', '_H50', '_corr', '_m10_70')
I've also replaced the [vectors|pairs] with (?:vectors|pairs) which is, I think, what you've actually meant - match either vectors or pairs literal strings, (?:...) is a syntax for a non-capturing group.
I'm not sure what your goal is, but you seem to be interested in what's between each underscore, so it may be simpler to split by it:
path, filename = os.path.split(x)
filename = filename.split('.')
fileparts = filename.split('_')
fileparts will then be this list:
vectors
puces
T12
C1
00
d2v
H50
corr
m10
70
And you can validate / inspect any part, e.g. if fileparts[0] == 'vectors' or tpart = fileparts[2:4]...

Take first successful match from a batch of regexes

I'm trying to extract set of data from a string that can match one of three patterns. I have a list of compiled regexes. I want to run through them (in order) and go with the first match.
regexes = [
compiled_regex_1,
compiled_regex_2,
compiled_regex_3,
]
m = None
for reg in regexes:
m = reg.match(name)
if m: break
if not m:
print 'ARGL NOTHING MATCHES THIS!!!'
This should work (haven't tested yet) but it's pretty fugly. Is there a better way of boiling down a loop that breaks when it succeeds or explodes when it doesn't?
There might be something specific to re that I don't know about that allows you to test multiple patterns too.
You can use the else clause of the for loop:
for reg in regexes:
m = reg.match(name)
if m: break
else:
print 'ARGL NOTHING MATCHES THIS!!!'
If you just want to know if any of the regex match then you could use the builtin any function:
if any(reg.match(name) for reg in regexes):
....
however this will not tell you which regex matched.
Alternatively you can combine multiple patterns into a single regex with |:
regex = re.compile(r"(regex1)|(regex2)|...")
Again this will not tell you which regex matched, but you will have a match object that you can use for further information. For example you can find out which of the regex succeeded from the group that is not None:
>>> match = re.match("(a)|(b)|(c)|(d)", "c")
>>> match.groups()
(None, None, 'c', None)
However this can get complicated however if any of the sub-regex have groups in them as well, since the numbering will be changed.
This is probably faster than matching each regex individually since the regex engine has more scope for optimising the regex.
Since you have a finite set in this case, you could use short ciruit evaluation:
m = compiled_regex_1.match(name) or
compiled_regex_2.match(name) or
compiled_regex_3.match(name) or
print("ARGHHHH!")
In Python 2.6 or better:
import itertools as it
m = next(it.ifilter(None, (r.match(name) for r in regexes)), None)
The ifilter call could be made into a genexp, but only a bit awkwardly, i.e., with the usual trick for name binding in a genexp (aka the "phantom nested for clause idiom"):
m = next((m for r in regexes for m in (r.match(name),) if m), None)
but itertools is generally preferable where applicable.
The bit needing 2.6 is the next built-in, which lets you specify a default value if the iterator is exhausted. If you have to simulate it in 2.5 or earlier,
def next(itr, deft):
try: return itr.next()
except StopIteration: return deft
I use something like Dave Kirby suggested, but add named groups to the regexps, so that I know which one matched.
regexps = {
'first': r'...',
'second': r'...',
}
compiled = re.compile('|'.join('(?P<%s>%s)' % item for item in regexps.iteritems()))
match = compiled.match(my_string)
print match.lastgroup
Eric is in better track in taking bigger picture of what OP is aiming, I would use if else though. I would also think that using print function in or expression is little questionable. +1 for Nathon of correcting OP to use proper else statement.
Then my alternative:
# alternative to any builtin that returns useful result,
# the first considered True value
def first(seq):
for item in seq:
if item: return item
regexes = [
compiled_regex_1,
compiled_regex_2,
compiled_regex_3,
]
m = first(reg.match(name) for reg in regexes)
print(m if m else 'ARGL NOTHING MATCHES THIS!!!')