Regex matches string, but doesn't group correctly [duplicate] - regex

While matching an email address, after I match something like yasar#webmail, I want to capture one or more of (\.\w+)(what I am doing is a little bit more complicated, this is just an example), I tried adding (.\w+)+ , but it only captures last match. For example, yasar#webmail.something.edu.tr matches but only include .tr after yasar#webmail part, so I lost .something and .edu groups. Can I do this in Python regular expressions, or would you suggest matching everything at first, and split the subpatterns later?

re module doesn't support repeated captures (regex supports it):
>>> m = regex.match(r'([.\w]+)#((\w+)(\.\w+)+)', 'yasar#webmail.something.edu.tr')
>>> m.groups()
('yasar', 'webmail.something.edu.tr', 'webmail', '.tr')
>>> m.captures(4)
['.something', '.edu', '.tr']
In your case I'd go with splitting the repeated subpatterns later. It leads to a simple and readable code e.g., see the code in #Li-aung Yip's answer.

You can fix the problem of (\.\w+)+ only capturing the last match by doing this instead: ((?:\.\w+)+)

This will work:
>>> regexp = r"[\w\.]+#(\w+)(\.\w+)?(\.\w+)?(\.\w+)?(\.\w+)?(\.\w+)?"
>>> email_address = "william.adama#galactica.caprica.fleet.mil"
>>> m = re.match(regexp, email_address)
>>> m.groups()
('galactica', '.caprica', '.fleet', '.mil', None, None)
But it's limited to a maximum of six subgroups. A better way to do this would be:
>>> m = re.match(r"[\w\.]+#(.+)", email_address)
>>> m.groups()
('galactica.caprica.fleet.mil',)
>>> m.group(1).split('.')
['galactica', 'caprica', 'fleet', 'mil']
Note that regexps are fine so long as the email addresses are simple - but there are all kinds of things that this will break for. See this question for a detailed treatment of email address regexes.

This is what you are looking for:
>>> import re
>>> s="yasar#webmail.something.edu.tr"
>>> r=re.compile("\.\w+")
>>> m=r.findall(s)
>>> m
['.something', '.edu', '.tr']

Related

how can I get all possible subgroups in python regex?

I would like to get all possible subgroups during regex findall: (group(subgroup))+. Currently it only returns the last matches, for example:
>>> re.findall(r'SOME_STRING_(([A-D])[0-9]+)+_[A-Z]+', 'SOME_STRING_A2B2C3_OTK')
[('C3', 'C')]
Now I have to do that in two steps:
>>> match = re.match(r'SOME_STRING_(([A-D][0-9]+)+)_[A-Z]+', 'SOME_STRING_A2B2C3_OTK')
>>> re.findall(r'([A-D])[0-9]+', match.group(1))
['A', 'B', 'C']
Is there any method can let me get the same result in a single step?
Since (([A-D])[0-9]+)+ is a repeated capturing group, it is no wonder only the last match results are returned.
You may use a PyPi regex library (that you may install by typing pip install regex in the console/terminal and pressing ENTER) and then use:
import regex
results = regex.finditer(r'SOME_STRING_(([A-D])[0-9]+)+_[A-Z]+', 'SOME_STRING_A2B2C3_OTK')
print( [zip(x.captures(1),x.captures(2)) for x in results] )
# => [[('A2', 'A'), ('B2', 'B'), ('C3', 'C')]]
The match.captures property keeps track of all captures.
If you can only use re, you need to first extract all your matches, and then run a second regex on them to extract the parts you need:
import re
tmp = re.findall(r'SOME_STRING_((?:[A-D][0-9]+)+)_[A-Z]+', 'SOME_STRING_A2B2C3_OTK')
results = []
for m in tmp:
results.append(re.findall(r'(([A-D])[0-9]+)', m))
print( results )
# => [[('A2', 'A'), ('B2', 'B'), ('C3', 'C')]]
See the Python demo
A single-regex (and possibly single-pass-of-data) solution can be done, provided your sample code and sample data are both well-defined. The assumed premises are:
The length of SOME_STRING_ is fixed. This is based on the example data you give, where SOME_STRING_ reads a literal string and not a regex.
The data contains no [E-Z] or other exceptions in its "alphabet-digits" part. This is based on your working 2-lined solution, which should have returned an error AttributeError: 'NoneType' object has no attribute 'group' if data like SOME_STRING_A1B2Z3_OTK exists. However, the error was not reported, so I assume you did not have such data.
If the above are met, a single regex r"[0-9]+" can be used to perform a straightforward string split. All digits are discarded because the + operator is greedy according to the official documentation. The greedy match could be theoretically done with a single pass of data, so the efficiency should be satisfying if it is indeed the case. (I did not have a check on the implementation details though.)
Solution
import re
s = 'SOME_STRING_A10B20C30_OTK' # len("SOME_STRING_") = 12 is fixed
# may have multiple digits in between
re.compile(r"[0-9]+").split(s[12:])[:-1] # discard the last element
# returns ['A', 'B', 'C']

python regex for parsing filenames

I'm the worst for regex in general, but in python... I need help in fixing my regex for parsing filenames, e.g:
>>> from re import search, I, M
>>> x="/almac/data/vectors_puces_T12_C1_00_d2v_H50_corr_m10_70.mtx"
>>> for i in range(6):
... print search(r"[vectors|pairs]+_(\w+[\-\w+]*[0-9]{0,4})([_T[0-9]{2,3}_C[1-9]_[0-9]{2}]?)(_[d2v|w2v|coocc\w*|doc\w*]*)(_H[0-9]{1,4})(_[sub|co[nvs{0,2}|rr|nc]+]?)(_m[0-9]{1,3}[_[0-9]{0,3}]?)",x, M|I).group(i)
...
It gives the following output:
vectors_puces_T12_C1_00_d2v_H50_corr_m10_70
puces_T
12_C1_00
_d2v
_H50
_corr
However, what I need is
vectors_puces_T12_C1_00_d2v_H50_corr_m10_70
puces
T12_C1_00
_d2v
_H50
_corr
I don't know what exactly is wrong. Thank you
One problem is that \w would also match underscore which you want to be a delimiter between puces and T12_C1_00 in this case. Replace the \w with A-Za-z\-. Also, you should put the underscore between the appropriate saving groups:
(?:vectors|pairs)_([A-Za-z\-]+[0-9]{0,4})_([T[0-9]{2,3}_C[1-9]_[0-9]{2}]?)...
HERE^
Works for me:
>>> import re
>>> re.search(r"(?:vectors|pairs)_([A-Za-z\-]+[0-9]{0,4})_([T[0-9]{2,3}_C[1-9]_[0-9]{2}]?)(_[d2v|w2v|coocc\w*|doc\w*]*)(_H[0-9]{1,4})(_[sub|co[nvs{0,2}|rr|nc]+]?)(_m[0-9]{1,3}[_[0-9]{0,3}]?)",x, re.M|re.I).groups()
('puces', 'T12_C1_00', '_d2v', '_H50', '_corr', '_m10_70')
I've also replaced the [vectors|pairs] with (?:vectors|pairs) which is, I think, what you've actually meant - match either vectors or pairs literal strings, (?:...) is a syntax for a non-capturing group.
I'm not sure what your goal is, but you seem to be interested in what's between each underscore, so it may be simpler to split by it:
path, filename = os.path.split(x)
filename = filename.split('.')
fileparts = filename.split('_')
fileparts will then be this list:
vectors
puces
T12
C1
00
d2v
H50
corr
m10
70
And you can validate / inspect any part, e.g. if fileparts[0] == 'vectors' or tpart = fileparts[2:4]...

Trim string after 5 of the same chars are found

Say I have a the string AAAGCTTACGAAAAAAACGTA and I would like to remove anything after and including the occurrence of 4 As, regardless of where it occurs in the string. So for this example we are left with AAAGCTTACG after trimming. What would be a fast and efficient way to go about this?
You can use str.split():
>>> s = "AAAGCTTACGAAAAAAACGTA"
>>> s.split("AAAA", 1)[0]
'AAAGCTTACG'
You could use a greedy match and replace with nothing.
import re
new_string = re.sub(r'AAAA.*', '', original_string)
Alternatively, AAAA can also be expressed as A{4} if you find it more readable.
Just find those AAAA if any, and slice:
>>> s = "AAAGCTTACGAAAAAAACGTA"
>>> s[:s.find("AAAA")]
'AAAGCTTACG'
However, this way you should first check whether the string contains AAAA, otherwise it will slice away the last character.

regex : all strings containing "tomcat/logs"

I want to know how to match all strings containing tomcat/logs ?
For example : /home/tomcat/logs, /etc/tomcat/logs, /home/folder/tomcat/logs
Thanks.
Edited :
I'm using this for excluding backup directories, I need just regular expression independent of any specific language.
You can do something like this: (This is in Python)
>>> import re
>>> string_to_find_in = '/home/tomcat/logs'
>>> m = re.search('(.*tomcat\/logs)', string_to_find_in)
>>> m.group(0)
'/home/tomcat/logs'

Take first successful match from a batch of regexes

I'm trying to extract set of data from a string that can match one of three patterns. I have a list of compiled regexes. I want to run through them (in order) and go with the first match.
regexes = [
compiled_regex_1,
compiled_regex_2,
compiled_regex_3,
]
m = None
for reg in regexes:
m = reg.match(name)
if m: break
if not m:
print 'ARGL NOTHING MATCHES THIS!!!'
This should work (haven't tested yet) but it's pretty fugly. Is there a better way of boiling down a loop that breaks when it succeeds or explodes when it doesn't?
There might be something specific to re that I don't know about that allows you to test multiple patterns too.
You can use the else clause of the for loop:
for reg in regexes:
m = reg.match(name)
if m: break
else:
print 'ARGL NOTHING MATCHES THIS!!!'
If you just want to know if any of the regex match then you could use the builtin any function:
if any(reg.match(name) for reg in regexes):
....
however this will not tell you which regex matched.
Alternatively you can combine multiple patterns into a single regex with |:
regex = re.compile(r"(regex1)|(regex2)|...")
Again this will not tell you which regex matched, but you will have a match object that you can use for further information. For example you can find out which of the regex succeeded from the group that is not None:
>>> match = re.match("(a)|(b)|(c)|(d)", "c")
>>> match.groups()
(None, None, 'c', None)
However this can get complicated however if any of the sub-regex have groups in them as well, since the numbering will be changed.
This is probably faster than matching each regex individually since the regex engine has more scope for optimising the regex.
Since you have a finite set in this case, you could use short ciruit evaluation:
m = compiled_regex_1.match(name) or
compiled_regex_2.match(name) or
compiled_regex_3.match(name) or
print("ARGHHHH!")
In Python 2.6 or better:
import itertools as it
m = next(it.ifilter(None, (r.match(name) for r in regexes)), None)
The ifilter call could be made into a genexp, but only a bit awkwardly, i.e., with the usual trick for name binding in a genexp (aka the "phantom nested for clause idiom"):
m = next((m for r in regexes for m in (r.match(name),) if m), None)
but itertools is generally preferable where applicable.
The bit needing 2.6 is the next built-in, which lets you specify a default value if the iterator is exhausted. If you have to simulate it in 2.5 or earlier,
def next(itr, deft):
try: return itr.next()
except StopIteration: return deft
I use something like Dave Kirby suggested, but add named groups to the regexps, so that I know which one matched.
regexps = {
'first': r'...',
'second': r'...',
}
compiled = re.compile('|'.join('(?P<%s>%s)' % item for item in regexps.iteritems()))
match = compiled.match(my_string)
print match.lastgroup
Eric is in better track in taking bigger picture of what OP is aiming, I would use if else though. I would also think that using print function in or expression is little questionable. +1 for Nathon of correcting OP to use proper else statement.
Then my alternative:
# alternative to any builtin that returns useful result,
# the first considered True value
def first(seq):
for item in seq:
if item: return item
regexes = [
compiled_regex_1,
compiled_regex_2,
compiled_regex_3,
]
m = first(reg.match(name) for reg in regexes)
print(m if m else 'ARGL NOTHING MATCHES THIS!!!')