python regex for parsing filenames - regex

I'm the worst for regex in general, but in python... I need help in fixing my regex for parsing filenames, e.g:
>>> from re import search, I, M
>>> x="/almac/data/vectors_puces_T12_C1_00_d2v_H50_corr_m10_70.mtx"
>>> for i in range(6):
... print search(r"[vectors|pairs]+_(\w+[\-\w+]*[0-9]{0,4})([_T[0-9]{2,3}_C[1-9]_[0-9]{2}]?)(_[d2v|w2v|coocc\w*|doc\w*]*)(_H[0-9]{1,4})(_[sub|co[nvs{0,2}|rr|nc]+]?)(_m[0-9]{1,3}[_[0-9]{0,3}]?)",x, M|I).group(i)
...
It gives the following output:
vectors_puces_T12_C1_00_d2v_H50_corr_m10_70
puces_T
12_C1_00
_d2v
_H50
_corr
However, what I need is
vectors_puces_T12_C1_00_d2v_H50_corr_m10_70
puces
T12_C1_00
_d2v
_H50
_corr
I don't know what exactly is wrong. Thank you

One problem is that \w would also match underscore which you want to be a delimiter between puces and T12_C1_00 in this case. Replace the \w with A-Za-z\-. Also, you should put the underscore between the appropriate saving groups:
(?:vectors|pairs)_([A-Za-z\-]+[0-9]{0,4})_([T[0-9]{2,3}_C[1-9]_[0-9]{2}]?)...
HERE^
Works for me:
>>> import re
>>> re.search(r"(?:vectors|pairs)_([A-Za-z\-]+[0-9]{0,4})_([T[0-9]{2,3}_C[1-9]_[0-9]{2}]?)(_[d2v|w2v|coocc\w*|doc\w*]*)(_H[0-9]{1,4})(_[sub|co[nvs{0,2}|rr|nc]+]?)(_m[0-9]{1,3}[_[0-9]{0,3}]?)",x, re.M|re.I).groups()
('puces', 'T12_C1_00', '_d2v', '_H50', '_corr', '_m10_70')
I've also replaced the [vectors|pairs] with (?:vectors|pairs) which is, I think, what you've actually meant - match either vectors or pairs literal strings, (?:...) is a syntax for a non-capturing group.

I'm not sure what your goal is, but you seem to be interested in what's between each underscore, so it may be simpler to split by it:
path, filename = os.path.split(x)
filename = filename.split('.')
fileparts = filename.split('_')
fileparts will then be this list:
vectors
puces
T12
C1
00
d2v
H50
corr
m10
70
And you can validate / inspect any part, e.g. if fileparts[0] == 'vectors' or tpart = fileparts[2:4]...

Related

Regex matches string, but doesn't group correctly [duplicate]

While matching an email address, after I match something like yasar#webmail, I want to capture one or more of (\.\w+)(what I am doing is a little bit more complicated, this is just an example), I tried adding (.\w+)+ , but it only captures last match. For example, yasar#webmail.something.edu.tr matches but only include .tr after yasar#webmail part, so I lost .something and .edu groups. Can I do this in Python regular expressions, or would you suggest matching everything at first, and split the subpatterns later?
re module doesn't support repeated captures (regex supports it):
>>> m = regex.match(r'([.\w]+)#((\w+)(\.\w+)+)', 'yasar#webmail.something.edu.tr')
>>> m.groups()
('yasar', 'webmail.something.edu.tr', 'webmail', '.tr')
>>> m.captures(4)
['.something', '.edu', '.tr']
In your case I'd go with splitting the repeated subpatterns later. It leads to a simple and readable code e.g., see the code in #Li-aung Yip's answer.
You can fix the problem of (\.\w+)+ only capturing the last match by doing this instead: ((?:\.\w+)+)
This will work:
>>> regexp = r"[\w\.]+#(\w+)(\.\w+)?(\.\w+)?(\.\w+)?(\.\w+)?(\.\w+)?"
>>> email_address = "william.adama#galactica.caprica.fleet.mil"
>>> m = re.match(regexp, email_address)
>>> m.groups()
('galactica', '.caprica', '.fleet', '.mil', None, None)
But it's limited to a maximum of six subgroups. A better way to do this would be:
>>> m = re.match(r"[\w\.]+#(.+)", email_address)
>>> m.groups()
('galactica.caprica.fleet.mil',)
>>> m.group(1).split('.')
['galactica', 'caprica', 'fleet', 'mil']
Note that regexps are fine so long as the email addresses are simple - but there are all kinds of things that this will break for. See this question for a detailed treatment of email address regexes.
This is what you are looking for:
>>> import re
>>> s="yasar#webmail.something.edu.tr"
>>> r=re.compile("\.\w+")
>>> m=r.findall(s)
>>> m
['.something', '.edu', '.tr']

how can I get all possible subgroups in python regex?

I would like to get all possible subgroups during regex findall: (group(subgroup))+. Currently it only returns the last matches, for example:
>>> re.findall(r'SOME_STRING_(([A-D])[0-9]+)+_[A-Z]+', 'SOME_STRING_A2B2C3_OTK')
[('C3', 'C')]
Now I have to do that in two steps:
>>> match = re.match(r'SOME_STRING_(([A-D][0-9]+)+)_[A-Z]+', 'SOME_STRING_A2B2C3_OTK')
>>> re.findall(r'([A-D])[0-9]+', match.group(1))
['A', 'B', 'C']
Is there any method can let me get the same result in a single step?
Since (([A-D])[0-9]+)+ is a repeated capturing group, it is no wonder only the last match results are returned.
You may use a PyPi regex library (that you may install by typing pip install regex in the console/terminal and pressing ENTER) and then use:
import regex
results = regex.finditer(r'SOME_STRING_(([A-D])[0-9]+)+_[A-Z]+', 'SOME_STRING_A2B2C3_OTK')
print( [zip(x.captures(1),x.captures(2)) for x in results] )
# => [[('A2', 'A'), ('B2', 'B'), ('C3', 'C')]]
The match.captures property keeps track of all captures.
If you can only use re, you need to first extract all your matches, and then run a second regex on them to extract the parts you need:
import re
tmp = re.findall(r'SOME_STRING_((?:[A-D][0-9]+)+)_[A-Z]+', 'SOME_STRING_A2B2C3_OTK')
results = []
for m in tmp:
results.append(re.findall(r'(([A-D])[0-9]+)', m))
print( results )
# => [[('A2', 'A'), ('B2', 'B'), ('C3', 'C')]]
See the Python demo
A single-regex (and possibly single-pass-of-data) solution can be done, provided your sample code and sample data are both well-defined. The assumed premises are:
The length of SOME_STRING_ is fixed. This is based on the example data you give, where SOME_STRING_ reads a literal string and not a regex.
The data contains no [E-Z] or other exceptions in its "alphabet-digits" part. This is based on your working 2-lined solution, which should have returned an error AttributeError: 'NoneType' object has no attribute 'group' if data like SOME_STRING_A1B2Z3_OTK exists. However, the error was not reported, so I assume you did not have such data.
If the above are met, a single regex r"[0-9]+" can be used to perform a straightforward string split. All digits are discarded because the + operator is greedy according to the official documentation. The greedy match could be theoretically done with a single pass of data, so the efficiency should be satisfying if it is indeed the case. (I did not have a check on the implementation details though.)
Solution
import re
s = 'SOME_STRING_A10B20C30_OTK' # len("SOME_STRING_") = 12 is fixed
# may have multiple digits in between
re.compile(r"[0-9]+").split(s[12:])[:-1] # discard the last element
# returns ['A', 'B', 'C']

Removing dot in front of digits using regular expression in pandas

I need to remove the dot in front of digits using regular expressions in pandas.
What I have: .9/10 .8/10
What I want: 9/10 8/10
I need to use df.col.str.extract().
Also beware because there are also float numbers 11.25/10, and in those cases I want to keep the dot.
I think this works on the small example you provided (Next time provide more data)
import re
re.sub(r' $', '', re.sub(r'|^.', '', re.sub(r', .', ', ', '.9/10, .8/10 ')))
'9/10, 8/10'
using a sample df as you didn't provide one (make sure you provide a sample dataset and your expected outcome in future for others to help)
df = pd.DataFrame ({'Data' : '20.01/10.'},index=[0])
print(df)
Data
0 20.01/10.
df['Data'] = df['Data'].str.replace('\.$','')
print(df)
Data
0 20.01/10
Explanation
In regex, the $ special character "[matches] the end of the string or just before the newline at the end of the string"
assuming you only need to remove the . from the end you could use the pattern above.
If you need to remove from a non digit char then use
"\.(?!\d)"

Trim string after 5 of the same chars are found

Say I have a the string AAAGCTTACGAAAAAAACGTA and I would like to remove anything after and including the occurrence of 4 As, regardless of where it occurs in the string. So for this example we are left with AAAGCTTACG after trimming. What would be a fast and efficient way to go about this?
You can use str.split():
>>> s = "AAAGCTTACGAAAAAAACGTA"
>>> s.split("AAAA", 1)[0]
'AAAGCTTACG'
You could use a greedy match and replace with nothing.
import re
new_string = re.sub(r'AAAA.*', '', original_string)
Alternatively, AAAA can also be expressed as A{4} if you find it more readable.
Just find those AAAA if any, and slice:
>>> s = "AAAGCTTACGAAAAAAACGTA"
>>> s[:s.find("AAAA")]
'AAAGCTTACG'
However, this way you should first check whether the string contains AAAA, otherwise it will slice away the last character.

Python: get the string between two capitals

I'd like your opinion as you might be more experienced on Python as I do.
I came from C++ and I'm still not used to the Pythonic way to do things.
I want to loop under a string, between 2 capital letters. For example, I could do that this way:
i = 0
str = "PythonIsFun"
for i, z in enumerate(str):
if(z.isupper()):
small = ''
x = i + 1
while(not str[x].isupper()):
small += str[x]
I wrote this on my phone, so I don't know if this even works but you caught the idea, I presume.
I need you to help me get the best results on this, not just in a non-forced way to the cpu but clean code too. Thank you very much
This is one of those times when regexes are the best bet.
(And don't call a string str, by the way: it shadows the built-in function.)
s = 'PythonIsFun'
result = re.search('[A-Z]([a-z]+)[A-Z]', s)
if result is not None:
print result.groups()[0]
you could use regular expressions:
import re
re.findall ( r'[A-Z]([^A-Z]+)[A-Z]', txt )
outputs ['ython'], and
re.findall ( r'(?=[A-Z]([^A-Z]+)[A-Z])', txt )
outputs ['ython', 's']; and if you just need the first match,
re.search ( r'[A-Z]([^A-Z]+)[A-Z]', txt ).group( 1 )
You can use a list comprehension to do this easily.
>>> s = "PythonIsFun"
>>> u = [i for i,x in enumerate(s) if x.isupper()]
>>> s[u[0]+1:u[1]]
'ython'
If you can't guarantee that there are two upper case characters you can check the length of u to make sure it is at least 2. This does iterate over the entire string, which could be a problem if the two upper case characters occur at the start of a lengthy string.
There are many ways to tackle this, but I'd use regular expressions.
This example will take "PythonIsFun" and return "ythonsun"
import re
text = "PythonIsFun"
pattern = re.compile(r'[a-z]') #look for all lower-case characters
matches = re.findall(pattern, text) #returns a list of lower-chase characters
lower_string = ''.join(matches) #turns the list into a string
print lower_string
outputs:
ythonsun