I'm the worst for regex in general, but in python... I need help in fixing my regex for parsing filenames, e.g:
>>> from re import search, I, M
>>> x="/almac/data/vectors_puces_T12_C1_00_d2v_H50_corr_m10_70.mtx"
>>> for i in range(6):
... print search(r"[vectors|pairs]+_(\w+[\-\w+]*[0-9]{0,4})([_T[0-9]{2,3}_C[1-9]_[0-9]{2}]?)(_[d2v|w2v|coocc\w*|doc\w*]*)(_H[0-9]{1,4})(_[sub|co[nvs{0,2}|rr|nc]+]?)(_m[0-9]{1,3}[_[0-9]{0,3}]?)",x, M|I).group(i)
...
It gives the following output:
vectors_puces_T12_C1_00_d2v_H50_corr_m10_70
puces_T
12_C1_00
_d2v
_H50
_corr
However, what I need is
vectors_puces_T12_C1_00_d2v_H50_corr_m10_70
puces
T12_C1_00
_d2v
_H50
_corr
I don't know what exactly is wrong. Thank you
One problem is that \w would also match underscore which you want to be a delimiter between puces and T12_C1_00 in this case. Replace the \w with A-Za-z\-. Also, you should put the underscore between the appropriate saving groups:
(?:vectors|pairs)_([A-Za-z\-]+[0-9]{0,4})_([T[0-9]{2,3}_C[1-9]_[0-9]{2}]?)...
HERE^
Works for me:
>>> import re
>>> re.search(r"(?:vectors|pairs)_([A-Za-z\-]+[0-9]{0,4})_([T[0-9]{2,3}_C[1-9]_[0-9]{2}]?)(_[d2v|w2v|coocc\w*|doc\w*]*)(_H[0-9]{1,4})(_[sub|co[nvs{0,2}|rr|nc]+]?)(_m[0-9]{1,3}[_[0-9]{0,3}]?)",x, re.M|re.I).groups()
('puces', 'T12_C1_00', '_d2v', '_H50', '_corr', '_m10_70')
I've also replaced the [vectors|pairs] with (?:vectors|pairs) which is, I think, what you've actually meant - match either vectors or pairs literal strings, (?:...) is a syntax for a non-capturing group.
I'm not sure what your goal is, but you seem to be interested in what's between each underscore, so it may be simpler to split by it:
path, filename = os.path.split(x)
filename = filename.split('.')
fileparts = filename.split('_')
fileparts will then be this list:
vectors
puces
T12
C1
00
d2v
H50
corr
m10
70
And you can validate / inspect any part, e.g. if fileparts[0] == 'vectors' or tpart = fileparts[2:4]...
Related
While matching an email address, after I match something like yasar#webmail, I want to capture one or more of (\.\w+)(what I am doing is a little bit more complicated, this is just an example), I tried adding (.\w+)+ , but it only captures last match. For example, yasar#webmail.something.edu.tr matches but only include .tr after yasar#webmail part, so I lost .something and .edu groups. Can I do this in Python regular expressions, or would you suggest matching everything at first, and split the subpatterns later?
re module doesn't support repeated captures (regex supports it):
>>> m = regex.match(r'([.\w]+)#((\w+)(\.\w+)+)', 'yasar#webmail.something.edu.tr')
>>> m.groups()
('yasar', 'webmail.something.edu.tr', 'webmail', '.tr')
>>> m.captures(4)
['.something', '.edu', '.tr']
In your case I'd go with splitting the repeated subpatterns later. It leads to a simple and readable code e.g., see the code in #Li-aung Yip's answer.
You can fix the problem of (\.\w+)+ only capturing the last match by doing this instead: ((?:\.\w+)+)
This will work:
>>> regexp = r"[\w\.]+#(\w+)(\.\w+)?(\.\w+)?(\.\w+)?(\.\w+)?(\.\w+)?"
>>> email_address = "william.adama#galactica.caprica.fleet.mil"
>>> m = re.match(regexp, email_address)
>>> m.groups()
('galactica', '.caprica', '.fleet', '.mil', None, None)
But it's limited to a maximum of six subgroups. A better way to do this would be:
>>> m = re.match(r"[\w\.]+#(.+)", email_address)
>>> m.groups()
('galactica.caprica.fleet.mil',)
>>> m.group(1).split('.')
['galactica', 'caprica', 'fleet', 'mil']
Note that regexps are fine so long as the email addresses are simple - but there are all kinds of things that this will break for. See this question for a detailed treatment of email address regexes.
This is what you are looking for:
>>> import re
>>> s="yasar#webmail.something.edu.tr"
>>> r=re.compile("\.\w+")
>>> m=r.findall(s)
>>> m
['.something', '.edu', '.tr']
I would like to get all possible subgroups during regex findall: (group(subgroup))+. Currently it only returns the last matches, for example:
>>> re.findall(r'SOME_STRING_(([A-D])[0-9]+)+_[A-Z]+', 'SOME_STRING_A2B2C3_OTK')
[('C3', 'C')]
Now I have to do that in two steps:
>>> match = re.match(r'SOME_STRING_(([A-D][0-9]+)+)_[A-Z]+', 'SOME_STRING_A2B2C3_OTK')
>>> re.findall(r'([A-D])[0-9]+', match.group(1))
['A', 'B', 'C']
Is there any method can let me get the same result in a single step?
Since (([A-D])[0-9]+)+ is a repeated capturing group, it is no wonder only the last match results are returned.
You may use a PyPi regex library (that you may install by typing pip install regex in the console/terminal and pressing ENTER) and then use:
import regex
results = regex.finditer(r'SOME_STRING_(([A-D])[0-9]+)+_[A-Z]+', 'SOME_STRING_A2B2C3_OTK')
print( [zip(x.captures(1),x.captures(2)) for x in results] )
# => [[('A2', 'A'), ('B2', 'B'), ('C3', 'C')]]
The match.captures property keeps track of all captures.
If you can only use re, you need to first extract all your matches, and then run a second regex on them to extract the parts you need:
import re
tmp = re.findall(r'SOME_STRING_((?:[A-D][0-9]+)+)_[A-Z]+', 'SOME_STRING_A2B2C3_OTK')
results = []
for m in tmp:
results.append(re.findall(r'(([A-D])[0-9]+)', m))
print( results )
# => [[('A2', 'A'), ('B2', 'B'), ('C3', 'C')]]
See the Python demo
A single-regex (and possibly single-pass-of-data) solution can be done, provided your sample code and sample data are both well-defined. The assumed premises are:
The length of SOME_STRING_ is fixed. This is based on the example data you give, where SOME_STRING_ reads a literal string and not a regex.
The data contains no [E-Z] or other exceptions in its "alphabet-digits" part. This is based on your working 2-lined solution, which should have returned an error AttributeError: 'NoneType' object has no attribute 'group' if data like SOME_STRING_A1B2Z3_OTK exists. However, the error was not reported, so I assume you did not have such data.
If the above are met, a single regex r"[0-9]+" can be used to perform a straightforward string split. All digits are discarded because the + operator is greedy according to the official documentation. The greedy match could be theoretically done with a single pass of data, so the efficiency should be satisfying if it is indeed the case. (I did not have a check on the implementation details though.)
Solution
import re
s = 'SOME_STRING_A10B20C30_OTK' # len("SOME_STRING_") = 12 is fixed
# may have multiple digits in between
re.compile(r"[0-9]+").split(s[12:])[:-1] # discard the last element
# returns ['A', 'B', 'C']
I need to remove the dot in front of digits using regular expressions in pandas.
What I have: .9/10 .8/10
What I want: 9/10 8/10
I need to use df.col.str.extract().
Also beware because there are also float numbers 11.25/10, and in those cases I want to keep the dot.
I think this works on the small example you provided (Next time provide more data)
import re
re.sub(r' $', '', re.sub(r'|^.', '', re.sub(r', .', ', ', '.9/10, .8/10 ')))
'9/10, 8/10'
using a sample df as you didn't provide one (make sure you provide a sample dataset and your expected outcome in future for others to help)
df = pd.DataFrame ({'Data' : '20.01/10.'},index=[0])
print(df)
Data
0 20.01/10.
df['Data'] = df['Data'].str.replace('\.$','')
print(df)
Data
0 20.01/10
Explanation
In regex, the $ special character "[matches] the end of the string or just before the newline at the end of the string"
assuming you only need to remove the . from the end you could use the pattern above.
If you need to remove from a non digit char then use
"\.(?!\d)"
Say I have a the string AAAGCTTACGAAAAAAACGTA and I would like to remove anything after and including the occurrence of 4 As, regardless of where it occurs in the string. So for this example we are left with AAAGCTTACG after trimming. What would be a fast and efficient way to go about this?
You can use str.split():
>>> s = "AAAGCTTACGAAAAAAACGTA"
>>> s.split("AAAA", 1)[0]
'AAAGCTTACG'
You could use a greedy match and replace with nothing.
import re
new_string = re.sub(r'AAAA.*', '', original_string)
Alternatively, AAAA can also be expressed as A{4} if you find it more readable.
Just find those AAAA if any, and slice:
>>> s = "AAAGCTTACGAAAAAAACGTA"
>>> s[:s.find("AAAA")]
'AAAGCTTACG'
However, this way you should first check whether the string contains AAAA, otherwise it will slice away the last character.
I'd like your opinion as you might be more experienced on Python as I do.
I came from C++ and I'm still not used to the Pythonic way to do things.
I want to loop under a string, between 2 capital letters. For example, I could do that this way:
i = 0
str = "PythonIsFun"
for i, z in enumerate(str):
if(z.isupper()):
small = ''
x = i + 1
while(not str[x].isupper()):
small += str[x]
I wrote this on my phone, so I don't know if this even works but you caught the idea, I presume.
I need you to help me get the best results on this, not just in a non-forced way to the cpu but clean code too. Thank you very much
This is one of those times when regexes are the best bet.
(And don't call a string str, by the way: it shadows the built-in function.)
s = 'PythonIsFun'
result = re.search('[A-Z]([a-z]+)[A-Z]', s)
if result is not None:
print result.groups()[0]
you could use regular expressions:
import re
re.findall ( r'[A-Z]([^A-Z]+)[A-Z]', txt )
outputs ['ython'], and
re.findall ( r'(?=[A-Z]([^A-Z]+)[A-Z])', txt )
outputs ['ython', 's']; and if you just need the first match,
re.search ( r'[A-Z]([^A-Z]+)[A-Z]', txt ).group( 1 )
You can use a list comprehension to do this easily.
>>> s = "PythonIsFun"
>>> u = [i for i,x in enumerate(s) if x.isupper()]
>>> s[u[0]+1:u[1]]
'ython'
If you can't guarantee that there are two upper case characters you can check the length of u to make sure it is at least 2. This does iterate over the entire string, which could be a problem if the two upper case characters occur at the start of a lengthy string.
There are many ways to tackle this, but I'd use regular expressions.
This example will take "PythonIsFun" and return "ythonsun"
import re
text = "PythonIsFun"
pattern = re.compile(r'[a-z]') #look for all lower-case characters
matches = re.findall(pattern, text) #returns a list of lower-chase characters
lower_string = ''.join(matches) #turns the list into a string
print lower_string
outputs:
ythonsun