Python: get the string between two capitals - c++

I'd like your opinion as you might be more experienced on Python as I do.
I came from C++ and I'm still not used to the Pythonic way to do things.
I want to loop under a string, between 2 capital letters. For example, I could do that this way:
i = 0
str = "PythonIsFun"
for i, z in enumerate(str):
if(z.isupper()):
small = ''
x = i + 1
while(not str[x].isupper()):
small += str[x]
I wrote this on my phone, so I don't know if this even works but you caught the idea, I presume.
I need you to help me get the best results on this, not just in a non-forced way to the cpu but clean code too. Thank you very much

This is one of those times when regexes are the best bet.
(And don't call a string str, by the way: it shadows the built-in function.)
s = 'PythonIsFun'
result = re.search('[A-Z]([a-z]+)[A-Z]', s)
if result is not None:
print result.groups()[0]

you could use regular expressions:
import re
re.findall ( r'[A-Z]([^A-Z]+)[A-Z]', txt )
outputs ['ython'], and
re.findall ( r'(?=[A-Z]([^A-Z]+)[A-Z])', txt )
outputs ['ython', 's']; and if you just need the first match,
re.search ( r'[A-Z]([^A-Z]+)[A-Z]', txt ).group( 1 )

You can use a list comprehension to do this easily.
>>> s = "PythonIsFun"
>>> u = [i for i,x in enumerate(s) if x.isupper()]
>>> s[u[0]+1:u[1]]
'ython'
If you can't guarantee that there are two upper case characters you can check the length of u to make sure it is at least 2. This does iterate over the entire string, which could be a problem if the two upper case characters occur at the start of a lengthy string.

There are many ways to tackle this, but I'd use regular expressions.
This example will take "PythonIsFun" and return "ythonsun"
import re
text = "PythonIsFun"
pattern = re.compile(r'[a-z]') #look for all lower-case characters
matches = re.findall(pattern, text) #returns a list of lower-chase characters
lower_string = ''.join(matches) #turns the list into a string
print lower_string
outputs:
ythonsun

Related

How to get some number by different groups using re module in python3

Assume that there is a string like s = 'add/10/20/30/4/3/9/' or s = 'add/10/20/30/', which starts with 'add', and follows many numbers(not sure how many, only know 3 times repeat at least).
I wanted to got them in: ['10', '20', ...]
I tried to use re: r = re.compile(r"add/(?:(\d+)/){3,}")
However, only the last number matched and returned.
>>> r.findall(s)
['9']
So what's the problem and how to fix that? Thanks in advance.
Is regex a must? string split method should be faster here if you have such simple patterns:
s = "add/10/20/30/4/3/9/"
nums = [num for num in s.split('/')[1:] if num]
regex pattern would be smply:
re.findall('\d+', s)
This would return all the numerical sequence in string s.
re.findall(r"[0-9]+", s)

python regex for parsing filenames

I'm the worst for regex in general, but in python... I need help in fixing my regex for parsing filenames, e.g:
>>> from re import search, I, M
>>> x="/almac/data/vectors_puces_T12_C1_00_d2v_H50_corr_m10_70.mtx"
>>> for i in range(6):
... print search(r"[vectors|pairs]+_(\w+[\-\w+]*[0-9]{0,4})([_T[0-9]{2,3}_C[1-9]_[0-9]{2}]?)(_[d2v|w2v|coocc\w*|doc\w*]*)(_H[0-9]{1,4})(_[sub|co[nvs{0,2}|rr|nc]+]?)(_m[0-9]{1,3}[_[0-9]{0,3}]?)",x, M|I).group(i)
...
It gives the following output:
vectors_puces_T12_C1_00_d2v_H50_corr_m10_70
puces_T
12_C1_00
_d2v
_H50
_corr
However, what I need is
vectors_puces_T12_C1_00_d2v_H50_corr_m10_70
puces
T12_C1_00
_d2v
_H50
_corr
I don't know what exactly is wrong. Thank you
One problem is that \w would also match underscore which you want to be a delimiter between puces and T12_C1_00 in this case. Replace the \w with A-Za-z\-. Also, you should put the underscore between the appropriate saving groups:
(?:vectors|pairs)_([A-Za-z\-]+[0-9]{0,4})_([T[0-9]{2,3}_C[1-9]_[0-9]{2}]?)...
HERE^
Works for me:
>>> import re
>>> re.search(r"(?:vectors|pairs)_([A-Za-z\-]+[0-9]{0,4})_([T[0-9]{2,3}_C[1-9]_[0-9]{2}]?)(_[d2v|w2v|coocc\w*|doc\w*]*)(_H[0-9]{1,4})(_[sub|co[nvs{0,2}|rr|nc]+]?)(_m[0-9]{1,3}[_[0-9]{0,3}]?)",x, re.M|re.I).groups()
('puces', 'T12_C1_00', '_d2v', '_H50', '_corr', '_m10_70')
I've also replaced the [vectors|pairs] with (?:vectors|pairs) which is, I think, what you've actually meant - match either vectors or pairs literal strings, (?:...) is a syntax for a non-capturing group.
I'm not sure what your goal is, but you seem to be interested in what's between each underscore, so it may be simpler to split by it:
path, filename = os.path.split(x)
filename = filename.split('.')
fileparts = filename.split('_')
fileparts will then be this list:
vectors
puces
T12
C1
00
d2v
H50
corr
m10
70
And you can validate / inspect any part, e.g. if fileparts[0] == 'vectors' or tpart = fileparts[2:4]...

Regex: allow for the occurrence of a certain character up to one time

I want to search for a specific (DNA) string 'AGCTAGCT' and allow for the occurrence of one (and only one) mismatch (signified as 'N').
The following are matches (no or one N):
AGCTAGCT
NGCTAGCT
AGCNAGCT
The following are not matches (two or more Ns):
AGNTAGCN
AGNTANCN
Use negative lookahead at the start to check for the strings whether it contains two N's or not.
^(?!.*?N.*N)[AGCTN]{8}$
I assumed that you string contains only A,G,C,T,N letters.
^(?!.*?N.*N)[AGCTN]+$
Or simply like this,
^(?!.*?N.*N).+$
DEMO
In any language you could do something like this
var count = str.match(/N/g).length; // just count the number of N in the string
if(count == 1 || count == 0) { // and compare it
// str valid
}
If you only want a regex, you could use this regex
/^[^N]*N?[^N]*$/
You can test if the string matches the above regex or not.
if you are using python, you can make it without regex:
myList = []
for word in dna :
if word.count('N') < 2 :
myList.append(word)
and now, if you want to generate all the DNA, i dont know how DNA takes letters, but this can save you:
import itertools
letters = ['A', 'G', 'C', 'T', 'N']
for letter in itertools.permutations(letters):
print ''.join(letter)
then, you will have all the permutations you can have from the four letters.
I think a regular expression is not the best choice for doing this. I say that because (at least to my knowledge) there is no easy way to express an arbitrary string to match with at most one mistake, other than explicitly considering all the possible mistakes.
being said that, it'd be something like this
AGCTAGCT|NGCTAGCT|ANCTAGCT|AGNTAGCT|AGCNAGCT|AGCTNGCT|AGCTANCT|AGCTAGNT|AGCTAGCN
maybe it can be simplified a bit.
EDIT
Given that N is a mismatch, a regular expression to accept what you want should replace each N with the wrong alternatives.
AGCTAGCT|[GCT]GCTAGCT|A[ACT]CTAGCT|AG[AGT]TAGCT|AGC[AGC]AGCT
|AGCT[GCT]GCT|AGCTA[ACT]CT|AGCTAG[AGT]T|AGCTAGC[AGC]
Simplifying...
(A(G(C(T(A(G(C(T|[AGC])|[AGT]T)|[ACT]CT)|[GCT]GCT)|[AGC]AGCT)|[AGT]TAGCT)|[ACT]CTAGCT)|[GCT]GCTAGCT)
Demo replacing N with wrong choices https://regex101.com/r/bB0gX1/1.

Regex that find words with exactly two 'a'

I'd like a regex that finds word with exactly two a (not 3,4,5,.) need pattern? don't have to be in row.
["taat","weagda","aa"] is ok,
but not this ["a","eta","aaa","aata","ssdfaasdfa"].
This one will work:
^[^a]*a[^a]*a[^a]*$
More generalized version where you can replace 2 with any number:
^(?:[^a]*a){2}[^a]*$
The 2 regexes above make use of the fact that a is a single character, so we can make sure that all other characters are not a. The 2nd one uses repetition notation.
Even more generalized version "not more than n non-overlapping substring" (DOTALL mode enabled):
^(?!(?:.*sstr){3})(?:.*sstr){2}.*$
Where sstr is a regex-escaped substring, and the number of repetition in the negative lookahead must be 1 more than the number we want to match.
This one is slightly trickier, and I use negative look-ahead to make sure the string doesn't contain n + 1 non-overlapping instances of the substring sstr, then try to find exactly n non-overlapping instances.
In this situation , i think, you can just use string to find out, just use a for loop.
mylist = ["taat","weagda","aa","eta","aaa","aata","ssdfaasdfa"];
resultList = [];
for x in mylist:
count = 0;
for c in x:
if c == 'a':
count = count +1;
if count == 2:
resultList.append(x);
print(resultList);
Do it with two regexes rather than trying to cram it all into one.
Check that your word matches a[^a]*a and does not match a.*a.*a
You can also use a Counter object for this task.
In [1]: from collections import Counter
In [2]: words = ["taat","weagda","aa", "a","eta","aaa","aata","ssdfaasdfa"]
In [3]: [word for word in words if Counter(word)['a'] == 2]
Out[3]: ['taat', 'weagda', 'aa']

Take first successful match from a batch of regexes

I'm trying to extract set of data from a string that can match one of three patterns. I have a list of compiled regexes. I want to run through them (in order) and go with the first match.
regexes = [
compiled_regex_1,
compiled_regex_2,
compiled_regex_3,
]
m = None
for reg in regexes:
m = reg.match(name)
if m: break
if not m:
print 'ARGL NOTHING MATCHES THIS!!!'
This should work (haven't tested yet) but it's pretty fugly. Is there a better way of boiling down a loop that breaks when it succeeds or explodes when it doesn't?
There might be something specific to re that I don't know about that allows you to test multiple patterns too.
You can use the else clause of the for loop:
for reg in regexes:
m = reg.match(name)
if m: break
else:
print 'ARGL NOTHING MATCHES THIS!!!'
If you just want to know if any of the regex match then you could use the builtin any function:
if any(reg.match(name) for reg in regexes):
....
however this will not tell you which regex matched.
Alternatively you can combine multiple patterns into a single regex with |:
regex = re.compile(r"(regex1)|(regex2)|...")
Again this will not tell you which regex matched, but you will have a match object that you can use for further information. For example you can find out which of the regex succeeded from the group that is not None:
>>> match = re.match("(a)|(b)|(c)|(d)", "c")
>>> match.groups()
(None, None, 'c', None)
However this can get complicated however if any of the sub-regex have groups in them as well, since the numbering will be changed.
This is probably faster than matching each regex individually since the regex engine has more scope for optimising the regex.
Since you have a finite set in this case, you could use short ciruit evaluation:
m = compiled_regex_1.match(name) or
compiled_regex_2.match(name) or
compiled_regex_3.match(name) or
print("ARGHHHH!")
In Python 2.6 or better:
import itertools as it
m = next(it.ifilter(None, (r.match(name) for r in regexes)), None)
The ifilter call could be made into a genexp, but only a bit awkwardly, i.e., with the usual trick for name binding in a genexp (aka the "phantom nested for clause idiom"):
m = next((m for r in regexes for m in (r.match(name),) if m), None)
but itertools is generally preferable where applicable.
The bit needing 2.6 is the next built-in, which lets you specify a default value if the iterator is exhausted. If you have to simulate it in 2.5 or earlier,
def next(itr, deft):
try: return itr.next()
except StopIteration: return deft
I use something like Dave Kirby suggested, but add named groups to the regexps, so that I know which one matched.
regexps = {
'first': r'...',
'second': r'...',
}
compiled = re.compile('|'.join('(?P<%s>%s)' % item for item in regexps.iteritems()))
match = compiled.match(my_string)
print match.lastgroup
Eric is in better track in taking bigger picture of what OP is aiming, I would use if else though. I would also think that using print function in or expression is little questionable. +1 for Nathon of correcting OP to use proper else statement.
Then my alternative:
# alternative to any builtin that returns useful result,
# the first considered True value
def first(seq):
for item in seq:
if item: return item
regexes = [
compiled_regex_1,
compiled_regex_2,
compiled_regex_3,
]
m = first(reg.match(name) for reg in regexes)
print(m if m else 'ARGL NOTHING MATCHES THIS!!!')