I have a list of 8-letter sequences like this:
['GQPLWLEH', 'TLYSFFPK', 'TYGEIFEK', 'APYWLINK', ...]
How can I use regular expressions to find all the sequences that have the specific letters at specific positions within the sequence? For example, the letters V, I, F, or Y at the 2nd letter in the sequence and the letters M, L, F, Y at the 3rd position in the sequence.
I'm really new to RE, thanks in advance!
You can try using the following regex pattern:
.[VIFY][MLFY].*
This will match any first character, followed by a second and third character using the logic you want.
import re
mylist = ['GQPLWLEH', 'TLYSFFPK', 'TYGEIFEK', 'APYWLINK']
r = re.compile(".[VIFY][MLFY].*")
newlist = filter(r.match, mylist)
print str(newlist)
Demo here:
Rextester
Note: I added the word BILL to your list in the demo to get something which passes the regex match.
\b.[VIFY][MLFY]\w*\b
This may satisfy what you want. You can play with regex online at regex101
Maybe you can avoid using a regexp altogether:
[x for x in mylist if x[1] in 'VIFY' and x[2] in 'MLFY']
Related
I am trying to prevent the inclusion of suffix name, for example, JR/SR, or other suffix made up of using I,V,X using regular expression way. To accomplish this I have implemented the following regex
((^((?!((\b((I+))\b)|(\b(V+)\b)|(\b(X+)\b)|\b(IV)\b|(\b(V?I){1,2}\b)|(\b(IX)\b)|(\bX[I|IX]{1,2}\b)|(\bX|X+[V|VI]{1,2}\b)|(\b(JR)\b)|(\b(SR)\b))).)*$))
Using this I am able to prevent various possible combination eg.,
'Last Name I',
'Last Name II',
'Last Name IJR',
'Last Name SRX' etc.
However, there are still couple of combinations remaining, which this regex can match. eg., 'Last Name IXV' or 'Last Name VXI'
These two I am not able to debug. Please suggest me in which part of this regex I can make changes to satisfy the requirement.
Thank you!
Try this pattern: .+\b(?:(?>[JS]R)|X|I|J|V)+$
Explanation:
.+ - match one or more of any characters
\b - word boudnary
(?:...) - non-capturing group
(?>...) - atomic group
[JS]R - match whether S or J followed by R
| - alternation: match what is on the left OR what's on the right
+ - quantifier: match one or more times preceeding pattern
$ - match end of the string
Demo
In order to solve this I have worked on the above regex a little bit more. And here is the final result that can successfully match up with the "roman numeral" upto thirty constituted I, V, and X.
"(\b(?!(IIX|IIV|IVV|IXX|IXI))I[IVX]{0,3}\b|\b(V|X)\b|\bV[I]{1,2}\b|\b((?!XVV|XVX)X([IXV]{1,2}))\b|\b[S|J]R\b)|^$"
What I have done here is:
I have taken those input into consideration which are standalone,
that is: SR or XXV I have observed the incorrect pattern and
have restricted them to match as a positive result.
Separate input has been ensured using \b the word boundary.
Word-boundary: It suggests that starting of a word, that means in
simple words it says "yes there is a word" or "no it is not."
it has done in the following way-
using negative lookahead (?!(IIX|IIV|IVV|IXX|IXI))
How I have arrived on this solution is given as follows:
I have observed closely all the pattern first, that from I to X - that is:
I
I I
I I I
I V
V
V I
V I I
V I I I (it is out of the range of 3 characters.)
I X
X
we have an I, V, and X at first position. Then there is another I, X and V
on the second position. After then again same I and V. I have
implemented this in the following regex of the above written code:
\b(?!(IIX|IIV|IVV|IXX|IXI))I[IVX]{0,3}\b
Start it with I and then look for any of I, V, or X in a range of 'zero' to 'three' characters, and do neglect invalid numbers written inside the ?!(IIX|IIV|IVV|IXX|IXI) Similarly, I have done with other combinations given below.
Then for V and X : \b(V|X)\b
Then for the VI, VII: \bV[I]{1,2}\b
Then for the XI - XXX: \b((?!XVV|XVX)X([IXV]{1,2}))\b
To validate a suffix name, i.e. JR, SR, one can use following regex: \b[S|J]R\b
and the last (^$) is for matching a blank string or in other words, when no input has provided to the given input-box or textbox.
You may post any question or suggestion, if you have.
Thanks!
Ps: This regex is simply a solution to validate "roman numbers" from 1 to 30 using I, V, and X. I hope it helps to learn a bit to each and every newbie of regex.
I solved this with a more explicit:
(.+) (?:(?>JR$|SR$|I$|II$|III$|IV$|MD$|DO$|PHD$))|(.+)
I know I could do something like [JS]R but I like the way this reads:
(.+) match any characters and then a space
(?:(?>JR$|SR$|I$|II$|III$|IV$|MD$|DO$|PHD$)) atomically look for but don't match endings like JR etc
|(.+) if you don't find the endings then match any characters
Feel free to add the endings you'd like to suit your needs.
I have a long string S and a string-to-string map M, where keys in M are the results of a regex match on S. I want to do a find-and-replace on S where, whenever one of the matches from that same regex is exactly one of my keys K in M, I replace it with its value M[K].
In order to do this I think I'd need to access the result of regex matches within a regex. If I try to store the result of a match and test equality outside a regex, I can't do my replace because I no longer know where the match was. How do I accomplish my goal?
Examples:
S = "abcd_a", regex = "[a-z]", M = {a:b}
result: "bbcd_b" because the regex would match the a's and replace them with b's
S = "abcd_a", regex = "[a-z]*", M = {a:b}
result: "abcd_b" because the regex would match "abcd" (but not replace it because it is not exactly "a") and the final 'a' (which it would replace because it is exactly "a")
EDIT Thanks for AlanMoore's suggestion. The code is now simpler.
I tried using python (2.7x) to solve this simple example, but it can be achieved with any other language. What's important is the approach (algorithm). Hope it helps:
import re
from itertools import cycle
S = "abcd_a"
REGEX = "[a-z]"
M = {'a':'b'}
def ReplaceWithDict(pattern):
# split by match group and map the match against map dict
return ''.join([M[v] if v and v in M else v for v in re.split(pattern, S)])
print ReplaceWithDict('([a-z])')
print ReplaceWithDict('([a-z]*)')
Output:
bbcd_b
abcd_b
I'm making a tool to find open reading frames for amino acids as a personal project. I have many strings that have characters consisting of the 26 uppercase English alphabet letters (A through Z). They look like this:
GMGMGRZMQGGRZR
I want to find all possible matches that are between the letters M and Z, with some additional rules.
There should not be any Z's in between an M and a Z
Example: If EMAZAZ is the input string then MAZ should match, MAZAZ should not
There can be multiple M's between an M and a Z
Example: If the input string is GMGMGRZMQGGRZR then MGMGRZ should match, but MGRZ shouldn't since there are more M's before the first M in MGRZ that could be used to match.
For Example
With the above string (GMGMGRZMQGGRZR), only MGMGRZ and MQGGRZ should match. MGMGRZMQGGRZ, MGRZ, and MGRZAMQGGRZ should NOT be match.
Does anyone know how to construct a regex like this? I consulted a few Java regex tutorials (I am using Java to write this program) but was unable to come up with a regex that followed all of the above rules.
The closest I have gotten is this regex:
M((?!(Z)))*Z
It shows that the substrings MGMGRZ, MQGGRZ, and MGRZ match. However, I do not want MGRZ to match.
What you want is:
(M[^Z]+Z)
DEMO
The regex works as follow: It will try to match an M, followed by any number of chars that are not a Z up to a Z
The thing is that every char is consumed only once from left to right, so in
GMGMGRZMQGGRZR
^----^ 1st match MGMGRZ
^----^ 2nd match MQGGRZ
And consequently, it will match MGRZ if you feed it alone to the regex !!
This may be very easy but for some reason i am unable to get the expression. I want to find position/index of all matching words in a given string. for example
"THIS IS AND NAND XOR NOR AATD". now, I want to find index of matching string starting with A and can have any char between A-Z but must end with T or D. So the result should look like [9,AND][14,AND][24,AAT][25,ATD]
my expression (?s)(A.[TD]) is missing the last index. Thanks in advance. I am using python.
If you are trying to do this by using a regular expression, you need a Positive Lookahead assertion. I replaced the dot in your regular expression with [A-Z] since you stated you want to match word characters.
>>> import re
>>> p = re.compile(r'(?=(A[A-Z][TD]))')
>>> for m in p.finditer('THIS IS AND NAND XOR NOR AATD'):
... print [m.start() + 1, m.group(1)]
[9, 'AND']
[14, 'AND']
[26, 'AAT']
[27, 'ATD']
You're not actually matching words but sequences, and the problem is that you are looking at capturing overlapping sequences.
See Overlapping regex matches for a discussion on the subject.
first match text using:
/^(.*)(A[A_Z]*[TD])/g
then index of matched element would be length of first matched sequence!
In R, is there a better/simpler way than the following of finding the location of the last dot in a string?
x <- "hello.world.123.456"
g <- gregexpr(".", x, fixed=TRUE)
loc <- g[[1]]
loc[length(loc)] # returns 16
This finds all the dots in the string and then returns the last one, but it seems rather clumsy. I tried using regular expressions, but didn't get very far.
Does this work for you?
x <- "hello.world.123.456"
g <- regexpr("\\.[^\\.]*$", x)
g
\. matches a dot
[^\.] matches everything but a dot
* specifies that the previous expression (everything but a dot) may occur between 0 and unlimited times
$ marks the end of the string.
Taking everything together: find a dot that is followed by anything but a dot until the string ends. R requires \ to be escaped, hence \\ in the expression above. See regex101.com to experiment with regex.
How about a minor syntax improvement?
This will work for your literal example where the input vector is of length 1. Use escapes to get a literal "." search, and reverse the result to get the last index as the "first":
rev(gregexpr("\\.", x)[[1]])[1]
A more proper vectorized version (in case x is longer than 1):
sapply(gregexpr("\\.", x), function(x) rev(x)[1])
and another tidier option to use tail instead:
sapply(gregexpr("\\.", x), tail, 1)
Someone posted the following answer which I really liked, but I notice that they've deleted it:
regexpr("\\.[^\\.]*$", x)
I like it because it directly produces the desired location, without having to search through the results. The regexp is also fairly clean, which is a bit of an exception where regexps are concerned :)
There is a slick stri_locate_last function in the stringi package, that can accept both literal strings and regular expressions.
To just find a dot, no regex is required, and it is as easy as
stringi::stri_locate_last_fixed(x, ".")[,1]
If you need to use this function with a regex, to find the location of the last regex match in the string, you should replace _fixed with _regex:
stringi::stri_locate_last_regex(x, "\\.")[,1]
Note the . is a special regex metacharacter and should be escaped when used in a regex to match a literal dot char.
See an R demo online:
x <- "hello.world.123.456"
stringi::stri_locate_last_fixed(x, ".")[,1]
stringi::stri_locate_last_regex(x, "\\.")[,1]