Need some help constructing a regex in Java - regex

I'm making a tool to find open reading frames for amino acids as a personal project. I have many strings that have characters consisting of the 26 uppercase English alphabet letters (A through Z). They look like this:
GMGMGRZMQGGRZR
I want to find all possible matches that are between the letters M and Z, with some additional rules.
There should not be any Z's in between an M and a Z
Example: If EMAZAZ is the input string then MAZ should match, MAZAZ should not
There can be multiple M's between an M and a Z
Example: If the input string is GMGMGRZMQGGRZR then MGMGRZ should match, but MGRZ shouldn't since there are more M's before the first M in MGRZ that could be used to match.
For Example
With the above string (GMGMGRZMQGGRZR), only MGMGRZ and MQGGRZ should match. MGMGRZMQGGRZ, MGRZ, and MGRZAMQGGRZ should NOT be match.
Does anyone know how to construct a regex like this? I consulted a few Java regex tutorials (I am using Java to write this program) but was unable to come up with a regex that followed all of the above rules.
The closest I have gotten is this regex:
M((?!(Z)))*Z
It shows that the substrings MGMGRZ, MQGGRZ, and MGRZ match. However, I do not want MGRZ to match.

What you want is:
(M[^Z]+Z)
DEMO
The regex works as follow: It will try to match an M, followed by any number of chars that are not a Z up to a Z
The thing is that every char is consumed only once from left to right, so in
GMGMGRZMQGGRZR
^----^ 1st match MGMGRZ
^----^ 2nd match MQGGRZ
And consequently, it will match MGRZ if you feed it alone to the regex !!

Related

Substitute every character after a certain position with a regex

I'm trying to use a regex replace each character after a given position (say, 3) with a placeholder character, for an arbitrary-length string (the output length should be the same as that of the input). I think a lookahead (lookbehind?) can do it, but I can't get it to work.
What I have right now is:
regex: /.(?=.{0,2}$)/
input string: 'hello there'
replace string: '_'
current output: 'hello th___' (last 3 substituted)
The output I'm looking for would be 'hel________' (everything but the first 3 substituted).
I'm doing this in Typescript, to replace some old javascript that is using ugly split/concatenate logic. However, I know how to make the regex calls, so the answer should be pretty language agnostic.
If you know the string is longer than given position n, the start-part can be optionally captured
(^.{3})?.
and replaced with e.g. $1_ (capture of first group and _). Won't work if string length is <= n.
See this demo at regex101
Another option is to use a lookehind as far as supported to check if preceded by n characters.
(?<=.{3}).
See other demo at regex101 (replace just with underscore) - String length does not matter here.
To mention in PHP/PCRE the start-part could simply be skipped like this: ^.{1,3}(*SKIP)(*F)|.

Regex cannot prevent a match of suffix name made up using I,V,X and SR/JR

I am trying to prevent the inclusion of suffix name, for example, JR/SR, or other suffix made up of using I,V,X using regular expression way. To accomplish this I have implemented the following regex
((^((?!((\b((I+))\b)|(\b(V+)\b)|(\b(X+)\b)|\b(IV)\b|(\b(V?I){1,2}\b)|(\b(IX)\b)|(\bX[I|IX]{1,2}\b)|(\bX|X+[V|VI]{1,2}\b)|(\b(JR)\b)|(\b(SR)\b))).)*$))
Using this I am able to prevent various possible combination eg.,
'Last Name I',
'Last Name II',
'Last Name IJR',
'Last Name SRX' etc.
However, there are still couple of combinations remaining, which this regex can match. eg., 'Last Name IXV' or 'Last Name VXI'
These two I am not able to debug. Please suggest me in which part of this regex I can make changes to satisfy the requirement.
Thank you!
Try this pattern: .+\b(?:(?>[JS]R)|X|I|J|V)+$
Explanation:
.+ - match one or more of any characters
\b - word boudnary
(?:...) - non-capturing group
(?>...) - atomic group
[JS]R - match whether S or J followed by R
| - alternation: match what is on the left OR what's on the right
+ - quantifier: match one or more times preceeding pattern
$ - match end of the string
Demo
In order to solve this I have worked on the above regex a little bit more. And here is the final result that can successfully match up with the "roman numeral" upto thirty constituted I, V, and X.
"(\b(?!(IIX|IIV|IVV|IXX|IXI))I[IVX]{0,3}\b|\b(V|X)\b|\bV[I]{1,2}\b|\b((?!XVV|XVX)X([IXV]{1,2}))\b|\b[S|J]R\b)|^$"
What I have done here is:
I have taken those input into consideration which are standalone,
that is: SR or XXV I have observed the incorrect pattern and
have restricted them to match as a positive result.
Separate input has been ensured using \b the word boundary.
Word-boundary: It suggests that starting of a word, that means in
simple words it says "yes there is a word" or "no it is not."
it has done in the following way-
using negative lookahead (?!(IIX|IIV|IVV|IXX|IXI))
How I have arrived on this solution is given as follows:
I have observed closely all the pattern first, that from I to X - that is:
I
I I
I I I
I V
V
V I
V I I
V I I I (it is out of the range of 3 characters.)
I X
X
we have an I, V, and X at first position. Then there is another I, X and V
on the second position. After then again same I and V. I have
implemented this in the following regex of the above written code:
\b(?!(IIX|IIV|IVV|IXX|IXI))I[IVX]{0,3}\b
Start it with I and then look for any of I, V, or X in a range of 'zero' to 'three' characters, and do neglect invalid numbers written inside the ?!(IIX|IIV|IVV|IXX|IXI) Similarly, I have done with other combinations given below.
Then for V and X : \b(V|X)\b
Then for the VI, VII: \bV[I]{1,2}\b
Then for the XI - XXX: \b((?!XVV|XVX)X([IXV]{1,2}))\b
To validate a suffix name, i.e. JR, SR, one can use following regex: \b[S|J]R\b
and the last (^$) is for matching a blank string or in other words, when no input has provided to the given input-box or textbox.
You may post any question or suggestion, if you have.
Thanks!
Ps: This regex is simply a solution to validate "roman numbers" from 1 to 30 using I, V, and X. I hope it helps to learn a bit to each and every newbie of regex.
I solved this with a more explicit:
(.+) (?:(?>JR$|SR$|I$|II$|III$|IV$|MD$|DO$|PHD$))|(.+)
I know I could do something like [JS]R but I like the way this reads:
(.+) match any characters and then a space
(?:(?>JR$|SR$|I$|II$|III$|IV$|MD$|DO$|PHD$)) atomically look for but don't match endings like JR etc
|(.+) if you don't find the endings then match any characters
Feel free to add the endings you'd like to suit your needs.

Look for any character that surrounds one of any character including itself

I am trying to write a regex code to find all examples of any character that surrounds one of any character including itself in the string below:
b9fgh9f1;2w;111b2b35hw3w3ww55
So ‘b2b’ and ‘111’ would be valid, but ‘3ww5’ would not be.
Could someone please help me out here?
Thanks,
Nikhil
You can use this regex which will match three characters where first and third are same using back reference, where as middle can be any,
(.).\1
Demo
Edit:
Above regex will only give you non-overlapping matches but as you want to get all matches that are even overlapping, you can use this positive look ahead based regex which doesn't consume the next two characters instead groups them in group2 so for your desired output, you can append characters from group1 and group2.
(.)(?=(.\1))
Demo with overlapping matches
Here is a Java code (I've never programmed in Ruby) demonstrating the code and the same logic you can write in your fav programming language.
String s = "b9fgh9f1;2w;111b2b35hw3w3ww55";
Pattern p = Pattern.compile("(.)(?=(.\\1))");
Matcher m = p.matcher(s);
while(m.find()) {
System.out.println(m.group(1) + m.group(2));
}
Prints all your intended matches,
111
b2b
w3w
3w3
w3w
Also, here is a Python code that may help if you know Python,
import re
s = 'b9fgh9f1;2w;111b2b35hw3w3ww55'
matches = re.findall(r'(.)(?=(.\1))',s)
for m in re.findall(r'(.)(?=(.\1))',s):
print(m[0]+m[1])
Prints all your expected matches,
111
b2b
w3w
3w3
w3w

Regex for searching strings matching the following one

I am searching strings matching the following one in my source code:
<CONSTANT_STRING_1> <CONSTANT_STRING_2> <VARIABLE_DIGITS> <CONSTANT_STRING_3>
where
<CONSTANT_STRING_1>, <CONSTANT_STRING_2> and <CONSTANT_STRING_3> are constant strings like "ABC", ""DEF" and "GHI".
<VARIABLE_DIGITS> is a random number of 14 digits like "12345678901234"
Note: there are white spaces between words.
What I am looking for is to search <CONSTANT_STRING_1> <CONSTANT_STRING_2> <WHATEVER> <CONSTANT_STRING_3>. How can I build the Regex?
I am reading that by "constant string" you mean character strings? If so the below should work to find that full string you are looking for. Btw the website linked below is really great for visualizing this type of problem... give it a try :)
(([a-zA-Z]+\s){2})[0-9]{14}\s([a-zA-Z]+)$
Debuggex Demo
To break it down...
(([a-zA-Z]+\s){2}) means a string of one or more characters comprised of either LC or UC letters followed by a space and that whole thing (chars + space) repeated twice
[0-9]{14}\s 14 digits followed by a space. As #Avinash said \d{14}\s is another way of writing this portion
([a-zA-Z]+)$ Another string of one or more characters. The $ indicates that this ends the string you are searching for
You could try the below regex.
<CONSTANT_STRING_1> <CONSTANT_STRING_2> \d{14} <CONSTANT_STRING_3>
Where, \d{14} matches exactly the 14 digit number.

Regular expression which will match if there is no repetition

I would like to construct regular expression which will match password if there is no character repeating 4 or more times.
I have come up with regex which will match if there is character or group of characters repeating 4 times:
(?:([a-zA-Z\d]{1,})\1\1\1)
Is there any way how to match only if the string doesn't contain the repetitions? I tried the approach suggested in Regular expression to match a line that doesn't contain a word? as I thought some combination of positive/negative lookaheads will make it. But I haven't found working example yet.
By repetition I mean any number of characters anywhere in the string
Example - should not match
aaaaxbc
abababab
x14aaaabc
Example - should match
abcaxaxaz
(a is here 4 times but it is not problem, I want to filter out repeating patterns)
That link was very helpful, and I was able to use it to create the regular expression from your original expression.
^(?:(?!(?<char>[a-zA-Z\d]+)\k<char>{3,}).)+$
or
^(?:(?!([a-zA-Z\d]+)\1{3,}).)+$
Nota Bene: this solution doesn't answer exaactly to the question, it does too much relatively to the expressed need.
-----
In Python language:
import re
pat = '(?:(.)(?!.*?\\1.*?\\1.*?\\1.*\Z))+\Z'
regx = re.compile(pat)
for s in (':1*2-3=4#',
':1*1-3=4#5',
':1*1-1=4#5!6',
':1*1-1=1#',
':1*2-a=14#a~7&1{g}1'):
m = regx.match(s)
if m:
print m.group()
else:
print '--No match--'
result
:1*2-3=4#
:1*1-3=4#5
:1*1-1=4#5!6
--No match--
--No match--
It will give a lot of work to the regex motor because the principle of the pattern is that for each character of the string it runs through, it must verify that the current character isn't found three other times in the remaining sequence of characters that follow the current character.
But it works, apparently.