Need a Regex that includes all char after expression - regex

I am trying to figure out a regex. That includes all characters after it but if another patterns occurs it does not overlap
This is my current regex
[a-zA-Z]{2}\d{1}\s?\w?
The pattern is always 2 letter followed by a number like AE1 or BE3 but I need all the characters following the pattern.
So AE1 A E F but if another pattern occurs in the string like
AE1 A D BE1 A D C it cannot overlap with and be two separate matches.
So to clarify
AB3 D T B should be one match on the regex
ABC D A F DE3 D CD A
should have 2 matches with all the char following it because of the the two letter word and number.
How do I achieve this

I'm not quite following the logic here, yet my guess would be that we might want something similar to this:
([A-Z]{2}\d\s([A-Z]+\s)+)|([A-Z]{3}\s([A-Z]+\s)+)
which allows two letters followed by a digit, or three letters, both followed by ([A-Z]+\s)+.
Demo

Look, you have to consider where your pattern will start. I mean, you know, what is different between AE1 A E F and BE1 A D C in AE1 A D BE1 A D C? You don't want to treat both similarly. So you have to separate them. Separation of these two texts is possible only determining which one is placed in text start.
Altogether, only adding ^ to start your pattern will solve problem.
So your regex should be like this:
^[a-zA-Z]{2}\d{1}\s?\w?
Demo

What you want to do is to split a string with your pattern having the current pattern match as the start of the extracted substrings.
You may use
(?!^)(?=[a-zA-Z]{2}\d)
to split the string. Details
(?!^) - not at the start of the string
(?=[a-zA-Z]{2}\d) - a location in the string that is immediately followed with 2 ASCII letters and any digit.
See the Scala demo:
val s = "ABC D A F DE3 D CD A"
val rx = """(?!^)(?=[a-zA-Z]{2}\d)"""
val results = s.split(rx).map(_.trim)
println(results.mkString(", "))
// => ABC D A F, DE3 D CD A

You can just use this regex:
(?i)\b[a-z]{2}\d\b(?:(?:(?!\b[a-z]{2}\d\b).)+\s?)?
Demo and explanations: https://regex101.com/r/DtFU8j/1/
It uses a negative lookahead (?!\b[a-z]{2}\d\b) to add the constraint that the character matched after the initial pattern (?i)\b[a-z]{2}\d\b should not contain this exact pattern.

Related

How to reduce this regex to fit in Lua?

I'm trying to do a regex that captures the longest string before a possibly ending letter "c", in Lua.
For example,
Given abc, match ab
Given acc, match ac
Given abcd, match abcd
Given abd, match abd
The solution I came up is ^(.+(?=c$)|.+(?!c$)). However, Lua does not have lookahead, so I'm thinking if there is a way to reduce this to something that Lua natively supports.
You can use (string.match(str:reverse(), "^c(.+)") or str:reverse()):reverse(). If nothing is matched, then the original string is returned.
[Updated to fix the logic]
You can use a capture group to capture 1 more occurrences of any char except c followed by optional c chars.
Then match a c char at the end of the string outside of the group.
([^c]+c*)c$
The pattern matches
( Capture group 1
[^c]+c* Match 1+ times any char except c followed by matching optional c chars
) Close group 1
c$ Match a c chat at the end of the string
Regex demo
In the code, you can print the value of group 1 if it exists, else print the whole string if there is not match.
I have no experience in Lua, but looking at the code (and borrowing it for an example) posted by #Nifim this would match:
strs = {
"abc",
"acc",
"abd",
"abcd",
"c"
}
for _,s in ipairs(strs) do
print(s:match("([^c]+c*)c$") or s)
end
Output
ab
ac
abd
abcd
c
You can use "s:match("([^c]+c?[^c]*)c$") or s, this will get all non-c characters then potentially a c followed by any remaining non-c characters and the final c if present, in the event of no match it returns the whole string.
Here is an example:
strs = {
"abc",
"acc",
"abd",
"abcd",
}
for _,s in ipairs(strs) do
print(s, s:match("([^c]+c?[^c]*)c$") or s)
end
Output:
abc ab
acc ac
abd abd
abcd abcd
Egor Skriptunoff provided the best answer in my opinion in the comment here. Since he didn't leave it as dedicated answer, I'll close this question for him. I do not hold any credit.
Original answer:
s:match("(.-)c?$")

removing one letter except a compination

Trying to remove all characters except from the compination of 'r d`. To be more clear some examples:
a ball -> ball
r something -> something
d someone -> someone
r d something -> r d something
r something d -> something
Till now I managed to remove the letters except from r or d, but this is not what i want. I want to keep only the compination(ex.4). I use this:
\b(?!r|d)\w{1}\b
Any idea who to do it?
Edit:The reg engine supports lookbehinds.
You may capture the r d combination and use a backreference in the replacement pattern to restore that combination, and remove all other matches:
\b(r d)\b|\b\w\b\s*
See the regex demo (replace with $1 that will put the r d back into the result).
Details:
\b(r d)\b - a "whole word" r d that is captured into Group 1
| - or
\b\w\b\s* - a single whole word consisting of 1 letter/digit/underscore (\b\w\b) and followed with 0+ whitespaces (\s*, just for removing the excessive whitespace, might not be necessary).

Need some help constructing a regex in Java

I'm making a tool to find open reading frames for amino acids as a personal project. I have many strings that have characters consisting of the 26 uppercase English alphabet letters (A through Z). They look like this:
GMGMGRZMQGGRZR
I want to find all possible matches that are between the letters M and Z, with some additional rules.
There should not be any Z's in between an M and a Z
Example: If EMAZAZ is the input string then MAZ should match, MAZAZ should not
There can be multiple M's between an M and a Z
Example: If the input string is GMGMGRZMQGGRZR then MGMGRZ should match, but MGRZ shouldn't since there are more M's before the first M in MGRZ that could be used to match.
For Example
With the above string (GMGMGRZMQGGRZR), only MGMGRZ and MQGGRZ should match. MGMGRZMQGGRZ, MGRZ, and MGRZAMQGGRZ should NOT be match.
Does anyone know how to construct a regex like this? I consulted a few Java regex tutorials (I am using Java to write this program) but was unable to come up with a regex that followed all of the above rules.
The closest I have gotten is this regex:
M((?!(Z)))*Z
It shows that the substrings MGMGRZ, MQGGRZ, and MGRZ match. However, I do not want MGRZ to match.
What you want is:
(M[^Z]+Z)
DEMO
The regex works as follow: It will try to match an M, followed by any number of chars that are not a Z up to a Z
The thing is that every char is consumed only once from left to right, so in
GMGMGRZMQGGRZR
^----^ 1st match MGMGRZ
^----^ 2nd match MQGGRZ
And consequently, it will match MGRZ if you feed it alone to the regex !!

a{0,1}|b{0,1} matches only 'a', why?

Why is this happening? I've a complex regexp, but here is what is driving me crazy.
a|b
Matches either single a or single b.
a+|b+
Matches either series of a or series of b.
a{1}|b{1}
Matches both single letter the same.
But I need to do this:
a{0,2}|b{0,2}
And this regexp matches only a and no b at all. What's wrong with that?
What is even funnier is that if I change the 0 to 1, so that it's {1,2}, it starts to match correctly (or better, as expected) again.
Since it seems it now quite clear, I'm adding my real example:
my $launch_regexp = '(\d*)d{0,1}(\d*)(\+{0,2}|-{0,2})(\d*)';
($dice, $fc, $op, $mod) = ($launch =~ /$launch_regexp/);
Where $launch is the same of $ARGV[1].
I want to match many things. Examples:
3 (numbers)
d10 (d + numbers)
3d10 (numbers + d + numbers)
3d10+/-5 (numbers + d + numbers + (+|-) + numbers)
3d10++/--5 (numbers + d + numbers + (++|--) + numbers)
I know my regexp also matches other strings, but now it works with + and not with -.
If I change the range with {1,2}, it matches strings with both + and - (but I need to match also strings which have not such modifiers).
This is happening on my machine with Perl 5.16.3 and I'm able to reproduce it on this website.
The string "b" can be matched by the regex a{0,2} as it correctly has zero instances of 'a'. It won't capture, but it'll match.
In order to match '','aa' or 'bb', you want (aa|bb)? and to wrap your whole regex in ^ and $
I think what you want for your solution is: (\d*)d?(\d+)(?:(\+{1,2}|\-{1,2})(\d*))?
Perl prefers earliest match in the string over anything else. Next, it prefers the earliest of a series of | alternatives (not the longest, as is the case with some regex engines).
Because your first alternative can match nothing, perl will do so at the beginning of the string, for any string that doesn't start with an a.
You probably want something like:
my ($find) = ($string) =~ /^[^ab]*(a{1,2}|b{1,2}|\z)/;

How to create a regular expression to match non-consecutive characters?

How to create a regular expression for strings of a,b and c such that aa and bb will be rejected?
For example, abcabccababcccccab will be accepted and aaabc or aaabbcccc or abcccababaa will be rejected.
If this is not a purely academical question you can simply search for aa and bb and negate your logic, for example:
s='abcccabaa'
# continue if string does not match.
if re.search('(?:aa|bb)', s) is None:
...
or simply scan the string for the two patterns, avoiding expensive regular expressions:
if 'aa' not in s and 'bb' not in s:
...
For such an easy task RE is probably total overkill.
P.S.: The examples are in Python but the principle applies to other languages of course.
^(?!.*(?:aa|bb))[abc]+$
See it here on Regexr
This regex would do two things
verify that your string consist only of a,b and c
fail on aa and bb
^ matches the start of the string
(?!.*(?:aa|bb)) negative lookahead assertion, will fail if there is aa or bb in the string
[abc]+ character class, allows only a,b,c at least one (+)
$ matches the end of the string
Using the & operator (intersection) and ~ (complement):
(a|b|c)*&~(.*(aa|cc).*)
Rewriting this without the these operators is tricky. The usual approach is to break it into cases.
In this case it is not all that difficult.
Suppose that the letter c is taken out of the picture. The only sequences then which don't have aa and bb are:
e (empty string)
a
b
b?(ab)*a?
Next what we can do is insert some optional 'c' runs into all possible interior places:
e (empty string)
a
b
(bc*)?(ac*bc*)*a?
Next, we have to acknowledge that illegal sequences like aabb become accepted if non-optional 'c' runs are put in the middle, as in for example acacbcbc'. We allow a finalaandb. This pattern can take care of our loneaandb` cases as well as matching the empty string:
(ac+|bc+)*(a|b)?
Then combine them together:
((ac+|bc+)*(a|b)?|(bc*)?(ac*bc*)*a?|(ac+|bc+)(a|b)?)
We are almost there: we also need to recognize that this pattern can occur an arbitrary number of times, as long as there are dividing 'c'-s between the occurences, and with arbitrary leading or trailing runs of c-s around the whole thing
c*((ac+|bc+)*(a|b)?|(bc*)?(ac*bc*)*a?|(ac+|bc+)(a|b)?)(c+((ac+|bc+)*(a|b)?|(bc*)?(ac*bc*)*a?|(ac+|bc+)(a|b)?))*c*
Mr. Regex Philbin, I'm not coming up with any cases that this doesn't handle, so I'm leaving it as my final answer.