Find all chars occuring 1-3 times - regex

I'm looking for regex here. I have a partial solution, but it has to be completed. Im looking for a leteral B here but I want to look for [A-Z] and use backreferences at the according places:
What I tried
^[^B]*(B)(?!(?:[^B]*B){3}) finds the first 'B' if its occuring 1-3 times.
regex101
Matches non B 0-n times
Matches first B
Does not match if there are 3(or more) B ahead
What it should be
What im looking for is not just match B, but match [A-Z] - if I try ^[^\1]*([A-Z])(?!(?:[^\1]*\1){3}) (replaced matched B with [A-Z] and backreferenced it.)
Problems to solve
The problem here. [^\1] this seems to not work.
I need to negate the backreference and quantify it. 2 things it doesnt wanna do :D
Desired Results (examples)
AAA result [A] because A is a [A-Z] and no more than 3 times in the string
AABB result [A,B] because A and B no more than 3 times
ABAB same as AABB
123AABBAA result [B] because [A-Z] and no more than 3 times
EDIT
Something that may help: (?:(?<=(?!\1)))* as replacement for [^\1]*
regex101
Kind of works returns A on (A)AA and no match on AAAA but wont match BAAA (which should redsult in [a,b])

Related

Regex Match a non-empty sequence of characters, where there is even amount of a given character (including 0) and any amount of the other characters

I am trying to write a regular expression that matches a non-empty sequence of A's and B's, where the A's are even which includes 0.
For example:
AABBABA -> AABBABA
BBBB -> BBBB
A -> nothing
Here is what I could come up with so far:
(AA+B*|B*AB*A|B*)+
But currently it is of course only gonna match what's in the parentheses not just any pattern of A's and B's. I am having trouble generalizing getting even amount of A's.
If you have to use regex, you may use something like this:
^(?:B*(?:AB*A)*B*)*$
Demo.
I'm sure this isn't the most efficient way but it seems to do the job.
This will basically match Two A characters with zero or more B characters in between and the whole thing is repeated zero or more times. This guarantees that the A count will be even. Then we have zero or more B characters at the beginning and end in case the string starts with or ends with B. And then the whole thing is repeated zero or more times again.
If you want to reject empty strings (and assuming your regex flavor supports Lookaheads), you can add a simple Lookahead that looks for one character to the beginning of the pattern:
^(?=.)(?:B*(?:AB*A)*B*)*$
If you don't want to match an empty string without lookarounds, you might also use
^(?:(?:B*AB*A)+B*|B+)$
Explanation
^ Start of string
(?: Non capture group
(?: Non capture group
(?:B*AB*A)+B* Match 1+ times pairs of A's between optional B's
| Or
B+ Match 1+ occurrences of B
) Close group
$ End of string
Regex demo

Match numbers not following words or punctuation

I want to have regex that will match numbers that are not preceded by whitespace or punctuation, e.g.:
foo12 -> matches 12
foo 42 -> no match
foo.42 -> no match
I came up with:
(?<![[:space:][:punct:]])\d+
However, this does not work as I intend, as for the examples, the results are as folllows:
foo12 -> matches 12 (OK)
foo 42 -> matches 2 (not OK)
foo.42 -> matches 2 (not OK)
I understand, why it matches single digits in the last two examples (because negative-lookbehind includes only whitespace and punctuation), however I am not sure how to change my regex to exclude those matches. How can it be corrected?
The reason for partial match is that engine doesn't know exactly where it should start from regarding your requirements. You tell engine by including \d in character class:
(?<![[:space:][:punct:]\d])\d+
^^
This RegEx might help you to divide your string input into two groups, where the second group ($2) is the target number and group one ($1) is the non-digit behind it:
([A-Za-z_+-]+)([0-9]+)
It might be safe to do so, if you might want to use it for text-processing.

How can I create a regex that matches a 6[A-Z] character string, with 3 or more non-consecutive or consecutive Cs?

(?=[A-Z]{6})(?=([C]){3,6}) is what I have tried so far.
I would like it to work like this:
ABYCCC Match
CBTCAC Match
CCTYEC Match
AFEQCB Don't match
CCEEEE Don't match
EEEEEE Don't match
This however just matches strings with consecutive Cs.
I am very new so any help is appreciated. I'm just using the search in Notepad ++
^(?=(?:.*C){3}).*$
Use this regex.See demo.
https://regex101.com/r/rP5pV8/1
So here we go
\b(?=(?:[ABD-Z]*C){3})[A-Z]{6}\b
This will match any string that contains of 6 Uppercase letters, of whom 3 (or more) are Cs.
It doesn't match:
strings shorter than 6 uppercase letters
strings longer than 6 uppercase letters
strings with less than 3 C but following Cs outside the string
https://regex101.com/r/vV3yS4/2
You can check for occurence of at least 3 C by using a lookahead.
^(?=(?:[^C]*C){3})[A-Z]{6}$
[^C]*C matches any amount of characters, that are not C followed by C
the (?:...) non capture group {3} to be repeated 3 times
[A-Z]{6} requires 6 upper alphas.
See demo at regex101
(Note that I put for demo an addional \n in the negated class for not skipping newlines)
Here's how I would do it in Python:
import re
pattern = re.compile("[A-Z]{6}")
strings = ["AABSDC", "CCCASD", "CAVACC"]
def checkC(letters):
return pattern.match(letters) and letters.count('C') >= 3
for string in strings:
print(checkC(string))
Output:
False
True
True
(?=(.*C.*){3})[A-Z]{6}
I swapped the two parts and removed the "lookahead" from the [A-Z]{6} so that the expression matches something positive.
Then, to the left and right of "C" I added lazy dots that match anything zero or more times. So you still match three or more Cs and allow for anything between them.
After that, I removed the ,6 because "anything" can be some more Cs.

Can I make two groups of regex match in same quantity?

I want a regex that matches the following pattern:
b
abc
aabcc
aaabccc
But does NOT match any of:
ab
bc
aabc
abcc
Basically, /(a)*b(c){_Q_}/, where _Q_ is the number of times that group 1 matched. I know how to match group 1 content later in the string, but how can I match group 1 count?
Use this recursive regex:
^(a(?:(?1)|b)c)$|^(b)$
Demo on regex101
The regex can be further reduced to:
^(a(?1)c|b)$
Demo on regex101
The alternation consists of:
The base case b
The recursive case a(?1)c which matches a, then recurse into group 1, then matches c. Group 1 is the alternation itself, so it can contain more pairs of a and c, or the recursion ends at base case b.

Regular expression for crossword solution

This is a crossword problem. Example:
the solution is a 6-letter word which starts with "r" and ends with "r"
thus the pattern is "r....r"
the unknown 4 letters must be drawn from the pool of letters "a", "e", "i" and "p"
each letter must be used exactly once
we have a large list of candidate 6-letter words
Solutions: "rapier" or "repair".
Filtering for the pattern "r....r" is trivial, but finding words which also have [aeip] in the "unknown" slots is beyond me.
Is this problem amenable to a regex, or must it be done by exhaustive methods?
Try this:
r(?:(?!\1)a()|(?!\2)e()|(?!\3)i()|(?!\4)p()){4}r
...or more readably:
r
(?:
(?!\1) a () |
(?!\2) e () |
(?!\3) i () |
(?!\4) p ()
){4}
r
The empty groups serve as check marks, ticking off each letter as it's consumed. For example, if the word to be matched is repair, the e will be the first letter matched by this construct. If the regex tries to match another e later on, that alternative won't match it. The negative lookahead (?!\2) will fail because group #2 has participated in the match, and never mind that it didn't consume anything.
What's really cool is that it works just as well on strings that contain duplicate letters. Take your redeem example:
r
(?:
(?!\1) e () |
(?!\2) e () |
(?!\3) e () |
(?!\4) d ()
){4}
m
After the first e is consumed, the first alternative is effectively disabled, so the second alternative takes it instead. And so on...
Unfortunately, this technique doesn't work in all regex flavors. For one thing, they don't all treat empty/failed group captures the same. The ECMAScript spec explicitly states that references to non-participating groups should always succeed.
The regex flavor also has to support forward references--that is, backreferences that appear before the groups they refer to in the regex. (ref) It should work in .NET, Java, Perl, PCRE and Ruby, that I know of.
Assuming that you meant that the unknown letters must be among [aeip], then a suitable regex could be:
/r[aeip]{4,4}r/
What's the front end language being used to compare strings, is it java, .net ...
here is an example/psuedo code using java
String mandateLetters = "aeio"
String regPattern = "\\br["+mandateLetters+"]*r$"; // or if for specific length \\br[+mandateLetters+]{4}r$
Pattern pattern = Pattern.compile(regPattern);
Matcher matcher = pattern.matcher("is this repair ");
matcher.find();
Why not replace each '.' in your original pattern with '[aeip]'?
You'd wind up with a regex string r[aeip][aeip][aeip][aeip]r.
This could of course be shortened to r[aeip]{4,4}r, but that would be a pain to implement in the general case and probably wouldn't improve the code any.
This doesn't address the issue of duplicate letter use. If I were coding it, I'd handle that in code outside the regexp - mostly because the regexp would get more complicated than I'd care to handle.
So the "only once" part is the critical thing. Listing all permutations is obviously not feasible. If your language/environment supports lookaheads and backreferences you can make it a bit easier for yourself:
r(?=[aeip]{4,4})(.)(?!\1)(.)(?!\1|\2)(.)(?!\1|\2|\3).r
Still quite ugly, but here is how it works:
r # match an r
(?= # positive lookahead (doesn't advance position of "cursor" in input string)
[aeip]{4,4}
) # make sure that there are the four desired character ahead
(.) # match any character and capture it in group 1
(?!\1)# make sure that the next character is NOT the same as the previous one
(.) # match any character and capture it in group 2
(?!\1|\2)
# make sure that the next character is neither the first nor the second
(.) # match any character and capture it in group 3
(?!\1|\2|\3)
# same thing again for all three characters
. # match another arbitrary character
r # match an r
Working demo.
This is neither really elegant nor scalable. So you might just want to use r([aiep]{4,4})r (capturing the four critical letters) and ensure the additional condition without regex.
EDIT: In fact, the above pattern is only really useful and necessary if you just want to ensure that you have 4 non-identical characters. For your specific case, again using lookaheads, there is simpler (despite longer) solution:
r(?=[^a]*a[^a]*r)(?=[^e]*e[^e]*r)(?=[^i]*i[^i]*r)(?=[^p]*p[^p]*r)[aeip]{4,4}r
Explained:
r # match an r
(?= # lookahead: ensure that there is exactly one a until the next r
[^a]* # match an arbitrary amount of non-a characters
a # match one a
[^a]* # match an arbitrary amount of non-a characters
r # match the final r
) # end of lookahead
(?=[^e]*e[^e]*r) # ensure that there is exactly one e until the next r
(?=[^i]*i[^i]*r) # ensure that there is exactly one i until the next r
(?=[^p]*p[^p]*r) # ensure that there is exactly one p until the next r
[aeip]{4,4}r # actually match the rest to include it in the result
Working demo.
For r....m with a pool of deee, this could be adjusted as:
r(?=[^d]*d[^d]*m)(?=[^e]*(?:e[^e])*{3,3}m)[de]{4,4}m
This ensures that there is exactly one d and exactly 3 es.
Working demo.
not fully regex due to sed multi regex action
sed -n -e '/^r[aiep]\{4,\}r$/{/\([aiep]\).*\1/!p;}' YourFile
take pattern 4 letter in group aeipsourround by r, keep only line where no letter in the sub group is found twice.
A more scalable solution (no need to write \1, \2, \3 and so on for each letter or position) is to use negative lookahead to assert that each character is not occurring later:
^r(?:([aeip])(?!.*\1)){4}r$
more readable as:
^r
(?:
([aeip])
(?!.*\1)
){4}
r$
Improvements
This was a quick solution which works in the situation you gave us, but here are some additional constraints to have a robuster version:
If the "pool of letters" may share some letters with the end of string, include the end of pattern in the lookahead:
^r(?:([aeip])(?!.*\1.*\2)){4}(r$)
(may not work as intended in all regex flavors, in which case copy-paste the end of pattern instead of using \2)
If some letters must be present not only once but a different fixed number of times, add a separate lookahead for all letters sharing this number of times. For instance, "r....r" with one "a" and one "p" but two "e" would be matched by this regex (but "rapper" and "repeer" wouldn't):
^r(?:([ap])(?!.*\1.*\3)|([e])(?!.*\2.*\2.*\3)){4}(r$)
The non-capturing groups now has 2 alternatives: ([ap])(?!.*\1.*\3) which matches "a" or "p" not followed anywhere until ending by another one, and ([e])(?!.*\2.*\2.*\3) which matches "e" not followed anywhere until ending by 2 other ones (so it fails on the first one if there are 3 in total).
BTW this solution includes the above one, but the end of pattern is here shifted to \3 (also, cf. note about flavors).