how to manage duplicated characters in regex - regex

I'm using this regex to find ALL of the following occurrences in an array:
/^.*(?=.*T)(?=.*O)(?=.*T)(?=.*A).*$/
it matches
pOTATO
mATTO
cATeTO
but also
lATO
minAreTO
AnTicO
although this last three words have just one T
how can I extract only words containing at least two Ts, one A and one O, in any order?

Since lookarounds stand their ground, once the first lookaround is tried, the next, and all subsequent ones after the first lookaround are checked from exactly the same position.
You need to use
/^(?=.*T.*T)(?=.*O)(?=.*A).*/
/^(?=.*T[^T]*T)(?=.*O)(?=.*A).*/
Note the missing .* after ^, it is not necessary as it is enough to only fire the lookaheads once at the string start position. Now, (?=.*T.*T) makes sure there are two repetitions of zero or more chars other than line break chars as many as possible followed with a T char. (?=.*T[^T]*T) makes sure there are zero or more chars other than line break chars as many as possible and then T, zero or more chars other than T and then another T.
See regex demo #1 and regex demo #2. Note that (?=.*T[^T]*T) can match more than (?=.*T.*T) since [^T] can match line break chars. To avoid that in the demo, I added \n into the negated character class.

Related

How to overcome multiple matches within same sentence (regex) [duplicate]

I am trying to implement a regex which includes all the strings which have any number of words but cannot be followed by a : and ignore the match if it does. I decided to use a negative look ahead for it.
/([a-zA-Z]+)(?!:)/gm
string: lame:joker
since i am using a character range it is matching one character at a time and only ignoring the last character before the : .
How do i ignore the entire match in this case?
Link to regex101: https://regex101.com/r/DlEmC9/1
The issue is related to backtracking: once your [a-zA-Z]+ comes to a :, the engine steps back from the failing position, re-checks the lookahead match and finds a match whenver there are at least two letters before a colon, returning the one that is not immediately followed by :. See your regex demo: c in c:real is not matched as there is no position to backtrack to, and rea in real:c is matched because a is not immediately followed with :.
Adding implicit requirement to the negative lookahead
Since you only need to match a sequence of letters not followed with a colon, you can explicitly add one more condition that is implied: and not followed with another letter:
[A-Za-z]+(?![A-Za-z]|:)
[A-Za-z]+(?![A-Za-z:])
See the regex demo. Since both [A-Za-z] and : match a single character, it makes sense to put them into a single character class, so, [A-Za-z]+(?![A-Za-z:]) is better.
Preventing backtracking into a word-like pattern by using a word boundary
As #scnerd suggests, word boundaries can also help in these situations, but there is always a catch: word boundary meaning is context dependent (see a number of ifs in the word boundary explanation).
[A-Za-z]+\b(?!:)
is a valid solution here, because the input implies the words end with non-word chars (i.e. end of string, or chars other than letter, digits and underscore). See the regex demo.
When does a word boundary fail?
\b will not be the right choice when the main consuming pattern is supposed to match even if glued to other word chars. The most common example is matching numbers:
\d+\b(?!:) matches 12 in 12,, but not in 12:, and also 12c and 12_
\d+(?![\d:]) matches 12 in 12, and 12c and 12_, not in 12: only.
Do a word boundary check \b after the + to require it to get to the end of the word.
([a-zA-Z]+\b)(?!:)
Here's an example run.

Capturing uppercase words in text with regex

I'm trying to find words that are in uppercase in a given piece of text. The words must be one after the other to be considered and they must be at least 4 of them.
I have a "almost" working code but it captures much more: [A-Z]*(?: +[A-Z]*){4,}. The capture group also includes spaces at the start or the end of those words (like a boundary).
I have a playground if you want to test it out: https://regex101.com/r/BmXHFP/2
Is there a way to make the regex in example capture only the words in the first sentence? The language I'm using is Go and it has no look-behind/ahead.
In your regex, you just need to change the second * for a +:
[A-Z]*(?: +[A-Z]+){4,}
Explanation
While using (?: +[A-Z]*), you are matchin "a space followed by 0+ letters". So you are matching spaces. When replacing the * by a +, you matches spaces if there are uppercase after.
Demo on regex101
Replace the *s by +s, and your regex only matches the words in the first sentence.
.* also matches the empty string. Looking at you regex and ignoring both [A-Z]*, all that remains is a sequence of spaces. Using + makes sure that there is at least one uppercase char between every now and then.
You had to mark at least 1 upper case as [A-Z]*(?: +[A-Z]+){4,} see updated regex.
A better Regex will allow non spaces as [A-Z]*(?: *[A-Z]+){4,}.see better regex
* After will indicate to allow at least upper case even without spaces.

Regex: Match 'no characters' between strings

I have to verify that strings match the following format before the first whitespace (if there is one):
Up to 3 leading letters
At least 4 consecutive digits
Up to 3 trailing letters
To give examples, the following are valid:
1234
Abc123456DeF
1234 blah+
XyZ01234
I'm having trouble avoiding this case however: 123a+b blah
So far I have (^\w{0,3}\d{4}\w{0,3})\s* but the problem lies in making sure a non-letter isn't caught in the first section.
I can see a couple solutions:
Run regex twice, first getting the string up to the first whitespace ([^\s]+) then apply regex again to that making sure it ends in up to 3 letters (^\w{0,3}\d{4}\w{0,3}$). This is what I do now, but surely there's a way to do this in one expression - I just can't figure out how
Make sure no non-letters exist between the (potential) 3 trailing letters and the (potential) whitespace. (^\w{0,3}\d{4}\w{0,3}no non-letters)\s*
I've tried negative lookahead (?!.*) but that doesn't seem to do anything.
This regex satisfy your specifications.
Regex: ^\w{0,3}\d{4,}\w{0,3}\s?$
Explanation:
According to your specifications.
\w{0,3}? Up to 3 leading letters
\d{4,} At least 4 consecutive digits
\w{0,3}? Up to 3 trailing letters
I have to verify that strings match the following format before the first whitespace (if there is one):
\s? hence an optional space.
Regex101 Demo
Note:- I am keeping this as stroked out because there were many shortcomings pointed out in comments. So to maintain the context of comments.
Solution:
Like I said in my comment.
#JCK: Problem is . . even whitespace is optional. Thus making it difficult to differentiate between first and second part.
Now employing a lookahead solves this problem. Complete regex goes like this.
Regex: ^(?=.*[0-9]{4,}[A-Za-z]{0,3}(?:\s|$))[A-Za-z]{0,3}[0-9]{4,}[A-Za-z]{0,3}\s*?(?:\S*\s*)*$
Explanation:
(?=.*[0-9]{4,}[A-Za-z]{0,3}(?:\s|$)) This positive lookahead makes sure that the first part defined by your specifications is matched. It looks for mentioned specs and either a \s or $ i.e end of string. Thus matching the first part.
[A-Za-z]{0,3}[0-9]{4,}[A-Za-z]{0,3}\s*?(?:\S*\s*)* Rest of the regex is as per the specifications.
Check by entering strings one by one.
Regex: (^[A-Za-z]{0,3}\d{4,}[A-Za-z]{0,3})(?:$|\s+)
\w is same as [A-Za-z0-9_], so to match just letters you should use [A-Za-z].
(?:$|\s+) matches end of string or at least one whitespace (hence ignoring the rest of the string).

regex doesn't match the word if it's not the last word

i'm trying to write a regex which can match a word in a string with theese conditions:
the word must be 8 character length.
the word must has 1 alphabetic character at any position of the
word.
the word must has 7 digits at any position of the word.
\b(?=\w{8}\z)(?=[^a-zA-Z]*[a-zA-Z]{1})(?=(?:[\D]*[\d]){7}).*\b
this can find "123r1234" and "foo 123r1234" but it doesn't find "foo bar 123r1234 foo".
i tried to add word boundries but it didn't work.
what is wrong with my regex and how can i fix it?
thanks.
You can use the following regex:
\b(?=[^a-zA-Z]*[a-zA-Z])(?=(?:\D*\d){7})\w{8}\b
See demo
There several things to note here:
It is not necessary to enclose single shorthand classes (like \d) into character classes (pattern becomes too awkward and less readable). Thus, use \D instead of [\D].
The rule of number of look-aheads should equal the number of conditions - 1 (see Fine-Tuning: Removing One Condition at rexegg.com). Most often, length restriction look-aheads with just 1 character/character class are valid candidates for being ported into the base pattern. Here, (?=\w{8}) can easily replace .* at the end.
The (?=\w{8}\z) look-ahead contains an end-of-string \z anchor that forces a match at the end of the string, while you need (as now I know) the end of a word.
[a-zA-Z]{1} is equal to [a-zA-Z] since {1} means *exactly one repetition, and it is redundant (again, regex patterns should be as clean and concise as they can be).
UPDATE (+1 goes to #Jonny5)
There is another way of approaching the current problem: by having the word contain 8 word characters, but matching only 1 letter enclosed with any number of digits. This can be achieved with
(?i)\b(?=\w{8}\b)\d*[a-z]\d*\b
See another demo (Note i modifier is used here)
You can remove last asterisk and change it by the 8 counter.
\b(?=[^a-zA-Z]*[a-zA-Z])(?=(?:[\D]*[\d]){7})\w{8}\b
You can view it running here:
https://regex101.com/r/bX6rK8/1

Matching parts of string that contain no consecutive dashes

I need a regex that will match strings of letters that do not contain two consecutive dashes.
I came close with this regex that uses lookaround (I see no alternative):
([-a-z](?<!--))+
Which given the following as input:
qsdsdqf--sqdfqsdfazer--azerzaer-azerzear
Produces three matches:
qsdsdqf-
sqdfqsdfazer-
azerzaer-azerzear
What I want however is:
qsdsdqf-
-sqdfqsdfazer-
-azerzaer-azerzear
So my regex loses the first dash, which I don't want.
Who can give me a hint or a regex that can do this?
This should work:
-?([^-]-?)*
It makes sure that there is at least one non-dash character between every two dashes.
Looks to me like you do want to match strings that contain double hyphens, but you want to break them into substrings that don't. Have you considered splitting it between pairs of hyphens? In other words, split on:
(?<=-)(?=-)
As for your regex, I think this is what you were getting at:
(?:[^-]+|-(?<!--)|\G-)+
The -(?<!--) will match one hyphen, but if the next character is also a hyphen the match ends. Next time around, \G- picks up the second hyphen because it's the next character; the only way that can happen (except at the beginning of the string) is if a previous match broke off at that point.
Be aware that this regex is more flavor dependent than most; I tested it in Java, but not all flavors support \G and lookbehinds.