How can different quantifiers make regex behave differently? - regex

Playing around with a question asked earlier (put on hold, but I wanted to fiddle with it ;) I stumbled across a peculiarity I'd like to ask this knowledgeable community about. Namely - why do these two regexes give different results?
(\b\w+(?:\s+\w+)+)(?:.*?(\1))(?:.*?(\1))?(?:.*?(\1))?
vs.
(\b\w+(?:\s+\w+)+)(?:.*?(\1)){1,3}
First at regex101 - Second at regex101
What I wanted to do, was to have this regex:
(\b\w+(?:\s+\w+)+)(?:.*?(\1))+
detect repeated word sequences - regex101. (a word followed by at least one more. Then anything up to a repetition of the identified sequence, then this last part possibly repeated any number of times. I.e. one or more repetitions.)
What it did was find a sequence that repeated it self later in the document, but it skipped to the last one. OK, though I consider me somewhat comfortable around regexes, I know greediness vs. lazy can be confusing. And I wanted it to catch all repetitions.
So I tried to force it by repeating the second part instead of using a quantifier:
(\b\w+(?:\s+\w+)+)(?:.*?(\1))(?:.*?(\1))
and then it worked like expected - regex101.
That made me try the two regexes first mentioned, that in my opinion should yield the same result, but they don't. So, again - What makes them give different results?

Your original pattern, (\b\w+(?:\s+\w+)+)(?:.*?(\1))+, is going to skip to the last repeated sub-pattern because you are telling it to do that with that last + - you are quantifying a capture group, which means that (?:.*?(\1))+ will not stop when it first hits "my cat is black", it'll keep repeating itself until the longest match is found, at which point all intermediate matches of the capture group are discarded.
Generally speaking, don't quantify capture groups, capture quantified groups.
I think what you want is simply this:
(\b\w+(?:\s+\w+)+).*?(\1)
https://regex101.com/r/OzDdCs/7

When you repeat a capture group, only the last "capture" is put in the back reference.
For example /A(B)+/ used on the string "ABBB" would put the last "B" in capture group $1.
But /A(B)(B)(B)/ has 3 capture groups and thus will have a "B" in $1 & $2 & $3
That's why in those 2 regex examples you showed, the first will also mark that 2nd "my cat is black".
But the second regex example won't.

Related

How to write a regular expression inside awk to IGNORE a word as a whole? [duplicate]

I've been looking around and could not make this happen. I am not totally noob.
I need to get text delimited by (including) START and END that doesn't contain START. Basically I can't find a way to negate a whole word without using advanced stuff.
Example string:
abcSTARTabcSTARTabcENDabc
The expected result:
STARTabcEND
Not good:
STARTabcSTARTabcEND
I can't use backward search stuff. I am testing my regex here: www.regextester.com
Thanks for any advice.
Try this
START(?!.*START).*?END
See it here online on Regexr
(?!.*START) is a negative lookahead. It ensures that the word "START" is not following
.*? is a non greedy match of all characters till the next "END". Its needed, because the negative lookahead is just looking ahead and not capturing anything (zero length assertion)
Update:
I thought a bit more, the solution above is matching till the first "END". If this is not wanted (because you are excluding START from the content) then use the greedy version
START(?!.*START).*END
this will match till the last "END".
START(?:(?!START).)*END
will work with any number of START...END pairs. To demonstrate in Python:
>>> import re
>>> a = "abcSTARTdefENDghiSTARTjlkENDopqSTARTrstSTARTuvwENDxyz"
>>> re.findall(r"START(?:(?!START).)*END", a)
['STARTdefEND', 'STARTjlkEND', 'STARTuvwEND']
If you only care for the content between START and END, use this:
(?<=START)(?:(?!START).)*(?=END)
See it here:
>>> re.findall(r"(?<=START)(?:(?!START).)*(?=END)", a)
['def', 'jlk', 'uvw']
The really pedestrian solution would be START(([^S]|S*S[^ST]|ST[^A]|STA[^R]|STAR[^T])*(S(T(AR?)?)?)?)END. Modern regex flavors have negative assertions which do this more elegantly, but I interpret your comment about "backwards search" to perhaps mean you cannot or don't want to use this feature.
Update: Just for completeness, note that the above is greedy with respect to the end delimiter. To only capture the shortest possible string, extend the negation to also cover the end delimiter -- START(([^ES]|E*E[^ENS]|EN[^DS]|S*S[^STE]|ST[^AE]|STA[^RE]|STAR[^TE])*(S(T(AR?)?)?|EN?)?)END. This risks to exceed the torture threshold in most cultures, though.
Bug fix: A previous version of this answer had a bug, in that SSTART could be part of the match (the second S would match [^T], etc). I fixed this but by the addition of S in [^ST] and adding S* before the non-optional S to allow for arbitrary repetitions of S otherwise.
May I suggest a possible improvement on the solution of Tim Pietzcker?
It seems to me that START(?:(?!START).)*?END is better in order to only catch a START immediately followed by an END without any START or END in between. I am using .NET and Tim's solution would match also something like START END END. At least in my personal case this is not wanted.
[EDIT: I have left this post for the information on capture groups but the main solution I gave was not correct.
(?:START)((?:[^S]|S[^T]|ST[^A]|STA[^R]|STAR[^T])*)(?:END)
as pointed out in the comments would not work; I was forgetting that the ignored characters could not be dropped and thus you would need something such as ...|STA(?![^R])| to still allow that character to be part of END, thus failing on something such as STARTSTAEND; so it's clearly a better choice; the following should show the proper way to use the capture groups...]
The answer given using the 'zero-width negative lookahead' operator "?!", with capture groups, is: (?:START)((?!.*START).*)(?:END) which captures the inner text using $1 for the replace. If you want to have the START and END tags captured you could do (START)((?!.*START).*)(END) which gives $1=START $2=text and $3=END or various other permutations by adding/removing ()s or ?:s.
That way if you are using it to do search and replace, you can do, something like BEGIN$1FINISH. So, if you started with:
abcSTARTdefSTARTghiENDjkl
you would get ghi as capture group 1, and replacing with BEGIN$1FINISH would give you the following:
abcSTARTdefBEGINghiFINISHjkl
which would allow you to change your START/END tokens only when paired properly.
Each (x) is a group, but I have put (?:x) for each of the ones except the middle which marks it as a non-capturing group; the only one I left without a ?: was the middle; however, you could also conceivably capture the BEGIN/END tokens as well if you wanted to move them around or what-have-you.
See the Java regex documentation for full details on Java regexes.

A short way to capture/back-reference every digit of a number individually

So basically I want to reformat a 10 digit number like so:
1234567890 --> (123) 456-7890
A long way to do this would be to have each number be its own capture group and then back-reference each one individually:
'([0-9])([0-9])...([0-9])' --> (\1\2\3) \4\5\6-\7\8\9\10
This seems unnecessary and verbose, but when I try the following
'([0-9]){10}'
There appears to be only one back-reference and its of the last digit in the number.
Is there is a more elegant way to reference each character as its own capture group?
Thanks!
The following pattern will do the job: ^(\d{3})(\d{3})(\d{4})$
^(\d{3}): beginning of the string, then exactly 3 digits
(\d{3}): exactly 3 digits
(\d{4})$: exactly 4 digits, then end of the string.
Then replace by: (\1) \2-\3
Although the other answer with its example regex patterns hopefully shed light on the correct application of capture groups, it does not directly answer the question. If you fail to understand how regular expressions work (capture groups in particular), you may find yourself wanting to do the same thing with a different pattern in the future.
Is there is a more elegant way to reference each character as its own
capture group?
The initial answer is "No", there is no way to reference an individual capture of a single capture group using traditional replacement syntax - regardless of whether it is a single digit or any other capture group. Consider that you indicate a precise number of matches with {10} and it seems perfectly reasonable to be able to access each capture. But what if you had indicated a variable number of matches with + or {,3}? There would be no well-defined way of knowing how many possible captures occurred. If the same regex pattern had had more capture groups following the "repeated" capture group, there would be no way of correctly referencing the later groups. Example: Given the pattern ([a-z])+(\d){3}, the first capture group could match 4 letters one time, then the next time match 11 letters. If you wanted to refer to the captured digits, how would you do that? You could not, since \1, \2, \3, ... would all be reserved for possible capture instances of the first group.
But the inability of basic regular expressions syntax to do what you want does not remove the validity of your question, nor does it necessarily place the solution outside the realm of many regular expression implementations. Various regex implementations (i.e. language syntax and regex libraries) resolve this limitation by facilitating regex matching with various objects for accessing repeated captures. (c# and .Net regex library is one example, like match.Groups[1].Captures[3]) So even though you can't use basic replacement patterns to get want you want, the answer is often "Yes", depending on the specific implementation.

Regex how to get a full match of nth word (without using non-capturing groups)

I am trying to use Regex to return the nth word in a string. This would be simple enough using other answers to similar questions; however, I do not have access to any of the code. I can only access a regex input field and the server only returns the 'full match' and cannot be made to return any captured groups such as 'group 1'
EDIT:
From the developers explaining the version of regex used:
"...its javascript regex so should mostly be compatible with perl i
believe but not as advanced, its fairly low level so wasn't really
intended for use by end users when originally implemented - i added
the dropdown with the intention of having some presets going
forwards."
/EDIT
Sample String:
One Two Three Four Five
Attempted solution (which is meant to get just the 2nd word):
^(?:\w+ ){1}(\S+)$
The result is:
One Two
I have also tried other variations of the regex:
(?:\w+ ){1}(\S+)$
^(?:\w+ ){1}(\S+)
But these just return the entire string.
I have tried replicating the behaviour that I see using regex101 but the results seem to be different, particularly when changing around the ^ and $.
For example, I get the same output on regex101 if I use the altered regex:
^(?:\w+ ){1}(\S+)
In any case, none of the comparing has helped me actually achieve my stated aim.
I am hoping that I have just missed something basic!
===EDIT===
Thanks to all of you who have contributed thus far, however, I am still running into issues. I am afraid that I do not know the language or restrictions on the regex other than what I can ascertain through trial and error, therefore here is a list of attempts and results all of which are trying to return "Two" from a sample of:
One Two Three Four Five
\w+(?=( \w+){1}$)
returns all words
^(\w+ ){1}\K(\w+)
returns no words atall (so I assume that \K does not work)
(\w+? ){1}\K(\w+?)(?= )
returns no words at all
\w+(?=\s\w+\s\w+\s\w+$)
returns all words
^(?:\w+\s){1}\K\w+
returns all words
====
With all of the above not working, I thought I would test out some others to see the limitations of the system
Attempting to return the last word:
\w+$
returns all words
This leads me to believe that something strange is going on with the start ^ and end $ characters, perhaps the server puts these in automatically if they are omitted? Any more ideas greatly appreciated.
I don't known if your language supports positive lookbehind, so using your example,
One Two Three Four Five
here is a solution which should work in every language :
\w+ match the first word
\w+$ match the last word
\w+(?=\s\w+$) match the 4th word
\w+(?=\s\w+\s\w+$) match the 3rd word
\w+(?=\s\w+\s\w+\s\w+$) match the 2nd word
So if a string contains 10 words :
The first and the last word are easy to find. To find a word at a position, then you simply have to use this rule :
\w+(?= followed by \s\w+ (10 - position) times followed by $)
Example
In this string :
One Two Three Four Five Six Seven Height Nine Ten
I want to find the 6th word.
10 - 6 = 4
\w+(?= followed by \s\w+ 4 times followed by $)
Our final regex is
\w+(?=\s\w+\s\w+\s\w+\s\w+$)
Demo
It's possible to use reset match (\K) to reset the position of the match and obtain the third word of a string as follows:
(\w+? ){2}\K(\w+?)(?= )
I'm not sure what language you're working in, so you may or may not have access to this feature.
I'm not sure if your language does support \K, but still sharing this anyway in case it does support:
^(?:\w+\s){3}\K\w+
to get the 4th word.
^ represents starting anchor
(?:\w+\s){3} is a non-capturing group that matches three words (ending with spaces)
\K is a match reset, so it resets the match and the previously matched characters aren't included
\w+ helps consume the nth word
Regex101 Demo
And similarly,
^(?:\w+\s){1}\K\w+ for the 2nd word
^(?:\w+\s){2}\K\w+ for the 3rd word
^(?:\w+\s){3}\K\w+ for the 4th word
and so on...
So, on the down side, you can't use look behind because that has to be a fixed width pattern, but the "full match" is just the last thing that "full matches", so you just need something whose last match is your word.
With Positive look-ahead, you can get the nth word from the right
\w+(?=( \w+){n}$)
If your server has extended regex, \K can "clear matched items", but most regex engines don't support this.
^(\w+ ){n}\K(\w+)
Unfortunately, Regex doesn't have a standard "match only n'th occurrence", So counting from the right is the best you can do. (Also, Regex101 has a searchable quick reference in the bottom right corner for looking up special characters, just remember that most of those characters are not supported by all regex engines)

Regex - Get string between two words that doesn't contain word

I've been looking around and could not make this happen. I am not totally noob.
I need to get text delimited by (including) START and END that doesn't contain START. Basically I can't find a way to negate a whole word without using advanced stuff.
Example string:
abcSTARTabcSTARTabcENDabc
The expected result:
STARTabcEND
Not good:
STARTabcSTARTabcEND
I can't use backward search stuff. I am testing my regex here: www.regextester.com
Thanks for any advice.
Try this
START(?!.*START).*?END
See it here online on Regexr
(?!.*START) is a negative lookahead. It ensures that the word "START" is not following
.*? is a non greedy match of all characters till the next "END". Its needed, because the negative lookahead is just looking ahead and not capturing anything (zero length assertion)
Update:
I thought a bit more, the solution above is matching till the first "END". If this is not wanted (because you are excluding START from the content) then use the greedy version
START(?!.*START).*END
this will match till the last "END".
START(?:(?!START).)*END
will work with any number of START...END pairs. To demonstrate in Python:
>>> import re
>>> a = "abcSTARTdefENDghiSTARTjlkENDopqSTARTrstSTARTuvwENDxyz"
>>> re.findall(r"START(?:(?!START).)*END", a)
['STARTdefEND', 'STARTjlkEND', 'STARTuvwEND']
If you only care for the content between START and END, use this:
(?<=START)(?:(?!START).)*(?=END)
See it here:
>>> re.findall(r"(?<=START)(?:(?!START).)*(?=END)", a)
['def', 'jlk', 'uvw']
The really pedestrian solution would be START(([^S]|S*S[^ST]|ST[^A]|STA[^R]|STAR[^T])*(S(T(AR?)?)?)?)END. Modern regex flavors have negative assertions which do this more elegantly, but I interpret your comment about "backwards search" to perhaps mean you cannot or don't want to use this feature.
Update: Just for completeness, note that the above is greedy with respect to the end delimiter. To only capture the shortest possible string, extend the negation to also cover the end delimiter -- START(([^ES]|E*E[^ENS]|EN[^DS]|S*S[^STE]|ST[^AE]|STA[^RE]|STAR[^TE])*(S(T(AR?)?)?|EN?)?)END. This risks to exceed the torture threshold in most cultures, though.
Bug fix: A previous version of this answer had a bug, in that SSTART could be part of the match (the second S would match [^T], etc). I fixed this but by the addition of S in [^ST] and adding S* before the non-optional S to allow for arbitrary repetitions of S otherwise.
May I suggest a possible improvement on the solution of Tim Pietzcker?
It seems to me that START(?:(?!START).)*?END is better in order to only catch a START immediately followed by an END without any START or END in between. I am using .NET and Tim's solution would match also something like START END END. At least in my personal case this is not wanted.
[EDIT: I have left this post for the information on capture groups but the main solution I gave was not correct.
(?:START)((?:[^S]|S[^T]|ST[^A]|STA[^R]|STAR[^T])*)(?:END)
as pointed out in the comments would not work; I was forgetting that the ignored characters could not be dropped and thus you would need something such as ...|STA(?![^R])| to still allow that character to be part of END, thus failing on something such as STARTSTAEND; so it's clearly a better choice; the following should show the proper way to use the capture groups...]
The answer given using the 'zero-width negative lookahead' operator "?!", with capture groups, is: (?:START)((?!.*START).*)(?:END) which captures the inner text using $1 for the replace. If you want to have the START and END tags captured you could do (START)((?!.*START).*)(END) which gives $1=START $2=text and $3=END or various other permutations by adding/removing ()s or ?:s.
That way if you are using it to do search and replace, you can do, something like BEGIN$1FINISH. So, if you started with:
abcSTARTdefSTARTghiENDjkl
you would get ghi as capture group 1, and replacing with BEGIN$1FINISH would give you the following:
abcSTARTdefBEGINghiFINISHjkl
which would allow you to change your START/END tokens only when paired properly.
Each (x) is a group, but I have put (?:x) for each of the ones except the middle which marks it as a non-capturing group; the only one I left without a ?: was the middle; however, you could also conceivably capture the BEGIN/END tokens as well if you wanted to move them around or what-have-you.
See the Java regex documentation for full details on Java regexes.

Regular expression greedy match not working as expected

I have a very basic regular expression that I just can't figure out why it's not working so the question is two parts. Why does my current version not work and what is the correct expression.
Rules are pretty simple:
Must have minimum 3 characters.
If a % character is the first character must be a minimum of 4 characters.
So the following cases should work out as follows:
AB - fail
ABC - pass
ABCDEFG - pass
% - fail
%AB - fail
%ABC - pass
%ABCDEFG - pass
%%AB - pass
The expression I am using is:
^%?\S{3}
Which to me means:
^ - Start of string
%? - Greedy check for 0 or 1 % character
\S{3} - 3 other characters that are not white space
The problem is, the %? for some reason is not doing a greedy check. It's not eating the % character if it exists so the '%AB' case is passing which I think should be failing. Why is the %? not eating the % character?
Someone please show me the light :)
Edit: The answer I used was Dav below: ^(%\S{3}|[^%\s]\S{2})
Although it was a 2 part answer and Alan's really made me understand why. I didn't use his version of ^(?>%?)\S{3} because it worked but not in the javascript implementation. Both great answers and a lot of help.
The word for the behavior you described isn't greedy, it's possessive. Normal, greedy quantifiers match as much as they can originally, but back off if necessary to allow the whole regex to match (I like to think of them as greedy but accommodating). That's what's happening to you: the %? originally matches the leading percent sign, but if there aren't enough characters left for an overall match, it gives up the percent sign and lets \S{3} match it instead.
Some regex flavors (including Java and PHP) support possessive quantifiers, which never back off, even if that causes the overall match to fail. .NET doesn't have those, but it has the next best thing: atomic groups. Whatever you put inside an atomic group acts like a separate regex--it either matches at the position where it's applied or it doesn't, but it never goes back and tries to match more or less than it originally did just because the rest of the regex is failing (that is, the regex engine never backtracks into the atomic group). Here's how you would use it for your problem:
^(?>%?)\S{3}
If the string starts with a percent sign, the (?>%?) matches it, and if there aren't enough characters left for \S{3} to match, the regex fails.
Note that atomic groups (or possessive quantifiers) are not necessary to solve this problem, as #Dav demonstrated. But they're very powerful tools which can easily make the difference between impossible and possible, or too damn slow and slick as can be.
Regex will always try to match the whole pattern if it can - "greedy" doesn't mean "will always grab the character if it exists", but instead means "will always grab the character if it exists and a match can be made with it grabbed".
Instead, what you probably want is something like this:
^(%\S{3}|[^%\s]\S{2})
Which will match either a % followed by 3 characters, or a non-%, non-whitespace followed by 2 more.
I always love to look at RE questions to see how much time people spend on them to "Save time"
str.len() >= str[0]=='&' ? 4 : 3
Although in real life I'd be more explicit, I just wrote it that way because for some reason some people consider code brevity an advantage (I'd call it an anti-advantage, but that's not a popular opinion right now)
Try the regex modified a little based on Dav's original one:
^(%\S{3,}|[^%\s]\S{2,})
with the regex option "^ and $ match at line breaks" on.