Is there a way to compare regular expression backreferences? - regex

I have the following sample expression that I'm passing to egrep over a word list:
^([a-z])lu([a-z])\2er$
I'd like to further stipulate that the content of \1 and \2 must be different, e.g. this would match "bluffer" but not "blubber". Is there a way to build this into the expression itself (so I can get my results right from egrep or something like it), or am I stuck doing this in some real language with regular expression support and manually checking that none of my groups are the same?

You could add the negative lookahead (?!\1) in front of the 2nd match group. The following regex:
([a-z])lu(?!\1)([a-z])\2er
matches "bluffer" but not "blubber". This only works properly if both the groups match the same amount of characters.

You need something more powerful. Regular expressions can't track state. Sed could probably do what you need.

Related

Why "ab(cd|c)*d" matches "abcdcdd" completely but "ab(c|cd)*d" does not match that? Whereas they're like each other

I tried this regex:
ab(cd|c)*d
in the regex101 and RegExr websites. It matched this text completely:
abcdcdd
Now let's swap "cd" and "c" in the regex:
ab(c|cd)*d
When I try this regex in the websites, I see this regex does not completely match the same text.
Why doesn't the regex engine recognize that ab(cd|c)*d and ab(c|cd)*d are the same, and how can I persuade ab(c|cd)*d to match the longest string?
REGEX: ab(cd|c)*d
Complete text matched in 13 steps: abcdcdd
REGEX: ab(c|cd)*d
Partial text matched in 9 steps: abcdcdd
#MurrayW's answer is excellent, but I would like to add some background information.
Regex as Finite State Automata
When I first learned regular expressions in university, we learned to convert them to finite state automata, essentially compiling them into graphs that were then processed to match the string. When you do that, (cd|c) and (c|cd) get compiled into the same graph, in which case both of your regular expressions would match the whole string. This is what grep actually does:
Both
echo abcdcdd | grep --color -E 'ab(c|cd)*d'
and
echo abcdcdd | grep --color -E 'ab(cd|c)*d'
color the whole string in red.
Patterns we call "regular expressions"
True finite state automata have many limitations that programmers don't like, such as the inability to capture matching groups, of to reuse those groups later in the pattern, and other limitations I forget, so the regular expression libraries that we use in most programming languages implement more complex formalisms. I don't remember that they are exactly, maybe push-down automata, but we have memory, we have backtracking, and all sorts of good stuff we use without thinking about it.
At the risk of seeming pedantic, the patterns we use are not "regular" at all. I know, the difference is usually not relevant, we just want our code to work, but once in a while it matters.
So, while the regular expressions (cd|c) and (c|cd) would be compiled into the same finite state automaton, those two (non-regular) patterns are instead turned into logic that says try the variants from left to right, and backtrack only if the rest of the pattern fails to match later, hence the results you observed.
Speed
While the patterns our "regular expression" libraries support offer us lots of goodies we like, those come at a performance cost. True regular expressions are blazingly fast, while our patterns, though usually fast, can sometimes be very expensive. Search for "catastrophic backtracking" on this site for many examples of patterns that take exponential time to fail. The same patterns, used with grep, would be compiled into a graph that is applied in linear time to the string to match no matter what.
Because the | character performs an or operation by testing the left-most condition first. If that matches, nothing further is tested in the or. If that fails, then the next or element is tested, and so on.
Using regex pattern ab(cd|c)*d, you can see that the cd part of (cd|c)* matches in your string, and is also repeated: abcdcdd.
However, in pattern ab(c|cd)*d, the c matches from the or operation in abcdcdd and so cd isn't tested at all. Then, the d at the end of the pattern matches the d after the first c and then the pattern stops, having only matched abcdcdd
As previously answered in the comments, they are not the same patterns. The alternation in the first one tries to match cd first, the second one c first.
First pattern
abcdcdd
^^^^
||
||
ab(cd|c)*d
Second pattern
abcdcdd
^^____
| |
| |
ab(c|cd)*d
If the d is optional, you can omit the pipe for the alternation and make the d optional.
ab(cd?)*d.
Regex demo
Note that this way you repeat the capturing group which will hold the value of the last iteration.
If you are not interrested in the value of the group and non capturing groups are supported you could use ab(?:cd?)*d.
Regex is always a left to right proposition.
The only way a regex engine will ignore a previous alternation construct
is if it has to satisfy a term on the right side of the alternation group
that cannot be satisfied otherwise.
The regex rule is that the pattern is traversed from left to right,
but is controlled by the target string being traversed from left to right.
The symbiosis ..
Given the target string was matched like so "abcdcdd"
its easy to assume that the regex subset of the full regex
ab
( c | cd )* # (1)
d
is clearly
ab
c*
d
where the cd term of the alternation to the right was never needed
for a successful match.
This proves regex engines are a Left to Right bias machine.

Regular expression, match anything but these strings

Within Splunk I have a number of field extractions for extracting values from uri stems. I have a few which match a specific pattern, I now want another regex which matches anything but these.
^/SiteName/[^/]*/(?<a_request_type>((?!Process)|(?!process)|(?!Assets)|(?!assets))[^/]+)
The regex above is what I have so far. I am expecting the negative lookaheads to prevent it from matching Process, process, assets or Assets. However it seems that the [^/]+ after these lookaheads can then go ahead and match these strings anyway. Resulting in this regex sometimes overriding the other regexes I wrote to accept these strings
What is the correct syntax for me to make the regex match any string, other than those specified in the negative lookaheads?
Thanks!
Negative lookaheads do not consume any of the string being searched. When you want multiple negative lookaheads, there is no need to separate them with | (OR). Try this:
^/SiteName/[^/]*/(?<a_request_type>((?![Pp]rocess)(?![Aa]ssets))[^/]+)
Note that I have combined your lookaheads ([Pp]rocess and [Aa]ssets) to make the regular expression more concise.
Live test.

Negation of several characters before pattern

I am trying to create a regex to find the following string:
AGK-XL.
Sometimes before and after this string there are other characters that are usually harmless, except if there is the following pattern before the string:
NOT-
I need to delete/ignore those cases.
This is what I have tried:
^[^N][^O][^T][^\-]AGK-XL\.(\s|\W|$)
But it only seems to match when there are exactly 4 letters in front of the string. How can I express that any other pattern besides NOT- before AGK-XL. is harmless?
Thanks for any hints.
edit: I am using regex in VBA atm.
If you cannot use fancy look-behinds, you can rely on capturing mechanism when you need to match something we do not want, and match and capture what you want. See the The Best Regex Trick Ever at rexegg.com.
However, in this case, you can match and capture NOT-AGK-XL. (so that you can restore it later with $1 backreference), and only match all other occurrences of AGK-XL. that you will remove. Use alternation operator | to match both alternatives:
(NOT-AGK-XL\.(?!\w))|AGK-XL\.(?!\w)
See demo
Note I replaced (\s|\W|$) with (?!\w) that is - IMHO - a better word boundary check.

Negative lookahead alternative

For a URL pattern such as this one:
/detail.php?a=BYGhs5w8e9o&b=234844617545&h=9827a
I would like Google Analytics to match only the URL's with the a and b parameters in it:
/orderdetail.php?a=BYGhs5w8e9o&b=234844617545
And thus strip out:
&h=9827a
The main goal is to be able to setup a goal in Google Analytics which covers only the a and b parameters and ignores the h parameter.
Is there an easy way to accomplish this without a negative lookahead?
Standard regular expressions do not need negative lookahead for this. Just do a match and replace. Searching for:
(/detail.php\?a=\w+&b=\w+)&h=\w+
and replacing with \1 works with the regular expressions in Notepad++ version 6.5.5. Google's regular expressions may be subtly different.
The above works by surrounding the wanted text with capturing braces and leaving the unwanted part outside. The ? needs escaping as un-escaped it means the previous item (ie the p) is optional. The \w sequence mean any "word" character so \w+ means a word.

Regex for nested matches

Consider the string
cos(t(2))+t(51)
Using a regular expression, I'd like to match cos(t(2)), t(2) and t(51). The general pattern this fits is intended to be something like
variable or function name + opening_parenthesis + contents + closing_parenthesis,
where contents can be any expression that has an equal number of opening and closing parentheses.
I'm using [a-zA-Z]+\([\W\w]*\) which returns cos(t(2)))+t(51), which of course is not the desired result.
Any ideas on how to achieve this using regex? I'm particularly stuck at this "equal number of opening and closing parentheses".
Niels, this is an interesting and tricky question because you are looking for overlapping matches. Even with recursion, the task is not trivial.
You asked about any idea how to achieve this with regex, so it sounds like even if this is not available in matlab, you would be interested in seeing an answer that shows you how to do it in regex.
This makes sense to me because tools often change the regex libraries they use. For instance Notepad++, which used to have crippled regex, switched to PCRE in version 6. (As it happens, PCRE would work with this solution.)
In Perl and PCRE, you can use this short regex:
(?=(\b\w+\((?:\d+|(?1))\)))
This will match:
cos(t(2))
t(2)
t(51)
For instance, in php, you could use this code (see the results at the bottom of the online demo).
$regex = "~(?=(\b\w+\((?:\d+|(?1))\)))~";
$string = "cos(t(2))+t(51)";
$count = preg_match_all($regex,$string,$matches);
print_r($matches[1]);
How does it work?
To allow overlapping matches, we use a lookahead. That way, after matching cos(t(2)), the engine will position itself NOT after cos(t(2)), but before the o in cos
In fact the engine does not actually match cos(t(2)) but merely captures it to Group 1. What it matches is the assertion that at this position in the string, looking ahead, we can see x. After matching this assertion, it tries to match it again starting from the next position in the string.
The expression in the lookahead (which describes what we're looking for) is almost very simple: in (\b\w+\((?:\d+|(?1))\)), after the \d+, the alternation | allows us to repeat subroutine number one with (?1), which is to say, the whole expression we are currently within. So we don't recurse the entire regex (which includes a lookahead), but a subexpression thereof.