How to order regular expression alternatives to get longest match? - regex

I have a number of regular expressions regex1, regex2, ..., regexN combined into a single regex as regex1|regex2|...|regexN. I would like to reorder the component expressions so that the combined expression gives the longest possible match at the beginning of a given string.
I believe this means reordering the regular expressions such that "if regexK matches a prefix of regexL, then L < K". If this is correct, is it possible to find out, in general, whether regexK can match a prefix of regexL?

Use the right regex flavor!
In some regex flavors, the alternation providing the longest match is the one that is used ("greedy alternation"). Note that most of these regex flavors are old (yet still used today), and thus lack some modern constructs such as back references.
Perl6 is modern (and has many features), yet defaults to the POSIX-style longest alternation. (You can even switch styles, as || creates an alternator that short-circuits to first match.) Note that the :Perl5/:P5 modifier is needed in order to use the "traditional" regex style.
Also, PCRE and the newer PCRE2 have functions that do the same. In PCRE2, it's pcre2_dfa_match. (See my section Relevant info about regex engine design section for more information about DFAs.)
This means, you can have ANY order of statements in a pipe and the result will always be the longest.
(This is different from the "absolute longest" match, as no amount of rearranging the terms in an alternation will change the fact that all regex engines traverse the string left-to-right. With the exception of .NET, apparently, which can go right-to-left. But traversing the string backwards wouldn't guarantee the "absolute longest" match either.) If you really want to find matches at (only) the beginning of a string, you should anchor the expression: ^(regex1|regex2|...).
According to this page*:
The POSIX standard, however, mandates that the longest match be returned. When applying Set|SetValue to SetValue, a POSIX-compliant regex engine will match SetValue entirely.
* Note: I do not have the ability to test every POSIX flavor. Also, some regex flavors (Perl6) have this behavior without being POSIX compliant overall.
Let me give you one specific example that I have verified on my own computer:
echo "ab c a" | sed -E 's/(a|ab)/replacement/'
The regex is (a|ab). When it runs on the string ab c a you get : replacement c a, meaning that you do, in fact, get the longest match that the alternator can provide.
This regex, for a more complex example, (a|ab.*c|.{0,2}c*d) applied to abcccd, will return abcccd.
Try it here!
More clarification: the regex engine will not go forward (in the search string) to see if there is an even longer match once it can match something. It will only look through the current list of alterations to see if another one will match a longer string (from the position where the initial match starts).
In other words, no matter the order of choices in an alteration, POSIX compliant regexes use the one that matches the most characters.
Other examples of flavors with this behavior:
Tcl ARE
POSIX ERE
GNU BRE
GNU ERE
Relevant information about regex engine design
This question asks about designing an engine, but the answers may be helpful to understand how these engines work. Essentially, DFA-based algorithms determine the common overlap of different expressions, especially those within an alternation. It might be worth checking out this page. It explains how alternatives can be combined into a single path:
Note: at some point, you might just want to consider using an actual programming language. Regexes aren't everything.

Longest Match
Unfortunately, there is no distinct logic to tell a regular expression
engine to get the longest match possible.
Doing so would/could create a cascading backtracking episode gone wild.
It is, by definition a complexity too great to deal with.
All regular expressions are processed from left to right.
Anything the engine can match first it will, then bail out.
This is especially true of alternations, where this|this is|this is here
will always match 'this is here' first and
will NEVER ever match this is nor this is here
Once you realize that, you can reorder the alternation into
this is here|this is|this which gives the longest match every time.
Of course this can be reduced to this(?:(?: is)? here)?
which is the clever way of getting the longest match.
Haven't seen any examples of the regex's you want to combine,
so this is just some general information.
If you show the regexes you're trying to combine, better solution could be
provided.
Alternation contents do affect each other, as well as whatever precedes or
follows the cluster can have an affect on which alternation gets matched.
If you have more questions just ask.
Addendum:
For #Laurel. This could always be done with a Perl 5 regex (>5.10)
because Perl can run code from within regex sub-expressions.
Since it can run code, it can count and get the longest match.
The rule of leftmost first, however, will never change.
If regex were thermodynamics, this would be the first law.
Perl is a strange entity as it tries to create a synergy between regex
and code execution.
As a result, it is possible to overload it's operators, to inject
customization into the language itself.
Their regex engine is no different, and can be customized the same way.
So, in theory, the regex below can be made into a regex construct,
a new Alternation construct.
I won't go into detail's here, but suffice it to say, it's not for the faint at heart.
If you're interested in this type of thing, see the perlre manpage under
section 'Creating Custom RE Engines'
Perl:
Note - The regex alternation form is based on #Laurel complex example
(a|ab.*c|.{0,2}c*d) applied to abcccd.
Visually, if made into a custom regex construct, would look similar to
an alternation (?:rx1||rx2||rx3) and I'm guessing this is how a lot of
Perl6 is done in terms of integrating regex engine directly into the language.
Also, if used as is, it's possible to construct this regex dynamically as needed.
And note that all the richness of Perl regex constructs are available.
Output
Longest Match Found: abcccd
Code
use strict;
use warnings;
my ($p1,$p2,$p3) = (0,0,0);
my $targ = 'abcccd';
# Formatted using RegexFormat7 (www.regexformat.com)
if ( $targ =~
/
# The Alternation Construct
(?=
( a ) # (1)
(?{ $p1 = length($^N) })
)?
(?=
( ab .* c ) # (2)
(?{ $p2 = length($^N) })
)?
(?=
( .{0,2} c*d ) # (3)
(?{ $p3 = length($^N) })
)?
# Check At Least 1 Match
(?(1)
(?(2)
(?(3)
| (?!)
)
)
)
# Consume Longest Alternation Match
( # (4 start)
(?(?{
$p1>=$p2 && $p1>=$p3
})
\1
| (?(?{
$p2>=$p1 && $p2>=$p3
})
\2
| (?(?{
$p3>=$p1 && $p3>=$p2
})
\3
)
)
)
) # (4 end)
/x ) {
print "Longest Match Found: $4\n";
} else {
print "Did not find a match!\n";
}

For sure a human might be able judging whther two given regexp are matching prefixes for some cases. In general this is an n-p-complete problem. So don't try.
In the best case combining the different regexp into a single one will give a suitable result cheap. However, I'm not aware of any algorithm that can take two arbitrary regexp and combine them in a way that the resulting regexp is still matching what any of the two would match. It would be n-p-complete also.
You must also not rely on ordering of alternatives. This depends on the internal execution logic of the regexp engine. It could easily be that this is reordering the alternatives internally beyond your control. So, a valid ordering with current engine mmight give wrong results with a different engine. (So, it could help as long as you stay with a single regexp engine implementation)
Best approach seems to me to simply execute all regexp, keep track of the matched length and then take the longest match.

Related

Conditional regular expression with one section dependent on the result of another section of the regex

Is it possible to design a regular expression in a way that a part of it is dependent on another section of the same regular expression?
Consider the following example:
(ABCHEHG)[HGE]{5,1230}(EEJOPK)[DM]{5}
I want to continue this regex, and at some point I will have a section where the result of that section should depend on the result of [DM]{5}.
For example, D will be complemented by C, and M will be complemented by N.
(ABCHEHG)[HGHE]{5,1230}(EEJOPK)[DM]{5}[ACF]{1,1000}(BBBA)[CU]{2,5}[D'M']{5}
By D' I mean C, and by M' I mean N.
So a resulting string that matches the above regex, if it has DDDMM matching to the section [DM]{5}, it should necessarily have CCCNN matching to [D'M']{5}. Therefore, the result of [D'M']{5} always depends on [DM]{5}, or in other words, what matches to [DM]{5} always dictates what will match to [D'M']{5}.
Is it possible to do such a thing with regex?
Please note that, in this example I have extremely over-simplified the problem. The regex pattern I currently have is really much more complex and longer and my actual pattern includes about 5-6 of such dependent sections.
I cannot think of a way you can do this in pure regex. I would run 2 regex expressions. The first regex to extract the [DM]{5} string, such as
(ABCHEHG)[HGHE]{5,1230}(EEJOPK)[DM]{5}
And take the last 5 characters. Now replace the characters, for example in C# it would be result = result.Substring(result.Length - 5, 5).Replace('D', 'C').Replace('M', 'N'), and then concatenate like
(ABCHEHG)[HGHE]{5,1230}(EEJOPK)[DM]{5}[ACF]{1,1000}(BBBA)[CU]{2,5} + result
This is pretty easy to do in Perl:
m{
ABCHEHG
[HGHE]{5,1230}
EEJOPK
( [DM]{5} )
[ACF]{1,1000}
BBBA
[CU]{2,5}
(??{ $1 =~ tr/DM/CN/r })
}x
I've added the x modifier and whitespace for better readability. I've also removed the capturing groups around the fixed strings (they're fixed strings; you already know what they're going to capture).
The crucial part is that we capture the string that was actually matched by [DM]{5} (in $1), which we then use at the end to dynamically generate a subpattern by replacing all D by C and M by N in $1.
This sounds like bioinformatics in python. Do 2-stage filtering, at regex level and at app level.
Wildcard the DM portions, so the regex is permissive in what it accepts. Bury the regex in a token generator that yields several matching sections. Have your app iterate through the generator's results, discarding any result rejected by your business logic, such as finding that one token is not the complement of another token.
Alternatively, you might push some of that work down into a complex generated regex, which likely will perform worse and will be harder to debug. Your DDDMM example might be summarized as D+M+, or [DM]+, not sure if sequence matters. The complement might be C+N+, or [CN]+. Apparently there's two cases here. So start assembling a regex: stuff1 [DM]+ stuff2 [CN]+ stuff3. Then tack on '|' for alternation, and tack on the other case: stuff1 [CN]+ stuff2 [DM]+ stuff3 (or factor out suffix and prefix so alternation starts after stuff1). I can't imagine you'll be happy with such an approach, as the combinatorics get ugly, and the regex engine is forced to do lots of scanning and backtracking. And recompiling additional regexes on the fly doesn't come for free. Instead you should use the regex engine for the simple things that it's good at, and delegate complex business logic decisions to your app.

How to invert an arbitrary Regex expression

This question sounds like a duplicate, but I've looked at a LOT of similar questions, and none fit the bill either because they restrict their question to a very specific example, or to a specific usercase (e.g: single chars only) or because you need substitution for a successful approach, or because you'd need to use a programming language (e.g: C#'s split, or Match().Value).
I want to be able to get the reverse of any arbitrary Regex expression, so that everything is matched EXCEPT the found match.
For example, let's say I want to find the reverse of the Regex "over" in "The cow jumps over the moon", it would match The cow jumps and also match the moon.
That's only a simple example of course. The Regex could be something more messy such as "o.*?m", in which case the matches would be: The c, ps, and oon.
Here is one possible solution I found after ages of hunting. Unfortunately, it requires the use of substitution in the replace field which I was hoping to keep clear. Also, everything else is matched, but only a character by character basis instead of big chunks.
Just to stress again, the answer should be general-purpose for any arbitrary Regex, and not specific to any particular example.
From post: I want to be able to get the reverse of any arbitrary Regex expression, so that everything is matched EXCEPT the found match.
The answer -
A match is Not Discontinuous, it is continuous !!
Each match is a continuous, unbroken substring. So, within each match there
is no skipping anything within that substring. Whatever matched the
regular expression is included in a particular match result.
So within a single Match, there is no inverting (i.e. match not this only) that can extend past
a negative thing.
This is a Tennant of Regular Expressions.
Further, in this case, since you only want all things NOT something, you have
to consume that something in the process.
This is easily done by just capturing what you want.
So, even with multiple matches, its not good enough to say (?:(?!\bover\b).)+
because even though it will match up to (but not) over, on the next match
it will match ver ....
There are ways to avoid this that are tedious, requiring variable length lookbehinds.
But, the easiest way is to match up to over, then over, then the rest.
Several constructs can help. One is \K.
Unfortunately, there is no magical recipe to negate a pattern.
As you mentioned it in your question when you have an efficient pattern you use with a match method, to obtain the complementary, the more easy (and efficient) way is to use a split method with the same pattern.
To do it with the pattern itself, workarounds are:
1. consuming the characters that match the pattern
"other content" is the content until the next pattern or the end of the string.
alternation + capture group:
(pattern)|other content
Then you must check if the capture group exists to know which part of the alternation succeeds.
"other content" can be for example described in this way: .*?(?=pattern|$)
With PCRE and Perl, you can use backtracking control verbs to avoid the capture group, but the idea is the same:
pattern(*SKIP)(*FAIL)|other content
With this variant, you don't need to check anything after, since the first branch is forced to fail.
or without alternation:
((?:pattern)*)(other content)
variant in PCRE, Perl, or Ruby with the \K feature:
(?:pattern)*\Kother content
Where \K removes all on the left from the match result.
2. checking characters of the string one by one
(?:(?!pattern).)*
if this way is very simple to write (if the lookahead is available), it has the inconvenient to be slow since each positions of the string are tested with the lookahead.
The amount of lookahead tests can be reduced if you can use the first character of the pattern (lets say "a"):
[^a]*(?:(?!pattern)a[^a]*)*
3. list all that is not the pattern.
using character classes
Lets say your pattern is /hello/:
([^h]|h(([^eh]|$)|e(([^lh]|$)|l(([^lh]|$)|l([^oh]|$))))*
This way becomes quickly fastidious when the number of characters is important, but it can be useful for regex flavors that haven't many features like POSIX regex.

greedy operator in regular expression is not working in Tcl 8.5

See this simple regexp code:
puts [ regexp -inline {^\-\-\S+?=\S+} "--tox=9.0" ]
The output is:
>--tox=9
It would seem that the second \S+ is being non-greedy! Only 1 character is being matched
In PERL, one can can see that the result is as I expected, see 1 line output:
perl -e '"--tox=9.0" =~/(^\-\-\S+?=\S+)/ ; print "${1}\n"'
--tox=9.0
How can I get the Perl behaviour in Tcl?
This is an inherent 'feature' of Tcl's regexp implementation. For instance, the below is from Henry Spencer (the one who did most if not all of Tcl's regexp work I believe)
It is very difficult to come up with an entirely satisfactory
definition of the behavior of mixed-greediness regular expressions.
Perl doesn't try: the Perl "specification" is a description of the
implementation, an inherently low-performance approach involving
trying one match at a time. This is unsatisfactory for a number of
reasons, not least being that it takes several pages of text merely to
describe it. (That implementation and its description are distant,
mutated descendants of one of my earlier regexp packages, so I share
some of the blame for this.)
When all quantifiers are greedy, the Tcl 8.2 regexp matches the
longest possible match (as specified in the POSIX standard's
regular-expression definition). When all are non-greedy, it matches
the shortest possible match. Neither of these desirable statements is
true of Perl.
The trouble is that it is very, very hard to write a generalization of
those statements which covers mixed-greediness regular expressions --
a proper, implementation-independent definition of what
mixed-greediness regular expressions should match -- and makes them
do "what people expect". I've tried. I'm still trying. No luck so
far.
The rules in the Tcl 8.2 regexp, which basically give the whole regexp
a long/short preference based on its subexpressions, are the best I've
come up with so far. The code implements them accurately. I agree
that they fall short of what's really wanted. It's trickier than it
looks.
Basically, expressions with mixed greedy and non-greedy quantifiers impacts both the simplicity of the implementation and the performance. So, the implementation makes it so that the first 'type' of quantifier is passed on to all other quantifiers.
In other words, if the first quantifier is greedy, all the others will be greedy. If the first is non-greedy, all the others will be non-greedy. And therefore, you cannot force a Tcl regexp to work like a Perl regexp (or maybe you can through exec and using the bash command version of perl, but I'm not familiar with this).
I would advise using negated classes and/or anchors instead of non-greedy.
Since I don't know the exact context of your question, I won't provide an alternative regexp, because that will depend on whether this is really the whole string you are trying to make a match on.
The Tcl regular expression engine is an automata-theoretic one instead of a stack-based one, so it has a very different approach to matching mixed greediness REs. In particular, for the sort of RE you're talking about, that will be interpreted as entirely non-greedy.
The simplest method of fixing this is to use a different RE. Remembering that \S is just a shorthand for [^\s], we can do this (excluding = from the first part):
puts [ regexp -inline {^--[^\s=]+=\S+} "--tox=9.0" ]
(I also changed \- to - as it's not a special character in Tcl's REs.)
The answer can be found here:
Unfortunately, the answer is that to get the same answer Perl gives,
you have to use Perl's exact regexp implementation.
In your case, I'd use both anchors, ^ and $:
puts [ regexp -inline {^\-\-\S+?=\S+$} "--tox=9.0" ]
The result is: --tox=9.0

How do I match a pattern with optional surrounding quotes?

How would one write a regex that matches a pattern that can contain quotes, but if it does, must have matching quotes at the beginning and end?
"?(pattern)"?
Will not work because it will allow patterns that begin with a quote but don't end with one.
"(pattern)"|(pattern)
Will work, but is repetitive. Is there a better way to do that without repeating the pattern?
You can get a solution without repeating by making use of backreferences and conditionals:
/^(")?(pattern)(?(1)\1|)$/
Matches:
pattern
"pattern"
Doesn't match:
"pattern
pattern"
This pattern is somewhat complex, however. It first looks for an optional quote, and puts it into backreference 1 if one is found. Then it searches for your pattern. Then it uses conditional syntax to say "if backreference 1 is found again, match it, otherwise match nothing". The whole pattern is anchored (which means that it needs to appear by itself on a line) so that unmatched quotes won't be captured (otherwise the pattern in pattern" would match).
Note that support for conditionals varies by engine and the more verbose but repetitive expressions will be more widely supported (and likely easier to understand).
Update: A much simpler version of this regex would be /^(")?(pattern)\1$/, which does not need a conditional. When I was testing this initially, the tester I was using gave me a false negative, which lead me to discount it (oops!).
I'll leave the solution with the conditional up for posterity and interest, but this is a simpler version that is more likely to work in a wider variety of engines (backreferences are the only feature being used here which might be unsupported).
This is quite simple as well: (".+"|.+). Make sure the first match is with quotes and the second without.
Depending on the language you're using, you should be able to use backreferences. Something like this, say:
(["'])(pattern)\1|^(pattern)$
That way, you're requiring that either there are no quotes, or that the SAME quote is used on both ends.
This should work with recursive regex (which needs longer to get right). In the meantime: in Perl, you can build a self-modifying regex. I'll leave that as an academic example ;-)
my #stuff = ( '"pattern"', 'pattern', 'pattern"', '"pattern' );
foreach (#stuff) {
print "$_ OK\n" if /^
(")?
\w+
(??{defined $1 ? '"' : ''})
$
/x
}
Result:
"pattern" OK
pattern OK
Generally #Daniel Vandersluis response would work. However, some compilers do not recognize the optional group (") if it is empty, therefore they do not detect the back reference \1.
In order to avoid this problem a more robust solution would be:
/^("|)(pattern)\1$/
Then the compiler will always detect the first group. This expression can also be modified if there is some prefix in the expression and you want to capture it first:
/^(key)=("|)(value)\2$/

The Greedy Option of Regex is really needed?

The Greedy Option of Regex is really needed?
Lets say I have following texts, I like to extract texts inside [Optionx] and [/Optionx] blocks
[Option1]
Start=1
End=10
[/Option1]
[Option2]
Start=11
End=20
[/Option2]
But with Regex Greedy Option, its give me
Start=1
End=10
[/Option1]
[Option2]
Start=11
End=20
Anybody need like that? If yes, could you let me know?
If I understand correctly, the question is “why (when) do you need greedy matching?”
The answer is – almost always. Consider a regular expression that matches a sequence of arbitrary – but equal – characters, of length at least two. The regular expression would look like this:
(.)\1+
(\1 is a back-reference that matches the same text as the first parenthesized expression).
Now let’s search for repeats in the following string: abbbbbc. What do we find? Well, if we didn’t have greedy matching, we would find bb. Probably not what we want. In fact, in most application s we would be interested in finding the whole substring of bs, bbbbb.
By the way, this is a real-world example: the RLE compression works like that and can be easily implemented using regex.
In fact, if you examine regular expressions all around you will see that a lot of them use quantifiers and expect them to behave greedily. The opposite case is probably a minority. Often, it makes no difference because the searched expression is inside guard clauses (e.g. a quoted string is inside the quote marks) but like in the example above, that’s not always the case.
Regular expressions can potentially match multiple portion of a text.
For example consider the expression (ab)*c+ and the string "abccababccc". There are many portions of the string that can match the regular expressions:
(abc)cababccc
(abcc)ababccc
abcc(ababccc)
abccab(abccc)
ab(c)cababccc
ab(cc)ababccc
abcabab(c)ccc
....
some regular expressions implementation are actually able to return the entire set of matches but it is most common to return a single match.
There are many possible ways to determine the "winning match". The most common one is to take the "longest leftmost match" which results in the greedy behaviour you observed.
This is tipical of search and replace (a la grep) when with a+ you probably mean to match the entire aaaa rather than just a single a.
Choosing the "shortest non-empty leftmost" match is the usual non-greedy behaviour. It is the most useful when you have delimiters like your case.
It all depends on what you need, sometimes greedy is ok, some other times, like the case you showed, a non-greedy behaviour would be more meaningful. It's good that modern implementations of regular expressions allow us to do both.
If you're looking for text between the optionx blocks, instead of searching for .+, search for anything that's not "[\".
This is really rough, but works:
\[[^\]]+]([^(\[/)]+)
The first bit searches for anything in square brackets, then the second bit searches for anything that isn't "[\". That way you don't have to care about greediness, just tell it what you don't want to see.
One other consideration: In many cases, greedy and non-greedy quantifiers result in the same match, but differ in performance:
With a non-greedy quantifier, the regex engine needs to backtrack after every single character that was matched until it finally has matched as much as it needs to. With a greedy quantifier, on the other hand, it will match as much as possible "in one go" and only then backtrack as much as necessary to match any following tokens.
Let's say you apply a.*c to
abbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbc. This finds a match in 5 steps of the regex engine. Now apply a.*?c to the same string. The match is identical, but the regex engine needs 101 steps to arrive at this conclusion.
On the other hand, if you apply a.*c to abcbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb, it takes 101 steps whereas a.*?c only takes 5.
So if you know your data, you can tailor your regex to match it as efficiently as possible.
just use this algorithm which you can use in your fav language. No need regex.
flag=0
open file for reading
for each line in file :
if check "[/Option" in line:
flag=0
if check "[Option" in line:
flag=1
continue
if flag:
print line.strip()
# you can store the values of each option in this part