How do I match a pattern with optional surrounding quotes? - regex

How would one write a regex that matches a pattern that can contain quotes, but if it does, must have matching quotes at the beginning and end?
"?(pattern)"?
Will not work because it will allow patterns that begin with a quote but don't end with one.
"(pattern)"|(pattern)
Will work, but is repetitive. Is there a better way to do that without repeating the pattern?

You can get a solution without repeating by making use of backreferences and conditionals:
/^(")?(pattern)(?(1)\1|)$/
Matches:
pattern
"pattern"
Doesn't match:
"pattern
pattern"
This pattern is somewhat complex, however. It first looks for an optional quote, and puts it into backreference 1 if one is found. Then it searches for your pattern. Then it uses conditional syntax to say "if backreference 1 is found again, match it, otherwise match nothing". The whole pattern is anchored (which means that it needs to appear by itself on a line) so that unmatched quotes won't be captured (otherwise the pattern in pattern" would match).
Note that support for conditionals varies by engine and the more verbose but repetitive expressions will be more widely supported (and likely easier to understand).
Update: A much simpler version of this regex would be /^(")?(pattern)\1$/, which does not need a conditional. When I was testing this initially, the tester I was using gave me a false negative, which lead me to discount it (oops!).
I'll leave the solution with the conditional up for posterity and interest, but this is a simpler version that is more likely to work in a wider variety of engines (backreferences are the only feature being used here which might be unsupported).

This is quite simple as well: (".+"|.+). Make sure the first match is with quotes and the second without.

Depending on the language you're using, you should be able to use backreferences. Something like this, say:
(["'])(pattern)\1|^(pattern)$
That way, you're requiring that either there are no quotes, or that the SAME quote is used on both ends.

This should work with recursive regex (which needs longer to get right). In the meantime: in Perl, you can build a self-modifying regex. I'll leave that as an academic example ;-)
my #stuff = ( '"pattern"', 'pattern', 'pattern"', '"pattern' );
foreach (#stuff) {
print "$_ OK\n" if /^
(")?
\w+
(??{defined $1 ? '"' : ''})
$
/x
}
Result:
"pattern" OK
pattern OK

Generally #Daniel Vandersluis response would work. However, some compilers do not recognize the optional group (") if it is empty, therefore they do not detect the back reference \1.
In order to avoid this problem a more robust solution would be:
/^("|)(pattern)\1$/
Then the compiler will always detect the first group. This expression can also be modified if there is some prefix in the expression and you want to capture it first:
/^(key)=("|)(value)\2$/

Related

How to order regular expression alternatives to get longest match?

I have a number of regular expressions regex1, regex2, ..., regexN combined into a single regex as regex1|regex2|...|regexN. I would like to reorder the component expressions so that the combined expression gives the longest possible match at the beginning of a given string.
I believe this means reordering the regular expressions such that "if regexK matches a prefix of regexL, then L < K". If this is correct, is it possible to find out, in general, whether regexK can match a prefix of regexL?
Use the right regex flavor!
In some regex flavors, the alternation providing the longest match is the one that is used ("greedy alternation"). Note that most of these regex flavors are old (yet still used today), and thus lack some modern constructs such as back references.
Perl6 is modern (and has many features), yet defaults to the POSIX-style longest alternation. (You can even switch styles, as || creates an alternator that short-circuits to first match.) Note that the :Perl5/:P5 modifier is needed in order to use the "traditional" regex style.
Also, PCRE and the newer PCRE2 have functions that do the same. In PCRE2, it's pcre2_dfa_match. (See my section Relevant info about regex engine design section for more information about DFAs.)
This means, you can have ANY order of statements in a pipe and the result will always be the longest.
(This is different from the "absolute longest" match, as no amount of rearranging the terms in an alternation will change the fact that all regex engines traverse the string left-to-right. With the exception of .NET, apparently, which can go right-to-left. But traversing the string backwards wouldn't guarantee the "absolute longest" match either.) If you really want to find matches at (only) the beginning of a string, you should anchor the expression: ^(regex1|regex2|...).
According to this page*:
The POSIX standard, however, mandates that the longest match be returned. When applying Set|SetValue to SetValue, a POSIX-compliant regex engine will match SetValue entirely.
* Note: I do not have the ability to test every POSIX flavor. Also, some regex flavors (Perl6) have this behavior without being POSIX compliant overall.
Let me give you one specific example that I have verified on my own computer:
echo "ab c a" | sed -E 's/(a|ab)/replacement/'
The regex is (a|ab). When it runs on the string ab c a you get : replacement c a, meaning that you do, in fact, get the longest match that the alternator can provide.
This regex, for a more complex example, (a|ab.*c|.{0,2}c*d) applied to abcccd, will return abcccd.
Try it here!
More clarification: the regex engine will not go forward (in the search string) to see if there is an even longer match once it can match something. It will only look through the current list of alterations to see if another one will match a longer string (from the position where the initial match starts).
In other words, no matter the order of choices in an alteration, POSIX compliant regexes use the one that matches the most characters.
Other examples of flavors with this behavior:
Tcl ARE
POSIX ERE
GNU BRE
GNU ERE
Relevant information about regex engine design
This question asks about designing an engine, but the answers may be helpful to understand how these engines work. Essentially, DFA-based algorithms determine the common overlap of different expressions, especially those within an alternation. It might be worth checking out this page. It explains how alternatives can be combined into a single path:
Note: at some point, you might just want to consider using an actual programming language. Regexes aren't everything.
Longest Match
Unfortunately, there is no distinct logic to tell a regular expression
engine to get the longest match possible.
Doing so would/could create a cascading backtracking episode gone wild.
It is, by definition a complexity too great to deal with.
All regular expressions are processed from left to right.
Anything the engine can match first it will, then bail out.
This is especially true of alternations, where this|this is|this is here
will always match 'this is here' first and
will NEVER ever match this is nor this is here
Once you realize that, you can reorder the alternation into
this is here|this is|this which gives the longest match every time.
Of course this can be reduced to this(?:(?: is)? here)?
which is the clever way of getting the longest match.
Haven't seen any examples of the regex's you want to combine,
so this is just some general information.
If you show the regexes you're trying to combine, better solution could be
provided.
Alternation contents do affect each other, as well as whatever precedes or
follows the cluster can have an affect on which alternation gets matched.
If you have more questions just ask.
Addendum:
For #Laurel. This could always be done with a Perl 5 regex (>5.10)
because Perl can run code from within regex sub-expressions.
Since it can run code, it can count and get the longest match.
The rule of leftmost first, however, will never change.
If regex were thermodynamics, this would be the first law.
Perl is a strange entity as it tries to create a synergy between regex
and code execution.
As a result, it is possible to overload it's operators, to inject
customization into the language itself.
Their regex engine is no different, and can be customized the same way.
So, in theory, the regex below can be made into a regex construct,
a new Alternation construct.
I won't go into detail's here, but suffice it to say, it's not for the faint at heart.
If you're interested in this type of thing, see the perlre manpage under
section 'Creating Custom RE Engines'
Perl:
Note - The regex alternation form is based on #Laurel complex example
(a|ab.*c|.{0,2}c*d) applied to abcccd.
Visually, if made into a custom regex construct, would look similar to
an alternation (?:rx1||rx2||rx3) and I'm guessing this is how a lot of
Perl6 is done in terms of integrating regex engine directly into the language.
Also, if used as is, it's possible to construct this regex dynamically as needed.
And note that all the richness of Perl regex constructs are available.
Output
Longest Match Found: abcccd
Code
use strict;
use warnings;
my ($p1,$p2,$p3) = (0,0,0);
my $targ = 'abcccd';
# Formatted using RegexFormat7 (www.regexformat.com)
if ( $targ =~
/
# The Alternation Construct
(?=
( a ) # (1)
(?{ $p1 = length($^N) })
)?
(?=
( ab .* c ) # (2)
(?{ $p2 = length($^N) })
)?
(?=
( .{0,2} c*d ) # (3)
(?{ $p3 = length($^N) })
)?
# Check At Least 1 Match
(?(1)
(?(2)
(?(3)
| (?!)
)
)
)
# Consume Longest Alternation Match
( # (4 start)
(?(?{
$p1>=$p2 && $p1>=$p3
})
\1
| (?(?{
$p2>=$p1 && $p2>=$p3
})
\2
| (?(?{
$p3>=$p1 && $p3>=$p2
})
\3
)
)
)
) # (4 end)
/x ) {
print "Longest Match Found: $4\n";
} else {
print "Did not find a match!\n";
}
For sure a human might be able judging whther two given regexp are matching prefixes for some cases. In general this is an n-p-complete problem. So don't try.
In the best case combining the different regexp into a single one will give a suitable result cheap. However, I'm not aware of any algorithm that can take two arbitrary regexp and combine them in a way that the resulting regexp is still matching what any of the two would match. It would be n-p-complete also.
You must also not rely on ordering of alternatives. This depends on the internal execution logic of the regexp engine. It could easily be that this is reordering the alternatives internally beyond your control. So, a valid ordering with current engine mmight give wrong results with a different engine. (So, it could help as long as you stay with a single regexp engine implementation)
Best approach seems to me to simply execute all regexp, keep track of the matched length and then take the longest match.

How to invert an arbitrary Regex expression

This question sounds like a duplicate, but I've looked at a LOT of similar questions, and none fit the bill either because they restrict their question to a very specific example, or to a specific usercase (e.g: single chars only) or because you need substitution for a successful approach, or because you'd need to use a programming language (e.g: C#'s split, or Match().Value).
I want to be able to get the reverse of any arbitrary Regex expression, so that everything is matched EXCEPT the found match.
For example, let's say I want to find the reverse of the Regex "over" in "The cow jumps over the moon", it would match The cow jumps and also match the moon.
That's only a simple example of course. The Regex could be something more messy such as "o.*?m", in which case the matches would be: The c, ps, and oon.
Here is one possible solution I found after ages of hunting. Unfortunately, it requires the use of substitution in the replace field which I was hoping to keep clear. Also, everything else is matched, but only a character by character basis instead of big chunks.
Just to stress again, the answer should be general-purpose for any arbitrary Regex, and not specific to any particular example.
From post: I want to be able to get the reverse of any arbitrary Regex expression, so that everything is matched EXCEPT the found match.
The answer -
A match is Not Discontinuous, it is continuous !!
Each match is a continuous, unbroken substring. So, within each match there
is no skipping anything within that substring. Whatever matched the
regular expression is included in a particular match result.
So within a single Match, there is no inverting (i.e. match not this only) that can extend past
a negative thing.
This is a Tennant of Regular Expressions.
Further, in this case, since you only want all things NOT something, you have
to consume that something in the process.
This is easily done by just capturing what you want.
So, even with multiple matches, its not good enough to say (?:(?!\bover\b).)+
because even though it will match up to (but not) over, on the next match
it will match ver ....
There are ways to avoid this that are tedious, requiring variable length lookbehinds.
But, the easiest way is to match up to over, then over, then the rest.
Several constructs can help. One is \K.
Unfortunately, there is no magical recipe to negate a pattern.
As you mentioned it in your question when you have an efficient pattern you use with a match method, to obtain the complementary, the more easy (and efficient) way is to use a split method with the same pattern.
To do it with the pattern itself, workarounds are:
1. consuming the characters that match the pattern
"other content" is the content until the next pattern or the end of the string.
alternation + capture group:
(pattern)|other content
Then you must check if the capture group exists to know which part of the alternation succeeds.
"other content" can be for example described in this way: .*?(?=pattern|$)
With PCRE and Perl, you can use backtracking control verbs to avoid the capture group, but the idea is the same:
pattern(*SKIP)(*FAIL)|other content
With this variant, you don't need to check anything after, since the first branch is forced to fail.
or without alternation:
((?:pattern)*)(other content)
variant in PCRE, Perl, or Ruby with the \K feature:
(?:pattern)*\Kother content
Where \K removes all on the left from the match result.
2. checking characters of the string one by one
(?:(?!pattern).)*
if this way is very simple to write (if the lookahead is available), it has the inconvenient to be slow since each positions of the string are tested with the lookahead.
The amount of lookahead tests can be reduced if you can use the first character of the pattern (lets say "a"):
[^a]*(?:(?!pattern)a[^a]*)*
3. list all that is not the pattern.
using character classes
Lets say your pattern is /hello/:
([^h]|h(([^eh]|$)|e(([^lh]|$)|l(([^lh]|$)|l([^oh]|$))))*
This way becomes quickly fastidious when the number of characters is important, but it can be useful for regex flavors that haven't many features like POSIX regex.

Regex for matching a character, but not when it's enclosed in quotes

I need to match a colon (':') in a string, but not when it's enclosed by quotes - either a " or ' character.
So the following should have 2 matches
something:'firstValue':'secondValue'
something:"firstValue":'secondValue'
but this should only have 1 match
something:'no:match'
If the regular expression implementation supports look-around assertions, try this:
:(?:(?<=["']:)|(?=["']))
This will match any colon that is either preceeded or followed by a double or single quote. So that does only consider construct like you mentioned. something:firstValue would not be matched.
It would be better if you build a little parser that reads the input byte-by-byte and remembers when quotation is open.
Regular expressions are stateless. Tracking whether you are inside of quotes or not is state information. It is, therefore, impossible to handle this correctly using only a single regular expression. (Note that some "regular expression" implementations add extensions which may make this possible; I'm talking solely about "true" regular expressions here.)
Doing it with two regular expressions is possible, though, provided that you're willing to modify the original string or to work with a copy of it. In Perl:
$string =~ s/['"][^'"]*['"]//g;
my $match_count = $string =~ /:/g;
The first will find every sequence consisting of a quote, followed by any number of non-quote characters, and terminated by a second quote, and remove all such sequences from the string. This will eliminate any colons which are within quotes. (something:"firstValue":'secondValue' becomes something:: and something:'no:match' becomes something:)
The second does a simple count of the remaining colons, which will be those that weren't within quotes to start with.
Just counting the non-quoted colons doesn't seem like a particularly useful thing to do in most cases, though, so I suspect that your real goal is to split the string up into fields with colons as the field delimiter, in which case this regex-based solution is unsuitable, as it will destroy any data in quoted fields. In that case, you need to use a real parser (most CSV parsers allow you to specify the delimiter and would be ideal for this) or, in the worst case, walk through the string character-by-character and split it manually.
If you tell us the language you're using, I'm sure somebody could suggest a good parser library for that language.
Uppps ... missed the point. Forget the rest. It's quite hard to do this because regex is not good at counting balanced characters (but the .NET implementation for example has an extension that can do it, but it's a bit complicated).
You can use negated character groups to do this.
[^'"]:[^'"]
You can further wrap the quotes in non-capturing groups.
(?:[^'"]):(?:[^'"])
Or you can use assertion.
(?<!['"]):(?!['"])
I've come up with the following slightly worrying construction:
(?<=^('[^']*')*("[^"]*")*[^'"]*):
It uses a lookbehind assertion to make sure you match an even number of quotes from the beginning of the line to the current colon. It allows for embedding a single quote inside double quotes and vice versa. As in:
'a":b':c::"':" (matches at positions 6, 8 and 9)
EDIT
Gumbo is right, using * within a look behind assertion is not allowed.
You can try to catch the strings withing the quotes
/(?<q>'|")([\w ]+)(\k<q>)/m
First pattern defines the allowed quote types, second pattern takes all Word-Digits and spaces.
Very good on this solution is, it takes ONLY Strings where opening and closing quotes match.
Try it at regex101.com

The Greedy Option of Regex is really needed?

The Greedy Option of Regex is really needed?
Lets say I have following texts, I like to extract texts inside [Optionx] and [/Optionx] blocks
[Option1]
Start=1
End=10
[/Option1]
[Option2]
Start=11
End=20
[/Option2]
But with Regex Greedy Option, its give me
Start=1
End=10
[/Option1]
[Option2]
Start=11
End=20
Anybody need like that? If yes, could you let me know?
If I understand correctly, the question is “why (when) do you need greedy matching?”
The answer is – almost always. Consider a regular expression that matches a sequence of arbitrary – but equal – characters, of length at least two. The regular expression would look like this:
(.)\1+
(\1 is a back-reference that matches the same text as the first parenthesized expression).
Now let’s search for repeats in the following string: abbbbbc. What do we find? Well, if we didn’t have greedy matching, we would find bb. Probably not what we want. In fact, in most application s we would be interested in finding the whole substring of bs, bbbbb.
By the way, this is a real-world example: the RLE compression works like that and can be easily implemented using regex.
In fact, if you examine regular expressions all around you will see that a lot of them use quantifiers and expect them to behave greedily. The opposite case is probably a minority. Often, it makes no difference because the searched expression is inside guard clauses (e.g. a quoted string is inside the quote marks) but like in the example above, that’s not always the case.
Regular expressions can potentially match multiple portion of a text.
For example consider the expression (ab)*c+ and the string "abccababccc". There are many portions of the string that can match the regular expressions:
(abc)cababccc
(abcc)ababccc
abcc(ababccc)
abccab(abccc)
ab(c)cababccc
ab(cc)ababccc
abcabab(c)ccc
....
some regular expressions implementation are actually able to return the entire set of matches but it is most common to return a single match.
There are many possible ways to determine the "winning match". The most common one is to take the "longest leftmost match" which results in the greedy behaviour you observed.
This is tipical of search and replace (a la grep) when with a+ you probably mean to match the entire aaaa rather than just a single a.
Choosing the "shortest non-empty leftmost" match is the usual non-greedy behaviour. It is the most useful when you have delimiters like your case.
It all depends on what you need, sometimes greedy is ok, some other times, like the case you showed, a non-greedy behaviour would be more meaningful. It's good that modern implementations of regular expressions allow us to do both.
If you're looking for text between the optionx blocks, instead of searching for .+, search for anything that's not "[\".
This is really rough, but works:
\[[^\]]+]([^(\[/)]+)
The first bit searches for anything in square brackets, then the second bit searches for anything that isn't "[\". That way you don't have to care about greediness, just tell it what you don't want to see.
One other consideration: In many cases, greedy and non-greedy quantifiers result in the same match, but differ in performance:
With a non-greedy quantifier, the regex engine needs to backtrack after every single character that was matched until it finally has matched as much as it needs to. With a greedy quantifier, on the other hand, it will match as much as possible "in one go" and only then backtrack as much as necessary to match any following tokens.
Let's say you apply a.*c to
abbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbc. This finds a match in 5 steps of the regex engine. Now apply a.*?c to the same string. The match is identical, but the regex engine needs 101 steps to arrive at this conclusion.
On the other hand, if you apply a.*c to abcbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb, it takes 101 steps whereas a.*?c only takes 5.
So if you know your data, you can tailor your regex to match it as efficiently as possible.
just use this algorithm which you can use in your fav language. No need regex.
flag=0
open file for reading
for each line in file :
if check "[/Option" in line:
flag=0
if check "[Option" in line:
flag=1
continue
if flag:
print line.strip()
# you can store the values of each option in this part

How can I match a quote-delimited string with a regex?

If I'm trying to match a quote-delimited string with a regex, which of the following is "better" (where "better" means both more efficient and less likely to do something unexpected):
/"[^"]+"/ # match quote, then everything that's not a quote, then a quote
or
/".+?"/ # match quote, then *anything* (non-greedy), then a quote
Assume for this question that empty strings (i.e. "") are not an issue. It seems to me (no regex newbie, but certainly no expert) that these will be equivalent.
Update: Upon reflection, I think changing the + characters to * will handle empty strings correctly anyway.
You should use number one, because number two is bad practice. Consider that the developer who comes after you wants to match strings that are followed by an exclamation point. Should he use:
"[^"]*"!
or:
".*?"!
The difference appears when you have the subject:
"one" "two"!
The first regex matches:
"two"!
while the second regex matches:
"one" "two"!
Always be as specific as you can. Use the negated character class when you can.
Another difference is that [^"]* can span across lines, while .* doesn't unless you use single line mode. [^"\n]* excludes the line breaks too.
As for backtracking, the second regex backtracks for each and every character in every string that it matches. If the closing quote is missing, both regexes will backtrack through the entire file. Only the order in which then backtrack is different. Thus, in theory, the first regex is faster. In practice, you won't notice the difference.
More complicated, but it handles escaped quotes and also escaped backslashes (escaped backslashes followed by a quote is not a problem)
/(["'])((\\{2})*|(.*?[^\\](\\{2})*))\1/
Examples:
"hello\"world" matches "hello\"world"
"hello\\"world" matches "hello\\"
I would suggest:
([\"'])(?:\\\1|.)*?\1
But only because it handles escaped quote chars and allows both the ' and " to be the quote char. I would also suggest looking at this article that goes into this problem in depth:
http://blog.stevenlevithan.com/archives/match-quoted-string
However, unless you have a serious performance issue or cannot be sure of embedded quotes, go with the simpler and more readable:
/".*?"/
I must admit that non-greedy patterns are not the basic Unix-style 'ed' regular expression, but they are getting pretty common. I still am not used to group operators like (?:stuff).
I'd say the second one is better, because it fails faster when the terminating " is missing. The first one will backtrack over the string, a potentially expensive operation. An alternative regexp if you are using perl 5.10 would be /"[^"]++"/. It conveys the same meaning as version 1 does, but is as fast as version two.
I'd go for number two since it's much easier to read. But I'd still like to match empty strings so I would use:
/".*?"/
From a performance perspective (extremely heavy, long-running loop over long strings), I could imagine that
"[^"]*"
is faster than
".*?"
because the latter would do an additional check for each step: peeking at the next character. The former would be able to mindlessly roll over the string.
As I said, in real-world scenarios this would hardly be noticeable. Therefore I would go with number two (if my current regex flavor supports it, that is) because it is much more readable. Otherwise with number one, of course.
Using the negated character class prevents matching when the boundary character (doublequotes, in your example) is present elsewhere in the input.
Your example #1:
/"[^"]+"/ # match quote, then everything that's not a quote, then a quote
matches only the smallest pair of matched quotes -- excellent, and most of the time that's all you'll need. However, if you have nested quotes, and you're interested in the largest pair of matched quotes (or in all the matched quotes), you're in a much more complicated situation.
Luckily Damian Conway is ready with the rescue: Text::Balanced is there for you, if you find that there are multiple matched quote marks. It also has the virtue of matching other paired punctuation, e.g. parentheses.
I prefer the first regex, but it's certainly a matter of taste.
The first one might be more efficient?
Search for double-quote
add double-quote to group
for each char:
if double-quote:
break
add to group
add double-quote to group
Vs something a bit more complicated involving back-tracking?
Considering that I didn't even know about the "*?" thing until today, and I've been using regular expressions for 20+ years, I'd vote in favour of the first. It certainly makes it clear what you're trying to do - you're trying to match a string that doesn't include quotes.