Regular Expression to Match Unescaped Characters Only - regex

Okay, so I'm trying to use a regular expression to match instances of a character only if it hasn't been escaped (with a backslash) and decided to use the a negative look-behind like so:
(?<!\\)[*]
This succeeds and fails as expected with strings such as foo* and foo\* respectively.
However, it doesn't work for strings such as foo\\*, i.e - where the special character is preceded by a back-slash escaping another back-slash (an escape sequence that is itself escaped).
Is it possible to use a negative look-behind (or some other technique) to skip special characters only if they are preceded by an odd number of back-slashes?

I've found the following solution which works for NSRegularExpression but also works in every regexp implementation I've tried that supports negative look-behinds:
(?<!\\)(?:(\\\\)*)[*]
In this case the second unmatched parenthesis matches any pairs of back-slashes, effectively eliminating them, at which point the negative look-behind can compare any remaining (odd numbered) back-slashes as expected.

A lookbehind can not solve this problem. The only way is to match escaped characters first to avoid them and to find unescaped characters:
you can isolate the unescaped character from the result with a capture group:
(?:\\.)+|(\*)
or with the \K (pcre/perl/ruby) feature that removes all on the left from the result:
(?:\\.)*\K\*
or using backtracking control verbs (pcre/perl) to skip escaped characters:
(?:\\.)+(*SKIP)(*FAIL)|\*
The only case you can use a lookbehind is with the .net framework that allows unlimited length lookbehind:
(?<!(?:[^\\]|\A)(?:\\\\)*\\)\*
or in a more limited way with java:
(?<!(?:[^\\]|\A)(?:\\\\){0,1000}\\)\*

Related

Regex expression to match everything between a ? and # OR ? to the end of string [duplicate]

I have a string. The end is different, such as index.php?test=1&list=UL or index.php?list=UL&more=1. The one thing I'm looking for is &list=.
How can I match it, whether it's in the middle of the string or it's at the end? So far I've got [&|\?]list=.*?([&|$]), but the ([&|$]) part doesn't actually work; I'm trying to use that to match either & or the end of the string, but the end of the string part doesn't work, so this pattern matches the second example but not the first.
Use:
/(&|\?)list=.*?(&|$)/
Note that when you use a bracket expression, every character within it (with some exceptions) is going to be interpreted literally. In other words, [&|$] matches the characters &, |, and $.
In short
Any zero-width assertions inside [...] lose their meaning of a zero-width assertion. [\b] does not match a word boundary (it matches a backspace, or, in POSIX, \ or b), [$] matches a literal $ char, [^] is either an error or, as in ECMAScript regex flavor, any char. Same with \z, \Z, \A anchors.
You may solve the problem using any of the below patterns:
[&?]list=([^&]*)
[&?]list=(.*?)(?=&|$)
[&?]list=(.*?)(?![^&])
If you need to check for the "absolute", unambiguous string end anchor, you need to remember that is various regex flavors, it is expressed with different constructs:
[&?]list=(.*?)(?=&|$) - OK for ECMA regex (JavaScript, default C++ `std::regex`)
[&?]list=(.*?)(?=&|\z) - OK for .NET, Go, Onigmo (Ruby), Perl, PCRE (PHP, base R), Boost, ICU (R `stringr`), Java/Andorid
[&?]list=(.*?)(?=&|\Z) - OK for Python
Matching between a char sequence and a single char or end of string (current scenario)
The .*?([YOUR_SINGLE_CHAR_DELIMITER(S)]|$) pattern (suggested by João Silva) is rather inefficient since the regex engine checks for the patterns that appear to the right of the lazy dot pattern first, and only if they do not match does it "expand" the lazy dot pattern.
In these cases it is recommended to use negated character class (or bracket expression in the POSIX talk):
[&?]list=([^&]*)
See demo. Details
[&?] - a positive character class matching either & or ? (note the relationships between chars/char ranges in a character class are OR relationships)
list= - a substring, char sequence
([^&]*) - Capturing group #1: zero or more (*) chars other than & ([^&]), as many as possible
Checking for the trailing single char delimiter presence without returning it or end of string
Most regex flavors (including JavaScript beginning with ECMAScript 2018) support lookarounds, constructs that only return true or false if there patterns match or not. They are crucial in case consecutive matches that may start and end with the same char are expected (see the original pattern, it may match a string starting and ending with &). Although it is not expected in a query string, it is a common scenario.
In that case, you can use two approaches:
A positive lookahead with an alternation containing positive character class: (?=[SINGLE_CHAR_DELIMITER(S)]|$)
A negative lookahead with just a negative character class: (?![^SINGLE_CHAR_DELIMITER(S)])
The negative lookahead solution is a bit more efficient because it does not contain an alternation group that adds complexity to matching procedure. The OP solution would look like
[&?]list=(.*?)(?=&|$)
or
[&?]list=(.*?)(?![^&])
See this regex demo and another one here.
Certainly, in case the trailing delimiters are multichar sequences, only a positive lookahead solution will work since [^yes] does not negate a sequence of chars, but the chars inside the class (i.e. [^yes] matches any char but y, e and s).

Regex match between if present [duplicate]

I have a string. The end is different, such as index.php?test=1&list=UL or index.php?list=UL&more=1. The one thing I'm looking for is &list=.
How can I match it, whether it's in the middle of the string or it's at the end? So far I've got [&|\?]list=.*?([&|$]), but the ([&|$]) part doesn't actually work; I'm trying to use that to match either & or the end of the string, but the end of the string part doesn't work, so this pattern matches the second example but not the first.
Use:
/(&|\?)list=.*?(&|$)/
Note that when you use a bracket expression, every character within it (with some exceptions) is going to be interpreted literally. In other words, [&|$] matches the characters &, |, and $.
In short
Any zero-width assertions inside [...] lose their meaning of a zero-width assertion. [\b] does not match a word boundary (it matches a backspace, or, in POSIX, \ or b), [$] matches a literal $ char, [^] is either an error or, as in ECMAScript regex flavor, any char. Same with \z, \Z, \A anchors.
You may solve the problem using any of the below patterns:
[&?]list=([^&]*)
[&?]list=(.*?)(?=&|$)
[&?]list=(.*?)(?![^&])
If you need to check for the "absolute", unambiguous string end anchor, you need to remember that is various regex flavors, it is expressed with different constructs:
[&?]list=(.*?)(?=&|$) - OK for ECMA regex (JavaScript, default C++ `std::regex`)
[&?]list=(.*?)(?=&|\z) - OK for .NET, Go, Onigmo (Ruby), Perl, PCRE (PHP, base R), Boost, ICU (R `stringr`), Java/Andorid
[&?]list=(.*?)(?=&|\Z) - OK for Python
Matching between a char sequence and a single char or end of string (current scenario)
The .*?([YOUR_SINGLE_CHAR_DELIMITER(S)]|$) pattern (suggested by João Silva) is rather inefficient since the regex engine checks for the patterns that appear to the right of the lazy dot pattern first, and only if they do not match does it "expand" the lazy dot pattern.
In these cases it is recommended to use negated character class (or bracket expression in the POSIX talk):
[&?]list=([^&]*)
See demo. Details
[&?] - a positive character class matching either & or ? (note the relationships between chars/char ranges in a character class are OR relationships)
list= - a substring, char sequence
([^&]*) - Capturing group #1: zero or more (*) chars other than & ([^&]), as many as possible
Checking for the trailing single char delimiter presence without returning it or end of string
Most regex flavors (including JavaScript beginning with ECMAScript 2018) support lookarounds, constructs that only return true or false if there patterns match or not. They are crucial in case consecutive matches that may start and end with the same char are expected (see the original pattern, it may match a string starting and ending with &). Although it is not expected in a query string, it is a common scenario.
In that case, you can use two approaches:
A positive lookahead with an alternation containing positive character class: (?=[SINGLE_CHAR_DELIMITER(S)]|$)
A negative lookahead with just a negative character class: (?![^SINGLE_CHAR_DELIMITER(S)])
The negative lookahead solution is a bit more efficient because it does not contain an alternation group that adds complexity to matching procedure. The OP solution would look like
[&?]list=(.*?)(?=&|$)
or
[&?]list=(.*?)(?![^&])
See this regex demo and another one here.
Certainly, in case the trailing delimiters are multichar sequences, only a positive lookahead solution will work since [^yes] does not negate a sequence of chars, but the chars inside the class (i.e. [^yes] matches any char but y, e and s).

regex not matching last word if there is no white space after it [duplicate]

I have a string. The end is different, such as index.php?test=1&list=UL or index.php?list=UL&more=1. The one thing I'm looking for is &list=.
How can I match it, whether it's in the middle of the string or it's at the end? So far I've got [&|\?]list=.*?([&|$]), but the ([&|$]) part doesn't actually work; I'm trying to use that to match either & or the end of the string, but the end of the string part doesn't work, so this pattern matches the second example but not the first.
Use:
/(&|\?)list=.*?(&|$)/
Note that when you use a bracket expression, every character within it (with some exceptions) is going to be interpreted literally. In other words, [&|$] matches the characters &, |, and $.
In short
Any zero-width assertions inside [...] lose their meaning of a zero-width assertion. [\b] does not match a word boundary (it matches a backspace, or, in POSIX, \ or b), [$] matches a literal $ char, [^] is either an error or, as in ECMAScript regex flavor, any char. Same with \z, \Z, \A anchors.
You may solve the problem using any of the below patterns:
[&?]list=([^&]*)
[&?]list=(.*?)(?=&|$)
[&?]list=(.*?)(?![^&])
If you need to check for the "absolute", unambiguous string end anchor, you need to remember that is various regex flavors, it is expressed with different constructs:
[&?]list=(.*?)(?=&|$) - OK for ECMA regex (JavaScript, default C++ `std::regex`)
[&?]list=(.*?)(?=&|\z) - OK for .NET, Go, Onigmo (Ruby), Perl, PCRE (PHP, base R), Boost, ICU (R `stringr`), Java/Andorid
[&?]list=(.*?)(?=&|\Z) - OK for Python
Matching between a char sequence and a single char or end of string (current scenario)
The .*?([YOUR_SINGLE_CHAR_DELIMITER(S)]|$) pattern (suggested by João Silva) is rather inefficient since the regex engine checks for the patterns that appear to the right of the lazy dot pattern first, and only if they do not match does it "expand" the lazy dot pattern.
In these cases it is recommended to use negated character class (or bracket expression in the POSIX talk):
[&?]list=([^&]*)
See demo. Details
[&?] - a positive character class matching either & or ? (note the relationships between chars/char ranges in a character class are OR relationships)
list= - a substring, char sequence
([^&]*) - Capturing group #1: zero or more (*) chars other than & ([^&]), as many as possible
Checking for the trailing single char delimiter presence without returning it or end of string
Most regex flavors (including JavaScript beginning with ECMAScript 2018) support lookarounds, constructs that only return true or false if there patterns match or not. They are crucial in case consecutive matches that may start and end with the same char are expected (see the original pattern, it may match a string starting and ending with &). Although it is not expected in a query string, it is a common scenario.
In that case, you can use two approaches:
A positive lookahead with an alternation containing positive character class: (?=[SINGLE_CHAR_DELIMITER(S)]|$)
A negative lookahead with just a negative character class: (?![^SINGLE_CHAR_DELIMITER(S)])
The negative lookahead solution is a bit more efficient because it does not contain an alternation group that adds complexity to matching procedure. The OP solution would look like
[&?]list=(.*?)(?=&|$)
or
[&?]list=(.*?)(?![^&])
See this regex demo and another one here.
Certainly, in case the trailing delimiters are multichar sequences, only a positive lookahead solution will work since [^yes] does not negate a sequence of chars, but the chars inside the class (i.e. [^yes] matches any char but y, e and s).

Do not repeat placeholders in the same string regex

I made a regex to validate arrays that contain variable placeholders surrounded by { and }:
^(\/?(([a-zA-Z0-9\-\_]+)|(\{[a-zA-Z][a-zA-Z0-9]*\}))\/?)*$
It will validate strings like test/{a}/{b} and /some-text/{a}/{a}/ and its working fine. Here is the test: https://regex101.com/r/nP1tB2/2
Is it possible to block duplicated placeholders?
For example, in the 2nd string, {a} appears twice, but I would like to "block" (regex that doesn't match) it.
You may use a negative lookahead to restrict the matching process:
^(?!.*{([\w-]+)}.*{\1})(\/?(([\w-]+)|(\{[a-zA-Z][a-zA-Z0-9]*\}))\/?)*$
^^^^^^^^^^^^^^^^^^^^^^
It means that right after a beginning of string is detected, (?!.*{([\w-]+)}.*{\1}) will check if there are 0+ characters other than a newline followed with a {...} substring (with only letters, digits, underscores or hyphens) followed with the same pattern. If the pattern is found, the whole match is failed.
See the regex demo
Note that if you do not use a Unicode aware pattern (and it is not .NET without RegexOptions.ECMAScript), \w is equal to [A-Za-z0-9_]. So, I replaced that with \w in your pattern. Else, restore that subpattern in both lookahead and the main pattern.
Also, [a-zA-Z] can also be expressed as [^\W\d_] or \p{L} (or even [:alpha:]) and [a-zA-Z0-9] as [^\W_] (or [:alnum:], [\p{L}\p{N}]). These subpatterns are handy if you need to make the pattern Unicode aware. A lot depends on the regex flavor.

How does the following regex work?

Let's say I have a string in which I wanted to parse from an opening double-quote to a closing double-quote:
asdf"pass\"word"asdf
I was lucky enough to discover that the following PCRE would match from the opening double-quote to the closing double-quote while ignoring the escaped double-quote in the middle (to properly parse the logical unit):
".*?(?:(?!\\").)"
Match:
"pass\"word"
However, I have no idea why this PCRE matches the opening and closing double-quote properly.
I know the following:
" = literal double-quote
.*? = lazy matching of zero or more of any character
(?: = opening of non-capturing group
(?!\") = asserts its impossible to match literal \"
. = single character
) = closing of non-capturing group
" = literal double-quote
It appears that a single character and a negative lookahead are apart of the same logical group. To me , this means the PCRE is saying "Match from a double-quote to zero or more of any character as long as there is no \" right after the character, then match one more character and one single double quote."
However, according to that logic the PCRE would not match the string at all.
Could someone help me wrap my head around this?
It's easier to understand if you change the non-capture group to be a capture group.
Lazy matching generally moves forward one character at a time (vs. greedy matching everything it can and then giving up what it must). But it "moves forward" as far as satisfying the required parts of the pattern after it, which is accomplished by letting the .*? match everything up to r, then letting the negative lookahead + . match the d.
Update: you asked in comment:
how come it matches up to the r at all? shouldn't the negative
lookahead prevent it from getting passed the \" in the string? thanks
for helpin me understand, by the way
No, because it is not the negative lookahead stuff that is matching it. That is why I suggested you change the non-captured group into a captured group, so that you can see it is .*? that matches the \", not (?:(?!\\").)
.*? has the potential to match the entire string, and the regex engine uses that to satisfy the requirement to match the rest of the pattern.
Update 2:
It is effectively the same as doing this: ".*?[^\\]" which is probably a lot easier to wrap your head around.
A (slightly) better pattern would be to use a negative lookbehind like so: ".*?(?<!\\)" because it will allow for an empty string "" to be matched (a valid match in many contexts), but negative lookbehinds aren't supported in all engines/languages (from your tags, pcre supports it, but I don't think you can really do this in bash except e.g. grep -P '[pattern]' .. which basically runs it through perl).
Nothing to add to Crayon Violent explanation, only a little disambiguation and ways to match substrings enclosed between double quotes (with eventually quotes escaped by a backslash inside).
First, it seems that you use in your question the acronym "PCRE" (Perl Compatible Regular Expression) that is the name of a particular regex engine (and by extension or somewhat imprecisely refers to its syntax) in place of the word "pattern" that is the regular expression that describes a group of other strings (whatever the regex engine used).
With Bash:
A='asdf"pass\"word"asdf'
pattern='"(([^"\\]|\\.)*)"'
[[ $A =~ $pattern ]]
echo ${BASH_REMATCH[1]}
You can use this pattern too: pattern='"(([^"\\]+|\\.)*)"'
With a PCRE regex engine, you can use the first pattern, but it's better to rewrite it in a more efficient way:
"([^"\\]*+(?:\\.[^"\\])*+)"
Note that for these three patterns don't need any lookaround. They are able to deal with any number of consecutive backslashes: "abc\\\"def" (a literal backslash and an escaped quote), "abcdef\\\\" (two literal backslashes, the quote is not escaped).