What is the meaning of the \\+ in this regex? - regex

I am trying to parse this example regex.
I know that slashes can be used as escape characters. So if you wanted to search for ) without implying a grouping you would do \ and then ) spelling this out to avoid stack overflow regex...
I also know that a plus sign can indicate one or more of the preceding item.
But in the example below, is the plus sign or the slash getting escaped? It seems like the first slash allows you to "escape" the second slash and then the plus sign indicates that there is at least one prior slashes --- but the example says there are at least two + in the string...
What does this regex mean? There are too many new things going on for me to parse it.

But in the example below, is the plus sign or the slash getting escaped?
Both!
The \ is escaped because the query language you are using probably uses it as an escape character itself (i.e. to escape quotes). So \\ is understood as a single \ in the regex, which is then used to escape the +. The regex means a single + followed by zero or many +.
It could probably be rewritten as \\++ where the second + is actually the regex quantifier.

That regexp can actually mean two different things, depending on the PostgreSQL version and the value of standard_conforming_strings.
Old versions (before standard_conforming_strings or those that defaulted to off) would interpret the string as a backslash-escaped string. So PostgreSQL would turn \\+\\+* into \+\+*, i.e. it'd consume a level of escaping. Then the regular expression would consume the remaining level to escape the pluses, so they're interpreted as literal +s not qualifiers. That regexp says ++ followed by zero or more other characters.
Newer versions with standard_conforming_strings defaulting to on will, per the SQL standard, not decode the backslashes as escapes. So you'll run the regexp \\+\\+*, which is one or more backslashes, followed by one or more backslashes, followed by ... oops, the asterisk without a preceding character is an error.
So we know you must have standard_conforming_strings off, 'cos the query would fail to compile the regexp on a new one.
regress=> SELECT 'blah' ~ '\\+\\+*';
ERROR: invalid regular expression: quantifier operand invalid
postgres=> SHOW standard_conforming_strings;
standard_conforming_strings
-----------------------------
on
(1 row)
You'll have this problem later on, so I suggest dealing with it before you upgrade.
Assuming that the x_spam_level field always starts with the pluses, which the regexp doesn't check, that code might be better written as:
x_spam_level LIKE '++%'
If it doesn't start with the pluses use:
x_spam_level LIKE '%++%'
which is what the current regexp is doing. PostgreSQL will turn that into a regular expression internally, but you don't have to worry about the escaping.
If you want to use a regular expression and have it behave consisently across all versions, use:
x_spam_level ~ E'\\+\\+*'
The E'' syntax tells PostgreSQL to decode backslash escapes, irrespective of the standard_conforming_strings setting.

Related

How to do a negative lookbehind within a %r<…>-delimited regexp in Ruby?

I like the %r<…> delimiters because it makes it really easy to spot the beginning and end of the regex, and I don't have to escape any /. But it seems that they have an insurmountable limitation that other delimiters don't have?
Every other delimiter imaginable works fine:
/(?<!foo)/
%r{(?<!foo)}
%r[(?<!foo)]
%r|(?<!foo)|
%r/(?<!foo)/
But when I try to do this:
%r<(?<!foo)>
it gives this syntax error:
unterminated regexp meets end of file
Okay, it probably doesn't like that it's not a balanced pair, but how do you escape it such that it does like it?
Does something need to be escaped?
According to wikibooks.org:
Any single non-alpha-numeric character can be used as the delimiter,
%[including these], %?or these?, %~or even these things~.
By using this notation, the usual string delimiters " and ' can appear
in the string unescaped, but of course the new delimiter you've chosen
does need to be escaped.
Indeed, escaping is needed in these examples:
%r!(?<\!foo)!
%r?(\?<!foo)?
But if that were the only problem, then I should be able to escape it like this and have it work:
%r<(?\<!foo)>
But that yields this error:
undefined group option: /(?\<!foo)/
So maybe escaping is not needed/allowed? wikibooks.org does list %<pointy brackets> as one of the exceptions:
However, if you use
%(parentheses), %[square brackets], %{curly brackets} or
%<pointy brackets> as delimiters then those same delimiters
can appear unescaped in the string as long as they are in balanced
pairs
Is it a problem with balanced pairs?
Balanced pairs are no problem as long as you are doing something in the Regexp that requires them, like...
%r{(?<!foo{1})} # repetition quantifier
%r[(?<![foo])] # character class
%r<(?<name>foo)> # named capture group
But what if you need to insert a left-side delimiter ({, [, or <) inside the regex? Just escape it, right? Ruby seems to have no problem with escaped unbalanced delimiters most of the time...
%r{(?<!foo\{)}
%r[(?<!\[foo)]
%r<\<foo>
It's just when you try to do it in the middle of the "group options" (which I guess is what the <! characters are classified as here) following a (? that it doesn't like it:
%r<(?\<!foo)>
# undefined group option: /(?\<!foo)/
So how do you do that then and make Ruby happy? (without changing the delimiters)
Conclusion
The workaround is easy. I'll just change this particular regex to just use something else instead like %r{…} instead.
But the questions remain...
Is there really no way to escape the < here?
Are there really some regular expression that are simply impossible to write using certain delimiters like %r<…>?
Is %r<…> the only regular expression delimiter pair that has this problem (where some regular expressions are impossible to write when using it). If you know of a similar example with %r{…}/%r[…], do share!
Version info
Not that it probably matters since this syntax probably hasn't changed, but I'm using:
⟫ ruby -v
ruby 2.6.0p0 (2018-12-25 revision 66547) [x86_64-linux]
Reference:
https://ruby-doc.org/core-2.6.3/Regexp.html
% Notation
As others have mentioned, seems like an oversight based on how this character differs from other paired boundaries.
As far as "Is there really no way to escape the < here?" there is a way... but you're not going to like it:
%r<(?#{'<'}!foo)> == %r((?<!foo))
Using interpolation to insert the < character seems to work. But given that there are much better options, I would avoid it unless you were planning on splitting the regex into sections anyway...

Seemingly incorrect regex evaluation in regexp_replace

I just stumbled upon a curious behavior of regexp_replace PostgreSQL function. It looks like a bug but I always doubt myself first. When I run
SELECT regexp_replace(E'1%2_3', '([_%])', E'\\ \\1', 'g')
it correctly prefixes either underscore or percent with backslash+space and produces "1\ %2\ _3". However when I remove space (it doesn't have to be space, can be any character)
SELECT regexp_replace(E'1%2_3', '([_%])', E'\\\\1', 'g')
it stops using captured parenthesized expression in substitution and produces "1\12\13" instead of "1\%2\_3". I would appreciate if someone could tell me what am I doing wrong. I simply need to add backslash before certain characters in a string.
UPDATE: I was able to achieve the desired behavior by running
SELECT regexp_replace(E'1%2_3', '([_%])', E'\\\\\\1', 'g')
My original example still seems a bit illogical and inconsistent. The inconsistency is that using the same E'...' syntax 4 backslashes may produce different result.
In your second query, after the backslash escapES are processed at the string level, you have the replacement string \\1.
What's happening is that the escaped backslash prevents \1 from being recognized as a back-reference. You need another set of backslashes, so that the replacement string is \\\1 to get a literal backslash and a back-reference. Since every literal backslash needs to be escaped, you need to double all of them.
SELECT regexp_replace(E'1%2_3', '([_%])', E'\\\\\\1', 'g')
I would not use the outdated Posix escape syntax in Postgres without need in the first place. Are you running an outdated version with standard_conforming_strings = off? Because if you are not, simplify:
SELECT regexp_replace('1%2_3', '([_%])', '\\\1', 'g')
You only need to add a single \ to escape the special meaning of \ in the regexp pattern.
Insert text with single quotes in PostgreSQL
Strings prefixed with E have to be processed, which costs a tiny bit extra and there is always the risk of unintended side effects with special characters. It's pointless to write E'1%2_3' for a string you want to provide as is. Make that just '1%2_3' in any case.
And for just just two characters to replace simple use:
SELECT replace(replace('1%2_3', '_', '\_'), '%', '\%')
Regular expressions are powerful, but for a price. Even several nested simple replace() calls are cheaper than a single regexp_replace().

How do I escape Regex to search on a period?

have a simple assignment that's been messing with me and I need another few sets of eyes. I'm sure I'm missing something simple. We have a directory of files that include all kinds of special characters, and I need to strip those out leaving only alpha, numeric, dot (period) and underscore characters. I'm using regex within a PowerShell v2.0 script.
For example:
!foo12.log becomes foo12.log
foo1(bar)2.log becomes foo1bar2.log
[foo]bar_.log becomes foobar_.log
My strategy is to use and exclude list and replace everything else with "". Consider:
$bkpPath = "\\Server\foo"
gci $bkpPath | % {$_.name -replace "[^a-zA-z_0-9]",""}
When I ran this, I ended up with foo12log, foo1bar2log and foobar_log so I change the regex to include .: [^a-zA-Z_\.0-9]. That doesn't remove any special characters. I've also tried [^a-zA-Z_\[\]\(\)\.0-9] with the same results as when I escape a period.
I suspect that there's and issue with my escape to the period \. and regex is reading it as a wildcard. If that's what's going on, how do I fix it? If that's not what's going on, what am I missing?
Because "." means "anything", it would be silly to use that special character inside square brackets. So in this case, the full stop loses its special meaning and you don't have to use the "\" escape character before it.
Also, it's worth noting that:
\w means "any word character" (letter, number, underscore)
\W means "any non-word character" (Although this isn't a time-saver in this case, since you want to match full stops too.)
So in this case, your relevant bit of regex could just be:
[^\w.]
You don't need to escape a period inside a character class:
[^a-zA-Z_.0-9]
should work fine. If it doesn't, there may be something special about the powershell regex flavour.

white space in Regular expression

I making use of this software, dk-brics-automaton to get number of states
of regular expressions. Now ,for example I have this type of RE:
^SEARCH\s+[^\n]{10}
When I insert it below as a string, the compiler say that invalid escape sequence
RegExp r = new RegExp("^SEARCH\s+[^\n]{10}", ALL);
where ALL is a certain FLAG
when I use double back slashes before small s, then the compiler accepts it
as a string where as over here \s means space but I am confused when I will make use of
double back slashes then it will consider just back slash and "s" where as I meant white space.
Now, I have thousands of such regular expressions for which I want to compute finite automaton
states.So, does that mean that I have to add manually back slashes in all the RE?
Here is a link where they have explained something related to this but I am not getting it:
http://www.brics.dk/automaton/doc/index.html
Please help me if anyone has some past experience in this software or if you have any idea to solve this issue.
I had another look at that documentation. "automaton" is a java package, therefor I think you have to treat them like java regexes. So just double every backslash inside a regex.
The thing here is, Java does not know "raw" strings. So you have to escape for two levels. The first level that evaluates escape sequences is the string level.
The string does not know an escape sequence \s, that is the error. \n is fine, the string evaluates it and stores instead the two characters \ (0x5C) and n (0x6E) the character 0x0A.
Then the string is stored and handed over to the regex constructor. Here happens the next round of escape sequence evaluation.
So if you want to escape for the regex level, then you have to double the backslashes. The string level will evaluate the \\ to \ and so the regex level gets the correct escape sequences.

Regex match anything that is not sub-pattern

I have cookies in my HTTP header like so:
Set-Cookie: frontend=ovsu0p8khivgvp29samlago1q0; adminhtml=6df3s767g199d7mmk49dgni4t7; external_no_cache=1; ZDEDebuggerPresent=php,phtml,php3
and I need to extract the 26 character string that comes after frontend (e.g. ovsu0p8khivgvp29samlago1q0). The following regular expression matches that for me:
(?<=frontend=)(.*)(?=;)
However, I am using Varnish Cache and can only use a regex replace. Therefore, to extract that cookie value (26 character frontend string) I need to match all characters that do not match that pattern (so I can replace them with '').
I've done a fair bit of Googling but so far have drawn a blank. I've tried the following
Match characters that do not match the pattern I want: [^((?<=frontend=)[A-Za-z0-9]{26}(?=;))] which matches random characters, including the ones I want to preserve
I'd be grateful if someone could point me in the right direction, or note where I might have gone wrong.
The Set-Cookie response header is a bit magical in Varnish, since the backends tend to send multiple headers with the same name. This is prohibited by the RFC, but the defacto way to do it.
If you are using Varnish 3.0 you can use the Header VMOD, it can parse the response and extract what you need:
https://github.com/varnish/libvmod-header
Use regex pattern
^Set-Cookie:.*?\bfrontend=([^;]*)
and the "26 character string that comes after frontend" will be in group 1 (usually referred to in the replacement string as $1)
Do you have control over the replacement string? If so, you can go with Ωmega's answer, and use $1 in your replacement string to write the frontend value back.
Otherwise, you could use this:
^Set-Cookie:.*(?!frontend=)|(?<=frontend=.{26}).*$
This will match everything from the start of the string, until frontend= is encountered. Or it will match everything that has frontend= exactly 26 characters to the left of it and up until the end of the string. If those 26 characters are a variable length, it would get signigicantly more complicated, because only .NET supports variable-length lookbehinds.
For your last question. Let's have a look at your regex:
[^((?<=frontend=)[A-Za-z0-9]{26}(?=;))]
Well, firstly the negative character class [^...] you tried to surround you pattern with, doesn't really work like this. It is still a character class, so it matches only a single character that is not inside that class. But it gets even more complicated (and I wonder why it matches at all). So firstly the character class should be closed by the first ]. This character class matches anything that is not (, ?, <, =, ), a letter or a digit. Then the {26} is applied to that, so we are trying to find 26 of those characters. Then the (?=;) which asserts that those 26 characters are followed by ;. Now comes what should not work. The closing ) should actually throw and error. And the final ] would just be interpreted as a literal ].
There are some regex flavors which allow for nesting of character classes (Java does). In this case, you would simply have a character class equivalent to [^a-zA-Z0-9(){}?<=;]. But as far as I could google it, Varnish uses PCRE, and in PCRE your regex should simply not compile.