Seemingly incorrect regex evaluation in regexp_replace

Seemingly incorrect regex evaluation in regexp_replace - regex

I just stumbled upon a curious behavior of regexp_replace PostgreSQL function. It looks like a bug but I always doubt myself first. When I run
SELECT regexp_replace(E'1%2_3', '([_%])', E'\\ \\1', 'g')
it correctly prefixes either underscore or percent with backslash+space and produces "1\ %2\ _3". However when I remove space (it doesn't have to be space, can be any character)
SELECT regexp_replace(E'1%2_3', '([_%])', E'\\\\1', 'g')
it stops using captured parenthesized expression in substitution and produces "1\12\13" instead of "1\%2\_3". I would appreciate if someone could tell me what am I doing wrong. I simply need to add backslash before certain characters in a string.
UPDATE: I was able to achieve the desired behavior by running
SELECT regexp_replace(E'1%2_3', '([_%])', E'\\\\\\1', 'g')
My original example still seems a bit illogical and inconsistent. The inconsistency is that using the same E'...' syntax 4 backslashes may produce different result.

In your second query, after the backslash escapES are processed at the string level, you have the replacement string \\1.
What's happening is that the escaped backslash prevents \1 from being recognized as a back-reference. You need another set of backslashes, so that the replacement string is \\\1 to get a literal backslash and a back-reference. Since every literal backslash needs to be escaped, you need to double all of them.
SELECT regexp_replace(E'1%2_3', '([_%])', E'\\\\\\1', 'g')

I would not use the outdated Posix escape syntax in Postgres without need in the first place. Are you running an outdated version with standard_conforming_strings = off? Because if you are not, simplify:
SELECT regexp_replace('1%2_3', '([_%])', '\\\1', 'g')
You only need to add a single \ to escape the special meaning of \ in the regexp pattern.
Insert text with single quotes in PostgreSQL
Strings prefixed with E have to be processed, which costs a tiny bit extra and there is always the risk of unintended side effects with special characters. It's pointless to write E'1%2_3' for a string you want to provide as is. Make that just '1%2_3' in any case.
And for just just two characters to replace simple use:
SELECT replace(replace('1%2_3', '_', '\_'), '%', '\%')
Regular expressions are powerful, but for a price. Even several nested simple replace() calls are cheaper than a single regexp_replace().

Related

How to do a negative lookbehind within a %r<…>-delimited regexp in Ruby?

I like the %r<…> delimiters because it makes it really easy to spot the beginning and end of the regex, and I don't have to escape any /. But it seems that they have an insurmountable limitation that other delimiters don't have?
Every other delimiter imaginable works fine:
/(?<!foo)/
%r{(?<!foo)}
%r[(?<!foo)]
%r|(?<!foo)|
%r/(?<!foo)/
But when I try to do this:
%r<(?<!foo)>
it gives this syntax error:
unterminated regexp meets end of file
Okay, it probably doesn't like that it's not a balanced pair, but how do you escape it such that it does like it?
Does something need to be escaped?
According to wikibooks.org:
Any single non-alpha-numeric character can be used as the delimiter,
%[including these], %?or these?, %~or even these things~.
By using this notation, the usual string delimiters " and ' can appear
in the string unescaped, but of course the new delimiter you've chosen
does need to be escaped.
Indeed, escaping is needed in these examples:
%r!(?<\!foo)!
%r?(\?<!foo)?
But if that were the only problem, then I should be able to escape it like this and have it work:
%r<(?\<!foo)>
But that yields this error:
undefined group option: /(?\<!foo)/
So maybe escaping is not needed/allowed? wikibooks.org does list %<pointy brackets> as one of the exceptions:
However, if you use
%(parentheses), %[square brackets], %{curly brackets} or
%<pointy brackets> as delimiters then those same delimiters
can appear unescaped in the string as long as they are in balanced
pairs
Is it a problem with balanced pairs?
Balanced pairs are no problem as long as you are doing something in the Regexp that requires them, like...
%r{(?<!foo{1})} # repetition quantifier
%r[(?<![foo])] # character class
%r<(?<name>foo)> # named capture group
But what if you need to insert a left-side delimiter ({, [, or <) inside the regex? Just escape it, right? Ruby seems to have no problem with escaped unbalanced delimiters most of the time...
%r{(?<!foo\{)}
%r[(?<!\[foo)]
%r<\<foo>
It's just when you try to do it in the middle of the "group options" (which I guess is what the <! characters are classified as here) following a (? that it doesn't like it:
%r<(?\<!foo)>
# undefined group option: /(?\<!foo)/
So how do you do that then and make Ruby happy? (without changing the delimiters)
Conclusion
The workaround is easy. I'll just change this particular regex to just use something else instead like %r{…} instead.
But the questions remain...
Is there really no way to escape the < here?
Are there really some regular expression that are simply impossible to write using certain delimiters like %r<…>?
Is %r<…> the only regular expression delimiter pair that has this problem (where some regular expressions are impossible to write when using it). If you know of a similar example with %r{…}/%r[…], do share!
Version info
Not that it probably matters since this syntax probably hasn't changed, but I'm using:
⟫ ruby -v
ruby 2.6.0p0 (2018-12-25 revision 66547) [x86_64-linux]
Reference:
https://ruby-doc.org/core-2.6.3/Regexp.html
% Notation

As others have mentioned, seems like an oversight based on how this character differs from other paired boundaries.
As far as "Is there really no way to escape the < here?" there is a way... but you're not going to like it:
%r<(?#{'<'}!foo)> == %r((?<!foo))
Using interpolation to insert the < character seems to work. But given that there are much better options, I would avoid it unless you were planning on splitting the regex into sections anyway...

Regex alphanumeric with hyphen, single quotes, and single spacing is timing out (crashing)

I have the following regular expression that I use but it crashes in my browsers (does nothing and then likely times out).
I am trying to accept alphanumeric, as well as dashes and single quotes. I'm also trying to restrict spacing to allow only single spaces (no more than one space consecutively)
<constant>
<constant-name>expressionFormat</constant-name>
<constant-value>^([a-zA-Z0-9'-]+\s?)*$</constant-value>
</constant>
A sample example string that crashes with this is:
"ABCDEFGHIJKLMNOPQ43 5343443RSTUVWXYZ0123456789 ‘ –"
I'm using Struts. Any tips on what I'm doing wrong? Thanks in advance!

I've found a solution.
My OLD expression:
^([a-zA-Z0-9'-]+\s?)*$
First off, I got rid of the \s since it includes other things like tabs, new lines, etc, which I do not want.
The ? is "greedy", which means if the regex fails it continues evaluating the rest of the string until it's sure it's going to return a failure... In essence, the + and ? were making it try and check recursively making it resource intensive for longer strings.
The following expression works much better for my case:
^([a-zA-Z0-9' -])*$

I believe that the browser is just taking a really long time to process the regex search and may even be timing out.
Your sample string
ABCDEFGHIJKLMNOPQ43 5343443RSTUVWXYZ0123456789 ‘ –
will not be matched by your regular expression:
^([a-zA-Z0-9'-]+\s?)*$
Add the special characters (‘ ’ — –), i.e.,
‘ ’ — –
if you want to accept them.
^([a-zA-Z0-9'‘’—–-]+\s?)*$
This regex matches your sample string.
UPDATE:
Try this regex that uses atomic grouping to avoid catastrophic backtracking:
^(?>[a-zA-Z0-9'-]+\s?)*$

What is the meaning of the \\+ in this regex?

I am trying to parse this example regex.
I know that slashes can be used as escape characters. So if you wanted to search for ) without implying a grouping you would do \ and then ) spelling this out to avoid stack overflow regex...
I also know that a plus sign can indicate one or more of the preceding item.
But in the example below, is the plus sign or the slash getting escaped? It seems like the first slash allows you to "escape" the second slash and then the plus sign indicates that there is at least one prior slashes --- but the example says there are at least two + in the string...
What does this regex mean? There are too many new things going on for me to parse it.

But in the example below, is the plus sign or the slash getting escaped?
Both!
The \ is escaped because the query language you are using probably uses it as an escape character itself (i.e. to escape quotes). So \\ is understood as a single \ in the regex, which is then used to escape the +. The regex means a single + followed by zero or many +.
It could probably be rewritten as \\++ where the second + is actually the regex quantifier.

That regexp can actually mean two different things, depending on the PostgreSQL version and the value of standard_conforming_strings.
Old versions (before standard_conforming_strings or those that defaulted to off) would interpret the string as a backslash-escaped string. So PostgreSQL would turn \\+\\+* into \+\+*, i.e. it'd consume a level of escaping. Then the regular expression would consume the remaining level to escape the pluses, so they're interpreted as literal +s not qualifiers. That regexp says ++ followed by zero or more other characters.
Newer versions with standard_conforming_strings defaulting to on will, per the SQL standard, not decode the backslashes as escapes. So you'll run the regexp \\+\\+*, which is one or more backslashes, followed by one or more backslashes, followed by ... oops, the asterisk without a preceding character is an error.
So we know you must have standard_conforming_strings off, 'cos the query would fail to compile the regexp on a new one.
regress=> SELECT 'blah' ~ '\\+\\+*';
ERROR: invalid regular expression: quantifier operand invalid
postgres=> SHOW standard_conforming_strings;
standard_conforming_strings
-----------------------------
on
(1 row)
You'll have this problem later on, so I suggest dealing with it before you upgrade.
Assuming that the x_spam_level field always starts with the pluses, which the regexp doesn't check, that code might be better written as:
x_spam_level LIKE '++%'
If it doesn't start with the pluses use:
x_spam_level LIKE '%++%'
which is what the current regexp is doing. PostgreSQL will turn that into a regular expression internally, but you don't have to worry about the escaping.
If you want to use a regular expression and have it behave consisently across all versions, use:
x_spam_level ~ E'\\+\\+*'
The E'' syntax tells PostgreSQL to decode backslash escapes, irrespective of the standard_conforming_strings setting.

white space in Regular expression

I making use of this software, dk-brics-automaton to get number of states
of regular expressions. Now ,for example I have this type of RE:
^SEARCH\s+[^\n]{10}
When I insert it below as a string, the compiler say that invalid escape sequence
RegExp r = new RegExp("^SEARCH\s+[^\n]{10}", ALL);
where ALL is a certain FLAG
when I use double back slashes before small s, then the compiler accepts it
as a string where as over here \s means space but I am confused when I will make use of
double back slashes then it will consider just back slash and "s" where as I meant white space.
Now, I have thousands of such regular expressions for which I want to compute finite automaton
states.So, does that mean that I have to add manually back slashes in all the RE?
Here is a link where they have explained something related to this but I am not getting it:
http://www.brics.dk/automaton/doc/index.html
Please help me if anyone has some past experience in this software or if you have any idea to solve this issue.

I had another look at that documentation. "automaton" is a java package, therefor I think you have to treat them like java regexes. So just double every backslash inside a regex.
The thing here is, Java does not know "raw" strings. So you have to escape for two levels. The first level that evaluates escape sequences is the string level.
The string does not know an escape sequence \s, that is the error. \n is fine, the string evaluates it and stores instead the two characters \ (0x5C) and n (0x6E) the character 0x0A.
Then the string is stored and handed over to the regex constructor. Here happens the next round of escape sequence evaluation.
So if you want to escape for the regex level, then you have to double the backslashes. The string level will evaluate the \\ to \ and so the regex level gets the correct escape sequences.

regex_match allow '

I am using regex_match for validation for last names
I have this so far regex_match[/^[a-zA-Z -]{0,25}+$/] but I also want to allow ' for names like O'Neal. I tried this regex_match[/^[a-zA-Z -\']{0,25}+$/] but it didnt work
any suggestions?
Thanks,
J

-\' is an invalid range. You need to put the dash at the end of the character class:
/^[a-zA-Z '-]{0,25}$/
The + is superfluous here (in some regex flavors, it activates "possessive matching", but it's definitely not needed here), as is the backslash.
Also, I suspect that the square brackets around the regex are not syntactically correct in whatever language you're using. (Which one is that, by the way?)
But the real problem is your trying to validate a name (with a regex, no less).

You don't need the \ to escape the ', but you probably should put the dash last so it's not creating a range.
Nor do you need the + after the {0,25}, it's not a valid regex with it.
This works fine for me ^[a-zA-Z '-]{0,25}$

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js