Regex working in regex engine but not in postgresql - regex

I tried to match number 13 in pipe separated string like the one below:
13 - match
1|2|13 - match
13|1|2 - match
1|13|2 - match
1345|1|2 - should fail
1|1345|2 - should fail
1|2|1345 - should fail
1|4513|2 - should fail
4513|1|2 - should fail
2|3|4|4513- should fail
So, if 13 only occurs at the beginning or end, or in-between the string as a whole word it should match.
For that I wrote the following regex:
^13$|(\|13\|)?(?(1)|(^13\||\|13$))
In Regex101 it is working as expected. Please click link to see my sample.
But in Postgresql it throws error for the following query:
SELECT * FROM tbl_privilage WHERE user_id = 24 and show_id ~ '^13$|(\|13\|)?(?(1)|(^13\||\|13$))';
Error:
ERROR: invalid regular expression: quantifier operand invalid
SQL state: 2201B

Don't use a regex, using an array is more robust (and maybe more efficient as well):
select *
from the_table
where '13' = any (string_to_array(the_column, '|'));
this assumes that there is no whitespace between the values and the delimiter. You can even index that expression which probably makes searching a lot faster.
But I agree with Frank: you should really fix your data model.

Documentation is quite clear, saying that operator ~ implements the POSIX regular expressions. In Regex101 you're using PCRE (Perl-compatible) regular expressions. The two are very different.
If you need PCRE regular expressions in PostgreSQL you can setup an extension. Like pgpcre.

You need to match 13 within word boundaries.
You need
[[:<:]]13[[:>:]]
This solution should work even if you have spaces around the numeric values.
See documentation:
There are two special cases of bracket expressions: the bracket
expressions [[:<:]] and [[:>:]] are constraints, matching empty
strings at the beginning and end of a word respectively.

Related

Regular Expresion in Tableau returns only Null's in Calculated Field

I'm trying to extratct in Tableau the first occurance of part of speech name (e.g. subst, adj, fin) located between { and : in every line from column below:
{subst:pl:nom:m3=18, subst:pl:voc:m3=1, subst:pl:acc:m3=5}
{subst:sg:gen:m3=5, subst:sg:inst:m3=1, subst:sg:gen:f=1, subst:sg:nom:m3=1}
{subst:sg:nom:f=3, subst:sg:loc:f=2, subst:sg:inst:f=1, subst:sg:nom:m3=1}
{adj:sg:nom:m3:pos=2, adj:sg:acc:m3:pos=1, adj:sg:acc:n1.n2:pos=3, adj:pl:acc:m1.p1:pos=3, adj:sg:nom:f:pos=1}
{adj:sg:gen:f:pos=2, adj:sg:nom:n:pos=1}
{fin:sg:ter:imperf=5}
To do this I use the following regular expression: {(\w+):(?:.*?)}$. Unfortunately my calculated field returns only Null's:
Screeen from Tableau
I checked my regular expression on regex tester and is valid:
Sreen from regex101.com
I don't know what I'm doing wrong so if anybody has any suggestions I would be greatfull.
Tableau regex engine is ICU, and there are some differences between it and PCRE.
One of them is that braces that should be matched as literal symbols must be escsaped.
Your regex also contains a redundant non-capturing group ((?:.*?) = .*?) and a lazy quantifier that slows down matching since you want to check for a } at the end of the string, and thus should be changed to a greedy .*.
You can use
REGEXP_EXTRACT([col], '^\{(\w+):.*\}$')

Postgresql regexp_matches syntax not working as expected

I use the Postgres regexp_matches function to extract numbers.
The regular expression I use is
4([\s\-\/\.]*?0){3}([\s\-\/\.]*?[12]){1}([\s\-\/\.]*?\d){4}
If I use a tool like https://regexr.com/ to verify if it's working and I apply the following test set
4-0001-1234
5-2342-2344
499999999
4-0001-1234 4.0001.12344 4-0-0-0-1-1234
I get the expected extraction result:
4-0001-1234
4-0001-1234
4.0001.1234
4-0-0-0-1-1234
However, if I use the same expression in Postgresql, it does work well:
SELECT unnest(regexp_matches('4-0001-1234', '4([\s\-\/\.]*?0){3}([\s\-\/\.]*?[12]){1}([\s\-\/\.]*?\d){4}', 'g'));
Result:
0
1
4
What I suspect is, that it has to do with the greediness and/or that the quantifiers like {3} are not applied in the right way. Or it uses the Posix standard for regular expressions, which always seem to differ a little bit from the Java Syntax.
Any suggestions why it is not working and how to fix it?
The regexp_matches(string text, pattern text [, flags text]) function returns the captured values:
Return all captured substrings resulting from matching a POSIX regular expression against the string.
You may fix the expression using non-capturing groups:
SELECT unnest(regexp_matches('4-0001-1234 4.0001.12344 4-0-0-0-1-1234', '4(?:[\s/.-]*0){3}(?:[\s/.-]*[12])(?:[\s/.-]*\d){4}', 'g'));
See the online demo.
BTW, you do not need to escape - when it is at the start/end of the bracket expression, and there is no need to escape neither / nor . there. I also suggest removing {1} as a = a{1} in any regex supporting limiting quantifiers.

Regular expression ((MM)*N{1,3})|((N{1,3})(MM)*) not matching what I think it should

I am trying to write a regular expression that would match pairs of Ms and up to 3 Ns consecutively in any order so
MMMMNN would match
MMNNN would match
NNNMM would match
NMMMM would also match
I used the following regular expression:
((MM)*N{1,3})|(N{1,3}(MM)*)
each term matches alone but when I put the | between them it doesn't seem to match both possibilities. I used http://regex101.com/ to test it.
What regular expression would match those?
This matches all the examples you have:
(N{1,3}(MM)+)|((MM)+N{1,3})
The question is however, if 'up to 3' should include zero instances?
Edit: The comment is correct, removed the extra plus.
Does this work for you...
(MMN{0,3})|(N{0,3}MM)

Regular Expression using vbscript for two numbers: not valid:"$123456789012" and valid: "12345678912"

I wanted to create regular Expression to find the number with 10 or more digits and that number should not have $ symbol in front of it.
Eg:
not valid:$123456789012 and valid: 12345678912.
Apart from this I have more validations for example finding number pattern: 3digits - 4digits, 3digits - 4digits - 5digits.
But for now I am able to create pattern for all those but unable to do for $<number>, could you please help.
Sorry for not mentioning in the beginning - this is using vbscript.
Code:
RegularExpressionObject.IgnoreCase = True
RegularExpressionObject.Global = True
RegularExpressionObject.Pattern =
"(([0-9]{3}-[0-9]{4})|([0-9]{3}-[0-9]{4}-[0-9]{5})|([0-9]{3}-[0-9]{4}-[0-9]{5}-[‌​0-9]{6})|([0-9]{10}))"
'pattern:CCID or 3(digits)-4(digits), 3(digits)-4(digits)-5(digits), 3(digits)-4(digits)-5(digits)-6(digits), 10digts and above number
Set Matches = RegularExpressionObject.Execute(Rescomts)
If (Matches.Count <> 0) Then
In the above code, regular expression pattern 10 digits are allowed, but I wanted to ignore 10 digits starting from $ symbol
As #Sam said, you would probably use a negative look-behind:
(?<!\$)\b[0-9]{10,}\b
Example: http://regex101.com/r/vR3iF6
Depending on the language selected, look-around may not be available. For example, Javascript limits the use of (?<!\$) so you may need to write it as
[^$]\b[0-9]{10,}\b
Example: http://regex101.com/r/oH5uE4
Some languages support \d, which may make your regex look cleaner. Others don't, and you'll need to write [0-9].
EDIT
Per #AlanMoore's suggestion, extraneous characters may also interfere. You might be able to get around this by using \b[^$]\b instead of just a single \b:
\b[^$]\b[0-9]{10,}\b
Example: http://regex101.com/r/lL2bN9
This would get rid of preceding spaces, non-digit characters, etc.
This regular expression seems to work:
^[^\$][0-9]{9,}$

Regular Expression - Want two matches get only one

I'm working wih a regular expression and have some lines in javascript. My expression should deliver two matches but recognizes only one and I don't know whats the problem.
The Lines in javascript look like this:
if(mode==1) var adresse = "?APPNAME=CampusNet&PRGNAME=ACTION&ARGUMENTS=-A7uh6sBXerQwOCd8VxEMp6x0STE.YaNZDsBnBOto8YWsmwbh7FmWgYGPUHysiL9u0.jUsPVdYQAlvwCsiktBzUaCohVBnkyistIjCR77awL5xoM3WTHYox0AQs65SoHAhMXDJVr7="; else var adresse = "?APPNAME=CampusNet&PRGNAME=ACTION&ARGUMENTS=-AHMqmg-jXIDdylCjFLuixe..udPC2hjn6Kiioq7O41HsnnaP6ylFkQLhaUkaWKINEj4l2JqL2eBSzOpmG.b5Av2AvvUxEinUhMBTt5awdgAL4SkBEgYXGejTGUxcgPE-MfiQjefc=";
My expression looks like this:
(?<Popup>(popUp\(')|(adresse...")).*\?((?<Parameters>APPNAME=CampusNet[^>"']*["']))
I want to have two matches with APPNAME...... as Parameters.
[UPDATE] Like Tim Pietzcker wrote i used the greedy version and should have used the lazy version. while he wrote that i solved it myself by using .? instead of . in the middle so the expression looks like this:
(?<Popup>(popUp\(')|(adresse...")).*?\\?((?<Parameters>APPNAME=CampusNet[^>"']*["']))
That worked. Thanks to Tim Pietzcker
Your regex matches too much - from the very first adresse until the very last " because it uses a greedy quantifier .*.
If you make that quantifier lazy, i. e.
(?<Popup>(popUp\(')|(adresse...")).*?\?((?<Parameters>APPNAME=CampusNet[^>"']*["']))
you get two matches.
Alternatively, if your data allows this, use a different quantifier that only matches non-space characters. This will match faster (but will fail of course if the text you're trying to match could possibly contain spaces):
(?<Popup>(popUp\(')|(adresse..."))\S*\?((?<Parameters>APPNAME=CampusNet[^>"']*["']))
Usually you must apply the regex with the "global" flag to find all matches. I can't really say more until I see the complete code sample you are working with.