Postgresql regexp_matches syntax not working as expected - regex

I use the Postgres regexp_matches function to extract numbers.
The regular expression I use is
4([\s\-\/\.]*?0){3}([\s\-\/\.]*?[12]){1}([\s\-\/\.]*?\d){4}
If I use a tool like https://regexr.com/ to verify if it's working and I apply the following test set
4-0001-1234
5-2342-2344
499999999
4-0001-1234 4.0001.12344 4-0-0-0-1-1234
I get the expected extraction result:
4-0001-1234
4-0001-1234
4.0001.1234
4-0-0-0-1-1234
However, if I use the same expression in Postgresql, it does work well:
SELECT unnest(regexp_matches('4-0001-1234', '4([\s\-\/\.]*?0){3}([\s\-\/\.]*?[12]){1}([\s\-\/\.]*?\d){4}', 'g'));
Result:
0
1
4
What I suspect is, that it has to do with the greediness and/or that the quantifiers like {3} are not applied in the right way. Or it uses the Posix standard for regular expressions, which always seem to differ a little bit from the Java Syntax.
Any suggestions why it is not working and how to fix it?

The regexp_matches(string text, pattern text [, flags text]) function returns the captured values:
Return all captured substrings resulting from matching a POSIX regular expression against the string.
You may fix the expression using non-capturing groups:
SELECT unnest(regexp_matches('4-0001-1234 4.0001.12344 4-0-0-0-1-1234', '4(?:[\s/.-]*0){3}(?:[\s/.-]*[12])(?:[\s/.-]*\d){4}', 'g'));
See the online demo.
BTW, you do not need to escape - when it is at the start/end of the bracket expression, and there is no need to escape neither / nor . there. I also suggest removing {1} as a = a{1} in any regex supporting limiting quantifiers.

Related

I need to get all expressions between parenteses in a mathematical operation

I need to get all expression between parenthesis in a mathematical operation, with scala.
I tried to do it with a regex. And it works with expressions like:
(2+4)-> Result: 2+4
4*(3+1)-> Result: 3+1
But It's impossible for me to get all values, as in the following example:
(2+1)*(4-3)-> Result: 2+1)*(4+3
Expected result:
`2+1`
`4+3`
Where "formula" is the input expression
val regex = Pattern.compile("\\((.*)\\)")
val regexMatcher = regex.matcher(formula)
while (regexMatcher.find()) {
println(regexMatcher.group(1)); //Fetching Group from String
}
EDIT: In case of (1+(2+3)), the good result would be 1+(2+3)
You can use the reluctant quantifier *? instead of * to capture only the characters until the first ).
In Scala, you don't need to double backslashes if you use triple-quoted strings, which makes regexes much more readable:
Combining:
Pattern.compile("""\((.*?)\)""")
As mrzasa's answer mentions, regular expressions are a wrong tool if you ever need to handle nested parentheses (but see for a caveat How to match string within parentheses (nested) in Java? and Regular expression to match balanced parentheses).
In Scala, you can use parser combinators instead. Huynhjl's answer to Parsing values contained inside nested brackets describes something which you can use as a starting point.
Your regex is too greedy, it takes too much. Try
([^)]*)
this limits repetition to chars that are not a closing bracket.
In Scala it'd be probably:
val regex = Pattern.compile("\\(([\\)]*)\\)")
Demo
Also note that this does not support nested brackets, e.g. (1+(2+3))

Regular Expresion in Tableau returns only Null's in Calculated Field

I'm trying to extratct in Tableau the first occurance of part of speech name (e.g. subst, adj, fin) located between { and : in every line from column below:
{subst:pl:nom:m3=18, subst:pl:voc:m3=1, subst:pl:acc:m3=5}
{subst:sg:gen:m3=5, subst:sg:inst:m3=1, subst:sg:gen:f=1, subst:sg:nom:m3=1}
{subst:sg:nom:f=3, subst:sg:loc:f=2, subst:sg:inst:f=1, subst:sg:nom:m3=1}
{adj:sg:nom:m3:pos=2, adj:sg:acc:m3:pos=1, adj:sg:acc:n1.n2:pos=3, adj:pl:acc:m1.p1:pos=3, adj:sg:nom:f:pos=1}
{adj:sg:gen:f:pos=2, adj:sg:nom:n:pos=1}
{fin:sg:ter:imperf=5}
To do this I use the following regular expression: {(\w+):(?:.*?)}$. Unfortunately my calculated field returns only Null's:
Screeen from Tableau
I checked my regular expression on regex tester and is valid:
Sreen from regex101.com
I don't know what I'm doing wrong so if anybody has any suggestions I would be greatfull.
Tableau regex engine is ICU, and there are some differences between it and PCRE.
One of them is that braces that should be matched as literal symbols must be escsaped.
Your regex also contains a redundant non-capturing group ((?:.*?) = .*?) and a lazy quantifier that slows down matching since you want to check for a } at the end of the string, and thus should be changed to a greedy .*.
You can use
REGEXP_EXTRACT([col], '^\{(\w+):.*\}$')

Regex working in regex engine but not in postgresql

I tried to match number 13 in pipe separated string like the one below:
13 - match
1|2|13 - match
13|1|2 - match
1|13|2 - match
1345|1|2 - should fail
1|1345|2 - should fail
1|2|1345 - should fail
1|4513|2 - should fail
4513|1|2 - should fail
2|3|4|4513- should fail
So, if 13 only occurs at the beginning or end, or in-between the string as a whole word it should match.
For that I wrote the following regex:
^13$|(\|13\|)?(?(1)|(^13\||\|13$))
In Regex101 it is working as expected. Please click link to see my sample.
But in Postgresql it throws error for the following query:
SELECT * FROM tbl_privilage WHERE user_id = 24 and show_id ~ '^13$|(\|13\|)?(?(1)|(^13\||\|13$))';
Error:
ERROR: invalid regular expression: quantifier operand invalid
SQL state: 2201B
Don't use a regex, using an array is more robust (and maybe more efficient as well):
select *
from the_table
where '13' = any (string_to_array(the_column, '|'));
this assumes that there is no whitespace between the values and the delimiter. You can even index that expression which probably makes searching a lot faster.
But I agree with Frank: you should really fix your data model.
Documentation is quite clear, saying that operator ~ implements the POSIX regular expressions. In Regex101 you're using PCRE (Perl-compatible) regular expressions. The two are very different.
If you need PCRE regular expressions in PostgreSQL you can setup an extension. Like pgpcre.
You need to match 13 within word boundaries.
You need
[[:<:]]13[[:>:]]
This solution should work even if you have spaces around the numeric values.
See documentation:
There are two special cases of bracket expressions: the bracket
expressions [[:<:]] and [[:>:]] are constraints, matching empty
strings at the beginning and end of a word respectively.

how to avoid to match the last letter in this regexp?

I have a quesion about regexp in tcl:
first output: TIP_12.3.4 %
second output: TIP_12.3.4 %
and sometimes the output maybe look like:
first output: TIP_12 %
second output: TIP_12 %
I want to get the number 12.3.4 or 12 using the following exgexp:
output: TIP_(/[0-9].*/[0-9])
but why it does not matches 12.3.4 or 12%?
You need to escape the dot, else it stands for "match every character". Also, I'm not sure about the slashes in your regexp. Better solution:
/TIP_(\d+\.?)+/
Your problem is that / is not special in Tcl's regular expression language at all. It's just an ordinary printable non-letter character. (Other languages are a little different, as it is quite common to enclose regular expressions in / characters; this is not the case in Tcl.) Because it is a simple literal, using it in your RE makes it expect it in the input (despite it not being there); unsurprisingly, that makes the RE not match.
Fixing things: I'd use a regular expression like this: output: TIP_([\d.]+) under the assumption that the data is reasonably well formatted. That would lead to code like this:
regexp {output: TIP_([0-9.]+)} $input -> dottedDigits
Everything not in parentheses is a literal here, so that the code is able to find what to match. Inside the parentheses (the bit we're saving for later) we want one or more digits or periods; putting them inside a square-bracketed-set is perfect and simple. The net effect is to store the 12.3.4 in the variable dottedDigits (if found) and to yield a boolean result that says whether it matched (i.e., you can put it in an if condition usefully).
NB: the regular expression is enclosed in braces because square brackets are also Tcl language metacharacters; putting the RE in braces avoids trouble with misinterpretation of your script. (You could use backslashes instead, but they're ugly…)
Try this :
output: TIP_(/([0-9\.^%]*)/[0-9])
Capture group 1.
Demo here :
http://regexr.com?31f6g
The following expression works for me:
{TIP_((\d+\.?)+)}

get a substring with regular expression not left

I have a text like this:
a = CreateObject("1-SI")
foo bar 'blah blah CreateObject("2-No")
'CreateObject("3-No")
with regular expression i want select all CreateObject("...") substrings that don't have the ' character on the left
How can I do this?
You can do it like this (example at RegExr)
^(?:[^']*?)(CreateObject\(".*?"\))
Not sure about VB6s regex - but this doesn't require lookahead or behind.
The first capturing group is the CreateObject(..) part. You will need to use multiline mode (if possible in VB6).
Why don't you just try [^']*CreateObject(...)?
Another solotion would be at negative lookbehinds. Note that this kind of construct is not supported by all programming languages, not to speak of regexp engines in text editors.