What changed from PostgreSQL 8.3 to 9.2 in regex matching? - regex

If I run this query:
SELECT 'Via Orologio 122 A' SIMILAR TO '(Strada|Via) % [0-9]+( [A-Z])?';
I expect to get TRUE. Version 9.1.8 of postgreSQL returns the expected value, but in version 8.3 it returns FALSE. I think that the problem is the final question mark. In fact, the query:
SELECT 'Via Orologio 122 A' SIMILAR TO '(Strada|Via) % [0-9]+( [A-Z])';
Returns TRUE in both versions.
Anyone knows which is the difference between the two versions?

From changelog of 8.3.2:
Fix a corner case in regular-expression substring matching
(substring(string from pattern)) (Tom)
The problem occurs when there
is a match to the pattern overall but the user has specified a
parenthesized subexpression and that subexpression hasn't got a match.
An example is substring('foo' from 'foo(bar)?'). This should return
NULL, since (bar) isn't matched, but it was mistakenly returning the
whole-pattern match instead (ie, foo)

When switching to a regular expression (~), the drop-in replacement would be:
SELECT 'Via Orologio 122 A' ~ '^(?:(?:Strada|Via) .* [0-9]+(?: [A-Z])?)$'
left-anchored and right-anchored
with *, not +
non-capturing parentheses
Hint:
You can let Postgres translate SIMILAR TO expressions for you with the technique outlined in tis related answer on dba.SE.

Following Craig Ringer's advice, changing to:
SELECT 'Via Orologio 122 A' ~ '(Strada|Via) .+ [0-9]+( [A-Z])?';
solved the problem. '~' seems to be a definitely better solution than 'SIMILAR TO'

Related

Having difficulty in pattern matching Postal Codes for an oracle regexp_like command

The Problem:
All I'm trying to do is come up with a pattern matching string for my regular expression that lets me select Canadian postal codes in this format: 'A1A-2B2' (for example).
The types of data I am trying to insert:
Insert Into Table
(Table_Number, Person_Name, EMail_Address, Street_Address, City, Province, Postal_Code, Hire_Date)
Values
(87, 'Tommy', 'mobster#gmail.com', '123 Street', 'location', 'ZY', 'T4X-1S2', To_Date('30-Aug-2020 08:50:56');
This is a slightly modified/generic version to protect some of the data. All of the other columns enter just fine/no complaints. But the postal code it does not seem to like when I try to run a load data script.
The Column & Constraint in question:
Postal_Code varchar2(7) Constraint Table_Postal_Code Null
Constraint CK_Postal_Code Check ((Regexp_like (Postal_Code, '^\[[:upper:]]{1}[[:digit:]]{1}[[:upper:]][[:punct:]]{1}[[:digit:]]{1}[[:upper:]](1}[[:digit:]]{1}$')),
My logic here: following the regular expression documentation:
I have:
an open quote
a exponent sign to indicate start of string
Backslash (I think to interpet a string literal)
-1 upper case letter, 1 digit, 1 uppercase , 1 :punct: to account for the hypen, 1 digit, 1 upper case letter, 1 digit
$ to indicate end of string
Close quote
In my mind, something like this should work, it accounts for every single letter/character and the ranges they have to be in. But something is off regarding my formatting of this pattern matching string.
The error I get is:
ORA-02290: check constraint (user.CK_POSTAL_CODE) violated
(slightly modified once more to protect my identity)
Which tells me that the data insert statement is tripping off my check constraint and thats about it. So its as issue with the condition of the constraint itself - ie string I'm using to match it. My instructor has told me that insert data is valid, and doesn't need any fix-up so I'm at a loss.
Limits/Rules: The Hyphen has to be there/matched to my understanding of the problem. They are all uppercase in the dataset, so I don't have to worry about lowercase for this example.
I have tried countless variations of this regexp statement to see if anything at all would work, including:
changing all those uppers to :alpha: , then using 'i' to not check for case sensitivity for the time being
removing the {1} in case that was redudant
using - (backslash hyphen) , to turn into a string literal maybe
using only Hyphen by itself
even removing regexp altogether and trying a LIKE [A-Z][0-9][A-Z]-[0-9][A-Z][0-9] etc
keeping the uppers , turning :digit:'s to [0-9] to see if that would maybe work
The only logical thing I can think of now is: the check constraint is actually working fine and tripping off when it matches my syntax. But I didn't write it clearly enough to say "IGNORE these cases and only get tripped/activated if it doesn't meet these conditions"
But I'm at my wits end and asking here as a last resort. I wouldn't if I could see my mistake eventually - but everything I can think of, I probably tried. I'm sure its some tiny formatting rule I just can't see (I can feel it).Thank you kindly to anyone who would know how to format a pattern matching string like this properly.
It looks like you may have been overcomplicating the regex a bit. The regex below matches your description based on the first set of bullets you lined out:
REGEXP_LIKE (postal_code, '^[A-Z]\d[A-Z]-\d[A-Z]\d$')
I see two problems with that regexp.
Firstly, you have a spurious \ at the start. It serves you no purpose, get rid of it.
Secondly, the second-from last {1} appears in your code with mismatched brackets as (1}. I get the error ORA-12725: unmatched parentheses in regular expression because of this.
To be honest, you don't need the {1}s at all: they just tell the regular expression that you want one of the previous item, which is exactly what you'd get without them.
So you can fix the regexp in your constraint by getting rid of the \ and removing the {1}s, including the one with mismatched parentheses.
Here's a demo of the fixed constraint in action:
SQL> CREATE TABLE postal_code_test (
2 Postal_Code varchar2(7) Constraint Table_Postal_Code Null
3 Constraint CK_Postal_Code Check ((Regexp_like (Postal_Code, '^[[:upper:]][[:digit:]][[:upper:]][[:punct:]][[:digit:]][[:upper:]][[:digit:]]$'))));
Table created.
SQL> INSERT INTO postal_code_test (postal_code) VALUES ('T4X-1S2');
1 row created.
SQL> INSERT INTO postal_code_test (postal_code) VALUES ('invalid');
INSERT INTO postal_code_test (postal_code) VALUES ('invalid')
*
ERROR at line 1:
ORA-02290: check constraint (user.CK_POSTAL_CODE) violated
You do not need the backslash and you have (1} instead of {1}.
You can simplify the expression to:
Postal_Code varchar2(7)
Constraint Table_Postal_Code Null
Constraint CK_Postal_Code Check (
REGEXP_LIKE(Postal_Code, '^[A-Z]\d[A-Z][[:punct:]]\d[A-Z]\d$')
)
or:
Constraint CK_Postal_Code Check (
REGEXP_LIKE(
Postal_Code,
'^[A-Z][0-9][A-Z][[:punct:]][0-9][A-Z][0-9]$'
)
)
or:
Constraint CK_Postal_Code Check (
REGEXP_LIKE(
Postal_Code,
'^[[:upper:]][[:digit:]][[:upper:]][[:punct:]][[:digit:]][[:upper:]][[:digit:]]$'
)
)
or (although the {1} syntax is redundant here):
Constraint CK_Postal_Code Check (
REGEXP_LIKE(
Postal_Code,
'^[[:upper:]]{1}[[:digit:]]{1}[[:upper:]]{1}[[:punct:]]{1}[[:digit:]]{1}[[:upper:]]{1}[[:digit:]]{1}$'
)
)
fiddle
removing regexp altogether and trying a LIKE [A-Z][0-9][A-Z]-[0-9][A-Z][0-9] etc
That will not work as the LIKE operator does not match regular expression patterns.

"invalid regular expression: parentheses () not balanced"

I'm getting the error invalid regular expression: parentheses () not balanced while executing a query. The error refers to this part:
substring(every_x1, '\m[0-9]*\.?[0-9]'),
substring(every_x2, '\m[0-9]*\(?|-|to|TO)'),
substring(every_x2, '\m[0-9]*\(?|time|TIME)')
I checked it in an online parentheses checker, and it's supposed to be okay. What am I doing wrong?
I dont know what is the exact pattern you are looking but if you were looking to find the actual "(" (parentheses) sign maybe you should escape it the second time also, somethin like in this example:
select substring(every_x2, '\m[0-9]*\(?|-|to|TO\)')::float as part_1
Try this regular expression '\m(\d+(?:\s*-\s*|\s*to\s*)\d+)\M').
select substring('123 12-56 xyz' from '\m(\d+(?:\s*-\s*|\s*to\s*)\d+)\M');
-- 12-56
select substring('123 12 to 56 xyz' from '\m(\d+(?:\s*-\s*|\s*to\s*)\d+)\M');
-- 12 to 56

Why is this seemingly correct Regex not working correctly in Rascal?

In have following code:
set[str] noNnoE = { v | str v <- eu, (/\b[^eEnN]*\b/ := v) };
The goal is to filter out of a set of strings (called 'eu'), those strings that have no 'e' or 'n' in them (both upper- and lowercase). The regular expression I've provided:
/\b[^eEnN]?\b/
seems to work like it should, when I try it out in an online regex-tester.
When trying it out in the Rascel terminal it doesn't seem to work:
rascal>/\b[^eEnN]*\b/ := "Slander";
bool: true
I expected no match. What am I missing here? I'm using the latest (stable) Rascal release in Eclipse Oxygen1a.
Actually, the online regex-tester is giving the same match that we are giving. You can look at the match as follows:
if (/<w1:\b[^eEnN]?\b>/ := "Slander")
println("The match is: |<w1>|");
This is assigning the matched string to w1 and then printing it between the vertical bars, assuming the match succeeds (if it doesn't, it returns false, so the body of the if will not execute). If you do this, you will get back a match to the empty string:
The match is: ||
The online regex tester says the same thing:
Match 1
Full match 0-0 ''
If you want to prevent this, you can force at least one occurrence of the characters you are looking for by using a +, versus a ?:
rascal>/\b[^eEnN]+\b/ := "Slander";
bool: false
Note that you can also make the regex match case insensitive by following it with an i, like so:
/\b[^en]+\b/i
This may make it easier to write if you need to add more characters into the character class.
This solution (/\b[^en]+\b/i) doesn't work for strings consisting of two words, such as the Czech Republic.
Try /\b[^en]+\b$/i. That seems to work for me.

Postgres: regexp_replace & trim

I need to remove '.0' at the end of the string but I have some issues.
In PG 8.4 I have this expression and its was worked fine.
select regexp_replace('10.1.2.3.0', '(\\\\.0)+$', '');
and result was
'10.1.2.3' - good result.
But after PG was updated to 9.x version result is
'10.1.2.3.0' - the input string and its not ok.
Also I tried to use trim function
it this case it is ok
select trim('.0' from '10.1.2.3.0');
result is '10.1.2.3' - ok
but when I have 10 at the end of the code I have unexpected result
select trim('.0' from '10.1.2.3.10.0');
or
select trim('.0' from '10.1.2.3.10');
result is 10.1.2.3.1 - 0 is trimmed from 10
Somebody can suggest me solution and explain what is wrong with trim function and what was changed in regexp_replace in latest versions?
I would suggest doing something like this:
select (case when col like '%.0' then left(col, length(col) - 2)
else col
end)
This will work in all versions of Postgres and you don't need to worry about regular expression parsing.
As for the regular expression version, both of these work for me (on recent versions of Postgres):
select regexp_replace('10.1.2.3.0', '(\.0)+$', '');
select regexp_replace('10.1.2.3.0', '([.]0)+$', '');
I suspect the problem with the earlier version is the string parsing with the backslash escape character -- you can use square brackets instead of backslash and the pattern should work in any version.

Replacing the first vowel-consonent occurence with consonent-vowel using sub in R

I know that it should be something like this but definitely I am missing something in the syntax:
yy=sub(r'\b[aeiou][^aeiou]*',r'\b[^aeiou][aeiou]*',"abmmmm")
I expect to have "bammmm" as output
Error: unexpected string constant in "yy=sub(r'\b[aeiou][^aeiou]*'"
I am not sure how is the exact syntax.
Please run your code in RStudio or any R compiler. I am new to regex and you giving me Python code wouldn't help me to understand the situation. Thanks!
This is what you want
yy=sub("\\b([aeiou])([^aeiuos])","\\2\\1","abmm")
I'll explain how it works:
If you ask me to substitute any vowel-consonent with any consonent-vowel? It doesn't make much sense. Should I change ab to ba, ce, or da? It can be any one of them. You never specified any relationship between the vowel in vowel-consonent and the vowel in consonent-vowel. Therefore, it doesn't make sense to put a regular expression in the 2nd argument. As a result, you are not allowed to.
If you want to achieve what you asked for. You can add brackets to the regular expression in the 1st argument. The first ( marks group 1, second ( marks group 2, etc. (note, group 0 is the whole matched string.) You can use \1, \2, ... in the second argument to put the matched group there.
As an alternative to using a regular expression for this, there's a nice string reversal function in example(strsplit)
> strReverse <- function(x)
sapply(lapply(strsplit(x, NULL), rev), paste, collapse="")
> dd <- "abmmmm"
> paste(strReverse(substr(dd, 1, 2)), substr(dd, 3, nchar(dd)), sep = "")
[1] "bammmm"