Postgres regex to delimit multiple optional matches - regex

Suppose a text field needs to be delimited in PostgreSQL. It is formatted as 'abcd' where each variable can be any one of: 1.4, 3, 5, 10, 15, 20 or N/A. Here is a query with some examples, followed by their expected results:
WITH example AS(
SELECT '10N/AN/AN/A' AS bw
UNION SELECT '1010N/AN/A'
UNION SELECT '101020N/A'
UNION SELECT '35N/A1.4'
UNION SELECT '1010N/A10'
UNION SELECT '105N/AN/A'
UNION SELECT '1.43N/A20'
)
SELECT
bw
,regexp_replace(
regexp_replace(
regexp_replace(
regexp_replace(
regexp_replace(
regexp_replace(
regexp_replace(bw, '(1\.4)', E'\\&|', 'g')
, '(3)', E'\\&|', 'g')
, '(5)', E'\\&|', 'g')
, '(10)', E'\\&|', 'g')
, '(15)', E'\\&|', 'g')
, '(20)', E'\\&|', 'g')
, '(N/A)', E'\\&|', 'g')
FROM
example
Results:
bw:text, regexp_replace:text
'1010N/AN/A', '10|10|N/A|N/A|'
'1010N/A10', '10|10|N/A|10|'
'35N/A1.4', '3|5|N/A|1.4|'
'1.43N/A20', '1.4|3|N/A|20|'
'105N/AN/A', '10|5|N/A|N/A|'
'101020N/A', '10|10|20|N/A|'
'10N/AN/AN/A','10|N/A|N/A|N/A|'
I'm not worried about the trailing pipe '|' since I can deal with it. This gets me what I want, but I'm concerned I could be doing it more succinctly. I experimented with putting each of the capture groups in a single regexp_replace statement while scouring through the documentation, but I was unable to get these results.
Can this be achieved within a single regexp_replace statement?

You may build a (1\.4|3|5|1[50]|20|N/A) capturing group with alternation operators separating the alternatives and replace with \1|:
select regexp_replace('35N/A1.4', '(1\.4|3|5|1[50]|20|N/A)', '\1|','g');
-- 35|N/A|1.4|
See the online demo
Details
( - starting the capturing group construct
1\.4 - 1.4 substring (. must be escaped in order to be parsed as a literal dot, else, it matches any char)
| - or
3 - a 3 char
| - or
5 - a 5 char
| - or
1[50] - 1 followed with either 5 or 0 (the [...] is called a bracket expression where you may specify chars, char ranges or even character classes)
| - or
20 - a 20 substring
| - or
N/A - a N/A substring
) - end of the capturing group.
The \1 in the replacement pattern is a numbered replacement backreference (also called a (group) placeholder) that references the value captured into Group 1.

Related

Replacing multiple special characters in oracle

I have a requirement in oracle to replace the special characters at first and last position of the column data.
Requirement: only [][.,$'*&!%^{}-?] and alphanumberic characters are allowed to stay in the address data and rest of the characters has to be replaced with space.I have tried in below way in different probabilities but its not working as expected. Please help me in resolving this.
SELECT emp_address,
REGEXP_REPLACE(
emp_address,
'^[^[[][.,$'\*&!%^{}-?\]]]|[^[[][.,$'\*&!%^{}-?\]]]$'
) AS simplified_emp_address
FROM table_name
As per the regular expression operators and metasymbols documentation:
Put ] as the first character of the (negated) character group;
- as the last; and
Do not put . immediately after [ or it can be matched as the start of a coalition element [..] if there is a second . later in the expression.
Also:
Double up the single quote (to escape it, so it does not terminate the string literal); and
Include the non-special characters a-zA-Z0-9 in the capture group too otherwise they will be matched.
Which gives you the regular expression:
SELECT emp_address,
REGEXP_REPLACE(
emp_address,
'^[^][,.$''\*&!%^{}?a-zA-Z0-9-]|[^][,.$''\*&!%^{}?a-zA-Z0-9-]$'
) AS simplified_emp_address
FROM table_name
Which, for the sample data:
CREATE TABLE table_name (emp_address) AS
SELECT '"test1"' FROM DUAL UNION ALL
SELECT '$test2$' FROM DUAL UNION ALL
SELECT '[test3]' FROM DUAL UNION ALL
SELECT 'test4' FROM DUAL UNION ALL
SELECT '|test5|' FROM DUAL;
Outputs:
EMP_ADDRESS
SIMPLIFIED_EMP_ADDRESS
"test1"
test1
$test2$
$test2$
[test3]
[test3]
test4
test4
|test5|
test5
db<>fiddle here
You do not need regular expressions, because they will have cumbersome escape sequences. Use substrings and translate function:
with a as (
select
'some [data ]' as val
from dual
union all
select '{test $' from dual
union all
select 'clean $%&* value' from dual
union all
select 's' from dual
)
select
translate(substr(val, 1, 1), q'{ [][.,$'*&!%^{}-?]}', ' ')
|| substr(val, 2, lengthc(val) - 2)
|| case
when lengthc(val) > 1
then translate(substr(val, -1), q'{ [][.,$'*&!%^{}-?]}', ' ')
end
as value_replaced
from a
| VALUE_REPLACED |
| :--------------- |
| some [data |
| test |
| clean $%&* value |
| s |
db<>fiddle here

Capture the last group/word

I want to capture the last word from the matched regexp. Here’s my query.
SELECT REGEXP_SUBSTR(
'The;quick;brown;fox;jumps;over;the;lazy;dog','^([^;]*;){5}([^;]*)') REF
FROM
DUAL
Desired result: over
Actual Result: The;quick;brown;fox;jumps;over
I can do subregex but it will affect the performance if there are million of records…
Nested Regex
SELECT REGEXP_SUBSTR(REGEXP_SUBSTR(
'The;quick;brown;fox;jumps;over;the;lazy;dog',
'^([^;]*;){5}([^;]*)'),'[^;]*$') REF
FROM
DUAL
Don't use regular expressions if you are worried about performance (as they are slow), just use normal string functions:
SELECT SUBSTR(
value,
INSTR(value, ';', 1, 5) + 1,
INSTR(value, ';', 1, 6) - INSTR(value, ';', 1, 5) - 1
) AS DATA
FROM table_name;
If you did want to use a regular expression then just extract the value of a capturing group:
SELECT REGEXP_SUBSTR(value, '(.*?);', 1, 6, NULL, 1) AS data
-- ^ Start from
-- ^ Occurrence
-- ^ Capturing group to extract
FROM table_name;
Which, for the sample data:
CREATE TABLE table_name ( value ) AS
SELECT 'The;quick;brown;fox;jumps;over;the;lazy;dog' FROM DUAL;
Both output:
DATA
over
db<>fiddle here
If you want to use REGEXP_SUBSTR then use its fourth parameter for the occurrence you are looking for:
SELECT REGEXP_SUBSTR(
'The;quick;brown;fox;jumps;over;the;lazy;dog',
'[^;]+',
1,
6) AS ref
FROM dual;
Docs: https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/REGEXP_SUBSTR.html#GUID-2903904D-455F-4839-A8B2-1731EF4BD099

Regex: Get penultimate part of a "path"

I've got something like this:
>AAA>BBB>CCC>DDD
With
([^>]*$)
I get the last part DDD . How can I get the part before it, CCC?
Thanks!
You may use
REGEXP_SUBSTR('>AAA>BBB>CCC>DDD', '([^>]+)>[^>]+$', 1, 1, NULL, 1)
The ([^>]+)>[^>]+$ regex will match and capture into Group 1 any 1+ chars other than >, then will match > followed with any 1+ chars other than > up to the end of the string.
The last argument, 1, tells REGEXP_SUBSTR to return just the captured substring.
See online demo.
Another approach is to replace the whole string but keep the captured part of your choice:
REGEXP_REPLACE( '>AAA>BBB>CCC>DDD', '.*>([^>]+)>[^>]+$', '\1')
See another online demo.
Here, .*> will match all the string up to the >, then ([^>]+) will capture any 1+ chars other than > and then >[^>]+$ will match and consume > and 1+ chars other than > at the end of the string.
You don't need regular expressions for this - standard string functions suffice, and they will be much faster.
In the last example, notice that there is no "second-to-last" or penultimate part; so the output is NULL. That is indeed the correct answer in that case.
with
test_data (pth) as (
select '>AAA>BBB>CCC>DDD' from dual union all
select null from dual union all
select '>EEE>GGG' from dual union all
select '>JJJJJ' from dual
)
select pth,
substr(pth, instr(pth, '>', -1, 2) + 1,
instr(pth, '>', -1, 1) - instr(pth, '>', -1, 2) - 1) as stl
from test_data
;
PTH STL
---------------- ----------------
>AAA>BBB>CCC>DDD CCC
>EEE>GGG EEE
>JJJJJ
Here is a silly workaround for the lack of support for returning subexpressions in your version of Oracle. I offer this just as a curiosity; I proposed a better solution that doesn't use regular expressions at all in a separate Answer.
with
test_data (pth) as (
select '>AAA>BBB>CCC>DDD' from dual union all
select null from dual union all
select '>EEE>GGG' from dual union all
select '>JJJJJ' from dual
)
select pth,
regexp_substr(pth, '[^>]*', 1, nullif(2*regexp_count(pth, '>')-2, 0)) as stl
from test_data
;
PTH STL
---------------- ----------------
>AAA>BBB>CCC>DDD CCC
>EEE>GGG EEE
>JJJJJ

How to split strings using two delimiter in Oracle 11g regexp_substr functions

I have doubt to split a string using the delimiter.
First split based on , delimiter select those splitted strings should split based on - delimiter
My original string: UMC12I-1234,CSM3-123,VQ,
Expected output:
UMC12I
CSM3
VQ
Each value comes as row value
I tried the option
WITH fab_sites AS (
SELECT trim(regexp_substr('UMC12I-1234,CSM3-123,VQ,', '[^,]+', 1, LEVEL)) fab_site
FROM dual
CONNECT BY LEVEL <= regexp_count('UMC12I-1234,CSM3-123,VQ,', '[^,]+')+1
)
SELECT fab_site FROM fab_sites WHERE fab_site IS NOT NULL
-- splitted based on , delimiter
Output is:
UMC12I-1234
CSM3-123
VQ
how can I get my expected output? (need to split again - delimiter)
You may extract the "words" before the - with the regexp_substr using
([^,-]+)(-[^,-]+)?
The pattern will match and capture into Group 1 one or more chars other than , and -, then will match an optional sequence of - and 1+ chars other than ,and -.
See the regex demo.
Use this regex_substr line instead of yours with the above regex:
SELECT trim(regexp_substr('UMC12I-1234,CSM3-123,VQ,', '([^,-]+)(-[^,-]+)?', 1, LEVEL, NULL, 1)) fab_site
See the online demo
You might try this query:
WITH fab_sites AS (
SELECT TRIM(',' FROM REGEXP_SUBSTR('UMC12I-1234,CSM3-123,VQ,', '(^|,)[^,-]+', 1, LEVEL)) fab_site
FROM dual
CONNECT BY LEVEL <= REGEXP_COUNT('UMC12I-1234,CSM3-123,VQ,', '(^|,)[^,-]+')
)
SELECT fab_site
FROM fab_sites;
We start by matching any substring that starts either with the start of the whole string ^ or with a comma ,, the delimiter. We then get all the characters that match neither a comma nor a dash -. Once we have that substring we trim any leftover commas from it.
P.S. I think the +1 in the CONNECT BY clause is extraneous, as is the WHERE NOT NULL in the "outer" query.

How to make regular expression correctly?

I need to get data from third-occurrence position of "*" to 4th. I do so:
with t as (select 'T*76031*12558*test*received percents' as txt from dual)
select regexp_replace(txt, '.*(.{4})[*][^*].*$', '\1')
from t
I receive "test" - it's right, but how to get any number of characters, not just 4?
This should work given the example you have used:
REGEXP_REPLACE( txt, '(^.*\*.*\*.*\*)([[:alnum:]]*)(\*.*$)', '\2')
So the SELECT would be:
WITH t
AS (SELECT 'T*76031*12558*test*received percents' AS txt FROM DUAL)
SELECT REGEXP_REPLACE( txt, '(^.*\*.*\*.*\*)([[:alnum:]]*)(\*.*$)', '\2')
FROM t;
The regex looks for:
Group 1:
start of string. Any number of characters up to a ''. Any further characters up mto another ''. Any further characters up to the third '*'.
Group 2:
Any alphanumeric characters
Group 3:
A '*' followed by any other characters up to the end of the string.
Replace all of the above with whatever was found in Group 2.
Hope this helps.
EDIT:
Following on from a great answer from another thread by Rob van Wijk here:
Exracting substring from given string
WITH t
AS (SELECT 'T*76031*12558*test*received percents' AS txt FROM DUAL)
SELECT REGEXP_SUBSTR( txt,'[^\*]+',1,4)
FROM t;
How about the following?
^([^*]*[*]){3}([^*]*)
The first part matches 3 groups of * and the second part matches everything until the next * or end of line.
You are assuming that the last * of your text is also the fourth. If this assumption is true then this :
\b\w*\b(?=\*[^*]*$)
Will get you what you want. But of course this only matches the last word between * before the last star. It only matches test in this case or whatever word characters are inside the *.
Note: 10g REGEXP_SUBSTR doesn't support returning subexpressions, see comments below.
If you are really only selecting a part of the string I recommend using REGEXP_SUBSTR instead. I don't know if it's more efficient, but it will better document your intent:
SQL> select regexp_substr('T*76031*12558*test*received percents',
'^([^*]*[*]){3}([^*]*)', 1, 1, '', 2) from dual;
REGEXP_SUBST
------------
test
Above I have used regexp provided by Pieter-Bas.
See also http://www.regular-expressions.info/oracle.html