Capture the last group/word - regex

I want to capture the last word from the matched regexp. Here’s my query.
SELECT REGEXP_SUBSTR(
'The;quick;brown;fox;jumps;over;the;lazy;dog','^([^;]*;){5}([^;]*)') REF
FROM
DUAL
Desired result: over
Actual Result: The;quick;brown;fox;jumps;over
I can do subregex but it will affect the performance if there are million of records…
Nested Regex
SELECT REGEXP_SUBSTR(REGEXP_SUBSTR(
'The;quick;brown;fox;jumps;over;the;lazy;dog',
'^([^;]*;){5}([^;]*)'),'[^;]*$') REF
FROM
DUAL

Don't use regular expressions if you are worried about performance (as they are slow), just use normal string functions:
SELECT SUBSTR(
value,
INSTR(value, ';', 1, 5) + 1,
INSTR(value, ';', 1, 6) - INSTR(value, ';', 1, 5) - 1
) AS DATA
FROM table_name;
If you did want to use a regular expression then just extract the value of a capturing group:
SELECT REGEXP_SUBSTR(value, '(.*?);', 1, 6, NULL, 1) AS data
-- ^ Start from
-- ^ Occurrence
-- ^ Capturing group to extract
FROM table_name;
Which, for the sample data:
CREATE TABLE table_name ( value ) AS
SELECT 'The;quick;brown;fox;jumps;over;the;lazy;dog' FROM DUAL;
Both output:
DATA
over
db<>fiddle here

If you want to use REGEXP_SUBSTR then use its fourth parameter for the occurrence you are looking for:
SELECT REGEXP_SUBSTR(
'The;quick;brown;fox;jumps;over;the;lazy;dog',
'[^;]+',
1,
6) AS ref
FROM dual;
Docs: https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/REGEXP_SUBSTR.html#GUID-2903904D-455F-4839-A8B2-1731EF4BD099

Related

How can I extract a word from a string using Oracle regexp_substr?

I'm trying to extract a word from a string using Oracle 12c regexp_substr but no luck in understanding how it works, too much info in the net and I get confused.
So I want to extract tmp* tables from a string:
query_str:
select
column1 c1,
column2 c2
from tmp_123 foo1, -- some comments here
TAB1_123 TAB1
where 1=1
;
Trying to use this but no "luck":
select regexp_substr(query_str, 'TMP_[A-z]+', 1, 1, 'i');
I want to extract until the space and the tmp table name can have numbers in the middle like this: tmp_123.
Any suggestion?
You can use either of the two:
select regexp_substr(query_str, 'TMP_\w+', 1, 1, 'i');
select regexp_substr(query_str, 'TMP_\S+', 1, 1, 'i');
The \w+ will match alphanumeric or underscore chars after TMP_ and \S+ will match one or more non-whitespace chars.
See the \w regex demo and the \S regex demo.
The major problem is that the SELECT statement shown is not valid in Oracle, where a FROM clause is required. Here's an example of how to make this work:
WITH cteData
AS (SELECT 'select' AS QUERY_STR FROM DUAL UNION ALL
SELECT 'column1 c1,' AS QUERY_STR FROM DUAL UNION ALL
SELECT 'column2 c2' AS QUERY_STR FROM DUAL UNION ALL
SELECT 'from tmp_123 foo1, -- some comments here' AS QUERY_STR FROM DUAL UNION ALL
SELECT 'TAB1_123 TAB1' AS QUERY_STR FROM DUAL UNION ALL
SELECT 'where 1=1' AS QUERY_STR FROM DUAL UNION ALL
SELECT ';' AS QUERY_STR FROM DUAL)
select regexp_substr(query_str, 'TMP_[A-z]+', 1, 1, 'i') AS MATCH
FROM cteData
WHERE regexp_substr(query_str, 'TMP_[A-z]+', 1, 1, 'i') IS NOT NULL
Here I've put your data line-for-line into a Common Table Expression (CTE) named "cteData" which the SELECT then uses as the source of its data. This returns the line
tmp_123 foo1, -- some comments here
db<>fiddle here

Regex: Get penultimate part of a "path"

I've got something like this:
>AAA>BBB>CCC>DDD
With
([^>]*$)
I get the last part DDD . How can I get the part before it, CCC?
Thanks!
You may use
REGEXP_SUBSTR('>AAA>BBB>CCC>DDD', '([^>]+)>[^>]+$', 1, 1, NULL, 1)
The ([^>]+)>[^>]+$ regex will match and capture into Group 1 any 1+ chars other than >, then will match > followed with any 1+ chars other than > up to the end of the string.
The last argument, 1, tells REGEXP_SUBSTR to return just the captured substring.
See online demo.
Another approach is to replace the whole string but keep the captured part of your choice:
REGEXP_REPLACE( '>AAA>BBB>CCC>DDD', '.*>([^>]+)>[^>]+$', '\1')
See another online demo.
Here, .*> will match all the string up to the >, then ([^>]+) will capture any 1+ chars other than > and then >[^>]+$ will match and consume > and 1+ chars other than > at the end of the string.
You don't need regular expressions for this - standard string functions suffice, and they will be much faster.
In the last example, notice that there is no "second-to-last" or penultimate part; so the output is NULL. That is indeed the correct answer in that case.
with
test_data (pth) as (
select '>AAA>BBB>CCC>DDD' from dual union all
select null from dual union all
select '>EEE>GGG' from dual union all
select '>JJJJJ' from dual
)
select pth,
substr(pth, instr(pth, '>', -1, 2) + 1,
instr(pth, '>', -1, 1) - instr(pth, '>', -1, 2) - 1) as stl
from test_data
;
PTH STL
---------------- ----------------
>AAA>BBB>CCC>DDD CCC
>EEE>GGG EEE
>JJJJJ
Here is a silly workaround for the lack of support for returning subexpressions in your version of Oracle. I offer this just as a curiosity; I proposed a better solution that doesn't use regular expressions at all in a separate Answer.
with
test_data (pth) as (
select '>AAA>BBB>CCC>DDD' from dual union all
select null from dual union all
select '>EEE>GGG' from dual union all
select '>JJJJJ' from dual
)
select pth,
regexp_substr(pth, '[^>]*', 1, nullif(2*regexp_count(pth, '>')-2, 0)) as stl
from test_data
;
PTH STL
---------------- ----------------
>AAA>BBB>CCC>DDD CCC
>EEE>GGG EEE
>JJJJJ

Postgres regex to delimit multiple optional matches

Suppose a text field needs to be delimited in PostgreSQL. It is formatted as 'abcd' where each variable can be any one of: 1.4, 3, 5, 10, 15, 20 or N/A. Here is a query with some examples, followed by their expected results:
WITH example AS(
SELECT '10N/AN/AN/A' AS bw
UNION SELECT '1010N/AN/A'
UNION SELECT '101020N/A'
UNION SELECT '35N/A1.4'
UNION SELECT '1010N/A10'
UNION SELECT '105N/AN/A'
UNION SELECT '1.43N/A20'
)
SELECT
bw
,regexp_replace(
regexp_replace(
regexp_replace(
regexp_replace(
regexp_replace(
regexp_replace(
regexp_replace(bw, '(1\.4)', E'\\&|', 'g')
, '(3)', E'\\&|', 'g')
, '(5)', E'\\&|', 'g')
, '(10)', E'\\&|', 'g')
, '(15)', E'\\&|', 'g')
, '(20)', E'\\&|', 'g')
, '(N/A)', E'\\&|', 'g')
FROM
example
Results:
bw:text, regexp_replace:text
'1010N/AN/A', '10|10|N/A|N/A|'
'1010N/A10', '10|10|N/A|10|'
'35N/A1.4', '3|5|N/A|1.4|'
'1.43N/A20', '1.4|3|N/A|20|'
'105N/AN/A', '10|5|N/A|N/A|'
'101020N/A', '10|10|20|N/A|'
'10N/AN/AN/A','10|N/A|N/A|N/A|'
I'm not worried about the trailing pipe '|' since I can deal with it. This gets me what I want, but I'm concerned I could be doing it more succinctly. I experimented with putting each of the capture groups in a single regexp_replace statement while scouring through the documentation, but I was unable to get these results.
Can this be achieved within a single regexp_replace statement?
You may build a (1\.4|3|5|1[50]|20|N/A) capturing group with alternation operators separating the alternatives and replace with \1|:
select regexp_replace('35N/A1.4', '(1\.4|3|5|1[50]|20|N/A)', '\1|','g');
-- 35|N/A|1.4|
See the online demo
Details
( - starting the capturing group construct
1\.4 - 1.4 substring (. must be escaped in order to be parsed as a literal dot, else, it matches any char)
| - or
3 - a 3 char
| - or
5 - a 5 char
| - or
1[50] - 1 followed with either 5 or 0 (the [...] is called a bracket expression where you may specify chars, char ranges or even character classes)
| - or
20 - a 20 substring
| - or
N/A - a N/A substring
) - end of the capturing group.
The \1 in the replacement pattern is a numbered replacement backreference (also called a (group) placeholder) that references the value captured into Group 1.

How to split strings using two delimiter in Oracle 11g regexp_substr functions

I have doubt to split a string using the delimiter.
First split based on , delimiter select those splitted strings should split based on - delimiter
My original string: UMC12I-1234,CSM3-123,VQ,
Expected output:
UMC12I
CSM3
VQ
Each value comes as row value
I tried the option
WITH fab_sites AS (
SELECT trim(regexp_substr('UMC12I-1234,CSM3-123,VQ,', '[^,]+', 1, LEVEL)) fab_site
FROM dual
CONNECT BY LEVEL <= regexp_count('UMC12I-1234,CSM3-123,VQ,', '[^,]+')+1
)
SELECT fab_site FROM fab_sites WHERE fab_site IS NOT NULL
-- splitted based on , delimiter
Output is:
UMC12I-1234
CSM3-123
VQ
how can I get my expected output? (need to split again - delimiter)
You may extract the "words" before the - with the regexp_substr using
([^,-]+)(-[^,-]+)?
The pattern will match and capture into Group 1 one or more chars other than , and -, then will match an optional sequence of - and 1+ chars other than ,and -.
See the regex demo.
Use this regex_substr line instead of yours with the above regex:
SELECT trim(regexp_substr('UMC12I-1234,CSM3-123,VQ,', '([^,-]+)(-[^,-]+)?', 1, LEVEL, NULL, 1)) fab_site
See the online demo
You might try this query:
WITH fab_sites AS (
SELECT TRIM(',' FROM REGEXP_SUBSTR('UMC12I-1234,CSM3-123,VQ,', '(^|,)[^,-]+', 1, LEVEL)) fab_site
FROM dual
CONNECT BY LEVEL <= REGEXP_COUNT('UMC12I-1234,CSM3-123,VQ,', '(^|,)[^,-]+')
)
SELECT fab_site
FROM fab_sites;
We start by matching any substring that starts either with the start of the whole string ^ or with a comma ,, the delimiter. We then get all the characters that match neither a comma nor a dash -. Once we have that substring we trim any leftover commas from it.
P.S. I think the +1 in the CONNECT BY clause is extraneous, as is the WHERE NOT NULL in the "outer" query.

Extract data outside of parentheses in oracle

I have this value: (203)1669
My requirement is to extract data which is outside of the parentheses.
I want to use Regular expression for this Oracle query.
Much appreciated!
You can use the Oracle REGEXP_REPLACE() function, and match the group which is outside the parentheses.
SELECT REGEXP_REPLACE(phone_number, '\([[:digit:]]+\)(.*)', '\1') AS newValue
FROM your_table
You can use the combination of SUBSTR and INSTR function.
select substr('(203)1669', instr('(203)1669',')')+1) from dual
This example uses REGEXP_SUBSTR() and the REGEX explicitly follows your spec of getting the 4 digits between the closing paren and the end of the line. If there could be a different number of digits, replace the {4} with a + for one or more digits:
SQL> with tbl(str) as (
select '(203)1669' from dual
)
select regexp_substr(str, '\)(\d{4})$', 1, 1, NULL, 1) nbr
from tbl;
NBR
----
1669
SQL>
For the pattern you mentioned, this should work.
select
rtrim(ltrim(substr(phone_number,instr(phone_number,')')+1,length(phone_number))))
as derived_phone_no
from
(select '(123)456' as phone_number from dual union all
select '(567)99084' as phone_number from dual)
Here first I am getting position of ) and then getting substr from the position of ) + 1 till the length of the string. As a best practice, you can use trim functions.