Regex: Get penultimate part of a "path"

Regex: Get penultimate part of a "path" - regex

I've got something like this:
>AAA>BBB>CCC>DDD
With
([^>]*$)
I get the last part DDD . How can I get the part before it, CCC?
Thanks!

You may use
REGEXP_SUBSTR('>AAA>BBB>CCC>DDD', '([^>]+)>[^>]+$', 1, 1, NULL, 1)
The ([^>]+)>[^>]+$ regex will match and capture into Group 1 any 1+ chars other than >, then will match > followed with any 1+ chars other than > up to the end of the string.
The last argument, 1, tells REGEXP_SUBSTR to return just the captured substring.
See online demo.
Another approach is to replace the whole string but keep the captured part of your choice:
REGEXP_REPLACE( '>AAA>BBB>CCC>DDD', '.*>([^>]+)>[^>]+$', '\1')
See another online demo.
Here, .*> will match all the string up to the >, then ([^>]+) will capture any 1+ chars other than > and then >[^>]+$ will match and consume > and 1+ chars other than > at the end of the string.

You don't need regular expressions for this - standard string functions suffice, and they will be much faster.
In the last example, notice that there is no "second-to-last" or penultimate part; so the output is NULL. That is indeed the correct answer in that case.
with
test_data (pth) as (
select '>AAA>BBB>CCC>DDD' from dual union all
select null from dual union all
select '>EEE>GGG' from dual union all
select '>JJJJJ' from dual
)
select pth,
substr(pth, instr(pth, '>', -1, 2) + 1,
instr(pth, '>', -1, 1) - instr(pth, '>', -1, 2) - 1) as stl
from test_data
;
PTH STL
---------------- ----------------
>AAA>BBB>CCC>DDD CCC
>EEE>GGG EEE
>JJJJJ

Here is a silly workaround for the lack of support for returning subexpressions in your version of Oracle. I offer this just as a curiosity; I proposed a better solution that doesn't use regular expressions at all in a separate Answer.
with
test_data (pth) as (
select '>AAA>BBB>CCC>DDD' from dual union all
select null from dual union all
select '>EEE>GGG' from dual union all
select '>JJJJJ' from dual
)
select pth,
regexp_substr(pth, '[^>]*', 1, nullif(2*regexp_count(pth, '>')-2, 0)) as stl
from test_data
;
PTH STL
---------------- ----------------
>AAA>BBB>CCC>DDD CCC
>EEE>GGG EEE
>JJJJJ

Related

Postgres regex to delimit multiple optional matches

Suppose a text field needs to be delimited in PostgreSQL. It is formatted as 'abcd' where each variable can be any one of: 1.4, 3, 5, 10, 15, 20 or N/A. Here is a query with some examples, followed by their expected results:
WITH example AS(
SELECT '10N/AN/AN/A' AS bw
UNION SELECT '1010N/AN/A'
UNION SELECT '101020N/A'
UNION SELECT '35N/A1.4'
UNION SELECT '1010N/A10'
UNION SELECT '105N/AN/A'
UNION SELECT '1.43N/A20'
)
SELECT
bw
,regexp_replace(
regexp_replace(
regexp_replace(
regexp_replace(
regexp_replace(
regexp_replace(
regexp_replace(bw, '(1\.4)', E'\\&|', 'g')
, '(3)', E'\\&|', 'g')
, '(5)', E'\\&|', 'g')
, '(10)', E'\\&|', 'g')
, '(15)', E'\\&|', 'g')
, '(20)', E'\\&|', 'g')
, '(N/A)', E'\\&|', 'g')
FROM
example
Results:
bw:text, regexp_replace:text
'1010N/AN/A', '10|10|N/A|N/A|'
'1010N/A10', '10|10|N/A|10|'
'35N/A1.4', '3|5|N/A|1.4|'
'1.43N/A20', '1.4|3|N/A|20|'
'105N/AN/A', '10|5|N/A|N/A|'
'101020N/A', '10|10|20|N/A|'
'10N/AN/AN/A','10|N/A|N/A|N/A|'
I'm not worried about the trailing pipe '|' since I can deal with it. This gets me what I want, but I'm concerned I could be doing it more succinctly. I experimented with putting each of the capture groups in a single regexp_replace statement while scouring through the documentation, but I was unable to get these results.
Can this be achieved within a single regexp_replace statement?

You may build a (1\.4|3|5|1[50]|20|N/A) capturing group with alternation operators separating the alternatives and replace with \1|:
select regexp_replace('35N/A1.4', '(1\.4|3|5|1[50]|20|N/A)', '\1|','g');
-- 35|N/A|1.4|
See the online demo
Details
( - starting the capturing group construct
1\.4 - 1.4 substring (. must be escaped in order to be parsed as a literal dot, else, it matches any char)
| - or
3 - a 3 char
| - or
5 - a 5 char
| - or
1[50] - 1 followed with either 5 or 0 (the [...] is called a bracket expression where you may specify chars, char ranges or even character classes)
| - or
20 - a 20 substring
| - or
N/A - a N/A substring
) - end of the capturing group.
The \1 in the replacement pattern is a numbered replacement backreference (also called a (group) placeholder) that references the value captured into Group 1.

How to split strings using two delimiter in Oracle 11g regexp_substr functions

I have doubt to split a string using the delimiter.
First split based on , delimiter select those splitted strings should split based on - delimiter
My original string: UMC12I-1234,CSM3-123,VQ,
Expected output:
UMC12I
CSM3
VQ
Each value comes as row value
I tried the option
WITH fab_sites AS (
SELECT trim(regexp_substr('UMC12I-1234,CSM3-123,VQ,', '[^,]+', 1, LEVEL)) fab_site
FROM dual
CONNECT BY LEVEL <= regexp_count('UMC12I-1234,CSM3-123,VQ,', '[^,]+')+1
)
SELECT fab_site FROM fab_sites WHERE fab_site IS NOT NULL
-- splitted based on , delimiter
Output is:
UMC12I-1234
CSM3-123
VQ
how can I get my expected output? (need to split again - delimiter)

You may extract the "words" before the - with the regexp_substr using
([^,-]+)(-[^,-]+)?
The pattern will match and capture into Group 1 one or more chars other than , and -, then will match an optional sequence of - and 1+ chars other than ,and -.
See the regex demo.
Use this regex_substr line instead of yours with the above regex:
SELECT trim(regexp_substr('UMC12I-1234,CSM3-123,VQ,', '([^,-]+)(-[^,-]+)?', 1, LEVEL, NULL, 1)) fab_site
See the online demo

You might try this query:
WITH fab_sites AS (
SELECT TRIM(',' FROM REGEXP_SUBSTR('UMC12I-1234,CSM3-123,VQ,', '(^|,)[^,-]+', 1, LEVEL)) fab_site
FROM dual
CONNECT BY LEVEL <= REGEXP_COUNT('UMC12I-1234,CSM3-123,VQ,', '(^|,)[^,-]+')
)
SELECT fab_site
FROM fab_sites;
We start by matching any substring that starts either with the start of the whole string ^ or with a comma ,, the delimiter. We then get all the characters that match neither a comma nor a dash -. Once we have that substring we trim any leftover commas from it.
P.S. I think the +1 in the CONNECT BY clause is extraneous, as is the WHERE NOT NULL in the "outer" query.

Extract data outside of parentheses in oracle

I have this value: (203)1669
My requirement is to extract data which is outside of the parentheses.
I want to use Regular expression for this Oracle query.
Much appreciated!

You can use the Oracle REGEXP_REPLACE() function, and match the group which is outside the parentheses.
SELECT REGEXP_REPLACE(phone_number, '\([[:digit:]]+\)(.*)', '\1') AS newValue
FROM your_table

You can use the combination of SUBSTR and INSTR function.
select substr('(203)1669', instr('(203)1669',')')+1) from dual

This example uses REGEXP_SUBSTR() and the REGEX explicitly follows your spec of getting the 4 digits between the closing paren and the end of the line. If there could be a different number of digits, replace the {4} with a + for one or more digits:
SQL> with tbl(str) as (
select '(203)1669' from dual
)
select regexp_substr(str, '\)(\d{4})$', 1, 1, NULL, 1) nbr
from tbl;
NBR
----
1669
SQL>

For the pattern you mentioned, this should work.
select
rtrim(ltrim(substr(phone_number,instr(phone_number,')')+1,length(phone_number))))
as derived_phone_no
from
(select '(123)456' as phone_number from dual union all
select '(567)99084' as phone_number from dual)
Here first I am getting position of ) and then getting substr from the position of ) + 1 till the length of the string. As a best practice, you can use trim functions.

Oracle Substring after specific character

I already found out I need to use substr/instr or regex but with reading the documentation about those, I cant get it done...
I am here on Oracle 11.2.
So here is what I have.
A list of Strings like:
743H5-34L-56
123HD34-7L
12HSS-34R
23Z67-4R-C23
What I need is the number (length 1 or 2) after the first '-' until there comes a 'L' or 'R'.
Has anybody some advice?

regexp_replace(string, '^.*?-(\d+)[LR].*$', '\1')
fiddle

Another version (without fancy lookarounds :-) :
with v_data as (
select '743H5-34L-56' val from dual
union all
select '123HD34-7L' val from dual
union all
select '12HSS-34R' val from dual
union all
select '23Z67-4R-C23' val from dual
)
select
val,
regexp_replace(val, '^[^-]+-(\d+)[LR].*', '\1')
from v_data
It matches
the beginning of the string "^"
one or more characters that are not a '-' "[^-]+"
followed by a '-' "-"
followed by one ore more digits (capturing them in a group) "(\d+)"
followed by 'L' or 'R' "[LR]"
followed by zero or more arbitrary characters ".*"

How to make regular expression correctly?

I need to get data from third-occurrence position of "*" to 4th. I do so:
with t as (select 'T*76031*12558*test*received percents' as txt from dual)
select regexp_replace(txt, '.*(.{4})[*][^*].*$', '\1')
from t
I receive "test" - it's right, but how to get any number of characters, not just 4?

This should work given the example you have used:
REGEXP_REPLACE( txt, '(^.*\*.*\*.*\*)([[:alnum:]]*)(\*.*$)', '\2')
So the SELECT would be:
WITH t
AS (SELECT 'T*76031*12558*test*received percents' AS txt FROM DUAL)
SELECT REGEXP_REPLACE( txt, '(^.*\*.*\*.*\*)([[:alnum:]]*)(\*.*$)', '\2')
FROM t;
The regex looks for:
Group 1:
start of string. Any number of characters up to a ''. Any further characters up mto another ''. Any further characters up to the third '*'.
Group 2:
Any alphanumeric characters
Group 3:
A '*' followed by any other characters up to the end of the string.
Replace all of the above with whatever was found in Group 2.
Hope this helps.
EDIT:
Following on from a great answer from another thread by Rob van Wijk here:
Exracting substring from given string
WITH t
AS (SELECT 'T*76031*12558*test*received percents' AS txt FROM DUAL)
SELECT REGEXP_SUBSTR( txt,'[^\*]+',1,4)
FROM t;

How about the following?
^([^*]*[*]){3}([^*]*)
The first part matches 3 groups of * and the second part matches everything until the next * or end of line.

You are assuming that the last * of your text is also the fourth. If this assumption is true then this :
\b\w*\b(?=\*[^*]*$)
Will get you what you want. But of course this only matches the last word between * before the last star. It only matches test in this case or whatever word characters are inside the *.

Note: 10g REGEXP_SUBSTR doesn't support returning subexpressions, see comments below.
If you are really only selecting a part of the string I recommend using REGEXP_SUBSTR instead. I don't know if it's more efficient, but it will better document your intent:
SQL> select regexp_substr('T*76031*12558*test*received percents',
'^([^*]*[*]){3}([^*]*)', 1, 1, '', 2) from dual;
REGEXP_SUBST
------------
test
Above I have used regexp provided by Pieter-Bas.
See also http://www.regular-expressions.info/oracle.html

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex: Get penultimate part of a "path" - regex

I've got something like this: >AAA>BBB>CCC>DDD With ([^>]*$) I get the last part DDD . How can I get the part before it, CCC? Thanks!

Related

Postgres regex to delimit multiple optional matches

How to split strings using two delimiter in Oracle 11g regexp_substr functions

Extract data outside of parentheses in oracle

Oracle Substring after specific character

How to make regular expression correctly?

Categories

Resources