Split single row string into multiple rows by multi-chracter delimiter Oracle - regex

I have attempted to use this question here Splitting string into multiple rows in Oracle and adjust it to my needs however I'm not very confident with regex and have not been able to solve it via searching.
Currently that questions answers it with a lot of regex_substr and so on, using [^,]+ as the pattern so it splits by a single comma. I need it to split by a multi-character delimiter (e.g. #;) but that regex pattern matches any single character to split it out so where there are #s or ;s elsewhere in the text this causes a split.
I've worked out the pattern (#;+) will match every group of #; but I cannot workout how to invert this as done above to split the row into multiple.
I'm sure I'm just missing something simple so any help would be greatly appreciated!

I think you should use:
[^#;+]+
instead of
(#;+)
As, it will be checking for any one of the characters in the range which can be # ; or + and then you can split accordingly.
You can change it according to your requirement but in the regex I
shared, I am consudering # , ; and + as delimeter
So, in end, the query would look something like this:
with tbl(str) as (
select ' My, Delimiter# Hello My; Delimiter World My Delimiter My Delimiter test My Delimiter ' from dual
)
SELECT LEVEL AS element,
REGEXP_SUBSTR( str ,'([^#;+]+)', 1, LEVEL, NULL, 1 ) AS element_value
FROM tbl
CONNECT BY LEVEL <= regexp_count(str, '[#;+]')+1\\
Output:
ELEMENT ELEMENT_VALUE
1 My, Delimiter
2 Hello My
3 Delimiter World My Delimiter My Delimiter test My Deli
-- EDIT --
In case you want to check unlimited numbers of # or ; to split and don't want to split at one existence, I found the below regex, but again that is not supported by Oracle.
(?:(?:(?![;#]+).#(?![;#]+).|(?![;#]+).;(?![;#]+).|(?![;#]+).)*)+
So, I found no easy apart from below query which will not split on single existence if there is only one such instance between two delimeters:
select ' My, Delimiter;# Hello My Delimiter ;;# World My Delimiter ; My Delimiter test#; My Delimiter ' from dual
)
SELECT LEVEL AS element,
REGEXP_SUBSTR( str ,'([^#;]+#?[^#;]+;?[^#;]+)', 1, LEVEL, NULL, 1 ) AS element_value
FROM tbl
CONNECT BY LEVEL <= regexp_count(str, '[#;]{2,}')+1\\
Output:
ELEMENT ELEMENT_VALUE
1 My, Delimiter
2 Hello My Delimiter
3 World My Delimiter ; My Delimiter test
4 My Delimiter

Related

Split records with complex delimiter

I have an incoming record with a complex column delimiter and need to tokenize the record.
One of the delimiter characters can be part of the data.
I am looking for a regex expression.
Required to use on Teradata 16.1 with the function "REGEXP_SUBSTR".
There can max of 5 columns to tokenize.
Planing to use case statements in Teradata to tokenize the columns.
I guess regular expression for one token will do the trick.
Case#1: Column delimiter is ' - '
Sample data: On-e - tw o - thr$ee
Required output : [On-e, tw o, thr$ee]
My attempt : ([\S]*)\s{1}\-{1}\s{1}
Case#2 : Column delimiter is '::'
Sample data : On:e::tw:o::thr$ee
Required output : [On:e, tw:o, thr$ee]
Case#3 : Column delimiter is ':;'
Sample data : On:e:;tw;o:;thr$ee
Required output : [On:e, tw;o, thr$ee]
The above 3 cases are independent and do not occur together ie., 3 different solutions are required
If you absolutely must use RegEx for this, you could do it like in the examples shown below using capture groups.
Generic example:
/(?<data>.+?)($delimiter|$)/gm
(?<data>.+?) named capture group data, matching:
. any character
+? occuring between one and unlimited times
followed by
($delimiter|$) another capture group, matching:
$delimiter - replace this with regex matching your delimiter string
| or
$ end of string
Picking up your examples:
Case #1:
Column delimiter is ' - '
/(?<data>.+?)(\s-\s|$)/gm
(https://regex101.com/r/qMYxAY/1)
Case #2:
Column delimiter is '::'
/(?<data>.+?)(\:\:|$)/gm
https://regex101.com/r/IzaAoA/1
Case #3:
Column delimiter is ':;'
(?<data>.+?)(\:\;|$)
https://regex101.com/r/g1MUb6/1
Normally you would use STRTOK to split a string on a delimiter. But strtok can't handle a multi-character delimiter. One moderately over-complicated approach is to replace the multiple characters of the delimiter with a single character and split on that. For example:
select
strtok(oreplace(<your column>,' - ', '|'),'|',1) as one,
strtok(oreplace(somecol,' - ', '|'),'|',2) as two,
strtok(oreplace(somecol,' - ', '|'),'|',3) as three,
strtok(oreplace(<your column>,' - ', '|'),'|',4) as four,
strtok(oreplace(<your column>,' - ', '|'),'|',5) as five
If there are only three occurrences, like in your samples, it just returns null for the other two.

Oracle Database, extract string beeing between two other strings

I need a regexp that's combined with regexp_substr() would give me the word being between two other specified words.
Example:
source_string => 'First Middle Last'
substring varchar2(100);
substring := regexp_substr(source_string, 'First (.*) Last'); <===
this doesn't work :(.
dbms_output.put_line(substring) ===> output should be: 'Middle'
I know it looks simple and to be honest, at the beginning I thought the same.
But now after spending about 3h for searching for a solution I give up...
It's not working because the literal strings 'First' and 'Last' are being looked for. Assuming that the strings don't all literally begin 'First' you need to find another way to represent them. You've already done this by representing 'Middle' as (.*)
The next point is that you need to extract a sub-expression (the part in parenthesis), this is the 6th parameter of REGEXP_SUBSTR().
If you put these together then the following gives you what you want:
regexp_substr(source_string, '.*\s(.*)\s.*', 1, 1, 'i', 1)
An example of it working:
SQL> select regexp_substr('first middle last', '.*\s(.*)\s.*', 1, 1, 'i', 1)
2 from dual;
REGEXP
------
middle
You can also use an online regex tester to validate that 'middle' is the only captured group.
Depending on what your actual source strings look like you may not want to search for exactly spaces, but use \W (a non-word character) instead.
If you're expecting exactly three words I'd also anchor your expression to the start and end of the string: ^.*\s(.*)\s.*$
If source string always looks the same, i.e. consists of 3 elements (words), then such a simple regular expression does the job:
SQL> with t (str) as
2 (select 'First Middle Last' from dual)
3 select regexp_substr(str, '\w+', 1, 2) result from t;
RESULT
------
Middle
SQL>
(\S*) pattern might be used with regexp_replace and regexp_substr as in the following way to get the middle word :
with t(str) as
(
select 'First Middle Last' from dual
)
select regexp_substr(trim(regexp_replace(str, '^(\S*)', '')),'(\S*)')
as "Result String"
from t;
Result String
-------------
Middle
in the first step First, and in the second one Last words are trimmed.
Or, More directly you can figure out by using regexp_replace as
with t(str) as
(
select 'First Middle Last' from dual
)
select regexp_replace(str,'(.*) (.*) (.*)','\2')
as "Result String"
from t;
Result String
-------------
Middle

How to split strings using two delimiter in Oracle 11g regexp_substr functions

I have doubt to split a string using the delimiter.
First split based on , delimiter select those splitted strings should split based on - delimiter
My original string: UMC12I-1234,CSM3-123,VQ,
Expected output:
UMC12I
CSM3
VQ
Each value comes as row value
I tried the option
WITH fab_sites AS (
SELECT trim(regexp_substr('UMC12I-1234,CSM3-123,VQ,', '[^,]+', 1, LEVEL)) fab_site
FROM dual
CONNECT BY LEVEL <= regexp_count('UMC12I-1234,CSM3-123,VQ,', '[^,]+')+1
)
SELECT fab_site FROM fab_sites WHERE fab_site IS NOT NULL
-- splitted based on , delimiter
Output is:
UMC12I-1234
CSM3-123
VQ
how can I get my expected output? (need to split again - delimiter)
You may extract the "words" before the - with the regexp_substr using
([^,-]+)(-[^,-]+)?
The pattern will match and capture into Group 1 one or more chars other than , and -, then will match an optional sequence of - and 1+ chars other than ,and -.
See the regex demo.
Use this regex_substr line instead of yours with the above regex:
SELECT trim(regexp_substr('UMC12I-1234,CSM3-123,VQ,', '([^,-]+)(-[^,-]+)?', 1, LEVEL, NULL, 1)) fab_site
See the online demo
You might try this query:
WITH fab_sites AS (
SELECT TRIM(',' FROM REGEXP_SUBSTR('UMC12I-1234,CSM3-123,VQ,', '(^|,)[^,-]+', 1, LEVEL)) fab_site
FROM dual
CONNECT BY LEVEL <= REGEXP_COUNT('UMC12I-1234,CSM3-123,VQ,', '(^|,)[^,-]+')
)
SELECT fab_site
FROM fab_sites;
We start by matching any substring that starts either with the start of the whole string ^ or with a comma ,, the delimiter. We then get all the characters that match neither a comma nor a dash -. Once we have that substring we trim any leftover commas from it.
P.S. I think the +1 in the CONNECT BY clause is extraneous, as is the WHERE NOT NULL in the "outer" query.

Need to form pattern for regexp_replace

I have input string something like :
1.2.3.4_abc_4.2.1.44_1.3.4.23
100.11.11.22_xyz-abd_10.2.1.2_12.2.3.4
100.11.11.22_xyz_123_10.2.1.2_1.2.3.4
I have to replace the first string formed between two ipaddress which are separated by _, however in some string the _ is part of the replacement string (xyz_123)
I have to find the abc, xyz-abd and xyz_123 from the above string, so that I can replace with another column in that table.
_.*?_(?=\d+\.)
matches _abc_, _xyz-abd_ and _xyz_123_ in your examples. Is this working for you?
DECLARE
result VARCHAR2(255);
BEGIN
result := REGEXP_REPLACE(subject, $$_.*?_(?=\d+\.)$$, $$_foo_$$);
END;
Probably this is enough:
_[^.]+_
and replace with
_Replacement_
See it here on Regexr.
[^.]+ uses a negated character class to match a sequence of at least one (the + quantifier) non "." characters.
I am also matching a leading and a trailing "_", so you have to put it in again in the replacement string.
If PostgreSQL supports lookbehind and lookahead assertions, then it is possible to avoid the "_" in the replacement string:
(?<=_)[^.]+(?=_)
See it on Regexr
In order to map match first two "" , as #stema and #Tim Pietzcker mentioned the regex works. Then in order to append "" to the column , which is what I was struggling with, can be done with || operator as eg below
update table1 set column1=regexp_replace(column1,'.*?(?=\d+.)','' || column2 || '_')
Then for using the another table for update query , the below eg can be helpfull
update table1 as t set column1=regexp_replace(column1,'.*?(?=\d+.)','' || column2 || '_') from table2 as t2 where t.id=t2.id [other criteria]

plsql regex to remove text between quotes that has quotes

I am struggling with the regex replacement solution that would remove all the text that are between quotes from VARCHAR2 field even if the text between these quotes has quoted text as well
For example text:
'text start 'text inside' text end' leftover 'some other text'
after regex replacement should contain: leftover
What I have came up with is this code:
with tbl as (
select
'''text start ''text inside'' text end'' leftover ''some other text''' as str
,'\''(.*?)\''' as regex
from dual
)
select
tbl.str as strA
,regexp_replace(tbl.str,tbl.regex, '') as strB
from tbl;
but the text between subquotes still remains.
Is it even possible to achieve this with regular expressions, or should I split and analyze the contents in some loop ?
An ideal solution would be if it could handle infinite levels occurrences of quoted text inside quoted text.
An ideal solution would be if it could handle infinite levels occurrences of quoted text inside quoted text.
It's impossible with a single regular expression.
Neither recursive regexps, nor recursive capture buffers are available in Oracle.
UPD :
But it could be done by SQL:
with tbl as (
select
'''text start ''text inside'' text end'' leftover ''some other text'''
as str
from dual
)
select
listagg(text) within group (order by n)
from
(
select
n,
sum(decode(regexp_replace(str, '^(.*?([<>])){'||n||'}.*$', '\2'),
'<', 1, '>', -1, 0)) over (order by n) as nest,
regexp_replace(str, '^(.*?[<>]){'||n||'}([^<>]*).*$', '\2') as text
from
( select regexp_replace(regexp_replace(str, '(\s|^)''', '\1<'),
'''(\s|$)', '>\1') as str from tbl ),
( select level-1 as n from dual
connect by level-1 <= (select regexp_count(str, '''') from tbl) )
)
where nest = 0
fiddle
try
, '^[^'']*(''.*'')[^'']*$' as regex
caveat: this will dumbly capture all content between the first and the last occurrence of single quotes inside tested text in capture group 1, including the outermost quotes themselves. in particular it does not check for proper nesting.
more important your replacement expr will be more complex:
, CASE WHEN REGEXP_INSTR(test, regex) > 0
THEN REPLACE ( test, REGEXP_REPLACE(test, regex, '\1'), '' )
ELSE test
END
if the regexp matches, the capture group is extracted first to be used in an ordinary replacement (this works because the matched portion is guaranteed to be maximal).
IMPORTANT: the solution won't produce the desired result in the particular context you have supplied. however, you cannot fare any better with plsql regexp functions since the oracle regex engine does not offer extensions to express recursion in the pattern (as eg. pcre do). you need this facility to resolve nesting constructs (ie. perform balanced counting).