RegExp not returning the desired results - regex

I would expect the following code to return these two lines
88518-008
89274-021(08518-008,09274-021)
But it is only returning the second one, and I don't understand why, any help would be great!
WITH DATA AS
(
SELECT '88518-008,89274-021(08518-008,09274-021)' str
FROM dual
)
SELECT TRIM(REGEXP_SUBSTR(str, '[^,]+\((.+)\)|[^,]+(?![^\(]*\))+', 1, LEVEL)) str
FROM DATA
CONNECT BY REGEXP_INSTR(str, '\,(?![^\(]*\))', 1, LEVEL - 1) > 0
I have tested the regex online and they work as expected, and pulled the query from another example and tried replacing the values to match my needs.

You need the following regex:
'([^,]*),(.*\([^\)]+\))'
It starts by creating a Group 1, matching anything but comma, then a comma, then creates a Group 2, mathing anything up to left parenthes, then a left parenthes, then anything up to a right parenthes and finally a right parenthes.
That will give you the first value in Group 1, and the second value in Group 2.

Thanks for your help, The below returns the desired results
WITH DATA AS
(
SELECT 'word1, word2, word3, word4, word5, word6 (word7, word8)' str FROM dual
)
SELECT trim(regexp_substr(str, '[^,]+\((.+)\)|[^,]+(?![^\(]*\))+|[^,]+', 1, LEVEL)) str
FROM DATA
CONNECT BY REGEXP_INSTR(str, ',', 1, LEVEL) > 0

Related

how can I in SQL get a list of all occurrences in a certain filed using regex

Imagine you have a field with a lot of texts and you want to get a list of substrings that match a certain pattern (it's important to know which ones not just how many).
I'm a bit surprised I cannot do this using REGEXP_SUBSTR but I think there must be a way of doing this, just can't easily figure out how.
An example:
CREATE TABLE test_me (
text varchar(3000)
);
insert into test_me values (
'obj_a, obj_b,
trx_a, trx_c,
obj_c,
obj_d,
obj');
For this you would like to retrieve:
obj_a, obj_b, obj_c, obj_d, obj
I was playing with something like:
SELECT REGEXP_SUBSTR(text, '(obj(.*))', 1, 3) "REGEXP_SUBSTR"FROM test_me;
Which can give me either the first, the second, whatever occurrence but not all
If we are using normal regex in other languages I can get all the matching groups and just loop over the matching groups but I could not find a way to do this in oracle. I guess I could "unpivot" this field into a new table per line and then process each line, which is not exactly what I want because even in the same line you can have several occurrences that are important, is there not an easy way to do this?
The use case for example could be to know all the dependencies I have in my queries, if for example I would store the full query in a column. Or if I have a list of distributed lists in an email and I want to know all the participants of a certain domain, for example.
Any ideas?
edited with the simple example.
Thanks to all that provided feedback. The question should be clear now.
You can use a recursive sub-query:
WITH matches ( text, match, idx, num_matches ) AS (
SELECT text,
REGEXP_SUBSTR(text, '(obj.*?)(,|$)', 1, 1, NULL, 1),
1,
REGEXP_COUNT(text, '(obj.*?)(,|$)')
FROM test_me
UNION ALL
SELECT text,
REGEXP_SUBSTR(text, '(obj.*?)(,|$)', 1, idx + 1, NULL, 1),
idx + 1,
num_matches
FROM matches
WHERE idx < num_matches
)
SELECT match
FROM matches
WHERE idx <= num_matches
Which, for the sample data:
CREATE TABLE test_me (
text varchar(3000)
);
insert into test_me (text) VALUES (
'obj_a, obj_b,
trx_a, trx_c,
obj_c,
obj_d,
obj'
);
Outputs:
MATCH
obj_a
obj_b
obj_c
obj_d
obj
If you just want the output in a single row then:
SELECT REGEXP_REPLACE(text, '.*?(obj.*?(,|$)|$)', '\1', 1, 0, 'n') AS matches
FROM test_me
Which outputs:
MATCHES
obj_a,obj_b,obj_c,obj_d,obj
db<>fiddle here

Url regex oracle

I want to do something like this :
This is the link I want to replace. So I want only to keep the "textIwantToKeep" part :
http://mylink/aaa-bbb/textIwantToKeep
And I want this :
http://mySecondLink/ccc-ddd/textIwantToKeep
I want to use regular expression with Oracle SQL Developper. I think about to count the number of slash (4) and to split only the part before the 4th slash but it doesn't work..
Thank you for your help.
REGEXP_SUBSTR might be one option; \w+$ returns the last word (i.e. the one "anchored" to the end of the string):
SQL> with test (link) as
2 (select 'http://mylink/aaa-bbb/textIwantToKeep' from dual union all
3 select 'http://mySecondLink/ccc-ddd/textIwantToKeep' from dual
4 )
5 select link,
6 regexp_substr(link, '\w+$') result
7 from test;
LINK RESULT
------------------------------------------- --------------------
http://mylink/aaa-bbb/textIwantToKeep textIwantToKeep
http://mySecondLink/ccc-ddd/textIwantToKeep textIwantToKeep
SQL>
There could be other alternatives, but here is something that came to me quickly -
WITH main_table AS (
SELECT 'http://mylink/aaa-bbb/textIwantToKeep' AS original_string FROM dual
)
,
second_table AS (
SELECT 'http://mySecondLink/ccc-ddd/' AS my_second_link FROM dual
)
SELECT
second_table.my_second_link
|| regexp_substr(main_table.original_string, '[^/]+', 1, 4) AS final_string
FROM
main_table,
second_table;
Let me know if that works.

Locate where is the nth occurrence of a token in a string separated by pipes

I'm I am a newbie with Regex and would like to know if it is possible to do that.
It is possible to locate the token position of a sub-string in a string like the below sample text?
AA|BBBBBBBBBB|XXXX||XXXX||FFFFFFFFFFF
Requesting the position of the 1st occurrence of 'XXXX' I must get '3', requesting the 2nd occurrence of 'XXXX' I must get '5', requesting the 3rd occurrence of 'XXXX' I must get '0' cause there's no a 3rd ocurrence.
This can be done using just regex?
Thanks in advance.
PS: If it is possible I will implement this solution on DB2 v7r2 using REGEX functions to replace an UDF I write long time ago on PLSQL to do this job.
This isn't how'd I'd normally use regex....
But it can get the job done...
create variable mysource varchar(50)
default('AA|BBBBBBBBBB|XXXX||XXXX||FFFFFFFFFFF');
select
regexp_count(
substring(mysource
, 1
,regexp_instr(mysource
,'XXXX'
,1
,2 --occurance
,1)
)
,'\|')
from sysibm.sysdummy1;
REGEXP_COUNT
5
Might need to concat a '|' to the end of the source if it's possible for the pattern to fall in the last position.
EDIT
Ok, here's a completely different way...using a recursive common table expression (RCTE)
Note that the solution is easiest if you ensure that the text ends with a delimiter...
create variable mysource varchar(50)
default('AA|BBBBBBBBBB|XXXX||XXXX||FFFFFFFFFFF|');
And the code..
with splitstring (pos, data, remain) as (
select 1
, substring(mysource,1,locate('|', mysource) -1 )
, substring(mysource,locate('|', mysource) + 1 )
from sysibm.sysdummy1
union all
select pos + 1
, substring(remain,1,locate('|', remain) -1 )
, substring(remain,locate('|', remain) + 1 )
, matches as (
select row_number() over (order by pos) as occur
,pos
from splitString
where data = 'XXXX'
)
select coalesce(pos,0) as pos
from sysibm.sysdummy1
left join matches
on occur = 2 ;
Results
POS
5

PLSQL select substr between Nth and Mth occurance of character

I'm sure there is a simple function for exactly this problem, but I can't seem to find it...
I have a string containing multiple slashes, for example an URL. Let's say I want to obtain the substring between the second and fourth occurance of the slash, if exists, else I want everything following the second slash or simply "" if it contains less than 2 slashes.
Hence: 'ab/cd/ef/gh/ij' should be selected as 'ef/gh' and 'abc/d' should be selected as ''.
What is the magical function/combination of functions I'm looking for? Tried to play around with substr and regexp_substr, but it got messy quite rapidly, without the desired result.
Apparently I wasn't searching hard enough. The function instr does the trick, hence in combination with substr:
SUBSTR(string, INSTR(string,'/',1,2) + 1, INSTR(string,'/',1,4) - INSTR(string,'/',1,2)-1)
Still looks kind of dirty to me though, creativity is more than welcome.
Give this a try. I suspect the regex's could be simpler but it meets your requirements. Note that the order in which you make the tests against the string in the case statement are very important, lest the str fall into the wrong test.
with tbl(rownbr, str) as (
select 1, 'ab/cd/ef/gh/ij/x/x/x' from dual union
select 2, 'aa/bb/cc' from dual union
select 3, 'gg/hh/ii/jj' from dual union
select 4, 'abc/d' from dual union
select 5, 'zz' from dual
)
select rownbr,
case
when regexp_count(str, '/') > 4 then
regexp_replace(str, '^.*?/.*?/(.*?/.*?)/.*$', '\1')
when regexp_count(str, '/') < 2 then
NULL
when regexp_count(str, '/') < 4 then
regexp_replace(str, '^.*?/.*?/(.*)$', '\1')
end result
from tbl;

Oracle regex eliminate all duplicate words

I would like to eliminate all duplicate words in a comma separated list.
I've tried with:
SELECT
REGEXP_REPLACE(
'1234,234,1234,1234,928,1234,123,1234,Abcd,1234,1234',
'([^,\w]+)(,[ ]*[\1])+') AS r
FROM dual
It should return
1234,234,928,123,Abcd
But in fact it returns
1234,234,234,234
Also tried with ([^,\w]+)(,[ ]*\1)+ but with '1234,1234,1234' it returns (null)
Also tried with
SELECT
REGEXP_REPLACE(
'1234,234,1234,1234,928,1234,123,1234,Abcd,1234,1234',
'([^,\w]+)(,[ ]*[\1])+', '\1') AS r
FROM dual
and following replacements, even '\1\2' but none of them is giving the desired result.
Please, any ideas?
I know this isn't exactly the method you were asking for, but it still achieves the same result:
WITH DATA AS
( SELECT '1234,234,1234,1234,928,1234,123,1234,Abcd,1234,1234' str FROM dual)
SELECT DISTINCT trim(regexp_substr(str, '[^,]+', 1, LEVEL)) str
FROM DATA
CONNECT BY instr(str, ',', 1, LEVEL - 1) > 0