no preceding characters in regexp statement - regex

So I have attempted to use a negative look back in a regexp statement and have looked online at other solutions but they don't seem to work for me so obviously I am doing something wrong-
I am looking for a return on the first line but the others should be null. Essentially I need CT CHEST or CT LUNG
Any assistance TIA
with test (id, description) as (
select 1, 'CT CHEST HIGH RESOLUTION, NO CONTRAST' from dual union all --want this
select 2, 'INJECTION, THORACIC TRANSFORAMEN EPIDURAL, NON NEUROLYTIC W IMAGE GUIDANCE.' from dual union all --do not want this
select 3, 'The cow came back. But the dog went for a walk' from dual) --do not want this
select id, description, regexp_substr(description, '(?<![a-z]ct).{1,20}(CHEST|THOR|LUNG)',1,1,'i') from test;

regexp_substr(description,'([^A-Z]|^)[CT].{1,20}(CHEST|THOR|LUNG)',1,1,'i')
works

Leverage Oracle Subexrpession Parameter to Check for CT
I would leverage the use of subexpressions to use a pattern like this:
'regexp_substr(description, '(^| )((ct ).*((CHEST)|(THOR)|(LUNG)))', 1, 1,'i', 2)`
-subexpression 1 to look for beginning of line or a space: (^| )
-subexpression 3 to look for 'CT': (ct )
-allow for other characters: .*
-subexressions 5,6,7: (CHEST)|(THOR)|(LUNG)
-subexpression 2 which contain subexpression 3 an subexprssion 4
I use the last optional parameter to identify that I want subexpression 2.
WITH test (id, description) as (
SELECT 1
, 'CT CHEST HIGH RESOLUTION , NO CONTRAST'
FROM dual
UNION ALL --want this
SELECT 2
, 'INJECTION , THORACIC TRANSFORAMEN EPIDURAL , NON NEUROLYTIC W IMAGE GUIDANCE.'
FROM dual
UNION ALL --do not want this
SELECT 3
, 'The cow came back. But the dog went FOR a walk'
FROM dual
) --do not want this
SELECT id
, description
, regexp_substr(description, '(^| )((ct ).*((CHEST)|(THOR)|(LUNG)))', 1, 1,'i', 2)
FROM test;

Related

Regular Expression: changing matching method from OR to AND

I have a regular expression like the following: (Running on Oracle's regexp_like(), despite the question isn't Oracle-specific)
abc|bcd|def|xyz
This basically matches a tags field on database to see if tags field contains abc OR bcd OR def OR xyz when user has input for the search query "abc bcd def xyz".
The tags field on the database holds keywords separated by spaces, e.g. "cdefg abcd xyz"
On Oracle, this would be something like:
select ... from ... where
regexp_like(tags, 'abc|bcd|def|xyz');
It works fine as it is, but I want to add an extra option for users to search for results that match all keywords. How should I change the regular expression so that it matches abc AND bcd AND def AND xyz ?
Note: Because I won't know what exact keywords the user will enter, I can't pre-structure the query in the PL/SQL like this:
select ... from ... where
tags like '%abc%' AND
tags like '%bcd%' AND
tags like '%def%' AND
tags like '%xyz%';
You can split the input pattern and check that all the parts of the pattern match:
SELECT t.*
FROM table_name t
CROSS APPLY(
WITH input (match) AS (
SELECT 'abc bcd def xyz' FROM DUAL
)
SELECT 1
FROM input
CONNECT BY LEVEL <= REGEXP_COUNT(match, '\S+')
HAVING COUNT(
REGEXP_SUBSTR(
t.tags,
REGEXP_SUBSTR(match, '\S+', 1, LEVEL)
)
) = REGEXP_COUNT(match, '\S+')
)
Or, if you have Java enabled in the database then you can create a Java function to match regular expressions:
CREATE AND COMPILE JAVA SOURCE NAMED RegexParser AS
import java.util.regex.Pattern;
public class RegexpMatch {
public static int match(
final String value,
final String regex
){
final Pattern pattern = Pattern.compile(regex);
return pattern.matcher(value).matches() ? 1 : 0;
}
}
/
Then wrap it in an SQL function:
CREATE FUNCTION regexp_java_match(value IN VARCHAR2, regex IN VARCHAR2) RETURN NUMBER
AS LANGUAGE JAVA NAME 'RegexpMatch.match( java.lang.String, java.lang.String ) return int';
/
Then use it in SQL:
SELECT *
FROM table_name
WHERE regexp_java_match(tags, '(?=.*abc)(?=.*bcd)(?=.*def)(?=.*xyz)') = 1;
Try this, the idea being counting that the number of matches is == to the number of patterns:
with data(val) AS (
select 'cdefg abcd xyz' from dual union all
select 'cba lmnop xyz' from dual
),
targets(s) as (
select regexp_substr('abc bcd def xyz', '[^ ]+', 1, LEVEL) from dual
connect by regexp_substr('abc bcd def xyz', '[^ ]+', 1, LEVEL) is not null
)
select val from data d
join targets t on
regexp_like(val,s)
group by val having(count(*) = (select count(*) from targets))
;
Result:
cdefg abcd xyz
I think dynamic SQL will be needed for this. The match all option will require individual matching with logic to ensure every individual match is found.
An easy way would be to build a join condition for each keyword. Concatenate the join statements in a string. Use dynamic SQL to execute the string as a query.
The example below uses the customer table from the sample schemas provided by Oracle.
DECLARE
-- match string should be just the values to match with spaces in between
p_match_string VARCHAR2(200) := 'abc bcd def xyz';
-- need logic to determine match one (OR) versus match all (AND)
p_match_type VARCHAR2(3) := 'OR';
l_sql_statement VARCHAR2(4000);
-- create type if bulk collect is needed
TYPE t_email_address_tab IS TABLE OF customers.EMAIL_ADDRESS%TYPE INDEX BY PLS_INTEGER;
l_email_address_tab t_email_address_tab;
BEGIN
WITH sql_clauses(row_idx,sql_text) AS
(SELECT 0 row_idx -- build select plus beginning of where clause
,'SELECT email_address '
|| 'FROM customers '
|| 'WHERE 1 = '
|| DECODE(p_match_type, 'AND', '1', '0') sql_text
FROM DUAL
UNION
SELECT LEVEL row_idx -- build joins for each keyword
,DECODE(p_match_type, 'AND', ' AND ', ' OR ')
|| 'email_address'
|| ' LIKE ''%'
|| REGEXP_SUBSTR( p_match_string,'[^ ]+',1,level)
|| '%''' sql_text
FROM DUAL
CONNECT BY LEVEL <= LENGTH(p_match_string) - LENGTH(REPLACE( p_match_string, ' ' )) + 1
)
-- put it all together by row_idx
SELECT LISTAGG(sql_text, '') WITHIN GROUP (ORDER BY row_idx)
INTO l_sql_statement
FROM sql_clauses;
dbms_output.put_line(l_sql_statement);
-- can use execute immediate (or ref cursor) for dynamic sql
EXECUTE IMMEDIATE l_sql_statement
BULK COLLECT
INTO l_email_address_tab;
END;
Variable
Value
p_match_string
abc bcd def xyz
p_match_type
AND
l_sql_statement
SELECT email_address FROM customers WHERE 1 = 1 AND email_address LIKE '%abc%' AND email_address LIKE '%bcd%' AND email_address LIKE '%def%' AND email_address LIKE '%xyz%'
Variable
Value
p_match_string
abc bcd def xyz
p_match_type
OR
l_sql_statement
SELECT email_address FROM customers WHERE 1 = 0 OR email_address LIKE '%abc%' OR email_address LIKE '%bcd%' OR email_address LIKE '%def%' OR email_address LIKE '%xyz%'

Inconsistent results from Oracle's REGEXP_SUBSTR

Given a string of key-value pairs: /* USER='Administrator'; UNV='Universe'; DOC='WebIntellignceReport'; */
My goal is to extract values associated with the USER, UNV, and DOC keys.
Using a pattern of (?<=UNV=')(.*?)(?='), I get the expected value of Universe associated the UNV key (Fiddle).
However, when I use the pattern with REGEXP_SUBSTR, I get a NULL:
SELECT text
,REGEXP_SUBSTR(text,'(?<=UNV='')(.*?)(?='')') UNV
FROM (
SELECT '/* USER=''Administrator''; UNV=''Universe''; DOC=''WebIntellignceReport''; */' as text
FROM dual
) v
What am I missing?
You may extract the contents of group 1:
SELECT text, REGEXP_SUBSTR(text,'UNV=''(.*?)''', 1, 1 ,NULL, 1) UNV
FROM (
SELECT '/* USER=''Administrator''; UNV=''Universe''; DOC=''WebIntellignceReport''; */' as text
FROM dual
) v
See the online demo.
With UNV='(.*?)' , you may extract just what is between the closest single quuotes afterUNV=.
I think the easiest thing to do is just grab the whole key-value pair using REGEXP_SUBSTR, and then do another substr to pull out the value you want.
with v as (select '/* USER=''Administrator''; UNV=''Universe''; DOC=''WebIntellignceReport''; */' as text from dual)
select text, key_val, substr(key_val, instr(key_val, '''')+1, length(key_val)-instr(key_val, '''')-2)
from (
select text,
regexp_substr(text, ' UNV=''[^'']*'';') key_val
from v);
Output:
TEXT KEY_VAL VAL
----------------------------------------------------------------------- ----------------------------------------------------------------------- -----------------------------------------------------------------------
/* USER='Administrator'; UNV='Universe'; DOC='WebIntellignceReport'; */ UNV='Universe'; Universe

Can Redshift SQL perform a case insensitive regular expression evaluation?

The documentation says regexp_instr() and ~ are case sensitive Posix evaluating function and operator.
Is there a Posix syntax for case insensitive, or a plug-in for PCRE based function or operator
Example of PCRE tried in a Redshift query that don't work as desired because of POSIX'ness.
select
A.target
, B.pattern
, regexp_instr(A.target, B.pattern) as rx_instr_position
, A.target ~ B.pattern as tilde_operator
, regexp_instr(A.target
, 'm/'||B.pattern||'/i') as rx_instr_position_icase
from
( select 'AbCdEfffghi' as target
union select 'Chocolate' as target
union select 'Cocoa Latte' as target
union select 'coca puffs, delivered late' as target
) A
,
( select 'choc.*late' as pattern
union select 'coca.*late' as pattern
union select 'choc\w+late' as pattern
union select 'choc\\w+late' as pattern
) B
To answer your question: No Redshift-compatible syntax or plugins that I know of. In case you could live with a workaround: We ended up using lower() around the strings to match:
select
A.target
, B.pattern
, regexp_instr(A.target, B.pattern) as rx_instr_position
, A.target ~ B.pattern as tilde_operator
, regexp_instr(A.target, 'm/'||B.pattern||'/i') as rx_instr_position_icase
, regexp_instr(lower(A.target), B.pattern) as rx_instr_position_icase_by_lower
from
( select 'AbCdEfffghi' as target
union select 'Chocolate' as target
union select 'Cocoa Latte' as target
union select 'coca puffs, delivered late' as target
) A
,
( select 'choc.*late' as pattern
union select 'coca.*late' as pattern
union select 'choc\w+late' as pattern
union select 'choc\\w+late' as pattern
) B
select 'HELLO' ~* 'el' = true
this is currently undocumented (2020-11-05)
Redshift now provides a direct solution for case-insensitive regular expression flags via added function parameters: Amazon Redshift - REGEXP_INSTR
The syntax using the provided query example would be:
select
A.target
, B.pattern
, regexp_instr(A.target, B.pattern) as rx_instr_position
, A.target ~ B.pattern as tilde_operator
, regexp_instr(A.target, B.pattern, 1, 1, 0, 'i') AS rx_instr_position_icase
from
( select 'AbCdEfffghi' as target
union select 'Chocolate' as target
union select 'Cocoa Latte' as target
union select 'coca puffs, delivered late' as target
) A
,
( select 'choc.*late' as pattern
union select 'coca.*late' as pattern
union select 'choc\w+late' as pattern
union select 'choc\\w+late' as pattern
) B

Oracle How do I transform this string field into structured data using regular expressions?

I did start at this answer:
Oracle 11g get all matched occurrences by a regular expression
But it didn't get me far enough. I have a string field that looks like this:
A=&token1&token2&token3,B=&token2&token3&token5
It could have any number of tokens and any number of keys. The desired output is a set of rows looking like this:
Key | Token
A | &token1
A | &token2
A | &token3
B | &token2
B | &token3
B | &token5
This is proving rather difficult to do.
I started here:
SELECT token from
(SELECT REGEXP_SUBSTR(str, '[A-Z=&]+', 1, LEVEL) AS token
FROM (SELECT 'A=&token1&token2&token3,B=&token2&token3&token5' str from dual)
CONNECT BY LEVEL <= LENGTH(REGEXP_REPLACE(str, '[A-Z=&]+', ',')))
Where token is not null
But that yields:
A=&
&
&
B=&
&
&
which is getting me nowhere. I'm thinking I need to do a nested clever select where the first one gets me
A=&token1&token2&token3
B=&token2&token3&token5
And a subsequent select might be able to do a clever extract to get the final result.
Stumped. I'm trying to do this without using procedural or function code -- I would like the set to be something I can union with other queries so if it's possible to do this with nested selects that would be great.
UPDATE:
SET DEFINE OFF
SELECT SUBSTR(token,1,1) as Key, REGEXP_SUBSTR(token, '&\w+', 1, LEVEL) AS token2
FROM
(
-- 1 row per key/value pair
SELECT token from
(SELECT REGEXP_SUBSTR(str, '[^,]+', 1, LEVEL) AS token
FROM (SELECT 'A=&token1&token2&token3,B=&token2&token3&token5' str from dual)
CONNECT BY LEVEL <= LENGTH(REGEXP_REPLACE(str, '[^,]+', ',')))
Where token is not null
)
CONNECT BY LEVEL <= LENGTH(REGEXP_REPLACE(token, '&\w+'))
This gets me
A | &token1
A | &token2
B | &token3
B | &token2
A | &token2
B | &token3
Which is fantastic formatting except for the small problem that it's wrong (A should have a token3, and token4 and token5 are nowhere to be seen).
Great question! Thanks for it!
select distinct k, regexp_substr(v, '[^&]+', 1, level) t
from (
select substr(regexp_substr(val,'^[^=]+=&'),1,length(regexp_substr(val,'^[^=]+=&'))-2) k, substr(regexp_substr(val,'=&.*'),3) v
from (
select regexp_substr(str, '[^,]+', 1, level) val
from (select 'A=&token1&token2&token3,B=&token2&token3&token5' str from dual)
connect by level <= length(str) - length(replace(str,','))+1
)
) connect by level <= length(v) - length(replace(v,'&'))+1
It is an answer, and one that seems to work... But I don't like the middle splitting the val into kand v- there must be a better way (if the Key is always one character, that makes it easy though) . And having to put a DISTINCT to get rid of duplicates is horrible... Maybe with further playing you can clean it up though (or someone else might)
EDIT based on keeping the leading & and the key being a single character:
select distinct k, regexp_substr(v, '&[^&]+', 1, level) t
from (
select substr(val,1,1) k
, substr(regexp_substr(val,'=&.*'),1) v
from (
select regexp_substr(str, '[^,]+', 1, level) val
from (select 'A=&token1&token2&token3,B=&token2&token3&token5' str from dual)
connect by level <= length(str) - length(replace(str,','))+1
)
) connect by level < length(v) - length(replace(v,'&'))+1

REGEXP_SUBSTR Is taking more time for execution in Oracle

I am trying to split a comma separated email string into individual email ids which are comma separated but each email id is enclosed inside single quotation.
My Input is 'one#gmail.com,two#gamil.com,three#gmail.com,four#gmail.com'
My Output Should be: 'one#gmail.com','two#gamil.com','three#gmail.com','four#gmail.com'
I am going to use the output string above in oracle query where condition like...
Where EmailId's in ( 'one#gmail.com','two#gamil.com','three#gmail.com','four#gmail.com');
I am using the following code to achieve this
WHERE EMAIL IN
(REGEXP_SUBSTR('one#gmail.com,two#gamil.com,three#gmail.com,four#gmail.com' ,'[^,]+', 1, LEVEL))
CONNECT BY LEVEL <= LENGTH('one#gmail.com,two#gamil.com,three#gmail.com,four#gmail.com' ) - LENGTH(REPLACE('one#gmail.com,two#gamil.com,three#gmail.com,four#gmail.com' , ',', '')) +1;
But the above query taking 60 seconds to return only 16 records. Can any one suggest me the best approach for this...
Try this,
WHERE email IN (
select regexp_substr('one#gmail.com,two#gamil.com,three#gmail.com,four#gmail.com','[^,]+', 1, level) from dual
connect by regexp_substr('one#gmail.com,two#gamil.com,three#gmail.com,four#gmail.com', '[^,]+', 1, level) is not null );