Inconsistent results from Oracle's REGEXP_SUBSTR

Inconsistent results from Oracle's REGEXP_SUBSTR - regex

Given a string of key-value pairs: /* USER='Administrator'; UNV='Universe'; DOC='WebIntellignceReport'; */
My goal is to extract values associated with the USER, UNV, and DOC keys.
Using a pattern of (?<=UNV=')(.*?)(?='), I get the expected value of Universe associated the UNV key (Fiddle).
However, when I use the pattern with REGEXP_SUBSTR, I get a NULL:
SELECT text
,REGEXP_SUBSTR(text,'(?<=UNV='')(.*?)(?='')') UNV
FROM (
SELECT '/* USER=''Administrator''; UNV=''Universe''; DOC=''WebIntellignceReport''; */' as text
FROM dual
) v
What am I missing?

You may extract the contents of group 1:
SELECT text, REGEXP_SUBSTR(text,'UNV=''(.*?)''', 1, 1 ,NULL, 1) UNV
FROM (
SELECT '/* USER=''Administrator''; UNV=''Universe''; DOC=''WebIntellignceReport''; */' as text
FROM dual
) v
See the online demo.
With UNV='(.*?)' , you may extract just what is between the closest single quuotes afterUNV=.

I think the easiest thing to do is just grab the whole key-value pair using REGEXP_SUBSTR, and then do another substr to pull out the value you want.
with v as (select '/* USER=''Administrator''; UNV=''Universe''; DOC=''WebIntellignceReport''; */' as text from dual)
select text, key_val, substr(key_val, instr(key_val, '''')+1, length(key_val)-instr(key_val, '''')-2)
from (
select text,
regexp_substr(text, ' UNV=''[^'']*'';') key_val
from v);
Output:
TEXT KEY_VAL VAL
----------------------------------------------------------------------- ----------------------------------------------------------------------- -----------------------------------------------------------------------
/* USER='Administrator'; UNV='Universe'; DOC='WebIntellignceReport'; */ UNV='Universe'; Universe

Related

Regular Expression: changing matching method from OR to AND

I have a regular expression like the following: (Running on Oracle's regexp_like(), despite the question isn't Oracle-specific)
abc|bcd|def|xyz
This basically matches a tags field on database to see if tags field contains abc OR bcd OR def OR xyz when user has input for the search query "abc bcd def xyz".
The tags field on the database holds keywords separated by spaces, e.g. "cdefg abcd xyz"
On Oracle, this would be something like:
select ... from ... where
regexp_like(tags, 'abc|bcd|def|xyz');
It works fine as it is, but I want to add an extra option for users to search for results that match all keywords. How should I change the regular expression so that it matches abc AND bcd AND def AND xyz ?
Note: Because I won't know what exact keywords the user will enter, I can't pre-structure the query in the PL/SQL like this:
select ... from ... where
tags like '%abc%' AND
tags like '%bcd%' AND
tags like '%def%' AND
tags like '%xyz%';

You can split the input pattern and check that all the parts of the pattern match:
SELECT t.*
FROM table_name t
CROSS APPLY(
WITH input (match) AS (
SELECT 'abc bcd def xyz' FROM DUAL
)
SELECT 1
FROM input
CONNECT BY LEVEL <= REGEXP_COUNT(match, '\S+')
HAVING COUNT(
REGEXP_SUBSTR(
t.tags,
REGEXP_SUBSTR(match, '\S+', 1, LEVEL)
)
) = REGEXP_COUNT(match, '\S+')
)
Or, if you have Java enabled in the database then you can create a Java function to match regular expressions:
CREATE AND COMPILE JAVA SOURCE NAMED RegexParser AS
import java.util.regex.Pattern;
public class RegexpMatch {
public static int match(
final String value,
final String regex
){
final Pattern pattern = Pattern.compile(regex);
return pattern.matcher(value).matches() ? 1 : 0;
}
}
/
Then wrap it in an SQL function:
CREATE FUNCTION regexp_java_match(value IN VARCHAR2, regex IN VARCHAR2) RETURN NUMBER
AS LANGUAGE JAVA NAME 'RegexpMatch.match( java.lang.String, java.lang.String ) return int';
/
Then use it in SQL:
SELECT *
FROM table_name
WHERE regexp_java_match(tags, '(?=.*abc)(?=.*bcd)(?=.*def)(?=.*xyz)') = 1;

Try this, the idea being counting that the number of matches is == to the number of patterns:
with data(val) AS (
select 'cdefg abcd xyz' from dual union all
select 'cba lmnop xyz' from dual
),
targets(s) as (
select regexp_substr('abc bcd def xyz', '[^ ]+', 1, LEVEL) from dual
connect by regexp_substr('abc bcd def xyz', '[^ ]+', 1, LEVEL) is not null
)
select val from data d
join targets t on
regexp_like(val,s)
group by val having(count(*) = (select count(*) from targets))
;
Result:
cdefg abcd xyz

I think dynamic SQL will be needed for this. The match all option will require individual matching with logic to ensure every individual match is found.
An easy way would be to build a join condition for each keyword. Concatenate the join statements in a string. Use dynamic SQL to execute the string as a query.
The example below uses the customer table from the sample schemas provided by Oracle.
DECLARE
-- match string should be just the values to match with spaces in between
p_match_string VARCHAR2(200) := 'abc bcd def xyz';
-- need logic to determine match one (OR) versus match all (AND)
p_match_type VARCHAR2(3) := 'OR';
l_sql_statement VARCHAR2(4000);
-- create type if bulk collect is needed
TYPE t_email_address_tab IS TABLE OF customers.EMAIL_ADDRESS%TYPE INDEX BY PLS_INTEGER;
l_email_address_tab t_email_address_tab;
BEGIN
WITH sql_clauses(row_idx,sql_text) AS
(SELECT 0 row_idx -- build select plus beginning of where clause
,'SELECT email_address '
|| 'FROM customers '
|| 'WHERE 1 = '
|| DECODE(p_match_type, 'AND', '1', '0') sql_text
FROM DUAL
UNION
SELECT LEVEL row_idx -- build joins for each keyword
,DECODE(p_match_type, 'AND', ' AND ', ' OR ')
|| 'email_address'
|| ' LIKE ''%'
|| REGEXP_SUBSTR( p_match_string,'[^ ]+',1,level)
|| '%''' sql_text
FROM DUAL
CONNECT BY LEVEL <= LENGTH(p_match_string) - LENGTH(REPLACE( p_match_string, ' ' )) + 1
)
-- put it all together by row_idx
SELECT LISTAGG(sql_text, '') WITHIN GROUP (ORDER BY row_idx)
INTO l_sql_statement
FROM sql_clauses;
dbms_output.put_line(l_sql_statement);
-- can use execute immediate (or ref cursor) for dynamic sql
EXECUTE IMMEDIATE l_sql_statement
BULK COLLECT
INTO l_email_address_tab;
END;
Variable
Value
p_match_string
abc bcd def xyz
p_match_type
AND
l_sql_statement
SELECT email_address FROM customers WHERE 1 = 1 AND email_address LIKE '%abc%' AND email_address LIKE '%bcd%' AND email_address LIKE '%def%' AND email_address LIKE '%xyz%'
Variable
Value
p_match_string
abc bcd def xyz
p_match_type
OR
l_sql_statement
SELECT email_address FROM customers WHERE 1 = 0 OR email_address LIKE '%abc%' OR email_address LIKE '%bcd%' OR email_address LIKE '%def%' OR email_address LIKE '%xyz%'

no preceding characters in regexp statement

So I have attempted to use a negative look back in a regexp statement and have looked online at other solutions but they don't seem to work for me so obviously I am doing something wrong-
I am looking for a return on the first line but the others should be null. Essentially I need CT CHEST or CT LUNG
Any assistance TIA
with test (id, description) as (
select 1, 'CT CHEST HIGH RESOLUTION, NO CONTRAST' from dual union all --want this
select 2, 'INJECTION, THORACIC TRANSFORAMEN EPIDURAL, NON NEUROLYTIC W IMAGE GUIDANCE.' from dual union all --do not want this
select 3, 'The cow came back. But the dog went for a walk' from dual) --do not want this
select id, description, regexp_substr(description, '(?<![a-z]ct).{1,20}(CHEST|THOR|LUNG)',1,1,'i') from test;

regexp_substr(description,'([^A-Z]|^)[CT].{1,20}(CHEST|THOR|LUNG)',1,1,'i')
works

Leverage Oracle Subexrpession Parameter to Check for CT
I would leverage the use of subexpressions to use a pattern like this:
'regexp_substr(description, '(^| )((ct ).*((CHEST)|(THOR)|(LUNG)))', 1, 1,'i', 2)`
-subexpression 1 to look for beginning of line or a space: (^| )
-subexpression 3 to look for 'CT': (ct )
-allow for other characters: .*
-subexressions 5,6,7: (CHEST)|(THOR)|(LUNG)
-subexpression 2 which contain subexpression 3 an subexprssion 4
I use the last optional parameter to identify that I want subexpression 2.
WITH test (id, description) as (
SELECT 1
, 'CT CHEST HIGH RESOLUTION , NO CONTRAST'
FROM dual
UNION ALL --want this
SELECT 2
, 'INJECTION , THORACIC TRANSFORAMEN EPIDURAL , NON NEUROLYTIC W IMAGE GUIDANCE.'
FROM dual
UNION ALL --do not want this
SELECT 3
, 'The cow came back. But the dog went FOR a walk'
FROM dual
) --do not want this
SELECT id
, description
, regexp_substr(description, '(^| )((ct ).*((CHEST)|(THOR)|(LUNG)))', 1, 1,'i', 2)
FROM test;

Regex to remove <> from column

I have a column with name and email id like
Column A
ABX <ABX#gmail.com>
hfgfg <shantanu #gmail.com>
I Want to use a regex to only retrieve the name and exclude the <> along with email idfrom the above column.
Regex in SQL query.
I tried
SELECT REPLACE('s <abc#gmail.com>', SUBSTR('s <abc#gmail.com>', instr('(', 's <abc#gmail.com>'), LENGTH('s <abc#gmail.com>') - instr(')', reverse('s <abc#gmail.com>')) - instr('(', 's <abc#gmail.com>') + 2), '')
FROM dual;

You could use regular expressions; either remove everything from first opening angle bracket, optionally trimming any remaining spaces as well:
select rtrim(regexp_substr('s <abc#gmail.com>', '[^<]*'), ' ') as name from dual;
Or replace the angle brackets and whatever is inside them, and any immediately preceding whitespace, with null:
select regexp_replace('s <abc#gmail.com>', '\s?<.*>', null) as name from dual;
With some sample data:
with your_table(column_a) as (
select 'Some Name <some.name#example.com>' from dual
union all select 'SingleName <single#example.com>' from dual
)
select column_a,
rtrim(regexp_substr(column_a, '[^<]*'), ' ') as name1,
regexp_replace(column_a, '\s?<.*>', null) as name2
from your_table;
COLUMN_A NAME1 NAME2
--------------------------------- --------------- ---------------
Some Name <some.name#example.com> Some Name Some Name
SingleName <single#example.com> SingleName SingleName
If you want the email address as well you could use:
select regexp_substr('s <abc#gmail.com>', '([^<>]*)', 1, 3) as email from dual;
... though there might be a better way. Demoing that too:
with your_table(column_a) as (
select 'Some Name <some.name#example.com>' from dual
union all select 'SingleName <single#example.com>' from dual
)
select column_a,
rtrim(regexp_substr(column_a, '[^<]*'), ' ') as name1,
regexp_replace(column_a, '\s?<.*>', null) as name2,
regexp_substr(column_a, '([^<>]*)', 1, 3) as email
from your_table;
COLUMN_A NAME1 NAME2 EMAIL
--------------------------------- ---------- ---------- ---------------------
Some Name <some.name#example.com> Some Name Some Name some.name#example.com
SingleName <single#example.com> SingleName SingleName single#example.com

Why don't you try something like this :
UPDATE table SET A=TRIM(SUBSTRING(A, 1, INSTR(A,'<')));

How to use regular expressions properly on a SQL files?

I have a lot of undocumented and uncommented SQL queries. I would like to extract some information within the SQL-statements. Particularly, I'm interested in DB-names, table names and if possible column names. The queries have usually the following syntax.
SELECT *
FROM mydb.table1 m
LEFT JOIN mydb.sometable o ON m.id = o.id
LEFT JOIN mydb.sometable t ON p.id=t.id
LEFT JOIN otherdb.sometable s ON s.column='test'
Usually, the statements involes several DBs and Tables. I would like only extract DBs and Tables with any other information. I thought if whether it is possible to extract first the information which begins after FROM & JOIN & LEFT JOIN. Here its usually db.table letters such as o t s correspond already to referenced tables. I suppose they are difficult to capture. What I tried without any success is to use something like:
gsub(".*FROM \\s*|WHERE|ORDER|GROUP.*", "", vec)
Assuming that each statement ends with WHERE/where or ORDER/order or GROUP... But that doesnt work out as expected.

You haven't indicated which database system you are using but virtually all such systems have introspection facilities that would allow you to get this information a lot more easily and reliably than attempting to parse SQL statements. The following code which supposes SQLite can likely be adapted to your situation by getting a list of your databases and then looping over the databases and using dbConnect to connect to each one in turn running code such as this:
library(gsubfn)
library(RSQLite)
con <- dbConnect(SQLite()) # use in memory database for testing
# create two tables for purposes of this test
dbWriteTable(con, "BOD", BOD, row.names = FALSE)
dbWriteTable(con, "iris", iris, row.names = FALSE)
# get all table names and columns
tabinfo <- Map(function(tab) names(fn$dbGetQuery(con, "select * from $tab limit 0")),
dbListTables(con))
dbDisconnect(con)
giving an R list whose names are the table names and whose entries are the column names:
> tabinfo
$BOD
[1] "Time" "demand"
$iris
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
or perhaps long form output is preferred:
setNames(stack(tabinfo), c("column", "table"))
giving:
column table
1 Time BOD
2 demand BOD
3 Sepal.Length iris
4 Sepal.Width iris
5 Petal.Length iris
6 Petal.Width iris
7 Species iris

You could use the stringi package for this.
library(stringi)
# Your string vector
myString <- "SELECT *
FROM mydb.table1 m
LEFT JOIN mydb.sometable o ON m.id = o.id
LEFT JOIN mydb.sometable t ON p.id=t.id
LEFT JOIN otherdb.sometable s ON s.column='test'"
# Three stringi functions used
# stringi_extract_all_regex will extract the strings which have FROM or JOIN followed by some text till the next space
# string_replace_all_regex will replace all the FROM or JOIN followed by space with null string
# stringi_unique will extract all unique strings
t <- stri_unique(stri_replace_all_regex(stri_extract_all_regex(myString, "((FROM|JOIN) [^\\s]+)", simplify = TRUE),
"(FROM|JOIN) ", ""))
> t
[1] "mydb.table1" "mydb.sometable" "otherdb.sometable"

REGEXP_SUBSTR Is taking more time for execution in Oracle

I am trying to split a comma separated email string into individual email ids which are comma separated but each email id is enclosed inside single quotation.
My Input is 'one#gmail.com,two#gamil.com,three#gmail.com,four#gmail.com'
My Output Should be: 'one#gmail.com','two#gamil.com','three#gmail.com','four#gmail.com'
I am going to use the output string above in oracle query where condition like...
Where EmailId's in ( 'one#gmail.com','two#gamil.com','three#gmail.com','four#gmail.com');
I am using the following code to achieve this
WHERE EMAIL IN
(REGEXP_SUBSTR('one#gmail.com,two#gamil.com,three#gmail.com,four#gmail.com' ,'[^,]+', 1, LEVEL))
CONNECT BY LEVEL <= LENGTH('one#gmail.com,two#gamil.com,three#gmail.com,four#gmail.com' ) - LENGTH(REPLACE('one#gmail.com,two#gamil.com,three#gmail.com,four#gmail.com' , ',', '')) +1;
But the above query taking 60 seconds to return only 16 records. Can any one suggest me the best approach for this...

Try this,
WHERE email IN (
select regexp_substr('one#gmail.com,two#gamil.com,three#gmail.com,four#gmail.com','[^,]+', 1, level) from dual
connect by regexp_substr('one#gmail.com,two#gamil.com,three#gmail.com,four#gmail.com', '[^,]+', 1, level) is not null );

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Inconsistent results from Oracle's REGEXP_SUBSTR - regex

Related

Regular Expression: changing matching method from OR to AND

no preceding characters in regexp statement

Regex to remove <> from column

How to use regular expressions properly on a SQL files?

REGEXP_SUBSTR Is taking more time for execution in Oracle

Categories

Resources