Regular Expression: changing matching method from OR to AND - regex

I have a regular expression like the following: (Running on Oracle's regexp_like(), despite the question isn't Oracle-specific)
abc|bcd|def|xyz
This basically matches a tags field on database to see if tags field contains abc OR bcd OR def OR xyz when user has input for the search query "abc bcd def xyz".
The tags field on the database holds keywords separated by spaces, e.g. "cdefg abcd xyz"
On Oracle, this would be something like:
select ... from ... where
regexp_like(tags, 'abc|bcd|def|xyz');
It works fine as it is, but I want to add an extra option for users to search for results that match all keywords. How should I change the regular expression so that it matches abc AND bcd AND def AND xyz ?
Note: Because I won't know what exact keywords the user will enter, I can't pre-structure the query in the PL/SQL like this:
select ... from ... where
tags like '%abc%' AND
tags like '%bcd%' AND
tags like '%def%' AND
tags like '%xyz%';

You can split the input pattern and check that all the parts of the pattern match:
SELECT t.*
FROM table_name t
CROSS APPLY(
WITH input (match) AS (
SELECT 'abc bcd def xyz' FROM DUAL
)
SELECT 1
FROM input
CONNECT BY LEVEL <= REGEXP_COUNT(match, '\S+')
HAVING COUNT(
REGEXP_SUBSTR(
t.tags,
REGEXP_SUBSTR(match, '\S+', 1, LEVEL)
)
) = REGEXP_COUNT(match, '\S+')
)
Or, if you have Java enabled in the database then you can create a Java function to match regular expressions:
CREATE AND COMPILE JAVA SOURCE NAMED RegexParser AS
import java.util.regex.Pattern;
public class RegexpMatch {
public static int match(
final String value,
final String regex
){
final Pattern pattern = Pattern.compile(regex);
return pattern.matcher(value).matches() ? 1 : 0;
}
}
/
Then wrap it in an SQL function:
CREATE FUNCTION regexp_java_match(value IN VARCHAR2, regex IN VARCHAR2) RETURN NUMBER
AS LANGUAGE JAVA NAME 'RegexpMatch.match( java.lang.String, java.lang.String ) return int';
/
Then use it in SQL:
SELECT *
FROM table_name
WHERE regexp_java_match(tags, '(?=.*abc)(?=.*bcd)(?=.*def)(?=.*xyz)') = 1;

Try this, the idea being counting that the number of matches is == to the number of patterns:
with data(val) AS (
select 'cdefg abcd xyz' from dual union all
select 'cba lmnop xyz' from dual
),
targets(s) as (
select regexp_substr('abc bcd def xyz', '[^ ]+', 1, LEVEL) from dual
connect by regexp_substr('abc bcd def xyz', '[^ ]+', 1, LEVEL) is not null
)
select val from data d
join targets t on
regexp_like(val,s)
group by val having(count(*) = (select count(*) from targets))
;
Result:
cdefg abcd xyz

I think dynamic SQL will be needed for this. The match all option will require individual matching with logic to ensure every individual match is found.
An easy way would be to build a join condition for each keyword. Concatenate the join statements in a string. Use dynamic SQL to execute the string as a query.
The example below uses the customer table from the sample schemas provided by Oracle.
DECLARE
-- match string should be just the values to match with spaces in between
p_match_string VARCHAR2(200) := 'abc bcd def xyz';
-- need logic to determine match one (OR) versus match all (AND)
p_match_type VARCHAR2(3) := 'OR';
l_sql_statement VARCHAR2(4000);
-- create type if bulk collect is needed
TYPE t_email_address_tab IS TABLE OF customers.EMAIL_ADDRESS%TYPE INDEX BY PLS_INTEGER;
l_email_address_tab t_email_address_tab;
BEGIN
WITH sql_clauses(row_idx,sql_text) AS
(SELECT 0 row_idx -- build select plus beginning of where clause
,'SELECT email_address '
|| 'FROM customers '
|| 'WHERE 1 = '
|| DECODE(p_match_type, 'AND', '1', '0') sql_text
FROM DUAL
UNION
SELECT LEVEL row_idx -- build joins for each keyword
,DECODE(p_match_type, 'AND', ' AND ', ' OR ')
|| 'email_address'
|| ' LIKE ''%'
|| REGEXP_SUBSTR( p_match_string,'[^ ]+',1,level)
|| '%''' sql_text
FROM DUAL
CONNECT BY LEVEL <= LENGTH(p_match_string) - LENGTH(REPLACE( p_match_string, ' ' )) + 1
)
-- put it all together by row_idx
SELECT LISTAGG(sql_text, '') WITHIN GROUP (ORDER BY row_idx)
INTO l_sql_statement
FROM sql_clauses;
dbms_output.put_line(l_sql_statement);
-- can use execute immediate (or ref cursor) for dynamic sql
EXECUTE IMMEDIATE l_sql_statement
BULK COLLECT
INTO l_email_address_tab;
END;
Variable
Value
p_match_string
abc bcd def xyz
p_match_type
AND
l_sql_statement
SELECT email_address FROM customers WHERE 1 = 1 AND email_address LIKE '%abc%' AND email_address LIKE '%bcd%' AND email_address LIKE '%def%' AND email_address LIKE '%xyz%'
Variable
Value
p_match_string
abc bcd def xyz
p_match_type
OR
l_sql_statement
SELECT email_address FROM customers WHERE 1 = 0 OR email_address LIKE '%abc%' OR email_address LIKE '%bcd%' OR email_address LIKE '%def%' OR email_address LIKE '%xyz%'

Related

duckdb - aggregate string with a given separator

The standard aggregator makes coma separated list:
$ SELECT list_string_agg([1, 2, 'sdsd'])
'1,2,sdsd'
How can I make a smicolumn separated list or '/'-separated? Like '1;2;sdsd' or '1/2/sdsd'.
I believe string_agg function is what you want which also supports "distinct".
# Python example
import duckdb as dd
CURR_QUERY = \
'''
SELECT string_agg(distinct a.c, ' || ') AS str_con
FROM (SELECT 'string 1' AS c
UNION ALL
SELECT 'string 2' AS c,
UNION ALL
SELECT 'string 1' AS c) AS a
'''
print(dd.query(CURR_QUERY))
Above will give you "string 1||string 2"

no preceding characters in regexp statement

So I have attempted to use a negative look back in a regexp statement and have looked online at other solutions but they don't seem to work for me so obviously I am doing something wrong-
I am looking for a return on the first line but the others should be null. Essentially I need CT CHEST or CT LUNG
Any assistance TIA
with test (id, description) as (
select 1, 'CT CHEST HIGH RESOLUTION, NO CONTRAST' from dual union all --want this
select 2, 'INJECTION, THORACIC TRANSFORAMEN EPIDURAL, NON NEUROLYTIC W IMAGE GUIDANCE.' from dual union all --do not want this
select 3, 'The cow came back. But the dog went for a walk' from dual) --do not want this
select id, description, regexp_substr(description, '(?<![a-z]ct).{1,20}(CHEST|THOR|LUNG)',1,1,'i') from test;
regexp_substr(description,'([^A-Z]|^)[CT].{1,20}(CHEST|THOR|LUNG)',1,1,'i')
works
Leverage Oracle Subexrpession Parameter to Check for CT
I would leverage the use of subexpressions to use a pattern like this:
'regexp_substr(description, '(^| )((ct ).*((CHEST)|(THOR)|(LUNG)))', 1, 1,'i', 2)`
-subexpression 1 to look for beginning of line or a space: (^| )
-subexpression 3 to look for 'CT': (ct )
-allow for other characters: .*
-subexressions 5,6,7: (CHEST)|(THOR)|(LUNG)
-subexpression 2 which contain subexpression 3 an subexprssion 4
I use the last optional parameter to identify that I want subexpression 2.
WITH test (id, description) as (
SELECT 1
, 'CT CHEST HIGH RESOLUTION , NO CONTRAST'
FROM dual
UNION ALL --want this
SELECT 2
, 'INJECTION , THORACIC TRANSFORAMEN EPIDURAL , NON NEUROLYTIC W IMAGE GUIDANCE.'
FROM dual
UNION ALL --do not want this
SELECT 3
, 'The cow came back. But the dog went FOR a walk'
FROM dual
) --do not want this
SELECT id
, description
, regexp_substr(description, '(^| )((ct ).*((CHEST)|(THOR)|(LUNG)))', 1, 1,'i', 2)
FROM test;

Remove repeated substring in column and only return words in between

I have the following dataframe:
Column1 Column2
0 .com<br><br>Finance<br><br><br><br><br><br><br><br><br><br><br><br> .comFinance
1 .com<br><br>Finance<br><br><br><br><br>DO<br><br><br><br><br><br><br> .comFinanceDO
2 <br><br>Finance<br><br><br>ISV<br><br>DO<br>DO Prem<br><br><br><br><br><br> FinanceISVDODO Prem
3 <br><br>Finance<br><br><br><br><br><br><br><br><br><br><br><br> Finance
4 <br><br>Finance<br><br><br>TTY<br><br><br><br><br><br><br><br><br> ConsultingTTY
I used to following line of code to get Column2:
df['Column2'] = df['Column1'].str.replace('<br>', '', regex=True)
I want to remove all instances of "< b >" and so I want the column to look like this:
Column2
.com, Finance
.com, Finance, DO
Finance, ISV, DO, DO Prem
Finance
Consulting, TTY
Given the following dataframe:
Column1
.com<br><br>Finance<br><br><br><br><br><br><br><br><br><br><br><br>
.com<br><br>Finance<br><br><br><br><br>DO<br><br><br><br><br><br><br>
<br><br>Finance<br><br><br>ISV<br><br>DO<br>DO Prem<br><br><br><br><br><br>
<br><br>Finance<br><br><br><br><br><br><br><br><br><br><br><br>
<br><br>Finance<br><br><br>TTY<br><br><br><br><br><br><br><br><br>
df['Column2'] = df['Column1'].str.replace('<br>', ' ', regex=True).str.strip().replace('\\s+', ', ', regex=True) doesn't work because of sections like <br>DO Prem<br>, which will end of like DO, Prem, not DO Prem.
Split on <br> to make a list, then use a list comprehension to remove the '' spaces.
This will preserve spaces where they're supposed to be.
Join the list values back into a string with (', ').join([...])
import pandas as pd
df['Column2'] = df['Column1'].str.split('<br>').apply(lambda x: (', ').join([y for y in x if y != '']))
# output
Column1 Column2
.com<br><br>Finance<br><br><br><br><br><br><br><br><br><br><br><br> .com, Finance
.com<br><br>Finance<br><br><br><br><br>DO<br><br><br><br><br><br><br> .com, Finance, DO
<br><br>Finance<br><br><br>ISV<br><br>DO<br>DO Prem<br><br><br><br><br><br> Finance, ISV, DO, DO Prem
<br><br>Finance<br><br><br><br><br><br><br><br><br><br><br><br> Finance
<br><br>Finance<br><br><br>TTY<br><br><br><br><br><br><br><br><br> Finance, TTY
### Replace br with space
df['Column 2'] = df['column 1'].str.replace('<br>', ' ')
### Get rid of spaces before and after the string
df['Column 2'] = df['Column 2'].strip()
### Replace the space with ,
df['Column 2'] = df['Column 2'].str.replace('\\s+', ',', regex=True)
As pointed out by TrentonMcKinney, his solution is better. This one doesn't solve the issue when there is a space between the string values in Column 1

Read multiline log with regular expression using python

I want to select executed query from log file. Specifically an example would look something like this:
2019-01-10 10:33:21 +07 dvdrentalLOG: statement: SELECT last_update
From public.actor
2019-03-06 14:07:06 +07 dvdrentalLOG: statement: SELECT film_id, title
FROM public.film
WHERE film_id = 1
I want to get the queries using looping. desired output:
query1 : SELECT last_update From public.actor
query2 : SELECT film_id, title FROM public.film WHERE film_id = 1
This I have tried:
import re
def parseFile(filepath):
line=[]
with open(filepath,'r') as log:
regex = re.compile(r'(\d{4}-\d{2}-\d{2})(.*)',re.MULTILINE|re.DOTALL)
for line in log:
date = regex.findall(line)
if date == []:
print()
else:
print(date)
filepath = 'text.txt'
parseFile(filepath)
output:
[('2019-01-10', ' 10:33:21 +07 dvdrentalLOG: statement: SELECT last_update \n')]
[('2019-03-06', ' 14:07:06 +07 dvdrentalLOG: statement: SELECT film_id, title\n')]
the output don't select all the queries. what should I do?
You can adapt your code like this (you need to read the whole file before parsing it, if you read line by line as you did in your code, your regex will only parse a line after another and will never be able to select the whole SQL queries split on several lines) :
import re
def parseFile(filepath):
line=[]
with open(filepath,'r') as log:
regex = re.compile(r'(\d{4}-\d{2}-\d{2})(.*?)(?=\d{4}-\d{2}-\d{2}|$)',re.MULTILINE|re.DOTALL)
lines = re.sub('\n|\s{2,}',' ',log.read())#.replace('\n', '')
date = regex.findall(lines)
if date == []:
print()
else:
print(date)
filepath = 'query.log'
parseFile(filepath)
output:
[('2019-01-10', ' 10:33:21 +07 dvdrentalLOG: statement: SELECT last_update From public.actor '), ('2019-03-06', ' 14:07:06 +07 dvdrentalLOG: statement: SELECT film_id, title FROM public.film WHERE film_id = 1 ')]
Where the regex (using positive lookahead to limit the number of characters matched by .*?) used is detailed here: https://regex101.com/r/nE0omm/1/
(\d{4}-\d{2}-\d{2})(.*?)(?=\d{4}-\d{2}-\d{2}|$)
You're only processing a single line at a time (via the for line in log: loop), so your regex only applies to a single line at a time. It couldn't match across lines because you're not giving it multiple lines at a time to match across.
You could instead read the entire file via log.read() and then call .findall on that.

Regex to remove <> from column

I have a column with name and email id like
Column A
ABX <ABX#gmail.com>
hfgfg <shantanu #gmail.com>
I Want to use a regex to only retrieve the name and exclude the <> along with email idfrom the above column.
Regex in SQL query.
I tried
SELECT REPLACE('s <abc#gmail.com>', SUBSTR('s <abc#gmail.com>', instr('(', 's <abc#gmail.com>'), LENGTH('s <abc#gmail.com>') - instr(')', reverse('s <abc#gmail.com>')) - instr('(', 's <abc#gmail.com>') + 2), '')
FROM dual;
You could use regular expressions; either remove everything from first opening angle bracket, optionally trimming any remaining spaces as well:
select rtrim(regexp_substr('s <abc#gmail.com>', '[^<]*'), ' ') as name from dual;
Or replace the angle brackets and whatever is inside them, and any immediately preceding whitespace, with null:
select regexp_replace('s <abc#gmail.com>', '\s?<.*>', null) as name from dual;
With some sample data:
with your_table(column_a) as (
select 'Some Name <some.name#example.com>' from dual
union all select 'SingleName <single#example.com>' from dual
)
select column_a,
rtrim(regexp_substr(column_a, '[^<]*'), ' ') as name1,
regexp_replace(column_a, '\s?<.*>', null) as name2
from your_table;
COLUMN_A NAME1 NAME2
--------------------------------- --------------- ---------------
Some Name <some.name#example.com> Some Name Some Name
SingleName <single#example.com> SingleName SingleName
If you want the email address as well you could use:
select regexp_substr('s <abc#gmail.com>', '([^<>]*)', 1, 3) as email from dual;
... though there might be a better way. Demoing that too:
with your_table(column_a) as (
select 'Some Name <some.name#example.com>' from dual
union all select 'SingleName <single#example.com>' from dual
)
select column_a,
rtrim(regexp_substr(column_a, '[^<]*'), ' ') as name1,
regexp_replace(column_a, '\s?<.*>', null) as name2,
regexp_substr(column_a, '([^<>]*)', 1, 3) as email
from your_table;
COLUMN_A NAME1 NAME2 EMAIL
--------------------------------- ---------- ---------- ---------------------
Some Name <some.name#example.com> Some Name Some Name some.name#example.com
SingleName <single#example.com> SingleName SingleName single#example.com
Why don't you try something like this :
UPDATE table SET A=TRIM(SUBSTRING(A, 1, INSTR(A,'<')));