Postgre regexp does not work in perl code using dbi - regex

i have this simple data in postgres table (data type is character varying):
48
2
L
4XL
25.0
25
7.0
i have this sql query with regexp match (i want match only numeric like values like 7.0 or 48):
SELECT * FROM table WHERE ss.sizecode ~ E'^\\s*[\\d\\.]+\\s*$'
this works perfect in command line client psql,
but does not work in perl code:
my $sth = $dbh->prepare(
q(SELECT * FROM table WHERE ss.sizecode ~ E'^\\s*[\\d\\.]+\\s*$')
);
$sth->execute
while ( my #row = $sth->fetchrow_array() ) {
# no data i want
}

String literal
q(SELECT * FROM table WHERE ss.sizecode ~ E'^\\s*[\\d\\.]+\\s*$')
produces the string
SELECT * FROM table WHERE ss.sizecode ~ E'^\s*[\d\.]+\s*$'
To get
SELECT * FROM table WHERE ss.sizecode ~ E'^\\s*[\\d\\.]+\\s*$'
you need
q(SELECT * FROM table WHERE ss.sizecode ~ E'^\\\\s*[\\\\d\\\\.]+\\\\s*$')

Related

Regular Expression: changing matching method from OR to AND

I have a regular expression like the following: (Running on Oracle's regexp_like(), despite the question isn't Oracle-specific)
abc|bcd|def|xyz
This basically matches a tags field on database to see if tags field contains abc OR bcd OR def OR xyz when user has input for the search query "abc bcd def xyz".
The tags field on the database holds keywords separated by spaces, e.g. "cdefg abcd xyz"
On Oracle, this would be something like:
select ... from ... where
regexp_like(tags, 'abc|bcd|def|xyz');
It works fine as it is, but I want to add an extra option for users to search for results that match all keywords. How should I change the regular expression so that it matches abc AND bcd AND def AND xyz ?
Note: Because I won't know what exact keywords the user will enter, I can't pre-structure the query in the PL/SQL like this:
select ... from ... where
tags like '%abc%' AND
tags like '%bcd%' AND
tags like '%def%' AND
tags like '%xyz%';
You can split the input pattern and check that all the parts of the pattern match:
SELECT t.*
FROM table_name t
CROSS APPLY(
WITH input (match) AS (
SELECT 'abc bcd def xyz' FROM DUAL
)
SELECT 1
FROM input
CONNECT BY LEVEL <= REGEXP_COUNT(match, '\S+')
HAVING COUNT(
REGEXP_SUBSTR(
t.tags,
REGEXP_SUBSTR(match, '\S+', 1, LEVEL)
)
) = REGEXP_COUNT(match, '\S+')
)
Or, if you have Java enabled in the database then you can create a Java function to match regular expressions:
CREATE AND COMPILE JAVA SOURCE NAMED RegexParser AS
import java.util.regex.Pattern;
public class RegexpMatch {
public static int match(
final String value,
final String regex
){
final Pattern pattern = Pattern.compile(regex);
return pattern.matcher(value).matches() ? 1 : 0;
}
}
/
Then wrap it in an SQL function:
CREATE FUNCTION regexp_java_match(value IN VARCHAR2, regex IN VARCHAR2) RETURN NUMBER
AS LANGUAGE JAVA NAME 'RegexpMatch.match( java.lang.String, java.lang.String ) return int';
/
Then use it in SQL:
SELECT *
FROM table_name
WHERE regexp_java_match(tags, '(?=.*abc)(?=.*bcd)(?=.*def)(?=.*xyz)') = 1;
Try this, the idea being counting that the number of matches is == to the number of patterns:
with data(val) AS (
select 'cdefg abcd xyz' from dual union all
select 'cba lmnop xyz' from dual
),
targets(s) as (
select regexp_substr('abc bcd def xyz', '[^ ]+', 1, LEVEL) from dual
connect by regexp_substr('abc bcd def xyz', '[^ ]+', 1, LEVEL) is not null
)
select val from data d
join targets t on
regexp_like(val,s)
group by val having(count(*) = (select count(*) from targets))
;
Result:
cdefg abcd xyz
I think dynamic SQL will be needed for this. The match all option will require individual matching with logic to ensure every individual match is found.
An easy way would be to build a join condition for each keyword. Concatenate the join statements in a string. Use dynamic SQL to execute the string as a query.
The example below uses the customer table from the sample schemas provided by Oracle.
DECLARE
-- match string should be just the values to match with spaces in between
p_match_string VARCHAR2(200) := 'abc bcd def xyz';
-- need logic to determine match one (OR) versus match all (AND)
p_match_type VARCHAR2(3) := 'OR';
l_sql_statement VARCHAR2(4000);
-- create type if bulk collect is needed
TYPE t_email_address_tab IS TABLE OF customers.EMAIL_ADDRESS%TYPE INDEX BY PLS_INTEGER;
l_email_address_tab t_email_address_tab;
BEGIN
WITH sql_clauses(row_idx,sql_text) AS
(SELECT 0 row_idx -- build select plus beginning of where clause
,'SELECT email_address '
|| 'FROM customers '
|| 'WHERE 1 = '
|| DECODE(p_match_type, 'AND', '1', '0') sql_text
FROM DUAL
UNION
SELECT LEVEL row_idx -- build joins for each keyword
,DECODE(p_match_type, 'AND', ' AND ', ' OR ')
|| 'email_address'
|| ' LIKE ''%'
|| REGEXP_SUBSTR( p_match_string,'[^ ]+',1,level)
|| '%''' sql_text
FROM DUAL
CONNECT BY LEVEL <= LENGTH(p_match_string) - LENGTH(REPLACE( p_match_string, ' ' )) + 1
)
-- put it all together by row_idx
SELECT LISTAGG(sql_text, '') WITHIN GROUP (ORDER BY row_idx)
INTO l_sql_statement
FROM sql_clauses;
dbms_output.put_line(l_sql_statement);
-- can use execute immediate (or ref cursor) for dynamic sql
EXECUTE IMMEDIATE l_sql_statement
BULK COLLECT
INTO l_email_address_tab;
END;
Variable
Value
p_match_string
abc bcd def xyz
p_match_type
AND
l_sql_statement
SELECT email_address FROM customers WHERE 1 = 1 AND email_address LIKE '%abc%' AND email_address LIKE '%bcd%' AND email_address LIKE '%def%' AND email_address LIKE '%xyz%'
Variable
Value
p_match_string
abc bcd def xyz
p_match_type
OR
l_sql_statement
SELECT email_address FROM customers WHERE 1 = 0 OR email_address LIKE '%abc%' OR email_address LIKE '%bcd%' OR email_address LIKE '%def%' OR email_address LIKE '%xyz%'

Extract domain from url using PostgreSQL

I need to extract the domain name for a list of urls using PostgreSQL. In the first version, I tried using REGEXP_REPLACE to replace unwanted characters like www., biz., sports., etc. to get the domain name.
SELECT REGEXP_REPLACE(url, ^((www|www2|www3|static1|biz|health|travel|property|edu|world|newmedia|digital|ent|staging|cpelection|dev|m-staging|m|maa|cdnnews|testing|cdnpuc|shipping|sports|life|static01|cdn|dev1|ad|backends|avm|displayvideo|tand|static03|subscriptionv3|mdev|beta)\.)?', '') AS "Domain",
COUNT(DISTINCT(user)) AS "Unique Users"
FROM db
GROUP BY 1
ORDER BY 2 DESC;
This seems unfavorable as the query needs to be constantly updated for list of unwanted words.
I did try https://stackoverflow.com/a/21174423/10174021 to extract from the end of the line using PostgreSQL REGEXP_SUBSTR but, I'm getting blank rows in return. Is there a more better way of doing this?
A dataset sample to try with:
CREATE TABLE sample (
url VARCHAR(100) NOT NULL);
INSERT INTO sample url)
VALUES
("sample.co.uk"),
("www.sample.co.uk"),
("www3.sample.co.uk"),
("biz.sample.co.uk"),
("digital.testing.sam.co"),
("sam.co"),
("m.sam.co");
Desired output
+------------------------+--------------+
| url | domain |
+------------------------+--------------+
| sample.co.uk | sample.co.uk |
| www.sample.co.uk | sample.co.uk |
| www3.sample.co.uk | sample.co.uk |
| biz.sample.co.uk | sample.co.uk |
| digital.testing.sam.co | sam.co |
| sam.co | sam.co |
| m.sam.co | sam.co |
+------------------------+--------------+
So, I've found the solution using Jeremy and Rémy Baron's answer.
Extract all the public suffix from public suffix and store into
a table which I labelled as tlds.
Get the unique urls in the dataset and match to its TLD.
Extract the domain name using regexp_replace (used in this query) or alternative regexp_substr(t1.url, '([a-z]+)(.)'||t1."tld"). The final output:
The SQL query is as below:
WITH stored_tld AS(
SELECT
DISTINCT(s.url),
FIRST_VALUE(t.domain) over (PARTITION BY s.url ORDER BY length(t.domain) DESC
rows between unbounded preceding and unbounded following) AS "tld"
FROM sample s
JOIN tlds t
ON (s.url like '%%'||domain))
SELECT
t1.url,
CASE WHEN t1."tld" IS NULL THEN t1.url ELSE regexp_replace(t1.url,'(.*\.)((.[a-z]*).*'||replace(t1."tld",'.','\.')||')','\2')
END AS "extracted_domain"
FROM(
SELECT a.url,st."tld"
FROM sample a
LEFT JOIN stored_tld st
ON a.url = st.url
)t1
Links to try: SQL Tester
You can try this :
with tlds as (
select * from (values('.co.uk'),('.co'),('.uk')) a(tld)
) ,
sample as (
select * from (values ('sample.co.uk'),
('www.sample.co.uk'),
('www3.sample.co.uk'),
('biz.sample.co.uk'),
('digital.testing.sam.co'),
('sam.co'),
('m.sam.co')
) a(url)
)
select url,regexp_replace(url,'(.*\.)(.*'||replace(tld,'.','\.')||')','\2') "domain" from (
select distinct url,first_value(tld) over (PARTITION BY url order by length(tld) DESC) tld
from sample join tlds on (url like '%'||tld)
) a
I use split_part(url,'/',3) for this :
select split_part('https://stackoverflow.com/questions/56019744', '/', 3) ;
output
stackoverflow.com

Regex to match starts with

I need to update a table setting attribute MATCH to True where the attribute_a STARTS with the Value of attribute_b.
Somehow I can't get the correct syntax in Postgresql to do this pattern match.
UPDATE table
SET match= True
WHERE attribute_a ~ '^attribute_b' ;
eg MATCH TRUE: attribute_a = Nelson Mandela ; attribute_b = 'Nelson'
You do not need pattern matching, use left(), e.g.:
with my_table(attribute_a, attribute_b) as (
values
('Nelson Mandela', 'Nelson'),
('Donald Trump', 'Donald Duck'),
('John Major', 'John M')
)
select *
from my_table
where attribute_b = left(attribute_a, length(attribute_b));
attribute_a | attribute_b
----------------+-------------
Nelson Mandela | Nelson
John Major | John M
(2 rows)
If you absolutely want to use regex, you have to build the pattern with concat() or format(), like this:
select *
from my_table
where attribute_a ~ concat('^', attribute_b)
-- where attribute_a ~ format('^%s', attribute_b)

SQLite extract string from text in column

I have a Spatialite Database and I've imported OSM Data into this database.
With the following query I get all motorways:
SELECT * FROM lines
WHERE other_tags GLOB '*A [0-9]*'
AND highway='motorway'
I use GLOB '*A [0-9]*' here, because in Germany every Autobahn begins with A, followed by a number (like A 73).
There is a column called other_tags with information about the motorway part:
"bdouble"=>"yes","hazmat"=>"designated","lanes"=>"2","maxspeed"=>"none","oneway"=>"yes","ref"=>"A 73","width"=>"7"
If you look closer there is the part "ref"=>"A 73".
I want to extract the A 73 as the name for the motorway.
How can I do this in sqlite?
If the format doesn't change, that means that you can expect that the other_tags field is something like %"ref"=>"A 73","width"=>"7"%, then you can use instr and substr (note that 8 is the length of "ref"=>"):
SELECT substr(other_tags,
instr(other_tags, '"ref"=>"') + 8,
instr(other_tags, '","width"') - 8 - instr(other_tags, '"ref"=>"')) name
FROM lines
WHERE other_tags GLOB '*A [0-9]*'
AND highway='motorway'
The result will be
name
A 73
Check with following condition..
other_tags like A% -- Begin With 'A'.
abs(substr(other_tags, 3,2)) <> 0.0 -- Substring from 3rd character, two character is number.
length(other_tags) = 4 -- length of other_tags is 4
So here is how your query should be:
SELECT *
FROM lines
WHERE other_tags LIKE 'A%'
AND abs(substr(other_tags, 3,2)) <> 0.0
AND length(other_tags) = 4
AND highway = 'motorway'

python replace string function throws asterix wildcard error

When i use * i receive the error
raise error, v # invalid expression
error: nothing to repeat
other wildcard characters such as ^ work fine.
the line of code:
df.columns = df.columns.str.replace('*agriculture', 'agri')
am using pandas and python
edit:
when I try using / to escape, the wildcard does not work as i intend
In[44]df = pd.DataFrame(columns=['agriculture', 'dfad agriculture df'])
In[45]df
Out[45]:
Empty DataFrame
Columns: [agriculture, dfad agriculture df]
Index: []
in[46]df.columns.str.replace('/*agriculture*','agri')
Out[46]: Index([u'agri', u'dfad agri df'], dtype='object')
I thought the wildcard should output Index([u'agri', u'agri'], dtype='object)
edit:
I am currently using hierarchical columns and would like to only replace agri for that specific level (level = 2).
original:
df.columns[0] = ('grand total', '2005', 'agriculture')
df.columns[1] = ('grand total', '2005', 'other')
desired:
df.columns[0] = ('grand total', '2005', 'agri')
df.columns[1] = ('grand total', '2005', 'other')
I'm looking at this link right now: Changing columns names in Pandas with hierarchical columns
and that author says it will get easier at 0.15.0 so I am hoping there are more recent updated solutions
You need to the asterisk * at the end in order to match the string 0 or more times, see the docs:
In [287]:
df = pd.DataFrame(columns=['agriculture'])
df
Out[287]:
Empty DataFrame
Columns: [agriculture]
Index: []
In [289]:
df.columns.str.replace('agriculture*', 'agri')
Out[289]:
Index(['agri'], dtype='object')
EDIT
Based on your new and actual requirements, you can use str.contains to find matches and then use this to build a dict to map the old against new names and then call rename:
In [307]:
matching_cols = df.columns[df.columns.str.contains('agriculture')]
df.rename(columns = dict(zip(matching_cols, ['agri'] * len(matching_cols))))
Out[307]:
Empty DataFrame
Columns: [agri, agri]
Index: []