Regex to match starts with - regex

I need to update a table setting attribute MATCH to True where the attribute_a STARTS with the Value of attribute_b.
Somehow I can't get the correct syntax in Postgresql to do this pattern match.
UPDATE table
SET match= True
WHERE attribute_a ~ '^attribute_b' ;
eg MATCH TRUE: attribute_a = Nelson Mandela ; attribute_b = 'Nelson'

You do not need pattern matching, use left(), e.g.:
with my_table(attribute_a, attribute_b) as (
values
('Nelson Mandela', 'Nelson'),
('Donald Trump', 'Donald Duck'),
('John Major', 'John M')
)
select *
from my_table
where attribute_b = left(attribute_a, length(attribute_b));
attribute_a | attribute_b
----------------+-------------
Nelson Mandela | Nelson
John Major | John M
(2 rows)
If you absolutely want to use regex, you have to build the pattern with concat() or format(), like this:
select *
from my_table
where attribute_a ~ concat('^', attribute_b)
-- where attribute_a ~ format('^%s', attribute_b)

Related

Regular Expression: changing matching method from OR to AND

I have a regular expression like the following: (Running on Oracle's regexp_like(), despite the question isn't Oracle-specific)
abc|bcd|def|xyz
This basically matches a tags field on database to see if tags field contains abc OR bcd OR def OR xyz when user has input for the search query "abc bcd def xyz".
The tags field on the database holds keywords separated by spaces, e.g. "cdefg abcd xyz"
On Oracle, this would be something like:
select ... from ... where
regexp_like(tags, 'abc|bcd|def|xyz');
It works fine as it is, but I want to add an extra option for users to search for results that match all keywords. How should I change the regular expression so that it matches abc AND bcd AND def AND xyz ?
Note: Because I won't know what exact keywords the user will enter, I can't pre-structure the query in the PL/SQL like this:
select ... from ... where
tags like '%abc%' AND
tags like '%bcd%' AND
tags like '%def%' AND
tags like '%xyz%';
You can split the input pattern and check that all the parts of the pattern match:
SELECT t.*
FROM table_name t
CROSS APPLY(
WITH input (match) AS (
SELECT 'abc bcd def xyz' FROM DUAL
)
SELECT 1
FROM input
CONNECT BY LEVEL <= REGEXP_COUNT(match, '\S+')
HAVING COUNT(
REGEXP_SUBSTR(
t.tags,
REGEXP_SUBSTR(match, '\S+', 1, LEVEL)
)
) = REGEXP_COUNT(match, '\S+')
)
Or, if you have Java enabled in the database then you can create a Java function to match regular expressions:
CREATE AND COMPILE JAVA SOURCE NAMED RegexParser AS
import java.util.regex.Pattern;
public class RegexpMatch {
public static int match(
final String value,
final String regex
){
final Pattern pattern = Pattern.compile(regex);
return pattern.matcher(value).matches() ? 1 : 0;
}
}
/
Then wrap it in an SQL function:
CREATE FUNCTION regexp_java_match(value IN VARCHAR2, regex IN VARCHAR2) RETURN NUMBER
AS LANGUAGE JAVA NAME 'RegexpMatch.match( java.lang.String, java.lang.String ) return int';
/
Then use it in SQL:
SELECT *
FROM table_name
WHERE regexp_java_match(tags, '(?=.*abc)(?=.*bcd)(?=.*def)(?=.*xyz)') = 1;
Try this, the idea being counting that the number of matches is == to the number of patterns:
with data(val) AS (
select 'cdefg abcd xyz' from dual union all
select 'cba lmnop xyz' from dual
),
targets(s) as (
select regexp_substr('abc bcd def xyz', '[^ ]+', 1, LEVEL) from dual
connect by regexp_substr('abc bcd def xyz', '[^ ]+', 1, LEVEL) is not null
)
select val from data d
join targets t on
regexp_like(val,s)
group by val having(count(*) = (select count(*) from targets))
;
Result:
cdefg abcd xyz
I think dynamic SQL will be needed for this. The match all option will require individual matching with logic to ensure every individual match is found.
An easy way would be to build a join condition for each keyword. Concatenate the join statements in a string. Use dynamic SQL to execute the string as a query.
The example below uses the customer table from the sample schemas provided by Oracle.
DECLARE
-- match string should be just the values to match with spaces in between
p_match_string VARCHAR2(200) := 'abc bcd def xyz';
-- need logic to determine match one (OR) versus match all (AND)
p_match_type VARCHAR2(3) := 'OR';
l_sql_statement VARCHAR2(4000);
-- create type if bulk collect is needed
TYPE t_email_address_tab IS TABLE OF customers.EMAIL_ADDRESS%TYPE INDEX BY PLS_INTEGER;
l_email_address_tab t_email_address_tab;
BEGIN
WITH sql_clauses(row_idx,sql_text) AS
(SELECT 0 row_idx -- build select plus beginning of where clause
,'SELECT email_address '
|| 'FROM customers '
|| 'WHERE 1 = '
|| DECODE(p_match_type, 'AND', '1', '0') sql_text
FROM DUAL
UNION
SELECT LEVEL row_idx -- build joins for each keyword
,DECODE(p_match_type, 'AND', ' AND ', ' OR ')
|| 'email_address'
|| ' LIKE ''%'
|| REGEXP_SUBSTR( p_match_string,'[^ ]+',1,level)
|| '%''' sql_text
FROM DUAL
CONNECT BY LEVEL <= LENGTH(p_match_string) - LENGTH(REPLACE( p_match_string, ' ' )) + 1
)
-- put it all together by row_idx
SELECT LISTAGG(sql_text, '') WITHIN GROUP (ORDER BY row_idx)
INTO l_sql_statement
FROM sql_clauses;
dbms_output.put_line(l_sql_statement);
-- can use execute immediate (or ref cursor) for dynamic sql
EXECUTE IMMEDIATE l_sql_statement
BULK COLLECT
INTO l_email_address_tab;
END;
Variable
Value
p_match_string
abc bcd def xyz
p_match_type
AND
l_sql_statement
SELECT email_address FROM customers WHERE 1 = 1 AND email_address LIKE '%abc%' AND email_address LIKE '%bcd%' AND email_address LIKE '%def%' AND email_address LIKE '%xyz%'
Variable
Value
p_match_string
abc bcd def xyz
p_match_type
OR
l_sql_statement
SELECT email_address FROM customers WHERE 1 = 0 OR email_address LIKE '%abc%' OR email_address LIKE '%bcd%' OR email_address LIKE '%def%' OR email_address LIKE '%xyz%'

How to use regexp_replace in spark.sql() to extract hashtags from string

I need to write a regexg_replace query in spark.sql() and I'm not sure how to handle it. For readability purposes, I have to utilize SQL for it. I am trying to pull out the hashtags from the table. I know how to do this using the python method but most of my team are SQL users.
My dataframe example looks like so:
Insta_post
Today, Senate Dems vote to #SaveTheInternet. Proud to support similar #NetNeutrality legislation here in the House…
RT #NALCABPolicy: Meeting with #RepDarrenSoto . Thanks for taking the time to meet with #LatinoLeader ED Marucci Guzman. #NALCABPolicy2018.…
RT #Tharryry: I am delighted that #RepDarrenSoto will be voting for the CRA to overrule the FCC and save our #NetNeutrality rules. Find out…
My code:
I create a tempview:
post_df.createOrReplaceTempView("post_tempview")
post_df = spark.sql("""
select
regexp_replace(Insta_post, '.*?(.|'')(#)(\w+)', '$1') as a
from post_tempview
where Insta_post like '%#%'
""")
My end result:
+--------------------------------------------------------------------------------------------------------------------------------------------+
|a |
+--------------------------------------------------------------------------------------------------------------------------------------------+
|Today, Senate Dems vote to #SaveTheInternet. Proud to support similar #NetNeutrality legislation here in the House… |
|RT #NALCABPolicy: Meeting with #RepDarrenSoto . Thanks for taking the time to meet with #LatinoLeader ED Marucci Guzman. #NALCABPolicy2018.…|
|RT #Tharryry: I am delighted that #RepDarrenSoto will be voting for the CRA to overrule the FCC and save our #NetNeutrality rules. Find out…|
+--------------------------------------------------------------------------------------------------------------------------------------------+
desired result:
+---------------------------------+
|a |
+---------------------------------+
| #SaveTheInternet, #NetNeutrality|
| #NALCABPolicy2018 |
| #NetNeutrality |
+---------------------------------+
I haven't really used regexp_replace too much so this is new to me. Any help would be appreciated as well as an explanation of how to structure the subsets!
For Spark 3.1+, you can use regexp_extract_all function to extract multiple matches:
post_df = spark.sql("""
select regexp_extract_all(Insta_post, '(#\\\\w+)', 1) as a
from post_tempview
where Insta_post like '%#%'
""")
post_df.show(truncate=False)
#+----------------------------------+
#|a |
#+----------------------------------+
#|[#SaveTheInternet, #NetNeutrality]|
#|[#NALCABPolicy2018] |
#|[#NetNeutrality] |
#+----------------------------------+
For Spark <3.1, you can use regexp_replace to remove all that doesn't match the hashtag pattern :
post_df = spark.sql("""
select trim(trailing ',' from regexp_replace(Insta_post, '.*?(#\\\\w+)|.*', '$1,')) as a
from post_tempview
where Insta_post like '%#%'
""")
post_df.show(truncate=False)
#+-------------------------------+
#|a |
#+-------------------------------+
#|#SaveTheInternet,#NetNeutrality|
#|#NALCABPolicy2018 |
#|#NetNeutrality |
#+-------------------------------+
Note the use trim to remove the unnecessary trailing commas created by the first replace $,.
Do you really need a view? Because the following code might do it:
df = df.filter(F.col('Insta_post').like('%#%'))
col_trimmed = F.trim((F.regexp_replace('Insta_post', '.*?(#\w+)|.+', '$1 ')))
df = df.select(F.regexp_replace(col_trimmed,'\s',', ').alias('a'))
df.show(truncate=False)
# +--------------------------------+
# |a |
# +--------------------------------+
# |#SaveTheInternet, #NetNeutrality|
# |#NALCABPolicy2018 |
# |#NetNeutrality |
# +--------------------------------+
I ended up using two of regexp_replace, so potentially there could be a better alternative, just couldn't think of one.

Extract domain from url using PostgreSQL

I need to extract the domain name for a list of urls using PostgreSQL. In the first version, I tried using REGEXP_REPLACE to replace unwanted characters like www., biz., sports., etc. to get the domain name.
SELECT REGEXP_REPLACE(url, ^((www|www2|www3|static1|biz|health|travel|property|edu|world|newmedia|digital|ent|staging|cpelection|dev|m-staging|m|maa|cdnnews|testing|cdnpuc|shipping|sports|life|static01|cdn|dev1|ad|backends|avm|displayvideo|tand|static03|subscriptionv3|mdev|beta)\.)?', '') AS "Domain",
COUNT(DISTINCT(user)) AS "Unique Users"
FROM db
GROUP BY 1
ORDER BY 2 DESC;
This seems unfavorable as the query needs to be constantly updated for list of unwanted words.
I did try https://stackoverflow.com/a/21174423/10174021 to extract from the end of the line using PostgreSQL REGEXP_SUBSTR but, I'm getting blank rows in return. Is there a more better way of doing this?
A dataset sample to try with:
CREATE TABLE sample (
url VARCHAR(100) NOT NULL);
INSERT INTO sample url)
VALUES
("sample.co.uk"),
("www.sample.co.uk"),
("www3.sample.co.uk"),
("biz.sample.co.uk"),
("digital.testing.sam.co"),
("sam.co"),
("m.sam.co");
Desired output
+------------------------+--------------+
| url | domain |
+------------------------+--------------+
| sample.co.uk | sample.co.uk |
| www.sample.co.uk | sample.co.uk |
| www3.sample.co.uk | sample.co.uk |
| biz.sample.co.uk | sample.co.uk |
| digital.testing.sam.co | sam.co |
| sam.co | sam.co |
| m.sam.co | sam.co |
+------------------------+--------------+
So, I've found the solution using Jeremy and Rémy Baron's answer.
Extract all the public suffix from public suffix and store into
a table which I labelled as tlds.
Get the unique urls in the dataset and match to its TLD.
Extract the domain name using regexp_replace (used in this query) or alternative regexp_substr(t1.url, '([a-z]+)(.)'||t1."tld"). The final output:
The SQL query is as below:
WITH stored_tld AS(
SELECT
DISTINCT(s.url),
FIRST_VALUE(t.domain) over (PARTITION BY s.url ORDER BY length(t.domain) DESC
rows between unbounded preceding and unbounded following) AS "tld"
FROM sample s
JOIN tlds t
ON (s.url like '%%'||domain))
SELECT
t1.url,
CASE WHEN t1."tld" IS NULL THEN t1.url ELSE regexp_replace(t1.url,'(.*\.)((.[a-z]*).*'||replace(t1."tld",'.','\.')||')','\2')
END AS "extracted_domain"
FROM(
SELECT a.url,st."tld"
FROM sample a
LEFT JOIN stored_tld st
ON a.url = st.url
)t1
Links to try: SQL Tester
You can try this :
with tlds as (
select * from (values('.co.uk'),('.co'),('.uk')) a(tld)
) ,
sample as (
select * from (values ('sample.co.uk'),
('www.sample.co.uk'),
('www3.sample.co.uk'),
('biz.sample.co.uk'),
('digital.testing.sam.co'),
('sam.co'),
('m.sam.co')
) a(url)
)
select url,regexp_replace(url,'(.*\.)(.*'||replace(tld,'.','\.')||')','\2') "domain" from (
select distinct url,first_value(tld) over (PARTITION BY url order by length(tld) DESC) tld
from sample join tlds on (url like '%'||tld)
) a
I use split_part(url,'/',3) for this :
select split_part('https://stackoverflow.com/questions/56019744', '/', 3) ;
output
stackoverflow.com

Google Sheets Formula to Extract and Convert Currency from € or £ to USD

I'm trying to do the following:
Check the cell for N/A or No; if it has either of these then it should output N/A or No
Check the cell for either £ or € or Yes; If it has one of these then it would continue to step 3. If it has $ then it should repeat the same input as the output.
Extract currency from cell using: REGEXEXTRACT(A1, "\$\d+") or REGEXEXTRACT(A1, "\£\d+") (I assume that's the best way)
Convert it to $ USD using GoogleFinance("CURRENCY:EURUSD") or GoogleFinance("CURRENCY:GBPUSD")
Output the original cell but replacing the extracted currency from step 3 with the output from step 4.
Examples: (Original --> Output)
N/A --> N/A
No --> No
Alt --> Alt
Yes --> Yes
Yes £10 --> Yes $12.19
Yes £10 per week --> Yes $12.19 per week
Yes €5 (Next) --> Yes $5.49 (Next)
Yes $5 22 EA --> Yes $5 22 EA
Yes £5 - £10 --> Yes $5.49 - $12.19
I am unable to get a working IF statement working, I could do this in normal code but can't work it out for spreadsheet formulas.
I've tried modifying #Rubén's answer lots of times to including the N/A as it's not the Sheets error, I also tried the same for making any USD inputs come out as USD (no changes) but I really can't get the hang of IF/OR/AND in Excel/Google Sheets.
=ArrayFormula(
SUBSTITUTE(
A1,
OR(IF(A1="No","No",REGEXEXTRACT(A1, "[\£|\€]\d+")),IF(A1="N/A","N/A",REGEXEXTRACT(A1, "[\£|\€]\d+"))),
IF(
A1="No",
"No",
TEXT(
REGEXEXTRACT(A1, "[\£|\€](\d+)")*
IF(
"€"=REGEXEXTRACT(A1, "([\£|\€])\d+"),
GoogleFinance("CURRENCY:EURUSD"),
GoogleFinance("CURRENCY:GBPUSD")
),
"$###,###"
)
)
)
)
The above, I tried to add an OR() before the first IF statement to try and include N/A as an option, in the below I tried it as you can see below in various different ways (replace line 4 with this)
IF(
OR(
A1="No",
"No",
REGEXEXTRACT(A1, "[\£|\€]\d+");
A1="No",
"No",
REGEXEXTRACT(A1, "[\£|\€]\d+")
)
)
But that doesn't work either. I thought using ; was a way to separate the OR expressions but apparently not.
Re: Rubén's latest code 16/10/2016
I've modified it to =ArrayFormula(
IF(NOT(ISBLANK(A2)),
IF(IFERROR(SEARCH("$",A2),0),A2,IF(A2="N/A","N/A",IF(A2="No","No",IF(A2="Alt","Alt",IF(A2="Yes","Yes",
SUBSTITUTE(
A2,
REGEXEXTRACT(A2, "[\£|\€]\d+"),
TEXT(
REGEXEXTRACT(A2, "[\£|\€](\d+)")
*
VLOOKUP(
REGEXEXTRACT(A2, "([\£|\€])\d+"),
{
{"£";"€"},
{GoogleFinance("CURRENCY:GBPUSD");GoogleFinance("CURRENCY:EURUSD")}
},
2,0),
"$###,###"
)
)
)))))
,"")
)
This fixes:
Blank cells no longer throw #N/A
Yes only cells no longer throw #N/A
Added another text value Alt
Changes the format of the currency to 0 decimal places rather than my original request of 2 decimal places.
As you can see in the image below the two red cells aren't quite correct as I never thought of this scenario, the second of the two values is staying in it's input form and not being converted to USD.
Direct answer
Try
=ArrayFormula(
IF(IFERROR(SEARCH("$",A1:A6),0),A1:A6,IF(A1:A6="N/A","N/A",IF(A1:A6="No","No",
SUBSTITUTE(
A1:A6,
REGEXEXTRACT(A1:A6, "[\£|\€]\d+"),
TEXT(
REGEXEXTRACT(A1:A6, "[\£|\€](\d+)")
*
VLOOKUP(
REGEXEXTRACT(A1:A6, "([\£|\€])\d+"),
{
{"£";"€"},
{GoogleFinance("CURRENCY:GBPUSD");GoogleFinance("CURRENCY:EURUSD")}
},
2,0),
"$###,###.00"
)
)
)))
)
Result
+---+------------------+---------------------+
| | A | B |
+---+------------------+---------------------+
| 1 | N/A | N/A |
| 2 | No | No |
| 3 | Yes £10 | Yes $12.19 |
| 4 | Yes £10 per week | Yes $12.19 per week |
| 5 | Yes €5 (Next) | Yes $5.49 (Next) |
+---+------------------+---------------------+
Explanation
OR function
Instead or using OR function, the above formula use nested IF functions.
REGEXTRACT
Instead of using a REGEXEXTRACT function for each currency symbol, a regex OR operator was used. Example
REGEXEXTRACT(A1:A6, "[\£|\€]\d+")
Three regular expressions were used,
get currency symbol and the amount [\£|\€]\d+
get the amount [\£|\€](\d+)
get the currency symbol [(\£|\€])\d+
Currency conversion
Instead of using nested IF to handle currency conversion rates, VLOOKUP and array is used. This could be make easier to maintain the formula assuming that more currencies could be added in the future.

Postgre regexp does not work in perl code using dbi

i have this simple data in postgres table (data type is character varying):
48
2
L
4XL
25.0
25
7.0
i have this sql query with regexp match (i want match only numeric like values like 7.0 or 48):
SELECT * FROM table WHERE ss.sizecode ~ E'^\\s*[\\d\\.]+\\s*$'
this works perfect in command line client psql,
but does not work in perl code:
my $sth = $dbh->prepare(
q(SELECT * FROM table WHERE ss.sizecode ~ E'^\\s*[\\d\\.]+\\s*$')
);
$sth->execute
while ( my #row = $sth->fetchrow_array() ) {
# no data i want
}
String literal
q(SELECT * FROM table WHERE ss.sizecode ~ E'^\\s*[\\d\\.]+\\s*$')
produces the string
SELECT * FROM table WHERE ss.sizecode ~ E'^\s*[\d\.]+\s*$'
To get
SELECT * FROM table WHERE ss.sizecode ~ E'^\\s*[\\d\\.]+\\s*$'
you need
q(SELECT * FROM table WHERE ss.sizecode ~ E'^\\\\s*[\\\\d\\\\.]+\\\\s*$')