Regex to exclude values between braces and apply pattern on the rest - regex

I had the following values in a field
Aa11
BBB-
BBB+
A- /*-
A3
Ca
I would use the regex
(([A-Z](([abc]+\d?)|\d))|([A-Z]+[+-]?)
which worked fine. however, now I have another new set of data
(p)A3
(q)A- /*-
How do I make sure I ignore the brackets and values between them to apply my above regex?
I am doing this using REGEX_SUBSTR in oracle.

The regular expression \(.*?\) will match opening and closing brackets with as few characters as possible between them and [^(]*? will match zero-or-as-few-as-possible non-open bracket characters. You can combine these to give the regular expression ^([^(]*?\(.*?\))*?[^(]*? which will match as few as possible bracket groups (provided you do not have nested brackets) until your required pattern is found.
SQL Fiddle
Oracle 11g R2 Schema Setup:
CREATE TABLE data ( value ) AS
SELECT 'Aa11' FROM DUAL UNION ALL
SELECT 'BBB-' FROM DUAL UNION ALL
SELECT 'BBB+' FROM DUAL UNION ALL
SELECT 'A- /*-' FROM DUAL UNION ALL
SELECT 'A3' FROM DUAL UNION ALL
SELECT 'Ca' FROM DUAL UNION ALL
SELECT '(p)A3' FROM DUAL UNION ALL
SELECT '(q)A- /*-' FROM DUAL UNION ALL
SELECT '(Ca)Cb(Cc)' FROM DUAL UNION ALL
SELECT '--(Ca)--(Cb)--Cc(--Ca)' FROM DUAL;
Query 1:
SELECT value,
REGEXP_SUBSTR(
value,
'^([^(]*?\(.*?\))*?[^(]*?([A-Z]([abc]+\d?|\d|[A-Z]*[+-]?))',
1, -- Start at 1st character
1, -- Find the 1st occurrence
NULL, -- No flags
2 -- Return 2nd capturing group
) AS regex_output
FROM data
Results:
| VALUE | REGEX_OUTPUT |
|------------------------|--------------|
| Aa11 | Aa1 |
| BBB- | BBB- |
| BBB+ | BBB+ |
| A- /*- | A- |
| A3 | A3 |
| Ca | Ca |
| (p)A3 | A3 |
| (q)A- /*- | A- |
| (Ca)Cb(Cc) | Cb |
| --(Ca)--(Cb)--Cc(--Ca) | Cc |

Related

SUM with conditions in a query/importrange from another Google sheet

I have a table like this in spreadsheet A:
| title| total |
|----- |--------|
| X1 | 2 |
| Y | 3 |
| Z | 4 |
| X2 | 5 |
Since this spreadsheet A is constantly updated and is using other formulas, I need to export it to another sheet to work on.
I also need to sum the Total column if the Title column match a condition such as Regexp.
Result should be as:
| title| total |
|----- |--------|
| X | 7 |
| Y | 3 |
| Z | 4 |
Please advise on this case, I've been studying query with sumif formula but it does not support sum when condition is not matched.
Thanks in advance.
You can try SUMIFS() with wildcard option. Use below formula-
=SUMIFS($B$2:$B$5,$A$2:$A$5,D2 & "*")
after you allow access try:
=INDEX(QUERY({REGEXREPLACE(
IMPORTRANGE("id", "sheetname!A2:A"), "\d+$", ),
IMPORTRANGE("id", "sheetname!B2:B")},
"select Col1,sum(Col2)
where Col1 is not null
group by Col1
label sum(Col2)''"))

How to extract domain name using dynamic regex in Redshift?

I need to extract domain name from url using Redshift PostgreSQL. Example : extract 'google.com' from 'www.google.com'. Each url in my dataset has different top level domain (TLD). My approach was to first join the matching TLD to the dataset and use regex to extract 'first_string.TLD'. In Redshift, I'm getting error 'The pattern must be a valid UTF-8 literal character expression'. Is there a way around this?
A sample of my dataset:
+---+------------------------+--------------+
| id| trimmed_domain | tld |
+---+------------------------+--------------+
| 1 | sample.co.uk | co.uk |
| 2 | www.sample.co.uk | co.uk |
| 3 | www3.sample.co.uk | co.uk |
| 4 | biz.sample.co.uk | co.uk |
| 5 | digital.testing.sam.co | co |
| 6 | sam.co | co |
| 7 | www.google.com | com |
| 8 | 1.11.220 | |
+---+------------------------+--------------+
My code:
SELECT t1.extracted_domain, COUNT(DISTINCT(t1.id))
FROM(
SELECT
d.id,
d.trimmed_domain,
CASE
WHEN d.tld IS null THEN d.trimmed_domain ELSE
regexp_replace(d.trimmed_domain,'(.*\.)((.[a-z]*).*'||replace(tld,'.','\.')||')','\2')
END AS "extracted_domain"
FROM dataset d
)t1
GROUP BY 1
ORDER BY 2;
Expected output:
+------------------------+--------------+
| extracted_domain | count |
+------------------------+--------------+
| sample.co.uk | 4 |
| sam.co | 2 |
| google.com | 1 |
| 1.11.220 | 1 |
+------------------------+--------------+
I'm so sure about the query. However, you can use this tool and design any expression that you wish to modify your query.
My guess is that maybe this would help:
^(?!d|b|www3).*
You can list any domain that you wish to exclude in the list using OR (?!d|b|www3).
RegEx Circuit
You can visualize your expressions in this link:
You maybe want to add your desired URLs to an expression similar to:
^(sam|www.google|1.11|www.sample|www3.sample).*
So, I've found a solution. Redshift does not support column based regex so the alternative is to use Python UDF.
Change the tld column to regex pattern.
Go row by row and extract the domain name using the regex pattern column.
Group by the extracted_domain and count the users.
The SQL query is as below:
CREATE OR REPLACE function extractor(col_domain varchar)
RETURNS varchar
IMMUTABLE AS $$
import re
_regex = ''
for domain in col_domain:
if domain is None:
continue
else:
_regex += r'{}'.format(domain)
domain_regex = r'([^/.]+\.({}))'.format(_regex)
return domain_regex
$$ LANGUAGE plpythonu;
CREATE OR REPLACE FUNCTION regex_match(in_pattern varchar, input_str varchar)
RETURNS varchar
IMMUTABLE AS $$
import re
if in_pattern == '':
a = str(input_str)
else:
a= str(re.search(in_pattern, input_str).group())
return a
$$ LANGUAGE plpythonu;
SELECT
t2.extracted_domain,
COUNT(DISTINCT(t2.id)) AS "Unique Users"
FROM(
SELECT
t1.id,
t1.trimmed_domain,
regex_match(t1.regex_pattern, t1.trimmed_domain) AS "extracted_domain"
FROM(
SELECT
id,
trimmed_domain,
CASE WHEN tld is null THEN '' ELSE extractor(tld) END AS "regex_pattern"
FROM dataset
)t1
)t2
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10;
Python UDF seems to be slow on a large dataset. So, I'm open to suggestions on improving the query.
If you know the prefixes you would like to remove from the domains, then why not just exclude these? The following query simply removes the know www/http/etc prefixes from domain names and counts normalized domain names.
SELECT COUNT(*) from
(select REGEXP_REPLACE(domain, '^(https|http|www|biz)') FROM domains)
GROUP BY regexp_replace;

Regular Expression to parse Query Output

I have a need to execute a metadata query which will dump a list of tables into a file. However, I need a way to eliminate all formatting besides the tableId itself. Can this be done through a regex? Appreciate all help in advance.
+-------------------------------------+-------+
| tableId | Type |
+-------------------------------------+-------+
| t_margins | TABLE |
| t_rev_test | TABLE |
| t_rev_share | TABLE |
You have some options, but I would suggest something like this:
^\| (\S+)
It will match on the line from the start, a pipe, a space and then all non-spaces. The non-spaces will be your tableId. Here is a little example in Python:
import re
my_string = '''| t_margins | TABLE |
| t_rev_test | TABLE |
| t_rev_share | TABLE |'''
my_list = my_string.split('\n')
for line in my_list:
match = re.search("^\| (\S+)", line)
print (match.group(1))
This will give you:
t_margins
t_rev_test
t_rev_share
The following regexp captures just the column values of the first column:
^\| (\w+)
https://regex101.com/r/gODhra/3

Regex for PostgreSQL for getting domain with sub-domain from URL/Website

Basically, I need to get those rows which contain domain and subdomain name from a URL or the whole website name excluding www.
My DB table looks like this:
+----------+------------------------+
| id | website |
+----------+------------------------+
| 1 | https://www.google.com |
+----------+------------------------+
| 2 | http://www.google.co.in|
+----------+------------------------+
| 3 | www.google.com |
+----------+------------------------+
| 4 | www.google.co.in |
+----------+------------------------+
| 5 | google.com |
+----------+------------------------+
| 6 | google.co.in |
+----------+------------------------+
| 7 | http://google.co.in |
+----------+------------------------+
Expected output:
google.com
google.co.in
google.com
google.co.in
google.com
google.co.in
google.co.in
My Postgres Query looks like this:
select id, substring(website from '.*://([^/]*)') as website_domain from contacts
But above query give blank websites. So, how I can get the desired output?
You must use the "non capturing" match ?: to cope with the non "http://" websites.
like
select
id,
substring(website from '(?:.*://)?(?:www\.)?([^/?]*)') as website_domain
from contacts;
SQL Fiddle: http://sqlfiddle.com/#!17/f890c/2/0
PostgreSQL's regular expressions: https://www.postgresql.org/docs/9.3/functions-matching.html#POSIX-ATOMS-TABLE
You may use
SELECT REGEXP_REPLACE(website, '^(https?://)?(www\.)?', '') from tbl;
See the regex demo.
Details
^ - start of string
(https?://)? - 1 or 0 occurrences of http:// or https://
(www\.)? - 1 or 0 occurrences of www.
See the PostgreSQL demo:
CREATE TABLE tb1
(website character varying)
;
INSERT INTO tb1
(website)
VALUES
('https://www.google.com'),
('http://www.google.co.in'),
('www.google.com'),
('www.google.co.in'),
('google.com'),
('google.co.in'),
('http://google.co.in')
;
SELECT REGEXP_REPLACE(website, '^(https?://)?(www\.)?', '') from tb1;
Result:

Oracles REGEXP_LIKE not working correctly?

This could probably be something simple I'm missing but anyways:
I'm trying to match strings in a column that end with '01'. I currently have this for my expression '[A-Z0-9\-]+01$' which matches the types of strings I want matched checking with regex101. The strings it should be matching should be like this:
1-WA01-0009-01
which works in the linked site.
This is the SQL I'm using to test this:
SELECT *
FROM V_Translog_MDATE_V1
WHERE
REGEXP_LIKE (ITEMNO, '[a-zA-Z0-9\-]+01$')
ORDER BY ARINVT_ID
Why isn't my regex working in the SQL, in that why does this reutrn nothing when I know there are strings that match the pattern?
Closed but not working
Nothing seemed out of place with I was doing so hoping it could be something to do with the columns values itself. Thanks for the help from those who tried.
I'm trying to match strings in a column that end with '01'.
SQL Fiddle
Oracle 11g R2 Schema Setup:
CREATE TABLE V_Translog_MDATE_V1 ( ARINVT_ID, itemno ) AS
SELECT 1, '1-WA01-0009-01' FROM DUAL UNION ALL
SELECT 2, '1-WA01-0009-02' FROM DUAL UNION ALL
SELECT 3, '%&*^$%"£*&%-01' FROM DUAL UNION ALL
SELECT 4, '%&*^$%"£*&%-02' FROM DUAL;
Query 1:
You could just use LIKE:
SELECT *
FROM V_Translog_MDATE_V1
WHERE ITEMNO LIKE '%01'
ORDER BY ARINVT_ID
Results:
| ARINVT_ID | ITEMNO |
|-----------|----------------|
| 1 | 1-WA01-0009-01 |
| 3 | %&*^$%"£*&%-01 |
Query 2:
If you want to use regular expressions then you do not need to check the preceding characters:
SELECT *
FROM V_Translog_MDATE_V1
WHERE REGEXP_LIKE( ITEMNO, '01$' )
ORDER BY ARINVT_ID
Results:
| ARINVT_ID | ITEMNO |
|-----------|----------------|
| 1 | 1-WA01-0009-01 |
| 3 | %&*^$%"£*&%-01 |
Query 3:
If you want the entire string to be matched by your acceptable preceding characters regular expression then you want to prefix it with ^ (start-of-string):
SELECT *
FROM V_Translog_MDATE_V1
WHERE REGEXP_LIKE( ITEMNO, '^[A-Z0-9\-]+01$' )
ORDER BY ARINVT_ID
Results:
| ARINVT_ID | ITEMNO |
|-----------|----------------|
| 1 | 1-WA01-0009-01 |
(otherwise [A-Z0-9\-]+01$ would match £$£^*&$^"%-01 since there is at least one matched character preceding the final 01 characters.)