How to extract domain name using dynamic regex in Redshift? - regex

I need to extract domain name from url using Redshift PostgreSQL. Example : extract 'google.com' from 'www.google.com'. Each url in my dataset has different top level domain (TLD). My approach was to first join the matching TLD to the dataset and use regex to extract 'first_string.TLD'. In Redshift, I'm getting error 'The pattern must be a valid UTF-8 literal character expression'. Is there a way around this?
A sample of my dataset:
+---+------------------------+--------------+
| id| trimmed_domain | tld |
+---+------------------------+--------------+
| 1 | sample.co.uk | co.uk |
| 2 | www.sample.co.uk | co.uk |
| 3 | www3.sample.co.uk | co.uk |
| 4 | biz.sample.co.uk | co.uk |
| 5 | digital.testing.sam.co | co |
| 6 | sam.co | co |
| 7 | www.google.com | com |
| 8 | 1.11.220 | |
+---+------------------------+--------------+
My code:
SELECT t1.extracted_domain, COUNT(DISTINCT(t1.id))
FROM(
SELECT
d.id,
d.trimmed_domain,
CASE
WHEN d.tld IS null THEN d.trimmed_domain ELSE
regexp_replace(d.trimmed_domain,'(.*\.)((.[a-z]*).*'||replace(tld,'.','\.')||')','\2')
END AS "extracted_domain"
FROM dataset d
)t1
GROUP BY 1
ORDER BY 2;
Expected output:
+------------------------+--------------+
| extracted_domain | count |
+------------------------+--------------+
| sample.co.uk | 4 |
| sam.co | 2 |
| google.com | 1 |
| 1.11.220 | 1 |
+------------------------+--------------+

I'm so sure about the query. However, you can use this tool and design any expression that you wish to modify your query.
My guess is that maybe this would help:
^(?!d|b|www3).*
You can list any domain that you wish to exclude in the list using OR (?!d|b|www3).
RegEx Circuit
You can visualize your expressions in this link:
You maybe want to add your desired URLs to an expression similar to:
^(sam|www.google|1.11|www.sample|www3.sample).*

So, I've found a solution. Redshift does not support column based regex so the alternative is to use Python UDF.
Change the tld column to regex pattern.
Go row by row and extract the domain name using the regex pattern column.
Group by the extracted_domain and count the users.
The SQL query is as below:
CREATE OR REPLACE function extractor(col_domain varchar)
RETURNS varchar
IMMUTABLE AS $$
import re
_regex = ''
for domain in col_domain:
if domain is None:
continue
else:
_regex += r'{}'.format(domain)
domain_regex = r'([^/.]+\.({}))'.format(_regex)
return domain_regex
$$ LANGUAGE plpythonu;
CREATE OR REPLACE FUNCTION regex_match(in_pattern varchar, input_str varchar)
RETURNS varchar
IMMUTABLE AS $$
import re
if in_pattern == '':
a = str(input_str)
else:
a= str(re.search(in_pattern, input_str).group())
return a
$$ LANGUAGE plpythonu;
SELECT
t2.extracted_domain,
COUNT(DISTINCT(t2.id)) AS "Unique Users"
FROM(
SELECT
t1.id,
t1.trimmed_domain,
regex_match(t1.regex_pattern, t1.trimmed_domain) AS "extracted_domain"
FROM(
SELECT
id,
trimmed_domain,
CASE WHEN tld is null THEN '' ELSE extractor(tld) END AS "regex_pattern"
FROM dataset
)t1
)t2
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10;
Python UDF seems to be slow on a large dataset. So, I'm open to suggestions on improving the query.

If you know the prefixes you would like to remove from the domains, then why not just exclude these? The following query simply removes the know www/http/etc prefixes from domain names and counts normalized domain names.
SELECT COUNT(*) from
(select REGEXP_REPLACE(domain, '^(https|http|www|biz)') FROM domains)
GROUP BY regexp_replace;

Related

Regular Expression to parse Query Output

I have a need to execute a metadata query which will dump a list of tables into a file. However, I need a way to eliminate all formatting besides the tableId itself. Can this be done through a regex? Appreciate all help in advance.
+-------------------------------------+-------+
| tableId | Type |
+-------------------------------------+-------+
| t_margins | TABLE |
| t_rev_test | TABLE |
| t_rev_share | TABLE |
You have some options, but I would suggest something like this:
^\| (\S+)
It will match on the line from the start, a pipe, a space and then all non-spaces. The non-spaces will be your tableId. Here is a little example in Python:
import re
my_string = '''| t_margins | TABLE |
| t_rev_test | TABLE |
| t_rev_share | TABLE |'''
my_list = my_string.split('\n')
for line in my_list:
match = re.search("^\| (\S+)", line)
print (match.group(1))
This will give you:
t_margins
t_rev_test
t_rev_share
The following regexp captures just the column values of the first column:
^\| (\w+)
https://regex101.com/r/gODhra/3

Regex for PostgreSQL for getting domain with sub-domain from URL/Website

Basically, I need to get those rows which contain domain and subdomain name from a URL or the whole website name excluding www.
My DB table looks like this:
+----------+------------------------+
| id | website |
+----------+------------------------+
| 1 | https://www.google.com |
+----------+------------------------+
| 2 | http://www.google.co.in|
+----------+------------------------+
| 3 | www.google.com |
+----------+------------------------+
| 4 | www.google.co.in |
+----------+------------------------+
| 5 | google.com |
+----------+------------------------+
| 6 | google.co.in |
+----------+------------------------+
| 7 | http://google.co.in |
+----------+------------------------+
Expected output:
google.com
google.co.in
google.com
google.co.in
google.com
google.co.in
google.co.in
My Postgres Query looks like this:
select id, substring(website from '.*://([^/]*)') as website_domain from contacts
But above query give blank websites. So, how I can get the desired output?
You must use the "non capturing" match ?: to cope with the non "http://" websites.
like
select
id,
substring(website from '(?:.*://)?(?:www\.)?([^/?]*)') as website_domain
from contacts;
SQL Fiddle: http://sqlfiddle.com/#!17/f890c/2/0
PostgreSQL's regular expressions: https://www.postgresql.org/docs/9.3/functions-matching.html#POSIX-ATOMS-TABLE
You may use
SELECT REGEXP_REPLACE(website, '^(https?://)?(www\.)?', '') from tbl;
See the regex demo.
Details
^ - start of string
(https?://)? - 1 or 0 occurrences of http:// or https://
(www\.)? - 1 or 0 occurrences of www.
See the PostgreSQL demo:
CREATE TABLE tb1
(website character varying)
;
INSERT INTO tb1
(website)
VALUES
('https://www.google.com'),
('http://www.google.co.in'),
('www.google.com'),
('www.google.co.in'),
('google.com'),
('google.co.in'),
('http://google.co.in')
;
SELECT REGEXP_REPLACE(website, '^(https?://)?(www\.)?', '') from tb1;
Result:

Oracles REGEXP_LIKE not working correctly?

This could probably be something simple I'm missing but anyways:
I'm trying to match strings in a column that end with '01'. I currently have this for my expression '[A-Z0-9\-]+01$' which matches the types of strings I want matched checking with regex101. The strings it should be matching should be like this:
1-WA01-0009-01
which works in the linked site.
This is the SQL I'm using to test this:
SELECT *
FROM V_Translog_MDATE_V1
WHERE
REGEXP_LIKE (ITEMNO, '[a-zA-Z0-9\-]+01$')
ORDER BY ARINVT_ID
Why isn't my regex working in the SQL, in that why does this reutrn nothing when I know there are strings that match the pattern?
Closed but not working
Nothing seemed out of place with I was doing so hoping it could be something to do with the columns values itself. Thanks for the help from those who tried.
I'm trying to match strings in a column that end with '01'.
SQL Fiddle
Oracle 11g R2 Schema Setup:
CREATE TABLE V_Translog_MDATE_V1 ( ARINVT_ID, itemno ) AS
SELECT 1, '1-WA01-0009-01' FROM DUAL UNION ALL
SELECT 2, '1-WA01-0009-02' FROM DUAL UNION ALL
SELECT 3, '%&*^$%"£*&%-01' FROM DUAL UNION ALL
SELECT 4, '%&*^$%"£*&%-02' FROM DUAL;
Query 1:
You could just use LIKE:
SELECT *
FROM V_Translog_MDATE_V1
WHERE ITEMNO LIKE '%01'
ORDER BY ARINVT_ID
Results:
| ARINVT_ID | ITEMNO |
|-----------|----------------|
| 1 | 1-WA01-0009-01 |
| 3 | %&*^$%"£*&%-01 |
Query 2:
If you want to use regular expressions then you do not need to check the preceding characters:
SELECT *
FROM V_Translog_MDATE_V1
WHERE REGEXP_LIKE( ITEMNO, '01$' )
ORDER BY ARINVT_ID
Results:
| ARINVT_ID | ITEMNO |
|-----------|----------------|
| 1 | 1-WA01-0009-01 |
| 3 | %&*^$%"£*&%-01 |
Query 3:
If you want the entire string to be matched by your acceptable preceding characters regular expression then you want to prefix it with ^ (start-of-string):
SELECT *
FROM V_Translog_MDATE_V1
WHERE REGEXP_LIKE( ITEMNO, '^[A-Z0-9\-]+01$' )
ORDER BY ARINVT_ID
Results:
| ARINVT_ID | ITEMNO |
|-----------|----------------|
| 1 | 1-WA01-0009-01 |
(otherwise [A-Z0-9\-]+01$ would match £$£^*&$^"%-01 since there is at least one matched character preceding the final 01 characters.)

notepad++: keep regex (multi occurence per line) and line structure, remove other characters

I have a 130k line text file with patent information and I just want to keep the dates (regex "[0-9]{4}-[0-9]{2}-[0-9]{2} ") for subsequent work in Excel. For this purpose I need to keep the line structure intact (also blank lines). My main problem is that I can't seem to find a way to identify and keep multiple occurrences of date information in the same line while deleting all other information.
Original file structure:
US20110228428A1 | US | | 7 | 2010-03-19 | SEAGATE TECHNOLOGY LLC
US20120026629A1 | US | | 7 | 2010-07-28 | TDK CORP | US20120127612A1 | US | | EXAMINER | 2010-11-24 | | US20120147501A1 | US | | 2 | 2010-12-09 | SAE MAGNETICS HK LTD,HEADWAY TECHNOLOGIES INC
Desired file structure:
2010-03-19
2010-07-28 2010-11-24 2010-12-09
Thank you for your help!
Search for
.*?(?:([0-9]{4}-[0-9]{2}-[0-9]{2})|$)
And replace with
" $1"
Don't put the quotes, just to show there is a space before the $1. This will also put a space before the first match in a row.
This regex will match as less as possible .*? before it finds either the Date or the end of the row (the $). If a date is found it is stored in $1 because of the brackets around. So as replacement just put a space to separate the found dates and then the found date from $1.

Regex Match and Replace path to file

I am trying to do a regex match and replace for hotfile.com links to mp3 files i have in my database (Wordpress).
I used to use hotfile for streaming mp3 files on my site, now i switched to a CDN, could someone kindly help me out with this:
Replace: http//hotfile.com/dl/157490069/c8732d4/mp3_file_name.mp3
With: http//p.music.cdndomain.com/vod/music.folder/2010/mp3_file_name.mp3
I have been trying a simple Search and Replace like this: http//hotfile.com/dl/%/%/, but its not working.
It would have been easier to perform a search and replace if hotfile.com didn't have different folders for all files, below is 2 examples of the problem:
http//hotfile.com/dl/155490069/c7932d4/
http//hotfile.com/dl/165490070/c8745e7/
I have over 500 files to replace.
Thanks
Since you must be using mysql for your wordpress database, you can do this replacement either by regex as you asked:
Regex pattern : #http://(www.)?hotfile.com/\w+/\w+/\w+/#
Replacement pattern: http//p.music.cdndomain.com/vod/music.folder/2010/
An alternate simpler solution would be to extract the mp3 file name using simple string functions of mysql e.g.
Use SUBSTRING or SUBSTRING_INDEX to extract file name of your mp3 file i.e. find the string after last occurence of "/" in your hotfiles url.
Use CONCAT to append the file name retreived to new url prefix and update it in the database.
Here is an example, you can appropriately change it for your database:
mysql> select * from test_songs;
+---------------------------------------------------------------+
| song_url |
+---------------------------------------------------------------+
| http://hotfile.com/dl/157490069/c8732d4/mp3_file_name.mp3 |
| http://www.hotfile.com/dl/123412312/dd732d4/mp3_song_name.mp3 |
+---------------------------------------------------------------+
Taking substrings:
mysql> select SUBSTRING_INDEX(song_url,"/",-1) from test_songs;
+----------------------------------+
| SUBSTRING_INDEX(song_url,"/",-1) |
+----------------------------------+
| mp3_file_name.mp3 |
| mp3_song_name.mp3 |
+----------------------------------+
2 rows in set (0.03 sec)
Creating final update query:
mysql> Update test_songs set song_url =
CONCAT("http//p.music.cdndomain.com/vod/music.folder/2010/",
SUBSTRING_INDEX(song_url,"/",-1)) ;
Query OK, 2 rows affected (0.00 sec)
Rows matched: 2 Changed: 2 Warnings: 0
Checking the results :
mysql> select * from test_songs;
+---------------------------------------------------------------------+
| song_url |
+---------------------------------------------------------------------+
| http//p.music.cdndomain.com/vod/music.folder/2010/mp3_file_name.mp3 |
| http//p.music.cdndomain.com/vod/music.folder/2010/mp3_song_name.mp3 |
+---------------------------------------------------------------------+
2 rows in set (0.00 sec)
Done !
Something as simple as http://regex101.com/r/lK9wH4 should work:
/^.+\/(.+)$/ and replace with <your_new_url>\1.
Good luck.
you can use notepad++ to search and replace all your files
for this particular sample:
search and replace in regex mode:
search "http//hotfile.com/(.)/(..mp3) "
replace "http//p.music.cdndomain.com/vod/music.folder/2010/\2 "
remove the quote mark but keep the space at the end
updated: screencapture for notepad++