Extract domain from url using PostgreSQL

Extract domain from url using PostgreSQL - regex

I need to extract the domain name for a list of urls using PostgreSQL. In the first version, I tried using REGEXP_REPLACE to replace unwanted characters like www., biz., sports., etc. to get the domain name.
SELECT REGEXP_REPLACE(url, ^((www|www2|www3|static1|biz|health|travel|property|edu|world|newmedia|digital|ent|staging|cpelection|dev|m-staging|m|maa|cdnnews|testing|cdnpuc|shipping|sports|life|static01|cdn|dev1|ad|backends|avm|displayvideo|tand|static03|subscriptionv3|mdev|beta)\.)?', '') AS "Domain",
COUNT(DISTINCT(user)) AS "Unique Users"
FROM db
GROUP BY 1
ORDER BY 2 DESC;
This seems unfavorable as the query needs to be constantly updated for list of unwanted words.
I did try https://stackoverflow.com/a/21174423/10174021 to extract from the end of the line using PostgreSQL REGEXP_SUBSTR but, I'm getting blank rows in return. Is there a more better way of doing this?
A dataset sample to try with:
CREATE TABLE sample (
url VARCHAR(100) NOT NULL);
INSERT INTO sample url)
VALUES
("sample.co.uk"),
("www.sample.co.uk"),
("www3.sample.co.uk"),
("biz.sample.co.uk"),
("digital.testing.sam.co"),
("sam.co"),
("m.sam.co");
Desired output
+------------------------+--------------+
| url | domain |
+------------------------+--------------+
| sample.co.uk | sample.co.uk |
| www.sample.co.uk | sample.co.uk |
| www3.sample.co.uk | sample.co.uk |
| biz.sample.co.uk | sample.co.uk |
| digital.testing.sam.co | sam.co |
| sam.co | sam.co |
| m.sam.co | sam.co |
+------------------------+--------------+

So, I've found the solution using Jeremy and Rémy Baron's answer.
Extract all the public suffix from public suffix and store into
a table which I labelled as tlds.
Get the unique urls in the dataset and match to its TLD.
Extract the domain name using regexp_replace (used in this query) or alternative regexp_substr(t1.url, '([a-z]+)(.)'||t1."tld"). The final output:
The SQL query is as below:
WITH stored_tld AS(
SELECT
DISTINCT(s.url),
FIRST_VALUE(t.domain) over (PARTITION BY s.url ORDER BY length(t.domain) DESC
rows between unbounded preceding and unbounded following) AS "tld"
FROM sample s
JOIN tlds t
ON (s.url like '%%'||domain))
SELECT
t1.url,
CASE WHEN t1."tld" IS NULL THEN t1.url ELSE regexp_replace(t1.url,'(.*\.)((.[a-z]*).*'||replace(t1."tld",'.','\.')||')','\2')
END AS "extracted_domain"
FROM(
SELECT a.url,st."tld"
FROM sample a
LEFT JOIN stored_tld st
ON a.url = st.url
)t1
Links to try: SQL Tester

You can try this :
with tlds as (
select * from (values('.co.uk'),('.co'),('.uk')) a(tld)
) ,
sample as (
select * from (values ('sample.co.uk'),
('www.sample.co.uk'),
('www3.sample.co.uk'),
('biz.sample.co.uk'),
('digital.testing.sam.co'),
('sam.co'),
('m.sam.co')
) a(url)
)
select url,regexp_replace(url,'(.*\.)(.*'||replace(tld,'.','\.')||')','\2') "domain" from (
select distinct url,first_value(tld) over (PARTITION BY url order by length(tld) DESC) tld
from sample join tlds on (url like '%'||tld)
) a

I use split_part(url,'/',3) for this :
select split_part('https://stackoverflow.com/questions/56019744', '/', 3) ;
output
stackoverflow.com

Related

How to use regexp_replace in spark.sql() to extract hashtags from string

I need to write a regexg_replace query in spark.sql() and I'm not sure how to handle it. For readability purposes, I have to utilize SQL for it. I am trying to pull out the hashtags from the table. I know how to do this using the python method but most of my team are SQL users.
My dataframe example looks like so:
Insta_post
Today, Senate Dems vote to #SaveTheInternet. Proud to support similar #NetNeutrality legislation here in the House…
RT #NALCABPolicy: Meeting with #RepDarrenSoto . Thanks for taking the time to meet with #LatinoLeader ED Marucci Guzman. #NALCABPolicy2018.…
RT #Tharryry: I am delighted that #RepDarrenSoto will be voting for the CRA to overrule the FCC and save our #NetNeutrality rules. Find out…
My code:
I create a tempview:
post_df.createOrReplaceTempView("post_tempview")
post_df = spark.sql("""
select
regexp_replace(Insta_post, '.*?(.|'')(#)(\w+)', '$1') as a
from post_tempview
where Insta_post like '%#%'
""")
My end result:
+--------------------------------------------------------------------------------------------------------------------------------------------+
|a |
+--------------------------------------------------------------------------------------------------------------------------------------------+
|Today, Senate Dems vote to #SaveTheInternet. Proud to support similar #NetNeutrality legislation here in the House… |
|RT #NALCABPolicy: Meeting with #RepDarrenSoto . Thanks for taking the time to meet with #LatinoLeader ED Marucci Guzman. #NALCABPolicy2018.…|
|RT #Tharryry: I am delighted that #RepDarrenSoto will be voting for the CRA to overrule the FCC and save our #NetNeutrality rules. Find out…|
+--------------------------------------------------------------------------------------------------------------------------------------------+
desired result:
+---------------------------------+
|a |
+---------------------------------+
| #SaveTheInternet, #NetNeutrality|
| #NALCABPolicy2018 |
| #NetNeutrality |
+---------------------------------+
I haven't really used regexp_replace too much so this is new to me. Any help would be appreciated as well as an explanation of how to structure the subsets!

For Spark 3.1+, you can use regexp_extract_all function to extract multiple matches:
post_df = spark.sql("""
select regexp_extract_all(Insta_post, '(#\\\\w+)', 1) as a
from post_tempview
where Insta_post like '%#%'
""")
post_df.show(truncate=False)
#+----------------------------------+
#|a |
#+----------------------------------+
#|[#SaveTheInternet, #NetNeutrality]|
#|[#NALCABPolicy2018] |
#|[#NetNeutrality] |
#+----------------------------------+
For Spark <3.1, you can use regexp_replace to remove all that doesn't match the hashtag pattern :
post_df = spark.sql("""
select trim(trailing ',' from regexp_replace(Insta_post, '.*?(#\\\\w+)|.*', '$1,')) as a
from post_tempview
where Insta_post like '%#%'
""")
post_df.show(truncate=False)
#+-------------------------------+
#|a |
#+-------------------------------+
#|#SaveTheInternet,#NetNeutrality|
#|#NALCABPolicy2018 |
#|#NetNeutrality |
#+-------------------------------+
Note the use trim to remove the unnecessary trailing commas created by the first replace $,.

Do you really need a view? Because the following code might do it:
df = df.filter(F.col('Insta_post').like('%#%'))
col_trimmed = F.trim((F.regexp_replace('Insta_post', '.*?(#\w+)|.+', '$1 ')))
df = df.select(F.regexp_replace(col_trimmed,'\s',', ').alias('a'))
df.show(truncate=False)
# +--------------------------------+
# |a |
# +--------------------------------+
# |#SaveTheInternet, #NetNeutrality|
# |#NALCABPolicy2018 |
# |#NetNeutrality |
# +--------------------------------+
I ended up using two of regexp_replace, so potentially there could be a better alternative, just couldn't think of one.

Compare fields within relationship on Django ORM

I have two models, route and stop.
A route can have several stop, each stop have a name and a number. On same route, stop.number are unique.
The problem:
I need to search which route has two different stops and one stop.number is less than the other stop.number
Consider the following models:
class Route(models.Model):
name = models.CharField(max_length=20)
class Stop(models.Model):
route = models.ForeignKey(Route)
number = models.PositiveSmallIntegerField()
location = models.CharField(max_length=45)
And the following data:
Stop table
| id | route_id | number | location |
|----|----------|--------|----------|
| 1 | 1 | 1 | 'A' |
| 2 | 1 | 2 | 'B' |
| 3 | 1 | 3 | 'C' |
| 4 | 2 | 1 | 'C' |
| 5 | 2 | 2 | 'B' |
| 6 | 2 | 3 | 'A' |
In example:
Given two locations 'A' and 'B', search which routes have both location and A.number is less than B.number
With the previous data, it should match route id 1 and not route id 2
On raw SQL, this works with a single query:
SELECT
`route`.id
FROM
`route`
LEFT JOIN `stop` stop_from ON stop_from.`route_id` = `route`.`id`
LEFT JOIN `stop` stop_to ON stop_to.`route_id` = `route`.`id`
WHERE
stop_from.`stop_location_id` = 'A'
AND stop_to.`stop_location_id` = 'B'
AND stop_from.stop_number < stop_to.stop_number
Is this possible to do with one single query on Django ORM as well?

Generally ORM frameworks like Django ORM, SQLAlchemy and even Hibernate is not design to autogenerate most efficient query. There is a way to write this query only using Model objects, however, since I had similar issue, I would suggest to use raw query for more complex queries. Following is link for Django raw query:
[https://docs.djangoproject.com/en/1.11/topics/db/sql/]
Although, you can write your query in many ways but something like following could help.
from django.db import connection
def my_custom_sql(self):
with connection.cursor() as cursor:
cursor.execute("SELECT
`route`.id
FROM
`route`
LEFT JOIN `stop` stop_from ON stop_from.`route_id` = `route`.`id`
LEFT JOIN `stop` stop_to ON stop_to.`route_id` = `route`.`id`
WHERE
stop_from.`stop_location_id` = %s
AND stop_to.`stop_location_id` = %s
AND stop_from.stop_number < stop_to.stop_number", ['A', 'B'])
row = cursor.fetchone()
return row
hope this helps.

Hive: String split based on backslash \

I have a table with column named path which contains values with backslash:
\ModuleCalData\ComputerName
\ModuleCalData\StartTime
\ModuleCalData\EndTime
\ModuleCalData\SummaryParameters\TextMeasured\Value
\ModuleCalDataSummaryParameters\TextMeasured\Name
I'm trying to split and access each element separately. The query is
select split(path,'\\')[0] from test_data_tag;
This query is erroring out
Failed with exception
java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException:
Error evaluating split(path, '\')[0]
Can anyone help how to split the string on \ in hive?

select path
,split(path,'\\\\') as split_path
from mytable
;
+-----------------------------+-------------------------------------+
| path | split_path |
+-----------------------------+-------------------------------------+
| \ModuleCalData\ComputerName | ["","ModuleCalData","ComputerName"] |
+-----------------------------+-------------------------------------+

Regex to match starts with

I need to update a table setting attribute MATCH to True where the attribute_a STARTS with the Value of attribute_b.
Somehow I can't get the correct syntax in Postgresql to do this pattern match.
UPDATE table
SET match= True
WHERE attribute_a ~ '^attribute_b' ;
eg MATCH TRUE: attribute_a = Nelson Mandela ; attribute_b = 'Nelson'

You do not need pattern matching, use left(), e.g.:
with my_table(attribute_a, attribute_b) as (
values
('Nelson Mandela', 'Nelson'),
('Donald Trump', 'Donald Duck'),
('John Major', 'John M')
)
select *
from my_table
where attribute_b = left(attribute_a, length(attribute_b));
attribute_a | attribute_b
----------------+-------------
Nelson Mandela | Nelson
John Major | John M
(2 rows)
If you absolutely want to use regex, you have to build the pattern with concat() or format(), like this:
select *
from my_table
where attribute_a ~ concat('^', attribute_b)
-- where attribute_a ~ format('^%s', attribute_b)

search for specific characters within column and then create different columns from it

I have param_Value column that have different values. I need to extract these values and create columns for all of them.
|PARAM_NAME |param_Value |
__________|____________
|Step 4 | SP:0.09 |
|Procedure | MAX:125 |
|Step 4 | SP:Ambient|
|(null) | +/-:N/A |
|Steam | SP:2 |
|Step 3 | MIN:0 |
|Step 4 | RDPHN427B |
|Testing De | N/A |
I only want columns with: And give them names:
SP: SET_POINT_VALUE,
MAX: MAX_LIMIT,
MIN: MIN_LIMIT,
+/-: UPPER_LOWER_LIMIT
So what I have so far is:
CREATE OR REPLACE FORCE VIEW PROCESS_STEPS
("PARAM_NAME", "SET_POINT_VALUE", "UPPER_LOWER_LIMIT", "MAX_VALUE", "MIN_VALUE")
AS
SELECT PARAM_NAME,
REGEXP_LIKE("param_Value", 'SP:') SET_POINT_VALUE,
REGEXP_LIKE("param_Value", '+/-:') UPPER_LOWER_LIMIT,
REGEXP_LIKE("param_Value", 'MAX:') MAX_VALUE,
REGEXP_LIKE("param_Value", 'MIN:') MIN_VALUE
FROM PROCESS_STEPS
;

I'm more familiar with TSQL and MySQL, but this ought to do what I think you're looking for. If it doesn't exactly, it should at least point you in the right direction.
CREATE OR REPLACE FORCE VIEW PROCESS_STEPS
("PARAM_NAME", "SET_POINT_VALUE", "UPPER_LOWER_LIMIT", "MAX_VALUE", "MIN_VALUE")
AS
SELECT PARAM_NAME
, CASE WHEN "param_Value" LIKE 'SP:%'
THEN SUBSTR("param_Value", INSTR("param_Value", ':')+1)
ELSE Null
END SET_POINT_VALUE
, CASE WHEN "param_Value" LIKE '+/-:%'
THEN SUBSTR("param_Value", INSTR("param_Value", ':')+1)
ELSE Null
END UPPER_LOWER_LIMIT
, CASE WHEN "param_Value" LIKE 'MAX:%'
THEN SUBSTR("param_Value", INSTR("param_Value", ':')+1)
ELSE Null
END MAX_VALUE
, CASE WHEN "param_Value" LIKE 'MIN:%'
THEN SUBSTR("param_Value", INSTR("param_Value", ':')+1)
ELSE Null
END MIN_VALUE
FROM PROCESS_STEPS
;
The basic concept here is identifying the information you want via LIKE, then using SUBSTR and INSTR to extract it. While LIKE is normally something to stay away from, since there's no leading % in your case, it's Sargable, and thus probably not a total efficiency sink.
Really, though, I have to ask you to question why you're laying out your data like this - substring operations are slow in any language, and a DB is no exception. Why not use another column for your limit type? Why not lay it out in the view you're currently looking at?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Extract domain from url using PostgreSQL - regex

I use split_part(url,'/',3) for this : select split_part('https://stackoverflow.com/questions/56019744', '/', 3) ; output stackoverflow.com

Related

How to use regexp_replace in spark.sql() to extract hashtags from string

Compare fields within relationship on Django ORM

Hive: String split based on backslash \

Regex to match starts with

search for specific characters within column and then create different columns from it

Categories

Resources