Including parenthesis when joining dataframes using rlike in pyspark

Including parenthesis when joining dataframes using rlike in pyspark - regex

I have 2 pyspark dataframes that I am trying to join where some of the values in the columns have parenthesis.
For example one of the values is
"Mangy (Dog)"
If I try joining like so:
df1.join(df2 expr("df1.animal rlike df2.animal_stat")
I don't get any results.
So I tried filtering using rlike just to see if I am able to capture the values.
The filtering worked on all values except those with parenthesis. For example when i try to filter like so:
df.filter(col('animal').rlike("Mangy (Dog)")).show()
I don't get any results.
However, if I filter with rlike("Mangy") or rlike("(Dog)" it seems to work. Even though I specified parenthesis in (Dog).
Is there a way to make rlike to include parenthesis in its matches?
EDIT:
I have 2 dataframes df1 and df2 like so:
+-----------------+-------+
| animal| origin|
+-----------------+-------+
| mangy (dog)|Streets|
| Cat| house|
|[Bumbling] Bufoon| Utopia|
| Cheetah| Congo|
|(Sprawling) Snake| Amazon|
+-----------------+-------+
+-------------------+-----------+
| animal_stat|destination|
+-------------------+-----------+
| ^dog$| House|
| ^Cat$| Streets|
|^[Bumbling] Bufoon$| Circus|
| ^Cheetah$| Zoo|
| ^(Sprawling)$| Glass Box|
+-------------------+-----------+
I am trying to join the two using rlike using the following method:
dff1=df1.alias('dff1')
dff2=df2.alias('dff2')
combine=dff1.join(dff2, expr("dff1.animal rlike dff2.animal_stat"), how='left')
.drop(dff2.animal_stat)
I would like the output dataframe to be like so:
+-----------------+-------+-----------+
| animal| origin|destination|
+-----------------+-------+-----------+
| mangy (dog)|Streets| House|
| Cat| house| Streets|
|[Bumbling] Bufoon| Utopia| Circus|
| Cheetah| Congo| Zoo|
|(Sprawling) Snake| Amazon| Glass Box|
+-----------------+-------+-----------+

Edit:
combine = df1.alias('df1').join(
df2.withColumn('animal_stat', F.regexp_replace(F.regexp_replace(F.regexp_replace(F.regexp_replace('animal_stat', '\\(', '\\\\('), '\\)', '\\\\)'), '\\[', '\\\\['), '\\]', '\\\\]')).alias('df2'),
F.expr('df1.animal rlike df2.animal_stat'),
'left'
)
If you're not using any regex, you probably want to use like instead of rlike. For example, you can do
df1.join(df2, expr("df1.animal like concat('%', df2.animal_stat, '%')"))
To do a filter, you can try
df.filter(col('animal').like("%Mangy (Dog)%")).show()

.rlike() is the same as .like() except it uses regex. You need to escape the parentheses. Try filtering like this:
df.filter(col('animal').rlike("Mangy \(Dog\)")).show()
Not sure I can help with the original join issue without some sample data.

Related

How to replace or delete a specific character in PySpark

i'm looking for some help. Im loading some tables in aws using pypark, and when looking the results shows this:
+-----------+--------+------+-------------+
| Name|LastName|Gender| Birth|
+-----------+--------+------+-------------+
| Javier| ;Leo| n|;M;1999-09-09|
+-----------+--------+------+-------------+
And obviusly that's isn't the result i want, i need the correct format without the ";"
+-----------+--------+------+-------------+
| Name|LastName|Gender| Birth|
+-----------+--------+------+-------------+
| Javier| Leon| M| 1999-09-09|
+-----------+--------+------+-------------+
I'm reading the file like this:
input_df = spark.read.csv(tables_map[k], header=True, sep=";", encoding="iso-8859-1")
but for some reason the sep attribute doesn't work.
So I was looking if anyone knows the way to remove the ";". I appreciate your time and thank you!
Note: sorry if i wrote something wrong, english is not my mother language

If you are certain that you will never encounter ';' that value in any functional way in your data, you can use this:
import pyspark.sql.functions as F
df = input_df.withColumn('LastName', F.regexp_replace('LastName', '', ';'))
regexp_replace docs

How to use regexp_replace in spark.sql() to extract hashtags from string

I need to write a regexg_replace query in spark.sql() and I'm not sure how to handle it. For readability purposes, I have to utilize SQL for it. I am trying to pull out the hashtags from the table. I know how to do this using the python method but most of my team are SQL users.
My dataframe example looks like so:
Insta_post
Today, Senate Dems vote to #SaveTheInternet. Proud to support similar #NetNeutrality legislation here in the House…
RT #NALCABPolicy: Meeting with #RepDarrenSoto . Thanks for taking the time to meet with #LatinoLeader ED Marucci Guzman. #NALCABPolicy2018.…
RT #Tharryry: I am delighted that #RepDarrenSoto will be voting for the CRA to overrule the FCC and save our #NetNeutrality rules. Find out…
My code:
I create a tempview:
post_df.createOrReplaceTempView("post_tempview")
post_df = spark.sql("""
select
regexp_replace(Insta_post, '.*?(.|'')(#)(\w+)', '$1') as a
from post_tempview
where Insta_post like '%#%'
""")
My end result:
+--------------------------------------------------------------------------------------------------------------------------------------------+
|a |
+--------------------------------------------------------------------------------------------------------------------------------------------+
|Today, Senate Dems vote to #SaveTheInternet. Proud to support similar #NetNeutrality legislation here in the House… |
|RT #NALCABPolicy: Meeting with #RepDarrenSoto . Thanks for taking the time to meet with #LatinoLeader ED Marucci Guzman. #NALCABPolicy2018.…|
|RT #Tharryry: I am delighted that #RepDarrenSoto will be voting for the CRA to overrule the FCC and save our #NetNeutrality rules. Find out…|
+--------------------------------------------------------------------------------------------------------------------------------------------+
desired result:
+---------------------------------+
|a |
+---------------------------------+
| #SaveTheInternet, #NetNeutrality|
| #NALCABPolicy2018 |
| #NetNeutrality |
+---------------------------------+
I haven't really used regexp_replace too much so this is new to me. Any help would be appreciated as well as an explanation of how to structure the subsets!

For Spark 3.1+, you can use regexp_extract_all function to extract multiple matches:
post_df = spark.sql("""
select regexp_extract_all(Insta_post, '(#\\\\w+)', 1) as a
from post_tempview
where Insta_post like '%#%'
""")
post_df.show(truncate=False)
#+----------------------------------+
#|a |
#+----------------------------------+
#|[#SaveTheInternet, #NetNeutrality]|
#|[#NALCABPolicy2018] |
#|[#NetNeutrality] |
#+----------------------------------+
For Spark <3.1, you can use regexp_replace to remove all that doesn't match the hashtag pattern :
post_df = spark.sql("""
select trim(trailing ',' from regexp_replace(Insta_post, '.*?(#\\\\w+)|.*', '$1,')) as a
from post_tempview
where Insta_post like '%#%'
""")
post_df.show(truncate=False)
#+-------------------------------+
#|a |
#+-------------------------------+
#|#SaveTheInternet,#NetNeutrality|
#|#NALCABPolicy2018 |
#|#NetNeutrality |
#+-------------------------------+
Note the use trim to remove the unnecessary trailing commas created by the first replace $,.

Do you really need a view? Because the following code might do it:
df = df.filter(F.col('Insta_post').like('%#%'))
col_trimmed = F.trim((F.regexp_replace('Insta_post', '.*?(#\w+)|.+', '$1 ')))
df = df.select(F.regexp_replace(col_trimmed,'\s',', ').alias('a'))
df.show(truncate=False)
# +--------------------------------+
# |a |
# +--------------------------------+
# |#SaveTheInternet, #NetNeutrality|
# |#NALCABPolicy2018 |
# |#NetNeutrality |
# +--------------------------------+
I ended up using two of regexp_replace, so potentially there could be a better alternative, just couldn't think of one.

Extract domain from url using PostgreSQL

I need to extract the domain name for a list of urls using PostgreSQL. In the first version, I tried using REGEXP_REPLACE to replace unwanted characters like www., biz., sports., etc. to get the domain name.
SELECT REGEXP_REPLACE(url, ^((www|www2|www3|static1|biz|health|travel|property|edu|world|newmedia|digital|ent|staging|cpelection|dev|m-staging|m|maa|cdnnews|testing|cdnpuc|shipping|sports|life|static01|cdn|dev1|ad|backends|avm|displayvideo|tand|static03|subscriptionv3|mdev|beta)\.)?', '') AS "Domain",
COUNT(DISTINCT(user)) AS "Unique Users"
FROM db
GROUP BY 1
ORDER BY 2 DESC;
This seems unfavorable as the query needs to be constantly updated for list of unwanted words.
I did try https://stackoverflow.com/a/21174423/10174021 to extract from the end of the line using PostgreSQL REGEXP_SUBSTR but, I'm getting blank rows in return. Is there a more better way of doing this?
A dataset sample to try with:
CREATE TABLE sample (
url VARCHAR(100) NOT NULL);
INSERT INTO sample url)
VALUES
("sample.co.uk"),
("www.sample.co.uk"),
("www3.sample.co.uk"),
("biz.sample.co.uk"),
("digital.testing.sam.co"),
("sam.co"),
("m.sam.co");
Desired output
+------------------------+--------------+
| url | domain |
+------------------------+--------------+
| sample.co.uk | sample.co.uk |
| www.sample.co.uk | sample.co.uk |
| www3.sample.co.uk | sample.co.uk |
| biz.sample.co.uk | sample.co.uk |
| digital.testing.sam.co | sam.co |
| sam.co | sam.co |
| m.sam.co | sam.co |
+------------------------+--------------+

So, I've found the solution using Jeremy and Rémy Baron's answer.
Extract all the public suffix from public suffix and store into
a table which I labelled as tlds.
Get the unique urls in the dataset and match to its TLD.
Extract the domain name using regexp_replace (used in this query) or alternative regexp_substr(t1.url, '([a-z]+)(.)'||t1."tld"). The final output:
The SQL query is as below:
WITH stored_tld AS(
SELECT
DISTINCT(s.url),
FIRST_VALUE(t.domain) over (PARTITION BY s.url ORDER BY length(t.domain) DESC
rows between unbounded preceding and unbounded following) AS "tld"
FROM sample s
JOIN tlds t
ON (s.url like '%%'||domain))
SELECT
t1.url,
CASE WHEN t1."tld" IS NULL THEN t1.url ELSE regexp_replace(t1.url,'(.*\.)((.[a-z]*).*'||replace(t1."tld",'.','\.')||')','\2')
END AS "extracted_domain"
FROM(
SELECT a.url,st."tld"
FROM sample a
LEFT JOIN stored_tld st
ON a.url = st.url
)t1
Links to try: SQL Tester

You can try this :
with tlds as (
select * from (values('.co.uk'),('.co'),('.uk')) a(tld)
) ,
sample as (
select * from (values ('sample.co.uk'),
('www.sample.co.uk'),
('www3.sample.co.uk'),
('biz.sample.co.uk'),
('digital.testing.sam.co'),
('sam.co'),
('m.sam.co')
) a(url)
)
select url,regexp_replace(url,'(.*\.)(.*'||replace(tld,'.','\.')||')','\2') "domain" from (
select distinct url,first_value(tld) over (PARTITION BY url order by length(tld) DESC) tld
from sample join tlds on (url like '%'||tld)
) a

I use split_part(url,'/',3) for this :
select split_part('https://stackoverflow.com/questions/56019744', '/', 3) ;
output
stackoverflow.com

How can I get school infobox data for every College and University in the U.S. using SPARQL & dbpedia

Trying to create a query for SPARQL to extract all the school infobox data on dbpedia. Does that dataset already exist? (I am still confused after trying to read and understand the examples.) It seems like this specific wikipedia infobox data must exist in dbpedia but I can't figure out if it does already. If I want to export all college and university school infobox data is it possible easily do so?

All classes in DBpedia that have maybe something to do with school or college (YAGO classes omitted for brevity):
select ?cls where {
?cls a owl:Class.
filter(regex(str(?cls), 'college|school', 'i'))
}
Output:
+------------------------------------------+
| cls |
+------------------------------------------+
| http://dbpedia.org/ontology/College |
| http://dbpedia.org/ontology/CollegeCoach |
| http://dbpedia.org/ontology/SambaSchool |
| http://dbpedia.org/ontology/School |
+------------------------------------------+
If we take the http://dbpedia.org/ontology/School as an example, the query to get all the data would be something like
select * where {
?s a <http://dbpedia.org/ontology/School> ;
?p ?o
}
a lots of data is rdf:type, rdfs:label, owl:sameAs etc., but to see that also the other more interesting data is returned, you can try
select * where {
?s a <http://dbpedia.org/ontology/School> .
?s ?p ?o
filter(?p not in (rdf:type, owl:sameAs, rdfs:label, rdfs:comment, rdfs:seeAlso))
}
Note, that this query might not return all data you need, I just wanted to show you how to start

Equivalent of SQL LIKE operator in R

In an R script, I have a function that creates a data frame of files in a directory that have a specific extension.
The dataframe is always two columns with however many rows as there are files found with that specific extension.
The data frame ends up looking something like this:
| Path | Filename |
|:------------------------:|:-----------:|
| C:/Path/to/the/file1.ext | file1.ext |
| C:/Path/to/the/file2.ext | file2.ext |
| C:/Path/to/the/file3.ext | file3.ext |
| C:/Path/to/the/file4.ext | file4.ext |
Forgive the archaeic way that I express this question. I know that in SQL, you can apply where functions with like instead of =. So I could say `where Filename like '%1%' and it would pull out all files with a 1 in the name. Is there a way use something like this to set a variable in R?
I have a couple of different scripts that need to use the Filename pulled from this dataframe. The only reliable way I can think to tell the script which one to pull from is to set a variable like this.
Ultimately I would like these two (pseudo)expressions to yield the same thing.
x <- file1.ext
and
x like '%1%'
should both give x = file1.ext

you can use grepl() as in this answer
subset(a, grepl("1", a$filename))
Or if you're coming from an SQL background, you might want to look into sqldf

you can use like from data.table to get your sql like behaviour here.
From the documentation see this example
library(data.table)
DT = data.table(Name=c("Mary","George","Martha"), Salary=c(2,3,4))
DT[Name %like% "^Mar"]
for your problem suppose you have a data.frame df like this
path filename
1: C:/Path/to/the/file1.ext file1.ext
2: C:/Path/to/the/file2.ext file2.ext
3: C:/Path/to/the/file3.ext file3.ext
4: C:/Path/to/the/file4.ext file4.ext
do
library(data.table)
DT<-as.data.table(df)
DT[filename %like% "1"]
should give
path filename
1: C:/Path/to/the/file1.ext file1.ext

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Including parenthesis when joining dataframes using rlike in pyspark - regex

.rlike() is the same as .like() except it uses regex. You need to escape the parentheses. Try filtering like this: df.filter(col('animal').rlike("Mangy \(Dog\)")).show() Not sure I can help with the original join issue without some sample data.

Related

How to replace or delete a specific character in PySpark

How to use regexp_replace in spark.sql() to extract hashtags from string

Extract domain from url using PostgreSQL

How can I get school infobox data for every College and University in the U.S. using SPARQL & dbpedia

Equivalent of SQL LIKE operator in R

Categories

Resources