How to replace or delete a specific character in PySpark - amazon-web-services

i'm looking for some help. Im loading some tables in aws using pypark, and when looking the results shows this:
+-----------+--------+------+-------------+
| Name|LastName|Gender| Birth|
+-----------+--------+------+-------------+
| Javier| ;Leo| n|;M;1999-09-09|
+-----------+--------+------+-------------+
And obviusly that's isn't the result i want, i need the correct format without the ";"
+-----------+--------+------+-------------+
| Name|LastName|Gender| Birth|
+-----------+--------+------+-------------+
| Javier| Leon| M| 1999-09-09|
+-----------+--------+------+-------------+
I'm reading the file like this:
input_df = spark.read.csv(tables_map[k], header=True, sep=";", encoding="iso-8859-1")
but for some reason the sep attribute doesn't work.
So I was looking if anyone knows the way to remove the ";". I appreciate your time and thank you!
Note: sorry if i wrote something wrong, english is not my mother language

If you are certain that you will never encounter ';' that value in any functional way in your data, you can use this:
import pyspark.sql.functions as F
df = input_df.withColumn('LastName', F.regexp_replace('LastName', '', ';'))
regexp_replace docs

Related

How to use regexp_replace in spark.sql() to extract hashtags from string

I need to write a regexg_replace query in spark.sql() and I'm not sure how to handle it. For readability purposes, I have to utilize SQL for it. I am trying to pull out the hashtags from the table. I know how to do this using the python method but most of my team are SQL users.
My dataframe example looks like so:
Insta_post
Today, Senate Dems vote to #SaveTheInternet. Proud to support similar #NetNeutrality legislation here in the House…
RT #NALCABPolicy: Meeting with #RepDarrenSoto . Thanks for taking the time to meet with #LatinoLeader ED Marucci Guzman. #NALCABPolicy2018.…
RT #Tharryry: I am delighted that #RepDarrenSoto will be voting for the CRA to overrule the FCC and save our #NetNeutrality rules. Find out…
My code:
I create a tempview:
post_df.createOrReplaceTempView("post_tempview")
post_df = spark.sql("""
select
regexp_replace(Insta_post, '.*?(.|'')(#)(\w+)', '$1') as a
from post_tempview
where Insta_post like '%#%'
""")
My end result:
+--------------------------------------------------------------------------------------------------------------------------------------------+
|a |
+--------------------------------------------------------------------------------------------------------------------------------------------+
|Today, Senate Dems vote to #SaveTheInternet. Proud to support similar #NetNeutrality legislation here in the House… |
|RT #NALCABPolicy: Meeting with #RepDarrenSoto . Thanks for taking the time to meet with #LatinoLeader ED Marucci Guzman. #NALCABPolicy2018.…|
|RT #Tharryry: I am delighted that #RepDarrenSoto will be voting for the CRA to overrule the FCC and save our #NetNeutrality rules. Find out…|
+--------------------------------------------------------------------------------------------------------------------------------------------+
desired result:
+---------------------------------+
|a |
+---------------------------------+
| #SaveTheInternet, #NetNeutrality|
| #NALCABPolicy2018 |
| #NetNeutrality |
+---------------------------------+
I haven't really used regexp_replace too much so this is new to me. Any help would be appreciated as well as an explanation of how to structure the subsets!
For Spark 3.1+, you can use regexp_extract_all function to extract multiple matches:
post_df = spark.sql("""
select regexp_extract_all(Insta_post, '(#\\\\w+)', 1) as a
from post_tempview
where Insta_post like '%#%'
""")
post_df.show(truncate=False)
#+----------------------------------+
#|a |
#+----------------------------------+
#|[#SaveTheInternet, #NetNeutrality]|
#|[#NALCABPolicy2018] |
#|[#NetNeutrality] |
#+----------------------------------+
For Spark <3.1, you can use regexp_replace to remove all that doesn't match the hashtag pattern :
post_df = spark.sql("""
select trim(trailing ',' from regexp_replace(Insta_post, '.*?(#\\\\w+)|.*', '$1,')) as a
from post_tempview
where Insta_post like '%#%'
""")
post_df.show(truncate=False)
#+-------------------------------+
#|a |
#+-------------------------------+
#|#SaveTheInternet,#NetNeutrality|
#|#NALCABPolicy2018 |
#|#NetNeutrality |
#+-------------------------------+
Note the use trim to remove the unnecessary trailing commas created by the first replace $,.
Do you really need a view? Because the following code might do it:
df = df.filter(F.col('Insta_post').like('%#%'))
col_trimmed = F.trim((F.regexp_replace('Insta_post', '.*?(#\w+)|.+', '$1 ')))
df = df.select(F.regexp_replace(col_trimmed,'\s',', ').alias('a'))
df.show(truncate=False)
# +--------------------------------+
# |a |
# +--------------------------------+
# |#SaveTheInternet, #NetNeutrality|
# |#NALCABPolicy2018 |
# |#NetNeutrality |
# +--------------------------------+
I ended up using two of regexp_replace, so potentially there could be a better alternative, just couldn't think of one.

Including parenthesis when joining dataframes using rlike in pyspark

I have 2 pyspark dataframes that I am trying to join where some of the values in the columns have parenthesis.
For example one of the values is
"Mangy (Dog)"
If I try joining like so:
df1.join(df2 expr("df1.animal rlike df2.animal_stat")
I don't get any results.
So I tried filtering using rlike just to see if I am able to capture the values.
The filtering worked on all values except those with parenthesis. For example when i try to filter like so:
df.filter(col('animal').rlike("Mangy (Dog)")).show()
I don't get any results.
However, if I filter with rlike("Mangy") or rlike("(Dog)" it seems to work. Even though I specified parenthesis in (Dog).
Is there a way to make rlike to include parenthesis in its matches?
EDIT:
I have 2 dataframes df1 and df2 like so:
+-----------------+-------+
| animal| origin|
+-----------------+-------+
| mangy (dog)|Streets|
| Cat| house|
|[Bumbling] Bufoon| Utopia|
| Cheetah| Congo|
|(Sprawling) Snake| Amazon|
+-----------------+-------+
+-------------------+-----------+
| animal_stat|destination|
+-------------------+-----------+
| ^dog$| House|
| ^Cat$| Streets|
|^[Bumbling] Bufoon$| Circus|
| ^Cheetah$| Zoo|
| ^(Sprawling)$| Glass Box|
+-------------------+-----------+
I am trying to join the two using rlike using the following method:
dff1=df1.alias('dff1')
dff2=df2.alias('dff2')
combine=dff1.join(dff2, expr("dff1.animal rlike dff2.animal_stat"), how='left')
.drop(dff2.animal_stat)
I would like the output dataframe to be like so:
+-----------------+-------+-----------+
| animal| origin|destination|
+-----------------+-------+-----------+
| mangy (dog)|Streets| House|
| Cat| house| Streets|
|[Bumbling] Bufoon| Utopia| Circus|
| Cheetah| Congo| Zoo|
|(Sprawling) Snake| Amazon| Glass Box|
+-----------------+-------+-----------+
Edit:
combine = df1.alias('df1').join(
df2.withColumn('animal_stat', F.regexp_replace(F.regexp_replace(F.regexp_replace(F.regexp_replace('animal_stat', '\\(', '\\\\('), '\\)', '\\\\)'), '\\[', '\\\\['), '\\]', '\\\\]')).alias('df2'),
F.expr('df1.animal rlike df2.animal_stat'),
'left'
)
If you're not using any regex, you probably want to use like instead of rlike. For example, you can do
df1.join(df2, expr("df1.animal like concat('%', df2.animal_stat, '%')"))
To do a filter, you can try
df.filter(col('animal').like("%Mangy (Dog)%")).show()
.rlike() is the same as .like() except it uses regex. You need to escape the parentheses. Try filtering like this:
df.filter(col('animal').rlike("Mangy \(Dog\)")).show()
Not sure I can help with the original join issue without some sample data.

How to Replace empty string with N/A in Scala Spark?

I'm trying out an age old problem of replacing empty strings in a certain column in a Spark Scala dataframe with N/A, but to no avail.
Original Dataframe:
+----------+--------------+
|Testing ID|Test this Code|
+----------+--------------+
| 545242| ""|
| 643533| 994A|
| 856563| ""|
+----------+--------------+
First code I tried:
val a = sssd.withColumn("Test this Code", when($"Test this Code" === "", lit("N/A")).otherwise($"Test this Code"))
But nothing happens, zero changes observed. Hence I tried another way by using regexp_replace, code:
import org.apache.spark.sql.functions._
val a = sssd.withColumn("Test this Code", regexp_replace(col("Test this Code"), "", "N/A"))
But then, the output is strange enough, its the following:
+----------+------------------------+
|Testing ID| Test this Code|
+----------+------------------------+
| 545242| N/A"N/A"N/A|
| 643533| 994A|
| 856563| N/A"N/A"N/A|
+----------+------------------------+
I went through other SO answers, but to no avail, any help?
Try this. I suspect it's not an empty string, but actually a string of two quotes.
val a = sssd.withColumn("Test this Code", when($"Test this Code" === "\"\"", lit("N/A")).otherwise($"Test this Code"))

Redshift regexp_substr - extract data from a JSON type format

Help much appreciated - I have a field in Redshift giving data of the form:
{\"frequencyCapList\":[{\"frequencyCapped\":true,\"frequencyCapPeriodCount\":1,\"frequencyCapPeriodType\":\"DAYS\",\"frequencyCapCount\":501}]}
What I would like to do is parse this cleanly as the output of a Redshift query into some columns like:
Frequency Cap Period Count | Frequency Cap Period Type | Frequency Cap Count
1 | DAYS | 501
I believe I need to use the regexp_subst function to achieve this but I cannot work out the syntax to get the required output :(
Thanks in advance for any assistance,
Carter
Here you go
select json_extract_path_text(json_extract_array_element_text(json_extract_path_text(replace('{\"frequencyCapList\":[{\"frequencyCapped\":true,\"frequencyCapPeriodCount\":1,\"frequencyCapPeriodType\":\"DAYS\",\"frequencyCapCount\":501}]}','\\',''),'frequencyCapList'),0),'frequencyCapPeriodCount');
just replace the last string with each one you want to extract!

Equivalent of SQL LIKE operator in R

In an R script, I have a function that creates a data frame of files in a directory that have a specific extension.
The dataframe is always two columns with however many rows as there are files found with that specific extension.
The data frame ends up looking something like this:
| Path | Filename |
|:------------------------:|:-----------:|
| C:/Path/to/the/file1.ext | file1.ext |
| C:/Path/to/the/file2.ext | file2.ext |
| C:/Path/to/the/file3.ext | file3.ext |
| C:/Path/to/the/file4.ext | file4.ext |
Forgive the archaeic way that I express this question. I know that in SQL, you can apply where functions with like instead of =. So I could say `where Filename like '%1%' and it would pull out all files with a 1 in the name. Is there a way use something like this to set a variable in R?
I have a couple of different scripts that need to use the Filename pulled from this dataframe. The only reliable way I can think to tell the script which one to pull from is to set a variable like this.
Ultimately I would like these two (pseudo)expressions to yield the same thing.
x <- file1.ext
and
x like '%1%'
should both give x = file1.ext
you can use grepl() as in this answer
subset(a, grepl("1", a$filename))
Or if you're coming from an SQL background, you might want to look into sqldf
you can use like from data.table to get your sql like behaviour here.
From the documentation see this example
library(data.table)
DT = data.table(Name=c("Mary","George","Martha"), Salary=c(2,3,4))
DT[Name %like% "^Mar"]
for your problem suppose you have a data.frame df like this
path filename
1: C:/Path/to/the/file1.ext file1.ext
2: C:/Path/to/the/file2.ext file2.ext
3: C:/Path/to/the/file3.ext file3.ext
4: C:/Path/to/the/file4.ext file4.ext
do
library(data.table)
DT<-as.data.table(df)
DT[filename %like% "1"]
should give
path filename
1: C:/Path/to/the/file1.ext file1.ext