How to Replace empty string with N/A in Scala Spark? - regex

I'm trying out an age old problem of replacing empty strings in a certain column in a Spark Scala dataframe with N/A, but to no avail.
Original Dataframe:
+----------+--------------+
|Testing ID|Test this Code|
+----------+--------------+
| 545242| ""|
| 643533| 994A|
| 856563| ""|
+----------+--------------+
First code I tried:
val a = sssd.withColumn("Test this Code", when($"Test this Code" === "", lit("N/A")).otherwise($"Test this Code"))
But nothing happens, zero changes observed. Hence I tried another way by using regexp_replace, code:
import org.apache.spark.sql.functions._
val a = sssd.withColumn("Test this Code", regexp_replace(col("Test this Code"), "", "N/A"))
But then, the output is strange enough, its the following:
+----------+------------------------+
|Testing ID| Test this Code|
+----------+------------------------+
| 545242| N/A"N/A"N/A|
| 643533| 994A|
| 856563| N/A"N/A"N/A|
+----------+------------------------+
I went through other SO answers, but to no avail, any help?

Try this. I suspect it's not an empty string, but actually a string of two quotes.
val a = sssd.withColumn("Test this Code", when($"Test this Code" === "\"\"", lit("N/A")).otherwise($"Test this Code"))

Related

How to regexp_extract if a matching pattern resides anywhere in the string - pyspark

I was trying to get some insights on regexp_extract in pyspark and I tried to do a check with this option to get better understanding.
Below is my dataframe
data = [('2345', 'Checked|by John|for kamal'),
('2398', 'Checked|by John|for kamal '),
('2328', 'Verified|by Srinivas|for kamal than some random text'),
('3983', 'Verified|for Stacy|by John')]
df = sc.parallelize(data).toDF(['ID', 'Notes'])
df.show()
+----+-----------------------------------------------------+
| ID| Notes |
+----+-----------------------------------------------------+
|2345|Checked|by John|for kamal |
|2398|Checked|by John|for kamal |
|2328|Verified|by Srinivas|for kamal than some random text |
|3983|Verified|for Stacy|by John |
+----+-----------------------------------------------------+
So here I was trying to identify whether an ID is checked or verified by John
With the help of SO members I was able to crack the use of regexp_extract and came to below solution
result = df.withColumn('Employee', regexp_extract(col('Notes'), '(Checked|Verified)(\\|by John)', 1))
result.show()
+----+------------------------------------------------+------------+
| ID| Notes |Employee|
+----+------------------------------------------------+------------+
|2345|Checked|by John|for kamal | Checked|
|2398|Checked|by John|for kamal | Checked|
|2328|Verified|by Srinivas|for kamal than some random text| |
|3983|Verified|for Stacy|by John | |
+----+--------------------+----------------------------------------+
For few ID's this gives me perfect result ,But for last ID it didn't print Verified. Could someone please let me know whether any other action needs to be performed in the mentioned regular expression?
What I feel is (Checked|Verified)(\\|by John) is matching only adjacent values. I tried * and $, still it didn't print Verified for ID 3983.
I would have phrased the regex as:
(Checked|Verified)\b.*\bby John
Demo
This pattern finds Checked/Verified followed by by John with the two separated by any amount of text. Note that I just use word boundaries here instead of pipes.
Updated code:
result = df.withColumn('Employee', regexp_extract(col('Notes'), '\b(Checked|Verified)\b.*\bby John', 1))
You can try this regex:
import pyspark.sql.functions as F
result = df.withColumn('Employee', F.regexp_extract('Notes', '(Checked|Verified)\\|.*by John', 1))
result.show()
+----+--------------------+--------+
| ID| Notes|Employee|
+----+--------------------+--------+
|2345|Checked|by John|f...| Checked|
|2398|Checked|by John|f...| Checked|
|2328|Verified|by Srini...| |
|3983|Verified|for Stac...|Verified|
+----+--------------------+--------+
Another way is to check if the column Notes contains a string by John:
df.withColumn('Employee',F.when(col('Notes').like('%Checked|by John%'), 'Checked').when(col('Notes').like('%by John'), 'Verified').otherwise(" ")).show(truncate=False)
+----+----------------------------------------------------+--------+
|ID |Notes |Employee|
+----+----------------------------------------------------+--------+
|2345|Checked|by John|for kamal |Checked |
|2398|Checked|by John|for kamal |Checked |
|2328|Verified|by Srinivas|for kamal than some random text| |
|3983|Verified|for Stacy|by John |Verified|
+----+----------------------------------------------------+--------+

How to use regexp_replace in spark.sql() to extract hashtags from string

I need to write a regexg_replace query in spark.sql() and I'm not sure how to handle it. For readability purposes, I have to utilize SQL for it. I am trying to pull out the hashtags from the table. I know how to do this using the python method but most of my team are SQL users.
My dataframe example looks like so:
Insta_post
Today, Senate Dems vote to #SaveTheInternet. Proud to support similar #NetNeutrality legislation here in the House…
RT #NALCABPolicy: Meeting with #RepDarrenSoto . Thanks for taking the time to meet with #LatinoLeader ED Marucci Guzman. #NALCABPolicy2018.…
RT #Tharryry: I am delighted that #RepDarrenSoto will be voting for the CRA to overrule the FCC and save our #NetNeutrality rules. Find out…
My code:
I create a tempview:
post_df.createOrReplaceTempView("post_tempview")
post_df = spark.sql("""
select
regexp_replace(Insta_post, '.*?(.|'')(#)(\w+)', '$1') as a
from post_tempview
where Insta_post like '%#%'
""")
My end result:
+--------------------------------------------------------------------------------------------------------------------------------------------+
|a |
+--------------------------------------------------------------------------------------------------------------------------------------------+
|Today, Senate Dems vote to #SaveTheInternet. Proud to support similar #NetNeutrality legislation here in the House… |
|RT #NALCABPolicy: Meeting with #RepDarrenSoto . Thanks for taking the time to meet with #LatinoLeader ED Marucci Guzman. #NALCABPolicy2018.…|
|RT #Tharryry: I am delighted that #RepDarrenSoto will be voting for the CRA to overrule the FCC and save our #NetNeutrality rules. Find out…|
+--------------------------------------------------------------------------------------------------------------------------------------------+
desired result:
+---------------------------------+
|a |
+---------------------------------+
| #SaveTheInternet, #NetNeutrality|
| #NALCABPolicy2018 |
| #NetNeutrality |
+---------------------------------+
I haven't really used regexp_replace too much so this is new to me. Any help would be appreciated as well as an explanation of how to structure the subsets!
For Spark 3.1+, you can use regexp_extract_all function to extract multiple matches:
post_df = spark.sql("""
select regexp_extract_all(Insta_post, '(#\\\\w+)', 1) as a
from post_tempview
where Insta_post like '%#%'
""")
post_df.show(truncate=False)
#+----------------------------------+
#|a |
#+----------------------------------+
#|[#SaveTheInternet, #NetNeutrality]|
#|[#NALCABPolicy2018] |
#|[#NetNeutrality] |
#+----------------------------------+
For Spark <3.1, you can use regexp_replace to remove all that doesn't match the hashtag pattern :
post_df = spark.sql("""
select trim(trailing ',' from regexp_replace(Insta_post, '.*?(#\\\\w+)|.*', '$1,')) as a
from post_tempview
where Insta_post like '%#%'
""")
post_df.show(truncate=False)
#+-------------------------------+
#|a |
#+-------------------------------+
#|#SaveTheInternet,#NetNeutrality|
#|#NALCABPolicy2018 |
#|#NetNeutrality |
#+-------------------------------+
Note the use trim to remove the unnecessary trailing commas created by the first replace $,.
Do you really need a view? Because the following code might do it:
df = df.filter(F.col('Insta_post').like('%#%'))
col_trimmed = F.trim((F.regexp_replace('Insta_post', '.*?(#\w+)|.+', '$1 ')))
df = df.select(F.regexp_replace(col_trimmed,'\s',', ').alias('a'))
df.show(truncate=False)
# +--------------------------------+
# |a |
# +--------------------------------+
# |#SaveTheInternet, #NetNeutrality|
# |#NALCABPolicy2018 |
# |#NetNeutrality |
# +--------------------------------+
I ended up using two of regexp_replace, so potentially there could be a better alternative, just couldn't think of one.

Including parenthesis when joining dataframes using rlike in pyspark

I have 2 pyspark dataframes that I am trying to join where some of the values in the columns have parenthesis.
For example one of the values is
"Mangy (Dog)"
If I try joining like so:
df1.join(df2 expr("df1.animal rlike df2.animal_stat")
I don't get any results.
So I tried filtering using rlike just to see if I am able to capture the values.
The filtering worked on all values except those with parenthesis. For example when i try to filter like so:
df.filter(col('animal').rlike("Mangy (Dog)")).show()
I don't get any results.
However, if I filter with rlike("Mangy") or rlike("(Dog)" it seems to work. Even though I specified parenthesis in (Dog).
Is there a way to make rlike to include parenthesis in its matches?
EDIT:
I have 2 dataframes df1 and df2 like so:
+-----------------+-------+
| animal| origin|
+-----------------+-------+
| mangy (dog)|Streets|
| Cat| house|
|[Bumbling] Bufoon| Utopia|
| Cheetah| Congo|
|(Sprawling) Snake| Amazon|
+-----------------+-------+
+-------------------+-----------+
| animal_stat|destination|
+-------------------+-----------+
| ^dog$| House|
| ^Cat$| Streets|
|^[Bumbling] Bufoon$| Circus|
| ^Cheetah$| Zoo|
| ^(Sprawling)$| Glass Box|
+-------------------+-----------+
I am trying to join the two using rlike using the following method:
dff1=df1.alias('dff1')
dff2=df2.alias('dff2')
combine=dff1.join(dff2, expr("dff1.animal rlike dff2.animal_stat"), how='left')
.drop(dff2.animal_stat)
I would like the output dataframe to be like so:
+-----------------+-------+-----------+
| animal| origin|destination|
+-----------------+-------+-----------+
| mangy (dog)|Streets| House|
| Cat| house| Streets|
|[Bumbling] Bufoon| Utopia| Circus|
| Cheetah| Congo| Zoo|
|(Sprawling) Snake| Amazon| Glass Box|
+-----------------+-------+-----------+
Edit:
combine = df1.alias('df1').join(
df2.withColumn('animal_stat', F.regexp_replace(F.regexp_replace(F.regexp_replace(F.regexp_replace('animal_stat', '\\(', '\\\\('), '\\)', '\\\\)'), '\\[', '\\\\['), '\\]', '\\\\]')).alias('df2'),
F.expr('df1.animal rlike df2.animal_stat'),
'left'
)
If you're not using any regex, you probably want to use like instead of rlike. For example, you can do
df1.join(df2, expr("df1.animal like concat('%', df2.animal_stat, '%')"))
To do a filter, you can try
df.filter(col('animal').like("%Mangy (Dog)%")).show()
.rlike() is the same as .like() except it uses regex. You need to escape the parentheses. Try filtering like this:
df.filter(col('animal').rlike("Mangy \(Dog\)")).show()
Not sure I can help with the original join issue without some sample data.

Extracting main subject from a sentence in python

I am trying to extract the main subject from a sentence contained in a text file. For example, the file contains data as given below
I never used tobacco
They smoke tobacco
I do not like today's weather
Good weather
Exercise 3 to 4 times a week
No exercise
Family history of Cancer
No Cancer
,,· Alcohol use
Amazing football match
Pathetic football match
Has Depression
I have to extract the main subject and print it as follows:
I never used tobacco | Tobacco | False
They smoke tobacco | Tobacco | True
I do not like today's weather | Weather | False
Good weather | Weather | True
Exercise 3 to 4 times a week | Exercise | True
No exercise | Exercise | False
Family history of Cancer | Cancer | True
No Cancer | Cancer | False
,,· Alcohol use. | Alcohol | True
Amazing football match | Football Match| True
Pathetic football match | Football Match | False
Has Depression | Depression | True
I am trying Spacy for it but not able to get the desired output. I tokenized the sentences using Spacy then used part of speech tagging to extract the nouns but still not getting what is required.
Can anyone help that how it could be done?
There is not an exact solution to it but the below code which I used is somewhat helpful:
negatedwords = read_words_from_file('false.txt') # file containing all the negation words
#read_words_from_file() will read words from file
from collections import Counter
import spacy
nlp = spacy.load('en_core_web_md')
count = Counter(line.split())
negated_word_found = False
for key, val in count.items():
key = key.rstrip('.,?!\n') # removing punctuations
if key in negatedwords :
negated_word_found= True
if negated_word_found== True:
file_write.write("False")
else:
file_write.write("True")
file_write.write(" | ")
document = nlp(line)
for word in document:
look_for_word = word.text
word_pos = word.pos_
if ((word_pos =="NOUN" or word_pos =="ADJ" or word_pos == "PROPN" ) and look_for_word!="use" ): #The pos_ tag for 'use' is showed as NOUN
file_write.write(look_for_word)
file_write.write(' ')
false.txt
never
Never
no
No
NO
not
NOT
Not
NEVER
don't
Don't
DON'T

Keep words starting with character/letter in Pandas | Python

I'm not sure how to do this in a dataframe context
I have the table below here with text information
TEXT |
-------------------------------------------|
"Get some new #turbo #stacks today!" |
"Is it one or three? #phone" |
"Mayhaps it be three afterall..." |
"So many new issues with phone... #iphone" |
And I want to edit it down to where only the words with a '#' symbol are kept, like in the result below.
TEXT |
-----------------|
"#turbo #stacks" |
"#phone" |
"" |
"#iphone" |
In some cases, I'd also like to know if it's possible to eliminate the rows that are empty by checking for NaN as true or if you run a different kind of condition to get this result:
TEXT |
-----------------|
"#turbo #stacks" |
"#phone" |
"#iphone" |
Python 2.7 and pandas for this.
You could try using regex and extractall:
df.TEXT.str.extractall('(#\w+)').groupby(level=0)[0].apply(' '.join)
Output:
0 #turbo #stacks
1 #phone
3 #iphone
Name: 0, dtype: object