I was trying to get some insights on regexp_extract in pyspark and I tried to do a check with this option to get better understanding.
Below is my dataframe
data = [('2345', 'Checked|by John|for kamal'),
('2398', 'Checked|by John|for kamal '),
('2328', 'Verified|by Srinivas|for kamal than some random text'),
('3983', 'Verified|for Stacy|by John')]
df = sc.parallelize(data).toDF(['ID', 'Notes'])
df.show()
+----+-----------------------------------------------------+
| ID| Notes |
+----+-----------------------------------------------------+
|2345|Checked|by John|for kamal |
|2398|Checked|by John|for kamal |
|2328|Verified|by Srinivas|for kamal than some random text |
|3983|Verified|for Stacy|by John |
+----+-----------------------------------------------------+
So here I was trying to identify whether an ID is checked or verified by John
With the help of SO members I was able to crack the use of regexp_extract and came to below solution
result = df.withColumn('Employee', regexp_extract(col('Notes'), '(Checked|Verified)(\\|by John)', 1))
result.show()
+----+------------------------------------------------+------------+
| ID| Notes |Employee|
+----+------------------------------------------------+------------+
|2345|Checked|by John|for kamal | Checked|
|2398|Checked|by John|for kamal | Checked|
|2328|Verified|by Srinivas|for kamal than some random text| |
|3983|Verified|for Stacy|by John | |
+----+--------------------+----------------------------------------+
For few ID's this gives me perfect result ,But for last ID it didn't print Verified. Could someone please let me know whether any other action needs to be performed in the mentioned regular expression?
What I feel is (Checked|Verified)(\\|by John) is matching only adjacent values. I tried * and $, still it didn't print Verified for ID 3983.
I would have phrased the regex as:
(Checked|Verified)\b.*\bby John
Demo
This pattern finds Checked/Verified followed by by John with the two separated by any amount of text. Note that I just use word boundaries here instead of pipes.
Updated code:
result = df.withColumn('Employee', regexp_extract(col('Notes'), '\b(Checked|Verified)\b.*\bby John', 1))
You can try this regex:
import pyspark.sql.functions as F
result = df.withColumn('Employee', F.regexp_extract('Notes', '(Checked|Verified)\\|.*by John', 1))
result.show()
+----+--------------------+--------+
| ID| Notes|Employee|
+----+--------------------+--------+
|2345|Checked|by John|f...| Checked|
|2398|Checked|by John|f...| Checked|
|2328|Verified|by Srini...| |
|3983|Verified|for Stac...|Verified|
+----+--------------------+--------+
Another way is to check if the column Notes contains a string by John:
df.withColumn('Employee',F.when(col('Notes').like('%Checked|by John%'), 'Checked').when(col('Notes').like('%by John'), 'Verified').otherwise(" ")).show(truncate=False)
+----+----------------------------------------------------+--------+
|ID |Notes |Employee|
+----+----------------------------------------------------+--------+
|2345|Checked|by John|for kamal |Checked |
|2398|Checked|by John|for kamal |Checked |
|2328|Verified|by Srinivas|for kamal than some random text| |
|3983|Verified|for Stacy|by John |Verified|
+----+----------------------------------------------------+--------+
I am trying to extract the main subject from a sentence contained in a text file. For example, the file contains data as given below
I never used tobacco
They smoke tobacco
I do not like today's weather
Good weather
Exercise 3 to 4 times a week
No exercise
Family history of Cancer
No Cancer
,,· Alcohol use
Amazing football match
Pathetic football match
Has Depression
I have to extract the main subject and print it as follows:
I never used tobacco | Tobacco | False
They smoke tobacco | Tobacco | True
I do not like today's weather | Weather | False
Good weather | Weather | True
Exercise 3 to 4 times a week | Exercise | True
No exercise | Exercise | False
Family history of Cancer | Cancer | True
No Cancer | Cancer | False
,,· Alcohol use. | Alcohol | True
Amazing football match | Football Match| True
Pathetic football match | Football Match | False
Has Depression | Depression | True
I am trying Spacy for it but not able to get the desired output. I tokenized the sentences using Spacy then used part of speech tagging to extract the nouns but still not getting what is required.
Can anyone help that how it could be done?
There is not an exact solution to it but the below code which I used is somewhat helpful:
negatedwords = read_words_from_file('false.txt') # file containing all the negation words
#read_words_from_file() will read words from file
from collections import Counter
import spacy
nlp = spacy.load('en_core_web_md')
count = Counter(line.split())
negated_word_found = False
for key, val in count.items():
key = key.rstrip('.,?!\n') # removing punctuations
if key in negatedwords :
negated_word_found= True
if negated_word_found== True:
file_write.write("False")
else:
file_write.write("True")
file_write.write(" | ")
document = nlp(line)
for word in document:
look_for_word = word.text
word_pos = word.pos_
if ((word_pos =="NOUN" or word_pos =="ADJ" or word_pos == "PROPN" ) and look_for_word!="use" ): #The pos_ tag for 'use' is showed as NOUN
file_write.write(look_for_word)
file_write.write(' ')
false.txt
never
Never
no
No
NO
not
NOT
Not
NEVER
don't
Don't
DON'T
I'm trying to do the following:
Check the cell for N/A or No; if it has either of these then it should output N/A or No
Check the cell for either £ or € or Yes; If it has one of these then it would continue to step 3. If it has $ then it should repeat the same input as the output.
Extract currency from cell using: REGEXEXTRACT(A1, "\$\d+") or REGEXEXTRACT(A1, "\£\d+") (I assume that's the best way)
Convert it to $ USD using GoogleFinance("CURRENCY:EURUSD") or GoogleFinance("CURRENCY:GBPUSD")
Output the original cell but replacing the extracted currency from step 3 with the output from step 4.
Examples: (Original --> Output)
N/A --> N/A
No --> No
Alt --> Alt
Yes --> Yes
Yes £10 --> Yes $12.19
Yes £10 per week --> Yes $12.19 per week
Yes €5 (Next) --> Yes $5.49 (Next)
Yes $5 22 EA --> Yes $5 22 EA
Yes £5 - £10 --> Yes $5.49 - $12.19
I am unable to get a working IF statement working, I could do this in normal code but can't work it out for spreadsheet formulas.
I've tried modifying #Rubén's answer lots of times to including the N/A as it's not the Sheets error, I also tried the same for making any USD inputs come out as USD (no changes) but I really can't get the hang of IF/OR/AND in Excel/Google Sheets.
=ArrayFormula(
SUBSTITUTE(
A1,
OR(IF(A1="No","No",REGEXEXTRACT(A1, "[\£|\€]\d+")),IF(A1="N/A","N/A",REGEXEXTRACT(A1, "[\£|\€]\d+"))),
IF(
A1="No",
"No",
TEXT(
REGEXEXTRACT(A1, "[\£|\€](\d+)")*
IF(
"€"=REGEXEXTRACT(A1, "([\£|\€])\d+"),
GoogleFinance("CURRENCY:EURUSD"),
GoogleFinance("CURRENCY:GBPUSD")
),
"$###,###"
)
)
)
)
The above, I tried to add an OR() before the first IF statement to try and include N/A as an option, in the below I tried it as you can see below in various different ways (replace line 4 with this)
IF(
OR(
A1="No",
"No",
REGEXEXTRACT(A1, "[\£|\€]\d+");
A1="No",
"No",
REGEXEXTRACT(A1, "[\£|\€]\d+")
)
)
But that doesn't work either. I thought using ; was a way to separate the OR expressions but apparently not.
Re: Rubén's latest code 16/10/2016
I've modified it to =ArrayFormula(
IF(NOT(ISBLANK(A2)),
IF(IFERROR(SEARCH("$",A2),0),A2,IF(A2="N/A","N/A",IF(A2="No","No",IF(A2="Alt","Alt",IF(A2="Yes","Yes",
SUBSTITUTE(
A2,
REGEXEXTRACT(A2, "[\£|\€]\d+"),
TEXT(
REGEXEXTRACT(A2, "[\£|\€](\d+)")
*
VLOOKUP(
REGEXEXTRACT(A2, "([\£|\€])\d+"),
{
{"£";"€"},
{GoogleFinance("CURRENCY:GBPUSD");GoogleFinance("CURRENCY:EURUSD")}
},
2,0),
"$###,###"
)
)
)))))
,"")
)
This fixes:
Blank cells no longer throw #N/A
Yes only cells no longer throw #N/A
Added another text value Alt
Changes the format of the currency to 0 decimal places rather than my original request of 2 decimal places.
As you can see in the image below the two red cells aren't quite correct as I never thought of this scenario, the second of the two values is staying in it's input form and not being converted to USD.
Direct answer
Try
=ArrayFormula(
IF(IFERROR(SEARCH("$",A1:A6),0),A1:A6,IF(A1:A6="N/A","N/A",IF(A1:A6="No","No",
SUBSTITUTE(
A1:A6,
REGEXEXTRACT(A1:A6, "[\£|\€]\d+"),
TEXT(
REGEXEXTRACT(A1:A6, "[\£|\€](\d+)")
*
VLOOKUP(
REGEXEXTRACT(A1:A6, "([\£|\€])\d+"),
{
{"£";"€"},
{GoogleFinance("CURRENCY:GBPUSD");GoogleFinance("CURRENCY:EURUSD")}
},
2,0),
"$###,###.00"
)
)
)))
)
Result
+---+------------------+---------------------+
| | A | B |
+---+------------------+---------------------+
| 1 | N/A | N/A |
| 2 | No | No |
| 3 | Yes £10 | Yes $12.19 |
| 4 | Yes £10 per week | Yes $12.19 per week |
| 5 | Yes €5 (Next) | Yes $5.49 (Next) |
+---+------------------+---------------------+
Explanation
OR function
Instead or using OR function, the above formula use nested IF functions.
REGEXTRACT
Instead of using a REGEXEXTRACT function for each currency symbol, a regex OR operator was used. Example
REGEXEXTRACT(A1:A6, "[\£|\€]\d+")
Three regular expressions were used,
get currency symbol and the amount [\£|\€]\d+
get the amount [\£|\€](\d+)
get the currency symbol [(\£|\€])\d+
Currency conversion
Instead of using nested IF to handle currency conversion rates, VLOOKUP and array is used. This could be make easier to maintain the formula assuming that more currencies could be added in the future.
Got some text:
[23/07 | DEV | FARO | QC Billable | #2032] Unable to Load label
[30/07 | QC | ROLAWN ] Selling products as a bundle
[11/08 | EST | QC BILLABLE | #2015 ISUOG ] On Demand website looping
[05/08 | EST | ROLAWN | Problems with 'find a stockist'
[29/07 | DEV | QUBA] Blog comments loading to error
[24/07 | FROG | EST| QC BILLABLE #2033] Carousel banner not working correctly
I'm trying to match the last sentence at the end of each line so the matches are as follows:
Unable to Load label
Selling products as a bundle
On Demand website looping
Problems with 'find a stockist'
Blog comments loading to error
Carousel banner not working correctly
Unfortunately, I can't depend on the structure of the line to conform, but the information I'm trying to extract should always be the last sentence. I've tried quite a few different things, but I'm struggling here.
If there is also some kind on no-word character before last sentence, try with:
[\w\s']+$
DEMO
Edit: The answer above by m.cekiera [\w\s']+$ is better.
](.+)$
Here's a pretty naive solution: https://regex101.com/r/yT8jJ7/1.
If you give more details about the actual structure it could be refined.
Can someone help me replacing value in text file using regexp_replace before storing data by SqlLoader?
My text file:
Andy 0001231231231
Bobby 0000032132132122
Charles 0000456456456
and expected result in DB are:
NAME | PHONE
---------------------
Andy | 1231231231
Bobby | 32132132122
Charles | 456456456
here is my sqlLoader file:
PHONE POSITION(10:45) NULLIF PHONE=BLANKS "REGEXP_REPLACE(:PHONE, '^0+([^0]\d+)$','\1')",
But, I still got the result like this:
NAME | PHONE
---------------------
Andy | 0001231231231
Bobby | 0000032132132122
Charles | 0000456456456
What's wrong with my SqlLoader file?
Thank you
Faizal
I forgot to add \ in every \. Finally the result is as I expected.