Can someone help me replacing value in text file using regexp_replace before storing data by SqlLoader?
My text file:
Andy 0001231231231
Bobby 0000032132132122
Charles 0000456456456
and expected result in DB are:
NAME | PHONE
---------------------
Andy | 1231231231
Bobby | 32132132122
Charles | 456456456
here is my sqlLoader file:
PHONE POSITION(10:45) NULLIF PHONE=BLANKS "REGEXP_REPLACE(:PHONE, '^0+([^0]\d+)$','\1')",
But, I still got the result like this:
NAME | PHONE
---------------------
Andy | 0001231231231
Bobby | 0000032132132122
Charles | 0000456456456
What's wrong with my SqlLoader file?
Thank you
Faizal
I forgot to add \ in every \. Finally the result is as I expected.
Related
I was trying to get some insights on regexp_extract in pyspark and I tried to do a check with this option to get better understanding.
Below is my dataframe
data = [('2345', 'Checked|by John|for kamal'),
('2398', 'Checked|by John|for kamal '),
('2328', 'Verified|by Srinivas|for kamal than some random text'),
('3983', 'Verified|for Stacy|by John')]
df = sc.parallelize(data).toDF(['ID', 'Notes'])
df.show()
+----+-----------------------------------------------------+
| ID| Notes |
+----+-----------------------------------------------------+
|2345|Checked|by John|for kamal |
|2398|Checked|by John|for kamal |
|2328|Verified|by Srinivas|for kamal than some random text |
|3983|Verified|for Stacy|by John |
+----+-----------------------------------------------------+
So here I was trying to identify whether an ID is checked or verified by John
With the help of SO members I was able to crack the use of regexp_extract and came to below solution
result = df.withColumn('Employee', regexp_extract(col('Notes'), '(Checked|Verified)(\\|by John)', 1))
result.show()
+----+------------------------------------------------+------------+
| ID| Notes |Employee|
+----+------------------------------------------------+------------+
|2345|Checked|by John|for kamal | Checked|
|2398|Checked|by John|for kamal | Checked|
|2328|Verified|by Srinivas|for kamal than some random text| |
|3983|Verified|for Stacy|by John | |
+----+--------------------+----------------------------------------+
For few ID's this gives me perfect result ,But for last ID it didn't print Verified. Could someone please let me know whether any other action needs to be performed in the mentioned regular expression?
What I feel is (Checked|Verified)(\\|by John) is matching only adjacent values. I tried * and $, still it didn't print Verified for ID 3983.
I would have phrased the regex as:
(Checked|Verified)\b.*\bby John
Demo
This pattern finds Checked/Verified followed by by John with the two separated by any amount of text. Note that I just use word boundaries here instead of pipes.
Updated code:
result = df.withColumn('Employee', regexp_extract(col('Notes'), '\b(Checked|Verified)\b.*\bby John', 1))
You can try this regex:
import pyspark.sql.functions as F
result = df.withColumn('Employee', F.regexp_extract('Notes', '(Checked|Verified)\\|.*by John', 1))
result.show()
+----+--------------------+--------+
| ID| Notes|Employee|
+----+--------------------+--------+
|2345|Checked|by John|f...| Checked|
|2398|Checked|by John|f...| Checked|
|2328|Verified|by Srini...| |
|3983|Verified|for Stac...|Verified|
+----+--------------------+--------+
Another way is to check if the column Notes contains a string by John:
df.withColumn('Employee',F.when(col('Notes').like('%Checked|by John%'), 'Checked').when(col('Notes').like('%by John'), 'Verified').otherwise(" ")).show(truncate=False)
+----+----------------------------------------------------+--------+
|ID |Notes |Employee|
+----+----------------------------------------------------+--------+
|2345|Checked|by John|for kamal |Checked |
|2398|Checked|by John|for kamal |Checked |
|2328|Verified|by Srinivas|for kamal than some random text| |
|3983|Verified|for Stacy|by John |Verified|
+----+----------------------------------------------------+--------+
I need to write a regexg_replace query in spark.sql() and I'm not sure how to handle it. For readability purposes, I have to utilize SQL for it. I am trying to pull out the hashtags from the table. I know how to do this using the python method but most of my team are SQL users.
My dataframe example looks like so:
Insta_post
Today, Senate Dems vote to #SaveTheInternet. Proud to support similar #NetNeutrality legislation here in the House…
RT #NALCABPolicy: Meeting with #RepDarrenSoto . Thanks for taking the time to meet with #LatinoLeader ED Marucci Guzman. #NALCABPolicy2018.…
RT #Tharryry: I am delighted that #RepDarrenSoto will be voting for the CRA to overrule the FCC and save our #NetNeutrality rules. Find out…
My code:
I create a tempview:
post_df.createOrReplaceTempView("post_tempview")
post_df = spark.sql("""
select
regexp_replace(Insta_post, '.*?(.|'')(#)(\w+)', '$1') as a
from post_tempview
where Insta_post like '%#%'
""")
My end result:
+--------------------------------------------------------------------------------------------------------------------------------------------+
|a |
+--------------------------------------------------------------------------------------------------------------------------------------------+
|Today, Senate Dems vote to #SaveTheInternet. Proud to support similar #NetNeutrality legislation here in the House… |
|RT #NALCABPolicy: Meeting with #RepDarrenSoto . Thanks for taking the time to meet with #LatinoLeader ED Marucci Guzman. #NALCABPolicy2018.…|
|RT #Tharryry: I am delighted that #RepDarrenSoto will be voting for the CRA to overrule the FCC and save our #NetNeutrality rules. Find out…|
+--------------------------------------------------------------------------------------------------------------------------------------------+
desired result:
+---------------------------------+
|a |
+---------------------------------+
| #SaveTheInternet, #NetNeutrality|
| #NALCABPolicy2018 |
| #NetNeutrality |
+---------------------------------+
I haven't really used regexp_replace too much so this is new to me. Any help would be appreciated as well as an explanation of how to structure the subsets!
For Spark 3.1+, you can use regexp_extract_all function to extract multiple matches:
post_df = spark.sql("""
select regexp_extract_all(Insta_post, '(#\\\\w+)', 1) as a
from post_tempview
where Insta_post like '%#%'
""")
post_df.show(truncate=False)
#+----------------------------------+
#|a |
#+----------------------------------+
#|[#SaveTheInternet, #NetNeutrality]|
#|[#NALCABPolicy2018] |
#|[#NetNeutrality] |
#+----------------------------------+
For Spark <3.1, you can use regexp_replace to remove all that doesn't match the hashtag pattern :
post_df = spark.sql("""
select trim(trailing ',' from regexp_replace(Insta_post, '.*?(#\\\\w+)|.*', '$1,')) as a
from post_tempview
where Insta_post like '%#%'
""")
post_df.show(truncate=False)
#+-------------------------------+
#|a |
#+-------------------------------+
#|#SaveTheInternet,#NetNeutrality|
#|#NALCABPolicy2018 |
#|#NetNeutrality |
#+-------------------------------+
Note the use trim to remove the unnecessary trailing commas created by the first replace $,.
Do you really need a view? Because the following code might do it:
df = df.filter(F.col('Insta_post').like('%#%'))
col_trimmed = F.trim((F.regexp_replace('Insta_post', '.*?(#\w+)|.+', '$1 ')))
df = df.select(F.regexp_replace(col_trimmed,'\s',', ').alias('a'))
df.show(truncate=False)
# +--------------------------------+
# |a |
# +--------------------------------+
# |#SaveTheInternet, #NetNeutrality|
# |#NALCABPolicy2018 |
# |#NetNeutrality |
# +--------------------------------+
I ended up using two of regexp_replace, so potentially there could be a better alternative, just couldn't think of one.
I'm not sure how to do this in a dataframe context
I have the table below here with text information
TEXT |
-------------------------------------------|
"Get some new #turbo #stacks today!" |
"Is it one or three? #phone" |
"Mayhaps it be three afterall..." |
"So many new issues with phone... #iphone" |
And I want to edit it down to where only the words with a '#' symbol are kept, like in the result below.
TEXT |
-----------------|
"#turbo #stacks" |
"#phone" |
"" |
"#iphone" |
In some cases, I'd also like to know if it's possible to eliminate the rows that are empty by checking for NaN as true or if you run a different kind of condition to get this result:
TEXT |
-----------------|
"#turbo #stacks" |
"#phone" |
"#iphone" |
Python 2.7 and pandas for this.
You could try using regex and extractall:
df.TEXT.str.extractall('(#\w+)').groupby(level=0)[0].apply(' '.join)
Output:
0 #turbo #stacks
1 #phone
3 #iphone
Name: 0, dtype: object
Got some text:
[23/07 | DEV | FARO | QC Billable | #2032] Unable to Load label
[30/07 | QC | ROLAWN ] Selling products as a bundle
[11/08 | EST | QC BILLABLE | #2015 ISUOG ] On Demand website looping
[05/08 | EST | ROLAWN | Problems with 'find a stockist'
[29/07 | DEV | QUBA] Blog comments loading to error
[24/07 | FROG | EST| QC BILLABLE #2033] Carousel banner not working correctly
I'm trying to match the last sentence at the end of each line so the matches are as follows:
Unable to Load label
Selling products as a bundle
On Demand website looping
Problems with 'find a stockist'
Blog comments loading to error
Carousel banner not working correctly
Unfortunately, I can't depend on the structure of the line to conform, but the information I'm trying to extract should always be the last sentence. I've tried quite a few different things, but I'm struggling here.
If there is also some kind on no-word character before last sentence, try with:
[\w\s']+$
DEMO
Edit: The answer above by m.cekiera [\w\s']+$ is better.
](.+)$
Here's a pretty naive solution: https://regex101.com/r/yT8jJ7/1.
If you give more details about the actual structure it could be refined.
An example am trying to understand from website.
People2.txt is as follows.
2323:Doe John California
827:Doe Jane Texas
982982:Neuman Alfred Nebraska
I don't get the output as shown from the command below.
*PS C:\ Get-Content people2.txt | %{$data = [regex]::split($_, '\t|:'); Write-Output "$($data[2]) $($data[1]), $($data[3])"}
John Doe, California
Jane Doe, Texas
Alfred Neuman, Nebraska*
I could take out numbers and swapping first and second using
gc C:\appl\ppl.txt | %{$data = [regex]::split($_, ":") ;write-output $data[1] } | Out-File c:\appl\ppll.txt
gc C:\appl\ppll.txt | %{$data = $_.split(" "); Write-Output "$($data[1]) $($data[0]),
$($data[2])"}
Please help
**Need to find more efficient ways to do this.
Also I want to understand '\t|:' - is it 'Split at first TAB stop and a : ' ?**
Just threw this off the top of my head: ^(?<number>\d+):(?<first>\w+)\s+(?<last>\w+)\s(?<location>.*)$