How to remove quotes from front and end of the string Scala - regex

I have a dataframe where some strings contains "" in front and end of the string.
Eg:
+-------------------------------+
|data |
+-------------------------------+
|"john belushi" |
|"john mnunjnj" |
|"nmnj tyhng" |
|"John b-e_lushi" |
|"john belushi's book" |
Expected output:
+-------------------------------+
|data |
+-------------------------------+
|john belushi |
|john mnunjnj |
|nmnj tyhng |
|John b-e_lushi |
|john belushi's book |
I am trying to remove only " double quotes from the string. Can some one tell me how can I remove this in Scala ?
Python provide ltrim and rtrim. Is there any thing equivalent to that in Scala ?

Use expr, substring and length functions and get the substring from 2 and length() - 2
val df_d = List("\"john belushi\"", "\"John b-e_lushi\"", "\"john belushi's book\"")
.toDF("data")
Input:
+---------------------+
|data |
+---------------------+
|"john belushi" |
|"John b-e_lushi" |
|"john belushi's book"|
+---------------------+
Using expr, substring and length functions:
import org.apache.spark.sql.functions.expr
df_d.withColumn("data", expr("substring(data, 2, length(data) - 2)"))
.show(false)
Output:
+-------------------+
|data |
+-------------------+
|john belushi |
|John b-e_lushi |
|john belushi's book|
+-------------------+

How to remove quotes from front and end of the string Scala?
myString.substring(1, myString.length()-1) will remove the double quotes.
import spark.implicits._
val list = List("\"hi\"", "\"I am learning scala\"", "\"pls\"", "\"help\"").toDF()
list.show(false)
val finaldf = list.map {
row => {
val stringdoublequotestoberemoved = row.getAs[String]("value")
stringdoublequotestoberemoved.substring(1, stringdoublequotestoberemoved.length() - 1)
}
}
finaldf.show(false)
Result :
+--------------------+
| value|
+--------------------+
| "hi"|
|"I am learning sc...|
| "pls"|
| "help"|
+--------------------+
+-------------------+
| value|
+-------------------+
| hi|
|I am learning scala|
| pls|
| help|
+-------------------+

Try it
scala> val dataFrame = List("\"john belushi\"","\"john mnunjnj\"" , "\"nmnj tyhng\"" ,"\"John b-e_lushi\"", "\"john belushi's book\"").toDF("data")
scala> dataFrame.map { row => row.mkString.stripPrefix("\"").stripSuffix("\"")}.show
+-------------------+
| value|
+-------------------+
| john belushi|
| john mnunjnj|
| nmnj tyhng|
| John b-e_lushi|
|john belushi's book|
+-------------------+

Related

extract a string before certain punctuation regex

How to extract words before the first punctuation | in presto SQL?
Table
+----+------------------------------------+
| id | title |
+----+------------------------------------+
| 1 | LLA | Rec | po#069762 | saddasd |
| 2 | Hello amustromg dsfood |
| 3 | Hel | sdfke bones. |
+----+------------------------------------+
Output
+----+------------------------------------+
| id | result |
+----+------------------------------------+
| 1 | LLA |
| 2 | |
| 3 | Hel |
+----+------------------------------------+
Attempt
REGEXP_EXTRACT(title, '(.*)([^|]*)', 1)
Thank you
Using the base string functions we can try:
SELECT id,
CASE WHEN title LIKE '%|%'
THEN TRIM(SUBSTR(title, 1, STRPOS(title, '|') - 1))
ELSE '' END AS result
FROM yourTable
ORDER BY id;

Parsing string using regexp_extract using pyspark

I am trying to split the string to different columns using regular expression
Below is my data
decodeData = [('M|C705|Exx05','2'),
('M|Exx05','4'),
('M|C705 P|Exx05','6'),
('M|C705 P|8960 L|Exx05','7'),('M|C705 P|78|8960','9')]
df = sc.parallelize(decodeData).toDF(['Decode',''])
dfNew = df.withColumn('Exx05',regexp_extract(col('Decode'), '(M|P|M)(\\|Exx05)', 1)).withColumn('C705',regexp_extract(col('Decode'), '(M|P|M)(\\|C705)', 1)) .withColumn('8960',regexp_extract(col('Decode'), '(M|P|M)(\\|8960)', 1))
dfNew.show()
Result
+--------------------+---+-----+----+-----+
| Decode| |Exx05|C705| 8960|
+--------------------+---+-----+----+-----+
| M|C705|Exx05 | 2 | | M| |
| M|Exx05 | 4 | M| | |
| M|C705 P|Exx05 | 6 | P| M| |
|M|C705 P|8960 L|Exx05| 7 | M| M| P|
| M|C705 P|78|8960 | 9 | | M| |
+--------------------+---+-----+----+-----+
Here I am trying to extract the Code for string Exx05,C705,8960 and this can fall into M/P/L codes
eg: While decoding 'M|C705 P|8960 L|Exx05' I expect the results as L M P in respective columns. However I am missing some logic here,which I am finding difficulty to crack
Expected results
+--------------------+---+-----+----+-----+
| Decode| |Exx05|C705| 8960|
+--------------------+---+-----+----+-----+
| M|C705|Exx05 | | M| M| |
| M|Exx05 | | M| | |
| M|C705 P|Exx05 | | P| M| |
|M|C705 P|8960 L|Exx05| | L| M| P|
| M|C705 P|78|8960 | | | M| P|
+--------------------+---+-----+----+-----+
When I am trying to change the reg expression accordingly , It works for some cases and it wont work for other sample cases, and this is just a subset of the actual data I am working on.
eg: 1. Exx05 can fall in any code M/L/P and even it can fall at any position,begining,middle,end etc
One Decode can only belong to 1 (M or L or P) code per entry/ID ie M|Exx05 P|8960 L|Exx05 - here Exx05 falls in M and L,This scenario will not exist.
You can add ([^ ])* in the regex to extend it so that it matches any consecutive patterns that are not separated by a space:
dfNew = df.withColumn(
'Exx05',
regexp_extract(col('Decode'), '(M|P|L)([^ ])*(\\|Exx05)', 1)
).withColumn(
'C705',
regexp_extract(col('Decode'), '(M|P|L)([^ ])*(\\|C705)', 1)
).withColumn(
'8960',
regexp_extract(col('Decode'), '(M|P|L)([^ ])*(\\|8960)', 1)
)
dfNew.show(truncate=False)
+---------------------+---+-----+----+----+
|Decode | |Exx05|C705|8960|
+---------------------+---+-----+----+----+
|M|C705|Exx05 |2 |M |M | |
|M|Exx05 |4 |M | | |
|M|C705 P|Exx05 |6 |P |M | |
|M|C705 P|8960 L|Exx05|7 |L |M |P |
|M|C705$P|78|8960 |9 | |M |P |
+---------------------+---+-----+----+----+
What about we use X(?=Y) also known as lookahead assertion. This ensures we match X only if it is followed by Y
from pyspark.sql.functions import*
dfNew = df.withColumn('Exx05',regexp_extract(col('Decode'), '([A-Z](?=\|Exx05))', 1)).withColumn('C705',regexp_extract(col('Decode'), '([A-Z](?=\|C705))', 1)) .withColumn('8960',regexp_extract(col('Decode'), '([A-Z]+(?=\|[0-9]|8960))', 1))
dfNew.show()
+--------------------+---+-----+----+----+
| Decode| t|Exx05|C705|8960|
+--------------------+---+-----+----+----+
| M|C705|Exx05| 2| | M| |
| M|Exx05| 4| M| | |
| M|C705 P|Exx05| 6| P| M| |
|M|C705 P|8960 L|E...| 7| L| M| P|
| M|C705 P|78|8960| 9| | M| P|
+--------------------+---+-----+----+----+

Query array column in BigQuery by condition

I have a table in Bigquery with this format:
+------------+-----------------+------------+-----------------+---------------------------------+
| event_date | event_timestamp | event_name | event_params.key| event_params.value.string_value |
+------------+-----------------+------------+-----------------+---------------------------------+
| 20201110 | 2929929292 | my_event | previous_page | /some-page |
+------------+-----------------+------------+-----------------+---------------------------------+
| | layer | /some-page/layer |
| +-----------------+---------------------------------+
| | session_id | 99292 |
| +-----------------+---------------------------------+
| | user._id | 2929292 |
+------------+-----------------+------------+-----------------+---------------------------------+
| 20201110 | 2882829292 | my_event | previous_page | /some-page |
+------------+-----------------+------------+-----------------+---------------------------------+
| | layer | /some-page/layer |
| +-----------------+---------------------------------+
| | session_id | 29292 |
| +-----------------+---------------------------------+
| | user_id | 229292 |
+-------------------------------------------+-----------------+---------------------------------+
I want to perform a query to get all rows where event_params.value.string_value contains the regex /layer.
I have tried this:
SELECT
"event_params.value.string_value",
FROM `my_project.my_dataset.my_events_20210110`,
UNNEST(event_params) AS event_param
WHERE event_param.key = 'layer' AND
REGEXP_CONTAINS(event_param.value.string_value, r'/layer')
LIMIT 100
But I'm getting this output:
+---------------------------------+
| event_params.value.string_value |
+---------------------------------+
| event_params.value.string_value |
+---------------------------------+
| event_params.value.string_value |
+---------------------------------+
| event_params.value.string_value |
+---------------------------------+
| event_params.value.string_value |
+---------------------------------+
Some ideas of what I'm doing wrong?
You are selecting a string - you should select a column.
The other problem is that you're cross joining the table with its arrays - effectively bloating up the table.
Your solution is to use a subquery in the WHERE clause:
SELECT
* -- Not sure what you actually need from the table ...
FROM `my_project.my_dataset.my_events_20210110`
WHERE
-- COUNT(*)>0 means "if you find more than zero" then return TRUE
(SELECT COUNT(*)>0 FROM UNNEST(event_params) AS event_param
WHERE event_param.key = 'layer' AND
REGEXP_CONTAINS(event_param.value.string_value, r'/layer')
)
LIMIT 100
If you actually want the values from the array your quick solution is removing the quotes:
SELECT
event_params.value.string_value
FROM `my_project.my_dataset.my_events_20210110`,
UNNEST(event_params) AS event_param
WHERE event_param.key = 'layer' AND
REGEXP_CONTAINS(event_param.value.string_value, r'/layer')
LIMIT 100

how to retrieve a column from pyspark dataframe and and insert it as new column within existing pyspark dataframe?

The problem is:
I've got a pyspark dataframe like this
df1:
+--------+
|index |
+--------+
| 121|
| 122|
| 123|
| 124|
| 125|
| 121|
| 121|
| 126|
| 127|
| 120|
| 121|
| 121|
| 121|
| 127|
| 129|
| 132|
| 122|
| 121|
| 121|
| 121|
+--------+
I want to retrieve index column from df1 and insert it in the existing dataframe df2 ( with same lengths).
df2:
+--------------------+--------------------+
| fact1| fact2|
+--------------------+--------------------+
| 2.4899928731985597|-0.19775025821959014|
| 1.029654847161142| 1.4878188087911541|
| -2.253992428312965| 0.29853121635739804|
| -0.8866000393025826| 0.4032596563578692|
|0.027618408969029146| 0.3218421798358574|
| -3.096711320314157|-0.35825821485752635|
| 3.1758221960731525| -2.0598630487806333|
| 7.401934592245097| -6.359158142708468|
| 1.9954990843859282| 1.9352531243666828|
| 8.728444492631189| -4.644796442599776|
| 3.21061543955211| -1.1472165049607643|
| -0.9619142291174212| -1.2487100946166108|
| 1.0681264788022142| 0.7901514935750167|
| -1.599476182182916| -1.171236788513644|
| 2.657843803002389| 1.456063339439953|
| -1.5683015324294765| -0.6126175010968302|
| -1.6735815834568026| -1.176721177528106|
| -1.4246852948658484| 0.745873761554541|
| 3.7043534046759716| 1.3993120926240652|
| 5.420426369792451| -2.149279759367474|
+--------------------+--------------------+
to get new df2 with the 3 columns : index Fact1 , Fact2
any ideas?
Thanks in advance.
Hope this helps!
import pyspark.sql.functions as f
df1 = sc.parallelize([[121],[122],[123]]).toDF(["index"])
df2 = sc.parallelize([[2.4899928731985597,-0.19775025821959014],[1.029654847161142,1.4878188087911541],
[-2.253992428312965,0.29853121635739804]]).toDF(["fact1","fact2"])
# since there is no common column between these two dataframes add row_index so that it can be joined
df1=df1.withColumn('row_index', f.monotonically_increasing_id())
df2=df2.withColumn('row_index', f.monotonically_increasing_id())
df2 = df2.join(df1, on=["row_index"]).sort("row_index").drop("row_index")
df2.show()
Don't forget to let us know if it solved your problem :)

Sed remove NULL but only when the NULL means empty or no value

Exporting a table from MySQL where fields that have no value will have the keyword NULL within.
| id | name | nickname | origin | date |
| 1 | Joe | Mini-J | BRAZIL | NULL |
I have written a script to automatically remove all occurrences of NULL using a one-liner sed, which will remove the NULL in date column correctly:
sed -i 's/NULL//g'
However, how do we handle IF we have the following?
| id | name | nickname | origin | date |
| 1 | Joe | Mini-J | BRAZIL | NULL |
| 2 | Dees | DJ Null Bee| US| 2017-04-01 |
| 3 | NULL AND VOID | NULLZIET | NULL| 2016-05-13 |
| 4 | Pablo | ALA PUHU MINULLE | GERMANY| 2017-02-14 |
Apparently, the global search and replace all occurrences of NULL will be removed, where even "ALA PUHU MINULLE" will become "ALA PUHU MIE", which is incorrect.
I suppose the use of regex perhaps can be useful to apply the rule? But if so, will "DJ Null Bee" be affected or will it become "DJ Bee"? The desired outcome should really be:
| id | name | nickname | origin | date |
| 1 | Joe | Mini-J | BRAZIL | |
| 2 | Dees | DJ Null Bee| US| 2017-04-01 |
| 3 | NULL AND VOID | DJ Null Bee| | 2016-05-13 |
| 4 | Pablo | ALA PUHU MINULLE | GERMANY| 2017-02-14 |
Given that NULL is a special keyword for any databases, but there is no stopping anyone from calling themselves a DJ NULL, or have the word NULL because it means differently in another language.
Any ideas on how to resolve this? Any suggestions welcomed. Thank you!
All you need is:
$ sed 's/|[[:space:]]*NULL[[:space:]]*|/| |/g; s/|[[:space:]]*NULL[[:space:]]*|/| |/g' file
| id | name | nickname | origin | date |
| 1 | Joe | Mini-J | BRAZIL | |
| 2 | Dees | DJ Null Bee| US| 2017-04-01 |
| 3 | NULL AND VOID | NULLZIET | | 2016-05-13 |
| 4 | Pablo | ALA PUHU MINULLE | GERMANY| 2017-02-14 |
That will work in any POSIX sed.
You have to do the substitution twice because each match consumes all of the characters in the match so when you have | NULL | NULL | the middle | is consumed by the match on | NULL | and so all that's left is NULL | which does not match | NULL |, so you need 2 passes to match every | NULL |.
Use awk:
awk -F\| '{ for (i=2;i<=NF;i++) { if ( $i == " NULL " ) { printf "| " } else if ( $i == " NULL" ) { printf "| DJ Null Bee " } else { printf "|"$i } } printf "\n" }' filename
Using pipe as the field separator, go through each field and then check if the field equates to " NULL " If it does, print nothing. Then check if the field equals " NULL" If it does print "DJ Null Bee" else print the field as is.
$ cat mysql.txt | sed -r 's/(\| )NULL( \|)/\1\2/g'
| id | name | nickname | origin | date |
| 1 | Joe | Mini-J | BRAZIL | |
| 2 | Dees | DJ Null Bee| US| 2017-04-01 |
| 3 | NULL AND VOID | NULLZIET | NULL| 2016-05-13 |
| 4 | Pablo | ALA PUHU MINULLE | GERMANY| 2017-02-14 |
will only remove capital NULL fields delimited by the opening and closing pipe symbols alone.
It will keep your origin column "| NULL|" in the line "| 3 | NULL AND VOID | DJ Null Bee| NULL| 2016-05-13 |" as well.
awk '{sub(/BRAZIL \| NULL/,"BRAZIL \| ")sub(/NULLZIET \| NULL/,"DJ Null Bee\| ")}1' file
| id | name | nickname | origin | date |
| 1 | Joe | Mini-J | BRAZIL | |
| 2 | Dees | DJ Null Bee| US| 2017-04-01 |
| 3 | NULL AND VOID | DJ Null Bee| | 2016-05-13 |
| 4 | Pablo | ALA PUHU MINULLE | GERMANY| 2017-02-14 |