Use non-consuming regular expression in pySpark sql functions [duplicate] - regex

This question already has answers here:
Regular Expressions: Is there an AND operator?
(14 answers)
Regex AND operator
(4 answers)
Closed 3 years ago.
How can I use existing pySpark sql functions to find non-consuming regular expression patterns in a string column?
The following is reproducible, but does not give the desired results.
import pyspark
from pyspark.sql import (
SparkSession,
functions as F)
spark = (SparkSession.builder
.master('yarn')
.appName("regex")
.getOrCreate()
)
sc = spark.sparkContext
sc.version # u'2.2.0'
testdf = spark.createDataFrame([
(1, "Julie", "CEO"),
(2, "Janice", "CFO"),
(3, "Jake", "CTO")],
["ID", "Name", "Title"])
ptrn = '(?=Ja)(?=ke)'
testdf.withColumn('contns_ptrn', testdf.Name.rlike(ptrn) ).show()
+---+------+-----+-----------+
| ID| Name|Title|contns_ptrn|
+---+------+-----+-----------+
| 1| Julie| CEO| false|
| 2|Janice| CFO| false|
| 3| Jake| CTO| false|
+---+------+-----+-----------+
testdf.withColumn('contns_ptrn', F.regexp_extract(F.col('Name'), ptrn, 1)).show()
+---+------+-----+-----------+
| ID| Name|Title|contns_ptrn|
+---+------+-----+-----------+
| 1| Julie| CEO| |
| 2|Janice| CFO| |
| 3| Jake| CTO| |
+---+------+-----+-----------+
testdf.withColumn('contns_ptrn', F.regexp_replace(F.col('Name'), ptrn, '')).show()
+---+------+-----+-----------+
| ID| Name|Title|contns_ptrn|
+---+------+-----+-----------+
| 1| Julie| CEO| Julie|
| 2|Janice| CFO| Janice|
| 3| Jake| CTO| Jake|
+---+------+-----+-----------+
The desired results would be:
+---+------+-----+-----------+
| ID| Name|Title|contns_ptrn|
+---+------+-----+-----------+
| 1| Julie| CEO| false|
| 2|Janice| CFO| false|
| 3| Jake| CTO| true|
+---+------+-----+-----------+
As the third row in the Name column contains 'Ja' and 'ke'.
If regexp_extract or regexp_replace are able to extract or replace non-consuming regular expression patterns, then I could also use them together with length to get a Boolean column.

Found a quick solution, hopefully this can help someone else.
change ptrn from '(?=Ja)(?=ke)' to '(?=.*Ja)(?=.*ke)' and rlike works.
This answer got me close, but led to my problem.
https://stackoverflow.com/a/469951/5060792
These answers solved my problem.
https://stackoverflow.com/a/3041326
https://stackoverflow.com/a/470602/5060792
By the way, with nothing but the change to ptrn, regexp_extract throws a java.lang.IndexOutOfBoundsException: No group 1 exception. After wrapping the entire pattern in parenthesis, ptrn = '((?=.*Ja)(?=.*ke))', it returns nulls.
Again, regexp_replace replaces nothing and the original values are returned.

Related

Parsing string using regexp_extract using pyspark

I am trying to split the string to different columns using regular expression
Below is my data
decodeData = [('M|C705|Exx05','2'),
('M|Exx05','4'),
('M|C705 P|Exx05','6'),
('M|C705 P|8960 L|Exx05','7'),('M|C705 P|78|8960','9')]
df = sc.parallelize(decodeData).toDF(['Decode',''])
dfNew = df.withColumn('Exx05',regexp_extract(col('Decode'), '(M|P|M)(\\|Exx05)', 1)).withColumn('C705',regexp_extract(col('Decode'), '(M|P|M)(\\|C705)', 1)) .withColumn('8960',regexp_extract(col('Decode'), '(M|P|M)(\\|8960)', 1))
dfNew.show()
Result
+--------------------+---+-----+----+-----+
| Decode| |Exx05|C705| 8960|
+--------------------+---+-----+----+-----+
| M|C705|Exx05 | 2 | | M| |
| M|Exx05 | 4 | M| | |
| M|C705 P|Exx05 | 6 | P| M| |
|M|C705 P|8960 L|Exx05| 7 | M| M| P|
| M|C705 P|78|8960 | 9 | | M| |
+--------------------+---+-----+----+-----+
Here I am trying to extract the Code for string Exx05,C705,8960 and this can fall into M/P/L codes
eg: While decoding 'M|C705 P|8960 L|Exx05' I expect the results as L M P in respective columns. However I am missing some logic here,which I am finding difficulty to crack
Expected results
+--------------------+---+-----+----+-----+
| Decode| |Exx05|C705| 8960|
+--------------------+---+-----+----+-----+
| M|C705|Exx05 | | M| M| |
| M|Exx05 | | M| | |
| M|C705 P|Exx05 | | P| M| |
|M|C705 P|8960 L|Exx05| | L| M| P|
| M|C705 P|78|8960 | | | M| P|
+--------------------+---+-----+----+-----+
When I am trying to change the reg expression accordingly , It works for some cases and it wont work for other sample cases, and this is just a subset of the actual data I am working on.
eg: 1. Exx05 can fall in any code M/L/P and even it can fall at any position,begining,middle,end etc
One Decode can only belong to 1 (M or L or P) code per entry/ID ie M|Exx05 P|8960 L|Exx05 - here Exx05 falls in M and L,This scenario will not exist.
You can add ([^ ])* in the regex to extend it so that it matches any consecutive patterns that are not separated by a space:
dfNew = df.withColumn(
'Exx05',
regexp_extract(col('Decode'), '(M|P|L)([^ ])*(\\|Exx05)', 1)
).withColumn(
'C705',
regexp_extract(col('Decode'), '(M|P|L)([^ ])*(\\|C705)', 1)
).withColumn(
'8960',
regexp_extract(col('Decode'), '(M|P|L)([^ ])*(\\|8960)', 1)
)
dfNew.show(truncate=False)
+---------------------+---+-----+----+----+
|Decode | |Exx05|C705|8960|
+---------------------+---+-----+----+----+
|M|C705|Exx05 |2 |M |M | |
|M|Exx05 |4 |M | | |
|M|C705 P|Exx05 |6 |P |M | |
|M|C705 P|8960 L|Exx05|7 |L |M |P |
|M|C705$P|78|8960 |9 | |M |P |
+---------------------+---+-----+----+----+
What about we use X(?=Y) also known as lookahead assertion. This ensures we match X only if it is followed by Y
from pyspark.sql.functions import*
dfNew = df.withColumn('Exx05',regexp_extract(col('Decode'), '([A-Z](?=\|Exx05))', 1)).withColumn('C705',regexp_extract(col('Decode'), '([A-Z](?=\|C705))', 1)) .withColumn('8960',regexp_extract(col('Decode'), '([A-Z]+(?=\|[0-9]|8960))', 1))
dfNew.show()
+--------------------+---+-----+----+----+
| Decode| t|Exx05|C705|8960|
+--------------------+---+-----+----+----+
| M|C705|Exx05| 2| | M| |
| M|Exx05| 4| M| | |
| M|C705 P|Exx05| 6| P| M| |
|M|C705 P|8960 L|E...| 7| L| M| P|
| M|C705 P|78|8960| 9| | M| P|
+--------------------+---+-----+----+----+

Change selected rows into columns

I have a dataframe with the below structure
+------+-------------+--------+
|region| key| val|
+--------------------+--------+
|Sample|row1 | 6|
|Sample|row1_category| Cat 1|
|Sample|row1_Unit | Kg|
|Sample|row2 | 4|
|Sample|row2_category| Cat 2|
|Sample|row2_Unit | ltr|
+------+-------------+--------+
I tried to add a column and push the values to from rows to columns, but the category and unit column
I want to convert it into the below structure
+------+-------------+--------+--------+--------+
|region| key| val|Category| Unit |
+--------------------+--------+--------+--------+
|Sample|row1 | 6| Cat 1| Kg|
|Sample|row2 | 4| Cat 2| ltr|
+------+-------------+--------+--------+--------+
This i need to do for multiple keys , i ll have row2,row 3 etc
scala> df.show
+------+-------------+----+
|region| key| val|
+------+-------------+----+
|Sample| row1| 6|
|Sample|row1_category|Cat1|
|Sample| row1_Unit| Kg|
|Sample| row2| 4|
|Sample|row2_category|Cat2|
|Sample| row2_Unit| ltr|
+------+-------------+----+
scala> val df1 = df.withColumn("_temp", split( $"key" , "_")).select(col("region"), $"_temp".getItem(0) as "key",$"_temp".getItem(1) as "colType",col("val"))
scala> df1.show(false)
+------+----+--------+----+
|region|key |colType |val |
+------+----+--------+----+
|Sample|row1|null |6 |
|Sample|row1|category|Cat1|
|Sample|row1|Unit |Kg |
|Sample|row2|null |4 |
|Sample|row2|category|Cat2|
|Sample|row2|Unit |ltr |
+------+----+--------+----+
scala> val df2 = df1.withColumn("Category", when(col("colType") === "category", col("val"))).withColumn("Unit", when(col("colType") === "Unit", col("val"))).withColumn("val", when(col("colType").isNull, col("val")))
scala> df2.show(false)
+------+----+--------+----+--------+----+
|region|key |colType |val |Category|Unit|
+------+----+--------+----+--------+----+
|Sample|row1|null |6 |null |null|
|Sample|row1|category|null|Cat1 |null|
|Sample|row1|Unit |null|null |Kg |
|Sample|row2|null |4 |null |null|
|Sample|row2|category|null|Cat2 |null|
|Sample|row2|Unit |null|null |ltr |
+------+----+--------+----+--------+----+
scala> val df3 = df2.groupBy("region", "key").agg(concat_ws("",collect_set(when($"val".isNotNull, $"val"))).as("val"),concat_ws("",collect_set(when($"Category".isNotNull, $"Category"))).as("Category"), concat_ws("",collect_set(when($"Unit".isNotNull, $"Unit"))).as("Unit"))
scala> df3.show()
+------+----+---+--------+----+
|region| key|val|Category|Unit|
+------+----+---+--------+----+
|Sample|row1| 6| Cat1| Kg|
|Sample|row2| 4| Cat2| ltr|
+------+----+---+--------+----+
you can achieve it by grouping by your key and maybe region and aggregate with collect_list, using ragex ^[^_]+ you will get all characters until _ character.
UPDATE: You can use (\\d{1,}) regex to find all numbers from string(capturing groups), for example if you have row_123_456_unit and your function looks like regexp_extract('val,"(\\d{1,})",0) you will get 123, if you change last parameter to 1, then you will get 456. Hope it helps. test regex
df.printSchema()
df.show()
val regex1 = "^[^_]+" // until '_' character
val regex2 = "(\\d{1,})" // capture group of numbers
df.groupBy('region, regexp_extract('key, regex1, 0))
.agg('region, collect_list('key).as("key"), collect_list('val).as("val"))
.select('region,
'key.getItem(0).as("key"),
'val.getItem(0).as("val"),
'val.getItem(1).as("Category"),
'val.getItem(2).as("Unit")
).show()
output:
root
|-- region: string (nullable = true)
|-- key: string (nullable = true)
|-- val: string (nullable = true)
+------+-------------+-----+
|region| key| val|
+------+-------------+-----+
|Sample| row1| 6|
|Sample|row1_category|Cat 1|
|Sample| row1_Unit| Kg|
|Sample| row2| 4|
|Sample|row2_category|Cat 2|
|Sample| row2_Unit| ltr|
+------+-------------+-----+
+------+----+---+--------+----+
|region| key|val|Category|Unit|
+------+----+---+--------+----+
|Sample|row1| 6| Cat 1| Kg|
|Sample|row2| 4| Cat 2| ltr|
+------+----+---+--------+----+

Replace substring containing dollar sign ($) with other column value pyspark [duplicate]

This question already has answers here:
Spark column string replace when present in other column (row)
(2 answers)
Closed 3 years ago.
I am trying to replace the substring '$NUMBER' with the value in the column 'number' for each row.
I tried
from pyspark.sql.functions import udf
from pyspark.sql.Types import StringType
replace_udf = udf(
lambda long_text, number: long_text.replace("$NUMBER", number),
StringType()
)
df = df.withColumn('long_text',replace_udf(col('long_text'),col('number')))
and
from pyspark.sql.functions import expr
df = df.withColumn('long_text',expr("regexp_replace(long_text, '$NUMBER', number)"))
but nothing works. I can't figure out how another column can be the replacement for the substring.
SAMPLE:
df1 = spark.createDataFrame(
[
("hahaha the $NUMBER is good",3),
("i dont know about $NUMBER",2),
("what is $NUMBER doing?",5),\
("ajajaj $NUMBER",2),
("$NUMBER dwarfs",1)
],
["long_text","number"]
)
INPUT:
+---------------------------------+------+
| long_text . |number|
+---------------------------------+------+
|hahaha the $NUMBER is good | 3|
| what is $NUMBER doing? | 5|
| ajajaj $NUMBER | 2|
+---------------------------------+------+
EXPECTED OUTPUT:
+--------------------+------+
| long_text|number|
+--------------------+------+
|hahaha the 3 is good| 3|
| what is 5 doing?| 5|
| ajajaj 123| 2|
+--------------------+------+
Similar question where the answeres didn't cover the column replacement:
Spark column string replace when present in other column (row)
The problem is that $ has a special meaning in regular expressions, which means match the end of the line. So your code:
regexp_replace(long_text, '$NUMBER', number)
Is trying to match the pattern: end of line followed by the literal string NUMBER (which can never match anything).
In order to match a $ (or any other regex special character), you have to escape it with a \.
from pyspark.sql.functions import expr
df = df.withColumn('long_text',expr("regexp_replace(long_text, '\$NUMBER', number)"))
df.show()
#+--------------------+------+
#| long_text|number|
#+--------------------+------+
#|hahaha the 3 is good| 3|
#| what is 5 doing?| 5|
#| ajajaj 2| 2|
#+--------------------+------+
You have to cast the number column to string with str() before you can use with replace in your lambda:
from pyspark.sql import types as T
from pyspark.sql import functions as F
l = [( 'hahaha the $NUMBER is good', 3)
,('what is $NUMBER doing?' , 5)
,('ajajaj $NUMBER ' , 2)]
df = spark.createDataFrame(l,['long_text','number'])
#Just added str() to your function
replace_udf = F.udf(lambda long_text, number: long_text.replace("$NUMBER", str(number)), T.StringType())
df.withColumn('long_text',replace_udf(F.col('long_text'),F.col('number'))).show()
+--------------------+------+
| long_text|number|
+--------------------+------+
|hahaha the 3 is good| 3|
| what is 5 doing?| 5|
| ajajaj 2 | 2|
+--------------------+------+

Pyspark extract multivalued column to another table

I have a csv file with one of the columns named id and another one named genre, that can contain any number of them.
1,Action|Horror|Adventure
2,Action|Adventure
Is it possible to do something like select a row, and for each genre insert into another dataframe current id and genre.
1,Action
1,Horror
1,Adventure
2,Action
2,Adventure
You can use a udf to split the genre data and use explode function.
from pyspark.sql.functions import explode
from pyspark.sql.types import ArrayType,StringType
s = [('1','Action|Adventure'),('2','Comdey|Action')]
rdd = sc.parallelize(s)
df = sqlContext.createDataFrame(rdd,['id','Col'])
df.show()
+---+----------------+
| id| Col|
+---+----------------+
| 1|Action|Adventure|
| 2| Comdey|Action|
+---+----------------+
newcol = f.udf(lambda x : x.split('|'),ArrayType(StringType()))
df1 = df.withColumn('Genre',explode(newcol('col'))).drop('col')
df1.show()
+---+---------+
| id| Genre|
+---+---------+
| 1| Action|
| 1|Adventure|
| 2| Comdey|
| 2| Action|
+---+---------+
In addition to Suresh solution, you can also use flatMap after splitting your string to achieve the same:
#Read csv from file (works in Spark 2.x and onwards
df_csv = sqlContext.read.csv("genre.csv")
#Split the Genre (y) on the character |, but leave the id (x) as is
rdd_split= df_csv.rdd.map(lambda (x,y):(x,y.split('|')))
#Use a list comprehension to add the id column to each Genre(y)
rdd_explode = rdd_split.flatMap(lambda (x,y):[(x,k) for k in y])
#Convert the resulting RDD back to a dataframe
df_final = rdd_explode.toDF(['id','Genre'])
df_final.show() returns this as output:
+---+---------+
| id| Genre|
+---+---------+
| 1| Action|
| 1| Horror|
| 1|Adventure|
| 2| Action|
| 2|Adventure|
+---+---------+

if else in pyspark for collapsing column values

I am trying a simple code to collapse my categorical variables in dataframe to binary classes after indexing
currently my column has 3 classes- "A","B","C"
I am writing a simple if else statement to collapse classes like
def condition(r):
if (r.wo_flag=="SLM" or r.wo_flag=="NON-SLM"):
r.wo_flag="dispatch"
else:
r.wo_flag="non_dispatch"
return r.wo_flag
df_final=df_new.map(lambda x: condition(x))
Its not working it doesn't understand the else condition
|MData|Recode12|Status|DayOfWeekOfDispatch|MannerOfDispatch|Wo_flag|PlaceOfInjury|Race|
M| 11| M| 4| 7| C| 99| 1 |
M| 8| D| 3| 7| A| 99| 1 |
F| 10| W| 2| 7| C| 99| 1 |
M| 9| D| 1| 7| B| 99| 1 |
M| 8| D| 2| 7| C| 99| 1 |
This is the Sample Data
The accepted answer is not very efficient due to the use of a user defined function (UDF).
I think most people are looking for when.
from pyspark.sql.functions import when
matches = df["wo_flag"].isin("SLM", "NON-SLM")
new_df = df.withColumn("wo_flag", when(matches, "dispatch").otherwise("non-dispatch"))
Try this :
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
def modify_values(r):
if r == "A" or r =="B":
return "dispatch"
else:
return "non-dispatch"
ol_val = udf(modify_values, StringType())
new_df = df.withColumn("wo_flag",ol_val(df.wo_flag))
Things you are doing wrong:
You are trying to modify Rows (Rows are immmutable)
When a map operation is done on a dataframe , the resulting data structure is a PipelinedRDD and not a dataframe . You have to apply .toDF() to get dataframe