Pyspark extract multivalued column to another table - python-2.7

I have a csv file with one of the columns named id and another one named genre, that can contain any number of them.
1,Action|Horror|Adventure
2,Action|Adventure
Is it possible to do something like select a row, and for each genre insert into another dataframe current id and genre.
1,Action
1,Horror
1,Adventure
2,Action
2,Adventure

You can use a udf to split the genre data and use explode function.
from pyspark.sql.functions import explode
from pyspark.sql.types import ArrayType,StringType
s = [('1','Action|Adventure'),('2','Comdey|Action')]
rdd = sc.parallelize(s)
df = sqlContext.createDataFrame(rdd,['id','Col'])
df.show()
+---+----------------+
| id| Col|
+---+----------------+
| 1|Action|Adventure|
| 2| Comdey|Action|
+---+----------------+
newcol = f.udf(lambda x : x.split('|'),ArrayType(StringType()))
df1 = df.withColumn('Genre',explode(newcol('col'))).drop('col')
df1.show()
+---+---------+
| id| Genre|
+---+---------+
| 1| Action|
| 1|Adventure|
| 2| Comdey|
| 2| Action|
+---+---------+

In addition to Suresh solution, you can also use flatMap after splitting your string to achieve the same:
#Read csv from file (works in Spark 2.x and onwards
df_csv = sqlContext.read.csv("genre.csv")
#Split the Genre (y) on the character |, but leave the id (x) as is
rdd_split= df_csv.rdd.map(lambda (x,y):(x,y.split('|')))
#Use a list comprehension to add the id column to each Genre(y)
rdd_explode = rdd_split.flatMap(lambda (x,y):[(x,k) for k in y])
#Convert the resulting RDD back to a dataframe
df_final = rdd_explode.toDF(['id','Genre'])
df_final.show() returns this as output:
+---+---------+
| id| Genre|
+---+---------+
| 1| Action|
| 1| Horror|
| 1|Adventure|
| 2| Action|
| 2|Adventure|
+---+---------+

Related

Change selected rows into columns

I have a dataframe with the below structure
+------+-------------+--------+
|region| key| val|
+--------------------+--------+
|Sample|row1 | 6|
|Sample|row1_category| Cat 1|
|Sample|row1_Unit | Kg|
|Sample|row2 | 4|
|Sample|row2_category| Cat 2|
|Sample|row2_Unit | ltr|
+------+-------------+--------+
I tried to add a column and push the values to from rows to columns, but the category and unit column
I want to convert it into the below structure
+------+-------------+--------+--------+--------+
|region| key| val|Category| Unit |
+--------------------+--------+--------+--------+
|Sample|row1 | 6| Cat 1| Kg|
|Sample|row2 | 4| Cat 2| ltr|
+------+-------------+--------+--------+--------+
This i need to do for multiple keys , i ll have row2,row 3 etc
scala> df.show
+------+-------------+----+
|region| key| val|
+------+-------------+----+
|Sample| row1| 6|
|Sample|row1_category|Cat1|
|Sample| row1_Unit| Kg|
|Sample| row2| 4|
|Sample|row2_category|Cat2|
|Sample| row2_Unit| ltr|
+------+-------------+----+
scala> val df1 = df.withColumn("_temp", split( $"key" , "_")).select(col("region"), $"_temp".getItem(0) as "key",$"_temp".getItem(1) as "colType",col("val"))
scala> df1.show(false)
+------+----+--------+----+
|region|key |colType |val |
+------+----+--------+----+
|Sample|row1|null |6 |
|Sample|row1|category|Cat1|
|Sample|row1|Unit |Kg |
|Sample|row2|null |4 |
|Sample|row2|category|Cat2|
|Sample|row2|Unit |ltr |
+------+----+--------+----+
scala> val df2 = df1.withColumn("Category", when(col("colType") === "category", col("val"))).withColumn("Unit", when(col("colType") === "Unit", col("val"))).withColumn("val", when(col("colType").isNull, col("val")))
scala> df2.show(false)
+------+----+--------+----+--------+----+
|region|key |colType |val |Category|Unit|
+------+----+--------+----+--------+----+
|Sample|row1|null |6 |null |null|
|Sample|row1|category|null|Cat1 |null|
|Sample|row1|Unit |null|null |Kg |
|Sample|row2|null |4 |null |null|
|Sample|row2|category|null|Cat2 |null|
|Sample|row2|Unit |null|null |ltr |
+------+----+--------+----+--------+----+
scala> val df3 = df2.groupBy("region", "key").agg(concat_ws("",collect_set(when($"val".isNotNull, $"val"))).as("val"),concat_ws("",collect_set(when($"Category".isNotNull, $"Category"))).as("Category"), concat_ws("",collect_set(when($"Unit".isNotNull, $"Unit"))).as("Unit"))
scala> df3.show()
+------+----+---+--------+----+
|region| key|val|Category|Unit|
+------+----+---+--------+----+
|Sample|row1| 6| Cat1| Kg|
|Sample|row2| 4| Cat2| ltr|
+------+----+---+--------+----+
you can achieve it by grouping by your key and maybe region and aggregate with collect_list, using ragex ^[^_]+ you will get all characters until _ character.
UPDATE: You can use (\\d{1,}) regex to find all numbers from string(capturing groups), for example if you have row_123_456_unit and your function looks like regexp_extract('val,"(\\d{1,})",0) you will get 123, if you change last parameter to 1, then you will get 456. Hope it helps. test regex
df.printSchema()
df.show()
val regex1 = "^[^_]+" // until '_' character
val regex2 = "(\\d{1,})" // capture group of numbers
df.groupBy('region, regexp_extract('key, regex1, 0))
.agg('region, collect_list('key).as("key"), collect_list('val).as("val"))
.select('region,
'key.getItem(0).as("key"),
'val.getItem(0).as("val"),
'val.getItem(1).as("Category"),
'val.getItem(2).as("Unit")
).show()
output:
root
|-- region: string (nullable = true)
|-- key: string (nullable = true)
|-- val: string (nullable = true)
+------+-------------+-----+
|region| key| val|
+------+-------------+-----+
|Sample| row1| 6|
|Sample|row1_category|Cat 1|
|Sample| row1_Unit| Kg|
|Sample| row2| 4|
|Sample|row2_category|Cat 2|
|Sample| row2_Unit| ltr|
+------+-------------+-----+
+------+----+---+--------+----+
|region| key|val|Category|Unit|
+------+----+---+--------+----+
|Sample|row1| 6| Cat 1| Kg|
|Sample|row2| 4| Cat 2| ltr|
+------+----+---+--------+----+

Removal of special characters in a txt file using spark

I have a txt file which comprises of data like
3,3e,4,5
3,5s,4#,5
5,6,2,4
and so on
now what I have to do is to remove these characters and using spark and then add all the values into an aggregated sum.
How to get remove the special characters and sum all the values.
I have created a dataframe and used regexp_replace to remove the special characters.
But by using .withColumn clause I can only remove the special characters one by one and not as a whole which I believe is not optimized code.
Secondly, I have to add all the values into an aggregated sum. How to get the aggregate value.
If you have fixed number of column in input data then you can use below approach.
//Input Text file
scala> val rdd = sc.textFile("/spath/stack.txt")
scala> rdd.collect()
res108: Array[String] = Array("3,3e,4,5 ", "3,5s,4#,5 ", 5,6,2,4)
//remove special characters
scala> val rdd1 = rdd.map{x => x.replaceAll("[^,0-9]", "")}
scala> rdd1.collect
res109: Array[String] = Array(3,3,4,5, 3,5,4,5, 5,6,2,4)
//Conver RDD into DataFrame
scala> val df = rdd1.map(_.split(",")).map(x => (x(0).toInt,x(1).toInt,x(2).toInt,x(3).toInt)).toDF
scala> df.show(false)
+---+---+---+---+
|_1 |_2 |_3 |_4 |
+---+---+---+---+
|3 |3 |4 |5 |
|3 |5 |4 |5 |
|5 |6 |2 |4 |
+---+---+---+---+
//local UDF to sum up value
scala> val sumUDF = udf((r:Row) => {
| r.getAs("_1").toString.toInt + r.getAs("_2").toString.toInt + r.getAs("_3").toString.toInt + r.getAs("_4").toString.toInt
| })
//Expected DataFrame
scala> val finaldf = df.withColumn("sumcol", sumUDF(struct(rdd2.columns map col: _*)))
scala> finaldf.show(false)
+---+---+---+---+------+
|_1 |_2 |_3 |_4 |sumcol|
+---+---+---+---+------+
|3 |3 |4 |5 |15 |
|3 |5 |4 |5 |17 |
|5 |6 |2 |4 |17 |
+---+---+---+---+------+

Use non-consuming regular expression in pySpark sql functions [duplicate]

This question already has answers here:
Regular Expressions: Is there an AND operator?
(14 answers)
Regex AND operator
(4 answers)
Closed 3 years ago.
How can I use existing pySpark sql functions to find non-consuming regular expression patterns in a string column?
The following is reproducible, but does not give the desired results.
import pyspark
from pyspark.sql import (
SparkSession,
functions as F)
spark = (SparkSession.builder
.master('yarn')
.appName("regex")
.getOrCreate()
)
sc = spark.sparkContext
sc.version # u'2.2.0'
testdf = spark.createDataFrame([
(1, "Julie", "CEO"),
(2, "Janice", "CFO"),
(3, "Jake", "CTO")],
["ID", "Name", "Title"])
ptrn = '(?=Ja)(?=ke)'
testdf.withColumn('contns_ptrn', testdf.Name.rlike(ptrn) ).show()
+---+------+-----+-----------+
| ID| Name|Title|contns_ptrn|
+---+------+-----+-----------+
| 1| Julie| CEO| false|
| 2|Janice| CFO| false|
| 3| Jake| CTO| false|
+---+------+-----+-----------+
testdf.withColumn('contns_ptrn', F.regexp_extract(F.col('Name'), ptrn, 1)).show()
+---+------+-----+-----------+
| ID| Name|Title|contns_ptrn|
+---+------+-----+-----------+
| 1| Julie| CEO| |
| 2|Janice| CFO| |
| 3| Jake| CTO| |
+---+------+-----+-----------+
testdf.withColumn('contns_ptrn', F.regexp_replace(F.col('Name'), ptrn, '')).show()
+---+------+-----+-----------+
| ID| Name|Title|contns_ptrn|
+---+------+-----+-----------+
| 1| Julie| CEO| Julie|
| 2|Janice| CFO| Janice|
| 3| Jake| CTO| Jake|
+---+------+-----+-----------+
The desired results would be:
+---+------+-----+-----------+
| ID| Name|Title|contns_ptrn|
+---+------+-----+-----------+
| 1| Julie| CEO| false|
| 2|Janice| CFO| false|
| 3| Jake| CTO| true|
+---+------+-----+-----------+
As the third row in the Name column contains 'Ja' and 'ke'.
If regexp_extract or regexp_replace are able to extract or replace non-consuming regular expression patterns, then I could also use them together with length to get a Boolean column.
Found a quick solution, hopefully this can help someone else.
change ptrn from '(?=Ja)(?=ke)' to '(?=.*Ja)(?=.*ke)' and rlike works.
This answer got me close, but led to my problem.
https://stackoverflow.com/a/469951/5060792
These answers solved my problem.
https://stackoverflow.com/a/3041326
https://stackoverflow.com/a/470602/5060792
By the way, with nothing but the change to ptrn, regexp_extract throws a java.lang.IndexOutOfBoundsException: No group 1 exception. After wrapping the entire pattern in parenthesis, ptrn = '((?=.*Ja)(?=.*ke))', it returns nulls.
Again, regexp_replace replaces nothing and the original values are returned.

Replace substring containing dollar sign ($) with other column value pyspark [duplicate]

This question already has answers here:
Spark column string replace when present in other column (row)
(2 answers)
Closed 3 years ago.
I am trying to replace the substring '$NUMBER' with the value in the column 'number' for each row.
I tried
from pyspark.sql.functions import udf
from pyspark.sql.Types import StringType
replace_udf = udf(
lambda long_text, number: long_text.replace("$NUMBER", number),
StringType()
)
df = df.withColumn('long_text',replace_udf(col('long_text'),col('number')))
and
from pyspark.sql.functions import expr
df = df.withColumn('long_text',expr("regexp_replace(long_text, '$NUMBER', number)"))
but nothing works. I can't figure out how another column can be the replacement for the substring.
SAMPLE:
df1 = spark.createDataFrame(
[
("hahaha the $NUMBER is good",3),
("i dont know about $NUMBER",2),
("what is $NUMBER doing?",5),\
("ajajaj $NUMBER",2),
("$NUMBER dwarfs",1)
],
["long_text","number"]
)
INPUT:
+---------------------------------+------+
| long_text . |number|
+---------------------------------+------+
|hahaha the $NUMBER is good | 3|
| what is $NUMBER doing? | 5|
| ajajaj $NUMBER | 2|
+---------------------------------+------+
EXPECTED OUTPUT:
+--------------------+------+
| long_text|number|
+--------------------+------+
|hahaha the 3 is good| 3|
| what is 5 doing?| 5|
| ajajaj 123| 2|
+--------------------+------+
Similar question where the answeres didn't cover the column replacement:
Spark column string replace when present in other column (row)
The problem is that $ has a special meaning in regular expressions, which means match the end of the line. So your code:
regexp_replace(long_text, '$NUMBER', number)
Is trying to match the pattern: end of line followed by the literal string NUMBER (which can never match anything).
In order to match a $ (or any other regex special character), you have to escape it with a \.
from pyspark.sql.functions import expr
df = df.withColumn('long_text',expr("regexp_replace(long_text, '\$NUMBER', number)"))
df.show()
#+--------------------+------+
#| long_text|number|
#+--------------------+------+
#|hahaha the 3 is good| 3|
#| what is 5 doing?| 5|
#| ajajaj 2| 2|
#+--------------------+------+
You have to cast the number column to string with str() before you can use with replace in your lambda:
from pyspark.sql import types as T
from pyspark.sql import functions as F
l = [( 'hahaha the $NUMBER is good', 3)
,('what is $NUMBER doing?' , 5)
,('ajajaj $NUMBER ' , 2)]
df = spark.createDataFrame(l,['long_text','number'])
#Just added str() to your function
replace_udf = F.udf(lambda long_text, number: long_text.replace("$NUMBER", str(number)), T.StringType())
df.withColumn('long_text',replace_udf(F.col('long_text'),F.col('number'))).show()
+--------------------+------+
| long_text|number|
+--------------------+------+
|hahaha the 3 is good| 3|
| what is 5 doing?| 5|
| ajajaj 2 | 2|
+--------------------+------+

if else in pyspark for collapsing column values

I am trying a simple code to collapse my categorical variables in dataframe to binary classes after indexing
currently my column has 3 classes- "A","B","C"
I am writing a simple if else statement to collapse classes like
def condition(r):
if (r.wo_flag=="SLM" or r.wo_flag=="NON-SLM"):
r.wo_flag="dispatch"
else:
r.wo_flag="non_dispatch"
return r.wo_flag
df_final=df_new.map(lambda x: condition(x))
Its not working it doesn't understand the else condition
|MData|Recode12|Status|DayOfWeekOfDispatch|MannerOfDispatch|Wo_flag|PlaceOfInjury|Race|
M| 11| M| 4| 7| C| 99| 1 |
M| 8| D| 3| 7| A| 99| 1 |
F| 10| W| 2| 7| C| 99| 1 |
M| 9| D| 1| 7| B| 99| 1 |
M| 8| D| 2| 7| C| 99| 1 |
This is the Sample Data
The accepted answer is not very efficient due to the use of a user defined function (UDF).
I think most people are looking for when.
from pyspark.sql.functions import when
matches = df["wo_flag"].isin("SLM", "NON-SLM")
new_df = df.withColumn("wo_flag", when(matches, "dispatch").otherwise("non-dispatch"))
Try this :
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
def modify_values(r):
if r == "A" or r =="B":
return "dispatch"
else:
return "non-dispatch"
ol_val = udf(modify_values, StringType())
new_df = df.withColumn("wo_flag",ol_val(df.wo_flag))
Things you are doing wrong:
You are trying to modify Rows (Rows are immmutable)
When a map operation is done on a dataframe , the resulting data structure is a PipelinedRDD and not a dataframe . You have to apply .toDF() to get dataframe