create new column in pyspark dataframe using existing columns - python-2.7

I am trying to work with pyspark dataframes and I would like to know how I can create and populate new column using existing columns.
Lets say I have a dataframe that looks like this:
+-----+---+---+
| _1| _2| _3|
+-----+---+---+
|x1-y1| 3| z1|
|x2-y2| 2| z2|
|x3-y3| 1| z3|
+-----+---+---+
I am looking for way to create a dataframe which looks like this:
+-----+---+---+----+--------+
| _1| _2| _3| _4| _5|
+-----+---+---+----+--------+
|x1-y1| 3| z1|x1y1|x1=y1=z1|
|x2-y2| 2| z2|x2y2|x2=y2=z2|
|x3-y3| 1| z3|x3y3|x3=y3=z3|
+-----+---+---+----+--------+
_4 is just '-' removed from _1 and _5 uses values from _1 and _3
I am using spark-2.3.3 and python 2.7
Thanks!

You can use pyspark.sql.functions to achieve it.
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
sqlContext = SparkSession.builder.appName("test").enableHiveSupport().getOrCreate()
data = [('x1-y1', 3,'z1'),
('x2-y2', 2,'z2'),
('x3-y3', 1,'z3')]
test_df = sqlContext.createDataFrame(data, schema=['_1', '_2', '_3'])
test_df = test_df.withColumn('_4', F.regexp_replace('_1', '-', ''))
test_df = test_df.withColumn('_5', F.concat(F.regexp_replace('_1', '-', '='),F.lit('='),F.col('_3')))
test_df.show()
+-----+---+---+----+--------+
| _1| _2| _3| _4| _5|
+-----+---+---+----+--------+
|x1-y1| 3| z1|x1y1|x1=y1=z1|
|x2-y2| 2| z1|x2y2|x2=y2=z2|
|x3-y3| 1| z1|x3y3|x3=y3=z3|
+-----+---+---+----+--------+

Related

pyspark concatonate multiple columns where the column name is in a list

I am trying to concatonate multiple columns to just one column, but only if the column name is in a list.
so issue = {'a','b','c'} is my list and would need to concatonate it as issue column with ; seperator.
I have tried:
1.
df_issue = df.withColumn('issue', concat_ws(';',map_values(custom.({issue}))))
Which returns invalid syntax error
2.
df_issue = df.withColumn('issue', lit(issue))
this just returnd a b c and not their value
Thank you
I have tried:
1.
df_issue = df.withColumn('issue', concat_ws(';',map_values(custom.({issue}))))
Which returns invalid syntax error
2.
df_issue = df.withColumn('issue', lit(issue))
this just returnd a b c and not their value
Thank you
You can simply use concat_ws:
from pyspark.sql import functions as F
columns_to_concat = ['a', 'b', 'c']
df.withColumn('issue', F.concat_ws(';', *columns_to_concat))
So, if your input DataFrame is:
+---+---+---+----------+----------+-----+
| a| b| c| date1| date2|value|
+---+---+---+----------+----------+-----+
| k1| k2| k3|2022-11-11|2022-11-14| 5|
| k4| k5| k6|2022-11-15|2022-11-19| 5|
| k7| k8| k9|2022-11-15|2022-11-19| 5|
+---+---+---+----------+----------+-----+
The previous code will produce:
+---+---+---+----------+----------+-----+--------+
| a| b| c| date1| date2|value| issue|
+---+---+---+----------+----------+-----+--------+
| k1| k2| k3|2022-11-11|2022-11-14| 5|k1;k2;k3|
| k4| k5| k6|2022-11-15|2022-11-19| 5|k4;k5;k6|
| k7| k8| k9|2022-11-15|2022-11-19| 5|k7;k8;k9|
+---+---+---+----------+----------+-----+--------+

How can I compare pairs of columns in a PySpark dataframe and number of records changed?

I have a situation where I need to compare multiple pairs of columns (the number of pairs will vary and can come from a list as shown in below code snippet) and get 1/0 flag for match/mismatch respectively. Eventually use this to identify the number of records/rows with mismatch and % records mismatched
NONKEYCOLS= ['Marks', 'Qualification']
The first image is source df and second image is expected df.
[
Since this is happening for multiple pairs on a loop, it is very slow for about a billion records. Need help with something efficient.
I have the below code but the part that calculates change records is taking long time.
for ind,cols in enumerate(NONKEYCOLS):
print(ind)
print(cols)
globals()['new_dataset' + '_char_changes_tmp']=globals()['new_dataset' + '_char_changes_tmp']\
.withColumn("records_changed" + str(ind),\
F.sum(col("records_ch_flag_" + str(ind)))\
.over(w1))
globals()['new_dataset' + '_char_changes_tmp']=globals()['new_dataset' + '_char_changes_tmp']\
.withColumn("records_changed" + str(ind),\
F.sum(col("records_ch_flag_" + str(ind)))\
.over(w1))
globals()['new_dataset' + '_char_changes_tmp']=globals()['new_dataset' + '_char_changes_tmp']\
.withColumn("records_changed_cnt" + str(ind),\
F.count(col("records_ch_flag_" + str(ind)))\
.over(w1))
i'm not sure what loop are you running, but here's an implementation with list comprehension within a select.
data_ls = [
(10, 11, 'foo', 'foo'),
(12, 12, 'bar', 'bar'),
(10, 12, 'foo', 'bar')
]
data_sdf = spark.sparkContext.parallelize(data_ls). \
toDF(['marks_1', 'marks_2', 'qualification_1', 'qualification_2'])
col_pairs = ['marks','qualification']
data_sdf. \
select('*',
*[(func.col(c+'_1') == func.col(c+'_2')).cast('int').alias(c+'_check') for c in col_pairs]
). \
show()
# +-------+-------+---------------+---------------+-----------+-------------------+
# |marks_1|marks_2|qualification_1|qualification_2|marks_check|qualification_check|
# +-------+-------+---------------+---------------+-----------+-------------------+
# | 10| 11| foo| foo| 0| 1|
# | 12| 12| bar| bar| 1| 1|
# | 10| 12| foo| bar| 0| 0|
# +-------+-------+---------------+---------------+-----------+-------------------+
where the list comprehension would yield the following
[(func.col(c+'_1') == func.col(c+'_2')).cast('int').alias(c+'_check') for c in col_pairs]
# [Column<'CAST((marks_1 = marks_2) AS INT) AS `marks_check`'>,
# Column<'CAST((qualification_1 = qualification_2) AS INT) AS `qualification_check`'>]
EDIT
based on the additional (updated) info, you need the count of unmatched records for that pair and then you want to calculate the unmatched percentage.
reversing the aforementioned logic to count the unmatched records
col_pairs = ['marks','qualification']
data_sdf. \
agg(*[func.sum((func.col(c+'_1') != func.col(c+'_2')).cast('int')).alias(c+'_unmatch') for c in col_pairs],
func.count('*').alias('row_cnt')
). \
select('*',
*[(func.col(c+'_unmatch') / func.col('row_cnt')).alias(c+'_unmatch_perc') for c in col_pairs]
). \
show()
# +-------------+---------------------+-------+------------------+--------------------------+
# |marks_unmatch|qualification_unmatch|row_cnt|marks_unmatch_perc|qualification_unmatch_perc|
# +-------------+---------------------+-------+------------------+--------------------------+
# | 2| 1| 3|0.6666666666666666| 0.3333333333333333|
# +-------------+---------------------+-------+------------------+--------------------------+
the code flags (as 1) the records where the pair does not match and takes a sum of the flag - which gives us the pair's unmatched record count. dividing that with the total row count will give the percentage.
the list comprehension will yield the following
[func.sum((func.col(c+'_1') != func.col(c+'_2')).cast('int')).alias(c+'_unmatch') for c in col_pairs]
# [Column<'sum(CAST((NOT (marks_1 = marks_2)) AS INT)) AS `marks_unmatch`'>,
# Column<'sum(CAST((NOT (qualification_1 = qualification_2)) AS INT)) AS `qualification_unmatch`'>]
this is very much efficient as all of it happens in a single select statement which will only project once in the spark plan as opposed to your approach which will project every time you do a withColumn - and that is inefficient to spark.
df.colRegex may serve you well. If all the values in columns which match the regex are equal, you get 1. The script is efficient, as everything is done in one select.
Inputs:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('p', 1, 2, 'g', 'm'),
('a', 3, 3, 'g', 'g'),
('b', 4, 5, 'g', 'g'),
('r', 8, 8, 'm', 'm'),
('d', 2, 1, 'u', 'g')],
['Name', 'Marks_1', 'Marks_2', 'Qualification_1', 'Qualification_2'])
col_pairs = ['Marks', 'Qualification']
Script:
def equals(*cols):
return (F.size(F.array_distinct(F.array(*cols))) == 1).cast('int')
df = df.select(
'*',
*[equals(df.colRegex(f"`^{c}.*`")).alias(f'{c}_result') for c in col_pairs]
)
df.show()
# +----+-------+-------+---------------+---------------+------------+--------------------+
# |Name|Marks_1|Marks_2|Qualification_1|Qualification_2|Marks_result|Qualification_result|
# +----+-------+-------+---------------+---------------+------------+--------------------+
# | p| 1| 2| g| m| 0| 0|
# | a| 3| 3| g| g| 1| 1|
# | b| 4| 5| g| g| 0| 1|
# | r| 8| 8| m| m| 1| 1|
# | d| 2| 1| u| g| 0| 0|
# +----+-------+-------+---------------+---------------+------------+--------------------+
Proof of efficiency:
df.explain()
# == Physical Plan ==
# *(1) Project [Name#636, Marks_1#637L, Marks_2#638L, Qualification_1#639, Qualification_2#640, cast((size(array_distinct(array(Marks_1#637L, Marks_2#638L)), true) = 1) as int) AS Marks_result#646, cast((size(array_distinct(array(Qualification_1#639, Qualification_2#640)), true) = 1) as int) AS Qualification_result#647]
# +- Scan ExistingRDD[Name#636,Marks_1#637L,Marks_2#638L,Qualification_1#639,Qualification_2#640]
Edit:
def equals(*cols):
return (F.size(F.array_distinct(F.array(*cols))) != 1).cast('int')
df = df.select(
'*',
*[equals(df.colRegex(f"`^{c}.*`")).alias(f'{c}_result') for c in col_pairs]
).agg(
*[F.sum(f'{c}_result').alias(f'rec_changed_{c}') for c in col_pairs],
*[(F.sum(f'{c}_result') / F.count(f'{c}_result')).alias(f'{c}_%_rec_changed') for c in col_pairs]
)
df.show()
# +-----------------+-------------------------+-------------------+---------------------------+
# |rec_changed_Marks|rec_changed_Qualification|Marks_%_rec_changed|Qualification_%_rec_changed|
# +-----------------+-------------------------+-------------------+---------------------------+
# | 3| 2| 0.6| 0.4|
# +-----------------+-------------------------+-------------------+---------------------------+

Pyspark Mean value of each element in multiple lists

I have a df with 2 columns:
id
vector
This is a sample of how it looks:
+--------------------+----------+
| vector| id|
+--------------------+----------+
|[8.32,3.22,5.34,6.5]|1046091128|
|[8.52,3.34,5.31,6.3]|1046091128|
|[8.44,3.62,5.54,6.4]|1046091128|
|[8.31,3.12,5.21,6.1]|1046091128|
+--------------------+----------+
I want to groupBy appid and take the mean of each element of the vectors. So for example the first value in the aggregated list will be (8.32+8.52+8.44+8.31)/4 and so on.
Any help is appreciated.
This assumes that you know the length of the array column:
l = 4 #size of array column
df1 = df.select("id",*[F.col("vector")[i] for i in range(l)])
out = df1.groupby("id").agg(F.array([F.mean(i)
for i in df1.columns[1:]]).alias("vector"))
out.show(truncate=False)
+----------+----------------------------------------+
|id |vector |
+----------+----------------------------------------+
|1046091128|[8.3975, 3.325, 5.35, 6.325000000000001]|
+----------+----------------------------------------
You can use posexplode function and then aggregate the column based upon average. Something like below -
from pyspark.sql.functions import *
from pyspark.sql.types import *
data = [([8.32,3.22,5.34,6.5], 1046091128 ), ([8.52,3.34,5.31,6.3], 1046091128), ([8.44,3.62,5.54,6.4], 1046091128), ([8.31,3.12,5.21,6.1], 1046091128)]
schema = StructType([ StructField("vector", ArrayType(FloatType())), StructField("id", IntegerType()) ])
df = spark.createDataFrame(data=data,schema=schema)
df.select("id", posexplode("vector")).groupBy("id").pivot("pos").agg(avg("col")).show()
Output would look somewhat like :
+----------+-----------------+------------------+-----------------+-----------------+
| id| 0| 1| 2| 3|
+----------+-----------------+------------------+-----------------+-----------------+
|1046091128|8.397500038146973|3.3249999284744263|5.350000023841858|6.325000047683716|
+----------+-----------------+------------------+-----------------+-----------------+
You can rename the columns later if required.
Could also avoid pivot by grouping by id and pos and then later grouping by id alone to collect_list
df.select("id", posexplode("vector")).groupby('id','pos').agg(avg('col').alias('vector')).groupby('id').agg(collect_list('vector').alias('vector')).show(truncate=False)
Outcome
+----------+-----------------------------------------------------------------------------+
|id |vector |
+----------+-----------------------------------------------------------------------------+
|1046091128|[8.397500038146973, 5.350000023841858, 3.3249999284744263, 6.325000047683716]|
+----------+-----------------------------------------------------------------------------+

pyspark: dataframe from rdd containing list of lists

I am new to Spark (with Python) and couldn't figure this out even after looking through relevant posts.
I have a RDD. Each record of the RDD is a list of lists as below
[[1073914607, 0, -1],[1073914607, 2, 7.88],[1073914607, 0, -1],[1073914607, 4, 40.0]]
[[1074079003, 0, -1],[1074079003, 2, 2.87],[1074079003, 0, -1],[1074079003, 4, 35.2]]
I want to convert the RDD to a dataframe with 3 columns, basically stack all the element lists. The dataframe should look like below.
account_id product_id price
1073914607 0 -1
1073914607 2 7.88
1073914607 0 -1
1073914607 4 40
1074079003 0 -1
1074079003 2 2.87
1074079003 0 -1
1074079003 4 35.2
I have tried my_rdd.toDF(), but it gives me two rows and four columns with each element list in a column. I also tried some solutions suggested in other posts which might be relevant. Since I am pretty new to spark, I got various errors that I could figure out. Please help. Thanks.
Added on 07/28/2021. In the end I did the following to loop through each element and generate a long list and convert it into a dataframe. Probably it is not the most efficient way but it solved my issue.
result_lst=[]
for x in my_rdd.toLocalIterator():
for y in x:
result_lst.append(y)
result_df=spark.createDataFrame(result_lst, ['account_id','product_id','price'])
>>> data = ([[1,2],[1,4]],[[2,5],[2,6]])
>>> df = sc.parallelize(data).toDF(['c1','c2'])
>>> df.show()
+------+------+
| c1| c2|
+------+------+
|[1, 2]|[1, 4]|
|[2, 5]|[2, 6]|
+------+------+
>>> df1 = df.select(df.c1.alias('c3')).union(df.select(df.c2).alias('c3'))
>>> df1.show()
+------+
| c3|
+------+
|[1, 2]|
|[2, 5]|
|[1, 4]|
|[2, 6]|
+------+
>>> df1.select(df1.c3,df1.c3[0],df1.c3[1]).show()
+------+-----+-----+
| c3|c3[0]|c3[1]|
+------+-----+-----+
|[1, 2]| 1| 2|
|[2, 5]| 2| 5|
|[1, 4]| 1| 4|
|[2, 6]| 2| 6|
+------+-----+-----+
I later used another way below to solve the problem without bringing the rdd to Localiterator() and looping through it. I guess this new way is more efficient.
from pyspark.sql.functions import explode
from pyspark.sql import Row
df_exploded=my_rdd.map(lambda x : Row(x)).toDF().withColumn('_1', explode('_1'))
result_df=df_exploded.select([df_exploded._1[i] for i in range(3)]).toDF('account_id','product_id','price')

Pyspark Merge WrappedArrays Within a Dataframe

The current Pyspark dataframe has this structure (a list of WrappedArrays for col2):
+---+---------------------------------------------------------------------+
|id |col2 |
+---+---------------------------------------------------------------------+
|a |[WrappedArray(code2), WrappedArray(code1, code3)] |
+---+---------------------------------------------------------------------+
|b |[WrappedArray(code5), WrappedArray(code6, code8)] |
+---+---------------------------------------------------------------------+
This is the structure I would like to have (a flattened list for col2):
+---+---------------------------------------------------------------------+
|id |col2 |
+---+---------------------------------------------------------------------+
|a |[code2,code1, code3)] |
+---+---------------------------------------------------------------------+
|b |[code5,code6, code8] |
+---+---------------------------------------------------------------------+
but I'm not sure how to do that transformation. I had tried to do a flatmap but that didn't seem to work. Any suggestions?
You can do this using 2 ways, udf and rdd. Here is example:-
df = sqlContext.createDataFrame([
['a', [['code2'],['code1', 'code3']]],
['b', [['code5','code6'], ['code8']]]
], ["id", "col2"])
df.show(truncate = False)
+---+-------------------------------------------------+
|id |col2 |
+---+-------------------------------------------------+
|a |[WrappedArray(code2), WrappedArray(code1, code3)]|
|b |[WrappedArray(code5, code6), WrappedArray(code8)]|
+---+-------------------------------------------------+
RDD:-
df.map(lambda row:(row[0], reduce(lambda x,y:x+y, row[1]))).toDF().show(truncate=False)
+---+---------------------+
|_1 |_2 |
+---+---------------------+
|a |[code2, code1, code3]|
|b |[code5, code6, code8]|
+---+---------------------+
UDF:-
from pyspark.sql import functions as F
import pyspark.sql.types as T
def fudf(val):
#emlist = []
#for item in val:
# emlist += item
#return emlist
return reduce (lambda x, y:x+y, val)
flattenUdf = F.udf(fudf, T.ArrayType(T.StringType()))
df.select("id", flattenUdf("col2").alias("col2")).show(truncate=False)
+---+---------------------+
|id |col2 |
+---+---------------------+
|a |[code2, code1, code3]|
|b |[code5, code6, code8]|
+---+---------------------+