PySpark Combine Rows to Columns StackOverFlow Error - row

What I want (very simplified):
Input Dataset to Output dataset
Some of the code I tried:
def add_columns(cur_typ, target, value):
if cur_typ == target:
return value
return None
schema = T.StructType([T.StructField("name", T.StringType(), True),
T.StructField("typeT", T.StringType(), True),
T.StructField("value", T.IntegerType(), True)])
data = [("x", "a", 3), ("x", "b", 5), ("x", "c", 7), ("y", "a", 1), ("y", "b", 2),
("y", "c", 4), ("z", "a", 6), ("z", "b", 2), ("z", "c", 3)]
df = ctx.spark_session.createDataFrame(ctx.spark_session.sparkContext.parallelize(data), schema)
targets = [i.typeT for i in df.select("typeT").distinct().collect()]
add_columns = F.udf(add_columns)
w = Window.partitionBy('name')
for target in targets:
df = df.withColumn(target, F.max(F.lit(add_columns(df["typeT"], F.lit(target), df["value"]))).over(w))
df = df.drop("typeT", "value").dropDuplicates()
another version:
targets = df.select(F.collect_set("typeT").alias("typeT")).first()["typeT"]
w = Window.partitionBy('name')
for target in targets:
df = df.withColumn(target, F.max(F.lit(F.when(veh["typeT"] == F.lit(target), veh["value"])
.otherwise(None)).over(w)))
df = df.drop("typeT", "value").dropDuplicates()
For small datasets both work, but I have a dataframe with 1 million rows and 5000 different typeTs.
So the result should be a table of about 500 x 5000 (some names do not have certain typeTs.
Now I get stackoverflow errors (py4j.protocol.Py4JJavaError: An error occurred while calling o7624.withColumn.
: java.lang.StackOverflowError) trying to create this dataframe. Besides increasing stacksize, what can I do? Is there a way better way to get the same result?

using withColumn in loop is not good, if no cols to be added are more.
create an array of cols, and select them, which will result in better performance
cols = [F.col("name")]
for target in targets:
cols.append(F.max(F.lit(add_columns(df["typeT"], F.lit(target), df["value"]))).over(w).alias(target))
df = df.select(cols)
which results the same output
+----+---+---+---+
|name| c| b| a|
+----+---+---+---+
| x| 7| 5| 3|
| z| 3| 2| 6|
| y| 4| 2| 1|
+----+---+---+---+

Related

How can I compare pairs of columns in a PySpark dataframe and number of records changed?

I have a situation where I need to compare multiple pairs of columns (the number of pairs will vary and can come from a list as shown in below code snippet) and get 1/0 flag for match/mismatch respectively. Eventually use this to identify the number of records/rows with mismatch and % records mismatched
NONKEYCOLS= ['Marks', 'Qualification']
The first image is source df and second image is expected df.
[
Since this is happening for multiple pairs on a loop, it is very slow for about a billion records. Need help with something efficient.
I have the below code but the part that calculates change records is taking long time.
for ind,cols in enumerate(NONKEYCOLS):
print(ind)
print(cols)
globals()['new_dataset' + '_char_changes_tmp']=globals()['new_dataset' + '_char_changes_tmp']\
.withColumn("records_changed" + str(ind),\
F.sum(col("records_ch_flag_" + str(ind)))\
.over(w1))
globals()['new_dataset' + '_char_changes_tmp']=globals()['new_dataset' + '_char_changes_tmp']\
.withColumn("records_changed" + str(ind),\
F.sum(col("records_ch_flag_" + str(ind)))\
.over(w1))
globals()['new_dataset' + '_char_changes_tmp']=globals()['new_dataset' + '_char_changes_tmp']\
.withColumn("records_changed_cnt" + str(ind),\
F.count(col("records_ch_flag_" + str(ind)))\
.over(w1))
i'm not sure what loop are you running, but here's an implementation with list comprehension within a select.
data_ls = [
(10, 11, 'foo', 'foo'),
(12, 12, 'bar', 'bar'),
(10, 12, 'foo', 'bar')
]
data_sdf = spark.sparkContext.parallelize(data_ls). \
toDF(['marks_1', 'marks_2', 'qualification_1', 'qualification_2'])
col_pairs = ['marks','qualification']
data_sdf. \
select('*',
*[(func.col(c+'_1') == func.col(c+'_2')).cast('int').alias(c+'_check') for c in col_pairs]
). \
show()
# +-------+-------+---------------+---------------+-----------+-------------------+
# |marks_1|marks_2|qualification_1|qualification_2|marks_check|qualification_check|
# +-------+-------+---------------+---------------+-----------+-------------------+
# | 10| 11| foo| foo| 0| 1|
# | 12| 12| bar| bar| 1| 1|
# | 10| 12| foo| bar| 0| 0|
# +-------+-------+---------------+---------------+-----------+-------------------+
where the list comprehension would yield the following
[(func.col(c+'_1') == func.col(c+'_2')).cast('int').alias(c+'_check') for c in col_pairs]
# [Column<'CAST((marks_1 = marks_2) AS INT) AS `marks_check`'>,
# Column<'CAST((qualification_1 = qualification_2) AS INT) AS `qualification_check`'>]
EDIT
based on the additional (updated) info, you need the count of unmatched records for that pair and then you want to calculate the unmatched percentage.
reversing the aforementioned logic to count the unmatched records
col_pairs = ['marks','qualification']
data_sdf. \
agg(*[func.sum((func.col(c+'_1') != func.col(c+'_2')).cast('int')).alias(c+'_unmatch') for c in col_pairs],
func.count('*').alias('row_cnt')
). \
select('*',
*[(func.col(c+'_unmatch') / func.col('row_cnt')).alias(c+'_unmatch_perc') for c in col_pairs]
). \
show()
# +-------------+---------------------+-------+------------------+--------------------------+
# |marks_unmatch|qualification_unmatch|row_cnt|marks_unmatch_perc|qualification_unmatch_perc|
# +-------------+---------------------+-------+------------------+--------------------------+
# | 2| 1| 3|0.6666666666666666| 0.3333333333333333|
# +-------------+---------------------+-------+------------------+--------------------------+
the code flags (as 1) the records where the pair does not match and takes a sum of the flag - which gives us the pair's unmatched record count. dividing that with the total row count will give the percentage.
the list comprehension will yield the following
[func.sum((func.col(c+'_1') != func.col(c+'_2')).cast('int')).alias(c+'_unmatch') for c in col_pairs]
# [Column<'sum(CAST((NOT (marks_1 = marks_2)) AS INT)) AS `marks_unmatch`'>,
# Column<'sum(CAST((NOT (qualification_1 = qualification_2)) AS INT)) AS `qualification_unmatch`'>]
this is very much efficient as all of it happens in a single select statement which will only project once in the spark plan as opposed to your approach which will project every time you do a withColumn - and that is inefficient to spark.
df.colRegex may serve you well. If all the values in columns which match the regex are equal, you get 1. The script is efficient, as everything is done in one select.
Inputs:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('p', 1, 2, 'g', 'm'),
('a', 3, 3, 'g', 'g'),
('b', 4, 5, 'g', 'g'),
('r', 8, 8, 'm', 'm'),
('d', 2, 1, 'u', 'g')],
['Name', 'Marks_1', 'Marks_2', 'Qualification_1', 'Qualification_2'])
col_pairs = ['Marks', 'Qualification']
Script:
def equals(*cols):
return (F.size(F.array_distinct(F.array(*cols))) == 1).cast('int')
df = df.select(
'*',
*[equals(df.colRegex(f"`^{c}.*`")).alias(f'{c}_result') for c in col_pairs]
)
df.show()
# +----+-------+-------+---------------+---------------+------------+--------------------+
# |Name|Marks_1|Marks_2|Qualification_1|Qualification_2|Marks_result|Qualification_result|
# +----+-------+-------+---------------+---------------+------------+--------------------+
# | p| 1| 2| g| m| 0| 0|
# | a| 3| 3| g| g| 1| 1|
# | b| 4| 5| g| g| 0| 1|
# | r| 8| 8| m| m| 1| 1|
# | d| 2| 1| u| g| 0| 0|
# +----+-------+-------+---------------+---------------+------------+--------------------+
Proof of efficiency:
df.explain()
# == Physical Plan ==
# *(1) Project [Name#636, Marks_1#637L, Marks_2#638L, Qualification_1#639, Qualification_2#640, cast((size(array_distinct(array(Marks_1#637L, Marks_2#638L)), true) = 1) as int) AS Marks_result#646, cast((size(array_distinct(array(Qualification_1#639, Qualification_2#640)), true) = 1) as int) AS Qualification_result#647]
# +- Scan ExistingRDD[Name#636,Marks_1#637L,Marks_2#638L,Qualification_1#639,Qualification_2#640]
Edit:
def equals(*cols):
return (F.size(F.array_distinct(F.array(*cols))) != 1).cast('int')
df = df.select(
'*',
*[equals(df.colRegex(f"`^{c}.*`")).alias(f'{c}_result') for c in col_pairs]
).agg(
*[F.sum(f'{c}_result').alias(f'rec_changed_{c}') for c in col_pairs],
*[(F.sum(f'{c}_result') / F.count(f'{c}_result')).alias(f'{c}_%_rec_changed') for c in col_pairs]
)
df.show()
# +-----------------+-------------------------+-------------------+---------------------------+
# |rec_changed_Marks|rec_changed_Qualification|Marks_%_rec_changed|Qualification_%_rec_changed|
# +-----------------+-------------------------+-------------------+---------------------------+
# | 3| 2| 0.6| 0.4|
# +-----------------+-------------------------+-------------------+---------------------------+

Pyspark Mean value of each element in multiple lists

I have a df with 2 columns:
id
vector
This is a sample of how it looks:
+--------------------+----------+
| vector| id|
+--------------------+----------+
|[8.32,3.22,5.34,6.5]|1046091128|
|[8.52,3.34,5.31,6.3]|1046091128|
|[8.44,3.62,5.54,6.4]|1046091128|
|[8.31,3.12,5.21,6.1]|1046091128|
+--------------------+----------+
I want to groupBy appid and take the mean of each element of the vectors. So for example the first value in the aggregated list will be (8.32+8.52+8.44+8.31)/4 and so on.
Any help is appreciated.
This assumes that you know the length of the array column:
l = 4 #size of array column
df1 = df.select("id",*[F.col("vector")[i] for i in range(l)])
out = df1.groupby("id").agg(F.array([F.mean(i)
for i in df1.columns[1:]]).alias("vector"))
out.show(truncate=False)
+----------+----------------------------------------+
|id |vector |
+----------+----------------------------------------+
|1046091128|[8.3975, 3.325, 5.35, 6.325000000000001]|
+----------+----------------------------------------
You can use posexplode function and then aggregate the column based upon average. Something like below -
from pyspark.sql.functions import *
from pyspark.sql.types import *
data = [([8.32,3.22,5.34,6.5], 1046091128 ), ([8.52,3.34,5.31,6.3], 1046091128), ([8.44,3.62,5.54,6.4], 1046091128), ([8.31,3.12,5.21,6.1], 1046091128)]
schema = StructType([ StructField("vector", ArrayType(FloatType())), StructField("id", IntegerType()) ])
df = spark.createDataFrame(data=data,schema=schema)
df.select("id", posexplode("vector")).groupBy("id").pivot("pos").agg(avg("col")).show()
Output would look somewhat like :
+----------+-----------------+------------------+-----------------+-----------------+
| id| 0| 1| 2| 3|
+----------+-----------------+------------------+-----------------+-----------------+
|1046091128|8.397500038146973|3.3249999284744263|5.350000023841858|6.325000047683716|
+----------+-----------------+------------------+-----------------+-----------------+
You can rename the columns later if required.
Could also avoid pivot by grouping by id and pos and then later grouping by id alone to collect_list
df.select("id", posexplode("vector")).groupby('id','pos').agg(avg('col').alias('vector')).groupby('id').agg(collect_list('vector').alias('vector')).show(truncate=False)
Outcome
+----------+-----------------------------------------------------------------------------+
|id |vector |
+----------+-----------------------------------------------------------------------------+
|1046091128|[8.397500038146973, 5.350000023841858, 3.3249999284744263, 6.325000047683716]|
+----------+-----------------------------------------------------------------------------+

Spark Add New Column with List Count

Let's say I have the following data - order_id and product_names.
data = [["1", ["Organic A", "Apple"],
["2", ["Organic B", "Chocolate", "Organic C]]
If I want to create a dataframe and add a new column product_count so the output looks like the following, how can I do that?
Output:
+----------------------------------------------------------------+
|order_id | product_count| product_names|
+----------------------------------------------------------------+
| 1 | 2| ["Organic A", "Apple"]|
| 2 | 3| ["Organic B", "Chocolate", "Organic C]|
+----------------------------------------------------------------+
You can use the size function to get the length of the product_names field.
df = df.select('order_id', F.size('product_names').alias('product_count'), 'product_names')
df.show(truncate=False)

Creating two dimensional list and then making a DataFrame in Scala

I have a date format list: (day,hour,minute) --> (5,3,12)
I want to insert to these data into the list. e.g. ((5,3,12),(1,14,21),...)
I am new at Scala and I do not know how to do this. And then I need to create a DataFrame from these data.
data = Seq(
(l , m , r)
).toDF("day", "hour", "minutes")
Like this. If anyone can show me to best practice of doing this I would be appreciate. Thanks! Maybe I need to open my question more. I have done the same thing in Python. İt seems like:
for index in table:
index = str(index)
parts = index.split(" ") #first element is part0
hours_minutes_second = parts[1].split(":")
year = parts[0].split("-")
dates=(year[0],year[1],year[2],hours_minutes_second[0],hours_minutes_second[1],hours_minutes_second[2])
data.append(dates)
df = pd.DataFrame(data, columns = ['day','hours','minutes'])
You don not have to worry about index and formats of data. What I need is that creating like two dimensional list and then making dataframe of it!
[
( '25', '06', '55'),
( '24', '14', '51'),
( '24', '06', '24'),
( '24', '03', '42'),
( '23', '19', '30')]
For your question, I assume that you want to create a Spark Dataframe that
is quite different to a Scala List.
To create a DataFrame you need to create the Schema first. You can do it like:
val schema = StructType(
List(
StructField("day", IntegerType),
StructField("hour", IntegerType),
StructField("minutes", IntegerType)
))
There are several ways to create a DataFrame, for example:
Given a Seq[Row] you can create a RDD[Row] and then create the DataFrame from it:
val rdd : RDD[Row]= spark.sparkContext.parallelize(Seq(Row(5,3,12)))
val df = spark.createDataFrame(rdd, schema)
df.show()
/*
+---+----+-------+
|day|hour|minutes|
+---+----+-------+
| 5| 3| 12|
+---+----+-------+
*/
For the case of a two dimensions list:
val schema2 = StructType(
List(
StructField("day", StringType),
StructField("hour", StringType),
StructField("minutes", StringType)
))
val list = List(
Seq("25", "06", "55"),
Seq("24", "14", "51"),
Seq("24", "06", "24"),
Seq("24", "03", "42"),
Seq("23", "19", "30"))
val rdd : RDD[Row]= spark.sparkContext.parallelize(list.map(el => Row.fromSeq(el)))
val df = spark.createDataFrame(rdd, schema2)
df.show()
/*
+---+----+-------+
|day|hour|minutes|
+---+----+-------+
| 25| 06| 55|
| 24| 14| 51|
| 24| 06| 24|
| 24| 03| 42|
| 23| 19| 30|
+---+----+-------+
*/

Get a second header with the units of columns

Sometimes in academic texts one wants to present a Table in which every column has units. It is usual that the units are specified below the column names, like this
|Object |Volume | area | Price |
| |$cm^3$ |$cm^2$ | euros |
|:------------|:-------|--------:|---------:|
|A |3 | 43.36| 567.40|
|B |15 | 43.47| 1000.80|
|C |1 | 42.18| 8.81|
|D |7 | 37.92| 4.72|
How could I achieve this for my bookdown documents?
Thank you in advance.
Here is a way using kableExtra:
```{r}
library(kableExtra)
df <- data.frame(Object = LETTERS[1:5],
Volume = round(runif(5, 1, 20)),
area = rnorm(5, 40, 3),
Price = rnorm(5, 700, 200))
colNames <- names(df)
dfUnits <- c("", "$cm^3$", "$cm^2$", "€")
kable(df, col.names = dfUnits,escape = F, align = "c") %>%
add_header_above(header = colNames, line = F, align = "c")
```