Failed to execute user defined function($anonfun$createTransformFunc$1: (string) => array<string> [duplicate] - regex

I have a column in my Spark DataFrame:
|-- topics_A: array (nullable = true)
| |-- element: string (containsNull = true)
I'm using CountVectorizer on it:
topic_vectorizer_A = CountVectorizer(inputCol="topics_A", outputCol="topics_vec_A")
I get NullPointerExceptions, because sometimes the topic_A column contains null.
Is there a way around this? Filling it with a zero-length array would work ok (although it will blow out the data size quite a lot) - but I can't work out how to do a fillNa on an Array column in PySpark.

Personally I would drop columns with NULL values because there is no useful information there but you can replace nulls with empty arrays. First some imports:
from pyspark.sql.functions import when, col, coalesce, array
You can define an empty array of specific type as:
fill = array().cast("array<string>")
and combine it with when clause:
topics_a = when(col("topics_A").isNull(), fill).otherwise(col("topics_A"))
or coalesce:
topics_a = coalesce(col("topics_A"), fill)
and use it as:
df.withColumn("topics_A", topics_a)
so with example data:
df = sc.parallelize([(1, ["a", "b"]), (2, None)]).toDF(["id", "topics_A"])
df_ = df.withColumn("topics_A", topics_a)
topic_vectorizer_A.fit(df_).transform(df_)
the result will be:
+---+--------+-------------------+
| id|topics_A| topics_vec_A|
+---+--------+-------------------+
| 1| [a, b]|(2,[0,1],[1.0,1.0])|
| 2| []| (2,[],[])|
+---+--------+-------------------+

I had similar issue, based on comment, I used following syntax to resolve before tokenizing:
remove the null values
clean_text_ddf.where(col("title").isNull()).show()
cleaned_text=clean_text_ddf.na.drop(subset=["title"])
cleaned_text.where(col("title").isNull()).show()
cleaned_text.printSchema()
cleaned_text.show(2)
+-----+
|title|
+-----+
+-----+
+-----+
|title|
+-----+
+-----+
root
|-- title: string (nullable = true)
+--------------------+
| title|
+--------------------+
|Mr. Beautiful (Up...|
|House of Ravens (...|
+--------------------+
only showing top 2 rows

Related

pyspark concatonate multiple columns where the column name is in a list

I am trying to concatonate multiple columns to just one column, but only if the column name is in a list.
so issue = {'a','b','c'} is my list and would need to concatonate it as issue column with ; seperator.
I have tried:
1.
df_issue = df.withColumn('issue', concat_ws(';',map_values(custom.({issue}))))
Which returns invalid syntax error
2.
df_issue = df.withColumn('issue', lit(issue))
this just returnd a b c and not their value
Thank you
I have tried:
1.
df_issue = df.withColumn('issue', concat_ws(';',map_values(custom.({issue}))))
Which returns invalid syntax error
2.
df_issue = df.withColumn('issue', lit(issue))
this just returnd a b c and not their value
Thank you
You can simply use concat_ws:
from pyspark.sql import functions as F
columns_to_concat = ['a', 'b', 'c']
df.withColumn('issue', F.concat_ws(';', *columns_to_concat))
So, if your input DataFrame is:
+---+---+---+----------+----------+-----+
| a| b| c| date1| date2|value|
+---+---+---+----------+----------+-----+
| k1| k2| k3|2022-11-11|2022-11-14| 5|
| k4| k5| k6|2022-11-15|2022-11-19| 5|
| k7| k8| k9|2022-11-15|2022-11-19| 5|
+---+---+---+----------+----------+-----+
The previous code will produce:
+---+---+---+----------+----------+-----+--------+
| a| b| c| date1| date2|value| issue|
+---+---+---+----------+----------+-----+--------+
| k1| k2| k3|2022-11-11|2022-11-14| 5|k1;k2;k3|
| k4| k5| k6|2022-11-15|2022-11-19| 5|k4;k5;k6|
| k7| k8| k9|2022-11-15|2022-11-19| 5|k7;k8;k9|
+---+---+---+----------+----------+-----+--------+

PySpark how to sort array of struct with 2 elements

I have the following schema
|-- node: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- score: string (nullable = true)
| | |-- toID: string (nullable = true)
+-----------
node|
+-------------
|[[57753, XV0912938...|
|[[51959, 02384848...|
|[[63487, 898898989...|
I want to sort based on score and select top 10 toID . What is an easy pyspark way to do this ?
sort_array(<array column>, asc=False) function can be used to sort the elements within the array. If the elements are structs, the array is sorted based on the first field in the struct. Since, the score field is the first field in the struct, you can directly sort the array of structs using this function.
Here's an example to get the top 5 IDs based on the score within the struct.
Suppose your data looks like the following
+----------------------------------------------------------------------------------------------------------------------------------------+
|arr_of_structs |
+----------------------------------------------------------------------------------------------------------------------------------------+
|[{270, foo0}, {237, foo1}, {2440, foo2}, {2059, foo3}, {3960, foo4}, {3009, foo5}, {842, foo6}, {1043, foo7}, {852, foo8}, {3925, foo9}]|
|[{1028, bar0}, {2571, bar1}, {1682, bar2}, {2543, bar3}, {689, bar4}, {11, bar5}, {1, bar6}, {740, bar7}, {113, bar8}, {701, bar9}] |
+----------------------------------------------------------------------------------------------------------------------------------------+
root
|-- arr_of_structs: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: long (nullable = true)
| | |-- _2: string (nullable = true)
The code would be very simple
use sort_array to sort the array of structs
use slice function to retain the first N elements in the sorted array.
data_sdf. \
withColumn('arr_of_structs_top5', func.slice(func.sort_array('arr_of_structs', asc=False), 1, 5)). \
withColumn('top5_id_only', func.expr('transform(arr_of_structs_top5, x -> x._2)')). \
drop('arr_of_structs'). \
show(truncate=False)
# +----------------------------------------------------------------------+------------------------------+
# |arr_of_structs_top5 |top5_id_only |
# +----------------------------------------------------------------------+------------------------------+
# |[{3960, foo4}, {3925, foo9}, {3009, foo5}, {2440, foo2}, {2059, foo3}]|[foo4, foo9, foo5, foo2, foo3]|
# |[{2571, bar1}, {2543, bar3}, {1682, bar2}, {1028, bar0}, {740, bar7}] |[bar1, bar3, bar2, bar0, bar7]|
# +----------------------------------------------------------------------+------------------------------+

Pyspark Mean value of each element in multiple lists

I have a df with 2 columns:
id
vector
This is a sample of how it looks:
+--------------------+----------+
| vector| id|
+--------------------+----------+
|[8.32,3.22,5.34,6.5]|1046091128|
|[8.52,3.34,5.31,6.3]|1046091128|
|[8.44,3.62,5.54,6.4]|1046091128|
|[8.31,3.12,5.21,6.1]|1046091128|
+--------------------+----------+
I want to groupBy appid and take the mean of each element of the vectors. So for example the first value in the aggregated list will be (8.32+8.52+8.44+8.31)/4 and so on.
Any help is appreciated.
This assumes that you know the length of the array column:
l = 4 #size of array column
df1 = df.select("id",*[F.col("vector")[i] for i in range(l)])
out = df1.groupby("id").agg(F.array([F.mean(i)
for i in df1.columns[1:]]).alias("vector"))
out.show(truncate=False)
+----------+----------------------------------------+
|id |vector |
+----------+----------------------------------------+
|1046091128|[8.3975, 3.325, 5.35, 6.325000000000001]|
+----------+----------------------------------------
You can use posexplode function and then aggregate the column based upon average. Something like below -
from pyspark.sql.functions import *
from pyspark.sql.types import *
data = [([8.32,3.22,5.34,6.5], 1046091128 ), ([8.52,3.34,5.31,6.3], 1046091128), ([8.44,3.62,5.54,6.4], 1046091128), ([8.31,3.12,5.21,6.1], 1046091128)]
schema = StructType([ StructField("vector", ArrayType(FloatType())), StructField("id", IntegerType()) ])
df = spark.createDataFrame(data=data,schema=schema)
df.select("id", posexplode("vector")).groupBy("id").pivot("pos").agg(avg("col")).show()
Output would look somewhat like :
+----------+-----------------+------------------+-----------------+-----------------+
| id| 0| 1| 2| 3|
+----------+-----------------+------------------+-----------------+-----------------+
|1046091128|8.397500038146973|3.3249999284744263|5.350000023841858|6.325000047683716|
+----------+-----------------+------------------+-----------------+-----------------+
You can rename the columns later if required.
Could also avoid pivot by grouping by id and pos and then later grouping by id alone to collect_list
df.select("id", posexplode("vector")).groupby('id','pos').agg(avg('col').alias('vector')).groupby('id').agg(collect_list('vector').alias('vector')).show(truncate=False)
Outcome
+----------+-----------------------------------------------------------------------------+
|id |vector |
+----------+-----------------------------------------------------------------------------+
|1046091128|[8.397500038146973, 5.350000023841858, 3.3249999284744263, 6.325000047683716]|
+----------+-----------------------------------------------------------------------------+

Scala spark how to interact with a List[Option[Map[String, DataFrame]]]

I'm trying to interact with this List[Option[Map[String, DataFrame]]] but I'm having a bit of trouble.
Inside it has something like this:
customer1 -> dataframeX
customer2 -> dataframeY
customer3 -> dataframeZ
Where the customer is an identifier that will become a new column.
I need to do an union of dataframeX, dataframeY and dataframeZ (all df have the same columns). Before I had this:
map(_.get).reduce(_ union _).select(columns:_*)
And it was working fine because I only had a List[Option[DataFrame]] and didn't need the identifier but I'm having trouble with the new list. My idea is to modify my old mapping, I know I can do stuff like "(0).get" and that would bring me "Map(customer1 -> dataframeX)" but I'm not quite sure how to do that iteration in the mapping and get the final dataframe that is the union of all three plus the identifier. My idea:
map(/*get identifier here along with dataframe*/).reduce(_ union _).select(identifier +: columns:_*)
The final result would be something like:
-------------------------------
|identifier | product |State |
-------------------------------
| customer1| prod1 | VA |
| customer1| prod132 | VA |
| customer2| prod32 | CA |
| customer2| prod51 | CA |
| customer2| prod21 | AL |
| customer2| prod52 | AL |
-------------------------------
You could use collect to unnest Option[Map[String, Dataframe]] to Map[String, DataFrame]. To put an identifier into the column you should use withColumn. So your code could look like:
import org.apache.spark.sql.functions.lit
val result: DataFrame = frames.collect {
case Some(m) =>
m.map {
case (identifier, dataframe) => dataframe.withColumn("identifier", lit(identifier))
}.reduce(_ union _)
}.reduce(_ union _)
Something like this perhaps?
list
.flatten
.flatMap {
_.map { case (id, df) =>
df.withColumn("identifier", id) }
}.reduce(_ union _)

Pyspark filter dataframe by columns of another dataframe

Not sure why I'm having a difficult time with this, it seems so simple considering it's fairly easy to do in R or pandas. I wanted to avoid using pandas though since I'm dealing with a lot of data, and I believe toPandas() loads all the data into the driver’s memory in pyspark.
I have 2 dataframes: df1 and df2. I want to filter df1 (remove all rows) where df1.userid = df2.userid AND df1.group = df2.group. I wasn't sure if I should use filter(), join(), or sql For example:
df1:
+------+----------+--------------------+
|userid| group | all_picks |
+------+----------+--------------------+
| 348| 2|[225, 2235, 2225] |
| 567| 1|[1110, 1150] |
| 595| 1|[1150, 1150, 1150] |
| 580| 2|[2240, 2225] |
| 448| 1|[1130] |
+------+----------+--------------------+
df2:
+------+----------+---------+
|userid| group | pick |
+------+----------+---------+
| 348| 2| 2270|
| 595| 1| 2125|
+------+----------+---------+
Result I want:
+------+----------+--------------------+
|userid| group | all_picks |
+------+----------+--------------------+
| 567| 1|[1110, 1150] |
| 580| 2|[2240, 2225] |
| 448| 1|[1130] |
+------+----------+--------------------+
EDIT:
I've tried many join() and filter() functions, I believe the closest I got was:
cond = [df1.userid == df2.userid, df2.group == df2.group]
df1.join(df2, cond, 'left_outer').select(df1.userid, df1.group, df1.all_picks) # Result has 7 rows
I tried a bunch of different join types, and I also tried different
cond values:
cond = ((df1.userid == df2.userid) & (df2.group == df2.group)) # result has 7 rows
cond = ((df1.userid != df2.userid) & (df2.group != df2.group)) # result has 2 rows
However, it seems like the joins are adding additional rows, rather than deleting.
I'm using python 2.7 and spark 2.1.0
Left anti join is what you're looking for:
df1.join(df2, ["userid", "group"], "leftanti")
but the same thing can be done with left outer join:
(df1
.join(df2, ["userid", "group"], "leftouter")
.where(df2["pick"].isNull())
.drop(df2["pick"]))