I have a df with 2 columns:
id
vector
This is a sample of how it looks:
+--------------------+----------+
| vector| id|
+--------------------+----------+
|[8.32,3.22,5.34,6.5]|1046091128|
|[8.52,3.34,5.31,6.3]|1046091128|
|[8.44,3.62,5.54,6.4]|1046091128|
|[8.31,3.12,5.21,6.1]|1046091128|
+--------------------+----------+
I want to groupBy appid and take the mean of each element of the vectors. So for example the first value in the aggregated list will be (8.32+8.52+8.44+8.31)/4 and so on.
Any help is appreciated.
This assumes that you know the length of the array column:
l = 4 #size of array column
df1 = df.select("id",*[F.col("vector")[i] for i in range(l)])
out = df1.groupby("id").agg(F.array([F.mean(i)
for i in df1.columns[1:]]).alias("vector"))
out.show(truncate=False)
+----------+----------------------------------------+
|id |vector |
+----------+----------------------------------------+
|1046091128|[8.3975, 3.325, 5.35, 6.325000000000001]|
+----------+----------------------------------------
You can use posexplode function and then aggregate the column based upon average. Something like below -
from pyspark.sql.functions import *
from pyspark.sql.types import *
data = [([8.32,3.22,5.34,6.5], 1046091128 ), ([8.52,3.34,5.31,6.3], 1046091128), ([8.44,3.62,5.54,6.4], 1046091128), ([8.31,3.12,5.21,6.1], 1046091128)]
schema = StructType([ StructField("vector", ArrayType(FloatType())), StructField("id", IntegerType()) ])
df = spark.createDataFrame(data=data,schema=schema)
df.select("id", posexplode("vector")).groupBy("id").pivot("pos").agg(avg("col")).show()
Output would look somewhat like :
+----------+-----------------+------------------+-----------------+-----------------+
| id| 0| 1| 2| 3|
+----------+-----------------+------------------+-----------------+-----------------+
|1046091128|8.397500038146973|3.3249999284744263|5.350000023841858|6.325000047683716|
+----------+-----------------+------------------+-----------------+-----------------+
You can rename the columns later if required.
Could also avoid pivot by grouping by id and pos and then later grouping by id alone to collect_list
df.select("id", posexplode("vector")).groupby('id','pos').agg(avg('col').alias('vector')).groupby('id').agg(collect_list('vector').alias('vector')).show(truncate=False)
Outcome
+----------+-----------------------------------------------------------------------------+
|id |vector |
+----------+-----------------------------------------------------------------------------+
|1046091128|[8.397500038146973, 5.350000023841858, 3.3249999284744263, 6.325000047683716]|
+----------+-----------------------------------------------------------------------------+
Related
I am trying to concatonate multiple columns to just one column, but only if the column name is in a list.
so issue = {'a','b','c'} is my list and would need to concatonate it as issue column with ; seperator.
I have tried:
1.
df_issue = df.withColumn('issue', concat_ws(';',map_values(custom.({issue}))))
Which returns invalid syntax error
2.
df_issue = df.withColumn('issue', lit(issue))
this just returnd a b c and not their value
Thank you
I have tried:
1.
df_issue = df.withColumn('issue', concat_ws(';',map_values(custom.({issue}))))
Which returns invalid syntax error
2.
df_issue = df.withColumn('issue', lit(issue))
this just returnd a b c and not their value
Thank you
You can simply use concat_ws:
from pyspark.sql import functions as F
columns_to_concat = ['a', 'b', 'c']
df.withColumn('issue', F.concat_ws(';', *columns_to_concat))
So, if your input DataFrame is:
+---+---+---+----------+----------+-----+
| a| b| c| date1| date2|value|
+---+---+---+----------+----------+-----+
| k1| k2| k3|2022-11-11|2022-11-14| 5|
| k4| k5| k6|2022-11-15|2022-11-19| 5|
| k7| k8| k9|2022-11-15|2022-11-19| 5|
+---+---+---+----------+----------+-----+
The previous code will produce:
+---+---+---+----------+----------+-----+--------+
| a| b| c| date1| date2|value| issue|
+---+---+---+----------+----------+-----+--------+
| k1| k2| k3|2022-11-11|2022-11-14| 5|k1;k2;k3|
| k4| k5| k6|2022-11-15|2022-11-19| 5|k4;k5;k6|
| k7| k8| k9|2022-11-15|2022-11-19| 5|k7;k8;k9|
+---+---+---+----------+----------+-----+--------+
I have a pyspark dataframe with two columns, ID and Elements. Column "Elements" has list element in it. It looks like this,
ID | Elements
_______________________________________
X |[Element5, Element1, Element5]
Y |[Element Unknown, Element Unknown, Element_Z]
I want to form a column with the most frequent element in the column 'Elements.' Output should look like,
ID | Elements | Output_column
__________________________________________________________________________
X |[Element5, Element1, Element5] | Element5
Y |[Element Unknown, Element Unknown, Element_Z] | Element Unknown
How can I do that using pyspark?
Thanks in advance.
We can use higher order functions here (available from spark 2.4+)
First use transform and aggregate to get counts for each distinct value in the array.
Then sort the array of structs in descending manner and then get the first element.
from pyspark.sql import functions as F
temp = (df.withColumn("Dist",F.array_distinct("Elements"))
.withColumn("Counts",F.expr("""transform(Dist,x->
aggregate(Elements,0,(acc,y)-> IF (y=x, acc+1,acc))
)"""))
.withColumn("Map",F.arrays_zip("Dist","Counts")
)).drop("Dist","Counts")
out = temp.withColumn("Output_column",
F.expr("""element_at(array_sort(Map,(first,second)->
CASE WHEN first['Counts']>second['Counts'] THEN -1 ELSE 1 END),1)['Dist']"""))
Output:
Note that I have added a blank array for ID z to test. Also you can drop the column Map by adding .drop("Map") to the output
out.show(truncate=False)
+---+---------------------------------------------+--------------------------------------+---------------+
|ID |Elements |Map |Output_column |
+---+---------------------------------------------+--------------------------------------+---------------+
|X |[Element5, Element1, Element5] |[{Element5, 2}, {Element1, 1}] |Element5 |
|Y |[Element Unknown, Element Unknown, Element_Z]|[{Element Unknown, 2}, {Element_Z, 1}]|Element Unknown|
|Z |[] |[] |null |
+---+---------------------------------------------+--------------------------------------+---------------+
For lower versions, you can use a udf with statistics mode:
from pyspark.sql import functions as F,types as T
from statistics import mode
u = F.udf(lambda x: mode(x) if len(x)>0 else None,T.StringType())
df.withColumn("Output",u("Elements")).show(truncate=False)
+---+---------------------------------------------+---------------+
|ID |Elements |Output |
+---+---------------------------------------------+---------------+
|X |[Element5, Element1, Element5] |Element5 |
|Y |[Element Unknown, Element Unknown, Element_Z]|Element Unknown|
|Z |[] |null |
+---+---------------------------------------------+---------------+
You can use pyspark sql functions to achieve that (spark 2.4+).
Here is a generic function that adds a new column containing the most common element in another array column. Here it is:
import pyspark.sql.functions as sf
def add_most_common_val_in_array(df, arraycol, drop=False):
"""Takes a spark df column of ArrayType() and returns the most common element
in the array in a new column of the df called f"MostCommon_{arraycol}"
Args:
df (spark.DataFrame): dataframe
arraycol (ArrayType()): array column in which you want to find the most common element
drop (bool, optional): Drop the arraycol after finding most common element. Defaults to False.
Returns:
spark.DataFrame: df with additional column containing most common element in arraycol
"""
dvals = f"distinct_{arraycol}"
dvalscount = f"distinct_{arraycol}_count"
startcols = df.columns
df = df.withColumn(dvals, sf.array_distinct(arraycol))
df = df.withColumn(
dvalscount,
sf.transform(
dvals,
lambda uval: sf.aggregate(
arraycol,
sf.lit(0),
lambda acc, entry: sf.when(entry == uval, acc + 1).otherwise(acc),
),
),
)
countercol = f"ReverseCounter{arraycol}"
df = df.withColumn(countercol, sf.map_from_arrays(dvalscount, dvals))
mccol = f"MostCommon_{arraycol}"
df = df.withColumn(mccol, sf.element_at(countercol, sf.array_max(dvalscount)))
df = df.select(*startcols, mccol)
if drop:
df = df.drop(arraycol)
return df
I have a column in my Spark DataFrame:
|-- topics_A: array (nullable = true)
| |-- element: string (containsNull = true)
I'm using CountVectorizer on it:
topic_vectorizer_A = CountVectorizer(inputCol="topics_A", outputCol="topics_vec_A")
I get NullPointerExceptions, because sometimes the topic_A column contains null.
Is there a way around this? Filling it with a zero-length array would work ok (although it will blow out the data size quite a lot) - but I can't work out how to do a fillNa on an Array column in PySpark.
Personally I would drop columns with NULL values because there is no useful information there but you can replace nulls with empty arrays. First some imports:
from pyspark.sql.functions import when, col, coalesce, array
You can define an empty array of specific type as:
fill = array().cast("array<string>")
and combine it with when clause:
topics_a = when(col("topics_A").isNull(), fill).otherwise(col("topics_A"))
or coalesce:
topics_a = coalesce(col("topics_A"), fill)
and use it as:
df.withColumn("topics_A", topics_a)
so with example data:
df = sc.parallelize([(1, ["a", "b"]), (2, None)]).toDF(["id", "topics_A"])
df_ = df.withColumn("topics_A", topics_a)
topic_vectorizer_A.fit(df_).transform(df_)
the result will be:
+---+--------+-------------------+
| id|topics_A| topics_vec_A|
+---+--------+-------------------+
| 1| [a, b]|(2,[0,1],[1.0,1.0])|
| 2| []| (2,[],[])|
+---+--------+-------------------+
I had similar issue, based on comment, I used following syntax to resolve before tokenizing:
remove the null values
clean_text_ddf.where(col("title").isNull()).show()
cleaned_text=clean_text_ddf.na.drop(subset=["title"])
cleaned_text.where(col("title").isNull()).show()
cleaned_text.printSchema()
cleaned_text.show(2)
+-----+
|title|
+-----+
+-----+
+-----+
|title|
+-----+
+-----+
root
|-- title: string (nullable = true)
+--------------------+
| title|
+--------------------+
|Mr. Beautiful (Up...|
|House of Ravens (...|
+--------------------+
only showing top 2 rows
Not sure why I'm having a difficult time with this, it seems so simple considering it's fairly easy to do in R or pandas. I wanted to avoid using pandas though since I'm dealing with a lot of data, and I believe toPandas() loads all the data into the driver’s memory in pyspark.
I have 2 dataframes: df1 and df2. I want to filter df1 (remove all rows) where df1.userid = df2.userid AND df1.group = df2.group. I wasn't sure if I should use filter(), join(), or sql For example:
df1:
+------+----------+--------------------+
|userid| group | all_picks |
+------+----------+--------------------+
| 348| 2|[225, 2235, 2225] |
| 567| 1|[1110, 1150] |
| 595| 1|[1150, 1150, 1150] |
| 580| 2|[2240, 2225] |
| 448| 1|[1130] |
+------+----------+--------------------+
df2:
+------+----------+---------+
|userid| group | pick |
+------+----------+---------+
| 348| 2| 2270|
| 595| 1| 2125|
+------+----------+---------+
Result I want:
+------+----------+--------------------+
|userid| group | all_picks |
+------+----------+--------------------+
| 567| 1|[1110, 1150] |
| 580| 2|[2240, 2225] |
| 448| 1|[1130] |
+------+----------+--------------------+
EDIT:
I've tried many join() and filter() functions, I believe the closest I got was:
cond = [df1.userid == df2.userid, df2.group == df2.group]
df1.join(df2, cond, 'left_outer').select(df1.userid, df1.group, df1.all_picks) # Result has 7 rows
I tried a bunch of different join types, and I also tried different
cond values:
cond = ((df1.userid == df2.userid) & (df2.group == df2.group)) # result has 7 rows
cond = ((df1.userid != df2.userid) & (df2.group != df2.group)) # result has 2 rows
However, it seems like the joins are adding additional rows, rather than deleting.
I'm using python 2.7 and spark 2.1.0
Left anti join is what you're looking for:
df1.join(df2, ["userid", "group"], "leftanti")
but the same thing can be done with left outer join:
(df1
.join(df2, ["userid", "group"], "leftouter")
.where(df2["pick"].isNull())
.drop(df2["pick"]))
I'm looking for a way to add a new column in a Spark DataFrames from a list of strings, in just one simple line.
Given :
rdd = sc.parallelize([((u'2016-10-19', u'2016-293'), 40020),
((u'2016-10-19', u'2016-293'), 143938),
((u'2016-10-19', u'2016-293'), 135891225.0)
])
This is my code to structure my rdd and get a Dataframe :
def structurate_CohortPeriod_metrics_by_OrderPeriod(line):
((OrderDate, OrderPeriod), metrics) = line
metrics = str(metrics)
return OrderDate, OrderPeriod, metrics
(rdd
.map(structurate_CohortPeriod_metrics_by_OrderPeriod)
.toDF(['OrderDate', 'OrderPeriod', 'MetricValue'])
.show())
Result :
+----------+-----------+-----------+
| OrderDate|OrderPeriod|MetricValue|
+----------+-----------+-----------+
|2016-10-19| 2016-293| 40020|
|2016-10-19| 2016-293| 143938|
|2016-10-19| 2016-293|135891225.0|
+----------+-----------+-----------+
I want to add a new column to precise the metric's name. This is what I've done :
def structurate_CohortPeriod_metrics_by_OrderPeriod(line):
(((OrderDate, OrderPeriod), metrics), index) = line
metrics = str(metrics)
return OrderDate, OrderPeriod, metrics, index
df1 = (rdd
.zipWithIndex()
.map(structurate_CohortPeriod_metrics_by_OrderPeriod)
.toDF(['OrderDate', 'OrderPeriod', 'MetricValue', 'index']))
Then
from pyspark.sql.types import StructType, StructField, StringType
df2 = sqlContext.createDataFrame(sc.parallelize([('0', 'UsersNb'),
('1', 'VideosNb'),
('2', 'VideosDuration')]),
StructType([StructField('index', StringType()),
StructField('MetricName', StringType())]))
df2.show()
+-----+--------------+
|index| MetricName|
+-----+--------------+
| 0| UsersNb|
| 1| VideosNb|
| 2|VideosDuration|
+-----+--------------+
And finally :
(df1
.join(df2, df1.index == df2.index)
.drop(df2.index)
.select('index', 'OrderDate', 'OrderPeriod', 'MetricName', 'MetricValue')
.show())
+-----+----------+-----------+--------------+-----------+
|index| OrderDate|OrderPeriod| MetricName|MetricValue|
+-----+----------+-----------+--------------+-----------+
| 0|2016-10-19| 2016-293| VideosNb| 143938|
| 1|2016-10-19| 2016-293| UsersNb| 40020|
| 2|2016-10-19| 2016-293|VideosDuration|135891225.0|
+-----+----------+-----------+--------------+-----------+
This is my expected output, but this method takes considerably longer. I want to do this in just one or two lines.. Something like for exemple the lit method :
from pyspark.sql.functions import lit
df1.withColumn('MetricName', lit('my_string'))
But I need of course to put 3 different strings : 'VideosNb', 'UsersNb' and 'VideosDuration'.
Ideas ? Thank you very much !