Let's say I have the following data - order_id and product_names.
data = [["1", ["Organic A", "Apple"],
["2", ["Organic B", "Chocolate", "Organic C]]
If I want to create a dataframe and add a new column product_count so the output looks like the following, how can I do that?
Output:
+----------------------------------------------------------------+
|order_id | product_count| product_names|
+----------------------------------------------------------------+
| 1 | 2| ["Organic A", "Apple"]|
| 2 | 3| ["Organic B", "Chocolate", "Organic C]|
+----------------------------------------------------------------+
You can use the size function to get the length of the product_names field.
df = df.select('order_id', F.size('product_names').alias('product_count'), 'product_names')
df.show(truncate=False)
Related
I am trying to concatonate multiple columns to just one column, but only if the column name is in a list.
so issue = {'a','b','c'} is my list and would need to concatonate it as issue column with ; seperator.
I have tried:
1.
df_issue = df.withColumn('issue', concat_ws(';',map_values(custom.({issue}))))
Which returns invalid syntax error
2.
df_issue = df.withColumn('issue', lit(issue))
this just returnd a b c and not their value
Thank you
I have tried:
1.
df_issue = df.withColumn('issue', concat_ws(';',map_values(custom.({issue}))))
Which returns invalid syntax error
2.
df_issue = df.withColumn('issue', lit(issue))
this just returnd a b c and not their value
Thank you
You can simply use concat_ws:
from pyspark.sql import functions as F
columns_to_concat = ['a', 'b', 'c']
df.withColumn('issue', F.concat_ws(';', *columns_to_concat))
So, if your input DataFrame is:
+---+---+---+----------+----------+-----+
| a| b| c| date1| date2|value|
+---+---+---+----------+----------+-----+
| k1| k2| k3|2022-11-11|2022-11-14| 5|
| k4| k5| k6|2022-11-15|2022-11-19| 5|
| k7| k8| k9|2022-11-15|2022-11-19| 5|
+---+---+---+----------+----------+-----+
The previous code will produce:
+---+---+---+----------+----------+-----+--------+
| a| b| c| date1| date2|value| issue|
+---+---+---+----------+----------+-----+--------+
| k1| k2| k3|2022-11-11|2022-11-14| 5|k1;k2;k3|
| k4| k5| k6|2022-11-15|2022-11-19| 5|k4;k5;k6|
| k7| k8| k9|2022-11-15|2022-11-19| 5|k7;k8;k9|
+---+---+---+----------+----------+-----+--------+
I have a df with 2 columns:
id
vector
This is a sample of how it looks:
+--------------------+----------+
| vector| id|
+--------------------+----------+
|[8.32,3.22,5.34,6.5]|1046091128|
|[8.52,3.34,5.31,6.3]|1046091128|
|[8.44,3.62,5.54,6.4]|1046091128|
|[8.31,3.12,5.21,6.1]|1046091128|
+--------------------+----------+
I want to groupBy appid and take the mean of each element of the vectors. So for example the first value in the aggregated list will be (8.32+8.52+8.44+8.31)/4 and so on.
Any help is appreciated.
This assumes that you know the length of the array column:
l = 4 #size of array column
df1 = df.select("id",*[F.col("vector")[i] for i in range(l)])
out = df1.groupby("id").agg(F.array([F.mean(i)
for i in df1.columns[1:]]).alias("vector"))
out.show(truncate=False)
+----------+----------------------------------------+
|id |vector |
+----------+----------------------------------------+
|1046091128|[8.3975, 3.325, 5.35, 6.325000000000001]|
+----------+----------------------------------------
You can use posexplode function and then aggregate the column based upon average. Something like below -
from pyspark.sql.functions import *
from pyspark.sql.types import *
data = [([8.32,3.22,5.34,6.5], 1046091128 ), ([8.52,3.34,5.31,6.3], 1046091128), ([8.44,3.62,5.54,6.4], 1046091128), ([8.31,3.12,5.21,6.1], 1046091128)]
schema = StructType([ StructField("vector", ArrayType(FloatType())), StructField("id", IntegerType()) ])
df = spark.createDataFrame(data=data,schema=schema)
df.select("id", posexplode("vector")).groupBy("id").pivot("pos").agg(avg("col")).show()
Output would look somewhat like :
+----------+-----------------+------------------+-----------------+-----------------+
| id| 0| 1| 2| 3|
+----------+-----------------+------------------+-----------------+-----------------+
|1046091128|8.397500038146973|3.3249999284744263|5.350000023841858|6.325000047683716|
+----------+-----------------+------------------+-----------------+-----------------+
You can rename the columns later if required.
Could also avoid pivot by grouping by id and pos and then later grouping by id alone to collect_list
df.select("id", posexplode("vector")).groupby('id','pos').agg(avg('col').alias('vector')).groupby('id').agg(collect_list('vector').alias('vector')).show(truncate=False)
Outcome
+----------+-----------------------------------------------------------------------------+
|id |vector |
+----------+-----------------------------------------------------------------------------+
|1046091128|[8.397500038146973, 5.350000023841858, 3.3249999284744263, 6.325000047683716]|
+----------+-----------------------------------------------------------------------------+
I am new to Spark (with Python) and couldn't figure this out even after looking through relevant posts.
I have a RDD. Each record of the RDD is a list of lists as below
[[1073914607, 0, -1],[1073914607, 2, 7.88],[1073914607, 0, -1],[1073914607, 4, 40.0]]
[[1074079003, 0, -1],[1074079003, 2, 2.87],[1074079003, 0, -1],[1074079003, 4, 35.2]]
I want to convert the RDD to a dataframe with 3 columns, basically stack all the element lists. The dataframe should look like below.
account_id product_id price
1073914607 0 -1
1073914607 2 7.88
1073914607 0 -1
1073914607 4 40
1074079003 0 -1
1074079003 2 2.87
1074079003 0 -1
1074079003 4 35.2
I have tried my_rdd.toDF(), but it gives me two rows and four columns with each element list in a column. I also tried some solutions suggested in other posts which might be relevant. Since I am pretty new to spark, I got various errors that I could figure out. Please help. Thanks.
Added on 07/28/2021. In the end I did the following to loop through each element and generate a long list and convert it into a dataframe. Probably it is not the most efficient way but it solved my issue.
result_lst=[]
for x in my_rdd.toLocalIterator():
for y in x:
result_lst.append(y)
result_df=spark.createDataFrame(result_lst, ['account_id','product_id','price'])
>>> data = ([[1,2],[1,4]],[[2,5],[2,6]])
>>> df = sc.parallelize(data).toDF(['c1','c2'])
>>> df.show()
+------+------+
| c1| c2|
+------+------+
|[1, 2]|[1, 4]|
|[2, 5]|[2, 6]|
+------+------+
>>> df1 = df.select(df.c1.alias('c3')).union(df.select(df.c2).alias('c3'))
>>> df1.show()
+------+
| c3|
+------+
|[1, 2]|
|[2, 5]|
|[1, 4]|
|[2, 6]|
+------+
>>> df1.select(df1.c3,df1.c3[0],df1.c3[1]).show()
+------+-----+-----+
| c3|c3[0]|c3[1]|
+------+-----+-----+
|[1, 2]| 1| 2|
|[2, 5]| 2| 5|
|[1, 4]| 1| 4|
|[2, 6]| 2| 6|
+------+-----+-----+
I later used another way below to solve the problem without bringing the rdd to Localiterator() and looping through it. I guess this new way is more efficient.
from pyspark.sql.functions import explode
from pyspark.sql import Row
df_exploded=my_rdd.map(lambda x : Row(x)).toDF().withColumn('_1', explode('_1'))
result_df=df_exploded.select([df_exploded._1[i] for i in range(3)]).toDF('account_id','product_id','price')
I have a date format list: (day,hour,minute) --> (5,3,12)
I want to insert to these data into the list. e.g. ((5,3,12),(1,14,21),...)
I am new at Scala and I do not know how to do this. And then I need to create a DataFrame from these data.
data = Seq(
(l , m , r)
).toDF("day", "hour", "minutes")
Like this. If anyone can show me to best practice of doing this I would be appreciate. Thanks! Maybe I need to open my question more. I have done the same thing in Python. İt seems like:
for index in table:
index = str(index)
parts = index.split(" ") #first element is part0
hours_minutes_second = parts[1].split(":")
year = parts[0].split("-")
dates=(year[0],year[1],year[2],hours_minutes_second[0],hours_minutes_second[1],hours_minutes_second[2])
data.append(dates)
df = pd.DataFrame(data, columns = ['day','hours','minutes'])
You don not have to worry about index and formats of data. What I need is that creating like two dimensional list and then making dataframe of it!
[
( '25', '06', '55'),
( '24', '14', '51'),
( '24', '06', '24'),
( '24', '03', '42'),
( '23', '19', '30')]
For your question, I assume that you want to create a Spark Dataframe that
is quite different to a Scala List.
To create a DataFrame you need to create the Schema first. You can do it like:
val schema = StructType(
List(
StructField("day", IntegerType),
StructField("hour", IntegerType),
StructField("minutes", IntegerType)
))
There are several ways to create a DataFrame, for example:
Given a Seq[Row] you can create a RDD[Row] and then create the DataFrame from it:
val rdd : RDD[Row]= spark.sparkContext.parallelize(Seq(Row(5,3,12)))
val df = spark.createDataFrame(rdd, schema)
df.show()
/*
+---+----+-------+
|day|hour|minutes|
+---+----+-------+
| 5| 3| 12|
+---+----+-------+
*/
For the case of a two dimensions list:
val schema2 = StructType(
List(
StructField("day", StringType),
StructField("hour", StringType),
StructField("minutes", StringType)
))
val list = List(
Seq("25", "06", "55"),
Seq("24", "14", "51"),
Seq("24", "06", "24"),
Seq("24", "03", "42"),
Seq("23", "19", "30"))
val rdd : RDD[Row]= spark.sparkContext.parallelize(list.map(el => Row.fromSeq(el)))
val df = spark.createDataFrame(rdd, schema2)
df.show()
/*
+---+----+-------+
|day|hour|minutes|
+---+----+-------+
| 25| 06| 55|
| 24| 14| 51|
| 24| 06| 24|
| 24| 03| 42|
| 23| 19| 30|
+---+----+-------+
*/
What I want (very simplified):
Input Dataset to Output dataset
Some of the code I tried:
def add_columns(cur_typ, target, value):
if cur_typ == target:
return value
return None
schema = T.StructType([T.StructField("name", T.StringType(), True),
T.StructField("typeT", T.StringType(), True),
T.StructField("value", T.IntegerType(), True)])
data = [("x", "a", 3), ("x", "b", 5), ("x", "c", 7), ("y", "a", 1), ("y", "b", 2),
("y", "c", 4), ("z", "a", 6), ("z", "b", 2), ("z", "c", 3)]
df = ctx.spark_session.createDataFrame(ctx.spark_session.sparkContext.parallelize(data), schema)
targets = [i.typeT for i in df.select("typeT").distinct().collect()]
add_columns = F.udf(add_columns)
w = Window.partitionBy('name')
for target in targets:
df = df.withColumn(target, F.max(F.lit(add_columns(df["typeT"], F.lit(target), df["value"]))).over(w))
df = df.drop("typeT", "value").dropDuplicates()
another version:
targets = df.select(F.collect_set("typeT").alias("typeT")).first()["typeT"]
w = Window.partitionBy('name')
for target in targets:
df = df.withColumn(target, F.max(F.lit(F.when(veh["typeT"] == F.lit(target), veh["value"])
.otherwise(None)).over(w)))
df = df.drop("typeT", "value").dropDuplicates()
For small datasets both work, but I have a dataframe with 1 million rows and 5000 different typeTs.
So the result should be a table of about 500 x 5000 (some names do not have certain typeTs.
Now I get stackoverflow errors (py4j.protocol.Py4JJavaError: An error occurred while calling o7624.withColumn.
: java.lang.StackOverflowError) trying to create this dataframe. Besides increasing stacksize, what can I do? Is there a way better way to get the same result?
using withColumn in loop is not good, if no cols to be added are more.
create an array of cols, and select them, which will result in better performance
cols = [F.col("name")]
for target in targets:
cols.append(F.max(F.lit(add_columns(df["typeT"], F.lit(target), df["value"]))).over(w).alias(target))
df = df.select(cols)
which results the same output
+----+---+---+---+
|name| c| b| a|
+----+---+---+---+
| x| 7| 5| 3|
| z| 3| 2| 6|
| y| 4| 2| 1|
+----+---+---+---+