pyspark collect_set or collect_list with groupby - list

How can I use collect_set or collect_list on a dataframe after groupby. for example: df.groupby('key').collect_set('values'). I get an error: AttributeError: 'GroupedData' object has no attribute 'collect_set'

You need to use agg. Example:
from pyspark import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql import functions as F
sc = SparkContext("local")
sqlContext = HiveContext(sc)
df = sqlContext.createDataFrame([
("a", None, None),
("a", "code1", None),
("a", "code2", "name2"),
], ["id", "code", "name"])
df.show()
+---+-----+-----+
| id| code| name|
+---+-----+-----+
| a| null| null|
| a|code1| null|
| a|code2|name2|
+---+-----+-----+
Note in the above you have to create a HiveContext. See https://stackoverflow.com/a/35529093/690430 for dealing with different Spark versions.
(df
.groupby("id")
.agg(F.collect_set("code"),
F.collect_list("name"))
.show())
+---+-----------------+------------------+
| id|collect_set(code)|collect_list(name)|
+---+-----------------+------------------+
| a| [code1, code2]| [name2]|
+---+-----------------+------------------+

If your dataframe is large, you can try using pandas udf(GROUPED_AGG) to avoid memory error. It is also much faster.
Grouped aggregate Pandas UDFs are similar to Spark aggregate functions. Grouped aggregate Pandas UDFs are used with groupBy().agg() and pyspark.sql.Window. It defines an aggregation from one or more pandas.Series to a scalar value, where each pandas.Series represents a column within the group or window. pandas udf
example:
import pyspark.sql.functions as F
#F.pandas_udf('string', F.PandasUDFType.GROUPED_AGG)
def collect_list(name):
return ', '.join(name)
grouped_df = df.groupby('id').agg(collect_list(df["name"]).alias('names'))

Related

Change values within AWS Glue DynamicFrame columns

I am trying to change values within some columns of my DynamicFrame in a AWS Glue job.
I see there is a Map function that seems useful for the task, but I cannot make it work.
This is my code:
def map_values_in_columns(self, df):
df = Map.apply(frame = df, f = self._map_values_in_columns)
return df
def _map_values_in_columns(self, rec):
for k, v in self.config['value_mapping'].items():
column_name = self.config['value_mapping'][k]['column_name']
values = self.config['value_mapping'][k]['values']
for old_value, new_value in values.items():
if rec[column_name] == old_value:
rec[column_name] = new_value
return rec
My config file is a yaml file with this structure:
value_mapping:
column_1:
column_name: asd
values:
- old_value_1: new_value_1
- old_value_2: new_value_2
column_2:
column_name: dsa
- old_value_1: new_value_1
- old_value_2: new_value_2
The above method throws a serialisation error:
_pickle.PicklingError: Could not serialize object: Py4JError: An error occurred while calling o81.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
I am not sure if this is due to how I am implementing the Map method, or if I should use a completely different approach.
So the question:
How can I change multiple values within multiple columns using AWS DynamicFrame, trying to avoid conversion back and forth between DynamicFrames and DataFrames?
There are a few problems in your code and yaml config, I'm not going to debug them here. See a working sample below, this can also be executed locally in a jupyter notebook.
I have simplified the yaml to keep the parsing complexity low.
from awsglue.context import GlueContext
from awsglue.transforms import *
from pyspark.context import SparkContext
from awsglue.dynamicframe import DynamicFrame
glueContext = GlueContext(SparkContext.getOrCreate())
columns = ["id", "asd", "dsa"]
data = [("1", "retain", "old_val_dsa_1"), ("2", "old_val_asd_1", "old_val_dsa_2"), ("3", "old_val_asd_2", "retain"), ("4", None, "")]
df = spark.createDataFrame(data).toDF(*columns)
dyF = DynamicFrame.fromDF(df, glueContext, "test_dyF")
import yaml
config = yaml.load('''value_mapping:
asd:
old_val_asd_1: new_val_asd_1
old_val_asd_2: new_val_asd_2
dsa:
old_val_dsa_1: new_val_dsa_1
old_val_dsa_2: new_val_dsa_2''')
def map_values(rec):
for k, v in config['value_mapping'].items():
if rec[k] is not None:
replacement_val = v.get(rec[k])
if replacement_val is not None:
rec[k] = replacement_val
return rec
print("-- dyF --")
dyF.toDF().show()
mapped_dyF = Map.apply(frame = dyF, f = map_values)
print("-- mapped_dyF --")
mapped_dyF.toDF().show()
-- dyF --
+---+-------------+-------------+
| id| asd| dsa|
+---+-------------+-------------+
| 1| retain|old_val_dsa_1|
| 2|old_val_asd_1|old_val_dsa_2|
| 3|old_val_asd_2| retain|
| 4| null| |
+---+-------------+-------------+
-- mapped_dyF --
+-------------+-------------+---+
| asd| dsa| id|
+-------------+-------------+---+
| retain|new_val_dsa_1| 1|
|new_val_asd_1|new_val_dsa_2| 2|
|new_val_asd_2| retain| 3|
| null| | 4|
+-------------+-------------+---+```

PySpark - How to pass a list to User Define Function?

I have a DataFrame with 2 columns. Column 1 is "code" which can repeat more than 1 time and column 2 which is "Values". For example, column 1 is 1,1,1,5,5 and Column 2 is 15,18,24,38,41. What I want to do is first sort by the 2 columns ( df.sort("code","Values") ) and then do a ("groupBy" "Code") and (agg Values) but I want to apply a UDF on values so I need to pass the "Values" of each code as a "list" to the UDF. I am not sure how many "Values" each Code will have. As you can see in this example "Code" 1 has 3 values and "Code" 5 has 2 Values. So for each "Code" I need to pass all the "Values" of that "Code" as a list to the UDF.
You can do a groupBy and then use the collect_set or collect_list function in pyspark. Below is an example dataframe of your use case (I hope this is what are you referring to ):
from pyspark import SparkContext
from pyspark.sql import HiveContext
sc = SparkContext("local")
sqlContext = HiveContext(sc)
df = sqlContext.createDataFrame([
("code1", "val1"),
("code1", "val2"),
("code1", "val3"),
("code2", "val1"),
("code2", "val2"),
], ["code", "val"])
df.show()
+-----+-----+
| code| val |
+-----+-----+
|code1|val1 |
|code1|val2 |
|code1|val3 |
|code2|val1 |
|code2|val2 |
+---+-------+
Now the groupBy and collect_list command:
(df
.groupby("code")
.agg(F.collect_list("val"))
.show())
Output:
+------+------------------+
|code |collect_list(val) |
+------+------------------+
|code1 |[val1, val2, val3]|
|code2 |[val1, val2] |
+------+------------------+
Here above you get list of aggregated values in second column

Identify Partition Key Column from a table using PySpark

I need help to find the unique partitions column names for a Hive table using PySpark. The table might have multiple partition columns and preferable the output should return a list of the partition columns for the Hive Table.
It would be great if the result would also include the datatype of the partitioned columns.
Any suggestions will be helpful.
It can be done using desc as shown below:
df=spark.sql("""desc test_dev_db.partition_date_table""")
>>> df.show(truncate=False)
+-----------------------+---------+-------+
|col_name |data_type|comment|
+-----------------------+---------+-------+
|emp_id |int |null |
|emp_name |string |null |
|emp_salary |int |null |
|emp_date |date |null |
|year |string |null |
|month |string |null |
|day |string |null |
|# Partition Information| | |
|# col_name |data_type|comment|
|year |string |null |
|month |string |null |
|day |string |null |
+-----------------------+---------+-------+
Since this table was partitioned, So here you can see the partition column information along with their datatypes.
It seems your are interested in just partition column name and their respective data types. Hence I am creating a list of tuples.
partition_list=df.select(df.col_name,df.data_type).rdd.map(lambda x:(x[0],x[1])).collect()
>>> print partition_list
[(u'emp_id', u'int'), (u'emp_name', u'string'), (u'emp_salary', u'int'), (u'emp_date', u'date'), (u'year', u'string'), (u'month', u'string'), (u'day', u'string'), (u'# Partition Information', u''), (u'# col_name', u'data_type'), (u'year', u'string'), (u'month', u'string'), (u'day', u'string')]
partition_details = [partition_list[index+1:] for index,item in enumerate(partition_list) if item[0]=='# col_name']
>>> print partition_details
[[(u'year', u'string'), (u'month', u'string'), (u'day', u'string')]]
It will return empty list in case table is not partitioned. Hope this helps.
The following snippet
Gets the columns for the given table
Filters out partition columns
Extracts (name, datatype) tuples from the partition columns
# s: pyspark.sql.session.SparkSession
# table: str
# 1. Get table columns for given table
columns = s.catalog.listColumns(table)
# 2. Filter out partition columns
partition_columns = list(filter(lambda c: c.isPartition , columns))
# 3. Now you can extract the name and dataType (among other attributes)
[ (c.name, c.dataType) for c in partition_columns ]
Another simple method through pyspark script .
from pyspark.sql.types import *
import pyspark.sql.functions as f
from pyspark.sql import functions as F
from pyspark.sql.functions import col, concat, lit
descschema = StructType([ StructField("col_name", StringType())
,StructField("data_type", StringType())
,StructField("comment", StringType())])
df = spark.sql(f"describe formatted serve.cust_transactions" )
df2=df.where((f.col("col_name")== 'Part 0') | (f.col("col_name")== 'Part 2') | (f.col("col_name")== 'Name')).select(f.col('data_type'))
df3 =df2.toPandas().transpose()
display(df3)
Result would be :

How to create a typed empty MapType?

I have a dataframe schema that I want to match that has a column of type MapType(StringType(), StringType()). I tried the following implementations (using Spark 2.2.1):
import pyspark.sql.functions as fx
from pyspark.sql.types import *
df = spark.createDataFrame([[1]], ['id'])
df = df.withColumn("map", fx.udf(dict, MapType(StringType(), StringType()))())
df = df.withColumn("map2", fx.create_map().cast(MapType(StringType(), StringType())))
The second attempt without the udf gives me this casting error:
cannot resolve 'map()' due to data type mismatch: cannot cast MapType(NullType,NullType,false) to MapType(StringType,StringType,true)
Is there a correct way to write the second implementation (without the UDF)?
I'm not sure if this is the "correct way" but here is a way to do this without the udf:
Create a new dataframe by specifying a schema, and do a crossJoin():
df = spark.createDataFrame([[1]], ['id'])
data = [({},)]
schema = StructType([StructField("map2", MapType(StringType(), StringType()))])
df2 = spark.createDataFrame(data, schema)
df.crossJoin(df2).show()
+---+-----+
| id| map2|
+---+-----+
| 1|Map()|
+---+-----+

How to split the elements of a specific column in a dataframe created from a csv file in PySpark?

I have created a dataframe in PySpark from a csv file with data with columns in the following format:
+---+--------------+-------------+
| ID| FileID| TestID|
+---+--------------+-------------+
| 1| HD_Fly_456_34|Gone_YT_78_67|
| 2|FG_Home_567_54|Gone_YT_78_22|
| 3| GD_Go_678_87|Gone_YT_06_82|
| 4| GH_Buy_908_45|Gone_YT_92_70|
| 5| HJ_Get_789_65|Gone_YT_98_43|
+---+--------------+-------------+
I used the following lines of code to create a dataframe:
df=sqlc.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load("testfile.csv")
I need to split the elements of the columns FileID, TestID and so on at the _ (underscore) so that they can be stored in a new column.
I am using the following method:
df.join(df['FileID'].str.split('_', 1, expand=True).rename(columns={0:'R', 1:'R1',2:'R2',3:'R3'}))
I get the following error:
df.join(df['FileID'].str.split('_', 1, expand=True).rename(columns={0:'R', 1:'R1',2:'R2',3:'R3'}))
TypeError: 'Column' object is not callable
How do I get to where I need to be?
While sometimes similar, PySpark is not the same as Pandas.
I'd use split:
from pyspark.sql.functions import split, col
parts = split("TestID", "_")
df.select(
parts[0].alias("R"), parts[1].alias("R1"),
parts[2].alias("R2"), parts[3].alias("R3"))