Error writing parquet file in s3 with Pyspark - amazon-web-services

I'm trying to read some tables(parquet files) and do some joins and write them as parquet format in S3 but I'm getting an error or taking more than a couple of hours to write the table.
error:
An error was encountered:
Invalid status code '400' from https://.... with error payload: {"msg":"requirement failed: session isn't active."}
I am able to write other tables as a parquet except for that table.
This is my sample code:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.config("spark.sql.catalogImplementation", "in-memory").getOrCreate()
table1 = spark.read.parquet("s3://.../table1")
table1.createOrReplaceTempView("table1")
table2 = spark.read.parquet("s3://.../table2")
table2.createOrReplaceTempView("table2")
table3 = spark.read.parquet("s3://.../table3")
table3.createOrReplaceTempView("table3")
table4 = spark.read.parquet("s3://.../table4")
table4.createOrReplaceTempView("table4")
Final_table = spark.sql("""
select
a.col1
a.col2
...
d.coln
from
table1 a
left outer join
table2 b
on
cond1
cond2
cond3
left outer join
table3 c
on
...
""")
Final_table.count()
# 3813731240
output_file="s3://.../final_table/"
final_table.write.option("partitionOverwriteMode", "dynamic").mode('overwrite').partitionBy("col1").parquet(output_file)
Just to add more, I've tried repartition but didn't work. Also, I've tried with different EMR clusters such as
Cluster1:
Master
m5.24xlarge
Cluster2:
Master
m5.24xlarge
1 core
m5.24xlarge
Cluster3:
Master
m5d.2xlarge
8 cores
m5d.2xlarge
EMR release version
5.29.0

Most spark jobs can be optimized by visualizing their DAG.
In this scenario if you are able to run the sql and get the count in minimum time and all your time is consumed just for writing then here are some suggestions
Since you already know the count of your dataframe, remove the count operation as it is unnecessary overhead for your job.
Now you are partitioning your data based on col1, so better try repartitioning your data so that the least shuffle is performed at the time of writing.
You can do something like
df.repartition('col1', 100).write
Also you can set the number based on the partition count if you know it.

Related

AWS Glue Job hangs from time to time

I've got an AWS Glue job, which reads data from 22 MySQL tables, transforms it using queries and fills in 8 MySQL tables in different schema.
These are fairly easy queries - few joins that execute in maximum few seconds.
All tables combined have around 1.5 mln records. I run this job every 4 hours incrementally - amount of records inserted each time is between 100 and 5000. Example of one of the parts of the script, containing the query:
# SOURCE TABLES
item_node1647201451763 = glueContext.create_dynamic_frame.from_catalog(
database="db-" + args['ENVIRONMENT'],
table_name="item",
transformation_ctx="chopdb2item_node1647201451763",
)
inventory_node1647206574114 = glueContext.create_dynamic_frame.from_catalog(
database="bi-db-" + args['BIDB_ENVIRONMENT'],
table_name="inventory"
)
# [...OTHER SOURCE TABLE DECLARATIONS HERE...]
# SQL
SqlQuery12 = """
select distinct r.id receipt_id,
i.total value_total,
i.qty quantity,
d2.id receipt_upload_date_id,
i.text item_name,
i.brand item_brand,
i.gtin item_gtin,
i.ioi,
cp.client_promotion_id,
inv.inventory_id
from receipt r
join item i on i.receipt_id = r.id
join verdict v on r.verdict_id = v.id and v.awardable = 1
join account a on v.account_id = a.id
join offer o on v.offer_id = o.id
join date_dimension d2
on d2.occurence_date = DATE(r.upload_time)
left join client_promotion cp
on cp.client_key = a.client_key and
cp.promotion_key = o.promotion_key
left join inventory inv on inv.inventory_gtin = i.gtin
"""
extractitemfacttabledata_node1647205714873 = sparkSqlQuery(
glueContext,
query=SqlQuery12,
mapping={
"item": item_node1647201451763,
"receipt": receipt_node1647477767036,
"verdict": verdict_without_bookmark,
"account": account_without_bookmark,
"offer": offer_without_bookmark,
"date_dimension": date_dimension_node1649721691167,
"client_promotion": client_promotion_node1647198052897,
"inventory": inventory_node1647206574114,
},
transformation_ctx="extractitemfacttabledata_node1647205714873",
)
# WRITING BACK TO MYSQL DATABASE
itemfacttableinwarehouse_node1647210647655 = glueContext.write_from_options(
frame_or_dfc=extractitemfacttabledata_node1647205714873,
connection_type="mysql",
connection_options={
"url": f"{args['DB_HOST_CM']}:{args['DB_PORT_CM']}/chopbidb?serverTimezone=UTC&useSSL=false",
"user": args['DB_USERNAME_CM'],
"password": args['DB_PASSWORD_CM'],
"dbtable": "bidb.item", # THIS IS DIFFERENT SCHEMA THAN THE SOURCE ITEM TABLE
"bulksize": 1
}
)
Structure of the file is:
ALL source tables declaration
ALL transformations (SQL queries)
ALL writes (inserting back to MySQL database)
My problem is, that from time to time, the job hangs and runs indefinitely. I set timeout to 8h and it's in progress for 8h and then cancelled.
I installed Apache Spark history UI and tried to analyze the logs.
The results are as follows:
90% of the time this is the query quoted above, but sometimes it's successful and the next one fails. Looks like it might fail when trying to insert the data.
Usually it happens when there is very little load on the system (in the middle of the night).
What I tried:
in first version of the script I inserted the data using glueContext.write_dynamic_frame.from_catalog. I thought it might be the issue with bulk inserts, which are causing the deadlocks in database, but changing it to write_from_options with bulksize = 1 did not help
increased resources in RDS -> did not help
moving insert to this particular table further in the script usually also resulted in it's failing

Spark SQL Union on Empty dataframe

We have glue job running script in which we have created two dataframes using createOrReplaceTempView command. The dataframes are then joined via a union:
df1.createOrReplaceTempView("df1")
df2.createOrReplaceTempView("df2")
joined_df = spark.sql(
"""
Select
id
, name
, product
, year
, month
, day
from df1
Union ALL
Select
id
, name
, product
, year
, month
, day
from df2
"""
)
joined_df.createOrReplaceTempView("joined_df")
Everything appeared to be working fine until it failed. Upon research, I am speculating it is because one of the dataframe is empty. This job runs daily and occasionally, one of the dataframe will not have data for the day (df2).
The error message returned in Cloudwatch log is not entirely clear:
pyspark.sql.utils.AnalysisException: u"cannot resolve '`id`' given input columns:
[]; line 22 pos 7;\n'Union\n:- Project [contract_id#0, agr_type_cd#1, ind_product_cd#2,
report_dt#3, effective_dt#4, agr_trns_cd#5, agreement_cnt#6, dth_bnft#7, admn_system_cd#8,
ind_lob_cd#9, cca_agreement_cd#10, year#11, month#12, day#13]\n: +- SubqueryAlias `df1`\n
'Project ['id, name#20,
'product, 'year, 'month, 'day]\n +- SubqueryAlias `df2`\n +- LogicalRDD false\n"
I'm seeking advice on how to resolve such an issue. Thank you.
UPDATE:
I'm fairly certain I know the issue, but unsure on how to solution. On days where the source file in s3 is "empty" no records but only has the header row, the read from catalog returns an empty set:
df2 = glueContext.create_dynamic_frame.from_catalog(
database = glue_database,
table_name = df2_table,
push_down_predicate = pred
)
df2 = df2.toDF()
df2.show()
The output:
++
||
++
++
Essentially, the from_catalog method is not reading the schema from Glue. I would expect that even without data, the header would be detected and a union would just return everything from df1. But, since I'm receiving a empty set without header, the union cannot occur because it's acting as though the "schema has changed" or non-existent... But, the underlying s3 files have not changed schema. This job triggers daily when source files have been loaded to S3. When there is data in the file, the union does not fail as the schema can be detected from_catalog. Is there a way to read the header even when no data returns?

Athena query return empty result because of timing issues

I'm trying to create and query the Athena table based on data located in S3, and it seems that there are some timing issues.
How can I know when all the partitions have been loaded to the table?
The following code returns an empty result -
athena_client.start_query_execution(QueryString=app_query_create_table,
ResultConfiguration={'OutputLocation': output_location})
athena_client.start_query_execution(QueryString="MSCK REPAIR TABLE `{athena_db}`.`{athena_db_partition}`"
.format(athena_db=athena_db, athena_db_partition=athena_db_partition),
ResultConfiguration={'OutputLocation': output_location})
result = query.format(athena_db_partition=athena_db_partition, delta=delta, dt=dt)
But when I add some delay, it works greate -
athena_client.start_query_execution(QueryString=app_query_create_table,
ResultConfiguration={'OutputLocation': output_location})
athena_client.start_query_execution(QueryString="MSCK REPAIR TABLE `{athena_db}`.`{athena_db_partition}`"
.format(athena_db=athena_db, athena_db_partition=athena_db_partition),
ResultConfiguration={'OutputLocation': output_location})
time.sleep(3)
result = query.format(athena_db_partition=athena_db_partition, delta=delta, dt=dt)
The following is the query for creating the table -
query_create_table = '''
CREATE EXTERNAL TABLE `{athena_db}`.`{athena_db_partition}` (
`time` string,
`user_advertiser_id` string,
`predictions` float
) PARTITIONED BY (
dt string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ','
) LOCATION 's3://{bucket}/path/'
'''
app_query_create_table = query_create_table.format(bucket=bucket,
athena_db=athena_db,
athena_db_partition=athena_db_partition)
I would love to get some help.
The start_query_execution call only starts the query, it does not wait for it to complete. You must run get_query_execution periodically until the status of the execution is successful (or failed).
Not related to your problem per se, but if you create a table with CREATE TABLE … AS there is no need to add partitions with MSCK REPAIR TABLE … afterwards, there will be no new partitions after the table has just been created that way – because it will be created with all the partitions produced by the query.
Also, in general, avoid using MSCK REPAIR TABLE, it is slow and inefficient. There are many better ways to add partitions to a table, see https://athena.guide/articles/five-ways-to-add-partitions/

How do I use EMR with spark DataFrames

I am running an AWS EMR job with spark. My input data is in my S3 bucket (csv files as .gz).
I am trying to filter multiple input files (one month worth of data, 1 file = 1 day) by first reading them in my spark dataframe, do some filtering and writing the result in my s3 bucket.
My Problem: I thought spark dataframes are already optimized to run on multiple nodes, but when running my code it only uses one node resulting in long computing time.
My code
input_bucket = my-bucket
input_path = '/2019/01/*/*.gz' #reading all january files
spark = SparkSession.builder.appName("Pythonexample").getOrCreate()
df = spark.read.csv(path=input_bucket+input_path, header=True, inferSchema=True)
df = df.drop("Time","Status") #keeping only relevant col
df = df.dropDuplicates()
df.show()
data = return_duplicates(df,'ID') # data = df without unique rows, only duplicates
data.write.format("com.databricks.spark.csv").option("header", "true").save(input_bucket+'/output')
my function
def return_duplicates(df, column):
w = Window.partitionBy(column)
return df.select('*', f.count(column).over(w).alias('dupeCount')).where('dupeCount > 1').drop('dupeCount')
Question: What should I change?
How can I use Map-Reduce or something similar (parallelize()?) with spark dataframes to use multiple nodes and reduce computing time?

AWS Glue Dynamic_frame with pushdown predicate not filtering correctly

I am writing an script for AWS Glue that is sourced in S3 stored parquet files, in which I am creating a DynamicFrame and attempting to use pushDownPredicate logic to restrict the data coming in.
The table partitions are (in order): account_id > region > vpc_id > dt
And the code for creating the dynamic_frame is the following:
dynamic_frame = glueContext.create_dynamic_frame.from_catalog(
database = DATABASE_NAME,
table_name= TABLE_NAME,
push_down_predicate = "dt='" + DATE + "'")
where DATE = '2019-10-29'
However it seems that Glue still attempts to read data from other days. Maybe it's because I have to specify a push_down_predicate for the other criteria?
As per the comments, the logs show that the date partition column is marked as "dt" where as in your table it is being referred by the name "date"
Logs
s3://route/account_id=XXX/region=eu-west-1/vpc_id=YYY/dt=2019-07-15
s3://route/account_id=XXX/region=eu-west-1/vpc_id=YYY//dt=2019-10-03
s3://route/account_id=XXX/region=eu-west-1/vpc_id=YYY//dt=2019-08-27
s3://route/account_id=XXX/region=eu-west-1/vpc_id=YYY//dt=2019-10-29 ...
Your Code
dynamic_frame = glueContext.create_dynamic_frame.from_catalog(
database = DATABASE_NAME,
table_name= TABLE_NAME,
push_down_predicate = "date='" + DATE + "'")
Change the date partitions column name to dt in your table and same in push_down_predicate parameter in the above code.
I also see extra forward slashes in some of the paths in above logs, were these partitions added manually through athena using ALTER TABLE command? If so, I would recommend to use MSCK REPAIR command to load all partitions in the table to avoid such issues. Extra blank slashes in S3 path some times lead to errors while doing ETL through spark.