Spark SQL Union on Empty dataframe - amazon-web-services

We have glue job running script in which we have created two dataframes using createOrReplaceTempView command. The dataframes are then joined via a union:
df1.createOrReplaceTempView("df1")
df2.createOrReplaceTempView("df2")
joined_df = spark.sql(
"""
Select
id
, name
, product
, year
, month
, day
from df1
Union ALL
Select
id
, name
, product
, year
, month
, day
from df2
"""
)
joined_df.createOrReplaceTempView("joined_df")
Everything appeared to be working fine until it failed. Upon research, I am speculating it is because one of the dataframe is empty. This job runs daily and occasionally, one of the dataframe will not have data for the day (df2).
The error message returned in Cloudwatch log is not entirely clear:
pyspark.sql.utils.AnalysisException: u"cannot resolve '`id`' given input columns:
[]; line 22 pos 7;\n'Union\n:- Project [contract_id#0, agr_type_cd#1, ind_product_cd#2,
report_dt#3, effective_dt#4, agr_trns_cd#5, agreement_cnt#6, dth_bnft#7, admn_system_cd#8,
ind_lob_cd#9, cca_agreement_cd#10, year#11, month#12, day#13]\n: +- SubqueryAlias `df1`\n
'Project ['id, name#20,
'product, 'year, 'month, 'day]\n +- SubqueryAlias `df2`\n +- LogicalRDD false\n"
I'm seeking advice on how to resolve such an issue. Thank you.
UPDATE:
I'm fairly certain I know the issue, but unsure on how to solution. On days where the source file in s3 is "empty" no records but only has the header row, the read from catalog returns an empty set:
df2 = glueContext.create_dynamic_frame.from_catalog(
database = glue_database,
table_name = df2_table,
push_down_predicate = pred
)
df2 = df2.toDF()
df2.show()
The output:
++
||
++
++
Essentially, the from_catalog method is not reading the schema from Glue. I would expect that even without data, the header would be detected and a union would just return everything from df1. But, since I'm receiving a empty set without header, the union cannot occur because it's acting as though the "schema has changed" or non-existent... But, the underlying s3 files have not changed schema. This job triggers daily when source files have been loaded to S3. When there is data in the file, the union does not fail as the schema can be detected from_catalog. Is there a way to read the header even when no data returns?

Related

AWS Glue Job hangs from time to time

I've got an AWS Glue job, which reads data from 22 MySQL tables, transforms it using queries and fills in 8 MySQL tables in different schema.
These are fairly easy queries - few joins that execute in maximum few seconds.
All tables combined have around 1.5 mln records. I run this job every 4 hours incrementally - amount of records inserted each time is between 100 and 5000. Example of one of the parts of the script, containing the query:
# SOURCE TABLES
item_node1647201451763 = glueContext.create_dynamic_frame.from_catalog(
database="db-" + args['ENVIRONMENT'],
table_name="item",
transformation_ctx="chopdb2item_node1647201451763",
)
inventory_node1647206574114 = glueContext.create_dynamic_frame.from_catalog(
database="bi-db-" + args['BIDB_ENVIRONMENT'],
table_name="inventory"
)
# [...OTHER SOURCE TABLE DECLARATIONS HERE...]
# SQL
SqlQuery12 = """
select distinct r.id receipt_id,
i.total value_total,
i.qty quantity,
d2.id receipt_upload_date_id,
i.text item_name,
i.brand item_brand,
i.gtin item_gtin,
i.ioi,
cp.client_promotion_id,
inv.inventory_id
from receipt r
join item i on i.receipt_id = r.id
join verdict v on r.verdict_id = v.id and v.awardable = 1
join account a on v.account_id = a.id
join offer o on v.offer_id = o.id
join date_dimension d2
on d2.occurence_date = DATE(r.upload_time)
left join client_promotion cp
on cp.client_key = a.client_key and
cp.promotion_key = o.promotion_key
left join inventory inv on inv.inventory_gtin = i.gtin
"""
extractitemfacttabledata_node1647205714873 = sparkSqlQuery(
glueContext,
query=SqlQuery12,
mapping={
"item": item_node1647201451763,
"receipt": receipt_node1647477767036,
"verdict": verdict_without_bookmark,
"account": account_without_bookmark,
"offer": offer_without_bookmark,
"date_dimension": date_dimension_node1649721691167,
"client_promotion": client_promotion_node1647198052897,
"inventory": inventory_node1647206574114,
},
transformation_ctx="extractitemfacttabledata_node1647205714873",
)
# WRITING BACK TO MYSQL DATABASE
itemfacttableinwarehouse_node1647210647655 = glueContext.write_from_options(
frame_or_dfc=extractitemfacttabledata_node1647205714873,
connection_type="mysql",
connection_options={
"url": f"{args['DB_HOST_CM']}:{args['DB_PORT_CM']}/chopbidb?serverTimezone=UTC&useSSL=false",
"user": args['DB_USERNAME_CM'],
"password": args['DB_PASSWORD_CM'],
"dbtable": "bidb.item", # THIS IS DIFFERENT SCHEMA THAN THE SOURCE ITEM TABLE
"bulksize": 1
}
)
Structure of the file is:
ALL source tables declaration
ALL transformations (SQL queries)
ALL writes (inserting back to MySQL database)
My problem is, that from time to time, the job hangs and runs indefinitely. I set timeout to 8h and it's in progress for 8h and then cancelled.
I installed Apache Spark history UI and tried to analyze the logs.
The results are as follows:
90% of the time this is the query quoted above, but sometimes it's successful and the next one fails. Looks like it might fail when trying to insert the data.
Usually it happens when there is very little load on the system (in the middle of the night).
What I tried:
in first version of the script I inserted the data using glueContext.write_dynamic_frame.from_catalog. I thought it might be the issue with bulk inserts, which are causing the deadlocks in database, but changing it to write_from_options with bulksize = 1 did not help
increased resources in RDS -> did not help
moving insert to this particular table further in the script usually also resulted in it's failing

Athena query return empty result because of timing issues

I'm trying to create and query the Athena table based on data located in S3, and it seems that there are some timing issues.
How can I know when all the partitions have been loaded to the table?
The following code returns an empty result -
athena_client.start_query_execution(QueryString=app_query_create_table,
ResultConfiguration={'OutputLocation': output_location})
athena_client.start_query_execution(QueryString="MSCK REPAIR TABLE `{athena_db}`.`{athena_db_partition}`"
.format(athena_db=athena_db, athena_db_partition=athena_db_partition),
ResultConfiguration={'OutputLocation': output_location})
result = query.format(athena_db_partition=athena_db_partition, delta=delta, dt=dt)
But when I add some delay, it works greate -
athena_client.start_query_execution(QueryString=app_query_create_table,
ResultConfiguration={'OutputLocation': output_location})
athena_client.start_query_execution(QueryString="MSCK REPAIR TABLE `{athena_db}`.`{athena_db_partition}`"
.format(athena_db=athena_db, athena_db_partition=athena_db_partition),
ResultConfiguration={'OutputLocation': output_location})
time.sleep(3)
result = query.format(athena_db_partition=athena_db_partition, delta=delta, dt=dt)
The following is the query for creating the table -
query_create_table = '''
CREATE EXTERNAL TABLE `{athena_db}`.`{athena_db_partition}` (
`time` string,
`user_advertiser_id` string,
`predictions` float
) PARTITIONED BY (
dt string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ','
) LOCATION 's3://{bucket}/path/'
'''
app_query_create_table = query_create_table.format(bucket=bucket,
athena_db=athena_db,
athena_db_partition=athena_db_partition)
I would love to get some help.
The start_query_execution call only starts the query, it does not wait for it to complete. You must run get_query_execution periodically until the status of the execution is successful (or failed).
Not related to your problem per se, but if you create a table with CREATE TABLE … AS there is no need to add partitions with MSCK REPAIR TABLE … afterwards, there will be no new partitions after the table has just been created that way – because it will be created with all the partitions produced by the query.
Also, in general, avoid using MSCK REPAIR TABLE, it is slow and inefficient. There are many better ways to add partitions to a table, see https://athena.guide/articles/five-ways-to-add-partitions/

Efficiently aggregrate (fitler/select) a large dataframe in a loop and create new dataframe

I have 1 large dataframe that is created by importing a csv file (sparkscv). This dataframe has many rows of daily data. The data is identified by date, region, service_offered and count.
I filter for region and service_offered and aggregate the count (sum) and roll that up to month. Each time the filter is run in the loop, it selects a region, then a service_offered and aggregates that.
if I append that to the df over and over the big 0 starts to happen and it becomes very slow. There are 360 offices and about 5-10 services per office. How do I save a select/filter to a list first and append those before making the final dataframe?
I saw this post Using pandas .append within for loop but it only shows for list.a and list.b. What about 360 lists?
Here is my code that loops/aggregates the data
#spark session
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
#spark schema
schema = StructType([
StructField('office', StringType(), True),
StructField('service', StringType(), True),
StructField('date', StringType(), True),
StructField('count', IntegerType(), True)
])
#empty dataframe
office_summary = spark.createDataFrame(spark.sparkContext.emptyRDD(),schema)
office_summary .printSchema()
x = 1
try :
for office in office_lookup :
office = office[0]
print(office_locations_count - x, " office(s) left")
x = x + 1
for transaction in service_lookup :
transaction = transaction[0]
monthly_counts = source_data.filter((col("office").rlike(office)) & (col("service").rlike(transaction))).groupby("office", "service", "date").sum()
#office_summary = office_summary.unionAll(monthly_counts)
except Exception as e:
print(e)
I know there are issues with like returning more results than expected, but that is not a problem with the current data. the first 30% of the process is very quick and then it starts to slow down as expected.
How do I save a filter result to a list, append or join that over and over and finally create the final dataframe? This code does finish, but it should not take 30 minutes to run.
Thanks!
With the help from #Andrew, I was able to continue to use Pyspark Dataframes to accomplish this. The command scrubbed looks like this.
df2 = df1.groupby("Column1", "Column2", "Column3").agg(sum('COUNT'))
This allowed me to create a new dataframe based off df1 where grouping and aggregation was satisfied on one line. This command takes about 0.5 seconds to execute, over the other way in the initial post.
The constant creation of a new dataframe and retaining the old ones in memory were the problem. This is the most efficient way to do it.

QuickSight could not generate any output column after applying transformation Error

I am running a query that works perfectly on AWS Athena however when I use athena as a data source from quicksight and tries to run query it keeps on giving me QuickSight could not generate any output column after applying transformation error message.
Here is my query:
WITH register as (
select created_at as register_time
, serial_number
, node_name
, node_visible_time_name
from table1
where type = 'register'),
bought as (
select created_at as bought_time
, node_name
, serial_number
from table1
where type= 'bought')
SELECT r.node_name
, r.serial_number
, r.register_time
, b.bought_time
, r.node_visible_time_name
FROM register r
LEFT JOIN bought b
ON r.serial_number = b.serial_number
AND r.node_name = b.node_name
AND b.bought_time between r.deploy_time and date(r.deploy_time + INTERVAL '1' DAY)
LIMIT 11;
I've did some search and found similar question Quicksight custom query postgresql functions In this case adding INTERVAL '1' DAY had the problem. I've tried other alternatives but no luck. Furthermore running query without it still outputs same error message.
No other lines seems to be getting transformed in any other way.
Re-creating dataset and running exact same query works.
I think queries that has been ran on existing dataset transforms the data. Please let me know if anyone knows why this is so.

Error writing parquet file in s3 with Pyspark

I'm trying to read some tables(parquet files) and do some joins and write them as parquet format in S3 but I'm getting an error or taking more than a couple of hours to write the table.
error:
An error was encountered:
Invalid status code '400' from https://.... with error payload: {"msg":"requirement failed: session isn't active."}
I am able to write other tables as a parquet except for that table.
This is my sample code:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.config("spark.sql.catalogImplementation", "in-memory").getOrCreate()
table1 = spark.read.parquet("s3://.../table1")
table1.createOrReplaceTempView("table1")
table2 = spark.read.parquet("s3://.../table2")
table2.createOrReplaceTempView("table2")
table3 = spark.read.parquet("s3://.../table3")
table3.createOrReplaceTempView("table3")
table4 = spark.read.parquet("s3://.../table4")
table4.createOrReplaceTempView("table4")
Final_table = spark.sql("""
select
a.col1
a.col2
...
d.coln
from
table1 a
left outer join
table2 b
on
cond1
cond2
cond3
left outer join
table3 c
on
...
""")
Final_table.count()
# 3813731240
output_file="s3://.../final_table/"
final_table.write.option("partitionOverwriteMode", "dynamic").mode('overwrite').partitionBy("col1").parquet(output_file)
Just to add more, I've tried repartition but didn't work. Also, I've tried with different EMR clusters such as
Cluster1:
Master
m5.24xlarge
Cluster2:
Master
m5.24xlarge
1 core
m5.24xlarge
Cluster3:
Master
m5d.2xlarge
8 cores
m5d.2xlarge
EMR release version
5.29.0
Most spark jobs can be optimized by visualizing their DAG.
In this scenario if you are able to run the sql and get the count in minimum time and all your time is consumed just for writing then here are some suggestions
Since you already know the count of your dataframe, remove the count operation as it is unnecessary overhead for your job.
Now you are partitioning your data based on col1, so better try repartitioning your data so that the least shuffle is performed at the time of writing.
You can do something like
df.repartition('col1', 100).write
Also you can set the number based on the partition count if you know it.