I've got an AWS Glue job, which reads data from 22 MySQL tables, transforms it using queries and fills in 8 MySQL tables in different schema.
These are fairly easy queries - few joins that execute in maximum few seconds.
All tables combined have around 1.5 mln records. I run this job every 4 hours incrementally - amount of records inserted each time is between 100 and 5000. Example of one of the parts of the script, containing the query:
# SOURCE TABLES
item_node1647201451763 = glueContext.create_dynamic_frame.from_catalog(
database="db-" + args['ENVIRONMENT'],
table_name="item",
transformation_ctx="chopdb2item_node1647201451763",
)
inventory_node1647206574114 = glueContext.create_dynamic_frame.from_catalog(
database="bi-db-" + args['BIDB_ENVIRONMENT'],
table_name="inventory"
)
# [...OTHER SOURCE TABLE DECLARATIONS HERE...]
# SQL
SqlQuery12 = """
select distinct r.id receipt_id,
i.total value_total,
i.qty quantity,
d2.id receipt_upload_date_id,
i.text item_name,
i.brand item_brand,
i.gtin item_gtin,
i.ioi,
cp.client_promotion_id,
inv.inventory_id
from receipt r
join item i on i.receipt_id = r.id
join verdict v on r.verdict_id = v.id and v.awardable = 1
join account a on v.account_id = a.id
join offer o on v.offer_id = o.id
join date_dimension d2
on d2.occurence_date = DATE(r.upload_time)
left join client_promotion cp
on cp.client_key = a.client_key and
cp.promotion_key = o.promotion_key
left join inventory inv on inv.inventory_gtin = i.gtin
"""
extractitemfacttabledata_node1647205714873 = sparkSqlQuery(
glueContext,
query=SqlQuery12,
mapping={
"item": item_node1647201451763,
"receipt": receipt_node1647477767036,
"verdict": verdict_without_bookmark,
"account": account_without_bookmark,
"offer": offer_without_bookmark,
"date_dimension": date_dimension_node1649721691167,
"client_promotion": client_promotion_node1647198052897,
"inventory": inventory_node1647206574114,
},
transformation_ctx="extractitemfacttabledata_node1647205714873",
)
# WRITING BACK TO MYSQL DATABASE
itemfacttableinwarehouse_node1647210647655 = glueContext.write_from_options(
frame_or_dfc=extractitemfacttabledata_node1647205714873,
connection_type="mysql",
connection_options={
"url": f"{args['DB_HOST_CM']}:{args['DB_PORT_CM']}/chopbidb?serverTimezone=UTC&useSSL=false",
"user": args['DB_USERNAME_CM'],
"password": args['DB_PASSWORD_CM'],
"dbtable": "bidb.item", # THIS IS DIFFERENT SCHEMA THAN THE SOURCE ITEM TABLE
"bulksize": 1
}
)
Structure of the file is:
ALL source tables declaration
ALL transformations (SQL queries)
ALL writes (inserting back to MySQL database)
My problem is, that from time to time, the job hangs and runs indefinitely. I set timeout to 8h and it's in progress for 8h and then cancelled.
I installed Apache Spark history UI and tried to analyze the logs.
The results are as follows:
90% of the time this is the query quoted above, but sometimes it's successful and the next one fails. Looks like it might fail when trying to insert the data.
Usually it happens when there is very little load on the system (in the middle of the night).
What I tried:
in first version of the script I inserted the data using glueContext.write_dynamic_frame.from_catalog. I thought it might be the issue with bulk inserts, which are causing the deadlocks in database, but changing it to write_from_options with bulksize = 1 did not help
increased resources in RDS -> did not help
moving insert to this particular table further in the script usually also resulted in it's failing
Related
We have glue job running script in which we have created two dataframes using createOrReplaceTempView command. The dataframes are then joined via a union:
df1.createOrReplaceTempView("df1")
df2.createOrReplaceTempView("df2")
joined_df = spark.sql(
"""
Select
id
, name
, product
, year
, month
, day
from df1
Union ALL
Select
id
, name
, product
, year
, month
, day
from df2
"""
)
joined_df.createOrReplaceTempView("joined_df")
Everything appeared to be working fine until it failed. Upon research, I am speculating it is because one of the dataframe is empty. This job runs daily and occasionally, one of the dataframe will not have data for the day (df2).
The error message returned in Cloudwatch log is not entirely clear:
pyspark.sql.utils.AnalysisException: u"cannot resolve '`id`' given input columns:
[]; line 22 pos 7;\n'Union\n:- Project [contract_id#0, agr_type_cd#1, ind_product_cd#2,
report_dt#3, effective_dt#4, agr_trns_cd#5, agreement_cnt#6, dth_bnft#7, admn_system_cd#8,
ind_lob_cd#9, cca_agreement_cd#10, year#11, month#12, day#13]\n: +- SubqueryAlias `df1`\n
'Project ['id, name#20,
'product, 'year, 'month, 'day]\n +- SubqueryAlias `df2`\n +- LogicalRDD false\n"
I'm seeking advice on how to resolve such an issue. Thank you.
UPDATE:
I'm fairly certain I know the issue, but unsure on how to solution. On days where the source file in s3 is "empty" no records but only has the header row, the read from catalog returns an empty set:
df2 = glueContext.create_dynamic_frame.from_catalog(
database = glue_database,
table_name = df2_table,
push_down_predicate = pred
)
df2 = df2.toDF()
df2.show()
The output:
++
||
++
++
Essentially, the from_catalog method is not reading the schema from Glue. I would expect that even without data, the header would be detected and a union would just return everything from df1. But, since I'm receiving a empty set without header, the union cannot occur because it's acting as though the "schema has changed" or non-existent... But, the underlying s3 files have not changed schema. This job triggers daily when source files have been loaded to S3. When there is data in the file, the union does not fail as the schema can be detected from_catalog. Is there a way to read the header even when no data returns?
I'm trying to create and query the Athena table based on data located in S3, and it seems that there are some timing issues.
How can I know when all the partitions have been loaded to the table?
The following code returns an empty result -
athena_client.start_query_execution(QueryString=app_query_create_table,
ResultConfiguration={'OutputLocation': output_location})
athena_client.start_query_execution(QueryString="MSCK REPAIR TABLE `{athena_db}`.`{athena_db_partition}`"
.format(athena_db=athena_db, athena_db_partition=athena_db_partition),
ResultConfiguration={'OutputLocation': output_location})
result = query.format(athena_db_partition=athena_db_partition, delta=delta, dt=dt)
But when I add some delay, it works greate -
athena_client.start_query_execution(QueryString=app_query_create_table,
ResultConfiguration={'OutputLocation': output_location})
athena_client.start_query_execution(QueryString="MSCK REPAIR TABLE `{athena_db}`.`{athena_db_partition}`"
.format(athena_db=athena_db, athena_db_partition=athena_db_partition),
ResultConfiguration={'OutputLocation': output_location})
time.sleep(3)
result = query.format(athena_db_partition=athena_db_partition, delta=delta, dt=dt)
The following is the query for creating the table -
query_create_table = '''
CREATE EXTERNAL TABLE `{athena_db}`.`{athena_db_partition}` (
`time` string,
`user_advertiser_id` string,
`predictions` float
) PARTITIONED BY (
dt string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ','
) LOCATION 's3://{bucket}/path/'
'''
app_query_create_table = query_create_table.format(bucket=bucket,
athena_db=athena_db,
athena_db_partition=athena_db_partition)
I would love to get some help.
The start_query_execution call only starts the query, it does not wait for it to complete. You must run get_query_execution periodically until the status of the execution is successful (or failed).
Not related to your problem per se, but if you create a table with CREATE TABLE … AS there is no need to add partitions with MSCK REPAIR TABLE … afterwards, there will be no new partitions after the table has just been created that way – because it will be created with all the partitions produced by the query.
Also, in general, avoid using MSCK REPAIR TABLE, it is slow and inefficient. There are many better ways to add partitions to a table, see https://athena.guide/articles/five-ways-to-add-partitions/
I have 1 large dataframe that is created by importing a csv file (sparkscv). This dataframe has many rows of daily data. The data is identified by date, region, service_offered and count.
I filter for region and service_offered and aggregate the count (sum) and roll that up to month. Each time the filter is run in the loop, it selects a region, then a service_offered and aggregates that.
if I append that to the df over and over the big 0 starts to happen and it becomes very slow. There are 360 offices and about 5-10 services per office. How do I save a select/filter to a list first and append those before making the final dataframe?
I saw this post Using pandas .append within for loop but it only shows for list.a and list.b. What about 360 lists?
Here is my code that loops/aggregates the data
#spark session
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
#spark schema
schema = StructType([
StructField('office', StringType(), True),
StructField('service', StringType(), True),
StructField('date', StringType(), True),
StructField('count', IntegerType(), True)
])
#empty dataframe
office_summary = spark.createDataFrame(spark.sparkContext.emptyRDD(),schema)
office_summary .printSchema()
x = 1
try :
for office in office_lookup :
office = office[0]
print(office_locations_count - x, " office(s) left")
x = x + 1
for transaction in service_lookup :
transaction = transaction[0]
monthly_counts = source_data.filter((col("office").rlike(office)) & (col("service").rlike(transaction))).groupby("office", "service", "date").sum()
#office_summary = office_summary.unionAll(monthly_counts)
except Exception as e:
print(e)
I know there are issues with like returning more results than expected, but that is not a problem with the current data. the first 30% of the process is very quick and then it starts to slow down as expected.
How do I save a filter result to a list, append or join that over and over and finally create the final dataframe? This code does finish, but it should not take 30 minutes to run.
Thanks!
With the help from #Andrew, I was able to continue to use Pyspark Dataframes to accomplish this. The command scrubbed looks like this.
df2 = df1.groupby("Column1", "Column2", "Column3").agg(sum('COUNT'))
This allowed me to create a new dataframe based off df1 where grouping and aggregation was satisfied on one line. This command takes about 0.5 seconds to execute, over the other way in the initial post.
The constant creation of a new dataframe and retaining the old ones in memory were the problem. This is the most efficient way to do it.
I'm trying to read some tables(parquet files) and do some joins and write them as parquet format in S3 but I'm getting an error or taking more than a couple of hours to write the table.
error:
An error was encountered:
Invalid status code '400' from https://.... with error payload: {"msg":"requirement failed: session isn't active."}
I am able to write other tables as a parquet except for that table.
This is my sample code:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.config("spark.sql.catalogImplementation", "in-memory").getOrCreate()
table1 = spark.read.parquet("s3://.../table1")
table1.createOrReplaceTempView("table1")
table2 = spark.read.parquet("s3://.../table2")
table2.createOrReplaceTempView("table2")
table3 = spark.read.parquet("s3://.../table3")
table3.createOrReplaceTempView("table3")
table4 = spark.read.parquet("s3://.../table4")
table4.createOrReplaceTempView("table4")
Final_table = spark.sql("""
select
a.col1
a.col2
...
d.coln
from
table1 a
left outer join
table2 b
on
cond1
cond2
cond3
left outer join
table3 c
on
...
""")
Final_table.count()
# 3813731240
output_file="s3://.../final_table/"
final_table.write.option("partitionOverwriteMode", "dynamic").mode('overwrite').partitionBy("col1").parquet(output_file)
Just to add more, I've tried repartition but didn't work. Also, I've tried with different EMR clusters such as
Cluster1:
Master
m5.24xlarge
Cluster2:
Master
m5.24xlarge
1 core
m5.24xlarge
Cluster3:
Master
m5d.2xlarge
8 cores
m5d.2xlarge
EMR release version
5.29.0
Most spark jobs can be optimized by visualizing their DAG.
In this scenario if you are able to run the sql and get the count in minimum time and all your time is consumed just for writing then here are some suggestions
Since you already know the count of your dataframe, remove the count operation as it is unnecessary overhead for your job.
Now you are partitioning your data based on col1, so better try repartitioning your data so that the least shuffle is performed at the time of writing.
You can do something like
df.repartition('col1', 100).write
Also you can set the number based on the partition count if you know it.
I have an S3 bucket which is constantly being filled with new data, I am using Athena and Glue to query that data, the thing is if glue doesn't know that a new partition is created it doesn't search that it needs to search there. If I make an API call to run the Glue crawler each time I need a new partition is too expensive so the best solution to do this is to tell glue that a new partition is added i.e to create a new partition is in it's properties table. I looked through AWS documentation but no luck, I am using Java with AWS. Any help?
You may want to use batch_create_partition() glue api to register new partitions. It doesn't require any expensive operation like MSCK REPAIR TABLE or re-crawling.
I had a similar use case for which I wrote a python script which does the below -
Step 1 - Fetch the table information and parse the necessary information from it which is required to register the partitions.
# Fetching table information from glue catalog
logger.info("Fetching table info for {}.{}".format(l_database, l_table))
try:
response = l_client.get_table(
CatalogId=l_catalog_id,
DatabaseName=l_database,
Name=l_table
)
except Exception as error:
logger.error("Exception while fetching table info for {}.{} - {}"
.format(l_database, l_table, error))
sys.exit(-1)
# Parsing table info required to create partitions from table
input_format = response['Table']['StorageDescriptor']['InputFormat']
output_format = response['Table']['StorageDescriptor']['OutputFormat']
table_location = response['Table']['StorageDescriptor']['Location']
serde_info = response['Table']['StorageDescriptor']['SerdeInfo']
partition_keys = response['Table']['PartitionKeys']
Step 2 - Generate a dictionary of lists where each list contains the information to create a single partition. All lists will have same structure but their partition specific values will change (year, month, day, hour)
def generate_partition_input_list(start_date, num_of_days, table_location,
input_format, output_format, serde_info):
input_list = [] # Initializing empty list
today = datetime.utcnow().date()
if start_date > today: # To handle scenarios if any future partitions are created manually
start_date = today
end_date = today + timedelta(days=num_of_days) # Getting end date till which partitions needs to be created
logger.info("Partitions to be created from {} to {}".format(start_date, end_date))
for input_date in date_range(start_date, end_date):
# Formatting partition values by padding required zeroes and converting into string
year = str(input_date)[0:4].zfill(4)
month = str(input_date)[5:7].zfill(2)
day = str(input_date)[8:10].zfill(2)
for hour in range(24): # Looping over 24 hours to generate partition input for 24 hours for a day
hour = str('{:02d}'.format(hour)) # Padding zero to make sure that hour is in two digits
part_location = "{}{}/{}/{}/{}/".format(table_location, year, month, day, hour)
input_dict = {
'Values': [
year, month, day, hour
],
'StorageDescriptor': {
'Location': part_location,
'InputFormat': input_format,
'OutputFormat': output_format,
'SerdeInfo': serde_info
}
}
input_list.append(input_dict.copy())
return input_list
Step 3 - Call the batch_create_partition() API
for each_input in break_list_into_chunks(partition_input_list, 100):
create_partition_response = client.batch_create_partition(
CatalogId=catalog_id,
DatabaseName=l_database,
TableName=l_table,
PartitionInputList=each_input
)
There is a limit of 100 partitions in a single api call, So if you are creating more than 100 partitions then you will need to break your list into chunks and iterate over it.
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.batch_create_partition
You can configure you're glue crawler to get triggered every 5 mins
You can create a lambda function which will either run on schedule, or will be triggered by an event from your bucket (eg. putObject event) and that function could call athena to discover partitions:
import boto3
athena = boto3.client('athena')
def lambda_handler(event, context):
athena.start_query_execution(
QueryString = "MSCK REPAIR TABLE mytable",
ResultConfiguration = {
'OutputLocation': "s3://some-bucket/_athena_results"
}
Use Athena to add partitions manualy. You can also run sql queries via API like in my lambda example.
Example from Athena manual:
ALTER TABLE orders ADD
PARTITION (dt = '2016-05-14', country = 'IN') LOCATION 's3://mystorage/path/to/INDIA_14_May_2016'
PARTITION (dt = '2016-05-15', country = 'IN') LOCATION 's3://mystorage/path/to/INDIA_15_May_2016';
This question is old but I wanted to put it out there that someone could have s3:ObjectCreated:Put notifications trigger a Lambda function which registers new partitions when data arrives on S3. I would even expand this function to handle deprecations based on object deletes and so on. Here's a blog post by AWS which details S3 event notifications: https://aws.amazon.com/blogs/aws/s3-event-notification/
AWS Glue recently added a RecrawlPolicy that only crawls the new folders/paritions that you add to your S3 bucket.
https://docs.aws.amazon.com/glue/latest/dg/incremental-crawls.html
This should help you with minimizing crawling all the data again an again. From what I read, you can define incremental crawls while setting up your crawler, or editing an existing one. One thing however to note is that incremental crawls require the schema of new data to be more or less the same as existing schema.