com.amazonaws.services.gluejobexecutor.model.VersionMismatchException - amazon-web-services

Exactly like in this AWS forum question I was running 2 Jobs concurrently. The Job was configured with Max concurrency: 10 but when executing job.commit() I receive this error message:
py4j.protocol.Py4JJavaError: An error occurred while calling z:com.amazonaws.services.glue.util.Job.commit.
: com.amazonaws.services.gluejobexecutor.model.VersionMismatchException:
Continuation update failed due to version mismatch. Expected version 6 but found version 7
(Service: AWSGlueJobExecutor; Status Code: 400; Error Code: VersionMismatchException; Request ID: 123)
The two Jobs read different portions of data.
But I can't understand what's the problem here and how to deal with it. Anyone can help?

Reporting #bgiannini's answer in this other AWS forum question, it looks like that the "version" was referring to job bookmarking.
If multiple instances of the same job are running simultaneously (i.e. max concurrency > 1) and using bookmarks, when job run 1 runs job.init() it gets a version and job.commit() seems to expect a certain value (+1 to version for every job.commit that is executed I guess?). If job run 2 started at the same time and got the same initial version from job.init(), then submits job.commit() before job 1 does, job 1 doesn't increment to the version it expected to.
Actually I was running the 2 Jobs with Job bookmark: Enable. Indeed when disabling bookmarking, looks to be working for me.
I understand it might not be the best solution but it can be a good compromise.

The default JobName for your bookmark is the glue JOB_NAME, but it doesn't have to be.
Consider you have a glue job called JobA which executes concurrently taking different input parameters. You have two concurrent executions with input parameter contextName. Let's call the value passed into this parameter contextA and contextB.
The default initialisation in your pyspark script is:
Job.init(args['JOB_NAME'], args)
but you can change this to be unique for your execution context. Instead:
Job.init(args['JOB_NAME']+args['contextName'], args)
This is unique for each concurrent execution so would never clash. When you view the bookmark state from the cli for this job, you'd need to view it like this:
aws glue get-job-bookmark --job-name "jobAcontextA"
or
aws glue get-job-bookmark --job-name "jobAcontextB"
You wouldn't be able to use the UI to pause or reset the bookmark, you'd need to do it programatically.

Related

AWS Glue Job using awsglueml.transforms.FindMatches gives timeout error seemingly randomly

I have a Glue ETL Job (using pyspark) that gives a timeout error when trying to access the awsglueml.transforms.FindMatches library seemingly randomly. The error given on the glue dashboard is:
An error occurred while calling z:com.amazonaws.services.glue.ml.FindMatches.apply. The target server failed to respond
Basically if I try to run this Glue ETL job late at night, it most of the time succeeds. But if I try to run this ETL Job in the middle of the day, it fails with this error. Sometimes just retrying it enough times causes it to succeed, but this doesn't seem like a good solution. It seems like the issue is with AWS FindMatches library not having enough bandwidth to support people wanting to use this library, but I could be wrong here.
The Glue ETL job was setup using the option A proposed script generated by AWS Glue
The line of code that this is timing out on is a line that was provided by glue when I created this job:
from awsglueml.transforms import FindMatches
...
findmatches2 = FindMatches.apply(frame = datasource0, transformId = "<redacted>", computeMatchConfidenceScores = True, transformation_ctx = "findmatches2")
Welcoming any information on this elusive issue.

disable macro which invoke information.table in Athena

I am new to dbt. currently i am trying to accessing S3 bucket which has parquet file via glue and Athena. I have configuration as per dbt documentation, however, after running the run dbt command it provided me how many model i am running, how may task it has there so up to this point it is good. but looks like after that it is hung and after some time its timed out. While checking dbt.log i found there is query running like below and it is running quite long time and eventually timed out. I am not sure why it is running and if any configuration i have to checked. I suspect it is coming from macro but there is no macro like that which run the below query. Please let me know if any pointer. Thank you.
query running by default after running dbt run command and not sure where it is coming from.
select table_catalog,table_schema,
case when table_type='BASE_TABLE' then 'table'
when table_type='VIEW' then 'view'
end as table_type
from information_schema.table
where regexp_like(table_schema,'(?i)\A\A')

How to quickly find which Activity in AWS Step Function State Machine fails with CLI?

I checked this guide AWS CLI For Step Functions, but it only describe about an State Machine's execution is passing or not, but no way to know which exact Activity fails and which exact Activity passes, is there a quick way to find that out?
The UI of Step Function has that visually showing which exact Activity fails, but not with the CLI.
You asked which exact Activity fails, but based on your question I think you are looking to find which state in your state machine fails.
In AWS console, Step Functions visually showing which exact State fails based on the data from Execution History.
To do it from CLI you can use get-execution-history command like this:
aws stepfunctions get-execution-history --execution-arn <execution-arn> --reverse-order --max-items 2
--reverse-order Lists events in descending order of their timeStamp, and is useful because the fail events are the last events in execution history.
--max-items Used because the last event is for ExecutionFailed event and the events before that is for state Fail event. You can increase it to see more events.

Dataflow Pipeline - “Processing stuck in step <STEP_NAME> for at least <TIME> without outputting or completing in state finish…”

Since I'm not allowed to ask my question in the same thread where another person have the same problem (but not using a template) I'm creating this new thread.
The problem: Im creating a dataflow job from a template in gcp to ingest data from pub/sub into BQ. This works fine until the job executes. The job gets "stuck" and does not write anything to BQ.
I cant do so much because I cant choose the beam version in the template. This is the error:
Processing stuck in step WriteSuccessfulRecords/StreamingInserts/StreamingWriteTables/StreamingWrite for at least 01h00m00s without outputting or completing in state finish
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:429)
at java.util.concurrent.FutureTask.get(FutureTask.java:191)
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$DatasetServiceImpl.insertAll(BigQueryServicesImpl.java:803)
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$DatasetServiceImpl.insertAll(BigQueryServicesImpl.java:867)
at org.apache.beam.sdk.io.gcp.bigquery.StreamingWriteFn.flushRows(StreamingWriteFn.java:140)
at org.apache.beam.sdk.io.gcp.bigquery.StreamingWriteFn.finishBundle(StreamingWriteFn.java:112)
at org.apache.beam.sdk.io.gcp.bigquery.StreamingWriteFn$DoFnInvoker.invokeFinishBundle(Unknown Source)
Any ideas how to get this to work?
The issue is coming from the step WriteSuccessfulRecords/StreamingInserts/StreamingWriteTables/StreamingWrite which suggest a problem while writing the data.
Your error can be replicated by (using either Pub/Sub Subscription to BigQuery or Pub/Sub Topic to BigQuery):
Configuring a template with a table that doesn't exist.
Starting the
template with a correct table and delete it during the job execution.
In both cases the stuckness message happens because the data is being read from Pubsub but it is waiting for the table availability to insert the data. The error is being reported each 5 minutes and it gets resolved when the table is created.
To verify the table configured in your template, see the property outputTableSpec in the PipelineOptions in the Dataflow UI.
I had the same issue before. The problem was that I used NestedValueProviders to evaluate the Pub/Sub topic/subscription and this is not supported in case of templated pipelines.
I was getting the same error and reason was that I created an empty BigQuery table without specifying an schema. Make sure to create a BQ table with a schema before you can load data via Dataflow.

Is there an api to send notifications based on job outputs?

I know there are api to configure the notification when a job is failed or finished.
But what if, say, I run a hive query that count the number of rows in a table. If the returned result is zero I want to send out emails to the concerned parties. How can I do that?
Thanks.
You may want to look at Airflow and Qubole's operator for airflow. We use airflow to orchestrate all jobs being run using Qubole and in some cases non Qubole environments. We DataDog API to report success / failures of each task (Qubole / Non Qubole). DataDog in this case can be replaced by Airflow's email operator. Airflow also has some chat operator (like Slack)
There is no direct api for triggering notification based on results of a query.
However there is a way to do this using Qubole:
-Create a work flow in qubole with following steps:
1. Your query (any query) that writes output to a particular location on s3.
2. A shell script - This script reads result from your s3 and fails the job based on any criteria. For instance in your case, fail the job if result returns 0 rows.
-Schedule this work flow using "Scheduler" API to notify on failure.
You can also use "Sendmail" shell command to send mail based on results in step 2 above.