Triggering an Airflow DAG once message arrive at a AWS SQS queue - amazon-web-services

Is it possible to schedule a DAG run when message arrives at the SQS queue?? I also need the dag to process the message in the queue. From what I know this could be done by using the SQSSensor but I couldn't find any example and I am confused on how to move forward.

Airflow runs DAGs on a fixed interval, while you're now looking to trigger DAGs per event. You'll have to do this outside of Airflow, e.g. using a Lambda trigger listening on the queue, which triggers an Airflow DAG via the REST API.
The SQSSensor in Airflow won't allow for event-by-event processing because it simply polls the queue after a DAG run starts (checking for new messages, pushing them to an XCom with key "messages", and deleting the messages if found). So if your DAG run is scheduled to once a day, an SQSSensor would only start polling for new messages once a day.
I can't find an SQSOperator in Airflow for reading SQS messages, so to create an event-triggered SQS + Airflow workflow, my best guess is to set up a Lambda for triggering Airflow DAGs via the REST API, and the DAG itself will start with an SQSSensor which reads all messages on the queue, and other tasks after that read and process the values from the XCom created by the SQSSensor task. The schedule_interval of the DAG can be set to None since it will be triggered via the REST API.

Related

Pub/Sub messages from snapshot not processed in a Dataflow streaming pipeline

We have a Dataflow consumming from a Pub/Sub and writing into bigquery in streaming. Due to a permits issue the pipeline got stuck and the messages were not consumed, we re-started the pipeline, save the unacked messages in a snapshot, replay the messages but they are discarded
We fix the problem, re-deployed the pipeline with a new subscription to the topic and all the events are consumed in streaming without a problem
For all the unacked messages accumulated (20M) in the first subscription, we created a snapshot
This snapshot was then connected to the new subscription via the UI using Replay messages dialog
In the metrics dashboard we see that the unacked messages spike to 20M and then they get consumed
subscription spike
But then the events are not sent to BigQuery, checking inside dataflow job metrics we are able to see a spike in the Duplicate message count within the read from pubsub step
Dataflow Duplicate counter
The messages are < 3 days old, does anybody knows why this happen? Thanks in advance
The pipeline is using Apache Beam SDK 2.39.0 and python 3.9 with streming engine and v2 runner enable.
How long does it take for a Pub/Sub message to process, is it a long process?
In that case, Pub/Sub may redeliver messages, according to subscription configuration/delays. See Subscription retry policy.
Dataflow can work-around that, as it acknowledges from the source after a successful shuffle. If you add a GroupByKey (or artificially, a Reshuffle) transform, it may resolve source duplications.
More information at https://beam.apache.org/contribute/ptransform-style-guide/#performance

long running cloud run process and pubsub message retry

I have a cloud run service which will run upto 60 minutes.The pubsub is the trigger point for execution of cloud run service.
pubsub configuration for Retry policy is set to max (600s).
Now when a message is published from pubsub, cloud run starts executing, as the complete execution takes around 60 minutes to complete, but the pubsub message after 600s starts to retry again as it doesn't received any acknowledge from cloud run and again causing cloud run service executing again and again.
How to handle the pubsub retry here so that cloud run will not execute again and again because of retrying.
I was thinking to use Cloud Tasks, or Cloud Workflows as a proxy for your long running Cloud Run. Unfortunately both services have max timeout of 1800s (30minutes). By the way upcoming callback feature of Cloud Workflows will have 12h timeout. In the meantime I would create a proxy as Cloud Function triggered by PubSub message that will be immediately acknowledged, and the function will call your Cloud Run in async with the PubSub message and return right away.
With push subscriptions, such as what you'd use with a Cloud Run service, the maximum ack deadline for a message is indeed 600s. If using pull, one can call ModifyAckDeadline to extend the deadline for a message. In fact, the client libraries for Cloud Pub/Sub do this automatically for up to a configured amount of time (default is 60m).
There is not going to be a way to extend the deadline if using a push subscription. Therefore, your options are:
Switch to a pull subscription. You could potentially do this via Cloud Run, though it would not be the best fit. More likely, you want to spin up a job in an environment that can keep it running without any kind of trigger, e.g., GKE. If you switch to pull, you can extend the ack deadline, though note that duplicates are still possible, even if the ack deadline has not expired or the message has already been acknowledged. They should be rare, but you still have to account for it.
When you receive the message, persist it somewhere, either on disk or in a database, and then acknowledge the message once persisted. Once you are actually done processing the message an hour later, you remove it from this persistent storage. Of course, you could just persist the message instead of publishing it via Pub/Sub and rely on the persistence layer's notifications mechanisms to learn of the new message. For example, if you write to GCS, you could use Cloud Storage notifications via Pub/Sub. In this case, you probably want to have some periodic read from your storage to see if there are any messages that have not been processed for some period of time and if so, reprocess them. For example, if you write with the message the time at which processing started and if more than some amount of time has passed since then and the message is still present, you could start the processing over again.

Triggering Cron job on AWS manually through sending a message from SQS

I setup cron job on Elasticbeanstalk using cron.yaml file and sqs run my tasks periodically. Is there a way to trigger a cron job manually through sqs platform so that for not-frequently running tasks I can easily test the results without waiting for the schedule itself? I tried to send a message to sqs queue attached to eb instance but can't set the http headers required for cronjob.

Failed cron job handling with elastic beanstalk and SQS

I have two elastic beanstalk environments.
One is the 'primary' web server environment and the other is a worker environment that handles cron jobs.
I have 12 cron jobs, setup via a cron.yaml file that all point at API endpoints on the primary web server.
Previously my cron jobs were all running on the web server environment but of course this created duplicate cron jobs when this scaled up.
My new implementation works nicely but where my cron jobs fail to run as expected the cron job repeats, generally within a minute or so.
I would rather avoid this behaviour and just attempt to run the cron job again at the next scheduled interval.
Is there a way to configure the worker environment/SQS so that failed jobs do not repeat?
Simply configure a CloudWatch event to take over your cron, and have it create an SQS message ( either directly or via a Lambda function ).
Your workers will now just have to handle SQS jobs and if needed, you will be able to scale the workers as well.
http://docs.aws.amazon.com/AmazonCloudWatch/latest/events/ScheduledEvents.html
Yes, you can set the Max retries parameter in the Elastic Beanstalk environment and the Maximum Receives parameter in the SQS queue to 1. This will ensure that the message is executed once, and if it fails, it will get sent to the dead letter queue.
With this approach, your instance may turn yellow if there are any failed jobs, because the messages would end up in the dead letter queue, which you can simple observe and ignore, but it may be annoying if you are OCD about needing all environments to be green. You can set the Message Retention Period parameter for the dead letter queue to something short so that it will go away sooner though.
An alternative approach, if you're interested, is to return a status 200 OK in your code regardless of how the job ran. This will ensure that the SQS daemon deletes the message in the queue, so that it won't get picked up again.
Of course, the downside is that you would have to modify your code, but I can see how this would make sense if you don't care about the result.
Here's a link to AWS documentation that explains all of the parameters.

Running Periodic tasks on elastic beanstalk workers with FIFO queue

I was trying to setup a periodic task (using cron.yaml) in EB worker environment which is using a FIFO SQS queue. When cron job tries submit job to SQS, it fails because it does not have message group id which is required for FIFO queue.
Is there a way around this? (Apart from using some other scheduling mechanism or using general queue)
scheduler: dropping leader, due to failed to send message for job
'italian-job', because: The request must contain the parameter
MessageGroupId. (Aws::SQS::Errors::MissingParameter)
Update: As a work around, I created a cloud watch trigger to start a lambda which sends messages to SQS queue.