Does SQS kill long running jobs that have Thread.sleep? - amazon-web-services

I have some java code that calls Thread.sleep(100_000) inside a job running in SQS. In production, during the sleep the job is often killed and re-submitted as failed. On dev I can never re-create that. Does SQS in production kill long running jobs?

SQS doesn't kill jobs - and I am not sure what you mean by you have code 'running in SQS' - what SQS does do, is to assume your job (which is running someplace other than SQS), has failed if you don't mark it completed within the timeout (Default Visibility Timeout) you set when you setup the queue.
Your job asks SQS for an item to work on (a message to process) - your job is supposed to do that work and then tell SQS that the job is now done (deletemessage). If you don't tell it it is done, SQS makes an assumption for you that the job has failed and puts the message back in the queue for another task to process.
If you need more time to complete the tasks, you can change the default visibility timeout to a higher value if you want.

Related

Deleting a Google Cloud Task does not stop task running

I have a task queue which users can push tasks onto, only one task can run at a time enforced by the concurrency setting for the queue. In some cases (e.g. long running task) they may wish to cancel a running task in order to free up the queue to process the next task.
To achieve this I have been running the task queue as a Flask application, should a user wish to cancel a task I call the delete_task method of the python client library for a given queue & task.
However, I am seeing that the underlying task continues to be processed even after the task has been deleted. Have been trying to find documentation of how Cloud Tasks handles a task being deleted, but haven't found anything concrete.
Hoping that i'd be able to listen for a signal of some sort in order to gracefully shut down the process if a deletion is received. Or that the underlying process would be killed if the parent task is deleted.
Has anyone worked with the Cloud Tasks API before? Is it correct to assume that a deleted task will cleanup any processes that are running?
I don't see how a worker would be able to find out that the task it is working on has been deleted.
In the eyes of the worker, a task is an incoming Http request. I don't know how the Queue could tell that specific process to stop. I'm fairly certain that "deleting" a task just removes it from the Queue only.
You'd have to build a custom 'cancel' function that would be able to reach out to this worker.
Or this worker would have to periodically check with the Queue to see if its task still exists.
https://cloud.google.com/tasks/docs/reference/rest/v2/projects.locations.queues.tasks/get
https://googleapis.dev/python/cloudtasks/latest/gapic/v2/api.html#google.cloud.tasks_v2.CloudTasksClient.get_task
I'm not actually sure what the Queue will return if you try to call 'get task' a deleted task since i don't see a 'status' property for task. Maybe it will return an error like 'task does not exist'

Approach to crashed workers in amazon swf

We're currently implementing a workflow in Amazon SWF where we submit jobs/workflow executions from our web application. Everything was fairly quick and painless to get set up using the Ruby Flow framework. As long as the deciders/activity workers don't crash we seem to be able to handle most issues/exceptions gracefully.
My question is, what is common practice for the scenario where the decider process crashes midway through a workflow execution? If the task fails in that way, is it possible to push an SNS notification (I've seen no examples) or something to indicate to another process that there's been an unexpected failure/crash?
There are various types of "decider" failures.
Workflow worker crashes while processing a decision. The decision task is automatically rescheduled after specified timeout. Make sure that workflow type defaultTaskStartToCloseTimeout is not set too high. If this crash is not related to code correctness then rescheduled task is processed and workflow execution continues normally.
Workflow worker doesn't crash but workflow execution itself fails. In this case you can use ListClosedWorkflowExecutions to count such failed workflows.
Workflow worker doesn't crash but a decision task cannot complete as RespondDecisionTaskCompleted fails due to a bug in the Flow framework. As from SWF point of view task is never completed it at some point is marked as timed out and rescheduled. As bug is still present a new task is again never completes and rescheduled, and so on. The workflow execution that is experiencing such issue has a history with a tail that consists from repeated "decision task scheduled, decision task timed out" events. If your workflow has a known execution time limit then the best way to catch this issue is to set reasonable executionStartToCloseTimeout and look for timed out workflow executions. If the decision task timeout is set too low such workflows can also hit the limit on history size before the execution timeout.
All swf metrics are not published to cloud watch. So all completed and failed workflows will send the metrics to cloudwatch where you can create alarms to send you notifications when any workflow fails.

AWS SQS: Delay making available a message that failed to process

Here's my scenario:
I have an SQS queue which processes a number of tasks. Those tasks can, and often times do, fail. Their failure is common and somewhat expected.
When a task fails, I want to retry it after a certain amount of time, and fail the item into a DLQ after a certain amount of retries. I do not want to retry immediately.
I have a worker EB app which processes these tasks. When it succeeds, I return 200 (and the task is successfully removed from the queue). When it fails I return 404, and the task is immediately returned to the queue (and, thus, immediately retried). This is not desired, I'd like to delay this failed item before it is retried.
Is it possible to do this with a combination of visibility timeouts and delay queues?
You can do this natively with SQS by calling ChangeMessageVisibility on a message you just failed to process and setting the VisibilityTimeout to whatever you want: http://docs.aws.amazon.com/AWSSimpleQueueService/latest/APIReference/API_ChangeMessageVisibility.html
Answered my own question, turns out I was looking in the wrong place (SQS config options, not EB config options). The magic settings I was looking for is "error visibility timeout" in the EB config options, which allows you to control the amount of time a failed item has before returning to its queue.

AWS SWF Simple Workflow - Best Way to Keep Activity Worker Scripts Running?

The maximum amount of time the pollForActivityTask method stays open polling for requests is 60 seconds. I am currently scheduling a cron job every minute to call my activity worker file so that my activity worker machine is constantly polling for jobs.
Is this the correct way to have continuous queue coverage?
The way that the Java Flow SDK does it and the way that you create an ActivityWorker, give it a tasklist, domain, activity implementations, and a few other settings. You set both the setPollThreadCount and setTaskExecutorSize. The polling threads long poll and then hand over work to the executor threads to avoid blocking further polling. You call start on the ActivityWorker to boot it up and when wanting to shutdown the workers, you can call one of the shutdown methods (usually best to call shutdownAndAwaitTermination).
Essentially your workers are long lived and need to deal with a few factors:
New versions of Activities
Various tasklists
Scaling independently on tasklist, activity implementations, workflow workers, host sizes, etc.
Handle error cases and deal with polling
Handle shutdowns (in case of deployments and new versions)
I ended using a solution where I had another script file that is called by a cron job every minute. This file checks whether an activity worker is already running in the background (if so, I assume a workflow execution is already being processed on the current server).
If no activity worker is there, then the previous long poll has completed and we launch the activity worker script again. If there is an activity worker already present, then the previous poll found a workflow execution and started processing so we refrain from launching another activity worker.

How to handle large Emailing queue and delivery with AWS SES?

We are developing an app. that need to handle large email queues. We have planned to store emails in a SQS queue and use SES to send emails. but a bit confused on how to actually handle the queue and process queue. should I use cronjob to regularly read the SQS queue and send emails? What would be the best way to actually trigger the script that will be emailing from our app?
Using SQS with SES is a great way to handle this. If something goes wrong while emailing the request will still be on the queue and will be processed next time around.
I just use a cron job that starts my queue processing/email sending job once an hour. The job runs for an hour as a simple loop:
while i've been running < 1 hour:
if there's a message in the queue:
process the message
delete the message from the queue
I set the WaitTimeSeconds parameter to the maximum (20 seconds) so that the check for a new message will wait a while for a new message if necessary so that the job isn't hitting AWS every few milliseconds. Otherwise, I could put a sleep statement of some kind in the loop.
The reason I run for just an hour is that the job might encounter some error that kills it, or have a memory leak, or some other unanticipated problem. This way any queued email requests will still get handled the next time the job is started.
If you want, you can start the job every fifteen minutes so you'll always have four worker processes handling queue requests. If one of them dies for some reason, you'll still be processing with the other three.