I have an SQS Worker Tier beanstalk application listening to a queue. If we encounter any issues, for example a database crash, is there a way for us to temporarily stop the worker tier from working that queue without having to terminate the environment and rebuilding it again when we want to resume?
One hack I guess would be for us to point it to an empty queue, but I'd rather avoid that type of thing.
Thanks
For anybody who is in the same boat as me, I just want to post my own, inelegant solution.
We have created another SQS Queue, and whenever we want to turn off the processing of messages, we just update the worker tier app to point to this new queue. It isn't clean, but it does what we need.
Another option is to just leave it as is. In case of database crash, or any other error, your application will return for example 500 instead of 200 and message will be returned back to the queue for future processing.
Not sure if this helps, but you can add a delivery delay to SQS queue: right click the queue -> configure queue -> set Delivery Delay up to 15 minutes. Any message will be received after this delay. This allows me to "pause" the queue for up to 15 minutes.
You can terminate the environment and recreate it. In case you do not have a way to recreate same environment via just one command, take a look at: https://github.com/ThoughtWorksStudios/eb_deployer
Related
We've got Celery/SQS set up for asynchronous task management. We're running Django for our framework. We have a celery task that has a self.retry() in it. Max_retries is set to 15. The retry is happening with an exponential backoff and takes 182 hours to complete all 15 retries.
Last week, this task went haywire, I think due to a bug in our code not properly handling a service outage. It resulted in exponential creation (retrying?) of the same celery task. It eventually used up all available memory and the worker crashed. Restarting the worker results in another crash a couple hours later, since all those tasks (and their retries) keep retrying and spawning new retries until we run out of memory again. Ultimately we ended up with nearly 600k tasks created!
We need our workers to ignore all the tasks with a specific celery GUID. Ideally we could just get rid of them for good. I was going to use revoke() but, per documentation (http://docs.celeryproject.org/en/3.1/userguide/workers.html#commands), this is only implemented for Redis and RabbitMQ, not SQS. Furthermore, when I go to the SQS service in the AWS console, it's showing zero messages in flight so it's not like I can just flush it.
Is there a way to delete or revoke a specific message from SQS using the Celery task ID? Or is there another way to fix this problem? Obviously we need to fix our code so we don't get into this situation again, but first we need to get our worker up and running because without it our website has reduced functionality. Thanks!
I have a SQS Listener with a max message count of 10. When my consumer receives a batch of 10 message they all get processed but sometimes (depending on the message) the process will take 5-6 hours and some with take as little as 5 minutes. I have 3 consumers (3 different JVM's) polling from the queue with a maxMessageCount of 10. Here is my issue:
If one of those 10 messages takes 5 hours to process it seems as though the listener is waiting to do the next poll of 10 messages until all of the previous messages are 100% complete. Is there a way to allow it to poll a new batch of messages even though another is still being processed?
I'm guessing that I am missing something little here. How I am using Spring Cloud library and the SqsListener annotation. Has anybody ran across this before?
Also I dont think this should matter but the queue is AWS SQS and there JVM's are running on an ECS cluster.
If you run the task on the poller thread, the next poll won't happen until the current one completes.
You can use an ExecutorChannel or QueueChannel to hand the work off to another thread (or threads) but you risk message loss if you do that.
Your situation is rather unusual; 5 hours is a long time to process a message.
You should perhaps consider redesigning your application to persist these "long running" requests to a database or similar, instead of processing them directly from the message. Or, perhaps put them in a different queue so that they don't impact the shorter tasks.
From the documentation of SQS, Max time delay we can configure for a message to hide from its consumers is 15 minutes - http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-delay-queues.html
Suppose if I need to hide the messages for a day, what is the pattern?
For eg. I want to mimic a daily cron for doing some action.
Thanks
The simplest way to do this is as follows:
SQS.push_to_queue({perform_message_at : "Thursday November 2022"},delay: 15 mins)
Inside your worker
message = SQS.poll_messages
if message.perform_message_at > Time.now
SQS.push_to_queue({perform_message_at : "Thursday November
2022"},delay:15 mins)
else
process_message(message)
end
Basically push the message back to the queue with the maximum delay and only process it when its processing time is less than the current time.
HTH.
Visibility timeout can do up to 12 hours. I think you can hack something together where you process a message but don't delete it and next time it is processed its been 12 hours. So a queue with one message and visibility timeout of 12 hours. That gets you a 12 hour cron.
Cloudwatch is likely a better way to do it. You can use a createEvent API with the timer, and have it trigger either a lambda function or an API call to whatever comes next.
Another way to do is to use the "wait" utility in an AWS step function.
https://docs.aws.amazon.com/step-functions/latest/dg/amazon-states-language-wait-state.html
In any case, unless you are extremely sure you will never need anything more than 15 minutes, the SQS backdoor to add the delay seems hacky.
You can do this by adding a DLQ with MaxReceives set to 1 on the first queue.
Add a simple Lambda on the first queue and fail the message vi Lambda. So message will be moved to DLQ automatically and then you can consume from DLQ.
Both primary queue and DLQ can have max 15 min delay, so finally you get 30 min delay.
So your consumer app receives the message after 30 minutes, without adding any custom logic on it.
Two thoughts.
Untested. Perhaps publish to and SNS topic that has no SQS queues. When delivery needs to happen, subscribe the queue to the topic. (I've not done this, I'm not sure if this would work as expected)
Push messages as files to a central store (like S3). Create a worker that looks at the time created timestamp and decides whether to publish them to a queue or not. If created >= 1d ago, publish.
This was a challenge for us as well and I never found a perfect solution so I ended up building a service to address it. Obviously self promotion here but the system allows you to work around the DelaySeconds limitation and set arbitrary date/times at scale.
https://anticipated.io
Some of the challenges working with Step Functions are scale of registered machines (if your system had that requirement). If you use EventBridge to fire them you run out of allowable rulesets (limit is 200 as of this posting). Example: if you need to set 150,000 arbitrary events a month you run into limits quickly.
We are developing an app. that need to handle large email queues. We have planned to store emails in a SQS queue and use SES to send emails. but a bit confused on how to actually handle the queue and process queue. should I use cronjob to regularly read the SQS queue and send emails? What would be the best way to actually trigger the script that will be emailing from our app?
Using SQS with SES is a great way to handle this. If something goes wrong while emailing the request will still be on the queue and will be processed next time around.
I just use a cron job that starts my queue processing/email sending job once an hour. The job runs for an hour as a simple loop:
while i've been running < 1 hour:
if there's a message in the queue:
process the message
delete the message from the queue
I set the WaitTimeSeconds parameter to the maximum (20 seconds) so that the check for a new message will wait a while for a new message if necessary so that the job isn't hitting AWS every few milliseconds. Otherwise, I could put a sleep statement of some kind in the loop.
The reason I run for just an hour is that the job might encounter some error that kills it, or have a memory leak, or some other unanticipated problem. This way any queued email requests will still get handled the next time the job is started.
If you want, you can start the job every fifteen minutes so you'll always have four worker processes handling queue requests. If one of them dies for some reason, you'll still be processing with the other three.
I'm implementing a task queue with Amazon SQS ( but i guess the question applies to any task-queue ) , where the workers are expected to take different action depending on how many times the job has been re-tried already ( move it to a different queue, increase visibility timeout, send an alert..etc )
What would be the best way to keep track of failed job count? I'd like to avoid having to keep a centralized db for job:retry-count records. Should i look at time spent in the queue instead in a monitoring process? IMO that would be ugly or un-clean at best, iterating over jobs until i find ancient ones..
thanks!
Andras
There is another simpler way. With your message you can request ApproximateReceiveCount information and base your retry logic on that. This way you won't have to keep it in the database and can calculate it from the message itself.
http://docs.aws.amazon.com/AWSSimpleQueueService/latest/APIReference/API_ReceiveMessage.html
I've had good success combining SQS with SimpleDB. It is "centralized", but only as much as SQS is.
Every job gets a record in simpleDB and a task in SQS. You can put any information you like in SimpleDB like the job creation time. When a worker pulls a job from the queue it can grab the corresponding record from simpleDB to determine it's history. You can see how old the job is, and you can see how many times it has been attempted. Once you're done, you can add worker data to the SimpleDB record (completion time, outcome, logs, errors, stack-trace, whatever) and acknowledge the message from SQS.
I prefer this method because it helps diagnose faults by providing lots of debug info for failed tasks. It also allows workers to handle the job differently depending on how long the job has been queued, how many failures it's had, etc.
It also gives you the ability to query SimpleDB directly and calculate things like average time per task, percent failure rate, etc.
Amazon just released Simple workflow serice (swf) which you can think of as a more sophisticated/flexible version of GAE Task queues.
It will let you monitor your tasks (with hearbeats), configure retry strategies and create complicated workflows. It looks pretty promising abstracting out task dependencies, scheduling and fault tolerance for tasks (esp. asynchronous ones)
Checkout http://docs.amazonwebservices.com/amazonswf/latest/developerguide/swf-dg-intro-to-swf.html for overview.
SQS stands for "Simple Queue Service" which, in concept is the incorrect name for that service. The first and foremost feature of a "Queue" is FIFO (First in, First out), and SQS lacks that. Just wanting to clarify.
Also, Azure Queue Services lacks that as well. For the best cloud Queue service, use Azure's Service Bus since it's a TRUE Queue concept.