AWS Elastic Beanstalk Worker timing out after inactivity during long computation - amazon-web-services

I am trying to use Amazon Elastic Beanstalk to run a very long numerical simulation - up to 20 hours. The code works beautifully when I tell it to do a short, 20 second simulation. However, when running a longer one, I get the error "The following instances have not responded in the allowed command timeout time (they might still finish eventually on their own)".
After browsing the web, it seems to me that the issue is that Elastic Beanstalk allows worker processes to run for 30 minutes at most, and then they time out because the instance has not responded (i.e. finished the simulation). The solution some have proposed is to send a message every 30 seconds or so that "pings" Elastic Beanstalk, letting it know that the simulation is going well so it doesn't time out, which would let me run a long worker process. So I have a few questions:
Is this the correct approach?
If so, what code or configuration would I add to the project to make it stop terminating early?
If not, how can I smoothly run a 12+ hour simulation on AWS or more generally, the cloud?
Add on information
Thank you for the feedback, Rohit. To give some more information, I'm using Python with Flask.
• I am indeed using an Elastic Beanstalk worker tier with SQS queues
• In my code, I'm running a simulation of variable length - from as short as 20 seconds to as long as 20 hours. 99% of the work that Elastic Beanstalk does is running the simulation. The other 1% involves saving results, sending emails, etc.
• The simulation itself involves using generating many random numbers and working with objects that I defined. I use numpy heavily here.
Let me know if I can provide any more information. I really appreciate the help :)

After talking to a friend who's more in the know about this stuff than me, I solved the problem. It's a little sketchy, but got the job done. For future reference, here is an outline of what I did:
1) Wrote a main script that used Amazon's boto library to connect to my SQS queue. Wrote an infinite while loop to poll the queue every 60 seconds. When there's a message on the queue, run a simulation and then continue through with the loop
2) Borrowed a beautiful /etc/init.d/ template to run my script as a daemon (http://blog.scphillips.com/2013/07/getting-a-python-script-to-run-in-the-background-as-a-service-on-boot/)
3) Made my main script and the script in (2) executable
4) Set up a cron job to make sure the script would start back up if it failed.
Once again, thank you Rohit for taking the time to help me out. I'm glad I still got to use Amazon even though Elastic Beanstalk wasn't the right tool for the job

From your question it seems you are running into launches timing out because some commands during launch that run on your instance take more than 30 minutes.
As explained here, you can adjust the Timeout option in the aws:elasticbeanstalk:command namespace. This can have values between 1 and 1800. This means if your commands finish within 30 minutes you won't see this error. The commands might eventually finish as the error message says but since Elastic Beanstalk has not received a response within the specified period it does not know what is going on your instance.
It would be helpful if you could add more details about your usecase. What commands you are running during startup? Apparently you are using ebextensions to launch commands which take a long time. Is it possible to run those commands in the background or do you need these commands to run during server startup?
If you are running a Tomcat web app you could also use something like servlet init method to run app bootstrapping code. This code can take however long it needs without giving you this error message.

Unfortunately, there is no way to 'process a message' from an SQS queue for more than 12 hours (see the description of ChangeVisibilityTimeout).
With that being the case, this approach doesn't fit your application well. I have ran into the same problem.
The correct way to do this: I don't know. However, I would suggest an alternate approach where you grab a message off of your queue, spin off a thread or process to run your long running simulation, and then delete the message (signaling successful processing). In this approach, be careful of spinning off too many threads on one machine and also be wary of machines shutting down before the simulation has ended, because the queue message has already been deleted.
Final note: your question is excellently worded and sufficiently detailed :)

For those looking to run jobs shorter than 10 hours, it needs to be mentioned that the current inactivity timeout limit is 36000 seconds, so exactly 10 hours and not anymore 30 minutes, like mentioned in posts all over the web (which led me to think a workaround like described above is needed).
Check out the docs: https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/using-features-managing-env-tiers.html
A very nice write-up can be found here: https://dev.to/rizasaputra/understanding-aws-elastic-beanstalk-worker-timeout-42hi

Related

Need to run a aws lambda function which takes more than 15 minutes to complete?

My Lambda function has limit 15 minutes which was 5 minutes ealier.Lambda process is automatically terminated after 15 minutes but my process takes more than 15 minutes. How I can manage ?
There is no way around this. If you're doing some sort of long running processing then your other option may be to run this task on an EC2 instance. If this long running process can be broken down in to multiple steps then you could look in to Lambda Step Functions.
15 Minutes is the max and this max can not be extended.
EDIT:
Recently I started running some long running tasks that are variable in length (anywhere from a couple minutes to several hours). To accomplish this I've been using AWS Fargate and my task is node.js script that is stored as a Docker container in ECR. Doing this was fairly easy and also is fairly cheap (I think we spent a little over $1 for running this task daily in a month). This may be something worth looking in to for others who may come across this answer.
https://docs.aws.amazon.com/AmazonECS/latest/userguide/scheduled_tasks.html
Typically use a Fat Lambda strategy or Step Function
Fat Lambda Strategy
A Fat Lambda strategy is used when your task is singular but has a
long-running execution time and/or you have heavy hardware
requirements. The idea is that you would create a script that executes your long
process and put it into a docker container that's hosted in Fargate.
Meaning no limits to execution time and access to powerful hardware (How to create a Fat Lambda https://youtu.be/XUp9SHIHU8w)
Step Function strategy
A Step Function strategy is used to break down your entire process
into smaller steps. Usually, a step function strategy would work for
you if your process could have lots of miniature stages linked
together instead of a big colossal job attempting to do everything
simultaneously. Bear in mind that a "Fat Lambda" can also be triggered
by a Step Function (How to create a Step Function
https://www.youtube.com/watch?v=s0XFX3WHg0w)
Also, another note, remember lambdas can also trigger other lambdas. So you might even be able to have different lambdas run bits of your lambda code. For example, a FOR loop sends off a lot of mini lambdas to run small tasks. You might not even need a Step Function or a Fat Lambda.
If you're stuck on what to choose, follow the below. It will help you reason with your problem.
Singular Lambda >> Lambda invoke another Lambda? >> Step Functions? >> Fargate (Fat Lambda)?
If you can checkpoint the task then you can check the getRemainingTimeInMillis (docs) and if the time is running out then invoke the same lambda with a parameter where to continue.
Something like this flow:
start working (0% done)
time is running low (40% done) => start a new lambda telling it to start from 40%
old lambda is terminated, new lambda starts working (40%)
when its time is running low, start a new lambda again (80%)
the third lambda finishes the job
But it requires a very specific type of task to support this. If your require a single execution from start to finish then lambda is not a good choice for this.
What do you think about using a lambda to trigger an ECS task? An ECS task just runs a containerized application for as long as it needs to run.
This blog post is relevant: https://www.gravitywell.co.uk/insights/using-ecs-tasks-on-aws-fargate-to-replace-lambda-functions/.
Aws lambda is meant to be used for quick processing. if your task is this long then better choose some other way to develop that functionality. Although you can define the timeout property for AWS lambda, but that can not exceed 15 minutes.
As per you use case better to use EC2 for deploying you application and then terminate the EC2 instance when the processing is done or it remains idle more than the threshold time.
Refer the AW Lambda documentation - https://docs.aws.amazon.com/lambda/index.html
To add to the Step function answer - here's a very simple playbook:
Work for 10 minutes
Write progress to S3
kick off another lambda to consume your progress
terminate
Once you're done, output. Viola, infinite runtime lambda with very little effective overhead.
No, you cannot run a lambda for more than 15 mins!
But Yes you can manage this using Signals.
Basically, this will inform you to start plan B when plan A is not enough within 15 mins. If you can decouple the tasks in your process and add checkpoints in your process then the next lambda invocation can be picked up in plan B or you can somehow create entries in db in the plan B for the unprocessed parts. And reprocess them as a part of another run.
Framework here -
https://gist.github.com/kuharan/c2bfddac7bd8dc5702f6eec31729fb48

Lambda timed out due to container refreshed

i have gone through the site but unable get the root cause of my issue.
we have a lambda that will run for every 50 seconds. the first run of lambda is a cold start. during the start all the necessary dependencies for the lambda are prepared ( all the interfaces ).. Lambda handler will have its own code to interact with SQS and SWF. during the first run from the cloud watch logs it is clear that it is reading the base file to get all the services. then lambda handler will start. from second run only lambda handler will get invoked after 50th second. So far everything is going smooth.
All of sudden we noticed the lambda took more than 50 seconds ( in general it finishes below 10s). log shows that lambda got timed out and freshly it started to initializing all the dependencies again.
This is not giving any clue to us as after the timeout the subsequent run works smooth. Its not good to see lambda timed out. Definitely lambda code is without errors.
Could this be any container issue? Does the container have any time period that it will keep data active till it reaches the expiry time out.
Can we able to access the container object to find out more information? we have 2 or more dev environments. this behavior is different for different environments. for some it happens for every 3 days. some time in a day it happens thrice.
if we want to understand the properties of the container object how can we do it? Is it a grey zone that only AWS can access it? Lambda code is written in c# using net core App 2.0. thought of checking the cloud trail log for this lambda during the invocation. there too i am not able to find the reason behind the timeout.
we have more than 20 lambda's for dev and 10 for test in each different regions. its not getting clear to us which lambda will time out.
Any suggestions or idea's will help me a lot???????
thankyou.
Lambda containers will not live indefinitely. If you are seeing occasional "cold starts" then that is normal behavior. If you're running only 1 invocation at a time (i.e. you only have a single lambda instance) you can still expect to see the container recycled every few hours. In general, I understand AWS is trying to give us fewer cold starts but you can still expect to get a new container and new cold start from time to time.

How to fix CloudRun error 'The request was aborted because there was no available instance'

I'm using managed CloudRun to deploy a container with concurrency=1. Once deployed, I'm firing four long-running requests in parallel.
Most of the time, all works fine -- But occasionally, I'm facing 500's from one of the nodes within a few seconds; logs only provide the error message provided in the subject.
Using retry with exponential back-off did not improve the situation; the retries also end up with 500s. StackDriver logs also do not provide further information.
Potentially relevant gcloud beta run deploy arguments:
--memory 2Gi --concurrency 1 --timeout 8m --platform managed
What does the error message mean exactly -- and how can I solve the issue?
This error message can appear when the infrastructure didn't scale fast enough to catch up with the traffic spike. Infrastructure only keeps a request in the queue for a certain amount of time (about 10s) then aborts it.
This usually happens when:
traffic suddenly largely increase
cold start time is long
request time is long
We also faced this issue when traffic suddenly increased during business hours. The issue is usually caused by a sudden increase in traffic and a longer instance start time to accommodate incoming requests. One way to handle this is by keeping warm-up instances always running i.e. configuring --min-instances parameters in the cloud run deploy command. Another and recommended way is to reduce the service cold start time (which is difficult to achieve in some languages like Java and Python)
I also experiment the problem. Easy to reproduce. I have a fibonacci container that process in 6s fibo(45). I use Hey to perform 200 requests. And I set my Cloud Run concurrency to 1.
Over 200 requests I have 8 similar errors. In my case: sudden traffic spike and long processing time. (Short cold start for me, it's in Go)
I was able to resolve this on my service by raising the max autoscaling container count from 2 to 10. There really should be no reason that 2 would be even close to too low for the traffic, but I suspect something about the Cloud Run internals were tying up to 2 containers somehow.
Setting the Max Retry Attempts to anything but zero will remedy this, as it did for me.

Celery/SQS task retry gone haywire - how to get rid of it?

We've got Celery/SQS set up for asynchronous task management. We're running Django for our framework. We have a celery task that has a self.retry() in it. Max_retries is set to 15. The retry is happening with an exponential backoff and takes 182 hours to complete all 15 retries.
Last week, this task went haywire, I think due to a bug in our code not properly handling a service outage. It resulted in exponential creation (retrying?) of the same celery task. It eventually used up all available memory and the worker crashed. Restarting the worker results in another crash a couple hours later, since all those tasks (and their retries) keep retrying and spawning new retries until we run out of memory again. Ultimately we ended up with nearly 600k tasks created!
We need our workers to ignore all the tasks with a specific celery GUID. Ideally we could just get rid of them for good. I was going to use revoke() but, per documentation (http://docs.celeryproject.org/en/3.1/userguide/workers.html#commands), this is only implemented for Redis and RabbitMQ, not SQS. Furthermore, when I go to the SQS service in the AWS console, it's showing zero messages in flight so it's not like I can just flush it.
Is there a way to delete or revoke a specific message from SQS using the Celery task ID? Or is there another way to fix this problem? Obviously we need to fix our code so we don't get into this situation again, but first we need to get our worker up and running because without it our website has reduced functionality. Thanks!

Is there an AWS / Pagerduty service that will alert me if it's NOT notified

We've got a little java scheduler running on AWS ECS. It's doing what cron used to do on our old monolith. it fires up (fargate) tasks in docker containers. We've got a task that runs every hour and it's quite important to us. I want to know if it crashes or fails to run for any reason (eg the java scheduler fails, or someone turns the task off).
I'm looking for a service that will alert me if it's not notified. I want to call the notification system every time the script runs successfully. Then if the alert system doesn't get the "OK" notification as expected, it shoots off an alert.
I figure this kind of service must exist, and I don't want to re-invent the wheel trying to build it myself. I guess my question is, what's it called? And where can I go to get that kind of thing? (we're using AWS obviously and we've got a pagerDuty account).
We use this approach for these types of problems. First, the task has to write a timestamp to a file in S3 or EFS. This file is the external evidence that the task ran to completion. Then you need an http based service that will read that file and calculate if the time stamp is valid ie has been updated in the last hour. This could be a simple php or nodejs script. This process is exposed to the public web eg https://example.com/heartbeat.php. This script returns a http response code of 200 if the timestamp file is present and valid, or a 500 if not. Then we use StatusCake to monitor the url, and notify us via its Pager Duty integration if there is an incident. We usually include a message in the response so a human can see the nature of the error.
This may seem tedious, but it is foolproof. Any failure anywhere along the line will be immediately notified. StatusCake has a great free service level. This approach can be used to monitor any critical task in same way. We've learned the hard way that critical cron type tasks and processes can fail for any number of reasons, and you want to know before it becomes customer critical. 24x7x365 monitoring of these types of tasks is necessary, and helps us sleep better at night.
Note: We always have a daily system test event that triggers a Pager Duty notification at 9am each day. For the truly paranoid, this assures that pager duty itself has not failed in some way eg misconfiguratiion etc. Our support team knows if they don't get a test alert each day, there is a problem in the notification system itself. The tech on duty has to awknowlege the incident as per SOP. If they do not awknowlege, then it escalates to the next tier, and we know we have to have a talk about response times. It keeps people on their toes. This is the final piece to insure you have robust monitoring infrastructure.
OpsGene has a heartbeat service which is basically a watch dog timer. You can configure it to call you if you don't ping them in x number of minutes.
Unfortunately I would not recommend them. I have been using them for 4 years and they have changed their account system twice and left my paid account orphaned silently. I have to find a new vendor as soon as I have some free time.