In one of my ECS clusters I have a scheduled Fargate task that's meant to spin up 8 instances of it's given target. However, when the task procs it starts up waaayyyy more than 8 tasks. Sometimes as many as 50. Does anyone know what could be causing this to happen?
Details:
Cron Expression: cron(40 16 ? * 1-5 *)
Target Definition:
For anyone who might run into this problem in the future:
This problem occurred because we had too many tasks running the cluster. As of the writing of this answer AWS set of limit of 50 tasks running in a single cluster. Before the rule triggered there was already close to 50 tasks running. The rule would proc and would start spinning up new tasks trying to get to the desired number (8).
However, due to the limit it would never be able to get 8 because new tasks over the limit would just get shutdown. So it would keep trying, and keep trying, and keep trying to spin up tasks which led to there being a huge pending queue of tasks that would seemingly push (nearly) all of our tasks out of the cluster and we'd be left with way more tasks than we had asked for.
The solution: we just moved the scheduled task into a new cluster to avoid the 50 task limit.
Related
I have an MWAA Airflow env in my AWS account. The DAG I am setting up is supposed to read massive data from S3 bucket A, filter what I want and dump the filtered results to S3 bucket B. It needs to read every minute since the data is coming in every minute. Every run processes about 200MB of json data.
My initial setting was using env class mw1.small with 10 worker machines, if I only run the task once in this setting, it takes about 8 minutes to finish each run, but when I start the schedule to run every minute, most of them could not finish, starts to take much longer to run (around 18 mins) and displays the error message:
[2021-09-25 20:33:16,472] {{local_task_job.py:102}} INFO - Task exited with return code Negsignal.SIGKILL
I tried to expand env class to mw1.large with 15 workers, more jobs were able to complete before the error shows up, but still could not catch up with the speed of ingesting every minute. The Negsignal.SIGKILL error would still show before even reaching worker machine max.
At this point, what should I do to scale this? I can imagine opening another Airflow env but that does not really make sense. There must be a way to do it within one env.
I've found the solution to this, for MWAA, edit the environment and under Airflow configuration options, setup these configs
celery.sync_parallelism = 1
celery.worker_autoscale = 1,1
This will make sure your worker machine runs 1 job at a time, preventing multiple jobs to share the worker, hence saving memory and reduces runtime.
I am using SLURM on AWS to manage jobs as part of AWS parallelcluster. I have two questions :
When using scancel *jobid* to cancel a job, the associated node(s) do not stop. How can I achieve that ?
When starting, I made the mistake of not making my script executable so the sbatch *script.sh* worked but the compute node was doing nothing. How could I identify such behaviour and handle it properly ? Is the proper to e.g. stop the idle node after some time for example and output that in a log ? How can I achieve that ?
Check out this page in the docs: https://docs.aws.amazon.com/parallelcluster/latest/ug/autoscaling.html
Bottom line is that instances that have no jobs for a period of time longer than the scaledown_idletime (the default setting is 10 minutes) will get scaled down (terminated) by the cluster, automagically.
You can tweak the setting in the config file when you build your cluster, if 10 mins is too long. Just think about your workload first, because you don't want small delays between jobs to cause you a lot of churn whilst you wait for nodes to die and then get created again shortly after, hence the 10 minute thing.
For a website I’m developing on AWS, a user can submit a large job (ex. select a large number of items and ask to update them all in some way). We don’t want to limit the size of the job these users are submitting so this job can can in theory run for a very long period of time and require a large amount of memory (this rules out AWS Lambda as a compute engine option). We want jobs to be as independent from one another as possible so we chose to run each job in its own container in Amazon ECS. What we currently do when a user submits a job request is send a message with a job id/reference to an SQS queue, have AWS lambda poll that queue and upon receiving a message, lambda starts an ECS task (SQS -> Lambda -> ECS). This has the problem that a new ECS task is started with each request, so a new container must be booted up which can take minutes. This latency is directly visible to the user and is particularly unacceptable if the users job is not even particularly large yet they still wait for minutes for the container to boot up. Additionally, the cost of constantly running container or two would not be too problematic.
I've been toying with some ideas for updating this flow.
Attempt 1:
In this updated flow we'd create an ECS task that looks like the following:
message = null;
while (message == null) {
message = pollForMessages();
}
processMessage(message);
// task finishes, and container can be brought down
We remove the lambda from the flow and just have SQS -> ECS rather than SQS -> Lambda -> ECS. In this case, there would be no cold start assuming a container is up spinning for messages. We could set the minimum number of tasks we want running to be a number > 0 to ensure all messages are processed at some point. However this suffers from the problem that it would not auto-scale as the number of messages in the queue increases. So something needs to spawn more containers when traffic increases.
Attempt 2:
In this updated flow we'd create an ECS task that looks like the following:
message = null;
while (message == null) {
message = pollForMessages();
}
If (number of running tasks < number of messages in queue) {
spawnMoreContainers();
}
processMessage(message);
// task finishes, and container can be brought down
This comes with the issue that we could end up over provisioning containers if multiple containers see that there are more messages in the queue than tasks running. Since these tasks run forever until a message is processed this could result in a large unnecessary cost. It could also under provision containers - if the task sees that number of running tasks >= number of messages, but these running tasks are already busy processing messages, these tasks will not end up taking one of these messages out of the queue and we may end up with messages that have to wait a very long time to be processed.
Attempt 3:
message = null;
while (message == null) {
message = pollForMessages();
If (# of containers > min provisioned && this particular container has been running longer than some timeout) {
// finish this task so this container can be brought down
return;
}
}
If (number of running tasks < number of messages in queue) {
spawnMoreContainers();
}
processMessage(message);
// task finishes, and container can be brought down
While this may save us some cost compared to Attempt 2 so over provisioning wouldn’t be so much of an issue, there is still the possibility that we could under provision containers, in which case certain job requests would need to wait for potentially long periods of time before being processed.
Attempt 4:
We can introduce locking (ex. https://aws.amazon.com/blogs/database/building-distributed-locks-with-the-dynamodb-lock-client/) to mitigate some of the race conditions, however we'll always have the issue that a task running does not necessarily mean a task that is available to pick up messages and Fargate gives us no way of distinguishing between these, which makes it difficult to determine how many containers to provision (ex. we see there are 5 running containers and 5 messages, but we don't know whether to provision more containers or not because we don't know if those containers are already processing a message or if they're waiting). Alternatively we could introduce some mechanism, either an external orchestrator or some logic within the containers and some data store, to manage the state of these containers.
Essentially to deal with each of these problems, the architecture becomes more and more complex and implementation would be difficult and error prone.
It also seems to me like these solutions are reinventing the wheel, and I feel there must be some service out there that has solved this problem already, but I can’t seem to find it.
The suggestions I’ve seen to deal with this are:
Maybe AWS batch is more suited for this use case - Indeed, AWS batch might be the more recommended approach for a workload like this but, we don’t remove any of the cold start problem by switching. AWS batch would still create a new container with each job.
Run the ECS tasks on EC2 rather than Fargate, then cache the container image on the host - With this, we’d be managing our own infrastructure and ideally we’d like this to be serverless.
Have an alarm on the number of messages in the queue and have this alarm trigger a lambda that then boots up more containers - alarms on the /AWS log group have a minimum period of 1 minute. This means the alarm would not be triggered until a minute after we’d received more requests than our provisioned containers can handle. Additionally we'd have to set up many alarms to scale at different numbers of messages.
I’m wondering if anyone is aware of potential services/frameworks that could make doing this more feasible? Or if anyone has suggestions on alternative architectures?
If you don't mind a bit slower response time to the bursts, you may create an autoscaling group (I assume there is something similar for ECS). This group can be governed by a custom metric, e. g. queue length divided by the number of workers. A detailed guide is here: https://docs.aws.amazon.com/autoscaling/ec2/userguide/as-using-sqs-queue.html
In any case, I'd decouple the scaling decision from the worker code, because there is a varying number of workers that you would need to synchronize. It's much easier to have one overseer that controls how many workers there should be. Because the overseer is not on the critical path to task processing, you don't need to care that much about its uptime. It's OK if it takes a few minutes before it recovers after a failure - the workers are still there, processing at least at some capacity.
MapReduce tasks are run within a parent pipeline, and of course we all know they can run for a very long time. But at the same time, the pipeline api documents that a pipeline must complete within 10 minutes (https://github.com/GoogleCloudPlatform/appengine-pipelines/wiki/Python). What is the proper way to understand this?
Thanks.
That pipeline documentation is really old... when it was written, tasks were limited to 10-mins. Now you can configure a non-default modules (used to be called a "backend") using basic/manual scaling that will allow a task to run for 24hrs
https://cloud.google.com/appengine/docs/python/modules/#Python_Instance_scaling_and_class
(NOTE: if you run a task on an auto-scaled module, it will still be limited to 10-mins)
The entire pipeline doesn't have to be limited to 24hrs though. The "root" pipeline (the first task that runs) can yield many child pipelines, and those each can further yield other pipelines... each pipeline is a task that has to run within the allotted time (10mins or 24hrs)... when it is done, it signals the parent to wake-up and finish... so the overall pipeline could run for days or months or whatever
We have our app split into two modules, one for the front-end (default, auto-scaled) that handles web requests, and one for the "back end" (basic scaling) that runs all of our tasks
I am trying to use Amazon Elastic Beanstalk to run a very long numerical simulation - up to 20 hours. The code works beautifully when I tell it to do a short, 20 second simulation. However, when running a longer one, I get the error "The following instances have not responded in the allowed command timeout time (they might still finish eventually on their own)".
After browsing the web, it seems to me that the issue is that Elastic Beanstalk allows worker processes to run for 30 minutes at most, and then they time out because the instance has not responded (i.e. finished the simulation). The solution some have proposed is to send a message every 30 seconds or so that "pings" Elastic Beanstalk, letting it know that the simulation is going well so it doesn't time out, which would let me run a long worker process. So I have a few questions:
Is this the correct approach?
If so, what code or configuration would I add to the project to make it stop terminating early?
If not, how can I smoothly run a 12+ hour simulation on AWS or more generally, the cloud?
Add on information
Thank you for the feedback, Rohit. To give some more information, I'm using Python with Flask.
• I am indeed using an Elastic Beanstalk worker tier with SQS queues
• In my code, I'm running a simulation of variable length - from as short as 20 seconds to as long as 20 hours. 99% of the work that Elastic Beanstalk does is running the simulation. The other 1% involves saving results, sending emails, etc.
• The simulation itself involves using generating many random numbers and working with objects that I defined. I use numpy heavily here.
Let me know if I can provide any more information. I really appreciate the help :)
After talking to a friend who's more in the know about this stuff than me, I solved the problem. It's a little sketchy, but got the job done. For future reference, here is an outline of what I did:
1) Wrote a main script that used Amazon's boto library to connect to my SQS queue. Wrote an infinite while loop to poll the queue every 60 seconds. When there's a message on the queue, run a simulation and then continue through with the loop
2) Borrowed a beautiful /etc/init.d/ template to run my script as a daemon (http://blog.scphillips.com/2013/07/getting-a-python-script-to-run-in-the-background-as-a-service-on-boot/)
3) Made my main script and the script in (2) executable
4) Set up a cron job to make sure the script would start back up if it failed.
Once again, thank you Rohit for taking the time to help me out. I'm glad I still got to use Amazon even though Elastic Beanstalk wasn't the right tool for the job
From your question it seems you are running into launches timing out because some commands during launch that run on your instance take more than 30 minutes.
As explained here, you can adjust the Timeout option in the aws:elasticbeanstalk:command namespace. This can have values between 1 and 1800. This means if your commands finish within 30 minutes you won't see this error. The commands might eventually finish as the error message says but since Elastic Beanstalk has not received a response within the specified period it does not know what is going on your instance.
It would be helpful if you could add more details about your usecase. What commands you are running during startup? Apparently you are using ebextensions to launch commands which take a long time. Is it possible to run those commands in the background or do you need these commands to run during server startup?
If you are running a Tomcat web app you could also use something like servlet init method to run app bootstrapping code. This code can take however long it needs without giving you this error message.
Unfortunately, there is no way to 'process a message' from an SQS queue for more than 12 hours (see the description of ChangeVisibilityTimeout).
With that being the case, this approach doesn't fit your application well. I have ran into the same problem.
The correct way to do this: I don't know. However, I would suggest an alternate approach where you grab a message off of your queue, spin off a thread or process to run your long running simulation, and then delete the message (signaling successful processing). In this approach, be careful of spinning off too many threads on one machine and also be wary of machines shutting down before the simulation has ended, because the queue message has already been deleted.
Final note: your question is excellently worded and sufficiently detailed :)
For those looking to run jobs shorter than 10 hours, it needs to be mentioned that the current inactivity timeout limit is 36000 seconds, so exactly 10 hours and not anymore 30 minutes, like mentioned in posts all over the web (which led me to think a workaround like described above is needed).
Check out the docs: https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/using-features-managing-env-tiers.html
A very nice write-up can be found here: https://dev.to/rizasaputra/understanding-aws-elastic-beanstalk-worker-timeout-42hi