How to stop a compute node with SLURM? - amazon-web-services

I am using SLURM on AWS to manage jobs as part of AWS parallelcluster. I have two questions :
When using scancel *jobid* to cancel a job, the associated node(s) do not stop. How can I achieve that ?
When starting, I made the mistake of not making my script executable so the sbatch *script.sh* worked but the compute node was doing nothing. How could I identify such behaviour and handle it properly ? Is the proper to e.g. stop the idle node after some time for example and output that in a log ? How can I achieve that ?

Check out this page in the docs: https://docs.aws.amazon.com/parallelcluster/latest/ug/autoscaling.html
Bottom line is that instances that have no jobs for a period of time longer than the scaledown_idletime (the default setting is 10 minutes) will get scaled down (terminated) by the cluster, automagically.
You can tweak the setting in the config file when you build your cluster, if 10 mins is too long. Just think about your workload first, because you don't want small delays between jobs to cause you a lot of churn whilst you wait for nodes to die and then get created again shortly after, hence the 10 minute thing.

Related

MWAA Airflow Scaling: what do I do when I have to run frequent & time consuming scripts? (Negsignal.SIGKILL)

I have an MWAA Airflow env in my AWS account. The DAG I am setting up is supposed to read massive data from S3 bucket A, filter what I want and dump the filtered results to S3 bucket B. It needs to read every minute since the data is coming in every minute. Every run processes about 200MB of json data.
My initial setting was using env class mw1.small with 10 worker machines, if I only run the task once in this setting, it takes about 8 minutes to finish each run, but when I start the schedule to run every minute, most of them could not finish, starts to take much longer to run (around 18 mins) and displays the error message:
[2021-09-25 20:33:16,472] {{local_task_job.py:102}} INFO - Task exited with return code Negsignal.SIGKILL
I tried to expand env class to mw1.large with 15 workers, more jobs were able to complete before the error shows up, but still could not catch up with the speed of ingesting every minute. The Negsignal.SIGKILL error would still show before even reaching worker machine max.
At this point, what should I do to scale this? I can imagine opening another Airflow env but that does not really make sense. There must be a way to do it within one env.
I've found the solution to this, for MWAA, edit the environment and under Airflow configuration options, setup these configs
celery.sync_parallelism = 1
celery.worker_autoscale = 1,1
This will make sure your worker machine runs 1 job at a time, preventing multiple jobs to share the worker, hence saving memory and reduces runtime.

Scheduled Task Produces Too Many Tasks

In one of my ECS clusters I have a scheduled Fargate task that's meant to spin up 8 instances of it's given target. However, when the task procs it starts up waaayyyy more than 8 tasks. Sometimes as many as 50. Does anyone know what could be causing this to happen?
Details:
Cron Expression: cron(40 16 ? * 1-5 *)
Target Definition:
For anyone who might run into this problem in the future:
This problem occurred because we had too many tasks running the cluster. As of the writing of this answer AWS set of limit of 50 tasks running in a single cluster. Before the rule triggered there was already close to 50 tasks running. The rule would proc and would start spinning up new tasks trying to get to the desired number (8).
However, due to the limit it would never be able to get 8 because new tasks over the limit would just get shutdown. So it would keep trying, and keep trying, and keep trying to spin up tasks which led to there being a huge pending queue of tasks that would seemingly push (nearly) all of our tasks out of the cluster and we'd be left with way more tasks than we had asked for.
The solution: we just moved the scheduled task into a new cluster to avoid the 50 task limit.

Is there a way to set a walltime on AWS Batch jobs?

Is there a way to set a maximum running time for AWS Batch jobs (or queues)? This is a standard setting in most batch managers, which avoids wasting resources when a job hangs for whatever reason.
As of April, 2018, AWS Batch now supports setting a Job Timeout when submitting a Job, or in the job definition.
https://aws.amazon.com/about-aws/whats-new/2018/04/aws-batch-adds-support-for-automatic-termination-with-job-execution-timeout/
You specify an attemptDurationSeconds parameter, which must be at least 60 seconds, either in your job definition, or when you submit the job. When this number of seconds has passed following the job attempt's startedAt timestamp, AWS Batch terminates the job. On the compute resource, your job's container receives a SIGTERM signal to give your application a chance to shut down gracefully; if the container is still running after 30 seconds, a SIGKILL signal is sent to forcefully shut down the container.
Source: https://docs.aws.amazon.com/batch/latest/userguide/job_timeouts.html
POST /v1/submitjob HTTP/1.1
Content-type: application/json
{
...
"timeout": {
"attemptDurationSeconds": number
}
}
AFAIK there is no feature to do this. However, a workaround was suggested in the forum for a similar question.
One idea is to call Batch as an Activity from Step Functions, pingback
back on a schedule (e.g. every minute) from that job. If it stops
responding then you can detect that situation as a Timeout in the
activity and act accordingly (terminate the job etc.). Not an ideal
solution (especially if the job continues to ping back as a "zombie"),
but it's a start. You'd also likely have to store activity tokens in a
database to trace them to Batch job id.
Alternatively, you split that setup into 2 steps, and schedule a Batch
job from a Lambda in the first state, then pass the Batch job id to
the second step which then polls Batch (from another Lambda) for its
state with Retry and IntervalSeconds (e.g. once every minute, or even
with exponential backoff), and MaxAttempts calculated based on your
timeout. This way, you don't need any external state storage
mechanism, long polling or even a "ping back" from the job (it CAN be
a zombie), but the downside is more steps.
There is no option to set timeout on batch job but you can setup a lambda function that triggers every 1 hour or so and deletes jobs created before say 24 hours.
working with aws for some time now and could not find a way to set a maximum running time for batch jobs.
However there are some alternative way which you could utilize.
AWS Forum
Sadly there is no way to set the limit execution time on AWS Batch.
One solution may be to edit the docker's entry point to schedule the execution time limit.

AWS Elastic Beanstalk Worker timing out after inactivity during long computation

I am trying to use Amazon Elastic Beanstalk to run a very long numerical simulation - up to 20 hours. The code works beautifully when I tell it to do a short, 20 second simulation. However, when running a longer one, I get the error "The following instances have not responded in the allowed command timeout time (they might still finish eventually on their own)".
After browsing the web, it seems to me that the issue is that Elastic Beanstalk allows worker processes to run for 30 minutes at most, and then they time out because the instance has not responded (i.e. finished the simulation). The solution some have proposed is to send a message every 30 seconds or so that "pings" Elastic Beanstalk, letting it know that the simulation is going well so it doesn't time out, which would let me run a long worker process. So I have a few questions:
Is this the correct approach?
If so, what code or configuration would I add to the project to make it stop terminating early?
If not, how can I smoothly run a 12+ hour simulation on AWS or more generally, the cloud?
Add on information
Thank you for the feedback, Rohit. To give some more information, I'm using Python with Flask.
• I am indeed using an Elastic Beanstalk worker tier with SQS queues
• In my code, I'm running a simulation of variable length - from as short as 20 seconds to as long as 20 hours. 99% of the work that Elastic Beanstalk does is running the simulation. The other 1% involves saving results, sending emails, etc.
• The simulation itself involves using generating many random numbers and working with objects that I defined. I use numpy heavily here.
Let me know if I can provide any more information. I really appreciate the help :)
After talking to a friend who's more in the know about this stuff than me, I solved the problem. It's a little sketchy, but got the job done. For future reference, here is an outline of what I did:
1) Wrote a main script that used Amazon's boto library to connect to my SQS queue. Wrote an infinite while loop to poll the queue every 60 seconds. When there's a message on the queue, run a simulation and then continue through with the loop
2) Borrowed a beautiful /etc/init.d/ template to run my script as a daemon (http://blog.scphillips.com/2013/07/getting-a-python-script-to-run-in-the-background-as-a-service-on-boot/)
3) Made my main script and the script in (2) executable
4) Set up a cron job to make sure the script would start back up if it failed.
Once again, thank you Rohit for taking the time to help me out. I'm glad I still got to use Amazon even though Elastic Beanstalk wasn't the right tool for the job
From your question it seems you are running into launches timing out because some commands during launch that run on your instance take more than 30 minutes.
As explained here, you can adjust the Timeout option in the aws:elasticbeanstalk:command namespace. This can have values between 1 and 1800. This means if your commands finish within 30 minutes you won't see this error. The commands might eventually finish as the error message says but since Elastic Beanstalk has not received a response within the specified period it does not know what is going on your instance.
It would be helpful if you could add more details about your usecase. What commands you are running during startup? Apparently you are using ebextensions to launch commands which take a long time. Is it possible to run those commands in the background or do you need these commands to run during server startup?
If you are running a Tomcat web app you could also use something like servlet init method to run app bootstrapping code. This code can take however long it needs without giving you this error message.
Unfortunately, there is no way to 'process a message' from an SQS queue for more than 12 hours (see the description of ChangeVisibilityTimeout).
With that being the case, this approach doesn't fit your application well. I have ran into the same problem.
The correct way to do this: I don't know. However, I would suggest an alternate approach where you grab a message off of your queue, spin off a thread or process to run your long running simulation, and then delete the message (signaling successful processing). In this approach, be careful of spinning off too many threads on one machine and also be wary of machines shutting down before the simulation has ended, because the queue message has already been deleted.
Final note: your question is excellently worded and sufficiently detailed :)
For those looking to run jobs shorter than 10 hours, it needs to be mentioned that the current inactivity timeout limit is 36000 seconds, so exactly 10 hours and not anymore 30 minutes, like mentioned in posts all over the web (which led me to think a workaround like described above is needed).
Check out the docs: https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/using-features-managing-env-tiers.html
A very nice write-up can be found here: https://dev.to/rizasaputra/understanding-aws-elastic-beanstalk-worker-timeout-42hi

Auto scaling batch jobs on Elastic Beanstalk

I am trying to set up a scalable background image processing using beanstalk.
My setup is the following:
Application server (running on Elastic Beanstalk) receives a file, puts it on S3 and sends a request to process it over SQS.
Worker server (also running on Elastic Beanstalk) polls the SQS queue, takes the request, load original image from S3, processes it resulting in 10 different variants and stores them back on S3.
These upload events are happening at a rate of about 1-2 batches per day, 20-40 pics each batch, at unpredictable times.
Problem:
I am currently using one micro-instance for the worker. Generating one variant of the picture can take anywhere from 3 seconds to 25-30 (it seems first ones are done in 3, but then micro instance slows down, I think this is by its 2 second bursty workload design). Anyway, when I upload 30 pictures that means the job takes: 30 pics * 10 variants each * 30 seconds = 2.5 hours to process??!?!
Obviously this is unacceptable, I tried using "small" instance for that, the performance is consistent there, but its about 5 seconds per variant, so still 30*10*5 = 26 minutes per batch. Still not really acceptable.
What is the best way to attack this problem which will get fastest results and will be price efficient at the same time?
Solutions I can think of:
Rely on beanstalk auto-scaling. I've tried that, setting up auto scaling based on CPU utilization. That seems very slow to react and unreliable. I've tried setting measurement time to 1 minute, and breach duration at 1 minute with thresholds of 70% to go up and 30% to go down with 1 increments. It takes the system a while to scale up and then a while to scale down, I can probably fine tune it, but it still feels weird. Ideally I would like to get a faster machine than micro (small, medium?) to use for these spikes of work, but with beanstalk that means I need to run at least one all the time, since most of the time the system is idle that doesn't make any sense price-wise.
Abandon beanstalk for the worker, implement my own monitor of of the SQS queue running on a micro, and let it fire up larger machine(or group of larger machines) when there are enough pending messages in the queue, terminate them the moment we detect queue is idle. That seems like a lot of work, unless there is a solution for this ready out there. In any case, I lose the benefits of beanstalk of deploying the code through git, managing environments etc.
I don't like any of these two solutions
Is there any other nice approach I am missing?
Thanks
CPU utilization on a micro instance is probably not the best metric to use for autoscaling in this case.
Length of the SQS queue would probably be the better metric to use, and the one that makes the most natural sense.
Needless to say, if you can budget for a bigger base-line machine everything would run that much faster.