MWAA Airflow Scaling: what do I do when I have to run frequent & time consuming scripts? (Negsignal.SIGKILL) - amazon-web-services

I have an MWAA Airflow env in my AWS account. The DAG I am setting up is supposed to read massive data from S3 bucket A, filter what I want and dump the filtered results to S3 bucket B. It needs to read every minute since the data is coming in every minute. Every run processes about 200MB of json data.
My initial setting was using env class mw1.small with 10 worker machines, if I only run the task once in this setting, it takes about 8 minutes to finish each run, but when I start the schedule to run every minute, most of them could not finish, starts to take much longer to run (around 18 mins) and displays the error message:
[2021-09-25 20:33:16,472] {{local_task_job.py:102}} INFO - Task exited with return code Negsignal.SIGKILL
I tried to expand env class to mw1.large with 15 workers, more jobs were able to complete before the error shows up, but still could not catch up with the speed of ingesting every minute. The Negsignal.SIGKILL error would still show before even reaching worker machine max.
At this point, what should I do to scale this? I can imagine opening another Airflow env but that does not really make sense. There must be a way to do it within one env.

I've found the solution to this, for MWAA, edit the environment and under Airflow configuration options, setup these configs
celery.sync_parallelism = 1
celery.worker_autoscale = 1,1
This will make sure your worker machine runs 1 job at a time, preventing multiple jobs to share the worker, hence saving memory and reduces runtime.

Related

AWS Glue job run not respecting Timeout and not stopping

I am running AWS Glue jobs using PySpark. They have set Timeout (as visible on the screenshot) of 1440 mins, which is 24 hours. Nevertheless, the job continues working over those 24 hours.
When this particular job had been running for over 5 days I stopped it manually (clicking stop icon in column "Run status" in GUI visible on the screenshot). However, since then (it has been over 2 days) it still hasn't stopped - the "Run status" is Stopping, not Stopped.
Additionally, after about 4 hours of running, new logs (column "Logs") in CloudWatch regarding this Job Run stopped appearing (in my PySpark script I have print() statements which regularly and often log extra data). Also, last error log in CloudWatch (column "Error logs") has been written 24 seconds after the date of the newest log in "Logs".
This behaviour continues for multiple jobs.
My questions are:
What could be reasons for Job Runs not obeying set Timeout value? How to fix that?
Why the newest log is from 4 hours since starting the Job Run, while the logs should appear regularly during 24 hours of the (desired) duration of the Job Run?
Why the Job Runs don't stop if I try to stop them manually? How can they be stopped?
Thank you in advance for your advice and hints.

How to stop a compute node with SLURM?

I am using SLURM on AWS to manage jobs as part of AWS parallelcluster. I have two questions :
When using scancel *jobid* to cancel a job, the associated node(s) do not stop. How can I achieve that ?
When starting, I made the mistake of not making my script executable so the sbatch *script.sh* worked but the compute node was doing nothing. How could I identify such behaviour and handle it properly ? Is the proper to e.g. stop the idle node after some time for example and output that in a log ? How can I achieve that ?
Check out this page in the docs: https://docs.aws.amazon.com/parallelcluster/latest/ug/autoscaling.html
Bottom line is that instances that have no jobs for a period of time longer than the scaledown_idletime (the default setting is 10 minutes) will get scaled down (terminated) by the cluster, automagically.
You can tweak the setting in the config file when you build your cluster, if 10 mins is too long. Just think about your workload first, because you don't want small delays between jobs to cause you a lot of churn whilst you wait for nodes to die and then get created again shortly after, hence the 10 minute thing.

Is there a way to set a walltime on AWS Batch jobs?

Is there a way to set a maximum running time for AWS Batch jobs (or queues)? This is a standard setting in most batch managers, which avoids wasting resources when a job hangs for whatever reason.
As of April, 2018, AWS Batch now supports setting a Job Timeout when submitting a Job, or in the job definition.
https://aws.amazon.com/about-aws/whats-new/2018/04/aws-batch-adds-support-for-automatic-termination-with-job-execution-timeout/
You specify an attemptDurationSeconds parameter, which must be at least 60 seconds, either in your job definition, or when you submit the job. When this number of seconds has passed following the job attempt's startedAt timestamp, AWS Batch terminates the job. On the compute resource, your job's container receives a SIGTERM signal to give your application a chance to shut down gracefully; if the container is still running after 30 seconds, a SIGKILL signal is sent to forcefully shut down the container.
Source: https://docs.aws.amazon.com/batch/latest/userguide/job_timeouts.html
POST /v1/submitjob HTTP/1.1
Content-type: application/json
{
...
"timeout": {
"attemptDurationSeconds": number
}
}
AFAIK there is no feature to do this. However, a workaround was suggested in the forum for a similar question.
One idea is to call Batch as an Activity from Step Functions, pingback
back on a schedule (e.g. every minute) from that job. If it stops
responding then you can detect that situation as a Timeout in the
activity and act accordingly (terminate the job etc.). Not an ideal
solution (especially if the job continues to ping back as a "zombie"),
but it's a start. You'd also likely have to store activity tokens in a
database to trace them to Batch job id.
Alternatively, you split that setup into 2 steps, and schedule a Batch
job from a Lambda in the first state, then pass the Batch job id to
the second step which then polls Batch (from another Lambda) for its
state with Retry and IntervalSeconds (e.g. once every minute, or even
with exponential backoff), and MaxAttempts calculated based on your
timeout. This way, you don't need any external state storage
mechanism, long polling or even a "ping back" from the job (it CAN be
a zombie), but the downside is more steps.
There is no option to set timeout on batch job but you can setup a lambda function that triggers every 1 hour or so and deletes jobs created before say 24 hours.
working with aws for some time now and could not find a way to set a maximum running time for batch jobs.
However there are some alternative way which you could utilize.
AWS Forum
Sadly there is no way to set the limit execution time on AWS Batch.
One solution may be to edit the docker's entry point to schedule the execution time limit.

How is it that a mapreduce pipeline can run longer than 10 minutes?

MapReduce tasks are run within a parent pipeline, and of course we all know they can run for a very long time. But at the same time, the pipeline api documents that a pipeline must complete within 10 minutes (https://github.com/GoogleCloudPlatform/appengine-pipelines/wiki/Python). What is the proper way to understand this?
Thanks.
That pipeline documentation is really old... when it was written, tasks were limited to 10-mins. Now you can configure a non-default modules (used to be called a "backend") using basic/manual scaling that will allow a task to run for 24hrs
https://cloud.google.com/appengine/docs/python/modules/#Python_Instance_scaling_and_class
(NOTE: if you run a task on an auto-scaled module, it will still be limited to 10-mins)
The entire pipeline doesn't have to be limited to 24hrs though. The "root" pipeline (the first task that runs) can yield many child pipelines, and those each can further yield other pipelines... each pipeline is a task that has to run within the allotted time (10mins or 24hrs)... when it is done, it signals the parent to wake-up and finish... so the overall pipeline could run for days or months or whatever
We have our app split into two modules, one for the front-end (default, auto-scaled) that handles web requests, and one for the "back end" (basic scaling) that runs all of our tasks

AWS Elastic Beanstalk Worker timing out after inactivity during long computation

I am trying to use Amazon Elastic Beanstalk to run a very long numerical simulation - up to 20 hours. The code works beautifully when I tell it to do a short, 20 second simulation. However, when running a longer one, I get the error "The following instances have not responded in the allowed command timeout time (they might still finish eventually on their own)".
After browsing the web, it seems to me that the issue is that Elastic Beanstalk allows worker processes to run for 30 minutes at most, and then they time out because the instance has not responded (i.e. finished the simulation). The solution some have proposed is to send a message every 30 seconds or so that "pings" Elastic Beanstalk, letting it know that the simulation is going well so it doesn't time out, which would let me run a long worker process. So I have a few questions:
Is this the correct approach?
If so, what code or configuration would I add to the project to make it stop terminating early?
If not, how can I smoothly run a 12+ hour simulation on AWS or more generally, the cloud?
Add on information
Thank you for the feedback, Rohit. To give some more information, I'm using Python with Flask.
• I am indeed using an Elastic Beanstalk worker tier with SQS queues
• In my code, I'm running a simulation of variable length - from as short as 20 seconds to as long as 20 hours. 99% of the work that Elastic Beanstalk does is running the simulation. The other 1% involves saving results, sending emails, etc.
• The simulation itself involves using generating many random numbers and working with objects that I defined. I use numpy heavily here.
Let me know if I can provide any more information. I really appreciate the help :)
After talking to a friend who's more in the know about this stuff than me, I solved the problem. It's a little sketchy, but got the job done. For future reference, here is an outline of what I did:
1) Wrote a main script that used Amazon's boto library to connect to my SQS queue. Wrote an infinite while loop to poll the queue every 60 seconds. When there's a message on the queue, run a simulation and then continue through with the loop
2) Borrowed a beautiful /etc/init.d/ template to run my script as a daemon (http://blog.scphillips.com/2013/07/getting-a-python-script-to-run-in-the-background-as-a-service-on-boot/)
3) Made my main script and the script in (2) executable
4) Set up a cron job to make sure the script would start back up if it failed.
Once again, thank you Rohit for taking the time to help me out. I'm glad I still got to use Amazon even though Elastic Beanstalk wasn't the right tool for the job
From your question it seems you are running into launches timing out because some commands during launch that run on your instance take more than 30 minutes.
As explained here, you can adjust the Timeout option in the aws:elasticbeanstalk:command namespace. This can have values between 1 and 1800. This means if your commands finish within 30 minutes you won't see this error. The commands might eventually finish as the error message says but since Elastic Beanstalk has not received a response within the specified period it does not know what is going on your instance.
It would be helpful if you could add more details about your usecase. What commands you are running during startup? Apparently you are using ebextensions to launch commands which take a long time. Is it possible to run those commands in the background or do you need these commands to run during server startup?
If you are running a Tomcat web app you could also use something like servlet init method to run app bootstrapping code. This code can take however long it needs without giving you this error message.
Unfortunately, there is no way to 'process a message' from an SQS queue for more than 12 hours (see the description of ChangeVisibilityTimeout).
With that being the case, this approach doesn't fit your application well. I have ran into the same problem.
The correct way to do this: I don't know. However, I would suggest an alternate approach where you grab a message off of your queue, spin off a thread or process to run your long running simulation, and then delete the message (signaling successful processing). In this approach, be careful of spinning off too many threads on one machine and also be wary of machines shutting down before the simulation has ended, because the queue message has already been deleted.
Final note: your question is excellently worded and sufficiently detailed :)
For those looking to run jobs shorter than 10 hours, it needs to be mentioned that the current inactivity timeout limit is 36000 seconds, so exactly 10 hours and not anymore 30 minutes, like mentioned in posts all over the web (which led me to think a workaround like described above is needed).
Check out the docs: https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/using-features-managing-env-tiers.html
A very nice write-up can be found here: https://dev.to/rizasaputra/understanding-aws-elastic-beanstalk-worker-timeout-42hi