AWS SWF cancelling child workflows automatically - amazon-web-services

I have AWS SWF workflow which creates many child workflows at runtime based on the number of input files. For x number of input files, it will create x number of child workflows. It works fine when number of input files is around 400 and successfully creates and executes 400 child workflows.
The issue is - when my input has around 500 files or more, it starts that many child workflows successfully but then automatically cancels some of them. I have tried different configurations but nothing worked.
I think AWS limit for number of child workflows is 1000, so that should not be issue.
Current child workflow config:
Execution Start To Close Timeout: 2 hours 1 minute
Task Start To Close Timeout: 1 minute 30 seconds
Main workflow config:
Execution Start To Close Timeout: 9 hours
Task Start To Close Timeout: 1 minute 30 seconds

My guess is that some exception is thrown in the workflow code which by default cancels workflows in the same cancellation scope. Read the TryCatchFinally documentation for more info about the cancellation semantic.
In general I wouldn't recommend that many child workflows in SWF, you can always do it hierarchically. Like 30 children, each of them 30 children give 900 workflows.

Related

AWS Glue job run not respecting Timeout and not stopping

I am running AWS Glue jobs using PySpark. They have set Timeout (as visible on the screenshot) of 1440 mins, which is 24 hours. Nevertheless, the job continues working over those 24 hours.
When this particular job had been running for over 5 days I stopped it manually (clicking stop icon in column "Run status" in GUI visible on the screenshot). However, since then (it has been over 2 days) it still hasn't stopped - the "Run status" is Stopping, not Stopped.
Additionally, after about 4 hours of running, new logs (column "Logs") in CloudWatch regarding this Job Run stopped appearing (in my PySpark script I have print() statements which regularly and often log extra data). Also, last error log in CloudWatch (column "Error logs") has been written 24 seconds after the date of the newest log in "Logs".
This behaviour continues for multiple jobs.
My questions are:
What could be reasons for Job Runs not obeying set Timeout value? How to fix that?
Why the newest log is from 4 hours since starting the Job Run, while the logs should appear regularly during 24 hours of the (desired) duration of the Job Run?
Why the Job Runs don't stop if I try to stop them manually? How can they be stopped?
Thank you in advance for your advice and hints.

MWAA Airflow Scaling: what do I do when I have to run frequent & time consuming scripts? (Negsignal.SIGKILL)

I have an MWAA Airflow env in my AWS account. The DAG I am setting up is supposed to read massive data from S3 bucket A, filter what I want and dump the filtered results to S3 bucket B. It needs to read every minute since the data is coming in every minute. Every run processes about 200MB of json data.
My initial setting was using env class mw1.small with 10 worker machines, if I only run the task once in this setting, it takes about 8 minutes to finish each run, but when I start the schedule to run every minute, most of them could not finish, starts to take much longer to run (around 18 mins) and displays the error message:
[2021-09-25 20:33:16,472] {{local_task_job.py:102}} INFO - Task exited with return code Negsignal.SIGKILL
I tried to expand env class to mw1.large with 15 workers, more jobs were able to complete before the error shows up, but still could not catch up with the speed of ingesting every minute. The Negsignal.SIGKILL error would still show before even reaching worker machine max.
At this point, what should I do to scale this? I can imagine opening another Airflow env but that does not really make sense. There must be a way to do it within one env.
I've found the solution to this, for MWAA, edit the environment and under Airflow configuration options, setup these configs
celery.sync_parallelism = 1
celery.worker_autoscale = 1,1
This will make sure your worker machine runs 1 job at a time, preventing multiple jobs to share the worker, hence saving memory and reduces runtime.

Google cloud task queues not running in parallel

I have a project in google cloud where there are 2 task queues: process-request to receive requests and process them, send-result to send the result of the processed request to another server. They are both running on an instance called remote-processing
My problem is that I see the tasks being enqueued in send-result but they are only executed after the process-request queue is empty and has processed all requests.
This is the instance config:
instance_class: B4
basic_scaling:
max_instances: 8
Here is the queue config:
- name: send-result
max_concurrent_requests: 20
rate: 1/s
retry_parameters:
task_retry_limit: 10
min_backoff_seconds: 5
max_backoff_seconds: 20
target: remote-processing
- name: process-request
bucket_size: 50
max_concurrent_requests: 10
rate: 10/s
target: remote-processing
Clarification : I don't need for the queues to run in an specific order, but I find it very strange that it looks like the insurance only runs one queue at a time, so it will only run the tasks in another queue after its done with the current queue.
over what period of time is this all happening?
how long does a process-request task take to run vs a send-result task
One thing that sticks out is that your rate for process-request is much higher than your rate for send-result. So maybe a couple send-result tasks ARE squeezing off, but it then hits the rate cap and has to run process-request tasks instead.
Same note for bucket_size. The bucket_size for process-request is huge compared to it's rate:
The bucket size limits how fast the queue is processed when many tasks
are in the queue and the rate is high. The maximum value for bucket
size is 500. This allows you to have a high rate so processing starts
shortly after a task is enqueued, but still limit resource usage when
many tasks are enqueued in a short period of time.
If you don't specify bucket_size for a queue, the default value is 5.
We recommend that you set this to a larger value because the default
size might be too small for many use cases: the recommended size is
the processing rate divided by 5 (rate/5).
https://cloud.google.com/appengine/docs/standard/python/config/queueref
Also, by setting max_instances: 8 does a big backlog of work build-up in these queues?
Let's try a two things:
set bucket_size and rate to be the same for both process-request and send-result. If fixes it, then start fiddling with the values to get the desired balance
bump up max_instances: 8 to see if removing that bottleneck fixes it

Concurrency Thread Group Showing more Samples then define

I am using Concurrency Thread group with the following values
Target Concurrency: 200,
Ramp-Up Time: 5 min,
Ramp-Up Step Count: 10,
Hold Target Rate Time : 0 min,
Thread Iteration Limit: 1.
I am Using Throughput Controller as a child to Concurrency Thread Group, Total Executions, Throughput = 1, per User selected.
I am 5 HTTP Request, What I am expected is each HTTP request should have 200 users but, it shows more than 300 users.
Can anyone tell me, that my expectation is wrong or my setup is wrong?
What is the best way to do?
Your expectation is wrong. With regards to your setup - we don't know what you're trying to achieve.
Concurrency Thread Group maintains the defined concurrency so
JMeter will start with 20 users
In 30 seconds another 20 users will be kicked off so you will have 40 users
In 60 seconds another 20 users will arrive so you will have 60 users
etc.
Once started the threads will begin executing Sampler(s) upside down (or according to Logic Controllers) and the actual number of requests will depend on your application response time.
Your "Thread Iteration Limit" setting allows the threads to loop only once so thread will be stopped once it executed all the samplers, however Concurrency Thread Group will kick off another thread to replace the ended one in order to maintain the defined concurrency
If you want to limit the total number of executions to 200 you can go for Throughput Controller
and this way you will have only 200 executions of its children
Be aware that in the above setup your test will still be running for 5 minutes, however the threads will not be executing samplers after 200 total executions.

Is there a way to set a walltime on AWS Batch jobs?

Is there a way to set a maximum running time for AWS Batch jobs (or queues)? This is a standard setting in most batch managers, which avoids wasting resources when a job hangs for whatever reason.
As of April, 2018, AWS Batch now supports setting a Job Timeout when submitting a Job, or in the job definition.
https://aws.amazon.com/about-aws/whats-new/2018/04/aws-batch-adds-support-for-automatic-termination-with-job-execution-timeout/
You specify an attemptDurationSeconds parameter, which must be at least 60 seconds, either in your job definition, or when you submit the job. When this number of seconds has passed following the job attempt's startedAt timestamp, AWS Batch terminates the job. On the compute resource, your job's container receives a SIGTERM signal to give your application a chance to shut down gracefully; if the container is still running after 30 seconds, a SIGKILL signal is sent to forcefully shut down the container.
Source: https://docs.aws.amazon.com/batch/latest/userguide/job_timeouts.html
POST /v1/submitjob HTTP/1.1
Content-type: application/json
{
...
"timeout": {
"attemptDurationSeconds": number
}
}
AFAIK there is no feature to do this. However, a workaround was suggested in the forum for a similar question.
One idea is to call Batch as an Activity from Step Functions, pingback
back on a schedule (e.g. every minute) from that job. If it stops
responding then you can detect that situation as a Timeout in the
activity and act accordingly (terminate the job etc.). Not an ideal
solution (especially if the job continues to ping back as a "zombie"),
but it's a start. You'd also likely have to store activity tokens in a
database to trace them to Batch job id.
Alternatively, you split that setup into 2 steps, and schedule a Batch
job from a Lambda in the first state, then pass the Batch job id to
the second step which then polls Batch (from another Lambda) for its
state with Retry and IntervalSeconds (e.g. once every minute, or even
with exponential backoff), and MaxAttempts calculated based on your
timeout. This way, you don't need any external state storage
mechanism, long polling or even a "ping back" from the job (it CAN be
a zombie), but the downside is more steps.
There is no option to set timeout on batch job but you can setup a lambda function that triggers every 1 hour or so and deletes jobs created before say 24 hours.
working with aws for some time now and could not find a way to set a maximum running time for batch jobs.
However there are some alternative way which you could utilize.
AWS Forum
Sadly there is no way to set the limit execution time on AWS Batch.
One solution may be to edit the docker's entry point to schedule the execution time limit.