Faster killing of ECS containers based on inactive tasks? - amazon-web-services

I've been lurking for years and the time has finally come to post my first question!
So, my GitLab/Terraform/AWS pipeline pushes containers to Fargate. Once the task definition gets updated, new containers go live and pass health checks. At this point both the old and the new containers are up:
It takes several minutes until the auto-scaler shuts down the old containers. This is in a dev environment so nobody is accessing anything and there are no connections to drain. Other than manually, is there a way to make this faster or even instant?
Thanks in advance!

There is a way to reduce the time you have to wait for tasks to drain. Go to EC2 -> Target Groups -> (Select your target group) -> Description and scroll down. At the bottom is a property called "Deregistration Delay". This is the amount of time the target group will allow connections to drain before shutting down a container (I think it defaults to 5 minutes). Just reduce that value and you should be able to deploy much quicker. Hope this helps!

Related

How to stop a compute node with SLURM?

I am using SLURM on AWS to manage jobs as part of AWS parallelcluster. I have two questions :
When using scancel *jobid* to cancel a job, the associated node(s) do not stop. How can I achieve that ?
When starting, I made the mistake of not making my script executable so the sbatch *script.sh* worked but the compute node was doing nothing. How could I identify such behaviour and handle it properly ? Is the proper to e.g. stop the idle node after some time for example and output that in a log ? How can I achieve that ?
Check out this page in the docs: https://docs.aws.amazon.com/parallelcluster/latest/ug/autoscaling.html
Bottom line is that instances that have no jobs for a period of time longer than the scaledown_idletime (the default setting is 10 minutes) will get scaled down (terminated) by the cluster, automagically.
You can tweak the setting in the config file when you build your cluster, if 10 mins is too long. Just think about your workload first, because you don't want small delays between jobs to cause you a lot of churn whilst you wait for nodes to die and then get created again shortly after, hence the 10 minute thing.

Running large jobs with low startup time and autoscaling for bursts of traffic

For a website I’m developing on AWS, a user can submit a large job (ex. select a large number of items and ask to update them all in some way). We don’t want to limit the size of the job these users are submitting so this job can can in theory run for a very long period of time and require a large amount of memory (this rules out AWS Lambda as a compute engine option). We want jobs to be as independent from one another as possible so we chose to run each job in its own container in Amazon ECS. What we currently do when a user submits a job request is send a message with a job id/reference to an SQS queue, have AWS lambda poll that queue and upon receiving a message, lambda starts an ECS task (SQS -> Lambda -> ECS). This has the problem that a new ECS task is started with each request, so a new container must be booted up which can take minutes. This latency is directly visible to the user and is particularly unacceptable if the users job is not even particularly large yet they still wait for minutes for the container to boot up. Additionally, the cost of constantly running container or two would not be too problematic.
I've been toying with some ideas for updating this flow.
Attempt 1:
In this updated flow we'd create an ECS task that looks like the following:
message = null;
while (message == null) {
message = pollForMessages();
}
processMessage(message);
// task finishes, and container can be brought down
We remove the lambda from the flow and just have SQS -> ECS rather than SQS -> Lambda -> ECS. In this case, there would be no cold start assuming a container is up spinning for messages. We could set the minimum number of tasks we want running to be a number > 0 to ensure all messages are processed at some point. However this suffers from the problem that it would not auto-scale as the number of messages in the queue increases. So something needs to spawn more containers when traffic increases.
Attempt 2:
In this updated flow we'd create an ECS task that looks like the following:
message = null;
while (message == null) {
message = pollForMessages();
}
If (number of running tasks < number of messages in queue) {
spawnMoreContainers();
}
processMessage(message);
// task finishes, and container can be brought down
This comes with the issue that we could end up over provisioning containers if multiple containers see that there are more messages in the queue than tasks running. Since these tasks run forever until a message is processed this could result in a large unnecessary cost. It could also under provision containers - if the task sees that number of running tasks >= number of messages, but these running tasks are already busy processing messages, these tasks will not end up taking one of these messages out of the queue and we may end up with messages that have to wait a very long time to be processed.
Attempt 3:
message = null;
while (message == null) {
message = pollForMessages();
If (# of containers > min provisioned && this particular container has been running longer than some timeout) {
// finish this task so this container can be brought down
return;
}
}
If (number of running tasks < number of messages in queue) {
spawnMoreContainers();
}
processMessage(message);
// task finishes, and container can be brought down
While this may save us some cost compared to Attempt 2 so over provisioning wouldn’t be so much of an issue, there is still the possibility that we could under provision containers, in which case certain job requests would need to wait for potentially long periods of time before being processed.
Attempt 4:
We can introduce locking (ex. https://aws.amazon.com/blogs/database/building-distributed-locks-with-the-dynamodb-lock-client/) to mitigate some of the race conditions, however we'll always have the issue that a task running does not necessarily mean a task that is available to pick up messages and Fargate gives us no way of distinguishing between these, which makes it difficult to determine how many containers to provision (ex. we see there are 5 running containers and 5 messages, but we don't know whether to provision more containers or not because we don't know if those containers are already processing a message or if they're waiting). Alternatively we could introduce some mechanism, either an external orchestrator or some logic within the containers and some data store, to manage the state of these containers.
Essentially to deal with each of these problems, the architecture becomes more and more complex and implementation would be difficult and error prone.
It also seems to me like these solutions are reinventing the wheel, and I feel there must be some service out there that has solved this problem already, but I can’t seem to find it.
The suggestions I’ve seen to deal with this are:
Maybe AWS batch is more suited for this use case - Indeed, AWS batch might be the more recommended approach for a workload like this but, we don’t remove any of the cold start problem by switching. AWS batch would still create a new container with each job.
Run the ECS tasks on EC2 rather than Fargate, then cache the container image on the host - With this, we’d be managing our own infrastructure and ideally we’d like this to be serverless.
Have an alarm on the number of messages in the queue and have this alarm trigger a lambda that then boots up more containers - alarms on the /AWS log group have a minimum period of 1 minute. This means the alarm would not be triggered until a minute after we’d received more requests than our provisioned containers can handle. Additionally we'd have to set up many alarms to scale at different numbers of messages.
I’m wondering if anyone is aware of potential services/frameworks that could make doing this more feasible? Or if anyone has suggestions on alternative architectures?
If you don't mind a bit slower response time to the bursts, you may create an autoscaling group (I assume there is something similar for ECS). This group can be governed by a custom metric, e. g. queue length divided by the number of workers. A detailed guide is here: https://docs.aws.amazon.com/autoscaling/ec2/userguide/as-using-sqs-queue.html
In any case, I'd decouple the scaling decision from the worker code, because there is a varying number of workers that you would need to synchronize. It's much easier to have one overseer that controls how many workers there should be. Because the overseer is not on the critical path to task processing, you don't need to care that much about its uptime. It's OK if it takes a few minutes before it recovers after a failure - the workers are still there, processing at least at some capacity.

How to fix CloudRun error 'The request was aborted because there was no available instance'

I'm using managed CloudRun to deploy a container with concurrency=1. Once deployed, I'm firing four long-running requests in parallel.
Most of the time, all works fine -- But occasionally, I'm facing 500's from one of the nodes within a few seconds; logs only provide the error message provided in the subject.
Using retry with exponential back-off did not improve the situation; the retries also end up with 500s. StackDriver logs also do not provide further information.
Potentially relevant gcloud beta run deploy arguments:
--memory 2Gi --concurrency 1 --timeout 8m --platform managed
What does the error message mean exactly -- and how can I solve the issue?
This error message can appear when the infrastructure didn't scale fast enough to catch up with the traffic spike. Infrastructure only keeps a request in the queue for a certain amount of time (about 10s) then aborts it.
This usually happens when:
traffic suddenly largely increase
cold start time is long
request time is long
We also faced this issue when traffic suddenly increased during business hours. The issue is usually caused by a sudden increase in traffic and a longer instance start time to accommodate incoming requests. One way to handle this is by keeping warm-up instances always running i.e. configuring --min-instances parameters in the cloud run deploy command. Another and recommended way is to reduce the service cold start time (which is difficult to achieve in some languages like Java and Python)
I also experiment the problem. Easy to reproduce. I have a fibonacci container that process in 6s fibo(45). I use Hey to perform 200 requests. And I set my Cloud Run concurrency to 1.
Over 200 requests I have 8 similar errors. In my case: sudden traffic spike and long processing time. (Short cold start for me, it's in Go)
I was able to resolve this on my service by raising the max autoscaling container count from 2 to 10. There really should be no reason that 2 would be even close to too low for the traffic, but I suspect something about the Cloud Run internals were tying up to 2 containers somehow.
Setting the Max Retry Attempts to anything but zero will remedy this, as it did for me.

Akka design: How to add/remove routee from cluster aware router dynamically

I have the following use case and I am not sure if the akka toolkit provide this out of the box:
I have a number of nodes (instance/machine) that can run a finite number of long running task in the background and cannot accept more work while at max capacity.
Each instance can only process 50 tasks.
All instances are behind a load balancer.
Each task can respond to messages from the client who initiated the task, since the client sends the messages via the load balancer the instances need to route it to the correct instance that handles the task.
I have tried initially cluster sharding, but there doesn't seem to be a way to cap the maximum number of shard regions/actors per node (= #tasks).
Then I tried it with a cluster aware router, which acts as a guard for accepting or rejecting work. This seems to work reasonable well, one problem is that once it reaches capacity I need to remove it as a routee and add it back once it has capacity again.
Is there something out of the box that supports this use case or should I carry on with the routing option and if so how can I achieve this?
I'll update the description if you have further questions or something is unclear.
Your scenario sounds like a good fit for the work pulling pattern. The gist of this pattern is:
A master actor coordinates units of work among a number of worker actors.
Workers register themselves to the master, meaning that workers can be added or removed dynamically.
When the master receives work to be done, the master notifies the workers that work is available. Workers pull units of work when they're ready, do what needs to be done with their respective units of work, then ask the master for more work when they're finished.
To learn more about this pattern, read the following (the first two links are listed in the Akka documentation):
The original post (by Derek Wyatt): http://letitcrash.com/post/29044669086/balancing-workload-across-nodes-with-akka-2
A follow-on post (by Michael Pollmeier): http://www.michaelpollmeier.com/akka-work-pulling-pattern
An application of the pattern in a clustered environment with a cluster-aware router (by Ryan Tanner): https://www.conspire.com/blog/2013/10/akka-at-conspire-part-5-the-importance-of/

Amazon Elasticache Failover

We have been using AWS Elasticache for about 6 months now without any issues. Every night we have a Java app that runs which will flush DB 0 of our redis cache and then repopulate it with updated data. However we had 3 instances between July 31 and August 5 where our DB was successfully flushed and then we were not able to write the new data to the database.
We were getting the following exception in our application:
redis.clients.jedis.exceptions.JedisDataException:
redis.clients.jedis.exceptions.JedisDataException: READONLY You can't
write against a read only slave.
When we look at the cache events in Elasticache we can see
Failover from master node prod-redis-001 to replica node
prod-redis-002 completed
We have not been able to diagnose the issue and since the app was running fine for the past 6 months I am wondering if it is something related to a recent Elasticache release that was done on the 30th of June.
https://aws.amazon.com/releasenotes/Amazon-ElastiCache
We have always been writing to our master node and we only have 1 replica node.
If someone could offer any insight it would be much appreciated.
EDIT: This seems to be an intermittent problem. Some days it will fail other days it runs fine.
We have been in contact with AWS support for the past few weeks and this is what we have found.
Most Redis requests are synchronous including the flush so it will block all other requests. In our case we are actually flushing 19m keys and it takes more then 30 seconds.
Elasticache performs a health check periodically and since the flush is running the health check will be blocked, thus causing a failover.
We have been asking the support team how often the health check is performed so we can get an idea of why our flush is only causing a failover 3-4 times a week. The best answer we can get is "We think its every 30 seconds". However our flush consistently takes more then 30 seconds and doesn't consistently fail.
They said that they may implement the ability to configure the timing of the health check however they said this would not be done anytime soon.
The best advice they could give us is:
1) Create a completely new cluster for loading the new data on, and
instead of flushing the previous cluster, re-point your application(s)
to the new cluster, and remove the old one.
2) If the data that you are flushing is an update version of the data,
consider not flushing, but updating and overwriting new keys?
3) Instead of flushing the data, set the expiry of the items to be
when you would normally flush, and let the keys be reclaimed (possibly
with a random time to avoid thundering herd issues), and then reload
the data.
Hope this helps :)
Currently for Redis versions from 6.2 AWS ElastiCache has a new feature of thread monitoring. So the health check doesn't happen in the same thread as all other actions of Redis. Redis can continue to proceed a long command / lua script, but will still considered healthy. Because of this new feature failovers should happen less.