AWS SQS + autoscale - amazon-web-services

Assuming that I have a queue and multiple instances in autoscaling group.
For scaling up case, it's quite easy to determine.If the length of the queue grows, autoscaling group will spawn new instances.
For scaling down case, it's a bit tricky here. If the length of the queue shrinks, autoscaling group will terminate the instances. It sounds obvious, but the question is: what happens if the instances which are still processing messages being terminated?
Of course we can use some metrics like CPU Utilisation, Disk Read/Write, etc to check. But I don't think it's a good idea. I'm thinking about a central place where instances will be registered for whether they are processing or not, so that the free ones can be determined and so terminated properly.
Any thoughts for this? Thanks.

The accepted answer on this thread:
Amazon Auto Scaling API for Job Servers
gives you two possibilities to handle your situation. One of them should work for you. Also keep in mind, that you don't necessarily want to kill and instance as soon as there is no work - when they spin up, you are going to pay for the whole hour wether you use 59 minutes or 1 minute, so you may want to build that into your solution - spin up instances fast, turn them off slowly.

Related

Problems with Memory and CPU limits in AWS ECS cluster running on reserved EC2 instance

I am running the ECS cluster that currently has 3 services running on T3 medium instance. Each of those services is running only one task which has a soft memory limit of 1GB, the hard limit is different for each (but that should not be the problem). I will always have enough memory to run one, new deployed task (new one will also take 1GB, and T3 medium will be able to handle it since it has 4GB total). After the new task is up and running, the old one will be stopped and I will have again 1GB free for the new deployment. I did similar to the CPU (2048 CPU, each task has 512, and 512 free for new deployments).
So everything runs fine now, but I am not completely satisfied with this setup for the future. What will happen if I need to add another service with another task? I need to deploy all existing tasks and to modify their task definitions to use less CPU and memory in order to run this new task (and new deployments). I am planning to get a reserved EC2 instance, so it will not be easy to swap the current EC2 instance with the larger one.
Is there a way to spin up another EC2 instance for the same ECS cluster to handle bursts in my tasks? Also deployments, it's not a perfect scenario to have the ability to deploy only one task, and then wait for old to be killed in order to deploy the next one, without downtimes.
And biggest concern, what if I need new service and task, I need again to adjust all others in order to run a new one and deploy others, which is not very maintainable and what if I cannot lower CPU and memory more because I already reached the lowest point in order to run the task smoothly.
I was thinking about having another EC2 instance for the same cluster, that will handle bursts, deployments, and new services/tasks. But not sure if that's possible and if that's the best way of doing this. I was also thinking about Fargate, but this is much more expensive and I cannot afford it for now. What do you think? Any ideas, suggestions, and hints will be helpful since I am desperate to find the best way to avoid the problems mentioned above.
Thanks in advance!
So unfortunately, there is no out of the box solution to ensure that all your tasks run on min possible (i.e. one) instance. You can use our new feature called Capacity Providers (CP), which will allow you to ensure the minimum number of ec2 instances required to run all your tasks. The major difference between CP vs ASG is that CP gives more weight to task placement (where as ASG will scale in/out based on resource utilization which isn't ideal in your case).
However, it's not an ideal solution. Just as you said in your comment, when the service needs to scale out during a deployment, CP will spin up another instance, the new task will be placed on it and once it gets to Running state, the old task will be stopped.
But now you have an "extra" EC2 instance because there is no way to replace a running task. The only way I can think of would be to use a lambda function that drains the new instance, which will move all the service tasks to the other instance. CP will, after about 15 minutes, terminate this instance as there are no tasks are running on it.
A couple caveats:
CP are new, a little rough around the edges, and you can't
delete/modify them. You can only create or deactivate them.
CP needs an underlying ASG and they must have a 1-1 relationship
Make sure to enable managed scaling when creating CP
Choose 100% capacity target
Don't forget to add a default capacity strategy for the cluster
Minimizing EC2 instances used:
If you're using a capacity provider, the 'binpack' placement strategy minimises the number of EC2 hosts that are used.
However, there are some scale-in scenarios where you can end up with a single task running on its own EC2 instance. As Ali mentions in their answer; ECS will not replace this running task, but depending on your setup, it may be fairly easy for you to replace it yourself by configuring your task to voluntarily 'quit'.
In my case; I always have at least 2 tasks running per service. So I just added some logic to my tasks' healthchecks, so they report as unhealthy after ~6 hours. ECS will spot the 'unhealthy' task, remove it from the load balancer, and spin up a replacement (according to the binpack strategy).
Note: If you take this approach; add some variation to your timeout so you're less likely to have all of your tasks expire at the same time. Something like: expiry = now + timedelta(hours=random.uniform(5.5,6.5))
Sharing memory 'headspace' with soft-limits:
If you set both soft and hard memory limits; ECS will place your tasks based on the soft limit. If your tasks' memory usage varies with usage, it's fairly easy to get your EC2 instance to start swapping.
For example: Say you have a task defined with a soft limit of 900mb, and a hard limit of 1800mb. You spin up a service with 4 running instances. ECS provisions all 4 of these instances on a single t3.medium. Notice here that each instance thinks it can safely use up to 1800mb, when in fact there's very little free memory on the host server. When you hit your service with some traffic; each task tries to use some more memory, and your t3.medium is incapacitated as it starts swapping memory to disk. ECS does not recover from this type of failure very well. It notices that the task instances are no longer available, and will attempt to provision replacements, but the capacity provider is very slow to replace the swapping t3.medium.
My suggestion:
Configure your service to auto-scale based on memory usage (this will be a percentage of your soft-limit), for example: a target memory usage of 70%
Configure your tasks' healthchecks so that they report as unhealthy when they are nearing their soft-limit. This way, your tasks still have some headroom for quick spikes of memory usage, while giving your load balancer a chance to drain and gracefully replace tasks that are getting greedy. This is fairly easy to do by reading the value within /sys/fs/cgroup/memory/memory.usage_in_bytes.

I want AWS Spot pricing for a long-running job. Is a spot request of one instance the best way to achieve this?

I have a multi-day analysis problem that I am running on a 72 cpu c5n EC2 instance. To get spot pricing, I made my code interruption-resilient and am launching a spot request of one instance. It works great, but this seems like overkill given that Spot can handle thousands of instances. Is this the correct way to solve my problem or am I using a sledgehammer to squash a fly?
I've tried normal EC2 launching, which works great, except that it is four times the price. I don't know of any other way to approach this except for these two ways. I thought about Fargate or containers or something, but I am running a 72 cpu c5n node, and those other options won't let me use that kind of horsepower (that I know of, hence my question).
Thanks!
Amazon EC2 Spot Instances are an excellent way to get cheaper compute (up to 90% discount). The only downside is that the instances might be stopped/terminated (your choice) if there is insufficient capacity.
Some strategies to improve your chance of obtaining spot instances:
Use instances across different Instance Types and Availability Zones because they each have different availability pools (EC2 Spot Fleet can assist with this)
Use resources on weekends and in evenings (even in different regions!) because these tend to be times of lower usage
Use Spot Instances with a specified duration (also known as Spot blocks), but this is at a higher price and a maximum duration of 6 hours
If your software permits it, you could split your load between multiple instances to get the job done faster and to be more resilient against any stoppages of your Spot instances.
Hopefully your application is taking advantage of all the CPUs, otherwise you'd be better-off with smaller instances.

reduce price by on AWS (EC2 and spot instances)

I have a queue of jobs and running AWS EC2 instances which process the jobs. We have an AutoScaling groups for each c4.* instance type in spot and on-demand version.
Each instance has power which is a number equal to number of instances CPUs. (for example c4.large has power=2 since it has 2 CPUs).
The the exact power we need is simply calculated from the number of jobs in the queue.
I would like to implement an algorithm which would periodically check the number of jobs in the queue and change the desired value of the particular AutoScaling groups by AWS SDK to save as much money as possible and maintain the total power of instances to keep jobs processed.
Especially:
I prefer spot instances to on-demand since they are cheaper
EC2 instances are charged per hour, we would like to turn off the instance only at the very last minute of its 1hour uptime.
We would like to replace on-demand instance by spot instances when possible. So, at 55min increase spot-group, at 58 check the new spot instance is running and if yes, decrease on-demand-group.
We would like to replace spot instances by on-demand if the bid would be too high. Just turn off the on-demand one and turn on the spot one.
Seems the problem is really difficult to handle. Anybody have any experience or a similar solution implemented?
You could certainly write your own code to do this, effectively telling your Auto Scaling groups when to add/remove instances.
Also, please note that a good strategy for lowering costs with Spot Instances is to appreciate that the price for a spot instance varies by:
Region
Availability Zone
Instance Type
So, if the spot price for a c4.xlarge goes up in one AZ, it might still be the same cost in another AZ. Also, the price of a c4.2xlarge might then be lower than a c4.xlarge, with twice the power.
Therefore, you should aim to diversity your spot instances across multiple AZs and multiple instance types. This means that spot price changes will impact only a small portion of your fleet rather than all at once.
You could use Spot Fleet to assist with this, or even third-party products such as SpotInst.
It's also worth looking at AWS Batch (not currently available in every region), which is designed to intelligently provide capacity for batch jobs.
Autoscaling groups allow you to use alarms and metrics that are defined outside of the autoscaling group.
If you are using SNS, you should be able to set up an alarm on your SNS queue and use that to scale up and scale down your scaling group.
If you are using a custom queue system, you can push metrics to cloudwatch to create a similar alarm.
You can determine how often scaling actions do occur, but it may be difficult to get the run time to exactly one hour.

How to use AWS Autoscaling Effectively

Please lemme know the answer for below question:
In Reviewing the Auto scaling events for ur application you notice that application is scaling up and down multiple times in the same hour.What design you make to optimize for cost while preserving elasticity?
A.Modify Autoscaling group termination polict to terminate old Oldinstance first
B..Modify Autoscaling group termination polict to terminate old new instance first
C.Modify Cloud watch alarm period that triggers Autoscaling down policy
D.Modify auto scaling group cool down timers.
E.Modify the Autoscaling policy to use scheduled scaling Actions.
im guessing D&E ..Please suggest!!
This is a question from from the many "Become AWS certified!" websites. The purpose of such questions is to determine whether you understand AWS enough to be officially recognised via certification. If you are merely asking people for the correct answer, then you are only learning the answer... not the actual knowledge!
If you truly researched Auto Scaling and thought about it, here's some of the things you should be thinking about. I present this information hoping that you'll actually learn about AWS rather than just memorising answers (which won't help you in the real world).
Scaling In/Out vs Up/Down
Auto Scaling is all about launching additional Amazon EC2 instances when they are required (eg during times of peak load) and terminating them when they are no longer needed, thereby saving money.
Since instances are being added and removed, this is referred to as Scaling Out and Scaling In. Try to avoid using using terms such as Scaling Up and Scaling Down since they suggest that the instances are being made bigger and smaller (which is not the case).
Scaling Out & In multiple times per hour
The assumption in this statement is that such scaling is not desired, which is true. Amazon EC2 is charged per-hour, so adding instances and them removing them within a short period of time is wasting money. This is known as thrashing.
In general, it is a good idea to Scale Out quickly and Scale In slowly. When a system needs extra capacity (Scale Out), it will want it fairly quickly to satisfy demand. When it no longer needs as much capacity, it might be worthwhile waiting before Scaling In because demand might increase again very soon thereafter.
Therefore, it is important to get the right alarm to trigger a scaling action and to wait a while before trying to scale again.
Optimize for cost while preserving elasticity
When an exam question makes a statement about optimizing, it's giving you a hint that the primary goal should be cost minimization, even if other choices might make more sense. Therefore, you want the solution to Scale In when possible, while avoiding thrashing.
Termination Policies
When an Auto Scaling Policy is triggered to remove instances, Auto Scaling uses the termination policy to determine which instance(s) to remove. This is, therefore, irrelevant to the question because optimizing for cost while preserving elasticity is only impacted by the number of instances, not which instances are actually terminated.
CloudWatch Alarms
Auto Scaling actions can be triggered by CloudWatch alarms, such as "average CPU < 70% for 15 minutes". A rule with a longer time period means that it will react to longer-term changes rather than temporary changes, which certainly helps avoid thrashing. However, it also means that Auto Scaling will take longer to respond to changes in demand.
Cooldowns
From the Auto Scaling documentation:
The Auto Scaling cooldown period is a configurable setting for your Auto Scaling group that helps to ensure that Auto Scaling doesn't launch or terminate additional instances before the previous scaling activity takes effect. After the Auto Scaling group dynamically scales using a simple scaling policy, Auto Scaling waits for the cooldown period to complete before resuming scaling activities.
This is very useful, because newly-launched instances take some time (eg for booting, configuring) before they can take some of the application workload. If the cooldown is too short, then Auto Scaling might launch additional instances before the first one is ready. The result is that too many instances will be launched, meaning that some will need to Scale In soon after, leading to more thrashing.
Scheduled Actions
Instead of triggering Scale In and Scale Out actions based upon a metric, Auto Scaling can be configured to use Schedules actions. For example, increasing the minimum number of instances at 8am in the morning before an expected rush, and decreasing the minimum number at 6pm when usage starts to drop-off.
Scheduled Actions are unlikely to cause thrashing, since scaling is based on a schedule rather than metrics that frequently change.
The Correct Answer
The correct answer is... I'm not going to tell you! However, by reading the above information and trying to grok how Auto Scaling works, you will hopefully come to a better understanding of the question and arrive at a suitable answer.
This way, you will have learned something rather than merely memorizing the answers.

Does ELB connection draining apply when spot instances are terminated?

A new AWS ELB feature, connection draining, was recently announced.
http://aws.amazon.com/about-aws/whats-new/2014/03/20/elastic-load-balancing-supports-connection-draining/
Apparently this works with Auto Scaling Groups - instances are drained before being removed, but does that also apply to spot instances that are being terminated by AWS due to a rising spot price?
Nothing definitive I could find, but from my reading on this, I think the answer is almost definitely no. Spot instances are different animals than regular instances, and the way the connection draining works you can specify upto 60 minute delay before your connection-drained enabled instance gets terminated when it becomes unhealthy - if AWS was to allow this added layer of safety to spot instances, it would completely up-end the way spot instances are used and how they are positioned.
The trade-off for using spot instances has always been, "you can pay a fraction of the cost, but you risk being terminated at any instant without warning"...if they added an up to 60 minute 'warning' to spot instances, while it would be fantastic from the end-users point of view, I think it would severely eat into AWS's on-demand and reserved instance pricing model and thus they probably won't support this anytime soon (unless forced to by competitive pressure).
EDIT 1/6/2015: now, almost a year later, AWS has indeed added a 'two minute warning' for EC2 spot instance termination. https://aws.amazon.com/blogs/aws/new-ec2-spot-instance-termination-notices/