Codedeploy taking long time - amazon-web-services

We trigger autoscaling on a TargetResponseTime threshold. Launch and turning healthy for new EC2 instance takes close to 20 mins. When we check the codedeploy we see two kinds of time, one in Deployment history start time Aug 22, 2019 3:10 PM and end time Aug 22, 2019 3:28 PM. Going to that particular deployment, we see duration as 2 minutes 21 seconds, from ApplicationStop to AfterAllowTraffic. Where is the rest of the time spent? Why is that Deployment history shows 18 mins whereas the deployment time is 2 mins 21 sec?
How can we reduce this time?
Background: To launch EC2 instance by autoscaling we have a launch configuration that installs the codedeploy agent. The instance would be in Pending:Wait state in Auto Scaling Group Instances' lifecycle, with a hook CodeDeploy-managed-automatic-launch-deployment-hook-DGENSVPC1b-f51a955c-194e-4a51-ad9b-1489101325ba
autoscaling:EC2_INSTANCE_LAUNCHING,ABANDON,600

Using Amazon's AMI instead of custom AMI helped reduce this time to ~5-6mins from the previous 20mins.

It's hard to say without more visibility into your system. It could range from your tasks within your EC2 User Data or ELB health check settings. Could you take another look at the different CodeDeploy lifecycle events and see where the time is aggregating? E.g. If you view the specific CodeDeploy action, you can "view events" to see a list of deployment lifecycle events and the time it took to complete each of them. After you find out what's taking the longest time, you can began to narrow down the root cause.

Related

Stop AWS OpenSearch service temporarily for cost savings

I am looking at doing some cost savings in AWS and want to know if we can stop and then start the AWS Opensearch service for a couple of days.
My scenario is that the application which uses the OpenSearch service (Elasticsearch) remains down during 2 days every week... During this time OpenSearch remains active and incurs costs...
I know one option to save the costs is to downgrade the node type and reduce the number of nodes during the application downtime.
But let me know if there are any other options where I can entirely "Switch Off" and "Switch On" the service just like we can do with EC2 instances.
There is no way to stop the cluster today. What I have seen to reduce bill was that team edited the cluster to reduce the instance type to a t2.small (or smaller ones) instance which is significantly cheaper than the previous instance.
Then when they needed to resume testing they changed the instance type back to what they required.
One other thing maybe is to take a snapshot of your domain, the disable Opensearch for the weekend. Finally restore it back on monday with the snapshot you got.
AWS launched OpenSearch Serverless during reInvent 2022 but still in preview.
This is the ideal solution which would not require you manage clusters.
https://aws.amazon.com/opensearch-service/features/serverless/
Until then, go for downsizing the instance during non peak hours.
[Edit]: It is GA now as on 25th Jan 2023
https://aws.amazon.com/blogs/big-data/amazon-opensearch-serverless-is-now-generally-available/

AWS ASG target tracking an ECS took 15 minutes to scale-in after the desired tasks of ECS is 0

I have an ECS on AWS which uses a capacity provider. The ASG associated with the capacity provider is responsible to scale-out and scale-in EC2 instances based on the ECS desired task count of ECS. It is worth mentioning that the desired task is managed by a lambda function and updated based on some metrics (calculate the depth of an SQS and based on that, change the desired task of ECS).
Scaling-out is happening almost immediately (without considering the provisioning and pending time) but when the desired task is set to zero in ECS (By lambda function), it takes at least 15 minutes for ASG to turn off the instances. Sinec we are using high performance EC2 types with large numbers, this scaling-in time costs a lot of money to us. I want to know is there any way to reduce this cooldown time to a minutes?
P.S: I have set the default cooldown to 120 but it didn't change anything

EC2 persistent instance retirement scheduled

On the personal health dashboard in the AWS Console, I've got this notification
EC2 persistent instance retirement scheduled
yesterday which says that one of my ec2 instances is scheduled to retire on 13th March 2019. The status was 'upcoming' while the start and end times both were set to 14-Mar-2019.
The content of the notification starts with:
Hello,
EC2 has detected degradation of the underlying hardware hosting your Amazon EC2 instance (instance-ID: i-xxxxxxxxxx) associated with your AWS account (AWS Account ID: xxxxxxxxxx) in the xxxx region. Due to this degradation your instance could already be unreachable. We will stop your instance after 2019-03-13 00:00 UTC.
....
I've got yet another notification today for the same instance and with the same subject line but the status has been changed to 'ongoing' and the start time is 27-Feb-2019 while the end time is 14-Mar-2019.
I was planning to do a start-stop of the instance next week but does the second notification tell me to do is ASAP?
Yes, it is better to do stop/start ASAP. Even in your message it says:
Due to this degradation your instance could already be unreachable

CloudFormation, CodeDeploy, ELB & Auto-Scaling Group

I am trying to build a stack with an ELB, an Auto-Scaling Group and a Pipeline (with CodeBuild and CodeDeploy).
I can't understand how it is supposed to work:
the auto-scaling group is starting two instances and wait X minutes before starting to check the instances state
the CodeDeploy application deployment group is waiting for the Auto-Scaling group to be created and ready
the pipeline takes about 10 minutes to start deploying the application
My issue is when I create the stack, it looks like there is a loop: AG requires an application from CodeDeploy and CodeDeploy requires an AG stabilized. To be clear, when the application is ready to deploy, my Auto-Scaling group is already starting to terminate instances and starting new ones, so the CodeDeployment is trying to deploy to instances already terminated or terminating.
I don't really want to configure HealthCheckGracePeriod and PauseTime to be ~10-15 minutes... it is way too long.
Are there any best practices for CloudFormation + ELB + AG + CodeDeploy via a Pipeline?
What should be the steps to achieve that?
Thank you!
This stopping/staring the instances is most probably linked to the Deployment Type: in-place vs. blue/green.
I have tried both in my setup, and I will try to summarize how they work.
Let's say that for this example, you have an Autoscaling group which at the time of deploying the application has 2 running instances and the deployment configuration is OneAtATime. Traffic is controlled by the Elastic Load Balancer. Then:
In-place deployment:
CodeDeploy gets notified of a new revision available.
It tells the ELB to stop directing traffic to 1st instance.
Once traffic to one instance is stopped, it starts the deployment process: Stop the application, download bundle etc.
If the deployment is successful (validate service hook returned 0), it tells ELB to resume traffic to that instance.
At this point, 1 instance is running the old code and 1 is running the new code.
Right after the ELB stops traffic to the 2nd instance, and repeats the deployment process there.
Important note:
With ELB enabled, the time it takes to block traffic to instance before deployment, and time it takes to allow traffic after it are directly dependent on your health check: time = Healthy threshold * Interval.
Blue/green deployment:
CodeDeploy gets notified of a new revision available.
It copies your Autoscaling Group: the same configuration of the group (including scaling policies, scheduled actions etc.) and the same number of instances (using same AMI as your original AG) there were there at the start of deployment - in our case 2.
At this point, there is no traffic going to the new AG.
CodeDeploy performs all the usual installation steps to one machine.
If successful, it deploys to the second machine too.
It directs traffic from the instances in your old AG to the new AG.
Once traffic is completely re-routed, it deletes the old AG and terminates all its instances (after a period specified in Deployment Settings - this option is only available if you select Blue/Green)
Now ELB is serving only the new AG.
From experience:
Blue/green deployment is a bit slower, since you need to wait for the
machines to boot up, but you get a much safer and fail-proof deployment.
In general I would stick with Blue/green, with load balancer
enabled and the Deployment Configuration: AllAtOnce (if it fails,
customers won't be affected since the instances won't be receiving
traffic. But it will be 2x as fast since you deploy in parallel
rather than sequentially).
If your health checks and validate
service are throughout enough, you can probably delete the original
AG with minimal waiting time (5 minutes at the time of writing the
post).

AWS Is it possible to automatically terminate and recreate new instances for an auto scaling group periodically?

We have an AWS scaling group that has 10-20 servers behind a load balancer. After running for a couple of weeks some these server go bad. We have no idea why the servers go bad and it will take some time for us to get to a stage where we can debug this issue.
In the interim is there a way to tell AWS to terminate all the instances in the scaling group in a controlled fashion (one by one) until all the instances are replaced by new ones every week or so?
You can achieve this very effectively using Data Pipeline.
This is the developer guide for How do I stop and start Amazon EC2 Instances at scheduled intervals with AWS Data Pipeline?
There is no function in Auto Scaling to tell it to automatically terminate and replace instances. However, you could script such functionality.
Assumptions:
Terminate instances that are older than a certain number of hours old
Do them one-at-a-time to avoid impacting available capacity
You wish to replace them immediately
A suitable script would do the following:
Loop through all instances in a given Auto-Scaling Group using describe-auto-scaling-instances
If the instance belongs to the desired Auto Scaling group, retrieve its launch time via describe-instances
If the instance is older than the desired number of hours, terminate it using terminate-instance-in-auto-scaling-group with --no-should-decrement-desired-capacity so that it is automatically replaced
Then, wait a few minutes to allow it to be replaced and continue the loop
The script could be created by using the AWS Command-Line Interface (CLI) or a programming language such as Python.
Alternatively, you could program the instances to self-destruct after a given period of time (eg 72 hours) by simply calling the operating system to shut-down the instance. This would cause auto-scaling to terminate the instance and replace it.
There are two ways to achieve what you are looking for, Scheduled Auto Scaling Actions or take them one of the instances out of the ASG.
Scheduled Scaling
Scaling based on a schedule allows you to scale your application in response to predictable load changes. For example, every week the traffic to your web application starts to increase on Wednesday, remains high on Thursday, and starts to decrease on Friday. You can plan your scaling activities based on the predictable traffic patterns of your web application.
https://docs.aws.amazon.com/autoscaling/latest/userguide/schedule_time.html
You most likely want this.
Auto Scaling enables you to put an instance that is in the InService state into the Standby state, update or troubleshoot the instance, and then return the instance to service. Instances that are on standby are still part of the Auto Scaling group, but they do not actively handle application traffic.
https://docs.aws.amazon.com/autoscaling/latest/userguide/as-enter-exit-standby.html
As of Nov 20, 2019, EC2 AutoScaling supports Max Instance Lifetime: https://aws.amazon.com/about-aws/whats-new/2019/11/amazon-ec2-auto-scaling-supports-max-instance-lifetime/
From:
The maximum instance lifetime specifies the maximum amount of time (in
seconds) that an instance can be in service. The maximum duration
applies to all current and future instances in the group. As an
instance approaches its maximum duration, it is terminated and
replaced, and cannot be used again.
When configuring the maximum instance lifetime for your Auto Scaling
group, you must specify a value of at least 86,400 seconds (1 day). To
clear a previously set value, specify a new value of 0.