Just in the past hour, our AWS CodeDeploy deployments have started hanging before even having a Start Time when looking at the Deployment Details page. The Status stays at In Progress indefinitely. We have not changed any of our deployment lifecycle details, so that leads me to believe that this is either some kind of CodeDeploy outage, or some kind of fluke that I'm not sure how to reset (Stopping the deployment and starting another ends up in the same place).
Has anyone else experienced this problem? Any ideas how to fix it?
Check the host agent on your instances. It's possible that it stopped running.
it looks like there is currently degraded performance on the Virginia region EC2 API, I'm also having issues with CodeDeploy not working and I assumed it may be from the increase in error rates on DescribeInstances in that region. AWS status page
I had this happen and I kept restarting the pipeline to no avail. I finally pushed a minor change and it magically started working again.
Related
I've been experiencing this with my ECS service for a few months now. Previously, when we would update the service with a new task definition, it would perform the rolling update correctly, deregistering them from the target group and draining all http connections to the old tasks before eventually stopping them. However, lately ECS is going straight to stopping the old tasks before draining connections or removing them from the target group. This is resulting in 8-12 seconds of API down time for us while new http requests continue to be routed to the now-stopped tasks that are still in the target group. This happens now whether we trigger the service update via the CLI or the console - same behaviour. Shown here are a screenshot showing a sample sequence of Events from ECS demonstrating the issue as well as the corresponding ECS agent logs for the same instance.
Of particular note when reviewing these ECS agent logs against the sequence of events is that the logs do not have an entry at 21:04:50 when the task was stopped. This feels like a clue to me, but I'm not sure where to go from here with it. Has anyone experienced something like this, or have any insights as to why the tasks wouldn't drain and be removed from the target group before being stopped?
For reference, the service is behind an AWS application load balancer. Happy to provide additional details if someone thinks of what else may be relevant
It turns out that ECS changed the timing of when the events would be logged in the UI in the screenshot. In fact, the targets were actually being drained before being stopped. The "stopped n running task(s)" message is now logged at the beginning of the task shutdown lifecycle steps (before deregistration) instead of at the end (after deregistration) like it used to.
That said, we were still getting brief downtime spikes on our service at the load balancer level during deployments, but ultimately this turned out to be because of the high startup overhead on the new versions of the tasks spinning up briefly pegging the CPU of the instances in the cluster to 100% when there was also sufficient taffic happening during the deployment, thus causing some requests to get dropped.
A good-enough for now solution was to adjust our minimum healthy deployment percentage up to 100% and set the maximum deployment percentage to 150% (as opposed to the old 200% setting), which forces the deployments to "slow down", only launching 50% of the intended new tasks at a time and waiting until they are stable before launching the rest. This spreads out the high task startup overhead to two smaller CPU spikes rather than one large one and has so far successfully prevented any more downtime during deployments. We'll also be looking into reducing the startup overhead itself. Figured I'd update this in case it helps anyone else out there.
I enabled AWS Amazon Inspector (2) for a single EC2 instance that I have. It's an ubuntu with php and apache, nothing special, and the status shows Scanning for the last 3 hours.
I look at the htop of this machine, and I see that the /snap/amazon-ssm-agent/####/amazon-ssm-agent is running and that several /snap/amazon-ssm-agent/####/ssm-agent-worker are running. Still.... 3 hours passed, and I have no results.
Is it working? isn't it working? is there a more verbose status?
Also, if someone have experience with this, can you share the avarage time you waited for results?
I've been in a similar situation - do inspector scans on EC2 as well as ECR. ECR was pretty quick for scans but for EC2 - it took about 4.5hrs to get to INITIAL_SCAN_COMPLETE state. Very concerning it takes this amount of time but noticed it was doing about 470 vulnerability checks.
here's are the document contains the status information.
https://docs.aws.amazon.com/inspector/latest/user/assessing-coverage.html
Scanning – Amazon Inspector is continuously monitoring and scanning the instance.
It won't just scan and leave it but instead continuously monitor the instance for future vulnerabilities too. Hence the status shows Scanning.
You need to get into findings tab to look into what's going on with the vulnerabilities. Findings -> By instance -> Select your instance to see findings related to your instance. Hope that helps.
One of my ECS fargate tasks is stopping and restarting in what seems to be a somewhat random fashion. I started the task in Dec 2019 and it has stopped/restarted three times since then. I've found that the task stopped and restarted from its 'Events' log (image below) but there's no info provided as to why it stopped..
So what I've tried to do to date to debug this is
Checked the 'Stopped' tasks inside the cluster for info as to why it might have stopped. No luck here as it appears 'Stopped' tasks are only held there for a short period of time.
Checked CloudWatch logs for any log messages that could be pertinent to this issue, nothing found
Checked CloudTrail event logs for any event pertinent to this issue, nothing found
Confirmed the memory and CPU utilisation is sufficient for the task, in fact the task never reaches 30% of it's limits
Read multiple AWS threads about similar issues where solutions mainly seem to be connected to using an ELB which I'm not..
Any have any further debugging device or ideas what might be going on here?
I ran into the same issue and found this from aws
https://docs.aws.amazon.com/AmazonECS/latest/userguide/task-maintenance.html
When AWS determines that a security or infrastructure update is needed
for an Amazon ECS task hosted on AWS Fargate, the tasks need to be
stopped and new tasks launched to replace them.
Also a github post on storing stopped tasks info in cloudwatch logs:
https://github.com/aws/amazon-ecs-agent/issues/368
(NOTE: I've modified my original post since I've now collected a little more data.)
Something just started happening today after several weeks of no issues and I can't think of anything I changed that would've caused this.
I have a Spring Boot app living behind an NGINX proxy (all Dockerized), all of which is in an AWS ECS Fargate cluster.
After deployment, I'm noticing that--randomly (as in, sometimes this doesn't happen)--a call to the services being served up by Spring Boot will 503 (behind the NGINX proxy). It seems to do this on every second deployment, with a subsequent deployment fixing the matter, i.e. calls to the server will succeed for awhile (maybe a few seconds; maybe a few minutes), but then stop.
I looked at the "HealthyHostCount" and I noticed that when I get the 503 and my main target group says it has no registered/healthy hosts in either AZ. I'm not sure what would cause the TG to deregister a target, especially since a subsequent deployment seems to "fix" the issue.
Any insight/help would be greatly appreciated.
Thanks in advance.
UPDATE
It looks as though it seems to happen right after I "terminate original task set" from the Blue/Green deployment from CodeDeploy. I'm wondering if it's an AZ issue, i.e. I haven't specified enough tasks to run on them both.
I think it likely your services are failing the health check on the TargetGroup.
When the TargetGroup determines that an existing target is unhealthy, it will unregister it, which will cause ECS to then launch a task to replace it. In the meantime you'll typically see Gateway errors like this.
On the ECS page, click your service, then click on the Events tab... if I am correct, here you'd see messages about why the task was stopped like 'unhealthy' or 'health check timeout'.
The cause can be myriad. An overloaded cluster (probably not the case with Fargate), not enough resources, a runaway request that eats all the memory or CPU.
You case smells like a slow startup issue. ECS web services have a 'Health Check Grace Period'... that is, a time to wait before health checks start. If your container takes too long to get started when first launched, the first couple of health checks may fail and cause the tasks to 'cycle'. A new deployment may be slower if your images are particularly large, because the ECS host has to download the image. Fargate, in general, is a bit slower than EC2 hosts because of overhead in how it sets up networking. If the slow startup is your problem, you can try increasing this grace period, but should also investigate how to speed your container startup time (decrease image size, other nginx tweaks to speed intialization, etc).
It turns out it was an issue with how I'd configured one of my target groups and/or the ALB, although what the exact issue was, I couldn't determine as I re-created the TGs and the service from scratch.
If I had to guess, I'd guess my TG wasn't configured with the right port, or the ALB had a bad rule/listener.
If anyone else had this issue after creating a new environment for an existing app, make sure you configure your security groups. My environment was timing out (and had bad health) because my web pages did not have access to the database inbound rule on AWS.
You can see if this is your issue if you are able to connect to urls in your app that do not connect to the same web services/databases.
Beanstalk is throwing below errors during deployment -
- The following instances have not responded in the allowed command timeout time (they might still finish eventually on their own): [i-0a2f60975b252747d].
- Command execution completed on all instances. Summary: [Successful: 0, TimedOut: 1].
- Unsuccessful command execution on instance id(s) 'i-0a2f60975b252747d'. Aborting the operation.
This happens every 2nd day, i don't change anything at all. It takes lot of time & gets timed out at the end. Unable to find a permanent solution for this.
Even "restart appserver" from console also doesn't work. EC2 instance seems to be healthy enough, application works fine & i can do SSH etc. However rebooting the EC2 instance bring it back to normal.
Did anyone else faced this problem, what can be done to find a permanent solution? Also how things will work in case of Autoscaling, this can cause major problems in Production?
I had a semi-similar issue with elasticbeanstalk. Turns out I was using an old t1.micro instance that is not recommended for use anymore. I upgraded the instance to a t2.small and I no longer get timeouts and my deployments are much faster and more stable.