My AWS Elastic Beanstalk Environment is stuck in Health: Grey.
My application is working, I can access it fine. However, I am unable to change configuration or deploy new versions because I get a message saying that
Environment named ______ is in an invalid state for this operation. Mustbe Ready.
If I run eb health on my console, I get the following output:
Status: Ready Health Grey
And
ELB State: InService
Is there anything I can try to revive my environment? I have contacted AWS Support, by they are really slow. Another option I can think of is terminating the environment and creating a new one, but I really would prefer to avoid that.
EB can be fairly tough to trouble shoot when you have full access to EB, the instances, the ELB's, etc... never mind trying to proxy this through SO.
I'd do the following:
Bring up a new environment under the same application
When it comes up green, use the EB application "Swap" functionality to swap the environments
More details on this process is here: http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/using-features.CNAMESwap.html
This performs a DNS switch so you should have no down time. You'll still the old environment running if you want to troublshoot it later with your friendly AWS support staff.
The only negatives are:
You'll continue paying for both environment stacks while waiting to troubleshoot the other.
The DNS is a little tough as you can't guarantee clients respect the short time-outs EB DNS entries have. They should, but someone may decide to keep using a local cached version. As with anything relying on trusting client-side features, it's a bit out of your control.
If you deploy a RDS DB via EB, you can't us the swap as the DB is tied to the environment (NEVER deploy a RDS DB in production EB environment via EB!!!!)
I know the question's already been answered, but I think the cause of the problem is important, instead of recommending a complete rebuild of OP's environment.
Elastic Beanstalk has 4 different colors - green, yellow, red, and grey. However, each color can mean multiple different things that vary wildly. Here are the potential statuses behind the grey color:
Grey (Suspended) - Your application has had such severe health issues Elastic Beanstalk is no longer monitoring it
Grey (Unknown) - The health agent has not reported enough data on an instance yet
Grey (Pending) - An operation is in progress on an instance within the command timeout (for example bootstrapping the environment)
Notice the incredible disparity between "Pending" and "Suspended". In Pending, it just needs a little more time, or perhaps you can shut down a single resource and have it respawn. In Suspended, all monitoring is shot, and you ought to rebuild the environment ASAP. Big difference in impact to customers during the solution.
Baked into Beanstalk are the vanilla colors. To get the additional statuses, you have to enable Enhanced Monitoring. You can do it in a couple minutes, and the cost is nominal.
To read more about the statuses and common problems with Beanstalk, I'd recommend a blog my colleague wrote: Health Monitoring in AWS Beanstalk
A more difficult situation occurs when the environment state is unknown. Even the Abort current operation option doesn't work. To solve this I had to apply these steps:
copy the instance id associated with Beanstalk environment
find the instance using that instance id in EC2 dashboard
stop the EC2 instance
start the EC2 instance
The environment should come to the previous state after restarting the instance.
Related
(NOTE: I've modified my original post since I've now collected a little more data.)
Something just started happening today after several weeks of no issues and I can't think of anything I changed that would've caused this.
I have a Spring Boot app living behind an NGINX proxy (all Dockerized), all of which is in an AWS ECS Fargate cluster.
After deployment, I'm noticing that--randomly (as in, sometimes this doesn't happen)--a call to the services being served up by Spring Boot will 503 (behind the NGINX proxy). It seems to do this on every second deployment, with a subsequent deployment fixing the matter, i.e. calls to the server will succeed for awhile (maybe a few seconds; maybe a few minutes), but then stop.
I looked at the "HealthyHostCount" and I noticed that when I get the 503 and my main target group says it has no registered/healthy hosts in either AZ. I'm not sure what would cause the TG to deregister a target, especially since a subsequent deployment seems to "fix" the issue.
Any insight/help would be greatly appreciated.
Thanks in advance.
UPDATE
It looks as though it seems to happen right after I "terminate original task set" from the Blue/Green deployment from CodeDeploy. I'm wondering if it's an AZ issue, i.e. I haven't specified enough tasks to run on them both.
I think it likely your services are failing the health check on the TargetGroup.
When the TargetGroup determines that an existing target is unhealthy, it will unregister it, which will cause ECS to then launch a task to replace it. In the meantime you'll typically see Gateway errors like this.
On the ECS page, click your service, then click on the Events tab... if I am correct, here you'd see messages about why the task was stopped like 'unhealthy' or 'health check timeout'.
The cause can be myriad. An overloaded cluster (probably not the case with Fargate), not enough resources, a runaway request that eats all the memory or CPU.
You case smells like a slow startup issue. ECS web services have a 'Health Check Grace Period'... that is, a time to wait before health checks start. If your container takes too long to get started when first launched, the first couple of health checks may fail and cause the tasks to 'cycle'. A new deployment may be slower if your images are particularly large, because the ECS host has to download the image. Fargate, in general, is a bit slower than EC2 hosts because of overhead in how it sets up networking. If the slow startup is your problem, you can try increasing this grace period, but should also investigate how to speed your container startup time (decrease image size, other nginx tweaks to speed intialization, etc).
It turns out it was an issue with how I'd configured one of my target groups and/or the ALB, although what the exact issue was, I couldn't determine as I re-created the TGs and the service from scratch.
If I had to guess, I'd guess my TG wasn't configured with the right port, or the ALB had a bad rule/listener.
If anyone else had this issue after creating a new environment for an existing app, make sure you configure your security groups. My environment was timing out (and had bad health) because my web pages did not have access to the database inbound rule on AWS.
You can see if this is your issue if you are able to connect to urls in your app that do not connect to the same web services/databases.
Newbie to Amazon Web Services here. I launched an instance from a Public AMI and found that I could not ssh into the instance - I received the error "Connection timed out." I checked the security groups to verify that Port 22 was associated with 0.0.0.0/0. Additionally, I checked the route tables to verify that 0.0.0.0/0 is associated with target gateway attached to the VPC.
I find that only 1/2 status checks have passed - the instance status check failed. I have tried stopping and starting the instance as well as terminated and launching a new instance, both to no avail. The error that I see in the system log is:
Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(8,1).
From this previous question, it appears that this could be a virtualization issue, but I'm not sure if that was due to something I did on my end when launching the instance or something that occurred from the creators of the AMI? Ec2 1/2 checks passed
Any help would be appreciated!
Can you share any more details about how you deployed the instance? Did you use the AWS Management Console, or one of the command line tools or SDKs to deploy it? Which public AMI did you use? Was it one of the ones provided by Amazon?
Depending on your needs, I would make sure that you use one of the AMIs provided by Amazon, such as Ubuntu, Amazon Linux, CentOS, etc. Here's the links to the docs on AMIs, but you can learn quite a bit by just searching for images. Since you mentioned virtualization types though, I'd suggest reading up briefly on the HVM vs. Paravirtual virtualization types on AWS. Each of the instance types / families uses a certain virtualization type, which is indicated in the chart on this page.
Instance Status Checks
This documentation page covers the instance status checks, which you'll probably want to familiarize yourself with. It's entirely possible that shutting down (not restart, but shutdown) and then starting the instance back up might resolve the instance status check.
Spot Instances - cost savings!
By the way, I'll just mention this since you indicated that you're new to AWS ... if you're just playing around right now, you can save a ton of cost by deploying EC2 Spot Instances, instead of paying the normal, on-demand rates. Depending on current rates, you can save more than 50%, and per-second billing still applies. Although there's the possibility that your EC2 instance could get "interrupted" based on market demand, you can configure your Spot Instance to just "Hibernate" or "Stop" instead of terminating and relaunching. That way, your work is instance state is saved for when it relaunches.
Hope this helps!
1) Use well-known images or contact with the image developer. Perhaps it requires more than one drive or tricky partitioning.
2) make sure you selected proper HVM/PV image according to the instance type.
3) (after checks are passed) make sure the instance has public ip
In other words I would like to temporarily turn off an environment (and its associated billing costs) but not delete it entirely.
It seems if I set the [Configuration > Web Tier > Scaling > Minimum instance count] to 0 along with the related "Maximum instance count", AWS rejects those settings as invalid. Ditto for questionable values like 0.1.
Any ideas for temporarily taking an Elastic Beanstalk environment out of service?
You can achieve this by defining time periods in your environment.
from your Elastic Load blanancer dashboard navigate through: Configuration -> Capacity -> Time-based scaling -> Scheduled actions.
You may choose to use multiple scheduled actions to start and stop your environment.
i.e. One to start your environment and another to stop it, or scale out during high load periods and back in after.
In the background, Elastic Beanstalk will terminate and recreate the EC2. This means you'll lose any data on the instances local storage.
I have used this to turn my environments off when they are no longer needed. For example, I have a workload that crawls looking for new content; and I pause this crawling after it has once completed. I also use this feature to spin down my nonproduction environments outside of working hours.
I prefer this method to others as it is supported by both single instance and load-balanced environments.
How about eb scale 0 using the Elastic Beanstalk CLI tools?
I don't believe this is possible. I any case, you would still have a load balancer in front for which you pay, so you're not really reducing the costs for this environment.
It makes more sense to delete the environment completely create a script that will deploy a version from your application versions to a new environment.
I now use the ParkMyCloud service to suspend Elastic Beanstalk environments (and also RDS database instances) that I'm not using, both manually and in scheduled fashion.
Although ParkMyCloud is non-free, it's saved me a lot more money than I've spent using it. It's worked quite well over the last few years so I'm comfortable recommending it.
I am trying to achieve zero down time redeploys on AWS elastic beanstalk.
I basically have two instances on my environment coupled with Jenkins for CI (Using Tomcat).
What I am trying to achieve is each time I trigger a redeploy from Jenkins, only one instance of the environment is redeployed, then have a timeout to allow the new instance to load the application and then redeploy the second instance.
In order to achieve that timeout I am setting both the "Pause time" and "Command timeout" but unfortunately its if though this limit is not honored.
The first instance is redeployed but after around 1 minute the second instance is redeployed regardless of the timeout value I set.
Have anyone archived this? any insights on how to achieve it?
"Pause Time" relates to environmental configuration made to instances. "Command timeouts" relates to commands executed to building the environment (for example if you've customised the container). Neither have anything to do with rolling application updates or zero downtime deployments. The documentation around this stuff is confusing and fragmented.
For zero downtime application deployments, AWS EB give you two options:
Batch application updates
Running two environments and cutting over
Option 1 feels like a lot less work but in my testing hasn't been truly zero downtime. There is a HARDCODED timeout value where traffic will be routed to instances after 1 minute regardless of whether the load balancer healthcheck passes or not.
But if you still want to go ahead then running two instances and setting a batch size of 50% or 1 should give you want you want.
I have a PHP application deployed to Amazon Elastic Beanstalk. But I notice a problem that every time I push my code changes via git aws.push to the Elastic Beanstalk, the application deployed didn't picked up the changes. I checked the events log on my application Beanstalk environment and notice that every time the Beanstalk issues:
Deploying new version to instance(s)
it's always followed by:
The following instances have not responded in the allowed command timeout time (they might still finish eventually on their own):
[i-d5xxxxx]
The same thing happens when I try to request snapshot logs. The Beanstalk issues:
requestEnvironmentInfo is starting
then after a few minutes it's again followed by:
The following instances have not responded in the allowed command timeout time (they might still finish eventually on their own): [i-d5xxxxx].
I had this problem a few times. It seems to affect only particular instances. So it can be solved by terminating the EC2 instance (done via the EC2 page on the Management Console). Thereafter, Elastic Beanstalk will detect that there are 0 healthy instances and automatically launch a new one.
If this is a production environment and you have only 1 instance and you want minimal down time
configure minimum instances to 2, and Beanstalk will launch another instance for you.
terminate the problematic instance via EC2 tab, Beanstalk will launch another instance for you because minimum instance is 2
configure minimum instance back to 1, Beanstalk will remove one of your two instances.
By default Elastic Beanstalk "throws a timeout exception" after 8 minutes (480 seconds defined in settings) if your commands did not complete in time.
You can set an higher time up to 30 minutes (1800 seconds).
{
"Namespace": "aws:elasticbeanstalk:command",
"OptionName": "Timeout",
"Value": "1800"
}
Read here: http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/command-options.html
Had the same issue here (single t1.micro instance).
Did solve the problem by rebooting the EC2 instance via the EC2 page on the Management Console (and not from EB page).
Beanstalk deployment (and other features like Get Logs) work by sending SQS commands to instances. SQS client is deployed to instances and checks queue about every 20 secs (see /var/log/cfn-hup.log):
2018-05-30 10:42:38,605 [DEBUG] Receiving messages for queue https://sqs.us-east-2.amazonaws.com/124386531466/93b60687a33e19...
If SQS Client crashes or has network problems on t1/t2 instances then it will not be able to receive commands from Beanstalk, and deployment would time out. Rebooting instance restarts SQS Client, and it can receive commands again.
An easier way to fix SQS Client is to restart cfn-hup service:
sudo service cfn-hup restart
In the case of deployment, an alternative to shutting down the EC2 instances and waiting for Elastic Beanstalk to react, or messing about with minimum and maximum instances, is to simply perform a Rebuild environment on the target environment.
If a previous deployment failed due to timeout then the new version will still be registered against the environment, but due to the timeout it will not appear to be operational (in my experience the instance appears to still be running the old version).
Rebuilding the environment seems to reset things with the new version being used.
Obviously there's the downside with that of a period of downtime.
I think is the correct way to deal with this.
I think the correct way to deal with this is to figure out the cause of the timeout by doing what this answer suggests.
chongzixin's answer is what needs to be done if you need this fixed ASAP before investigating the reason for a timeout.
However, if you do need to increase timeout, see the following:
Add configuration files to your source code in a folder named .ebextensions and deploy it in your application source bundle.
Example:
option_settings:
"aws:elasticbeanstalk:command":
Timeout: 2400
*"value" represents the length of time before timeout in seconds.
Reference: https://serverfault.com/a/747800/496353
"Restart App Server(s)" from the "Actions" menu in Elastic Beanstalk management dashboard followed by eb deploy fixes it for me.
Visual cue for the first instruction
After two days of checking random issues, I restarted both EC2 instances one after another to make sure there is no downtime. Site worked fine but after a while, website started throwing error 504.
When I checked the http server, nginx was off and "Out of HDD space" was thrown. "Increased the HDD size", elastic beanstalk created new instances and the issue was fixed.
For me, the problem was my VPC security group rules. According to the docs, you need to allow outbound traffic on port 123 for NTP to work. I had the port closed, so the clock was drifting, and so the EC2's were becoming unresponsive to commands from the Elastic Beanstalk environment, taking forever to deploy (only to time out) failing to get logs, etc.
Thank you #Logan Pickup for the hint in your comment.