ECS container hangs when calling ssm API endpoint - amazon-web-services

It seems ECS API hangs when calling ssm.ap-southeast-2.amazonaws.com. Below is the debug results where it hangs
2020-06-11 22:47:10,831 - MainThread - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (2): ssm.ap-southeast-2.amazonaws.com:443
This works fine on EC2 instance. Just inside ECS task container doesn't work and connection gets timed out.
What could be the reason behind this?

Works fine on EC2 instance
Hmm... I think your container is a victim of IMDSv2. Please allow me to explain.
Instance metadata is data about your instance that you can use to configure or manage the running instance. Instance metadata is divided into categories, for example, host name, events, and security groups. You can query instance metadata by calling the following URL:
http://169.254.169.254/latest/meta-data/
On Nov 19, 2019, v2 of the Instance Metadata Service was released. One of the features introduced with EC2 Instance Metadata Service version 2 (IMDSv2) is "Protecting against open layer 3 firewalls and NATs" 1 which sets a TTL (or hop limit 2) of 1 on low level IP packets containing the secret token so the packet can only cross one host. The TTL of 1 means that the instance is not able to forward the packet to a Docker container running on an ECS Container instance as that would be counted as another hop.
From 1:
With IMDSv2, setting the TTL value to “1” means that requests from the EC2 instance itself will work because they’re returned to the caller (on the instance) before the subtraction occurs. But if the EC2 instance has been misconfigured as an open router, layer 3 firewall, VPN, tunnel, or NAT device, the response containing the token will have its TTL reduced to zero before leaving the instance, and the packet containing the response will be discarded on its way out of the instance, preventing transport to the attacker. The information simply won’t make it further than the EC2 instance itself, which means that an attacker won’t get the response back with the token, and with it the ability to access instance metadata, even if they’ve been successful at getting past all other defenses.
A consequence of this change is Docker containers running on ECS instances in Bridge or AWSVPC mode can no longer query the metadata endpoint. The following request will timeout:
$ curl -X PUT -H "x-aws-ec2-metadata-token-ttl-seconds: 120" "http://169.254.169.254/latest/api/token"
If using AWS CLI, it has a fallback mechanism to IMDSv1 but after a long delay (5 seconds) which makes it rather unusable.
From: https://github.com/aws/aws-sdk-js/issues/3024#issuecomment-589135606 :
From v2.575.0, the SDK is configured to default to the IMDSv2 workflow and, by default, will try three times (with a timeout of one second between attempts) to obtain the required token. If all three attempts fail, the SDK will then fall back to the IDMSv1 workflow.
Option 1 (Use with caution)
It is possible to use the 'modify-instance-metadata-options' 3 AWS CLI call on the Container Instance to change the TTL to a higher value by specifying a value for the --http-put-response-hop-limit flag.
The following AWS CLI command modifies the value to '2' when run on the EC2 instance:
$ aws ec2 modify-instance-metadata-options --instance-id $(curl 169.254.169.254/latest/meta-data/instance-id) --http-put-response-hop-limit 2 --http-endpoint enabled
... after which the curl command against token endpoint was successful from the Docker container.
A Lambda function can be invoked from Autoscaling lifecycle hook to configure the value '2' on any launching instance with ModifyInstanceMetadataOptions api call. Another option is to place this command in EC2 Instance's UserData so every instance can 'self-configure' itself with the updated hop limit. Please note that in this case, Instance profile should have associated policy with 'ec2:ModifyInstanceMetadataOptions' permission for this call to be successful.
Option 2 (Recommended)
With regards to ECS, accessing the instance credentials from a container is not considered a best practice, instead the recommendation is to set a task role and use the AWS_CONTAINER_CREDENTIALS_RELATIVE_URI environment variable to retrieve container specific credentials from the ECS agent, by for example using the "curl 169.254.170.2$AWS_CONTAINER_CREDENTIALS_RELATIVE_URI" command, up to date versions of the AWS CLI use this by default.
You can read more about the task role credentials here 4. A similar endpoint for task metadata is also available 5.
More discussion can be found here:
https://github.com/aws/aws-sdk-ruby/issues/2177
https://github.com/aws/containers-roadmap/issues/670
https://github.com/aws/aws-sdk-js/issues/3024

Related

Lambda function to start or stop ec2 based on application usage

I would like to investigate that whether it is possible or not that if someone tries to hit the application url then instance should be re-enabled and remain active as long as there is active use. If the resources are inactive for 10 to 20 mins they should automatically disable themselves i.e. instance should get disabled.
Here there are Multi host Application deployed on ec2 instance and configured record set dns in Route53.
Pls suggest
Create a ec2 start lambda function that gets called when you hit the url hosted on APIGateway backed by this lambda, once the instance is up redirect to the actual ec2 instance url(So the lambda will have to keep checking the status of the ec2 instance once running redirect to this url).
On starting the ec2 instance trigger another lambda on the event pattern on the state as running basis which will attach a cloudwatch alarm to the instance.
The cloudwatch alarm will check on the cpu usage and if it goes below 10% for 3 consecutive times it will stop the instance.
The lambda should have the role with policy having full access to the EC2 instance(later on change it to the required method privileges).
blog on stop/start ec2 instance
aws knowledge centre
aws Instance Scheduler
Create an AWS Alarm to monitor usage or activity.
Use an SNS topic to trigger a Lambda function based on the Alarm
Turn off the ec2 instance based using Python in Lambda
This should help with the code:
https://medium.com/geekculture/terraform-setup-for-automatically-turning-off-ec2-instances-upon-inactivity-d7f414390800

monitoring aws ec2 instance ports

I have an application running in EC2 that listen to many ports, some external devices connect to those ports to send data to my application. This is fine, but my client has a requirement that i must monitoring those ports and if one of them stop listening, the instance must be terminated and a new one started.
I was reading about couldwatch, but i didn't found an alarm that i can customize like this (doing requests to ports). Is it possible to do this using cloudwatch ? i'm looking for a direction to create this monitoring, using internal aws services or develop a new solution (maybe a sheel script).
thanks!
I'm not aware of any AWS provided EC2 healthcheck monitoring system for custom checks.
You could write an AWS lambda function which sends requests to the ports on the EC2 instance you require. You can then schedule that lambda to run periodically with whatever frequency you want with Cloudwatch Events. The lambda function could publish this as a metric to cloudwatch which would then make it possible for you to use it in an alarm and thus take action when whatever threshold you deem reasonable to spin up a new replacement instance.
One part of AWS that does have basically what you are looking for built-in though is ECS. Instead of an EC2 instance, you'd have a Docker instance (running on an EC2 instance or Fargate) which can have healthchecks defined.
There are many ways to do what you are asking for.
Simplest solution: I will write a boto3/shell script to monitor the port and call TerminateInstance API or use AWS CLI to terminate the current instance. Needless to say, you need to pass AWS credentials or attach instance profile with sufficient privileges to terminate the instance.
Using Cloudwatch: Have a script to check port status and send 1 or 0 (Dimension: Count) to Cloudwatch. Set a threshold in Cloudwatch if there is consecutive 0s or NoData, then terminate the instance. Or do not send any data to Cloudwatch if the port is not available and NoData in Cloudwatch can trigger TerminateInstance. See: Cloudwatch - AddingTerminateActions

Amazon ECS How to login to the EC2 service it is associated?

After I initiated an Amazon ECS following their tutorial (But I don't recall there is one step that asks me for the key-pair information.)
After I set it up, I found that there is an extra EC2 in my EC2 instance list that starts to charge me money. I wonder what that EC2 is doing.
Is it the EC2 that is associated with ECS that I can start to build my own server on?
If so, how can I log into it? (There is no key-pair information for me to log in. It says I need to log in via valid username-password pair, but I don't even know my username.)
If not, how can I kill it? (Directly terminating it in EC2 service is not helping since it seems ECS will just start another one)
The username will be ec2-user. ECS creates a launch configuration in which you can set key pair
ECS creates an autoscaling group you can find it under ec2/autoscaling/home you can edit this group and set min and desired to 0 this will shut down the instance automatically.

How to make a HTTP call reaching all instances behind amazon AWS load balancer?

I have a web app which runs behind Amazon AWS Elastic Load Balancer with 3 instances attached. The app has a /refresh endpoint to reload reference data. It need to be run whenever new data is available, which happens several times a week.
What I have been doing is assigning public address to all instances, and do refresh independently (using ec2-url/refresh). I agree with Michael's answer on a different topic, EC2 instances behind ELB shouldn't allow direct public access. Now my problem is how can I make elb-url/refresh call reaching all instances behind the load balancer?
And it would be nice if I can collect HTTP responses from multiple instances. But I don't mind doing the refresh blindly for now.
one of the way I'd solve this problem is by
writing the data to an AWS s3 bucket
triggering a AWS Lambda function automatically from the s3 write
using AWS SDK to to identify the instances attached to the ELB from the Lambda function e.g. using boto3 from python or AWS Java SDK
call /refresh on individual instances from Lambda
ensuring when a new instance is created (due to autoscaling or deployment), it fetches the data from the s3 bucket during startup
ensuring that the private subnets the instances are in allows traffic from the subnets attached to the Lambda
ensuring that the security groups attached to the instances allow traffic from the security group attached to the Lambda
the key wins of this solution are
the process is fully automated from the instant the data is written to s3,
avoids data inconsistency due to autoscaling/deployment,
simple to maintain (you don't have to hardcode instance ip addresses anywhere),
you don't have to expose instances outside the VPC
highly available (AWS ensures the Lambda is invoked on s3 write, you don't worry about running a script in an instance and ensuring the instance is up and running)
hope this is useful.
While this may not be possible given the constraints of your application & circumstances, its worth noting that best practice application architecture for instances running behind an AWS ELB (particularly if they are part of an AutoScalingGroup) is ensure that the instances are not stateful.
The idea is to make it so that you can scale out by adding new instances, or scale-in by removing instances, without compromising data integrity or performance.
One option would be to change the application to store the results of the reference data reload into an off-instance data store, such as a cache or database (e.g. Elasticache or RDS), instead of in-memory.
If the application was able to do that, then you would only need to hit the refresh endpoint on a single server - it would reload the reference data, do whatever analysis and manipulation is required to store it efficiently in a fit-for-purpose way for the application, store it to the data store, and then all instances would have access to the refreshed data via the shared data store.
While there is a latency increase adding a round-trip to a data store, it is often well worth it for the consistency of the application - under your current model, if one server lags behind the others in refreshing the reference data, if the ELB is not using sticky sessions, requests via the ELB will return inconsistent data depending on which server they are allocated to.
You can't make these requests through the load balancer, So you will have to open up the security group of the instances to allow incoming traffic from source other than the ELB. That doesn't mean you need to open it to all direct traffic though. You could simply whitelist an IP address in the security group to allow requests from your specific computer.
If you don't want to add public IP addresses to these servers then you will need to run something like a curl command on an EC2 instance inside the VPC. In that case you would only need to open the security group to allow traffic from some server (or group of servers) that exist in the VPC.
I solved it differently, without opening up new traffic in security groups or resorting to external resources like S3. It's flexible in that it will dynamically notify instances added through ECS or ASG.
The ELB's Target Group offers a feature of periodic health check to ensure instances behind it are live. This is a URL that your server responds on. The endpoint can include a timestamp parameter of the most recent configuration. Every server in the TG will receive the health check ping within the configured Interval threshold. If the parameter to the ping changes it signals a refresh.
A URL may look like:
/is-alive?last-configuration=2019-08-27T23%3A50%3A23Z
Above I passed a UTC timestamp of 2019-08-27T23:50:23Z
A service receiving the request will check if the in-memory state is at least as recent as the timestamp parameter. If not, it will refresh its state and update the timestamp. The next health-check will result in a no-op since your state was refreshed.
Implementation notes
If refreshing the state can take more time than the interval window or the TG health timeout, you need to offload it to another thread to prevent concurrent updates or outright service disruption as the health-checks need to return promptly. Otherwise the node will be considered off-line.
If you are using traffic port for this purpose, make sure the URL is secured by making it impossible to guess. Anything publicly exposed can be subject to a DoS attack.
As you are using S3 you can automate your task by using the ObjectCreated notification for S3.
https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html
https://docs.aws.amazon.com/cli/latest/reference/s3api/put-bucket-notification.html
You can install AWS CLI and write a simple Bash script that will monitor that ObjectCreated notification. Start a Cron job that will look for the S3 notification for creation of new object.
Setup a condition in that script file to curl "http: //127.0.0.1/refresh" when the script file detects new object created in S3 it will curl the 127.0.0.1/refresh and done you don't have to do that manually each time.
I personally like the answer by #redoc, but wanted to give another alternative for anyone that is interested, which is a combination of his and the accepted answer. Using SEE object creation events, you can trigger a lambda, but instead of discovering the instances and calling them, which requires the lambda to be in the vpc, you could have the lambda use SSM (aka Systems Manager) to execute commands via a powershell or bash document on EC2 instances that are targeted via tags. The document would then call 127.0.0.1/reload like the accepted answer has. The benefit of this is that your lambda doesn't have to be in the vpc, and your EC2s don't need inbound rules to allow the traffic from lambda. The downside is that it requires the instances to have the SSM agent installed, which sounds like more work than it really is. There's AWS AMIs already optimized with SSM agent stuff, but installing it yourself in the user data is very simple. Another potential downside, depending on your use case, is that it uses an exponential ramp up for simultaneous executions, which means if you're targeting 20 instances, it runs one 1, then 2 at once, then 4 at once, then 8, until they are all done, or it reaches what you set for the max. This is because of the error recovery stuff it has built in. It doesn't want to destroy all your stuff if something is wrong, like slowly putting your weight on some ice.
You could make the call multiple times in rapid succession to call all the instances behind the Load Balancer. This would work because the AWS Load Balancers use round-robin without sticky sessions by default, meaning that each call handled by the Load Balancer is dispatched to the next EC2 Instance in the list of available instances. So if you're making rapid calls, you're likely to hit all the instances.
Another option is that if your EC2 instances are fairly stable, you can create a Target Group for each EC2 Instance, and then create a listener rule on your Load Balancer to target those single instance groups based on some criteria, such as a query argument, URL or header.

AWS: None of the Instances are sending data

I'm trying to set up an Elastic Beanstalk application with Amazon Web Services however I'm receiving a load of errors with the message None of the instances are sending data. I've tried deleting the Elastic Beanstalk Application and the EC2 instance several times with the sample application and trying again but I get the same error.
I also tried uploading a flask application with AWS Elastic Beanstalk command line tools but then I received the error below:
Environment health has transitioned from Pending to Severe. 100.0 % of the requests to the ELB are failing with HTTP 5xx. Insufficient request rate (0.5 requests/min) to determine application health (7 minutes ago). ELB health is failing or not available for all instances. None of the instances are sending data
Why do I get this error and how do I fix it? Thanks.
You are using Enhanced Health Monitoring.
With enhanced health monitoring an agent installed on your EC2 instance monitors vital system and application level health metrics and sends them directly to Elastic Beanstalk.
When you see an error message like "None of the instances are sending data", it means either the agent on the instance has crashed or it is unable to post data to Elastic Beanstalk due to networking error or some other error.
For debugging this, I would recommend downloading "Full logs" from the AWS console. You can follow the instructions for getting logs in the section "Downloading Bundle Logs from Elastic Beanstalk Console" here.
If you are unable to download logs using the console for any reason you can also ssh to the instance and look at the logs in /var/log.
You will find logs for the health agent in /var/log/healthd/daemon.log.
Additional logs useful for this situation are /var/log/cfn-init.log, /var/log/eb-cfn-init.log and /var/log/eb-activity.log. Can you look at the logs and give more details of the errors you see?
This should hopefully give you more details regarding the error "None of the instances are sending data".
Regarding other health "causes" you are seeing:
Environment health has transitioned from Pending to Severe - This is because initially your environment health status is Pending. If the instances do not go healthy within grace period health status transitions to Severe. In your case since none of the instances is healthy / sending data, the health transitioned to Severe.
100.0 % of the requests to the ELB are failing with HTTP 5xx. Insufficient request rate (0.5 requests/min) to determine application health (7 minutes ago).
Elastic Beanstalk monitors other resources in addition to your EC2 instances when using enhanced health monitoring. For example, it monitors cloudwatch metrics for your ELB. This error means that all requests sent to your environment CNAME/load balancer are failing with HTTP 5xx errors. At the same time the request rate is very low only 0.5 requests per minute, so this indicates that even though all requests are failing, the request rate is pretty low. "7 minutes ago" means that information about ELB metrics is slightly old. Because Elastic Beanstalk monitors cloudwatch metrics every few minutes, so the data can be slightly stale. This is as opposed to health data we get directly from the EC2 instances which is "near real time". In your case since the instances are not sending data the only available source for health is ELB metrics which is delayed by about 7 minutes.
ELB health is failing or not available for all instances
Elastic Beanstalk is looking at the health of your ELB, i.e. it is checking how many instances are in service behind ELB. In your case either all instances behind ELB are out of service or the health is not available for some other reason. You should double check that your service role is correctly configured. You can read how to configure service role correctly here or in the documentation. It is possible that your application failed to start.
In your case I would suggest focusing on the first error "None of the instances are sending data". For this you need to look at the logs as outlined above. Let me know what you see in the logs. The agent is started fairly early in the bootstrap process on the instance. So if you see an error like "None of the instances are sending data", it is very likely that bootstrap failed or the agent failed to start for some reason. The logs should tell you more.
Also make sure you are using an instance profile with your environment. Instance profile allows the health agent running on your EC2 instance to authenticate with Elastic Beanstalk. If instance profile is not associated with your environment then the agent will not be able to send data to Elastic Beanstalk. Read more about Instance Profiles with Elastic Beanstalk here.
Update
One common reason for the health cause "None of the instances are sending data" can be that your instance is in a VPC and your VPC does not allow NTP access. Typical indicator of this problem is the following message in /var/log/messages: ntpdate: Synchronizing with time server: [FAILED]. When this happens the clock on your EC2 instance can get out of sync and the data is considered invalid. You should also see a health cause on the instances on the health page on the AWS web console that tells you that instance clock is out-of-sync. The fix is to make sure that your VPC allows access to NTP.
There can be many reasons why the health agent is not able to send any data, so this may not be the answer to your problem, but it was to mine and hopefully can help somebody else:
I got the same error and looking into /var/log/healthd/daemon.log the following was repeatedly reported:
sending message(s) failed: (Aws::Healthd::Errors::GroupNotFoundException) Group 97c30ca2-5eb5-40af-8f9a-eb3074622172 does not exist
This was caused by me making and using an AMI image from an EC2 instance inside an Elastic Beanstalk environment. That is, I created a temporary environment with one instance the same configuration as my production environment, went into the EC2 console and created an image of the instance, terminated the temporary environment, and then created yet another environment using the new custom AMI.
Of course (in hindsight) this meant some settings of the temporary environment were still being used. In this case specifically /etc/healthd/config.yaml, resulting in the health agent trying to send messages to a no longer existing health group.
To fix this and make sure there was no other stale configuration around, I instead started a new EC2 instance by hand from the default AMI used in the production environment (find it under the 'Instances' configuration page of your environment), provision that, then create a new image from that and use that image in my new EB environment.
Check if your instance type's RAM is enough for app + os + amazon tooling. We suffered from this for a long time, when we discovered that t2.micro is barely enough for our use cases. The problem went away right after using t2.small (2GB).
I solved this by adding another security group (the default one for my Elastic Beanstalk).
It appears my problem was that I didn't associate a public ip address to my instance... after I set it it worked just fine.
I was running an app in elastic beanstalk environment with docker as platform. I got the same error that none of the instances are sending. And I was unable fetch logs as well.
Rebuilding the environment worked for me.
I just set the Path on load balancing to a URL that response with status code 200, for this only to study environment.
For my real app, I use actuator
If you see something like this where you don't get any enhanced metrics, check that you haven't accidentally removed the conf.d/elasticbeanstalk/healthd.conf include from your nginx config. This conf adds an machine-read log format that is responsible for reporting that data in EB (see Enhanced health log format - AWS).
My instance profile's IAM Role was lacking elasticbeanstalk:PutInstanceStatistics permission.
I found this by looking at /var/log/healthd/daemon.log as suggested in one of the other answers.
I had to SSH into the machine directly to discover this, as the Get Logs function itself was failing due to missing S3 Write permissions.
If you're running a Worker Tier EB, need to add this policy:
arn:aws:iam::aws:policy/AWSElasticBeanstalkWorkerTier
For anyone arriving here in 2022…
After launching a new environment that was identical to a current healthy environment and seeing no data, I raised an AWS Support ticket. I was informed:
Here, I would like to inform you that recently Elastic Beanstalk introduced new feature called EnhancedHealthAuthEnabled to increase security of your environment and help prevent health data spoofing on your behalf and this option will be enabled by default when you create new environment.
If you use managed policies for your instance profile, this feature is available for your new environment without any further configuration as Elastic Beanstalk instance profile managed policies contain permissions for the elasticbeanstalk:PutInstanceStatistics action. However, If you use a custom instance profile instead of a managed policy, your environment might display a No Data health status. This happens because custom instance profile doesn't PutInstanceStatistics permission by default and instances aren't authorised for the action that communicates enhanced health data to the service. Hence, your environment health shows Unknown/No data status.
The policy that I needed to attach to my existing EC2 role (as advised by AWS Support) looked like:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "ElasticBeanstalkHealthAccess",
"Action": [
"elasticbeanstalk:PutInstanceStatistics"
],
"Effect": "Allow",
"Resource": [
"arn:aws:elasticbeanstalk:*:*:application/*",
"arn:aws:elasticbeanstalk:*:*:environment/*"
]
}
]
}
Adding this policy to my EC2 role solved the issue for me.
In My case when i increased my ram or instance type(t2.micro to c5.xlarge) it had resolved.