I am facing a challenge with my EKS cluster and IRSA.
I have created a service account in K8 associated with an AWS role, via the eksctl create iamserviceaccount.. command. After that, I have modified the maximum session duration of the role in the AWS console, setting it to 12 hours.
I then used said service account in my deployment. Everything looks good once the workload runs in the cluster and my pod is able to assume the associated AWS role.
Whenever the pod executes an action that requires IAM auth, boto3 issues a AssumeRoleWithWebIdentity call and a set of temporary credentials are returned by AWS IAM (point 4 and 5 in the diagram below, source https://aws.amazon.com/blogs/containers/diving-into-iam-roles-for-service-accounts/). These credentials (by looking in CloudTrail) last 1 hour, and this is in line with the default duration_seconds for such call (see here https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRoleWithWebIdentity.html).
I want those credentials to last more than 1 hour, I'd like for those to last 12 hours. The only way to achieve so, as far as I understood, is to specify durations_seconds in the $HOME/.aws/config file within the pod. After implementing that and testing again, I saw that the temporary credentials still last 1 hour. At this point, I have no idea how to change the duration behaviour. Has anyone ever encountered this situation? If yes, how was this solved?
Note: The OIDC JWT token lasts 24 hours, as I specified in the annotation eks.amazonaws.com/token-expiration: 86400
Thanks!
Alessio
Related
I created my AWS account and got 12 months free plan. Then I went to the teg editor to check all my running services and there were 165 unnecessary running services. Maybe someone had the same problem? Is this ok and I don't have to pay for it?)
Screenshot
Just because they exist, does not mean they are running.
Those look like to be the default VPCs that are created by AWS in every region by default for every AWS account.
If you didn't create them, don't worry - you aren't being charged for them.
AWS does not provide any default resources that charge you money.
It seems ECS API hangs when calling ssm.ap-southeast-2.amazonaws.com. Below is the debug results where it hangs
2020-06-11 22:47:10,831 - MainThread - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (2): ssm.ap-southeast-2.amazonaws.com:443
This works fine on EC2 instance. Just inside ECS task container doesn't work and connection gets timed out.
What could be the reason behind this?
Works fine on EC2 instance
Hmm... I think your container is a victim of IMDSv2. Please allow me to explain.
Instance metadata is data about your instance that you can use to configure or manage the running instance. Instance metadata is divided into categories, for example, host name, events, and security groups. You can query instance metadata by calling the following URL:
http://169.254.169.254/latest/meta-data/
On Nov 19, 2019, v2 of the Instance Metadata Service was released. One of the features introduced with EC2 Instance Metadata Service version 2 (IMDSv2) is "Protecting against open layer 3 firewalls and NATs" 1 which sets a TTL (or hop limit 2) of 1 on low level IP packets containing the secret token so the packet can only cross one host. The TTL of 1 means that the instance is not able to forward the packet to a Docker container running on an ECS Container instance as that would be counted as another hop.
From 1:
With IMDSv2, setting the TTL value to “1” means that requests from the EC2 instance itself will work because they’re returned to the caller (on the instance) before the subtraction occurs. But if the EC2 instance has been misconfigured as an open router, layer 3 firewall, VPN, tunnel, or NAT device, the response containing the token will have its TTL reduced to zero before leaving the instance, and the packet containing the response will be discarded on its way out of the instance, preventing transport to the attacker. The information simply won’t make it further than the EC2 instance itself, which means that an attacker won’t get the response back with the token, and with it the ability to access instance metadata, even if they’ve been successful at getting past all other defenses.
A consequence of this change is Docker containers running on ECS instances in Bridge or AWSVPC mode can no longer query the metadata endpoint. The following request will timeout:
$ curl -X PUT -H "x-aws-ec2-metadata-token-ttl-seconds: 120" "http://169.254.169.254/latest/api/token"
If using AWS CLI, it has a fallback mechanism to IMDSv1 but after a long delay (5 seconds) which makes it rather unusable.
From: https://github.com/aws/aws-sdk-js/issues/3024#issuecomment-589135606 :
From v2.575.0, the SDK is configured to default to the IMDSv2 workflow and, by default, will try three times (with a timeout of one second between attempts) to obtain the required token. If all three attempts fail, the SDK will then fall back to the IDMSv1 workflow.
Option 1 (Use with caution)
It is possible to use the 'modify-instance-metadata-options' 3 AWS CLI call on the Container Instance to change the TTL to a higher value by specifying a value for the --http-put-response-hop-limit flag.
The following AWS CLI command modifies the value to '2' when run on the EC2 instance:
$ aws ec2 modify-instance-metadata-options --instance-id $(curl 169.254.169.254/latest/meta-data/instance-id) --http-put-response-hop-limit 2 --http-endpoint enabled
... after which the curl command against token endpoint was successful from the Docker container.
A Lambda function can be invoked from Autoscaling lifecycle hook to configure the value '2' on any launching instance with ModifyInstanceMetadataOptions api call. Another option is to place this command in EC2 Instance's UserData so every instance can 'self-configure' itself with the updated hop limit. Please note that in this case, Instance profile should have associated policy with 'ec2:ModifyInstanceMetadataOptions' permission for this call to be successful.
Option 2 (Recommended)
With regards to ECS, accessing the instance credentials from a container is not considered a best practice, instead the recommendation is to set a task role and use the AWS_CONTAINER_CREDENTIALS_RELATIVE_URI environment variable to retrieve container specific credentials from the ECS agent, by for example using the "curl 169.254.170.2$AWS_CONTAINER_CREDENTIALS_RELATIVE_URI" command, up to date versions of the AWS CLI use this by default.
You can read more about the task role credentials here 4. A similar endpoint for task metadata is also available 5.
More discussion can be found here:
https://github.com/aws/aws-sdk-ruby/issues/2177
https://github.com/aws/containers-roadmap/issues/670
https://github.com/aws/aws-sdk-js/issues/3024
I am trying to create a Sandbox playground in AWS for Users to practice some resources for 30min, after that, all resources should be killed and account temporary account also should be deleted.
I got some information like Cloud Formation, Lambda and IAM combined can be used, Or AWS Control Tower also but I have no idea where to begin with.
You would need:
A separate AWS Account so that anything created/deleted in the Account will not impact your normal environment (this account can be reused, there is no reason to use a new AWS Account each time you want a Sandbox)
A means of deleting resources from the account when the time period is reached
Some example tools that can do this are:
AWS Nuke
Cloud Nuke
You would also need to write some code that ties everything together:
Vending the account
Tracking usage (eg when to clean)
Triggering the cleanup script when time limit has been reached
Bottom line: It will take some work to create such a Sandbox.
I see an error in AWS cloud formation when I create a mssql RDS via CFT.
The stack hungs in "CREATE IN PROGRESS" phase, but the RDS was successfully created and it was in "AVAILABLE" status.
But after sometime like 5 to 6 hours later, the stack gets rolled back deleting the RDS saying "The DBInstance did not stabilize".
From AWS Documentation:
Operations for these(which includes RDS) resources might take longer than the default timeout period. The timeout period depends on the resource and credentials that you use. To extend the timeout period, specify a service role when you perform the stack operation.
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/troubleshooting.html#troubleshooting-resource-did-not-stabilize
I created an elasticsearch cluster and uploaded 45 GB of data.There, I tried to changed the access policy of the domain. The domain status has been showing "Processing" for last 24 hours. Is there any way to reset access policy and Why is the domain status still "Processing"?
Part of the response from AWS support to a similar question:
Any configuration change we spin up another set of instance and copy the data across to new search domain before we terminate previous set of instances.
So it might be that it's taking them a while to copy that data over and perform any other indexing operations on the new ES instance.