How do I get round the hanging bug in kubectl when aws credentials are expired? - kubectl

If the AWS credentials are expired, aws ec2 ... etc will exit immediately with An error occurred (RequestExpired) when calling the DescribeInstances operation: Request has expired..
However, kubectl exec will hang for 2 minutes before exiting with
Unable to connect to the server: dial tcp <some ip address>:443: i/o timeout.
Is there a workaround to get kubectl to exit immediately instead of hanging for 2 minutes?

As I mention in the comment. When you want to exit from kubectl commands you can just use ctrl+c same as in the console.
If you would like to use more Kubernetes way, you can use kubectl with flag --request-timeout. More details you can find in Kubectl docs.
--request-timeout string Default: "0"
The length of time to wait before giving up on a single server request. Non-zero values should contain a corresponding time unit (e.g. 1s, 2m, 3h). A value of zero means don't timeout requests.

Related

When running AWS batch job, receive a Received Resourceinitializationerror: failed to validate logger args: no such host

I am running an AWS batch job, and it fails to start, with this Status reason:
Resourceinitializationerror: failed to validate logger args: create stream has been retried 7 times: failed to create Cloudwatch log stream: RequestError: send request failed caused by: Post "https://logs.eu-central-1.amazonaws.com/": dial tcp: lookup logs.eu-central-1.amazonaws.com on 172.31.0.2:53: no such host : exit status 1
The job runs in a compute environment on eu-central-1 that uses a VPC, and that VPC has an endpoint to com.amazonaws.eu-central-1.logs
What can be done to fix it?
Ok, I found what the problem was - I forgot to associate the subnet to the endpoint to com.amazonaws.eu-central-1.logs

AWS CLI ecs run-task CannotPullContainerError: inspect image has been retried 5 time(s): failed to resolve ref

I'm trying to move from the Console to the CLI.
I have an ECS Cluster and a Task Definition. From the console, I can run a task WITHOUT any issue. The task comes green and I can use the public IP to access my service.
Now, I'd like to do the same but instead of creating the task using the Console, I'd like to use AWS cli.
I thought this was enough:
aws ecs run-task --cluster my-cluster \
--task-definition ecs-task-def:9 \
--launch-type FARGATE \
--network-configuration '{ "awsvpcConfiguration": { "subnets": ["subnet-XX1","subnet-XX2"], "securityGroups": ["sg-XXX"],"assignPublicIp": "ENABLED" }}'
However, the task gets stuck in PENDING state and after a while is STOPPED with the following error message:
CannotPullContainerError: inspect image has been retried 5 time(s): failed to resolve ref "docker.io/username/container:latest": failed to do request: Head https://registry-1.docker.io/v2/username/container/manifests/latest: dial tcp x.x.x.x:443: i/o timeout
What concerns me is that I can run tasks from the Console using the same arguments (VPC, Subnets, Sec Group, etc) but I cannot make it work using the CLI.
If the issue was missing/wrong rules both Console and CLI should not work.
Anyone knows why?
Look like ECS cannot pull image from registry
CannotPullContainerError: inspect image has been retried 5 time(s): failed to resolve ref "docker.io/username/container:latest": failed to do request: Head https://registry-1.docker.io/v2/username/container/manifests/latest: dial tcp x.x.x.x:443: i/o timeout
suggested that network through 443 has been blocked!? hence cannot pull image. Have you tried allow all traffic inbound & outbound on attached sg as well as check network connectivity from within attached subnet?
You can create a simple Lambda function with similar associated subnets & security groups then executing telnet/curl to registry endpoint to check connectivity.
example:
def test_book():
http = urllib3.PoolManager()
url = 'https://your-endpoint-here'
headers = {
"Accept": "application/json"
}
r = http.request(method='GET', url=url, headers=headers)
print(f'response_status: {r.status}\nresonse_headers: {r.headers}\nresponse_data: {r.data}')

Unable to create AWS EKS cluster with eksctl

Unable to create AWS EKS cluster with eksctl from Windows 10 PC. Here is the command which I'm executing
eksctl create cluster --name revit --version 1.17 --region ap-southeast-2 --fargate
Version of eksctl: 0.25.0
AWS CLI Version: aws-cli/2.0.38 Python/3.7.7 Windows/10 exe/AMD64
Error on executing create cluster command
2020-08-08T19:05:35+10:00 [ℹ] eksctl version 0.25.0
2020-08-08T19:05:35+10:00 [ℹ] using region ap-southeast-2
2020-08-08T19:05:35+10:00 [!] retryable error (RequestError: send request failed
caused by: Put "http://169.254.169.254/latest/api/token": dial tcp 169.254.169.254:80: connectex: A socket operation was attempted to an unreachable network.) from ec2metadata/GetToken - will retry after delay of 54.121635ms
2020-08-08T19:05:35+10:00 [!] retryable error (RequestError: send request failed
caused by: Put "http://169.254.169.254/latest/api/token": dial tcp 169.254.169.254:80: connectex: A socket operation was attempted to an unreachable network.) from ec2metadata/GetToken - will retry after delay of 86.006168ms
I had the same error, I've got rid of it providing my AWS credentials for programmatic access (AWS Access Key ID, AWS Secret Access Key):
$ aws configure
Next time I used eksctl it just didn't try to authenticate on its own and command passed.
I suspect this is related to this: https://aws.amazon.com/blogs/security/defense-in-depth-open-firewalls-reverse-proxies-ssrf-vulnerabilities-ec2-instance-metadata-service/
Specifically:
Protecting against open layer 3 firewalls and NATs Last, there is a final layer of defense in IMDSv2 that is designed to protect EC2 instances that have been misconfigured as open routers, layer 3 firewalls, VPNs, tunnels, or NAT devices. With IMDSv2, the PUT response containing the secret token will, by default, not be able to travel outside the instance. This is accomplished by having the default Time To Live (TTL) on the low-level IP packets containing the secret token set to “1,” much lower than a typical value, such as “64.” Hardware and software that handle packets, including EC2 instances, subtract 1 from each packet’s TTL field whenever they pass it on. If the TTL gets to 0, the packet is discarded, and an error message is sent back to the sender. A packet with a TTL of “64” can therefore make sixty-four “hops” in a network before giving up, while a packet with a TTL of “1” can exist in just one. This feature allows legitimate traffic to get to an intended destination, but is designed to stop packets from endlessly running around in circles if there’s a loop in a network.
Are you by any chance running the command above from within a container launched in bridge mode? I had a similar problem. If that is the case you could run it using --network host or by passing the creds as system variables.

Amazon ECS Service configuration return exactly 1 result, but got > '0'

I am trying to update an ECS service with bamboo and get the following error:
Failed to fetch resource from AWS!
java.lang.RuntimeException: Expected DescribeServiceRequest for
service 'my-service' to return exactly 1 result, but got
'0' at
net.utoolity.atlassian.bamboo.taws.aws.ECS.getSingleService(ECS.java:674)
at
net.utoolity.atlassian.bamboo.taws.ECSServiceTask.executeUpdate(ECSServiceTask.java:311)
at
net.utoolity.atlassian.bamboo.taws.ECSServiceTask.execute(ECSServiceTask.java:133)
at
net.utoolity.atlassian.bamboo.taws.AWSTask.execute(AWSTask.java:164)
at
com.atlassian.bamboo.task.TaskExecutorImpl.lambda$executeTasks$3(TaskExecutorImpl.java:319)
at
com.atlassian.bamboo.task.TaskExecutorImpl.executeTaskWithPrePostActions(TaskExecutorImpl.java:252)
at
com.atlassian.bamboo.task.TaskExecutorImpl.executeTasks(TaskExecutorImpl.java:319)
at
com.atlassian.bamboo.task.TaskExecutorImpl.execute(TaskExecutorImpl.java:112)
at
com.atlassian.bamboo.build.pipeline.tasks.ExecuteBuildTask.call(ExecuteBuildTask.java:73)
at
com.atlassian.bamboo.v2.build.agent.DefaultBuildAgent.executeBuildPhase(DefaultBuildAgent.java:203)
at
com.atlassian.bamboo.v2.build.agent.DefaultBuildAgent.build(DefaultBuildAgent.java:175)
at
com.atlassian.bamboo.v2.build.agent.BuildAgentControllerImpl.lambda$waitAndPerformBuild$0(BuildAgentControllerImpl.java:129)
at
com.atlassian.bamboo.variable.CustomVariableContextImpl.withVariableSubstitutor(CustomVariableContextImpl.java:185)
at
com.atlassian.bamboo.v2.build.agent.BuildAgentControllerImpl.waitAndPerformBuild(BuildAgentControllerImpl.java:123)
at
com.atlassian.bamboo.v2.build.agent.DefaultBuildAgent$1.run(DefaultBuildAgent.java:126)
at
com.atlassian.bamboo.utils.BambooRunnables$1.run(BambooRunnables.java:48)
at
com.atlassian.bamboo.security.ImpersonationHelper.runWith(ImpersonationHelper.java:26)
at
com.atlassian.bamboo.security.ImpersonationHelper.runWithSystemAuthority(ImpersonationHelper.java:17)
at
com.atlassian.bamboo.security.ImpersonationHelper$1.run(ImpersonationHelper.java:41)
at java.lang.Thread.run(Thread.java:745)
I am using the Force new deployment setting.
Any ideas what is the issue?
We have not been able to identify an bug in our code base right away, here's what's seemingly happening:
In order to append progress messages to the Bamboo build log, we need to call the DescribeServices API action before the call to the actual UpdateService API action, and the exception is thrown if and only if the targeted service cannot be found.
So at first glance there may be a subtle configuration issue, which happens to me every now and then when using Bamboo variables to reference resources from a preceding task, where it is easy to accidentally copy and paste the wrong variable name for example.
An incorrect reference in any of the following parameters of the Amazon ECS Service task's Update Service action would yield the resp. task action to fail with the error message at hand, because the DescribeServices API call in itself would succeed, yet fail to identify the target service:
Connector
Region
Service Name
For example, I've just reproduced the problem by using a non existing service name:
24-Oct-2019 17:37:05 Starting task 'Update sample ECS service (w/ ELB) - 2 instances' of type 'net.utoolity.atlassian.bamboo.tasks-for-aws:aws.ecs.service'
24-Oct-2019 17:37:05 Setting maxErrorRetry=7 and awaitTransitionInterval=15000
24-Oct-2019 17:37:05 Using session credentials provided by Identity Federation for AWS app (connector variable: 6f6fc85d-4ea5-43ce-8e70-25aba33a5fda).
24-Oct-2019 17:37:05 Selecting region eu-west-1
24-Oct-2019 17:37:05 Updating service 'NOT-A-SERVICE' on cluster 'TAWS-IT270-100-ubot':
24-Oct-2019 17:37:06 Failed to fetch resource from AWS!
24-Oct-2019 17:37:06 java.lang.RuntimeException: Expected DescribeServiceRequest for service 'NOT-A-SERVICE' to return exactly 1 result, but got '0'
...
Granted, the error message is not exactly helpful here, and we need to think about how to better handle this log pattern across our various tasks - the actual UpdateServiceAPI action would yield the much more appropriate ServiceNotFoundException exception in this scenario.
So assuming 'my-service' has been up and running before calling the 'Update Service' task action, can you please check whether the log from your failing Bamboo build may indicate this particular problem, for example by targeting another region by chance?
I could solve the issue by using a Shell Script Task and wrote a aws-cli command after exporting the keys. This workaround solved the issue:
aws ecs update-service --cluster my-cluster --service my-service --task-definition my-task-definition
So the AWS ECS is working fine and it should be a bug or misconfiguration in the Bamboo module.
But as mentioned in the other answer, the best approach would be to check if the configuration is correct.

Kubespray: send request failed caused by: Post https://ec2.us-east-1.amazonaws.com/

I'm trying to install Kubernetes with Kubespray using AWS a cloud provider. The installation fails with
FAILED - RETRYING: Master | wait for the apiserver to be running
When I check the logs of the kubelet docker container on the master I see
Flag --enable-cri has been deprecated, The non-CRI implementation will be deprecated and removed in a future version.
I0824 16:30:03.413509 13279 feature_gate.go:144] feature gates: map[Accelerators:true]
I0824 16:30:03.413727 13279 aws.go:762] Building AWS cloudprovider
I0824 16:30:03.413878 13279 aws.go:725] Zone not specified in configuration file; querying AWS metadata service
Error: failed to run Kubelet: could not init cloud provider "aws": error finding instance i-0cb81504d85c14b90: error listing AWS instances: RequestError: send request failed
caused by: Post https://ec2.us-east-1.amazonaws.com/: dial tcp 54.239.28.168:443: i/o timeout
Error: failed to run Kubelet: could not init cloud provider "aws": error finding instance i-0cb81504d85c14b90: error listing AWS instances: RequestError: send request failed
caused by: Post https://ec2.us-east-1.amazonaws.com/: dial tcp 54.239.28.168:443: i/o timeout
Flag --enable-cri has been deprecated, The non-CRI implementation will be deprecated and removed in a future version.
I0824 16:32:04.169558 13517 feature_gate.go:144] feature gates: map[Accelerators:true]
I0824 16:32:04.169808 13517 aws.go:762] Building AWS cloudprovider
I0824 16:32:04.169852 13517 aws.go:725] Zone not specified in configuration file; querying AWS metadata service
I'm positive this is a firewall issue. I have an IAM role with the proper permissions. When I set the https_proxy variable I am able to
curl https://ec2.us-east-1.amazonaws.com/
When the proxy variable is not set the curl fails. I tried setting the https_proxy variable inside the hyperkube container. However this causes a cert error when the apiserver tries to handshake with the etcd nodes.
Is there a way to get kubelet to only use the proxy when calling out to https://ec2.us-east-1.amazonaws.com/?