Amazon Elastic Load Balancer throwing connection timeout - amazon-web-services

I have added a route 53 entry A.B.com and it points to my ELB.
I set up a script that makes http calls to this end point - http://A.B.com/xxx
However, I see that every 70th or 80th request is throwing a "connection timeout" [not a read timeout]
Can anyone help me how to debug this issue?
Here is my script -
#!/bin/bash
n=0
while [ $n -le 5 ]
do
curl -s -D - "http://A.B.com/xxx"
done
Note
I do not use VPC
CloudWatch
HTTPCode_Backend_5XX ~ 200 (mean)
SurgeQueueLength ~ 1000 (mean)
Do these parameters in CloudWatch mean my backend is unhealthy..?
I changed the ELB for the time being.. things are ok as of now.. but I am not sure I found the root cause..

Related

AWS CLI ecs run-task CannotPullContainerError: inspect image has been retried 5 time(s): failed to resolve ref

I'm trying to move from the Console to the CLI.
I have an ECS Cluster and a Task Definition. From the console, I can run a task WITHOUT any issue. The task comes green and I can use the public IP to access my service.
Now, I'd like to do the same but instead of creating the task using the Console, I'd like to use AWS cli.
I thought this was enough:
aws ecs run-task --cluster my-cluster \
--task-definition ecs-task-def:9 \
--launch-type FARGATE \
--network-configuration '{ "awsvpcConfiguration": { "subnets": ["subnet-XX1","subnet-XX2"], "securityGroups": ["sg-XXX"],"assignPublicIp": "ENABLED" }}'
However, the task gets stuck in PENDING state and after a while is STOPPED with the following error message:
CannotPullContainerError: inspect image has been retried 5 time(s): failed to resolve ref "docker.io/username/container:latest": failed to do request: Head https://registry-1.docker.io/v2/username/container/manifests/latest: dial tcp x.x.x.x:443: i/o timeout
What concerns me is that I can run tasks from the Console using the same arguments (VPC, Subnets, Sec Group, etc) but I cannot make it work using the CLI.
If the issue was missing/wrong rules both Console and CLI should not work.
Anyone knows why?
Look like ECS cannot pull image from registry
CannotPullContainerError: inspect image has been retried 5 time(s): failed to resolve ref "docker.io/username/container:latest": failed to do request: Head https://registry-1.docker.io/v2/username/container/manifests/latest: dial tcp x.x.x.x:443: i/o timeout
suggested that network through 443 has been blocked!? hence cannot pull image. Have you tried allow all traffic inbound & outbound on attached sg as well as check network connectivity from within attached subnet?
You can create a simple Lambda function with similar associated subnets & security groups then executing telnet/curl to registry endpoint to check connectivity.
example:
def test_book():
http = urllib3.PoolManager()
url = 'https://your-endpoint-here'
headers = {
"Accept": "application/json"
}
r = http.request(method='GET', url=url, headers=headers)
print(f'response_status: {r.status}\nresonse_headers: {r.headers}\nresponse_data: {r.data}')

How do I get round the hanging bug in kubectl when aws credentials are expired?

If the AWS credentials are expired, aws ec2 ... etc will exit immediately with An error occurred (RequestExpired) when calling the DescribeInstances operation: Request has expired..
However, kubectl exec will hang for 2 minutes before exiting with
Unable to connect to the server: dial tcp <some ip address>:443: i/o timeout.
Is there a workaround to get kubectl to exit immediately instead of hanging for 2 minutes?
As I mention in the comment. When you want to exit from kubectl commands you can just use ctrl+c same as in the console.
If you would like to use more Kubernetes way, you can use kubectl with flag --request-timeout. More details you can find in Kubectl docs.
--request-timeout string Default: "0"
The length of time to wait before giving up on a single server request. Non-zero values should contain a corresponding time unit (e.g. 1s, 2m, 3h). A value of zero means don't timeout requests.

Unable to create AWS EKS cluster with eksctl

Unable to create AWS EKS cluster with eksctl from Windows 10 PC. Here is the command which I'm executing
eksctl create cluster --name revit --version 1.17 --region ap-southeast-2 --fargate
Version of eksctl: 0.25.0
AWS CLI Version: aws-cli/2.0.38 Python/3.7.7 Windows/10 exe/AMD64
Error on executing create cluster command
2020-08-08T19:05:35+10:00 [ℹ] eksctl version 0.25.0
2020-08-08T19:05:35+10:00 [ℹ] using region ap-southeast-2
2020-08-08T19:05:35+10:00 [!] retryable error (RequestError: send request failed
caused by: Put "http://169.254.169.254/latest/api/token": dial tcp 169.254.169.254:80: connectex: A socket operation was attempted to an unreachable network.) from ec2metadata/GetToken - will retry after delay of 54.121635ms
2020-08-08T19:05:35+10:00 [!] retryable error (RequestError: send request failed
caused by: Put "http://169.254.169.254/latest/api/token": dial tcp 169.254.169.254:80: connectex: A socket operation was attempted to an unreachable network.) from ec2metadata/GetToken - will retry after delay of 86.006168ms
I had the same error, I've got rid of it providing my AWS credentials for programmatic access (AWS Access Key ID, AWS Secret Access Key):
$ aws configure
Next time I used eksctl it just didn't try to authenticate on its own and command passed.
I suspect this is related to this: https://aws.amazon.com/blogs/security/defense-in-depth-open-firewalls-reverse-proxies-ssrf-vulnerabilities-ec2-instance-metadata-service/
Specifically:
Protecting against open layer 3 firewalls and NATs Last, there is a final layer of defense in IMDSv2 that is designed to protect EC2 instances that have been misconfigured as open routers, layer 3 firewalls, VPNs, tunnels, or NAT devices. With IMDSv2, the PUT response containing the secret token will, by default, not be able to travel outside the instance. This is accomplished by having the default Time To Live (TTL) on the low-level IP packets containing the secret token set to “1,” much lower than a typical value, such as “64.” Hardware and software that handle packets, including EC2 instances, subtract 1 from each packet’s TTL field whenever they pass it on. If the TTL gets to 0, the packet is discarded, and an error message is sent back to the sender. A packet with a TTL of “64” can therefore make sixty-four “hops” in a network before giving up, while a packet with a TTL of “1” can exist in just one. This feature allows legitimate traffic to get to an intended destination, but is designed to stop packets from endlessly running around in circles if there’s a loop in a network.
Are you by any chance running the command above from within a container launched in bridge mode? I had a similar problem. If that is the case you could run it using --network host or by passing the creds as system variables.

AWS CodeDeploy stuck in AllowTraffic step

I'm using AWS CodeDeploy to deploy my project (triggered by CodePipeline) to an autoscaling group (EC2 instances behind an ALB). This is my appSpec file:
version: 0.0
os: linux
files:
- source: /
destination: /var/www/html/test-deploy
overwrite: true
permissions:
- object: /var/www/html/test-deploy/codedeploy
pattern: "*.sh"
owner: root
group: root
mode: 755
type:
- file
hooks:
BeforeInstall:
- location: codedeploy/before_install.sh
timeout: 180
AfterInstall:
- location: codedeploy/after_install.sh
runas: centos
timeout: 180
The files get deployed successfully to the EC2 instance, but for some reason after the "BeforeAllowTraffic" nothing happens, like I waited 15 min and the next step was still at "pending".
The two .sh files do nothing fancy (and codedeploy passed those steps so I don't think that's the problem).
Can anyone point me to a direction? I don't get any error messages, so I don't even know how to debug it.
Thanks
I have got the same issue, after investigation, I found that my target group was "unhealthy". I just add the health check path/file i.e "/rorbots.txt" and rebooted the Ec2 Server and its fixed the problem.
We also had an unhealthy target instance. The problem was hosting two applications on the same instance, where one (application A) was responsible for health checks and talking to the load balancer, and the other one (application B without any open network ports) was being deployed. One instance was always getting stuck in AllowTraffic during app B deployments. I found the root cause when I looked at the target group for app A and saw that same instance in the "unhealthy" status, so of course deploying app B wasn't going to fix that. After I re-deployed app A and restored the instance back to health, app B deployments were able to progress.
Check your logs on your target group instances. It may be caused by one of the following:
the application startup command did not finish successfully
the application is not running due to an error
your target group's health check is NOT configured with the endpoint you expect
your application is NOT responding at the endpoint you expect

Autoclustering does not work on AWS with RabbitMQ

We are using the latest version of RabbitMQ, v3.7.2 on a few EC2 instances on AWS. We want to use auto clustering which comes default in the product, Cluster Formation and Peer Discovery.
After we start RabbitMQ it fails/ignores to do this. The only message we see in the log file is:
[info] <0.229.0> Peer discovery backend rabbit_peer_discovery_aws does not support registration, skipping registration.
On our RabbitMQ EC2 instance an IAM role is attached with the coorect policy. The rabbitMQ config is:
cluster_formation.peer_discovery_backend = rabbit_peer_discovery_aws
cluster_formation.aws.region = eu-west-1
cluster_formation.aws.use_autoscaling_group = true
cluster_formation.aws.use_private_ip = true
Did anyone face this issue?
Add the following to your rabbitmq.conf and restart rabbitmq-server
log.file.level = debug
It allows you to see a discovery request to AWS in logs.
Then do this on any rabbitmq node:
rabbitmqctl stop_app
rabbitmqctl reset
rabbitmqctl start_app
It'll execute the discovery again. Check rabbitmq logs for 'AWS Request' you'll see corresponding response so that you can check if your ec2 instances were found by specified tags. If no, something is wrong with your tags.
Not an answer (not enough reputation points to comment) but I'm dealing with the same thing. I've double-checked that the security groups are correct, they allow ports 4369, 5672 and 15672 (confirmed via telnet/netcat), and the IAM policies are correct. Debug logging shows nothing else. I'm at a loss how to figure this one out.