The ECS servive with aws_vpc cannot start due to ENI issues

The ECS servive with aws_vpc cannot start due to ENI issues - amazon-web-services

I have got one service. I have ECS cluster with 2 instances of t3.small.
I cannot start the ECS task. I have ECS task with 2 containers(NGINX and PHP-FPM). NGINX exposes port 80 and PHP-FPM exposes ports 9000, 9001, 9002.
Error I can see:
dev-cluster/ecs-agents i-12345678901234567 2019-09-15T13:20:48Z [ERROR] Task engine [arn:aws:ecs:us-east-1:123456789012:task/ea1d6e4b-ff9f-4e0a-b77a-1698721faa5c]: unable to configure pause container namespace: cni setup: invoke bridge plugin failed: bridge ipam ADD: failed to execute plugin: ecs-ipam: getIPV4AddressFromDB commands: failed to get available ip from the db: getAvailableIP ipstore: failed to find available ip addresses in the subnet
ECS agent: 1.29.
Do you know How Can I figure out what is wrong?
Here is logs snippet: https://pastebin.com/my620Kip
Task definition: https://pastebin.com/C5khX9Zy
UPDATE: My observations
Edited because my post below was deleted...
I recreated cluster, then the problem disappears.
Then I removed the application image from the ECR and I was seeing an error in AWS web console:
CannotPullContainerError: Error response from daemon: manifest for 123456789123.dkr.ecr.us-east-1.amazonaws.com/application123:development-716b4e55dd3235f6548d645af9e463e744d3785f not found
Then I waited a few hours until the original issue happened again.
Then I restarted instance manually with systemctl reboot and the problem disappeared again only for restarted instance.
This issue appears when On the cluster is hundred(s) awsvpc task which cannot start.
I think this is a bug in ECS agent. And When We are trying to create too many containers with requires ENI it is trying to use all free IPs in the subnet. (255) I think after restart/recreate EC2 instance some cache is cleared and the problem is solved.
Here is similar solution I found today: https://github.com/aws/amazon-ecs-cni-plugins/issues/93#issuecomment-502099642
What do you think about it?
I am opened for suggestions.

This is probably just a wild guess, but can it be that you simply don't have enough ENIs?
ENIs are quite limited (depending on the instance type):
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-eni.html
For instance, t3.medium only has 3 ENIs, one of which is used for primary network interface. Which leaves you with 2 ENIs only. So I can imagine that ECS tasks fail to start due to insufficient ENIs.
As mitigation, try ENI trunking:
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/container-instance-eni.html
This will multiply available ENIs per instance.

Related

Fail deploying simple HTTP server to ElasticBeanstalk when using Application Load Balancer

I'm unable to deploy the simplest docker-compose file to an ElasticBeanstalk environment configured with Application Load Balancer for high-availability.
This is the docker file:
version: "3.9"
services:
demo:
image: nginxdemos/hello
ports:
- "80:80"
restart: always
This is the ALB configuration:
EB Chain of events:
Creating CloudWatch alarms and log groups
Creating security groups
For the load balancer
Allow incoming traffic from the internet to my two listerners on ports 80/443
For the EC2 machines
Allow incoming traffic to the process port from the first security group created
Create auto scaling groups
Create Application Load Balancer
Create EC2 instance
Approx. 10 minutes after creating the EC2 instance (#5), I get the following log:
Environment health has transitioned from Pending to Severe. ELB processes are not healthy on all instances. Initialization in progress (running for 12 minutes). None of the instances are sending data. 50.0 % of the requests to the ELB are failing with HTTP 5xx. Insufficient request rate (2.0 requests/min) to determine application health (6 minutes ago). ELB health is failing or not available for all instances.
Looking at the Target Group, it is indicating 0 healthy instances (based on the default healthchecks)
When SSH'ing the instance, I see that the docker service is not even started, and my application is not running. So that explains why the instance is unhealthy.
However, what am I supposed to do differently? based on the understanding I have, to me it looks like a bug in the flow initiated by ElasticBealstalk, as the flow is waiting for the instances to be healthy before starting my application (otherwise, why the application wasn't started in the 10 minutes after the EC2 instance was created?)
It doesn't seem like an application issue, because the docker service was not even started.
Appreciate your help.

I tried to replicate your issue using your docker-compose.yml and Docker running on 64bit Amazon Linux 2/3.4.12 platform. For the test I created a zip file containing only the docker-compose.yml.
Everything works as expected and no issues were found.
The only thing I can suggest is to double check your files. Also there is no reason to use 443 as you don't have https at all.

"ERR_EMPTY_RESPONSE" - ShinyApp hosted over AWS (EC2 / EKS / ShinyProxy) does not work

Update #2:
I have checked the health status of my instances within the auto scaling group - here the instances are titled as "healthy". (Screenshot added)
I followed this trouble-shooting tutorial from AWS - without success:
Solution: Use the ELB health check for your Auto Scaling group. When you use the ELB health check, Auto Scaling determines the health status of your instances by checking the results of both the instance status check and the ELB health check. For more information, see Adding health checks to your Auto Scaling group in the Amazon EC2 Auto Scaling User Guide.
Update #1:
I found out that the two Node-Instances are "OutOfService" (as seen in the screenshots below) because they are failing the Healtcheck from the loadbalancer - could this be the problem? And how do i solve it?
Thanks!
I am currently on the home stretch to host my ShinyApp on AWS.
To make the hosting scalable, I decided to use AWS - more precisely an EKS cluster.
For the creation I followed this tutorial: https://github.com/z0ph/ShinyProxyOnEKS
So far everything worked, except for the last step: "When accessing the load balancer address and port, the login interface of ShinyProxy can be displayed normally.
The load balancer gives me the following error message as soon as I try to call it with the corresponding port: ERR_EMPTY_RESPONSE.
I have to admit that I am currently a bit lost and lack a starting point where the error could be.
I was already able to host the Shiny sample application in the cluster (step 3.2 in the tutorial), so it must be somehow due to shinyproxy, kubernetes proxy or the loadbalancer itself.
I link you to the following information below:
Overview EC2 Instances (Workspace + Cluster Nodes)
Overview Loadbalancer
Overview Repositories
Dockerfile ShinyProxy
Dockerfile Kubernetes Proxy
Dockerfile ShinyApp (sample application)
I have painted over some of the information to be on the safe side - if there is anything important, please let me know.
If you need anything else I haven't thought of, just give me a hint!
And please excuse the confusing question and formatting - I just don't know how to word / present it better. sorry!
Many thanks and best regards
Overview EC2 Instances (Workspace + Cluster Nodes)
Overview Loadbalancer
Overview Repositories
Dockerfile ShinyProxy (source https://github.com/openanalytics/shinyproxy-config-examples/tree/master/03-containerized-kubernetes)
Dockerfile Kubernetes Proxy (source https://github.com/openanalytics/shinyproxy-config-examples/tree/master/03-containerized-kubernetes - Fork)
Dockerfile ShinyApp (sample application)
The following files are 1:1 from the tutorial:
application.yaml (shinyproxy)
sp-authorization.yaml
sp-deployment.yaml
sp-service.yaml
Health-Status in the AutoScaling-Group

Unfortunately, there is a known issue in AWS
externalTrafficPolicy: Local with Type: LoadBalancer AWS NLB health checks failing · Issue #80579 · kubernetes/kubernetes
Closing this for now since it's a known issue
As per k8s manual:
.spec.externalTrafficPolicy - denotes if this Service desires to route external traffic to node-local or cluster-wide endpoints. There are two available options: Cluster (default) and Local. Cluster obscures the client source IP and may cause a second hop to another node, but should have good overall load-spreading. Local preserves the client source IP and avoids a second hop for LoadBalancer and NodePort type Services, but risks potentially imbalanced traffic spreading.
But you may try to fix local protocol like in this answer
Upd:
This is actually a known limitation where the AWS cloud provider does not allow for --hostname-override, see #54482 for more details.
Upd 2: There is a workaround via patching kube-proxy:
As per AWS KB
A Network Load Balancer with the externalTrafficPolicy is set to Local (from the Kubernetes website), with a custom Amazon VPC DNS on the DHCP options set. To resolve this issue, patch kube-proxy with the hostname override flag.

Service not responding to ping command

My service (myservice.com) which is hosted in EC2 is up and running. I could see java process running within the machine but not able to reach the service from external machines. Tried the following option,
dns +short myservice.com
ping myservice.com
(1) is resolving and giving me ip address. ping is causing 100% packet loss. Not able to reach the service.
Not sure where to look at. Some help to debug would be helpful.
EDIT:
I had an issue with previous deployment due to which service was not starting - which I've fixed and tried to update - but the deployment was blocked due to ongoing deployment (which might take ~3hrs to stabilize). So I tried enabling Force deployment option from the console
Also tried minimising the "Number of Tasks" count to 0 and reverted it back to 1 (Reference: How do I deploy updated Docker images to Amazon ECS tasks?) to stop the ongoing deployment.
Can that be an issue?

You probably need to allow ICMP protocol in the security group.
See https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/security-group-rules-reference.html#sg-rules-ping

Istio: failed calling admission webhook Address is not allowed

I am getting the following error while creating a gateway for the sample bookinfo application
Internal error occurred: failed calling admission webhook
"pilot.validation.istio.io": Post
https://istio-galley.istio-system.svc:443/admitpilot?timeout=30s:
Address is not allowed
I have created a EKS poc cluster using two node-groups (each with two instances), one with t2.medium and another one is with t2.large type of instances in my dev AWS account using two subnets with /26 subnet with default VPC-CNI provided by EKS
But as the cluster is growing with multiple services running, I started facing issues of IPs not available (as per docs default vpc-cni driver treat pods as an EC2 instance)
to avoid same I followed following post to change networking from default to weave
https://medium.com/codeops/installing-weave-cni-on-aws-eks-51c2e6b7abc8
because of same I have resolved IPs unavailability issue,
Now after network reconfiguration from vpc-cni to weave
I am started getting above issue as per subject line for my service mesh configured using Istio
There are a couple of services running inside the mesh and also integrated kiali, prometheus, jaeger with the same.
I tried to have a look at Github (https://github.com/istio/istio/issues/9998) and docs
(https://istio.io/docs/ops/setup/validation/), but could not get a proper valid answer.
Let me if anyone face this issue and have partial/full solution on this.

This 'appears' to be related to the switch from AWS CNI to weave. CNI uses the IP range of your VPC while weave uses its own address range (for pods), so there may be remaining iptables rules from AWS CNI, for example.
Internal error occurred: failed calling admission webhook "pilot.validation.istio.io": Post https://istio-galley.istio-system.svc:443/admitpilot?timeout=30s: Address is not allowed
The message above implies that whatever address istio-galley.istio-system.svc resolves to, internally in your K8s cluster, is not a valid IP address. So I would also try to see what that resolves to. (It may be related to coreDNS).
You can also try the following these steps;
Basically, (quoted)
kubectl delete ds aws-node -n kube-system
delete /etc/cni/net.d/10-aws.conflist on each of the node
edit instance security group to allow UDP, TCP on 6873, 6874 ports
flush iptables, nat, mangle, filter
restart kube-proxy pods
apply weave-net daemonset
delete existing pods so the get recreated in Weave pod CIDR's address-space.
Furthermore, you can try reinstalling everything from the beginning using weave.
Hope it helps!

AWS CodeDeploy: stuck on install step

I'm running through this tutorial to create a deployment pipeline with my custom .net-based docker image.
But when I start a deployment, it's stuck on install phase, so I have to stop it manually:
After that I get a couple of running tasks with different task definitions (note :1 and :4, 'cause I've tried to run deployment 4 times by now):
They also change their state RUNNING->PROVISIONING->PENDING all the time. And the list of stopped tasks grows:
Q:
So, how to hunt down the issue with CodeDeploy? Why It's running forever?
UPDATE:
It is connected to health checks.
UPDATE:
I'm getting this:
(service dataapi-dev-service, taskSet ecs-svc/9223370487815385540) (port 80) is unhealthy in target-group dataapi-dev-tg1 due to (reason Health checks failed with these codes: [404]).
Don't quite understand, why is it failing for newly created container, 'cause the original one passes health-check.

While the ECS task is running, ELB (Elastic Load Balancer) will constantly do healthchecking the container as you config in the target group to check if the container is still responding.
From your debug message, the container (api) responded the healthcheck path with 404.
I suggest you config the healhcheck path in target group dataapi-dev-tg1.

For those who are still hitting this issue: in my case the ECS cluster had no outbound connectivity.
Possible solutions to this problem:
make security groups you use with your VPC allow outbound traffic
make sure that the route table you use with VPC has subnet associations with subnets you use with your load balancer (examine route tables)
I have able to figure it out because I enabled CloudWatch during ECS cluster creation and got CannotPullContainerError. For more information on solving this problem look into Cannot Pull Container Image Error.

Make sure your Internet Gateway is attached to your Subnets through the Route Table (Routes), if your Load Balancer is internet facing.

The error is due to health check which detected an unhealthy target.
Make sure to check your configuration in Target group settings.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js