Can't access running EC2 Dockerized image from outside - amazon-web-services

The problem
I can't access a running docker image from outside of the EC2 instance.
What I've tried
I created a cluster in ECS, a service with a related task definition and an Application Load Balancer.
When the task gets executed I can see the logs from the Docker image in the task:
I also see the related EC2 instance running. When I ssh into the instance I can see the docker image running, as expected:
docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
af20230498fb <ecr-id>.dkr.ecr.eu-central-1.amazonaws.com/app-be:latest "docker-entrypoint.s…" 11 minutes ago Up 11 minutes 0.0.0.0:32805->5001/tcp ecs-app-task-definition-26-app-be-fcf9ffc3f9dadf80d401
5d59c2b2bcaa amazon/amazon-ecs-agent:latest "/agent" 2 hours ago Up 2 hours (healthy) ecs-agent
And when I do:
> curl 0.0.0.0:32805/status
{"message":"OK","timestamp":1598871064086}
Also my load balancer seems to be set up correctly:
But when trying to access the same endpoint from outside, both from the public IP of the EC2 instance and the Application Load Balancer DNS, it times out.
Also: the health checks for /status of the Application Load Balancer are failing as well, so the tasks restarts every 15 minutes.
Any help is appreciated, and sorry in advance if I'm making a rookie mistake, as I don't have much experience with AWS.

Do you config your Security Group of your EC2 or the NACL of the VPC where the EC2 is launched?
I see that you are expose port 5001 in your task so in the SG, you should open that port

Related

Health Check keeps failing for ECS container

I am currently trying to deploy 2 ECS services on a single EC2 instance for test environment.
Here is what I have done so far:
Successfully created 2 Security Groups for Load Balancer and EC2 instance.
My EC2 Security Group
My ALB Security Group
Successfully created 2 different Task Definitions for my 2 applications, all Spring Boot application. First application is running on port 8080, Container Port in Task Definition is also 8080. The second application is running on port 8081, Container Port in Task Definition is also 8081.
Successfully created an ECS cluster with an Auto-Scaling Group as Capacity Provider. The cluster also recognizes the Container Instance created from Auto-Scaling Group (I am using t2.micro since it is in free-tier package). Attached created Security Group to EC2 instance.
My EC2 Security Group
Successfully created an ALB with 2 forward listeners 8080 and 8081 configured to 2 different Target Groups for each service. Attached created Security Group to ALB.
Here is how the ECS behaves with my services:
I attempted to create 2 new services. First service mapped with port 8080 on ALB. The second one mapped with port 8081 on ALB. Each of them have different Target Group but the Health Check configurations are the same
Health Check Configuration for Service 1
Health Check Configuration for Service 2
The first service was deployed pretty smooth, health check returned success on the first try.
However, for the second service, I use the exact same configuration as the first one, just a different port listener on ALB and the application container running on a different port number as well (which I believe that it should not be a problem). The service attempted 10 times before it fails the deployment and I noticed getting this repeated error message: service <service_name> instance <instance_id> port <port_number> is unhealthy in target-group <target_group_name> due to (reason Health checks failed).
This did not happen with my first service with the same configuration. The weird thing is that when I attempted to send a request the ALB domain name on port 8081, the application on the second service seems to be working fine without any error. It is just that the Unhealthy Check keeps throwing my service off.
I went over bunch of posts and nothing really helps with the current situation. Also, it is kind of dumb since I cannot dig any further details rather than this info in this image below.
Anyone has any suggestion to resolve this problem? Would really appreciate it.

Fail deploying simple HTTP server to ElasticBeanstalk when using Application Load Balancer

I'm unable to deploy the simplest docker-compose file to an ElasticBeanstalk environment configured with Application Load Balancer for high-availability.
This is the docker file:
version: "3.9"
services:
demo:
image: nginxdemos/hello
ports:
- "80:80"
restart: always
This is the ALB configuration:
EB Chain of events:
Creating CloudWatch alarms and log groups
Creating security groups
For the load balancer
Allow incoming traffic from the internet to my two listerners on ports 80/443
For the EC2 machines
Allow incoming traffic to the process port from the first security group created
Create auto scaling groups
Create Application Load Balancer
Create EC2 instance
Approx. 10 minutes after creating the EC2 instance (#5), I get the following log:
Environment health has transitioned from Pending to Severe. ELB processes are not healthy on all instances. Initialization in progress (running for 12 minutes). None of the instances are sending data. 50.0 % of the requests to the ELB are failing with HTTP 5xx. Insufficient request rate (2.0 requests/min) to determine application health (6 minutes ago). ELB health is failing or not available for all instances.
Looking at the Target Group, it is indicating 0 healthy instances (based on the default healthchecks)
When SSH'ing the instance, I see that the docker service is not even started, and my application is not running. So that explains why the instance is unhealthy.
However, what am I supposed to do differently? based on the understanding I have, to me it looks like a bug in the flow initiated by ElasticBealstalk, as the flow is waiting for the instances to be healthy before starting my application (otherwise, why the application wasn't started in the 10 minutes after the EC2 instance was created?)
It doesn't seem like an application issue, because the docker service was not even started.
Appreciate your help.
I tried to replicate your issue using your docker-compose.yml and Docker running on 64bit Amazon Linux 2/3.4.12 platform. For the test I created a zip file containing only the docker-compose.yml.
Everything works as expected and no issues were found.
The only thing I can suggest is to double check your files. Also there is no reason to use 443 as you don't have https at all.

Wordpress running on EC2 t3.small becomes unavailable (ELB Error 504) after X amount of time, needs rebooting

I have a problem with my Amazon EC2 instance (that did not happened when I was using DigitalOcean).
I've several EC2 instances that are managed by me. My personal EC2 has about 5 Wordpress sites running on a t2.micro instance and the traffic is not high so it is working well in load speed.
Also I have another 2 instances for one of my clients, one t2.micro (running only one Wordpress site) and a t3a.micro (running 4 Wordpress sites). The issue is with all 3 instances (mine and all the 2 of my client).
I have a CloudWatch alarm to notify me by email when Error 504 happen. Since I get the alarm, the website becomes unavailable (Cloudflare shows me Error 504), but I can get into SSH or Webmin. I do service nginx status and all seems to be fine, same to service php7.2-fpm. I do pkill nginx && pkill php* and then service nginx start && service php7.2-fpm start correctly but when I try to enter to the site, the Error 504 is still there.
To test, I decided to install and configure Apache with and without PHP-FPM enabled, same problem. Instance going well and websites running fast but after X amount of hours, it becomes unaccessible via web and the only solution is rebooting...
What's the only thing that solve the issue? Well, rebooting the instance.... After it boots, the websites are available again. Please note that I moved from DigitalOcean to AWS because it is more useful but I can't understand why the problem is happening here and not there since I've a similar instance configured very similar...
In all of the instances I've a setup with:
OS: Ubuntu 18.04
Types: Two t2.micro and one t3a.micro
ELB: Enabled
Security Groups: only allow ports 80, 443 from all the sources.
Database: In a RDS, not on the same instance.
I can provide the logs of everything that you probably can ask but I review all the Nginx and PHP-fpm logs and I can't see any anomalies. Also with syslog and kern.log, but I can provide if it can helps.
Hope you can give me a hand. Thanks for your advice!
EDIT:
I already found the origin of the issue. The problem wasn't in the EC2, all my headache was because I have the RDS set with only one Security Group attached to allow access from my IP to remote management of the databases and the public IPs of the EC2 that runs Wordpress, but I figured that I also need to whitelist the private IPs of those EC2s... Really noob mistake but that was the solution.

The ECS servive with aws_vpc cannot start due to ENI issues

I have got one service. I have ECS cluster with 2 instances of t3.small.
I cannot start the ECS task. I have ECS task with 2 containers(NGINX and PHP-FPM). NGINX exposes port 80 and PHP-FPM exposes ports 9000, 9001, 9002.
Error I can see:
dev-cluster/ecs-agents i-12345678901234567 2019-09-15T13:20:48Z [ERROR] Task engine [arn:aws:ecs:us-east-1:123456789012:task/ea1d6e4b-ff9f-4e0a-b77a-1698721faa5c]: unable to configure pause container namespace: cni setup: invoke bridge plugin failed: bridge ipam ADD: failed to execute plugin: ecs-ipam: getIPV4AddressFromDB commands: failed to get available ip from the db: getAvailableIP ipstore: failed to find available ip addresses in the subnet
ECS agent: 1.29.
Do you know How Can I figure out what is wrong?
Here is logs snippet: https://pastebin.com/my620Kip
Task definition: https://pastebin.com/C5khX9Zy
UPDATE: My observations
Edited because my post below was deleted...
I recreated cluster, then the problem disappears.
Then I removed the application image from the ECR and I was seeing an error in AWS web console:
CannotPullContainerError: Error response from daemon: manifest for 123456789123.dkr.ecr.us-east-1.amazonaws.com/application123:development-716b4e55dd3235f6548d645af9e463e744d3785f not found
Then I waited a few hours until the original issue happened again.
Then I restarted instance manually with systemctl reboot and the problem disappeared again only for restarted instance.
This issue appears when On the cluster is hundred(s) awsvpc task which cannot start.
I think this is a bug in ECS agent. And When We are trying to create too many containers with requires ENI it is trying to use all free IPs in the subnet. (255) I think after restart/recreate EC2 instance some cache is cleared and the problem is solved.
Here is similar solution I found today: https://github.com/aws/amazon-ecs-cni-plugins/issues/93#issuecomment-502099642
What do you think about it?
I am opened for suggestions.
This is probably just a wild guess, but can it be that you simply don't have enough ENIs?
ENIs are quite limited (depending on the instance type):
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-eni.html
For instance, t3.medium only has 3 ENIs, one of which is used for primary network interface. Which leaves you with 2 ENIs only. So I can imagine that ECS tasks fail to start due to insufficient ENIs.
As mitigation, try ENI trunking:
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/container-instance-eni.html
This will multiply available ENIs per instance.

AWS ECS Fargate ALB Error (Request Timed Out)

I have set up a Docker container running on port 5566 with a small Django application. The Docker image is uploaded into the ECR and later used by Fargate container(s).
I have set up an ECS cluster with a VPC.
After creating the Task Definition and Service, the Service starts up 2 tasks (as it is supposed to):
Here's the Service's Network Access (with health check grace period on 300s):
I also set up an Application Load Balancer (with DNS) with a target group for the service, but the health checks seem to be failing:
Here's the health check configuration:
Because the health checks are failing the tasks are terminated and new ones are started after ~every 5 minutes.
Here's the container's port mapping:
As one cannot access the Fargate container (via SSH for example) and the logs are empty, how should I troubleshoot the issue?
I have tried to follow every step in the Troubleshoot Your Application Load Balancer.
Feel free to ask additional information.
can you confirm once, your application is working on port 5566 inside docker?
you can check logs in cloudwatch. you'll get the link in cluster -> service -> tasks -> your task.
Can you post your ALB configuration? your Target group port?