Service not responding to ping command - amazon-web-services

My service (myservice.com) which is hosted in EC2 is up and running. I could see java process running within the machine but not able to reach the service from external machines. Tried the following option,
dns +short myservice.com
ping myservice.com
(1) is resolving and giving me ip address. ping is causing 100% packet loss. Not able to reach the service.
Not sure where to look at. Some help to debug would be helpful.
EDIT:
I had an issue with previous deployment due to which service was not starting - which I've fixed and tried to update - but the deployment was blocked due to ongoing deployment (which might take ~3hrs to stabilize). So I tried enabling Force deployment option from the console
Also tried minimising the "Number of Tasks" count to 0 and reverted it back to 1 (Reference: How do I deploy updated Docker images to Amazon ECS tasks?) to stop the ongoing deployment.
Can that be an issue?

You probably need to allow ICMP protocol in the security group.
See https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/security-group-rules-reference.html#sg-rules-ping

Related

EC2 server losses internet connection and application fails to send email, sms and even yum updates

I have 5 EC2 servers in the same VPC and all of a sudden yesterday, all of my applications started failing to send email and sms. So I tried doing git pull of my project it also timed out. Then tried to install telnet using yum that to failed with Time out. I have checked almost everything including Network ACLs, Security Groups, Subnets, Iptables, etc and everything is correct. I am not sure why is this happening.
The weird thing is if I reboot the server once the internet comes for a brief amount of time and again it disconnects.
Attaching below are the errors I am facing:
Error while Generating the Tiny URL. Error: {"errno":-110,"code":"ETIMEDOUT","syscall":"connect","address":"XXX.XX.XXX.XX","port":443}
Error SendEmail UnknownEndpoint: Inaccessible host: `email.ap-south-1.amazonaws.com'. This service may not be available in the `ap-south-1' region.
Attaching screenshots of my Network ACLs, Security Groups, Subnets, and iptables:
Please help with what am I doing wrong or if is this an issue with AWS EC2? My goal is to make sure my application works without timeout and git and yum starts working.
Did you try terminating and reprovisioning the instances, rather than rebooting them? There may be some problem with the underlying hardware. When you terminate and recreate an instance, it will likely end up in a different rack in the datacenter, which may solve the problem.
If the above helps, you should consider setting up an application load balancer with an auto scaling group, with health checks enabled for both, so that the auto scaling group terminates unhealthy instances and replaces then with the new ones automatically.
You may also consider using Simple Notification Service and stop worrying about underlying compute for e-mail and sms distribution altogether!

AWS ECS Task can't connect to RDS Database

I'm a newer AWS user and today I got stuck while working on a sample project. I successfully created a docker container that runs a simple R script that connects to my AWS RDS MySQL Database and creates & writes some basic files to it. I built a public ECR repository, pushed my docker image there, and built a ECS cluster & task choosing Fargate and using the container image from my repository. My task ran and I could see the R code being executed when I went through the logs, but it was never able to connect to the SQL Database and exited afterwards.
I've had to whitelist my own IP address in the security group for the RDS Database so that I can connect to it, so I'm aware I probably have to do that for my ECS task to establish that connection too. But won't that IP address constantly change because I won't have a static IP for the Fargate Server that is executing my task? I'm trying to stay on the free tier so I'm not sure I want to setup an elastic IP address for this server.
These 2 articles seem close if not the same issue I'm having but I can't figure out a solution. I haven't found any other info.
https://aws.amazon.com/premiumsupport/knowledge-center/ecs-fargate-task-database-connection/
https://aws.amazon.com/premiumsupport/knowledge-center/ecs-fargate-static-elastic-ip-address/
The end goal is to get this sample project successfully running on a scheduled fixed interval, and then running actual scripts on there to help automate things and make my life easier, so this sample project is a first step towards that. Any help or info on the questions I'm having would be appreciated !
Yes, your task is ephemeral (whether you launch it manually or as part of an ECS service) and its private/public ip address may change over time if it gets replaced. The way you'd make the connectivity rules to stick is to assign a security group to the task (that may have inbound access on a specific port you need I assume and outbound to everything) and assign another security group to the RDS db that has inbound access on port 3306 for the security group you assigned to the task (this is the trick, the SG will not change and you are telling RDS to allow access to ALL traffic coming from that SG). I see the first article you posted doesn't talk about this part (it should).

Unable to find the server at www.googleapis.com only within GCP

I know there have been a few questions similar to this issue. But in my case this issue is only happening on GCP. We have been running our services within AKS (Azure) for almost one year with not a single occurrence. Right after we moved to GCP GKE, a few requests of our Python application are falling into the error: Unable to find the server at www.googleapis.com. In most cases, the request works, so it seems to be random. I already tried to increase TCP timeouts and also the minimum Minimum ports per VM instance in my Cloud Nat. We are running the services with GKE and we have Cloud Nat Gateway setup for the Network.
Is there any exclusive setting on GCP that could be causing the issue?
I figured out what was the issue. The kube-dns service was being scheduled to nodes suffering from high memory pressure, causing kube-dns to be evicted and restarted. During the time it was out some requests would not be resolved. In order to fix the issue I created a nodepool exclusive to the kube-system services, then edited the kube-system deployments and set a nodeSelector so they always get scheduled to safe Nodes. After that, the issue has ceased.

Wordpress running on EC2 t3.small becomes unavailable (ELB Error 504) after X amount of time, needs rebooting

I have a problem with my Amazon EC2 instance (that did not happened when I was using DigitalOcean).
I've several EC2 instances that are managed by me. My personal EC2 has about 5 Wordpress sites running on a t2.micro instance and the traffic is not high so it is working well in load speed.
Also I have another 2 instances for one of my clients, one t2.micro (running only one Wordpress site) and a t3a.micro (running 4 Wordpress sites). The issue is with all 3 instances (mine and all the 2 of my client).
I have a CloudWatch alarm to notify me by email when Error 504 happen. Since I get the alarm, the website becomes unavailable (Cloudflare shows me Error 504), but I can get into SSH or Webmin. I do service nginx status and all seems to be fine, same to service php7.2-fpm. I do pkill nginx && pkill php* and then service nginx start && service php7.2-fpm start correctly but when I try to enter to the site, the Error 504 is still there.
To test, I decided to install and configure Apache with and without PHP-FPM enabled, same problem. Instance going well and websites running fast but after X amount of hours, it becomes unaccessible via web and the only solution is rebooting...
What's the only thing that solve the issue? Well, rebooting the instance.... After it boots, the websites are available again. Please note that I moved from DigitalOcean to AWS because it is more useful but I can't understand why the problem is happening here and not there since I've a similar instance configured very similar...
In all of the instances I've a setup with:
OS: Ubuntu 18.04
Types: Two t2.micro and one t3a.micro
ELB: Enabled
Security Groups: only allow ports 80, 443 from all the sources.
Database: In a RDS, not on the same instance.
I can provide the logs of everything that you probably can ask but I review all the Nginx and PHP-fpm logs and I can't see any anomalies. Also with syslog and kern.log, but I can provide if it can helps.
Hope you can give me a hand. Thanks for your advice!
EDIT:
I already found the origin of the issue. The problem wasn't in the EC2, all my headache was because I have the RDS set with only one Security Group attached to allow access from my IP to remote management of the databases and the public IPs of the EC2 that runs Wordpress, but I figured that I also need to whitelist the private IPs of those EC2s... Really noob mistake but that was the solution.

amazon aws ec2 instance status is unhealthy. Website loading time very high

I am using aws ec2 m1.medium as my webserver. From last two days website loading very slow. Health check status in amazon route53 shows Unhealthy. The following status shows in health checkers
Failure: Resolved IP: [my ip]. The endpoint did not respond to the health checker request within the timeout limit.
When i check in mxToolBox
Failure - response over threshold (12.21s/10s)
Can anybody help please.
The AWS network in itself never takes this much time to reach to the servers and it must be your application responding slow. to make sure ping the node first and then telnet/nc on the specific port that your application is using.
telnet <ip> <port>
netcat -u <ip> <port>
if you find there is a significant difference, then you need to troubleshoot your application which might be having any kind of issue.
If not, then you may have faulty ELB sitting in between that is causing such high latency and you might wanna restart/replace that.