EC2 server losses internet connection and application fails to send email, sms and even yum updates - amazon-web-services

I have 5 EC2 servers in the same VPC and all of a sudden yesterday, all of my applications started failing to send email and sms. So I tried doing git pull of my project it also timed out. Then tried to install telnet using yum that to failed with Time out. I have checked almost everything including Network ACLs, Security Groups, Subnets, Iptables, etc and everything is correct. I am not sure why is this happening.
The weird thing is if I reboot the server once the internet comes for a brief amount of time and again it disconnects.
Attaching below are the errors I am facing:
Error while Generating the Tiny URL. Error: {"errno":-110,"code":"ETIMEDOUT","syscall":"connect","address":"XXX.XX.XXX.XX","port":443}
Error SendEmail UnknownEndpoint: Inaccessible host: `email.ap-south-1.amazonaws.com'. This service may not be available in the `ap-south-1' region.
Attaching screenshots of my Network ACLs, Security Groups, Subnets, and iptables:
Please help with what am I doing wrong or if is this an issue with AWS EC2? My goal is to make sure my application works without timeout and git and yum starts working.

Did you try terminating and reprovisioning the instances, rather than rebooting them? There may be some problem with the underlying hardware. When you terminate and recreate an instance, it will likely end up in a different rack in the datacenter, which may solve the problem.
If the above helps, you should consider setting up an application load balancer with an auto scaling group, with health checks enabled for both, so that the auto scaling group terminates unhealthy instances and replaces then with the new ones automatically.
You may also consider using Simple Notification Service and stop worrying about underlying compute for e-mail and sms distribution altogether!

Related

HTTP server on EC2 instance unreachable after a few minutes

I have a running instance on the Linux 2 AMI.
I have a default VPC and network interface.
Security groups taken care of, even opened all traffic and still got nothing.
There is an Internet Gateway
Routes are open on the VPC
The server is running
nginx is running
Once the instance is initiated and installed, all of this is ready
I can reach the http website the first 2-3 minutes, then it is unreacheable.
No idea why, everything else still running, can still ssh into the server, but http port 80 not running.
I opened everything from iptables, still nothing.
If I reboot the server, I get a minute where I can reach the server via http, but then a minute later its the same again.
I can reach http if I use $ wget http://localhost
So I think it is probably something from the EC2 control panel, not the instance itself.
I tried on new instances too.
Anyone has an idea?
The reason behind this weird behavior was that AWS abuse team had blocked some of my ports, had to upgrade to the developer plan to be able to know this, contacting them at the moment

AWS ECS Task can't connect to RDS Database

I'm a newer AWS user and today I got stuck while working on a sample project. I successfully created a docker container that runs a simple R script that connects to my AWS RDS MySQL Database and creates & writes some basic files to it. I built a public ECR repository, pushed my docker image there, and built a ECS cluster & task choosing Fargate and using the container image from my repository. My task ran and I could see the R code being executed when I went through the logs, but it was never able to connect to the SQL Database and exited afterwards.
I've had to whitelist my own IP address in the security group for the RDS Database so that I can connect to it, so I'm aware I probably have to do that for my ECS task to establish that connection too. But won't that IP address constantly change because I won't have a static IP for the Fargate Server that is executing my task? I'm trying to stay on the free tier so I'm not sure I want to setup an elastic IP address for this server.
These 2 articles seem close if not the same issue I'm having but I can't figure out a solution. I haven't found any other info.
https://aws.amazon.com/premiumsupport/knowledge-center/ecs-fargate-task-database-connection/
https://aws.amazon.com/premiumsupport/knowledge-center/ecs-fargate-static-elastic-ip-address/
The end goal is to get this sample project successfully running on a scheduled fixed interval, and then running actual scripts on there to help automate things and make my life easier, so this sample project is a first step towards that. Any help or info on the questions I'm having would be appreciated !
Yes, your task is ephemeral (whether you launch it manually or as part of an ECS service) and its private/public ip address may change over time if it gets replaced. The way you'd make the connectivity rules to stick is to assign a security group to the task (that may have inbound access on a specific port you need I assume and outbound to everything) and assign another security group to the RDS db that has inbound access on port 3306 for the security group you assigned to the task (this is the trick, the SG will not change and you are telling RDS to allow access to ALL traffic coming from that SG). I see the first article you posted doesn't talk about this part (it should).

Service not responding to ping command

My service (myservice.com) which is hosted in EC2 is up and running. I could see java process running within the machine but not able to reach the service from external machines. Tried the following option,
dns +short myservice.com
ping myservice.com
(1) is resolving and giving me ip address. ping is causing 100% packet loss. Not able to reach the service.
Not sure where to look at. Some help to debug would be helpful.
EDIT:
I had an issue with previous deployment due to which service was not starting - which I've fixed and tried to update - but the deployment was blocked due to ongoing deployment (which might take ~3hrs to stabilize). So I tried enabling Force deployment option from the console
Also tried minimising the "Number of Tasks" count to 0 and reverted it back to 1 (Reference: How do I deploy updated Docker images to Amazon ECS tasks?) to stop the ongoing deployment.
Can that be an issue?
You probably need to allow ICMP protocol in the security group.
See https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/security-group-rules-reference.html#sg-rules-ping

Wordpress running on EC2 t3.small becomes unavailable (ELB Error 504) after X amount of time, needs rebooting

I have a problem with my Amazon EC2 instance (that did not happened when I was using DigitalOcean).
I've several EC2 instances that are managed by me. My personal EC2 has about 5 Wordpress sites running on a t2.micro instance and the traffic is not high so it is working well in load speed.
Also I have another 2 instances for one of my clients, one t2.micro (running only one Wordpress site) and a t3a.micro (running 4 Wordpress sites). The issue is with all 3 instances (mine and all the 2 of my client).
I have a CloudWatch alarm to notify me by email when Error 504 happen. Since I get the alarm, the website becomes unavailable (Cloudflare shows me Error 504), but I can get into SSH or Webmin. I do service nginx status and all seems to be fine, same to service php7.2-fpm. I do pkill nginx && pkill php* and then service nginx start && service php7.2-fpm start correctly but when I try to enter to the site, the Error 504 is still there.
To test, I decided to install and configure Apache with and without PHP-FPM enabled, same problem. Instance going well and websites running fast but after X amount of hours, it becomes unaccessible via web and the only solution is rebooting...
What's the only thing that solve the issue? Well, rebooting the instance.... After it boots, the websites are available again. Please note that I moved from DigitalOcean to AWS because it is more useful but I can't understand why the problem is happening here and not there since I've a similar instance configured very similar...
In all of the instances I've a setup with:
OS: Ubuntu 18.04
Types: Two t2.micro and one t3a.micro
ELB: Enabled
Security Groups: only allow ports 80, 443 from all the sources.
Database: In a RDS, not on the same instance.
I can provide the logs of everything that you probably can ask but I review all the Nginx and PHP-fpm logs and I can't see any anomalies. Also with syslog and kern.log, but I can provide if it can helps.
Hope you can give me a hand. Thanks for your advice!
EDIT:
I already found the origin of the issue. The problem wasn't in the EC2, all my headache was because I have the RDS set with only one Security Group attached to allow access from my IP to remote management of the databases and the public IPs of the EC2 that runs Wordpress, but I figured that I also need to whitelist the private IPs of those EC2s... Really noob mistake but that was the solution.

I can ping my EC2 instance, but I cannot connect through ssh

A while back I had created an RHEL EC2 instance. Set it up correctly and was able to connect to it through putty and WINSCP. Over time it hasn't been used but until recently it needed to be accessed again. I went to check to login but wasn't able to. So i reboot the instance and try to reconnect but I cannot anymore. I get the error "Network error: Connection refused."
I tried recreating the ppk from pem, and also enable all ports to all IP's. What could have caused this un-reachability and are there any troubleshooting tips for me to connect to it again?
There are a few things to check here:
Did you have anything running on the box that might have caused it to become unresponsive over time? This is somewhat unlikely since you said you rebooted the machine.
Check your security group settings to ensure that the firewall is not blocking your SSH port. The instance has no way of knowing whether connections will actually be accepted by the Amazon network on the SSH listening port.
Amazon hardware can fail and cause your instance to become unresponsive. Go to the Instances page on your EC2 console and see if 2/2 of the status checks are passing. If less than 2 are passing, this is probably a failed instance situation.
As a last resort, try right-clicking the instance and checking the system log for anything that might have caused the instance to not listen for SSH connections.
Hopefully you have your data on an EBS volume such that you can simply stop and start the instance and have it come up on different hardware. While it would be nice if Amazon provided console level access to the box, unfortunately they do not presently (as far as I know).