How to determine root cause of AWS Elastic Beanstalk Shutdown Errors - django

I have a Django app hosted on AWS Elastic Beanstalk.
Users upload documents to the site. Sometimes, users upload documents and the server completely shuts down. The server instantly 500s, goes offline for about 4 minutes and then then magically the app is back up and running.
Obviously, something is happening to the app where it gets overwhelmed.
The only thing I get from Elastic Beanstalk is this message:
Environment health has transitioned from Ok to Severe. 100.0 % of the requests are failing with HTTP 5xx. ELB processes are not healthy on all instances. ELB health is failing or not available for all instances.
Then about 4 minutes later:
Environment health has transitioned from Severe to Ok.
I have 1 t2.medium EC2 instance. I've set it up as Load Balancing, but use Min 1 Max 1, so I don't take advantage of the load balancing features.
Here's a screenshot of my health tab:
My app shut off on 7/10 as can be seen in picture 1. My CPU spiked at this time, but I can't imagine 20% CPU was enough to overwhelm my server.
How can I determine what might be causing these short 500 errors? Is there somewhere else I can look to discover the source of this? I don't see anything helpful in my access_log or error_log. I don't know where to start looking.

I was having similar problems with Elastic Beanstalk without using load balancer. So when I faced that problem, my application was simply crashing and I needed to rebuild the environment from scratch. Further search revealed that the problem was sometimes the EC2 memory was exceeding, which was causing the elastic beanstalk to shutdown. The solution was adding a swap area (I preferred 2048MB of swap) and prevent these sudden memory exceeding.
Here is how to add swap area to the elastic beanstalk instance:
.ebextensions/swap-area.sh:
#!/usr/bin/env bash
SWAPFILE=/var/swapfile
SWAP_MEGABYTES=2048
if [ -f $SWAPFILE ]; then
echo "$SWAPFILE found, ignoring swap setup..."
exit;
fi
/bin/dd if=/dev/zero of=$SWAPFILE bs=1M count=$SWAP_MEGABYTES
/bin/chmod 600 $SWAPFILE
/sbin/mkswap $SWAPFILE
/sbin/swapon $SWAPFILE
.ebextensions/00-swap-area.config:
container_commands:
00_swap_area:
command: "bash .ebextensions/swap-area.sh"
Then after the deployment, you can check the swap area by using commands such as top etc. in your EC2.

Related

Service not responding to ping command

My service (myservice.com) which is hosted in EC2 is up and running. I could see java process running within the machine but not able to reach the service from external machines. Tried the following option,
dns +short myservice.com
ping myservice.com
(1) is resolving and giving me ip address. ping is causing 100% packet loss. Not able to reach the service.
Not sure where to look at. Some help to debug would be helpful.
EDIT:
I had an issue with previous deployment due to which service was not starting - which I've fixed and tried to update - but the deployment was blocked due to ongoing deployment (which might take ~3hrs to stabilize). So I tried enabling Force deployment option from the console
Also tried minimising the "Number of Tasks" count to 0 and reverted it back to 1 (Reference: How do I deploy updated Docker images to Amazon ECS tasks?) to stop the ongoing deployment.
Can that be an issue?
You probably need to allow ICMP protocol in the security group.
See https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/security-group-rules-reference.html#sg-rules-ping

Why AWS Classic Load Balancer Instances are OutOfService, but Auto Scale Group Shows InService and Healthy?

I tried to avoid asking this question, but after struggling for hours (or days) and reading all related materials, I'm desperately turning to SO.
So, I am trying to deploy my node/react project to AWS with a (classic) load balancer and auto scaling group. I got all the individual pieces working. Somehow, the instances in load balancer always show OutOfService, although those instances are InService and Healty in the auto scaling group. Why this disconnect?
Then, I added an elastic IP to one of the instances. I ssh'd to it and then ran "npm start" manually. Now this instance shows InService and Healthy in the load balancer.
It appears to me that it's not a security group issue, but that the start up script didn't get executed. This is my script:
#!/bin/bash
cd /home/ec2-user/projectname
npm start
Why not?
Some Update:
I enabled Access log for this balancer, and I got a lot of the (same) error logs. Here is one of them:
<Error>
<Code>AccessDenied</Code>
<Message>Access Denied</Message>
<RequestId>BC0FA4BB97BA1557</RequestId>
<HostId>r3wBXZLxJkTzm/SqcQnxEO+f9DhbtCxTLcVAn1vmllj6Dwa0xlO2psP3eEKOiuvNWY/Yb+Gt4C0=</HostId>
</Error>
This is not very helpful to me to figure out where the problem is.
More confusing is that I also get this kind of error log when the instance is started manually and running, and instance status in LB is healthy.
What is denied? health checker? Who is the health checker? This is my health check setting in balancer:
Ping Target HTTP:3000/
Timeout 5 seconds
Interval 30 seconds
Unhealthy threshold 2
Healthy threshold 10
The listener has HTTP 80 as load balance port and 3000 as instance port.
UPDATE again:
It appears to me that the real cause of the problem is that the start up script didn't run. I found a few suggestions around this problem like clear /var/lib/cloud folder or add #cloud-boothook to the top of the startup script, but nothing works for me.
UPDATE (3):
I couldn't make it work properly for me after a few days of struggle and give up now. However is a summary of what I learnt.
First off, I managed to follow up Ryan Lewis' PluralSight video and get it work as expected: Deploying to AWS with load balance and auto scaling. My project is very close to his "pizza-luvrs" project except that I'm using React front end and MongoDB. However, for some reason, I can't make it work for my own project.
My goal was to have load balancers work together with auto scaling group using a precreated AMI (with node, pm2 and my project installed). Using the below startup script with pm2, I got the server running at port 3000.
#!/bin/bash
echo "starting xxx..."
# restart pm2 and thus node app on reboot
crontab -l | { cat; echo "#reboot sudo pm2 start /home/ec2-user/xxx/server.js -i 0"; } | crontab -
# start the server on port 3000
pm2 start /home/ec2-user/xxx/server.js -i 0
echo "xxx started."
However, the instances in load balancer keeps saying "OutOfService", although the instances in auto scaling group always show InService. The most strange thing is that after I attach an Elastic IP (because my auto scaling instance is private) and SSH'd to the instance and without doing anything else, it becomes InService eventually (not always). I can then disassociate the Elastic IP and it keeps InService status. It sounds like the security group might be the cause of this problem, so I compared this with that "pizza-luvrs" project thousand times and made sure they have exactly the same setup. Still it works for his project but not for mine.
By the way, in AWS instances view, select an instance, then select the menu "Instance Settings" > "Get System Log", you can see how the instance gets started. This is how I can tell if my startup script in "user data" gets executed.
What you put in user data is executed once, on first boot, that's it. It's intended for setting up the instance the first time it's started. On subsequent startups it will not run. if you want to schedule a script to run on boot look up how to do it on whatever OS you're using.
Assuming you run some form of Linux that uses systemd, which is the most popular choice, have a look here
You can also execute user data on every boot, but it's really not worth the trouble.

AWS Elastic Beanstalk restart docker if CPU is 100% for longer time

We have Elastic Beanstalk set to load balancing. When our app is consuming 100% CPU for longer time (i.e. after some downtime when we receive tons of webhooks) then the load balancer restarts docker inside the instance. Our app starts aprox. 2 minutes therefore you can never recover from downtime.
Is there any way how to extend this restart period or even disable it?
Scaling using CPU threshold is not an option for us as our app consumes lots of CPU during higher load.
This seems like a case of failed health check
You can go to your EC2 Dashboad => Load Balancers
Check the Load Balancer that target your EB, under the Health Check tab, you should see and edit the thresold of failed ping request to your instance until it is considered unhealthy and terminated
More information on health checks here and here
Increasing of an instance from small to medium actually solved my problem. It seems that the app could not handle this amount of load with limited resources of small instance type.

Elastic Beanstalk 502 errors during autoscaling

I have an Elastic Beanstalk app running on Docker set up with autoscaling. When another instance is added to my environment as a result of autoscaling, it will 502 while the instance goes through the deployment process. If I ssh into the relevant box, I can see (via docker ps) that docker is in the process of setting itself up.
How can I prevent my load balancer from directing traffic to the instance until after the instance deployment has actually completed? I found this potentially related question on SuperUser, but I think my health check URL is set-up properly -- I have it set-up to point at the root of the domain, which definitely 502s when I navigate to it in my browser, so I suspect that's not the cause of my problem.

AWS ELB 502 at the same time every day

First some insight into how my setup is:
1 ELB
4 EC2 instances
2 web servers
1 to run the migrations, queue (beanstalkd) and scheduler
1 'services' server (socket.io instance etc etc)
MySQL on RDS
Redis on Elasticache
S3 for user assets
Every day at 10:55PM, users report getting white screens and 502 Bad Gateway errors. The ELB reports that both EC2 instances are OutOfService, yet I'm SSH'd into them and fully able to use the site by bypassing the ELB. RDS and Elasticache maintenance windows aren't during this period, and the two instances aren't at load either. I can't find anything in the ELB access logs, nothing in nginx logs on the instance end, nothing in the Laravel app logs. There's nothing in the Laravel scheduler that runs at this time either.
The only thing I've found, is that in my CloudWatch metrics, the ELB latency spikes right up to about 5-10 seconds. All this results in downtime of about 5-15 minutes at the same time every day. I can't seem to find anything that is causing the issue.
I'm 100% stumped as to what could be causing this to happen. Any help is appreciated.
What probably happens is that your web servers run out of connections, ELB cannot perform health checks and takes them out of service. It's actually enough for one of the machines to experience this and be taken out of service and the other will be killed as a cascading effect.
How many connections can the web servers hold at the same time?
Do you process a particularly "heavy request" at that point in time when this happens?
Does adding more web servers solve your problem?