I am running a python Django app on an AWS EC2 instance. It uses gunicorn and nginx to serve the app, the EC2 is behind an application load balancer. Occasionally I get 504 error where the entire EC2 instance becomes unreachable for everyone (including via SSH which I use all the time otherwise). I then need to restart everything which takes time.
I can replicate the error by overloading the app (e.g. uploading and processing a very large image), in that case, gunicorn worker times out (I see the timeout message in logs), 504 error appears and the instance becomes unreachable. I set my gunicorn to time out in 5 minutes (300 seconds) but it falls down quicker than that. There is nothing really useful in CloudWatch logs.
I am looking for ways to resolve this for all current and future cases. I.e., I want to have the situation where, if the site gets overloaded, it returns an error message as opposed to becoming completely unreachable for everyone. Is there a way to do that?
There are many things to consider and test here in order to get what is a reason for this, but I think it is OOM(out of memory) mainly because you have to restart even to login in SSH.
Nginx uses "event‑driven" approach to handle requests so a single worker of nginx can handle 1000s of req simultaneously. But Gunicorn on the other hand mostly(by default) uses sync worker which means a request will remain with a worker till it is processed.
When you put a large request your machine tries to process that request until an overflow occurs, mostly it will not get detected by any service running inside a machine. Just try to monitor memory by any monitoring tool in AWS or just SSH and use htop before calling the API.
In most cases with Django/gunicorn the culprit is oom.
Edit:
AFAIK You cannot capture(cache) an oom, the only thing you can do is aftermath i.e after system restart sees/var/logs/syslogs ... As I said monitor in AWS memory monitor(I don't have much experience with AWS).
And regarding the solution,
you first increase the memory of your EC2 until you don't get an error to see how big the problem is.
Then you optimise your application by profiling which part is actually taking this much memory. I haven't used any memory profiling so maybe you can tell me after which is better.
The only thing you can do is optimise your application see common gotchas, best practices, Query optimisations etc.
https://haydenjames.io/how-to-diagnose-oom-errors-on-linux-systems/
https://www.pluralsight.com/blog/tutorials/how-to-profile-memory-usage-in-python
Related
Backgound:
We are running a single page application being served via nginx with a node js (v12.10) backend running express. It runs as containers via ECS and currently we are running three t3a mediums as our container instances with the api and web services each running 6 replicas across these. We use an ALB to handle our load balancing / routing of requests. We run three subnets across 3 AZ's with the load balancer associated with all three and the instances spread across the 3 AZ's as well.
Problem:
We are trying to get to the root cause of some intermittent 502 errors that are appearing for both front and back end. I have downloaded the ALB access logs and the interesting thing about all of these requests is that they all show the following.
- request_processing_time: 0.000
- target_processing_time: 0.000 (sometimes this will be 0.001 or at most 0.004)
- response_processing_time: -1
At the time of these errors I can see that there were healthy targets available.
Now I know that some people have had issues like this with keepAlive times that were shorter on the server side than on the ALB side, therefore connections were being forceably closed that the ALB then tries to reuse (which is in line with the guidelines for troubleshooting on AWS). However when looking at the keepAlive times for our back end they are set higher than our ALB currently by double. Also the requests themselves can be replayed via chrome dev tools and they succeed (im not sure if this is a valid way to check a malformed request, it seemed reasonable).
I am very new to this area and if anyone has some suggestions as to where to look or what sort of tests to run that might help me pinpoint this issue it would be greatly appreciated. I have run some load tests on certain endpoints and duplicated the 502 errors, however the errors under heavy load differ from the intermittent ones I have seen on our logs in that the target_processing_time is quite high so to my mind this is another issue altogether. At this stage I would like to understand the errors that show a target_processing_time of basically zero to start with.
I wrote a blog post about this a bit over a year ago that's probably worth taking a look at (caused due to a behavior change in NodeJS 8+):
https://adamcrowder.net/posts/node-express-api-and-aws-alb-502/
TL;DR is you need to set the nodejs http.Server keepAliveTimeout (which is in ms) to be higher than the load balancer's idle timeout (which is in seconds).
Please also note that there is also something called an http-keepalive which sets an http header, which has absolutely nothing to do with this problem. Make sure you're setting the right thing.
Also note that there is currently a regression in nodejs where setting the keepAliveTimeout may not work properly. That bug is being tracked here: https://github.com/nodejs/node/issues/27363 and is worth looking through if you're still having this problem (you may need to also set headersTimeout as well).
I am using Coldfusion MX8 server and one of the scheduled task was running from 2 years but now suddenly from 01/12/2014 scheduled tasks are not running. When i browsed the file in browser then the file is running successfully without error.
I am not sure is there any updatation or license expiration problem. I am aware that mid of this year Adobe closed the support for coldfusion 8.
The first most common problem of this problem is external to the server. When you say you browsed to the file and it worked in a browser, it is very important to know if that test was performed on the server desktop. Knowing that you can browse to the file from your desktop or laptop is of small value.
The most common source of issues like this is a change in the DNS or network stack that is interfereing with resolution. For example, if the internal DNS serving your DMZ suddenly starts serving the "external" address - suddenly your server can't browse to your domain. Or if the IP served by the server for the domain in question goes from being 127.0.0.1 to some other IP that the server can't acces correctly due to reverse proxy or LB or some other rule. Finally, sometimes the Apache or IIS is altered so that an IP that previously was serviced (127.0.0.1 being the most common example) now does not respond.
If it is something intrinsic to the scheduler service then Frank's advice is pretty good - especially look for "proxy schduler" entries in the log - they can give you good clues. I would also log results of a scheduled task to a file. Then check the file. If it exists then your scheduled tasks ARE running - they are just not succeeding. Good luck!
I've seen the cf scheduling service crash in CF8. The rest of CF is unaffected.
Have you tried restarting the server?
Here are your concerns:
Your File (works since you tested it manually).
Your Scheduled Task (failed).
Your Coldfusion Application (Service) (any changes here)?
Your Server (what about here).
To test your problem create a duplicate task and schedule it. Leave the other one in place (maybe set your new one to run earlier). Use the same file too. See if it completes.
If it doesn't then you have a larger problem. Since the Coldfusion Server sits atop of the JVM there could be something happening there. Things just don't stop working unless something got corrupted or you got compromised. If you hardened your server by rearranging/renaming the file structure to make it more secure...It would break your task.
So going back: if your test schedule works then determine what is different between the two. Note you have logging capabilities. Logging abilities for CF8
If you are not directly incharge of maintaining this server, then I would recommend asking around and see if there was recent maintenance, if so, what was done to the server?
I am currently using Jetty 9.1.4 on Windows.
When I deploy the war file without hot deployment config, and then restart the Jetty service. During that 5-10 seconds starting process, all client connections to my Jetty server are waiting for the server to finish loading. Then clients will be able to view the contents.
Now with hot deployment config on, the default Jetty 404 error page shows within that 5-10 second loading interval.
Is there anyway I can make the hot deployment has the same behavior as the complete restart - clients connections will wait instead seeing the 404 error page ?
Unfortunately this does not seem to be possible currently after talking with the Jetty developers on IRC #jetty.
One solution I will try to use are two Jetty instances with a loadbalancing reverse proxy (e.g. nginx) before them and taking one instance down for deployment.
Of course this will instantly lead to new requirements (session persistence/sharing) which need to be handled. So in conclusion: much work to do in the Java world for zero downtime on deployments.
Edit: I will try this, seems like a simple enough solution http://rafaelsteil.com/zero-downtime-deploy-script-for-jetty/ Github: https://github.com/rafaelsteil/jetty-zero-downtime-deploy
To begin, I am running a Wordpress site on an AWS EC2 Ubuntu Micro instance. I have already confirmed that this is NOT an error with Wordpress/mysql.
Seemingly at random the site will go down and I'll get the "Error establishing database connection" message. The server says that it is running just fine, and rebooting usually fixes the issue, however I'd like to figure out the cause and resolve the issue so this can stop happening (it's been the past 2 weeks now that it goes down almost every other day.)
It's not a spike in traffic, or at least Google Analytics hasn't shown the site as having any spikes in traffic (it averages about 300 visits per day.)
What's the cause, and how can this be fixed?
Sounds like you might be running into the throttling that is a limitation on t1.micro. If you use too much CPU cycles you will be throttled.
See http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts_micro_instances.html#available-cpu-resources-during-spikes
The next time this happens I would check some general stats on the health of the instance. You can get a feel for the high-level health of the instance using the 'top' command (http://linuxaria.com/howto/understanding-the-top-command-on-linux?lang=en). Be sure to look for CPU and memory usage. You may find a process (pid) that is consuming a lot of resources and starving your app.
More likely, something within your application (how did you come to the conclusion that this is not a Wordpress/MySQL issue?) is going out of control. Possibly there is a database connection not being released? To see what your app is doing, find the process id (pid) for your app:
ps aux | grep "php"
and get a thread dump for that process: kill -3 to get java thread dump. This will help you see where your application's threads are stuck (if they are).
Typically it's good practice to execute two thread dumps a few seconds apart and compare trends in both. If there is an issue in the application, you should see a lot of threads stuck at the same point.
You might also want to checkout what MySQL is seeing (https://dev.mysql.com/doc/refman/5.1/en/show-processlist.html).
mysql> SHOW FULL PROCESSLIST
Hope this helps, let us know what you find!
I have an Apache2 and Django (mod_wsgi) setup that provides a RESTful API. I have a set of automated tests for this, that executes ~1000 API requests (pure http GET/POST/PUT/DELETE) in sequential order.
The problem is, for every 80 requests or so, I get a strange lag/timeout for exactly 5s or 10s. See timestamp examples here:
Request 1: 2013-08-30T03:49:20.915
Response 1: 2013-08-30T03:49:30.940
Request 2: 2013-08-30T03:50:32.559
Response 2: 2013-08-30T03:50:37.597
I can't figure out why this happens. I have an apache config with KeepAlive Off (recommended setup setting for Django) but otherwise standard install for Ubuntu 12.04 LTS.
I'm running the tests from the same server where the webserver is, I first thought this was some kind of DNS cache thing, but I've added the hostname I'm requesting to /etc/hosts but the problem persists.
The system is idle and have lots of cpu and mem when this lag/timeouts happens.
The lag is not specific to a certain request (URL), it seems kinda random.
Considering that it's always exactly to the millisecond 5s or 10s, it feels like this is some specific setting somewhere causing this.
In case it provides some insight, watch my talk from PyCon US.
http://lanyrd.com/2013/pycon/scdyzk/
The talk deals with things like process churn and startup costs. One thing you shouldn't do is set maximum requests if you don't really need it.
Also consider trying New Relic to help diagnose where the issue is. That will save a lot of guessing if it is a web application of backend service infrastructure issue.
As far as seeing how such monitoring can help, watch another one of my PyCon talks.
http://lanyrd.com/2012/pycon/spcdg/
This was a DNS issue, adding the domainname I used locally to /etc/hosts actually solved the problem. I just hadn't reboot the server for the changes to take effect, thought restarting networking would take care of that, but apparently not.