Sudden 503s on OpenShift Django. Need help debugging - django

Today accessing my Django 1.4 app on OpenShift started throwing 503 errors 99% of time when accessing it (yeah, ~1% of the time it loads fine). htop doesn't show any huge workload and the logs don't show any errors.
Any recommandations on how to debug this?
./manage.py shell works fine on the server and even theh PostgreSQL 9.2 db is fine.

I know you mentioned that there's nothing in the logs, but I would anyway try tailing all the logs rhc tail <yourApp> and watching in real time for any clues there, when the 503's are returned.
To check whether your gear is not restarting due to insufficient memory, I recommend this.
Having your ssh connection closed unexpectedly may be another indicator of unexpected gear restarts.
Note that htop displays only your tasks, which take only little resources in context of the whole node; using e.g. 3% of memory of 16GB may be nearing the small gear's limits (512 MB).

Related

504 timeout on AWS with nginx and gunicorn

I am running a python Django app on an AWS EC2 instance. It uses gunicorn and nginx to serve the app, the EC2 is behind an application load balancer. Occasionally I get 504 error where the entire EC2 instance becomes unreachable for everyone (including via SSH which I use all the time otherwise). I then need to restart everything which takes time.
I can replicate the error by overloading the app (e.g. uploading and processing a very large image), in that case, gunicorn worker times out (I see the timeout message in logs), 504 error appears and the instance becomes unreachable. I set my gunicorn to time out in 5 minutes (300 seconds) but it falls down quicker than that. There is nothing really useful in CloudWatch logs.
I am looking for ways to resolve this for all current and future cases. I.e., I want to have the situation where, if the site gets overloaded, it returns an error message as opposed to becoming completely unreachable for everyone. Is there a way to do that?
There are many things to consider and test here in order to get what is a reason for this, but I think it is OOM(out of memory) mainly because you have to restart even to login in SSH.
Nginx uses "event‑driven" approach to handle requests so a single worker of nginx can handle 1000s of req simultaneously. But Gunicorn on the other hand mostly(by default) uses sync worker which means a request will remain with a worker till it is processed.
When you put a large request your machine tries to process that request until an overflow occurs, mostly it will not get detected by any service running inside a machine. Just try to monitor memory by any monitoring tool in AWS or just SSH and use htop before calling the API.
In most cases with Django/gunicorn the culprit is oom.
Edit:
AFAIK You cannot capture(cache) an oom, the only thing you can do is aftermath i.e after system restart sees/var/logs/syslogs ... As I said monitor in AWS memory monitor(I don't have much experience with AWS).
And regarding the solution,
you first increase the memory of your EC2 until you don't get an error to see how big the problem is.
Then you optimise your application by profiling which part is actually taking this much memory. I haven't used any memory profiling so maybe you can tell me after which is better.
The only thing you can do is optimise your application see common gotchas, best practices, Query optimisations etc.
https://haydenjames.io/how-to-diagnose-oom-errors-on-linux-systems/
https://www.pluralsight.com/blog/tutorials/how-to-profile-memory-usage-in-python

'Server Connection Error' on GCP (AI Platform Notebook)

I am facing some issues with GCP and the AI Platform (Jupyterlab)
It seems that I am unable to maintain a stable connection with the server for a long time. I keep getting those 'server connection error' message. From there two possibilities:
either nothing happens and my cell keeps running
or
the cells have stopped running and I can see the status 'No Kernel!' on the top right of the notebook. Whenever I select a kernel (python 3) again, depending on my luck I can either keep working, or the cell will display the running status (with the * on the left of it) but the kernel status on the bottom left will stay on : 'connected' (instead of 'busy'). For the latter, I need to restart the kernel and run all the cells again, which can be very long.
Sometimes this happens as soon as I run the first cell after (re)starting the instance, sometimes a bit later. The longest stable period I was able to work on the notebook without any issue was 20, 30-ish minutes, which is quite annoying.
Configuration of my main instance :
- 16 CPUs
- 60gb of RAM
- A P100 NVIDIA GPU
I have tried different types of instance and I keep having the same problem, network at home is stable.
error message
What operating system and browser are you using at work?
I had the same problem as you did on Ubuntu 18 with the Firefox browser.
When I switched to Windows with Chrome the error did not reoccur, even though it was the same network.
I had a similar issue today: according to the google docs the cause for this is that the docker/ Jupyter service is not starting.
The cause why these services couldn't be started in our specific case was a full disk.

A timeout occur while attempting to local fusebox

I have a fusebox application setup on coldfusion 9 and it is in production mode. When application got started, it is giving me following error.
A timeout occur while attempting to local fusebox.
I have increased the timeout but that not help. Also tried other solution but nothing help.
There is one more thing. When i increased the JVM heap size from 1-2 gb to 2-3 gb. Then after restart of coldfusion service, Error remain for 1-2 hour and then site start working.
It also stop working after running for more than half day correctly. Then i get 504 gateway error and i have to restart the service again.
Can any one guide me to solve this problem?
Which FB version are you running? Is it set to production or development mode? If set to development, try setting it to production mode.

Very slow: ActiveRecord::QueryCache#call

I have an app on heroku, running on Puma:
workers 2
threads_count 3
pool 5
It looks like some requests get stuck in the middleware, and it makes the app very slow (VERY!).
I have seen other people threads about this problem but no solution so far.
Please let me know if you have any hint.
!
!
I work for Heroku support and Middleware/Rack/ActiveRecord::QueryCache#call is a commonly reported as a problem by New Relic. Unfortunately, it's usually a red herring as each time the source of the problem lies elsewhere.
QueryCache is where Rails first tries to check out a connection for use, so any problems with a connection will show up here as a request getting 'stuck' waiting. This doesn't mean the database server is out of connections necessarily (if you have Librato charts for Postgres they will show this). It likely means something is causing certain database connections to enter a bad state, and new requests for a connection are waiting. This can occur in older versions of Puma where multiple threads are used and the reaping_frequency is set - if some connections get into a bad state and the others are reaped this will cause problems.
Some high-level suggestions are as follows:
Upgrade Ruby & Puma
If using the rack-timeout gem, upgrade that too
These upgrades often help. If not, there are other options to look into such as switching from threads to worker based processes or using a Postgres connection pool such as PgBouncer. We have more suggestions on configuring concurrent web servers for use with Postgres here: https://devcenter.heroku.com/articles/concurrency-and-database-connections
I will answer my own question:
I simply had to check all queries to my DB. One of them was taking a VERY long time, and even if it was not requested often, it would slow down the whole server for quite some time afterwards(even after the process was done, there was a sort of "traffic jam" on the server).
Solution:
Check all the queries to your database, fix the slowest ones (it might simply mean breaking it down in few steps, it might mean make it run at night when there is no traffic, etc...).
Once this queries are fixed, everything should go back to normal.
I recently started seeing a spike in time spent in ActiveRecord::QueryCache#call. After looking at the source, I decided to try clearing said cache using ActiveRecord::Base.connection.clear_query_cache from a Rails Console attached to the production environment. The error I got back was PG::ConnectionBad: could not fork new process for connection: Cannot allocate memory which lead me to this other SO question at least Heroku Rails could not fork new process for connection: Cannot allocate memory

What could be causing seemingly random AWS EC2 server to Crash? (Error couldn't establish database connection)

To begin, I am running a Wordpress site on an AWS EC2 Ubuntu Micro instance. I have already confirmed that this is NOT an error with Wordpress/mysql.
Seemingly at random the site will go down and I'll get the "Error establishing database connection" message. The server says that it is running just fine, and rebooting usually fixes the issue, however I'd like to figure out the cause and resolve the issue so this can stop happening (it's been the past 2 weeks now that it goes down almost every other day.)
It's not a spike in traffic, or at least Google Analytics hasn't shown the site as having any spikes in traffic (it averages about 300 visits per day.)
What's the cause, and how can this be fixed?
Sounds like you might be running into the throttling that is a limitation on t1.micro. If you use too much CPU cycles you will be throttled.
See http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts_micro_instances.html#available-cpu-resources-during-spikes
The next time this happens I would check some general stats on the health of the instance. You can get a feel for the high-level health of the instance using the 'top' command (http://linuxaria.com/howto/understanding-the-top-command-on-linux?lang=en). Be sure to look for CPU and memory usage. You may find a process (pid) that is consuming a lot of resources and starving your app.
More likely, something within your application (how did you come to the conclusion that this is not a Wordpress/MySQL issue?) is going out of control. Possibly there is a database connection not being released? To see what your app is doing, find the process id (pid) for your app:
ps aux | grep "php"
and get a thread dump for that process: kill -3 to get java thread dump. This will help you see where your application's threads are stuck (if they are).
Typically it's good practice to execute two thread dumps a few seconds apart and compare trends in both. If there is an issue in the application, you should see a lot of threads stuck at the same point.
You might also want to checkout what MySQL is seeing (https://dev.mysql.com/doc/refman/5.1/en/show-processlist.html).
mysql> SHOW FULL PROCESSLIST
Hope this helps, let us know what you find!