PostgreSQL Database Server Unresponsive - django

How do you diagnose problems with PostgreSQL performance?
I have a Django-based webapp using PostgreSQL as a database backend on Ubuntu 12, and under heavy load, the database seems to just disappear, causing the Django-interface to be unreachable and resulting in errors like:
django.db.utils.DatabaseError: error with no message from the libpq
django.db.utils.DatabaseError: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
What's odd is that the logs in /var/log/postgresql show nothing unusual. The only thing the log /var/log/postgresql/postgresql-9.1-main.log shows are lots of lines like:
2012-09-01 12:24:01 EDT LOG: unexpected EOF on client connection
Running top shows that PostgreSQL doesn't seem to be consuming any CPU, even though service postgresql status indicates it's still running.
Doing a 'service postgresql restart` temporarily fixes the problem, but the problem returns as soon as there's a lot of load on the database.
I've checked the dmesg and syslog, but I don't see anything that would explain what's wrong. What other logs should I check? How do I determine what's wrong with my PostgreSQL server?
Edit: My max_connections is set to 100. Although I am doing a lot of manual transactions. Reading up on Django's ORM behavior with PostgreSQL in manual mode, it looks like I may have to explicitly do connection.close(), which I'm not doing.

I found this was due to Django's buggy Postgres-backend in combination with multi-processing. Essentially, Django doesn't properly close it's connections automatically, causing some weird behavior like tons of "idle in transaction" connections. I fixed it by adding connection.close() to the end of my multi-processing launched functions and before certain queries that were throwing this error.

2012-09-01 12:24:01 EDT LOG: unexpected EOF on client connection
This message shows, so some issue is on client side - maybe some exception from libpq ?? There can be related issues - when clients hangs without correct logout, then you have lot of idle connections, and you get other errors early.

The program pg_ctl has some options that might help. (man pg_ctl)
-c
Attempt to allow server crashes to produce core files, on platforms
where this is possible, by lifting any soft resource limit placed
on core files. This is useful in debugging or diagnosing problems
by allowing a stack trace to be obtained from a failed server
process.
-l filename
Append the server log output to filename. If the file does not
exist, it is created. The umask is set to 077, so access to the log
file is disallowed to other users by default.
The program postgres also has some debug options. (man postgres)
-d debug-level
Sets the debug level. The higher this value is set, the more
debugging output is written to the server log. Values are from 1 to
5. It is also possible to pass -d 0 for a specific session, which
will prevent the server log level of the parent postgres process
from being propagated to this session.
In the section "Semi-internal Options" . . .
-n
This option is for debugging problems that cause a server process
to die abnormally. The ordinary strategy in this situation is to
notify all other server processes that they must terminate and then
reinitialize the shared memory and semaphores. This is because an
errant server process could have corrupted some shared state before
terminating. This option specifies that postgres will not
reinitialize shared data structures. A knowledgeable system
programmer can then use a debugger to examine shared memory and
semaphore state.
-T
This option is for debugging problems that cause a server process
to die abnormally. The ordinary strategy in this situation is to
notify all other server processes that they must terminate and then
reinitialize the shared memory and semaphores. This is because an
errant server process could have corrupted some shared state before
terminating. This option specifies that postgres will stop all
other server processes by sending the signal SIGSTOP, but will not
cause them to terminate. This permits system programmers to collect
core dumps from all server processes by hand.

Related

Where to even begin investigating issue causing database crash: remaining connection slots are reserved for non-replication superuser connections

Occasionally our Postgres database crashes and it can only be solved by restarting the server. We have tried incrementing the max connections and Django's CONN_MAX_AGE. Also, I am trying to learn how to set up PgBouncer. However, I am convinced the underlying issue must be something else which is fixable.
I am trying to find what that issue is. The problem is I wouldn't know where or what to begin to look at. Here are some pieces of information:
The errors are always OperationalError: FATAL: remaining connection slots are reserved for non-replication superuser connections and OperationalError: could not write to hash-join temporary file: No space left on device. I think this is caused by opening too many database connections, but I have never managed to catch this going down live so that I could inspect pg_stat_activity and see what actual connections were active.
Looking at the error log, the same URL shows up for the most part. I've checked the nginx log and it's listed in many different lines, meaning the request is being made multiple times at once rather than Django logging the same error multiple times. All these requests are responded with 499 Client Closed Request. In addition to this same URL, there are of course sprinkled requests of other users trying to access our site.
I should mention that the logic the server processes when the URL in question is requested is pretty simple and I see nothing suspicious that could cause a database crash. However, for some reason, the page loads slowly in production.
I know this is very vague and very little to work with, but I am not used to working sysadmin, I only studied this kind of thing in college and so far I've only worked as a developer.
Those two problems are mostly independent.
Running out of connection slots won't crash the database. It just is a sign that you either don't use a connection pool or you have a connection leak, i.e. you forget to close transactions in your code.
Running out of space will crash your database if the condition persists.
I assume that the following happens in your system:
Because someone forgot a couple of join conditions or for some other reason, some of your queries take a very long time.
They also priduce a lot of (perhaps intermediate) results that are cached in temporary files that eventually fill up the disk. This out of space condition is cleared as soon as the query fails, but it can crash the database.
Because these queries take long and block a database session, your application keeps starting new sessions until it reaches the limit.
Solution:
Find and fix thise runaway queries. As a stop-gap, you can set statement_timeout to terminate all statements that take too long.
Use a connection pool with an upper limit so that you don't run out of connections.

Using Python Cassandra driver for multiple connections errors out

I am using the Python Cassandra driver offered by Datastax to connect to a single node Cassandra instance. My Python code spawns multiple processes (using the multiprocessing module), each of which opens a connection to this node, and shuts it down during exit.
Here's the behavior I observe: when the number of processes spawned is less (say ~ 30) my code runs flawlessly. But with a higher number I see errors like these (probably not surprising):
File "/usr/local/lib/python2.7/dist-packages/cassandra/cluster.py", line 755, in connect
self.control_connection.connect()
File "/usr/local/lib/python2.7/dist-packages/cassandra/cluster.py", line 1868, in connect
self._set_new_connection(self._reconnect_internal())
File "/usr/local/lib/python2.7/dist-packages/cassandra/cluster.py", line 1903, in _reconnect_internal
raise NoHostAvailable("Unable to connect to any servers", errors)
NoHostAvailable: ('Unable to connect to any servers', {'127.0.0.1': error(99, "Tried connecting to [('127.0.0.1', 9042)]. Last error: Cannot assign requested address")})
Apparently, the host is unable to accept new connections. This is something that looks like should be taken care of by the driver or Cassandra - by having new connection requests queue up and grant them when it frees up.
How do I impose this behavior?
"Cannot assign requested address" can indicate that you're running out of local ports. This is not up to the driver -- it is a system configuration issue. Here is a good article about the problem (it refers to MySQL, but the issue is the same). Note that connections in TIME_WAIT state occupy local ports, and can linger beyond individual program runs.
The article discusses multiple solutions, including expanded port ranges, listening on multiple IP addresses, or changing application connection behavior. I would consider application behavior, and recommend running fewer processes. Depending on what you're trying to overcome with multiprocessing, you'd probably be best served using (process count) <= (machine cores) (this is the default behavior of multiprocessing.Pool).

What could be causing seemingly random AWS EC2 server to Crash? (Error couldn't establish database connection)

To begin, I am running a Wordpress site on an AWS EC2 Ubuntu Micro instance. I have already confirmed that this is NOT an error with Wordpress/mysql.
Seemingly at random the site will go down and I'll get the "Error establishing database connection" message. The server says that it is running just fine, and rebooting usually fixes the issue, however I'd like to figure out the cause and resolve the issue so this can stop happening (it's been the past 2 weeks now that it goes down almost every other day.)
It's not a spike in traffic, or at least Google Analytics hasn't shown the site as having any spikes in traffic (it averages about 300 visits per day.)
What's the cause, and how can this be fixed?
Sounds like you might be running into the throttling that is a limitation on t1.micro. If you use too much CPU cycles you will be throttled.
See http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts_micro_instances.html#available-cpu-resources-during-spikes
The next time this happens I would check some general stats on the health of the instance. You can get a feel for the high-level health of the instance using the 'top' command (http://linuxaria.com/howto/understanding-the-top-command-on-linux?lang=en). Be sure to look for CPU and memory usage. You may find a process (pid) that is consuming a lot of resources and starving your app.
More likely, something within your application (how did you come to the conclusion that this is not a Wordpress/MySQL issue?) is going out of control. Possibly there is a database connection not being released? To see what your app is doing, find the process id (pid) for your app:
ps aux | grep "php"
and get a thread dump for that process: kill -3 to get java thread dump. This will help you see where your application's threads are stuck (if they are).
Typically it's good practice to execute two thread dumps a few seconds apart and compare trends in both. If there is an issue in the application, you should see a lot of threads stuck at the same point.
You might also want to checkout what MySQL is seeing (https://dev.mysql.com/doc/refman/5.1/en/show-processlist.html).
mysql> SHOW FULL PROCESSLIST
Hope this helps, let us know what you find!

Django + mod_wsgi + apache2 - child process XXX still did not exit, sending a SIGTERM

I am getting intermittent errors -
child process XXX still did not exit, sending a SIGTERM.. and then a SIGKILL. It occurs intermittently and the web page hangs.
I was not using Daemon process..but now I am, still the problem exists..
Also I have some Error opening file for reading: Permission Denied.
Please can someone help?
I am new to this forum, so sorry if that has been answered before.
If you were not using daemon mode of mod_wsgi, that would imply that Apache must have been restarted at the time that initial message was displayed.
What is occurring is that in trying to do a restart, Apache sends a SIGTERM to its child processes. If they do not exit by their own accord it will send SIGTERM again at 1 second intervals and finally send it a SIGKILL after 3 seconds. The message is warning you of the latter and that it force killed the process.
The issue now is what is causing the process to not shutdown promptly. There could be various reasons for this.
Using an extension module for Python which doesn't work in sub interpreters properly which is deadlocking and hanging the process, preventing it from shutting down. http://code.google.com/p/modwsgi/wiki/ApplicationIssues#Python_Simplified_GIL_State_API
Use of background threads in the Python web application which have not been set as being daemon threads properly with the result they are then blocking process shutdown.
Your web application is simply trying to do too much on process shutdown somehow and not completing within the time limit.
Even if using daemon mode you will likely see this behaviour as it implements a similar shutdown timeout, albeit that the timeout is configurable for daemon mode.
Anyway, force use of the main Python interpreter as explained in the documentation link above
As to the permissions issue, read:
http://code.google.com/p/modwsgi/wiki/ApplicationIssues#Access_Rights_Of_Apache_User
http://code.google.com/p/modwsgi/wiki/ApplicationIssues#Application_Working_Directory
In short, ensure access permissions are correct of files/directories you need to access and ensure you are always using absolute path names when accessing the file system.

Target IIS Worker Processes on Request

Ok, strange setup, strange question. We've got a Client and an Admin web application for our SaaS app, running on asp.net-2.0/iis-6. The Admin application can change options displayed on the Client application. When those options are saved in the Admin we call a Webservice on the Client, from the Admin, to flush our cache of the options for that specific account.
Recently we started giving our Client application >1 Worker Processes, thus causing the cache of options to only be cleared on 1 of the currently running Worker Processes.
So, I obviously have other avenues of fixing this problem (however input is appreciated), but my question is: is there any way to target/iterate through each Worker Processes via a web request?
I'm making some assumptions here for this answer....
I'm assuming the client app is using one of the .NET caching classes to store your application's options?
When you say 'flush' do you mean flush them back to a configuration file or db table?
Because the cache objects and data won't be shared between processes you need a mechanism to signal to the code running on the other worker process that it needs to re-read it's options into its cache or force the process to restart (which is not exactly convenient and most likely undesirable).
If you don't have access to the client source to modify to either watch the options config file or DB table (say using a SqlCacheDependency) I think you're kinda stuck with this behaviour.
I have full access to admin and client, by cache, I mean .net's Cache object. By flush I mean removing the item from the Cache object.
I'm aware that both worker processes don't share the cache data. That's sort of my conundrum)
The system is the way it is to remove the need to hit sql every new-session that comes in. So I'm trying to find a solution that can just tell each worker process that the cache needs to be cleared w/o getting sql involved.