pgbouncer - closing because: unclean server on every connection - django

I'm running Django 1.3 with PostgreSQL 9.1/PostGIS 1.5, psycopg2 2.4.2 and pgbouncer 1.4.2.
On every single connection to the database I get a log entry in pgbouncer.log:
2011-11-20 02:15:25.027 29538 LOG S-0x96c2200: app_db/postgres#192.168.171.185:5432 closing because: unclean server (age=0).
I can't find any solution to this problem - anybody have an idea why? I've tried reconfiguring pgbouncer (session/transaction mode, different timeouts etc), but to no avail.

Ok, I think I've figured this out. The problem lies with a long-standing issue with Django and Psycopg2. Basically, Psycopg2 will automatically issue a BEGIN statement to the DB. However, if Django thinks no data-modification has occurred, it won't issue a COMMIT at the end of a transaction.
There are a few solutions to this problem, look at http://www.slideshare.net/OReillyOSCON/unbreaking-your-django-application for more details. Ideally you turn off automatic commits (by setting autocommit = True in your DB settings, awkward naming convention). This prevents transactions on read-only functions, but also on write functions so you need to manually wrap those functions in a #commit_on_success decorator.
Alternatively, just add the django.middleware.transaction.TransactionMiddleware to your Middleware classes. This will wrap every request in a transaction. This means also unnecessarily wrapping read-only requests in a transaction, but it's a quick-and-dirty solution.

Related

sqlite database table is locked on tests

I am trying to migrate an application from django 1.11.1 to django 2.0.1
Tests are set up to run with sqlite in memory database. But every test is failing, because sqlite3.OperationalError: database table is locked for every table. How can I find out why is it locked? Icreasing timeout setting does not help.
I am using LiveServerTestCase, so I suppose the tests must be running in a different thread than the in memory database, and it for some reason does not get shared.
I hit this, too. The LiveServerTestCase is multi-threaded since this got merged.
It becomes a problem for me when my app under test issues multiple requests. Then, so my speculation, the LiveServer spawns threads to handle those requests. Those requests then cause a write to the SQLite db. That in turn does not like multiple writing threads.
Funnily enough, runserver knows about --nothreading. But such an option seems to be missing for the test server.
The following snippet brought me a single-threaded test server:
class LiveServerSingleThread(LiveServerThread):
"""Runs a single threaded server rather than multi threaded. Reverts https://github.com/django/django/pull/7832"""
def _create_server(self):
"""
the keep-alive fixes introduced in Django 2.1.4 (934acf1126995f6e6ccba5947ec8f7561633c27f)
cause problems when serving the static files in a stream.
We disable the helper handle method that calls handle_one_request multiple times.
"""
QuietWSGIRequestHandler.handle = QuietWSGIRequestHandler.handle_one_request
return WSGIServer((self.host, self.port), QuietWSGIRequestHandler, allow_reuse_address=False)
class LiveServerSingleThreadedTestCase(LiveServerTestCase):
"A thin sub-class which only sets the single-threaded server as a class"
server_thread_class = LiveServerSingleThread
Then, derive your test class from LiveServerSingleThreadedTestCase instead of LiveServerTestCase.
It was caused by this django bug.
Using a file-based database during testing fixes the "table is locked" error. To make Django use a file-based database, specify it's filename as test database name:
DATABASES = {
'default': {
...
'TEST': {
'NAME': os.path.join(BASE_DIR, 'db.sqlite3.test'),
},
}
}
I suppose that the timeout setting is ignored in case of in-memory database, see this comment for additional info.

Intermittent Deadlock with Django LiveServerTestCase, Selenium, and Postgres

In testing a Django/Postgres app using LiveServerTestCase and Selenium I'm seeing intermittent deadlock problems. LiveServerTestCase inherits from TransactionTestCase, so all DB tables are truncated after each test runs. But sometimes that truncation causes deadlock because one of the tables is locked by an unresolved Postgres transaction. I can see that because this query returns a row:
select * from pg_stat_activity
where datname='test' and current_query='<IDLE> in transaction';
So some activity in my application must be creating an unresolved transaction. I've combed the tests to make sure they wait for any updates to complete before exiting and am convinced that's not it.
Looking at the Postgres logs I see these two lines frequently, without a corresponding COMMIT or ROLLBACK:
SHOW default_transaction_isolation
BEGIN
I suspect these are causing the deadlock. Any idea what might be issuing this SQL or how to disable it? This is Django 1.5.
The root cause of this deadlock is Django 1.5's autocommit behavior. By default Django 1.5 runs with an open transaction, which is only closed by a COMMIT if you do an UPDATE or INSERT. "Read" operations (SELECT) cause the unmatched BEGIN statements I mentioned above. It appears that deadlock happens if a SELECT occurs just before the end-of-test TRUNCATE. To avoid deadlock the test must exit only after all requests have completed, even if the requests cause no DB writes. That can be tricky if Ajax calls are updating parts of the page asynchronously after an update.
A better solution is to use Django 1.6, where atomic() is the only (non-deprecated) transaction-creating primitive. It doesn't open transactions for read operations, and doesn't leave dangling BEGIN statements. Tests can follow the common-sense approach of not exiting while "write" requests are pending.
For any future travelers:
We experienced the same issue on Django 3.2 + Postgres 12. When the build server was at a high load with multiple parallel builds, the live server kept receiving AJAX calls from the Selenium container, which interrupted the post-test TRUNCATE, causing a deadlock.
Our solution was to just add a 1-second sleep at the end of each test case:
class CustomLiveTestCase(StaticLiveServerTestCase):
def tearDown(self):
time.sleep(1)
This gave the live server enough time to process any lingering AJAX calls after the test finished, removing the deadlocks.

Caching in Djangos object model

I'm running a system with a few workers that's taking jobs from a message queue, all using Djangos ORM.
In one case I'm actually passing a message along from one worker to another in another queue.
It works like this:
Worker1 in queue1 creates an object (MySQL INSERT) and pushes a message to queue2
Worker2 accepts the new message in queue2 and retrieves the object (MySQL SELECT), using Djangos objects.get(pk=object_id)
This works for the first message. But in the second message worker 2 always fails on that it can't find object with id object_id (with Django exception DoesNotExist).
This works seamlessly in my local setup with Django 1.2.3 and MySQL 5.1.66, the problem occurs only in my test environment which runs Django 1.3.1 and MySQL 5.5.29.
If I restart worker2 every time before worker1 pushes a message, it works fine. This makes me believe there's some kind of caching going on.
Is there any caching involved in Django's objects.get() that differs between these versions? If that's the case, can I clear it in some way?
The issue is likely related to the use of MySQL transactions. On the sender's site, the transaction must be committed to the database before notifying the receiver of an item to read. On the receiver's side, the transaction level used for a session must be set such that the new data becomes visible in the session after the sender's commit.
By default, MySQL uses the REPEATABLE READ isolation level. This poses problems where there are more than one process reading/writing to the database. One possible solution is to set the isolation level in the Django settings.py file using a DATABASES option like the following:
'OPTIONS': {'init_command': 'SET SESSION TRANSACTION ISOLATION LEVEL READ COMMITTED'},
Note however that changing the transaction isolation level may have other side effects, especially when using statement based replication.
The following links provide more useful information:
How do I force Django to ignore any caches and reload data?
Django ticket#13906

Django pre-shutdown hook to close hanging pymongo connection

I'm using pymongo in a Django project, and recently I've began to run into a problem where, upon exiting the main Django process (even through a management command) the pymongo connection will hang, and the process will never exit. Obviously, there's something wrong somewhere in the stack, but for now the best solution seems to be to explicitly close the connection before Django exits.
So: is there a pre-shutdown signal or hook that Django provides for this?
BTW: my connection code in case you're interested.
from django.conf import settings
from pymongo import ReplicaSetConnection, ReadPreference
conn = ReplicaSetConnection(
hosts_or_uri=settings.MONGO['HOST'],
replicaSet=settings.MONGO['REPLICASET'],
safe=settings.MONGO.get('SAFE', False),
journal=settings.MONGO.get('JOURNAL', False),
read_preference=ReadPreference.PRIMARY
)
db = getattr(conn, settings.MONGO['DB'])
(and as a point of curiousity, is this the right way to do connection pooling in pymongo?)
While this won't fix your issue, the hang was introduced in July 2012 on this commit to pymongo: https://github.com/mongodb/mongo-python-driver/commit/1fe6029c5d78eed64fcb2a6d368d9cdf8756d2f4#commitcomment-1820334.
Specifically, it only affects ReplicaSetConnections. The answer they gave is to call connection.close(), but as you correctly pointed out in your question, there is no good hook to close the connection.
I believe that you can safely close the connection at the end of every request. Django already does this for its ORM connections to the db. This is why they recommending using a connection pool like pgbouncer, so reconnecting to postgres is instant. Pymongo has a connection pool built in, so reconnect at will.

Why does RabbitMQ keep breaking from a corrupt persister log file?

I'm running Celery in a Django app with RabbitMQ as the message broker. However, RabbitMQ keeps breaking down like so. First is the error I get from Django. The trace is mostly unimportant, because I know what is causing the error, as you will see.
Traceback (most recent call last):
...
File "/usr/local/lib/python2.6/dist-packages/amqplib/client_0_8/transport.py", line 85, in __init__
raise socket.error, msg
error: [Errno 111] Connection refused
I know that this is due to a corrupt rabbit_persister.log file. This is because after I kill all processes tied to RabbitMQ, I run "sudo rabbitmq-server start" to get the following crash:
...
starting queue recovery ...done
starting persister ...BOOT ERROR: FAILED
Reason: {{badmatch,{error,{{{badmatch,eof},
[{rabbit_persister,internal_load_snapshot,2},
{rabbit_persister,init,1},
{gen_server,init_it,6},
{proc_lib,init_p_do_apply,3}]},
{child,undefined,rabbit_persister,
{rabbit_persister,start_link,[]},
transient,100,worker,
[rabbit_persister]}}}},
[{rabbit_sup,start_child,2},
{rabbit,'-run_boot_step/1-lc$^1/1-1-',1},
{rabbit,run_boot_step,1},
{rabbit,'-start/2-lc$^0/1-0-',1},
{rabbit,start,2},
{application_master,start_it_old,4}]}
Erlang has closed
My current fix: Every time this happens, I rename the corresponding rabbit_persister.log file to something else (rabbit_persister.log.bak) and am able to restart RabbitMQ with success. But the problem keeps occurring, and I can't tell why. Any ideas?
Also, as a disclaimer, I have no experience with Erlang; I'm only using RabbitMQ because it's the broker favored by Celery.
Thanks in advance, this problem is really annoying me because I keep doing the same fix over and over.
The persister is RabbitMQ's internal message database. That "log" is presumably like a database log and deleting it will cause you to lose messages. I guess it's getting corrupted by unclean broker shutdowns, but that's a bit beside the point.
It's interesting that you're getting an error in the rabbit_persister module. The last version of RabbitMQ that has that file is 2.2.0, so I'd strongly advise you to upgrade. The best version is always the latest, which you can get by using the RabbitMQ APT repository. In particular, the persister has seen a fairly large amount of fixes in the versions after 2.2.0, so there's a big chance your problem has already been resolved.
If you still see the problem after upgrading, you should report it on the RabbitMQ Discuss mailing list. The developers (of both Celery and RabbitMQ) make a point of fixing any problems reported there.
A. Because you are running an old version of RabbitMQ earlier than 2.7.1
B. Because RabbitMQ doesn't have enough RAM. You need to run RabbitMQ on a server all by itself and give that server enough RAM so that the RAM is 2.5 times the largest possible size of your persisted message log.
You might be able to fix this without any software changes just by adding more RAM and killing other services on the box.
Another approach to this is to build your own RabbitMQ from source and include the toke extension that persists messages using Tokyo Cabinet. Make sure you are using local hard drive and not NFS partitions because Tokyo Cabinet has corruption issues with NFS. And, of course, use version 2.7.1 for this. Depending on your message content, you might also benefit from Tokyo Cabinets compression settings to reduce the read/write activity of persisted messages.