Intermittent Deadlock with Django LiveServerTestCase, Selenium, and Postgres - django

In testing a Django/Postgres app using LiveServerTestCase and Selenium I'm seeing intermittent deadlock problems. LiveServerTestCase inherits from TransactionTestCase, so all DB tables are truncated after each test runs. But sometimes that truncation causes deadlock because one of the tables is locked by an unresolved Postgres transaction. I can see that because this query returns a row:
select * from pg_stat_activity
where datname='test' and current_query='<IDLE> in transaction';
So some activity in my application must be creating an unresolved transaction. I've combed the tests to make sure they wait for any updates to complete before exiting and am convinced that's not it.
Looking at the Postgres logs I see these two lines frequently, without a corresponding COMMIT or ROLLBACK:
SHOW default_transaction_isolation
BEGIN
I suspect these are causing the deadlock. Any idea what might be issuing this SQL or how to disable it? This is Django 1.5.

The root cause of this deadlock is Django 1.5's autocommit behavior. By default Django 1.5 runs with an open transaction, which is only closed by a COMMIT if you do an UPDATE or INSERT. "Read" operations (SELECT) cause the unmatched BEGIN statements I mentioned above. It appears that deadlock happens if a SELECT occurs just before the end-of-test TRUNCATE. To avoid deadlock the test must exit only after all requests have completed, even if the requests cause no DB writes. That can be tricky if Ajax calls are updating parts of the page asynchronously after an update.
A better solution is to use Django 1.6, where atomic() is the only (non-deprecated) transaction-creating primitive. It doesn't open transactions for read operations, and doesn't leave dangling BEGIN statements. Tests can follow the common-sense approach of not exiting while "write" requests are pending.

For any future travelers:
We experienced the same issue on Django 3.2 + Postgres 12. When the build server was at a high load with multiple parallel builds, the live server kept receiving AJAX calls from the Selenium container, which interrupted the post-test TRUNCATE, causing a deadlock.
Our solution was to just add a 1-second sleep at the end of each test case:
class CustomLiveTestCase(StaticLiveServerTestCase):
def tearDown(self):
time.sleep(1)
This gave the live server enough time to process any lingering AJAX calls after the test finished, removing the deadlocks.

Related

Race condition for Microservice architecture [CosmosDB]

We have a micro service based architecture. Let's say we have front and backend completely isolated. The backend microserviceA exposes a rest endpoint which basically calls a thirdParty service and updates a record in cosmosDB. Now, this micro service is deployed over kubernetes cluster and hence can have multiple replication factor for load balancing. As mentioned before, the frontEnd is isolated and it consumes the exposed endpoint.
Problem :
FrontEnd has been written in such a manner that if the response is not obtained within a certain time frame or if a network failure occurs, it retries the endpoint. It has been observed that in some rare scenarios(doesn't matter what) UI makes multiple calls (mostly 2) one after another with time difference in milliseconds. Now here comes the race condition at the backend logic.
If the first call goes to ThirdParty first and obtained a success response, the second call will get a failure(bcz the first one was already a success). We can not change the behaviour of ThirdParty.
Taking above scenario as base, Now if the second call(failure one) updates the DB first and reaches the UI. UI takes this as a failure(whereas the first call was already a success) and take failure actions.
If the success calls makes it to the UI first, everything works fine.
Possible solution I can think of:
1)
Put a cache as source of truth.
apiCall : Status
If (entry not present in cache) {
Put Entry in cache With Status NULL or Something with specific TTL
(acquire lock on specific entry) {
If (status is success) return successResponse.
MAKE ThirdParty Call
Update DB
Update cache
Release LOCK
}
} else {
(acquire lock on specific entry) {
MAKE ThirdParty Call
Update DB
Update cache
Release LOCK
}
}
Else block will never be executed. seems like.
Only in case of failure, instead of updating the DB, put a thread.sleep(10000) for couple of times in hope that another thread will update the DB with success response.
If still not success, return a failure update and update DB.
Put a poller on UI side. If it is a failure. Try to poll couple of times more in hope that the status changes. If not, take the failure actions.
Optimistic locking for cosmos record.
https://cosmosdb.github.io/labs/dotnet/labs/10-concurrency-control.html
Not sure how this can help.
Let's say, both api calls read the record when the version was 0.
Now the second api call update the the DB record, as the version was not changed,
it will be a successful update.
Now the DB holds Failure as value.
The first api call tries to update it and it found a version mismatch,
the update will not go through and another attempt will be made to update the DB as it was a success.
In case of failure, no attempts to update DB will be made.
Now, the second API call will appear to UI first and UI will again take the failure action.
UI require a poller in such cases.
But if the UI requires a poller, why do we need the optimistic locking in first place. :)
I don't know cosmosDB functionality much. If there is some functionality cosmos provides to handle, Please be kind enough to share.
What will be the best way to handle such kind of scenarios.
It seems in your application design you have made it necessary to wait for each execution to finish before you fire the next one, I am not debating if this is good or bad that's a different discussion, but it seems the only option you have to fire all your DB Updates in a synchronous manner in this case.
Optimistic locking is very good to ensure that the document you are updating have not been updated while your code did other things but it will not help your UI issue here.
I think you need to abstract the UI in order to make this work properly otherwise you are stuck running things in synchronous mode

Celery task fails midway while pulling data from a large database

I'm running a periodic task using celery in a django-rest application that pulls data from a large Postgres database with multiple tables, the task starts well and pulls some data for about 50 mins and then fails with this error
client_idle_timeout
server closed the connection unexpectedly, This probably means the server terminated abnormally before or while processing the request.
What could be the issue causing this and how can I go about to fix it?
It most likely means that your PostgreSQL has limit on how long transaction can take (idle in transaction), or how long session can be (session timeout).
This is probably happening because of a typical, incorrect way of dealing with databases (I've seen this done even by senior developers) - process creates a database session, and then starts doing some business logic that may take long time to finish, while DB data has been either partially updated or inserted. Code written in such way is doomed to fail because of timeouts enforced by PostgreSQL.

Celery/Django transaction

Celery user guides suggests Django transaction be manually committed before calling task process.
http://celery.readthedocs.org/en/latest/userguide/tasks.html#database-transactions
I want the system to be as reliable as possible. What is the best practice to recover from a crash between transaction commit and calling task (i.e. make sure task is always called when transaction is committed).
BTW, right now I'm using database-based job queue I implemented so there is no such problem -- I can send jobs within transaction. I'm not really convinced if I should switch to Celery.
From django 1.9 this has been added
transaction.on_commit(lambda: add_task_to_the_queue())

Caching in Djangos object model

I'm running a system with a few workers that's taking jobs from a message queue, all using Djangos ORM.
In one case I'm actually passing a message along from one worker to another in another queue.
It works like this:
Worker1 in queue1 creates an object (MySQL INSERT) and pushes a message to queue2
Worker2 accepts the new message in queue2 and retrieves the object (MySQL SELECT), using Djangos objects.get(pk=object_id)
This works for the first message. But in the second message worker 2 always fails on that it can't find object with id object_id (with Django exception DoesNotExist).
This works seamlessly in my local setup with Django 1.2.3 and MySQL 5.1.66, the problem occurs only in my test environment which runs Django 1.3.1 and MySQL 5.5.29.
If I restart worker2 every time before worker1 pushes a message, it works fine. This makes me believe there's some kind of caching going on.
Is there any caching involved in Django's objects.get() that differs between these versions? If that's the case, can I clear it in some way?
The issue is likely related to the use of MySQL transactions. On the sender's site, the transaction must be committed to the database before notifying the receiver of an item to read. On the receiver's side, the transaction level used for a session must be set such that the new data becomes visible in the session after the sender's commit.
By default, MySQL uses the REPEATABLE READ isolation level. This poses problems where there are more than one process reading/writing to the database. One possible solution is to set the isolation level in the Django settings.py file using a DATABASES option like the following:
'OPTIONS': {'init_command': 'SET SESSION TRANSACTION ISOLATION LEVEL READ COMMITTED'},
Note however that changing the transaction isolation level may have other side effects, especially when using statement based replication.
The following links provide more useful information:
How do I force Django to ignore any caches and reload data?
Django ticket#13906

pgbouncer - closing because: unclean server on every connection

I'm running Django 1.3 with PostgreSQL 9.1/PostGIS 1.5, psycopg2 2.4.2 and pgbouncer 1.4.2.
On every single connection to the database I get a log entry in pgbouncer.log:
2011-11-20 02:15:25.027 29538 LOG S-0x96c2200: app_db/postgres#192.168.171.185:5432 closing because: unclean server (age=0).
I can't find any solution to this problem - anybody have an idea why? I've tried reconfiguring pgbouncer (session/transaction mode, different timeouts etc), but to no avail.
Ok, I think I've figured this out. The problem lies with a long-standing issue with Django and Psycopg2. Basically, Psycopg2 will automatically issue a BEGIN statement to the DB. However, if Django thinks no data-modification has occurred, it won't issue a COMMIT at the end of a transaction.
There are a few solutions to this problem, look at http://www.slideshare.net/OReillyOSCON/unbreaking-your-django-application for more details. Ideally you turn off automatic commits (by setting autocommit = True in your DB settings, awkward naming convention). This prevents transactions on read-only functions, but also on write functions so you need to manually wrap those functions in a #commit_on_success decorator.
Alternatively, just add the django.middleware.transaction.TransactionMiddleware to your Middleware classes. This will wrap every request in a transaction. This means also unnecessarily wrapping read-only requests in a transaction, but it's a quick-and-dirty solution.