Getting "idle in transaction" for postgresql with django - django

We are using Django 1.3.1 and Postgres 9.1
I have a view which just fires multiple selects to get data from the database.
In Django documents it is mentioned, that when a request is completed then ROLLBACK is issued if only select statements were fired during a call to a view. But, I am seeing lot of "idle in transaction" in the log, especially when I have more than 200 requests. I don't see any commit or rollback statements in the postgres log.
What could be the problem? How should I handle this issue?

First, I would check out the related post What does it mean when a PostgreSQL process is “idle in transaction”? which covers some related ground.
One cause of "Idle in transaction" can be developers or sysadmins who
have entered "BEGIN;" in psql and forgot to "commit" or "rollback".
I've been there. :)
However, you mentioned your problem is related to have a lot of
concurrent connections. It sounds like investigating the "locks" tip
from the post above may be helpful to you.
A couple more suggestions: this problem may be secondary. The primary
problem might be that 200 connections is more than your hardware and
tuning can comfortably handle, so everything gets slow, and when things
get slow, more things are waiting for other things to finish.
If you don't have a reverse proxy like Nginx in front of your web app,
considering adding one. It can run on the same host without additional
hardware. The reverse proxy will serve to regulate the number of
connections to the backend Django web server, and thus the number of
database connections-- I've been here before with having too many
database connections and this is how I solved it!
With Apache's prefork model, there is 1=1 correspondence between the
number of Apache workers and the number of database connections,
assuming something like Apache::DBI is in use. Imagine someone connects
to the web server over a slow connection. The web and database server
take care of the request relatively quickly, but then the request is
held open on the web server unnecessarily long as the content is
dribbled back to the client. Meanwhile, the database connection slot is
tied up.
By adding a reverse proxy, the backend server can quickly delivery a
repliy back to the reverse proxy and then free the backend worker and
database slot.. The reverse proxy is then responsible for getting the
content back to the client, possibly holding open it's own connection
for longer. You may have 200 connections to the reverse proxy up front,
but you'll need far fewer workers and db slots on the backend.
If you graph the db slots with MRTG or similar, you'll see how many
slots you are actually using, and can tune down max_connections in
PostgreSQL, freeing those resources for other things.
You might also look at pg_top to
help monitor what your database is up to.

I understand this is an older question, but this article may describe the problem of idle transactions in django.
Essentially, Django's TransactionMiddleware will not explicitly COMMIT a transaction if it is not marked dirty (usually triggered by writing data). Yet, it still BEGINs a transaction for all queries even if they're read only. So, pg is left waiting to see if any more commands are coming and you get idle transactions.
The linked article shows a small modification to the transaction middleware to always commit (basically remove the condition that checks if the transaction is_dirty). I'll be trying this fix in a production environment shortly.

Related

Where to even begin investigating issue causing database crash: remaining connection slots are reserved for non-replication superuser connections

Occasionally our Postgres database crashes and it can only be solved by restarting the server. We have tried incrementing the max connections and Django's CONN_MAX_AGE. Also, I am trying to learn how to set up PgBouncer. However, I am convinced the underlying issue must be something else which is fixable.
I am trying to find what that issue is. The problem is I wouldn't know where or what to begin to look at. Here are some pieces of information:
The errors are always OperationalError: FATAL: remaining connection slots are reserved for non-replication superuser connections and OperationalError: could not write to hash-join temporary file: No space left on device. I think this is caused by opening too many database connections, but I have never managed to catch this going down live so that I could inspect pg_stat_activity and see what actual connections were active.
Looking at the error log, the same URL shows up for the most part. I've checked the nginx log and it's listed in many different lines, meaning the request is being made multiple times at once rather than Django logging the same error multiple times. All these requests are responded with 499 Client Closed Request. In addition to this same URL, there are of course sprinkled requests of other users trying to access our site.
I should mention that the logic the server processes when the URL in question is requested is pretty simple and I see nothing suspicious that could cause a database crash. However, for some reason, the page loads slowly in production.
I know this is very vague and very little to work with, but I am not used to working sysadmin, I only studied this kind of thing in college and so far I've only worked as a developer.
Those two problems are mostly independent.
Running out of connection slots won't crash the database. It just is a sign that you either don't use a connection pool or you have a connection leak, i.e. you forget to close transactions in your code.
Running out of space will crash your database if the condition persists.
I assume that the following happens in your system:
Because someone forgot a couple of join conditions or for some other reason, some of your queries take a very long time.
They also priduce a lot of (perhaps intermediate) results that are cached in temporary files that eventually fill up the disk. This out of space condition is cleared as soon as the query fails, but it can crash the database.
Because these queries take long and block a database session, your application keeps starting new sessions until it reaches the limit.
Solution:
Find and fix thise runaway queries. As a stop-gap, you can set statement_timeout to terminate all statements that take too long.
Use a connection pool with an upper limit so that you don't run out of connections.

What happens to a Django website when you restart Gunicorn?

In the near future, I'll be deploying a Django/Gunicorn/Nginx website to paying customers. This is my first public website. There will be times when I'll need to bring the site down temporarily for maintenance. I've learned how to configure Nginx to serve up a "503 Site Temporarily Unavailable" page while I'm doing my maintenance and I'm set to inform my customers in advance via an email.
However, if I have a critical problem and need to change a setting or view or something and I need to restart the site without any advance warning by restarting Gunicorn, what will happen from the customer's point of view? Will the site just slow down for a couple of seconds? Will their sessions remain intact or will all sessions be lost? Clearly there will be database problems if any writes are occuring. What things are going to happen and are there any precautions I can take to limit the damage?
As long as you don't restart gunicorn by physically cutting the power to the server, the OS will give it the chance to shut down normally. That means that all current views will complete and all open transactions will finish.
Assuming you are using a reverse proxy like nginx in front, and you start gunicorn again immediately, users will be unlikely to notice: the request might take a fraction of a second longer, that's all.
Sessions are stored in files or the database so will be unaffected.

Very slow: ActiveRecord::QueryCache#call

I have an app on heroku, running on Puma:
workers 2
threads_count 3
pool 5
It looks like some requests get stuck in the middleware, and it makes the app very slow (VERY!).
I have seen other people threads about this problem but no solution so far.
Please let me know if you have any hint.
!
!
I work for Heroku support and Middleware/Rack/ActiveRecord::QueryCache#call is a commonly reported as a problem by New Relic. Unfortunately, it's usually a red herring as each time the source of the problem lies elsewhere.
QueryCache is where Rails first tries to check out a connection for use, so any problems with a connection will show up here as a request getting 'stuck' waiting. This doesn't mean the database server is out of connections necessarily (if you have Librato charts for Postgres they will show this). It likely means something is causing certain database connections to enter a bad state, and new requests for a connection are waiting. This can occur in older versions of Puma where multiple threads are used and the reaping_frequency is set - if some connections get into a bad state and the others are reaped this will cause problems.
Some high-level suggestions are as follows:
Upgrade Ruby & Puma
If using the rack-timeout gem, upgrade that too
These upgrades often help. If not, there are other options to look into such as switching from threads to worker based processes or using a Postgres connection pool such as PgBouncer. We have more suggestions on configuring concurrent web servers for use with Postgres here: https://devcenter.heroku.com/articles/concurrency-and-database-connections
I will answer my own question:
I simply had to check all queries to my DB. One of them was taking a VERY long time, and even if it was not requested often, it would slow down the whole server for quite some time afterwards(even after the process was done, there was a sort of "traffic jam" on the server).
Solution:
Check all the queries to your database, fix the slowest ones (it might simply mean breaking it down in few steps, it might mean make it run at night when there is no traffic, etc...).
Once this queries are fixed, everything should go back to normal.
I recently started seeing a spike in time spent in ActiveRecord::QueryCache#call. After looking at the source, I decided to try clearing said cache using ActiveRecord::Base.connection.clear_query_cache from a Rails Console attached to the production environment. The error I got back was PG::ConnectionBad: could not fork new process for connection: Cannot allocate memory which lead me to this other SO question at least Heroku Rails could not fork new process for connection: Cannot allocate memory

Handling long requests

I'm working on a long request to a django app (nginx reverse proxy, mysql db, celery-rabbitMQ-redis set) and have some doubts about the solution i should apply :
Functionning : One functionality of the app allows users to migrate thousands of objects from one system to another. Each migration is logged into a db, and the users are provided the possibility to get in a csv format the history of the migration : which objects have been migrated, which status (success, errors, ...)
To get the history, a get request is sent to a django view, which returns, after serialization and rendering into csv, the download response.
Problem : the serialisation and rendering processes, for a large set of objects (e.g. 160 000) are quite long and the request times out.
Some solutions I was thinking about/found thanks to pervious search are :
Increasing the amount of time before timeout : easy, but I saw everywhere that this is a global nginx setting and would affect every requests on the server.
Using an asynchronous task handled by celery : the concept would be to make an initial request to the server, which would launch the serializing and rendering task with celery, and give a special httpresponse to the client. Then the client would regularly ask the server if the job is done, and the server would deliver the history at the end of processing. I like this one but I'm not sure about how to technically implement that.
Creating and temporarily storing the csv file on the server, and give the user a way to access it & to download it. I'm not a big fan of that one.
So my question is : has anyone already faced a similar question ? Do you have advises for the technical implementation of the solution (#2), or a better solution to propose me ?
Thqnks !
Clearly you should use Celery + RabbitMQ/REDIS. If you look at the docs it´s not that hard to setup.
The first question is whether to use RabbitMQ or Redis. There are many SO questions about this with good information about pros/cons.
The implementation in django is really simple. You can just wrap django functions with celery tasks (with #task attribute) and it´ll become async, so this is the easy part.
The problem I see in your project is that the server who is handling http traffic is the same server running the long process. That can affect performance and user experience even if celery is running on the background. Of course that depends on how much traffic you are expecting on that machine and how many migrations can run at the same time.
One of the things you setup on Celery is the number of workers (concurrent processing units) available. So the number of cores in your machine will matter.
If you need to handle http calls quickly I would suggest to delegate the migration process to another machine. Celery/REDIS can be configured that way. Let´s say you´ve got 2 servers. One would handle only normal django calls (no celery) and trigger celery tasks on the other server (the one who actually runs the migration process). Both servers can connect to the same database.
But this is just an infrastructure optimization and you may not need it.
I hope this answers your question. If you have specific Celery issues it would be better to create another question.

Multiple permanent connection for Django?

My website is using Django, and now I want to port part of the logic to a Redis, so I need a Redis connection for my views.py code, obvious I can't write connect to redis code in views.py because it might be called multiple times, so I need to put the connect somewhere in the django, perhaps middleware?
But I don't want to make this complicated, just the same place where the MySQL database connected, I want to add a global object for Redis connection. Perhaps later for XMPP conenction and ZeroMQ.
How to do this?
ANy idea is appreciated. Thanks in advance :)
in typical Django server settings multiple requests will be handled by the same worker process.
You can simply put a global variable to hold the connection on top of views.py and use the conenction in each view function/class, the connection will be established when the worker process starts and closes when the worker process got recycled. It's semi-permanent connection but good enough.
MySQL connection works the same way in Django. It's not each db connection per request but per worker process life-span
It isn't obvious you want to do that. It isn't obvious why you would want to do that.
So why not connect in views.py? To use a single "global connection" will mean adding locking/serialization code to ensure that your connection is safe to use amongst many calls to your views. I actually create and connect right in the method in my various and sundry views.py files. Sometimes I connect to one instance or another. I've seen no performance issues and also don't have to worry about concurrency safety. I let Redis figure that out.
Another aspect of a global shared connection is degraded performance - you'll have one page view waiting on another's to finish before it can run. By allowing each to have it's own connection you avoid one view slowing down another while waiting on access to a global connector.
Consider this: if your queries are so small and fast that you don't expect to see a performance hit from serializing every page that accesses Redis, then you won't see any performance degradation from a connection per page as you connect, query, and close. I highly doubt that the cost of setting up the connection is significantly more than serializing all page accesses that connect to Redis.
So my suggestion is to just try it. If and only if you see an issue should you worry about implementing something you will probably not need.
There is a great piece of code for this already. http://github.com/andymccurdy/redis-py