Very slow: ActiveRecord::QueryCache#call - ruby-on-rails-4

I have an app on heroku, running on Puma:
workers 2
threads_count 3
pool 5
It looks like some requests get stuck in the middleware, and it makes the app very slow (VERY!).
I have seen other people threads about this problem but no solution so far.
Please let me know if you have any hint.
!
!

I work for Heroku support and Middleware/Rack/ActiveRecord::QueryCache#call is a commonly reported as a problem by New Relic. Unfortunately, it's usually a red herring as each time the source of the problem lies elsewhere.
QueryCache is where Rails first tries to check out a connection for use, so any problems with a connection will show up here as a request getting 'stuck' waiting. This doesn't mean the database server is out of connections necessarily (if you have Librato charts for Postgres they will show this). It likely means something is causing certain database connections to enter a bad state, and new requests for a connection are waiting. This can occur in older versions of Puma where multiple threads are used and the reaping_frequency is set - if some connections get into a bad state and the others are reaped this will cause problems.
Some high-level suggestions are as follows:
Upgrade Ruby & Puma
If using the rack-timeout gem, upgrade that too
These upgrades often help. If not, there are other options to look into such as switching from threads to worker based processes or using a Postgres connection pool such as PgBouncer. We have more suggestions on configuring concurrent web servers for use with Postgres here: https://devcenter.heroku.com/articles/concurrency-and-database-connections

I will answer my own question:
I simply had to check all queries to my DB. One of them was taking a VERY long time, and even if it was not requested often, it would slow down the whole server for quite some time afterwards(even after the process was done, there was a sort of "traffic jam" on the server).
Solution:
Check all the queries to your database, fix the slowest ones (it might simply mean breaking it down in few steps, it might mean make it run at night when there is no traffic, etc...).
Once this queries are fixed, everything should go back to normal.

I recently started seeing a spike in time spent in ActiveRecord::QueryCache#call. After looking at the source, I decided to try clearing said cache using ActiveRecord::Base.connection.clear_query_cache from a Rails Console attached to the production environment. The error I got back was PG::ConnectionBad: could not fork new process for connection: Cannot allocate memory which lead me to this other SO question at least Heroku Rails could not fork new process for connection: Cannot allocate memory

Related

Where to even begin investigating issue causing database crash: remaining connection slots are reserved for non-replication superuser connections

Occasionally our Postgres database crashes and it can only be solved by restarting the server. We have tried incrementing the max connections and Django's CONN_MAX_AGE. Also, I am trying to learn how to set up PgBouncer. However, I am convinced the underlying issue must be something else which is fixable.
I am trying to find what that issue is. The problem is I wouldn't know where or what to begin to look at. Here are some pieces of information:
The errors are always OperationalError: FATAL: remaining connection slots are reserved for non-replication superuser connections and OperationalError: could not write to hash-join temporary file: No space left on device. I think this is caused by opening too many database connections, but I have never managed to catch this going down live so that I could inspect pg_stat_activity and see what actual connections were active.
Looking at the error log, the same URL shows up for the most part. I've checked the nginx log and it's listed in many different lines, meaning the request is being made multiple times at once rather than Django logging the same error multiple times. All these requests are responded with 499 Client Closed Request. In addition to this same URL, there are of course sprinkled requests of other users trying to access our site.
I should mention that the logic the server processes when the URL in question is requested is pretty simple and I see nothing suspicious that could cause a database crash. However, for some reason, the page loads slowly in production.
I know this is very vague and very little to work with, but I am not used to working sysadmin, I only studied this kind of thing in college and so far I've only worked as a developer.
Those two problems are mostly independent.
Running out of connection slots won't crash the database. It just is a sign that you either don't use a connection pool or you have a connection leak, i.e. you forget to close transactions in your code.
Running out of space will crash your database if the condition persists.
I assume that the following happens in your system:
Because someone forgot a couple of join conditions or for some other reason, some of your queries take a very long time.
They also priduce a lot of (perhaps intermediate) results that are cached in temporary files that eventually fill up the disk. This out of space condition is cleared as soon as the query fails, but it can crash the database.
Because these queries take long and block a database session, your application keeps starting new sessions until it reaches the limit.
Solution:
Find and fix thise runaway queries. As a stop-gap, you can set statement_timeout to terminate all statements that take too long.
Use a connection pool with an upper limit so that you don't run out of connections.

Handling long requests

I'm working on a long request to a django app (nginx reverse proxy, mysql db, celery-rabbitMQ-redis set) and have some doubts about the solution i should apply :
Functionning : One functionality of the app allows users to migrate thousands of objects from one system to another. Each migration is logged into a db, and the users are provided the possibility to get in a csv format the history of the migration : which objects have been migrated, which status (success, errors, ...)
To get the history, a get request is sent to a django view, which returns, after serialization and rendering into csv, the download response.
Problem : the serialisation and rendering processes, for a large set of objects (e.g. 160 000) are quite long and the request times out.
Some solutions I was thinking about/found thanks to pervious search are :
Increasing the amount of time before timeout : easy, but I saw everywhere that this is a global nginx setting and would affect every requests on the server.
Using an asynchronous task handled by celery : the concept would be to make an initial request to the server, which would launch the serializing and rendering task with celery, and give a special httpresponse to the client. Then the client would regularly ask the server if the job is done, and the server would deliver the history at the end of processing. I like this one but I'm not sure about how to technically implement that.
Creating and temporarily storing the csv file on the server, and give the user a way to access it & to download it. I'm not a big fan of that one.
So my question is : has anyone already faced a similar question ? Do you have advises for the technical implementation of the solution (#2), or a better solution to propose me ?
Thqnks !
Clearly you should use Celery + RabbitMQ/REDIS. If you look at the docs it´s not that hard to setup.
The first question is whether to use RabbitMQ or Redis. There are many SO questions about this with good information about pros/cons.
The implementation in django is really simple. You can just wrap django functions with celery tasks (with #task attribute) and it´ll become async, so this is the easy part.
The problem I see in your project is that the server who is handling http traffic is the same server running the long process. That can affect performance and user experience even if celery is running on the background. Of course that depends on how much traffic you are expecting on that machine and how many migrations can run at the same time.
One of the things you setup on Celery is the number of workers (concurrent processing units) available. So the number of cores in your machine will matter.
If you need to handle http calls quickly I would suggest to delegate the migration process to another machine. Celery/REDIS can be configured that way. Let´s say you´ve got 2 servers. One would handle only normal django calls (no celery) and trigger celery tasks on the other server (the one who actually runs the migration process). Both servers can connect to the same database.
But this is just an infrastructure optimization and you may not need it.
I hope this answers your question. If you have specific Celery issues it would be better to create another question.

What could be causing seemingly random AWS EC2 server to Crash? (Error couldn't establish database connection)

To begin, I am running a Wordpress site on an AWS EC2 Ubuntu Micro instance. I have already confirmed that this is NOT an error with Wordpress/mysql.
Seemingly at random the site will go down and I'll get the "Error establishing database connection" message. The server says that it is running just fine, and rebooting usually fixes the issue, however I'd like to figure out the cause and resolve the issue so this can stop happening (it's been the past 2 weeks now that it goes down almost every other day.)
It's not a spike in traffic, or at least Google Analytics hasn't shown the site as having any spikes in traffic (it averages about 300 visits per day.)
What's the cause, and how can this be fixed?
Sounds like you might be running into the throttling that is a limitation on t1.micro. If you use too much CPU cycles you will be throttled.
See http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts_micro_instances.html#available-cpu-resources-during-spikes
The next time this happens I would check some general stats on the health of the instance. You can get a feel for the high-level health of the instance using the 'top' command (http://linuxaria.com/howto/understanding-the-top-command-on-linux?lang=en). Be sure to look for CPU and memory usage. You may find a process (pid) that is consuming a lot of resources and starving your app.
More likely, something within your application (how did you come to the conclusion that this is not a Wordpress/MySQL issue?) is going out of control. Possibly there is a database connection not being released? To see what your app is doing, find the process id (pid) for your app:
ps aux | grep "php"
and get a thread dump for that process: kill -3 to get java thread dump. This will help you see where your application's threads are stuck (if they are).
Typically it's good practice to execute two thread dumps a few seconds apart and compare trends in both. If there is an issue in the application, you should see a lot of threads stuck at the same point.
You might also want to checkout what MySQL is seeing (https://dev.mysql.com/doc/refman/5.1/en/show-processlist.html).
mysql> SHOW FULL PROCESSLIST
Hope this helps, let us know what you find!

Strange apache lag in requests

I have an Apache2 and Django (mod_wsgi) setup that provides a RESTful API. I have a set of automated tests for this, that executes ~1000 API requests (pure http GET/POST/PUT/DELETE) in sequential order.
The problem is, for every 80 requests or so, I get a strange lag/timeout for exactly 5s or 10s. See timestamp examples here:
Request 1: 2013-08-30T03:49:20.915
Response 1: 2013-08-30T03:49:30.940
Request 2: 2013-08-30T03:50:32.559
Response 2: 2013-08-30T03:50:37.597
I can't figure out why this happens. I have an apache config with KeepAlive Off (recommended setup setting for Django) but otherwise standard install for Ubuntu 12.04 LTS.
I'm running the tests from the same server where the webserver is, I first thought this was some kind of DNS cache thing, but I've added the hostname I'm requesting to /etc/hosts but the problem persists.
The system is idle and have lots of cpu and mem when this lag/timeouts happens.
The lag is not specific to a certain request (URL), it seems kinda random.
Considering that it's always exactly to the millisecond 5s or 10s, it feels like this is some specific setting somewhere causing this.
In case it provides some insight, watch my talk from PyCon US.
http://lanyrd.com/2013/pycon/scdyzk/
The talk deals with things like process churn and startup costs. One thing you shouldn't do is set maximum requests if you don't really need it.
Also consider trying New Relic to help diagnose where the issue is. That will save a lot of guessing if it is a web application of backend service infrastructure issue.
As far as seeing how such monitoring can help, watch another one of my PyCon talks.
http://lanyrd.com/2012/pycon/spcdg/
This was a DNS issue, adding the domainname I used locally to /etc/hosts actually solved the problem. I just hadn't reboot the server for the changes to take effect, thought restarting networking would take care of that, but apparently not.

Getting "idle in transaction" for postgresql with django

We are using Django 1.3.1 and Postgres 9.1
I have a view which just fires multiple selects to get data from the database.
In Django documents it is mentioned, that when a request is completed then ROLLBACK is issued if only select statements were fired during a call to a view. But, I am seeing lot of "idle in transaction" in the log, especially when I have more than 200 requests. I don't see any commit or rollback statements in the postgres log.
What could be the problem? How should I handle this issue?
First, I would check out the related post What does it mean when a PostgreSQL process is “idle in transaction”? which covers some related ground.
One cause of "Idle in transaction" can be developers or sysadmins who
have entered "BEGIN;" in psql and forgot to "commit" or "rollback".
I've been there. :)
However, you mentioned your problem is related to have a lot of
concurrent connections. It sounds like investigating the "locks" tip
from the post above may be helpful to you.
A couple more suggestions: this problem may be secondary. The primary
problem might be that 200 connections is more than your hardware and
tuning can comfortably handle, so everything gets slow, and when things
get slow, more things are waiting for other things to finish.
If you don't have a reverse proxy like Nginx in front of your web app,
considering adding one. It can run on the same host without additional
hardware. The reverse proxy will serve to regulate the number of
connections to the backend Django web server, and thus the number of
database connections-- I've been here before with having too many
database connections and this is how I solved it!
With Apache's prefork model, there is 1=1 correspondence between the
number of Apache workers and the number of database connections,
assuming something like Apache::DBI is in use. Imagine someone connects
to the web server over a slow connection. The web and database server
take care of the request relatively quickly, but then the request is
held open on the web server unnecessarily long as the content is
dribbled back to the client. Meanwhile, the database connection slot is
tied up.
By adding a reverse proxy, the backend server can quickly delivery a
repliy back to the reverse proxy and then free the backend worker and
database slot.. The reverse proxy is then responsible for getting the
content back to the client, possibly holding open it's own connection
for longer. You may have 200 connections to the reverse proxy up front,
but you'll need far fewer workers and db slots on the backend.
If you graph the db slots with MRTG or similar, you'll see how many
slots you are actually using, and can tune down max_connections in
PostgreSQL, freeing those resources for other things.
You might also look at pg_top to
help monitor what your database is up to.
I understand this is an older question, but this article may describe the problem of idle transactions in django.
Essentially, Django's TransactionMiddleware will not explicitly COMMIT a transaction if it is not marked dirty (usually triggered by writing data). Yet, it still BEGINs a transaction for all queries even if they're read only. So, pg is left waiting to see if any more commands are coming and you get idle transactions.
The linked article shows a small modification to the transaction middleware to always commit (basically remove the condition that checks if the transaction is_dirty). I'll be trying this fix in a production environment shortly.