SQLAlchemy/Flask MSSQL Query hangs indefiniteley when run under Apache - flask

My app uses a couple of different DBs on the same MSSQL server (read only). My problem is that one of the two MSSQL connections always works fine, whereas the other hangs indefinetely on the first query until Flask cuts off the connection. This, however, only happens most of the time when the app is running under Apache. When I run the flask test server, everything is fine.
I've surrounded the MSSQL query with logging messages and am therefore positive that the bug is in that particular query. It is just a simple lookup by primary key, like this:
db.query(Record).get(id)
The DBs are accessed through different engines whose URIs only differ by the database name.
My problem is that I have no idea on how to start debugging this. Any tips?
[EDIT] I've managed to get SQLAlchemy logging going under Apache. I've set echo=True on the engine, and it doesn't output anything at all. It just hangs.

Turns out that the problem doesn't have to do with Apache but with connection timeout. Whereas the MySQL engine gives an error message when trying to execute a query on an expired server connection, the MSSQL engine just silently stalls forever. Adding pool_recycle=3600 to create_engine() solved the issue.

Related

Where to even begin investigating issue causing database crash: remaining connection slots are reserved for non-replication superuser connections

Occasionally our Postgres database crashes and it can only be solved by restarting the server. We have tried incrementing the max connections and Django's CONN_MAX_AGE. Also, I am trying to learn how to set up PgBouncer. However, I am convinced the underlying issue must be something else which is fixable.
I am trying to find what that issue is. The problem is I wouldn't know where or what to begin to look at. Here are some pieces of information:
The errors are always OperationalError: FATAL: remaining connection slots are reserved for non-replication superuser connections and OperationalError: could not write to hash-join temporary file: No space left on device. I think this is caused by opening too many database connections, but I have never managed to catch this going down live so that I could inspect pg_stat_activity and see what actual connections were active.
Looking at the error log, the same URL shows up for the most part. I've checked the nginx log and it's listed in many different lines, meaning the request is being made multiple times at once rather than Django logging the same error multiple times. All these requests are responded with 499 Client Closed Request. In addition to this same URL, there are of course sprinkled requests of other users trying to access our site.
I should mention that the logic the server processes when the URL in question is requested is pretty simple and I see nothing suspicious that could cause a database crash. However, for some reason, the page loads slowly in production.
I know this is very vague and very little to work with, but I am not used to working sysadmin, I only studied this kind of thing in college and so far I've only worked as a developer.
Those two problems are mostly independent.
Running out of connection slots won't crash the database. It just is a sign that you either don't use a connection pool or you have a connection leak, i.e. you forget to close transactions in your code.
Running out of space will crash your database if the condition persists.
I assume that the following happens in your system:
Because someone forgot a couple of join conditions or for some other reason, some of your queries take a very long time.
They also priduce a lot of (perhaps intermediate) results that are cached in temporary files that eventually fill up the disk. This out of space condition is cleared as soon as the query fails, but it can crash the database.
Because these queries take long and block a database session, your application keeps starting new sessions until it reaches the limit.
Solution:
Find and fix thise runaway queries. As a stop-gap, you can set statement_timeout to terminate all statements that take too long.
Use a connection pool with an upper limit so that you don't run out of connections.

Postgres not responding with high waiting time

I am using Django with Postgres.
My site servers half a million pages without any issue and everything works fine.
However I am using an API system and it works like below:
First party calls my site with API, my site gets data from third
website using API. My site extracts some data and pass it to First
party. It works perfectly. In this cycle I have to check my Postgres
whether data is already present or not.
Everything works fine. But if third party API does not respond or there is any server issue with third party it takes long time to respond with 404 error, my postgres just dies and I have to run service postgresql restart command everytime to site work.
What could be the issue? How do I check why Postgres is dying?
One possibility, although a guess, is that your code is locking the database table while the 3rd party API call is made. This will prevent other updates occurring while waiting.
This wouldn't explain why you would need to restart the Postgres server, the lock should be released after the 3rd party API call times out.
It might help to add to your question the code that deals with checking whether the data is already present in the db, calling the remote API, and finally updating the database with new data.

Suddenly scheduled tasks are not running in coldfusion 8

I am using Coldfusion MX8 server and one of the scheduled task was running from 2 years but now suddenly from 01/12/2014 scheduled tasks are not running. When i browsed the file in browser then the file is running successfully without error.
I am not sure is there any updatation or license expiration problem. I am aware that mid of this year Adobe closed the support for coldfusion 8.
The first most common problem of this problem is external to the server. When you say you browsed to the file and it worked in a browser, it is very important to know if that test was performed on the server desktop. Knowing that you can browse to the file from your desktop or laptop is of small value.
The most common source of issues like this is a change in the DNS or network stack that is interfereing with resolution. For example, if the internal DNS serving your DMZ suddenly starts serving the "external" address - suddenly your server can't browse to your domain. Or if the IP served by the server for the domain in question goes from being 127.0.0.1 to some other IP that the server can't acces correctly due to reverse proxy or LB or some other rule. Finally, sometimes the Apache or IIS is altered so that an IP that previously was serviced (127.0.0.1 being the most common example) now does not respond.
If it is something intrinsic to the scheduler service then Frank's advice is pretty good - especially look for "proxy schduler" entries in the log - they can give you good clues. I would also log results of a scheduled task to a file. Then check the file. If it exists then your scheduled tasks ARE running - they are just not succeeding. Good luck!
I've seen the cf scheduling service crash in CF8. The rest of CF is unaffected.
Have you tried restarting the server?
Here are your concerns:
Your File (works since you tested it manually).
Your Scheduled Task (failed).
Your Coldfusion Application (Service) (any changes here)?
Your Server (what about here).
To test your problem create a duplicate task and schedule it. Leave the other one in place (maybe set your new one to run earlier). Use the same file too. See if it completes.
If it doesn't then you have a larger problem. Since the Coldfusion Server sits atop of the JVM there could be something happening there. Things just don't stop working unless something got corrupted or you got compromised. If you hardened your server by rearranging/renaming the file structure to make it more secure...It would break your task.
So going back: if your test schedule works then determine what is different between the two. Note you have logging capabilities. Logging abilities for CF8
If you are not directly incharge of maintaining this server, then I would recommend asking around and see if there was recent maintenance, if so, what was done to the server?

Getting "idle in transaction" for postgresql with django

We are using Django 1.3.1 and Postgres 9.1
I have a view which just fires multiple selects to get data from the database.
In Django documents it is mentioned, that when a request is completed then ROLLBACK is issued if only select statements were fired during a call to a view. But, I am seeing lot of "idle in transaction" in the log, especially when I have more than 200 requests. I don't see any commit or rollback statements in the postgres log.
What could be the problem? How should I handle this issue?
First, I would check out the related post What does it mean when a PostgreSQL process is “idle in transaction”? which covers some related ground.
One cause of "Idle in transaction" can be developers or sysadmins who
have entered "BEGIN;" in psql and forgot to "commit" or "rollback".
I've been there. :)
However, you mentioned your problem is related to have a lot of
concurrent connections. It sounds like investigating the "locks" tip
from the post above may be helpful to you.
A couple more suggestions: this problem may be secondary. The primary
problem might be that 200 connections is more than your hardware and
tuning can comfortably handle, so everything gets slow, and when things
get slow, more things are waiting for other things to finish.
If you don't have a reverse proxy like Nginx in front of your web app,
considering adding one. It can run on the same host without additional
hardware. The reverse proxy will serve to regulate the number of
connections to the backend Django web server, and thus the number of
database connections-- I've been here before with having too many
database connections and this is how I solved it!
With Apache's prefork model, there is 1=1 correspondence between the
number of Apache workers and the number of database connections,
assuming something like Apache::DBI is in use. Imagine someone connects
to the web server over a slow connection. The web and database server
take care of the request relatively quickly, but then the request is
held open on the web server unnecessarily long as the content is
dribbled back to the client. Meanwhile, the database connection slot is
tied up.
By adding a reverse proxy, the backend server can quickly delivery a
repliy back to the reverse proxy and then free the backend worker and
database slot.. The reverse proxy is then responsible for getting the
content back to the client, possibly holding open it's own connection
for longer. You may have 200 connections to the reverse proxy up front,
but you'll need far fewer workers and db slots on the backend.
If you graph the db slots with MRTG or similar, you'll see how many
slots you are actually using, and can tune down max_connections in
PostgreSQL, freeing those resources for other things.
You might also look at pg_top to
help monitor what your database is up to.
I understand this is an older question, but this article may describe the problem of idle transactions in django.
Essentially, Django's TransactionMiddleware will not explicitly COMMIT a transaction if it is not marked dirty (usually triggered by writing data). Yet, it still BEGINs a transaction for all queries even if they're read only. So, pg is left waiting to see if any more commands are coming and you get idle transactions.
The linked article shows a small modification to the transaction middleware to always commit (basically remove the condition that checks if the transaction is_dirty). I'll be trying this fix in a production environment shortly.

SQL-Server Connection Fails after Network Reconnect

I am working on an update to an application that uses DAO to access an SQL Server. I know, but let's consider DAO a requirement for now.
The application runs all the time in the system tray and periodically performs SQL server operations. Since it is running all the time, and users of the application will be on laptops and transitioning between buildings, I've designed it to quietly transition between active and inactive states. When the database connection is successful operations resume.
I have one last issue before I release this update: When a connection is dropped, then reestablished, the SQL operations fail. This occurs only if I have specified the hostname in my connection string. If I use the IP, everything is fine (but I need to be able to use hostname).
Here is the behavior:
1) Everything working. Good network connection, database operations are fine.
2) Lost connection. Little 'x' appears on task bar icon, and nothing else. All ok.
3) Reconnect.
At step 3, I get an 'ODBC--call failed' error when I run the first query. Interestingly, the database is first opened without error.
If I skip step 1, and start the application when the connection is down, everything works fine in step 3, hostname or not.
I expect this is an issue with the DAO engine caching the DNS entry after the first connection, although the destination IP does not change so I'm not sure about that. I have tried flushing the windows DNS cache (from cmd prompt) to no effect. The same behavior occurs even when I'm using my local hostname with a local SQL server I set up for development. 127.0.0.1 has no problems.
I also tried to CoUninitialize() the DAO interface between active times, but I had trouble getting this to work. If someone thinks that would help I will work harder at it.
This behavior is the same in Windows XP or 7.
Thanks for anything you've got!
Edit: I should have mentioned - I am closing the database connection between the attempts, then reopening it with
m_pDb = m_pDaoEngine->OpenDatabase()
I ended up biting the bullet and converting the application to ADO. Everything works nicely now, and database operations are much faster to boot.