Why are Google Cloud SQL connections failing: "(psycopg2.DatabaseError) server closed the connection unexpectedly"

Why are Google Cloud SQL connections failing: "(psycopg2.DatabaseError) server closed the connection unexpectedly" - google-cloud-platform

I am working with a customer who is having issues with their GCP Cloud SQL deployment. There questions are transcribed here:
When connecting to Cloud SQL, connections often fail intermittently. This can look like a Python error:
(psycopg2.DatabaseError) server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
Or, in Node, it can look like a timeout error or a socket hang up:
TimeoutError: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call?
We have everything configured correctly, as far as we can tell, and have followed all the instructions in the Cloud SQL troubleshooting guide. We have an instance with 20GB of memory that should support 250 connections. The timeouts should be set to refresh the connections at the right intervals (< 10 min). So we're not sure what's going on here.
I know that isn't a ton to go on but I wanted to try and do my due diligence in seeing how we can help them. I realize we may not get a perfect answer on what is going on but some additional questions I can ask of them to help debug the issue would be a great help to start with.
I found this similar question that seems to be describing the same issue but it has no answers: PostgreSQL 'Sever closed the connection unexpectedly'
Thanks for any help!

As the error suggest, it's not clear what caused the connection to be closed. I would suggest looking into the Cloud SQL error logs (within your Google Cloud Console) to see detailed information as to why the connection was closed, as it was the case in this Github Issue (The wrong role was assigned).

Related

Amazon Redshift: Queries never finish running after period idle

I am working on a new Amazon Redshift database that I recently started.
I am experiencing an issue where after I connect to the database, I can run queries without any issue. However, if I spend some time without running anything (like, 5 minutes), when I try running another query or command, ir never finishes.
I am using dBeaver Community 21.2.2 to interact with the connection, and it stays "Executing query" forever. The only way i can get it to work is by cancelling, disconnecting from the redshift, connecting again and then it executes correctly. Until I stop using for some minutes, and then it's happens all over again.
I tought this was a dBeaver issue, as we have a Meabase connected to this same cluster without any issues. But today, I tried manipulating this cluster with R using RJDBC, and the same thing happens: I can run queries, until I stop, and then when I try running something else it never stops, until I disconnect and connect again.
I'm sorry if I wasn't able to explain it clearly, I tried searching for simmilar issues but couldn't.

I suspect that the queries in question are not even being launched on the database. You can check this by reviewing svl_statementtext to see if the query is even being seen. Put a unique comment in the query to help determine if it is actually the query in question.
Since I've seen similar behavior before I'll write up a possible way this can happen. In this case the queries were not being seen by the database or the connection to the database was being dropped mid execution. The cause is network switches and their configurations.
Typical network connections are fairly quick - you ask for a web page and it is given to you. Connection is complete. When you click on a link a new connection is established and also end quickly. These network actions are atomic from a network connection point of view. However, database connections are different. One connection is made and many back and forth transmissions of data happen while the connection is open. No problem and with the right set of network configurations these connections can be open and idle for days.
The problem come in when the operators of the network equipment decide that connections that have no data flowing are "stale" after some fixed amount of time. They do this so that the network equipment can "forget" about these connections and focus on "active" connections. ISPs drop idle connections a lot so that they can handle the load of traffic and connections that flow through their equipment. This doesn't cause any issues for web pages and APIs but database connections get clobbered.
When this happens is look exactly like what you describe. Both sides (client and database) think that the connection is still active but the network equipment has dropped the connection. Nothing gets through but no notification is sent either party. You will likely see corresponding open sessions on the Redshift side for these dropped connections and the database is just waiting for the client to give a command on each of them. An administrator will need to go through and close (terminate) these sessions for them to go away.
Now the thing that doesn't align with experience is the speed at which these connections are being marked as "stale". In my case my ISP was closing connections that were idle for more than 30 min. You seem to be timing out much faster than this. In some cases corporate firewalls will be configured with short idle connection timeouts for routes out of the private network to the internet. So there are cases where the timeouts can be short. The networks at AWS do not have these timeouts so if your connections are completely within AWS then this isn't your answer.
To address this there are a few ways to go. The easy way is to set up a tunnel into AWS with "keep alive" packets sent every 30 sec or so. You will need an ec2 instance at AWS so it isn't cost free. Ssh tunneling is the usual tool for this and there are write-ups online for setting it up.
The hard way (but likely most correct way) is to work with network experts to understand when the timeout is happening and why. If the timeout cannot be changed then it may be possible to configure a different network topology for your use case. Network peering or VPN could address.
In some cases you may be able to not have jdbc or odbc connections at all. You see these protocols are valid but they are old and most networking doesn't work this way anymore which is why they suffer from these issues. Redshift Data API let's you issue SQL to redshift in a single package and check on completion later on. These API calls are each independent connections so there is no possibility of "timing out" between them. The downside is this process is not interactive and therefore not supported by workbenches.
So does this match what you have going on?

Google App Engine logs a mess of New connection for ... and Client closed local connection on

Checking out my logs on my App Engine. I get A LOT of
New connection for "<project_id>-central1:<project_name>"
Client closed local connection on /cloudsql/<project_id>-central1:<project_name>/.s.PGSQL.5432
Like happening multiple times a second and just floods my logs.
I was unable to find any information relating to this and maybe this is just a non-issue.
Is there any way to prevent this? (excluding filtering)
Is this inadvertently driving up the cost of operation of opening and closing?
I am using Django on the app engine.

I found this post where it's mentioned that setting -verbose=false will turn off the new/closed connection logs.

I found information about the same error but it wasn't generating a lot of connections. Anyway it was related to the Cloud SQL proxy.
Have you followed the instructions in this guide to configure the PostgreSQL connection to App Engine? I am particularly interested in the ones from "Setting up your local environment".
I did not found any related field in quotas or pricing pages but you can check the billing in the Google Cloud Console: Billing -> Overview -> [PROJECT_ID].

I'm not a django developer but I guess the root of this problem is that django opens a new connection to the database for every request by default.
Source: https://docs.djangoproject.com/en/2.1/ref/databases/
Persistent connections avoid the overhead of re-establishing a connection to the database in each request. They’re controlled by the CONN_MAX_AGE parameter which defines the maximum lifetime of a connection. It can be set independently for each database.
The default value is 0, preserving the historical behavior of closing
the database connection at the end of each request. To enable
persistent connections, set CONN_MAX_AGE to a positive number of
seconds. For unlimited persistent connections, set it to None.
You can try to increase the CONN_MAX_AGE or set it to None and the log messages should disappear.

Changing CONN_MAX_AGE value to None can help, however this may expose your application to bot attacks like exposed me (see the picture below):
Looking for the IP's in abuseIPDB.com I've found a lot of reports of Brute Force/Web App Attack from it.
Maybe setting the variable value to a fixed number may keep your application safe and stop these logs.

flask-cache memcache connection auto-reconnect

I've recently moved my memcache server behind an Elastic Load Balancer in AWS. I'm also using Flask-Cache with this memcache. If I'm not mistaken (and it's totally possible I am), Flask-Cache opens a connection to memcache and holds it open. It also appears that the ELB terminates these long-standing connections after some period of time (I think it's about 60 minutes). This will result in errors like:
SomeErrors: error 19 from flush_all: (0x4ff96f0) CONNECTION FAILURE, ::rec() returned zero, server has disconnected
If there was some way I could catch these errors and reconnect (or some magic setting to "try to reconnect on connection failure"), that would solve this problem.
FWIW, I'm using pylibmc, but don't see anything obvious (to me) that I could pass.
Any help would be greatly appreciated!

Being disconnected from ELB is very common and also very difficult to debug. Here are a few things that might help:
Debugging Ideas
Attempt to debug the problem in a staging environment with only one
instance connected to ELB.
Make sure you have application logging with time stamps and that if you catch all exceptions in Python (which is generally not a great idea), that you log the exception. It is possible you have a subtle and hidden bug that appears to be something else if you are catching all exceptions.
Simulate the failure (i.e. manually remove "one" instance from ELB), now look at your logs and make sure you see this manifested in your logs. If you can reproduce the same behavior than you can figure out how to fix it.
Look into a web service automated testing tool like https://loader.io/. This can be very helpful to simulate the conditions when the disconnects appear to happen.
Try the same application with a different load balancer, i.e. HAProxy (I would potentially try this last).

Suddenly scheduled tasks are not running in coldfusion 8

I am using Coldfusion MX8 server and one of the scheduled task was running from 2 years but now suddenly from 01/12/2014 scheduled tasks are not running. When i browsed the file in browser then the file is running successfully without error.
I am not sure is there any updatation or license expiration problem. I am aware that mid of this year Adobe closed the support for coldfusion 8.

The first most common problem of this problem is external to the server. When you say you browsed to the file and it worked in a browser, it is very important to know if that test was performed on the server desktop. Knowing that you can browse to the file from your desktop or laptop is of small value.
The most common source of issues like this is a change in the DNS or network stack that is interfereing with resolution. For example, if the internal DNS serving your DMZ suddenly starts serving the "external" address - suddenly your server can't browse to your domain. Or if the IP served by the server for the domain in question goes from being 127.0.0.1 to some other IP that the server can't acces correctly due to reverse proxy or LB or some other rule. Finally, sometimes the Apache or IIS is altered so that an IP that previously was serviced (127.0.0.1 being the most common example) now does not respond.
If it is something intrinsic to the scheduler service then Frank's advice is pretty good - especially look for "proxy schduler" entries in the log - they can give you good clues. I would also log results of a scheduled task to a file. Then check the file. If it exists then your scheduled tasks ARE running - they are just not succeeding. Good luck!

I've seen the cf scheduling service crash in CF8. The rest of CF is unaffected.
Have you tried restarting the server?

Here are your concerns:
Your File (works since you tested it manually).
Your Scheduled Task (failed).
Your Coldfusion Application (Service) (any changes here)?
Your Server (what about here).
To test your problem create a duplicate task and schedule it. Leave the other one in place (maybe set your new one to run earlier). Use the same file too. See if it completes.
If it doesn't then you have a larger problem. Since the Coldfusion Server sits atop of the JVM there could be something happening there. Things just don't stop working unless something got corrupted or you got compromised. If you hardened your server by rearranging/renaming the file structure to make it more secure...It would break your task.
So going back: if your test schedule works then determine what is different between the two. Note you have logging capabilities. Logging abilities for CF8
If you are not directly incharge of maintaining this server, then I would recommend asking around and see if there was recent maintenance, if so, what was done to the server?

Automatically restart Coldfusion service when it goes down [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
How to restart Coldfusion Application Server when application is timeout?
Currently I have an ColdFusion application that causes server issues. After 1-2 days that serve doesn't respond until a manual restart is done.
I know that I had to find what is going wrong in my scripts and I spend time and time for several weeks.
But pending I would make a script that restart automatically ColdFusion service if it is bugged.
I have not many knowledge in batch script etc.. but I guess that the test would be a request to a .cfm and the response would be serve until a timeout. ?
Has anyone ever met a script like this ?
Config: Win 2k8 Server R2 - Coldfusion 9(.0.0)
Thank you

Two things here
The real way is to fix the issue and you can do that with Fusion Reactor - http://www.fusion-reactor.com/fr/ It will help you monitor and restart and self heal as it needs.
You could create a batch file, and create a Scheduled Task in Windows that ran it.
Using Net Start / Net Stop Commands
net stop "Macromedia JRun CFusion Server"
net start "Macromedia JRun CFusion Server"
Thought this may not always work so I have a batch file:
c:\JRun4\uninstall\KillJRun.exe
net start "Macromedia JRun CFusion Server"
Which works for me.

Your best bet is to use Pingdom or another server monitoring tool. When the server goes down (responds with a 503 error, service unavailable) you may be able to have Pingdom send a response to a PHP script on the server that calls a batch file. I am not sure if Pingdom supports pinging another server is one is down, but you could have Pingdom email to an inbox that your PHP can check every few minutes.
This may end up being more work than figuring out what is wrong with your script though.
Edit: You may want to look at this question. This will only work if the service has stopped, whereas usually when a script crashes ColdFusion it is hanging. If you run the script that crashes the server, then look at the service, if it says stopped, then this may work for you.
The other thing that I would check is the JVM memory. Often times crashes are due to processing large amounts of data from files or the database and the JVM doesn't have the memory to do that.

Nope. It cannot be restart automatically when your CF services/server is hanging. The only one way is to restart by windows schedule.

You could also use Nagios+Plugins to fire a restart script when the service hangs. But following the previous advice & finding out what the problem is is your best bet.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js