Got an error reading communication packets in Google Cloud SQL - google-cloud-platform

From 31th March I've got following error in Google Cloud SQL:
Got an error reading communication packets.
I have been using Google Cloud SQL for 2 years, but never faced with such problem.
I'm very worried about it.
This is detail error message:
textPayload: "2019-04-29T17:21:26.007574Z 203385 [Note] Aborted connection 203385 to db: {db_name} user: {db_username} host: 'cloudsqlproxy~{private ip}' (Got an error reading communication packets)"

While it is true that this error message often occurs after a maintenance period, it isn't necessarily a cause for concern as this is a known behavior by MySQL.
Possible explanations about why this issue is happening are :
The large increase of connection requests to the instance, with the
number of active connections increasing over a short period of time.
The freezing / unavailability of the instance can also occur due to
the burst of connections happening in a very short time interval. It
is observed that this freezing always happens with an increase of
connection requests. This increase in connections causes the
instance to be overloaded and hence unavailable to respond to
further connection requests until the number of connections
decreases or the instance stabilizes.
The server was too busy to accept new connections.
There were high rates of previous connections that were not closed
correctly.
The client terminated it abnormally.
readTimeout setting being set too low in the MySQL driver.
In an excerpt from the documentation, it is stated that:
There are many reasons why a connection attempt might not succeed.
Network communication is never guaranteed, and the database might be
temporarily unable to respond. Make sure your application handles
broken or unsuccessful connections gracefully.
Also a low Cloud SQL Proxy version can be the reason for such
incident issues. Possible upgrade to the latest version of (v1.23.0)
can be a troubleshooting solution.
IP from where you are trying to connect, may not be added to the
Authorized Networks in the Cloud SQL instance.
Some possible workaround for this issue, depending which is your case could be one of the following:
In the case that the issue is related to a high load, you could
retry the connection, using an exponential backoff to prevent
from sending too many simultaneous connection requests. The best
practice here is to exponentially back off your connection requests
and add randomized backoffsto avoid throttling, and potentially
overloading the instance. As a way to mitigate this issue in the
future, it is recommended that connection requests should be
spaced-out to prevent overloading. Although, depending on how you
are connecting to Cloud SQL, exponential backoffs may already be in
use by default with certain ORM packages.
If the issue could be related to an accumulation of long-running
inactive connections, you would be able to know if it is your case
using show full processliston your database looking for
the connections with high Time or connections where Command is
Sleep.
If this is your case you would have a few possible options:
If you are not using a connection pool you could try to update the client application logic to properly close connections immediately at the end of an operation or use a connection pool to limit your connections lifetime. In particular, it is ideal to manage the connection count by using a connection pool. This way unused connections are recycled and also the number of simultaneous connection requests can be limited through the use of the maximum pool size parameter.
If you are using a connecting pool, you could return the idle connections to the pool immediately at the end of an operation and set a shorter timeout by adjusting wait_timeout or interactive_timeoutflag values. Set CloudSQL wait_timeout flag to 600 seconds to force refreshing connections.
To check the network and port connectivity once -
Step 1. Confirm TCP connectivity on port 3306 with tcptraceroute or
netcat.
Step 2. If [Step 1] succeeded then try to check if there are any
errors in using mysql client to check timeout/error.
When the client might be terminating the connection abruptly you
could check for:
If the MySQL client or mysqld server are receiving a packet bigger
than max_allowed_packet bytes, or the client receiving a packet
too large message,if it so you could send smaller packets or
increase the max_allowed_packet flag value on both client
and server. If there are transactions that are not being properly
committed using both "begin" and "commit", there is the need to
update the client application logic to properly commit the
transaction.
There are several utilities that I think will be helpful here,
if you can install mtr and the tcpdump utilities to
monitor the packets during these connection-increasing events.
It is strongly recommended to enable the general_log in the
database flags. Another suggestion is to also enable the slow_query
database flag and output to a file. Also have a look at this
GitHub issue comment and go through the list of additional
solutions proposed for this issue here

This error message indicates a connection issue, either because your application doesn't terminate connections properly or because of a network issue.
As suggested in these troubleshooting steps for MySQL or PostgreSQL instances from the GCP docs, you can start debugging by checking that you follow best practices for managing database connections.

Related

Cloud SQL Proxy connection timesout occasionaly

We use single-tenant architecture for our instances. Each instance contains 3 Django Apps i.e Django, Celery{worker, beat} and few other things that don't interact with the database. We deploy cloudsql-proxy as a sidecar for these django containers which are running as Pod in Google Kubernetes Engine. We are using CloudSQL (Postgres 9.6) by Google and it has Public IP address.
The problem is that we are getting Operational Errors on Django side i.e
OperationalError: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
and when we check the Pod's logs at the same time when the OperationalError occurred we see the following log error from cloudsql-proxy container
couldn't connect to db_instance: dial tcp our_db_instance_public_ip:3307: connect: connection timed out
It is not that the connection to database doesn't work. It works most of the time but sometime it throws the above errors, which is kind of a pain because we run celery tasks every other minute and they fail due to this. Sometime, this occurs when the end user is interacting with our application and their requests fails.
Our application isn't under very high load. We set the maximum connection of our database to 1000. And the peak number of connection is around 35 (sum of all instance's connections). I checked the stats of Database and it seems pretty happy i.e CPU utilization almost never goes above 50%, Disk is 30% used, Memory usage is around 50%.
I can provide more details if needed. Would appreciate any help!

Amazon Redshift: Queries never finish running after period idle

I am working on a new Amazon Redshift database that I recently started.
I am experiencing an issue where after I connect to the database, I can run queries without any issue. However, if I spend some time without running anything (like, 5 minutes), when I try running another query or command, ir never finishes.
I am using dBeaver Community 21.2.2 to interact with the connection, and it stays "Executing query" forever. The only way i can get it to work is by cancelling, disconnecting from the redshift, connecting again and then it executes correctly. Until I stop using for some minutes, and then it's happens all over again.
I tought this was a dBeaver issue, as we have a Meabase connected to this same cluster without any issues. But today, I tried manipulating this cluster with R using RJDBC, and the same thing happens: I can run queries, until I stop, and then when I try running something else it never stops, until I disconnect and connect again.
I'm sorry if I wasn't able to explain it clearly, I tried searching for simmilar issues but couldn't.
I suspect that the queries in question are not even being launched on the database. You can check this by reviewing svl_statementtext to see if the query is even being seen. Put a unique comment in the query to help determine if it is actually the query in question.
Since I've seen similar behavior before I'll write up a possible way this can happen. In this case the queries were not being seen by the database or the connection to the database was being dropped mid execution. The cause is network switches and their configurations.
Typical network connections are fairly quick - you ask for a web page and it is given to you. Connection is complete. When you click on a link a new connection is established and also end quickly. These network actions are atomic from a network connection point of view. However, database connections are different. One connection is made and many back and forth transmissions of data happen while the connection is open. No problem and with the right set of network configurations these connections can be open and idle for days.
The problem come in when the operators of the network equipment decide that connections that have no data flowing are "stale" after some fixed amount of time. They do this so that the network equipment can "forget" about these connections and focus on "active" connections. ISPs drop idle connections a lot so that they can handle the load of traffic and connections that flow through their equipment. This doesn't cause any issues for web pages and APIs but database connections get clobbered.
When this happens is look exactly like what you describe. Both sides (client and database) think that the connection is still active but the network equipment has dropped the connection. Nothing gets through but no notification is sent either party. You will likely see corresponding open sessions on the Redshift side for these dropped connections and the database is just waiting for the client to give a command on each of them. An administrator will need to go through and close (terminate) these sessions for them to go away.
Now the thing that doesn't align with experience is the speed at which these connections are being marked as "stale". In my case my ISP was closing connections that were idle for more than 30 min. You seem to be timing out much faster than this. In some cases corporate firewalls will be configured with short idle connection timeouts for routes out of the private network to the internet. So there are cases where the timeouts can be short. The networks at AWS do not have these timeouts so if your connections are completely within AWS then this isn't your answer.
To address this there are a few ways to go. The easy way is to set up a tunnel into AWS with "keep alive" packets sent every 30 sec or so. You will need an ec2 instance at AWS so it isn't cost free. Ssh tunneling is the usual tool for this and there are write-ups online for setting it up.
The hard way (but likely most correct way) is to work with network experts to understand when the timeout is happening and why. If the timeout cannot be changed then it may be possible to configure a different network topology for your use case. Network peering or VPN could address.
In some cases you may be able to not have jdbc or odbc connections at all. You see these protocols are valid but they are old and most networking doesn't work this way anymore which is why they suffer from these issues. Redshift Data API let's you issue SQL to redshift in a single package and check on completion later on. These API calls are each independent connections so there is no possibility of "timing out" between them. The downside is this process is not interactive and therefore not supported by workbenches.
So does this match what you have going on?

How can I know whether the connection to Cassandra was lost with the C++ driver?

I'm wondering whether there is a way for me to know whether the connection to all Cassandra nodes was lost by the C++ driver.
My application has a proxy server which connects to Cassandra once and sits there forever. Other services will connect to that proxy server and send requests as required.
That works great, until all the connections to the Cassandra cluster are lost. In that special circumstance, the proxy does not seem to recover...
Is there a way for me to know/detect that all connections were lost and thus attempt a connect() again?
I have the same situation and the best solution is to check the error from the request/query you make to Cassandra.
I just tested this and if all connections are gone, the driver immediately returns:
CASS_ERROR_LIB_NO_HOSTS_AVAILABLE
That's the solution I'm going myself with as I couldn't find a better one and actually this one works for me as it doesn't wait for a specific timeout but return immediately.

Auto failover multiple connections to mirror database when principal goes down

I have a principal database (server_A), mirror database (server_B), and a witness database (server_C). The databases are set up for automatic failover, that is, when server_A goes down or fails over, server_B assumes the role of the new principal database. The database quorum is set up correctly to the best of my knowledge.
I have written an application in c++ to connect to the database and get a value to ensure a true connection. The application detects when a failure occurs on the GetValue call and attempts to reconnect when the error occurs.
The issue is this:
When I have MULTIPLE connections to the database (two threads connected, once connected, it will get a value in a loop), when the failover occurs (stopping sql server on server A so server B will take over as principal), I detect the connection failure and destroy my connection and attempt to reconnect using the same connection string:
"Driver={SQL Native Client};Server=tcp:Server_A;Failover_Partner=tcp:Server_B;Database=SomeDatabase;Uid=SomeUser;Pwd=SomePassword;"
** NOTE **
I have verified that the failover has taken place by monitoring the databases.
Even though, the connection to the database has been properly disposed of, I cannot reconnect to the database until I restart the application, OR if I bring server_A back online (now acting as the mirror database) and then failover server_B (shutting down sql server) making server A the principal database again, the application can reconnect without having to completely close out.
Though I could manipulate the connection string to make server_B the new principal and server_A the new Failover_Partner, this is not an ideal solution as many more connections will be utilized.
Keep in mind, this ONLY happens with multiple connections to the database. If I run the application with only one connection, all is fine and I can reconnect just fine when the failover occurs.
EDIT: If I connect in the beginning with multiple threads, all is fine. When I shutdown SQL Server, and therefore a failover occurs, I can reconnect only when I go through and delete ALL objects and re-instantiate new objects. Also, I am using SQL Native Client 11.0 (ODBC). Thoughts?
A lot of what you're describing is consistent with the issue described in KB 2605597 "Time-out error when a mirrored database connection is created by the .NET Framework data provider for SQLClient."
The KB describes problems when the connection timeout is set to 15 seconds, I have anecdotally heard of similar problems when the connection timeout is set to 0 (which isn't a good idea for other reasons, mentioning just in case).
This hotfix is applied to the application servers. If you want to rule this out as a possible cause, you could test raising the timeout (like it says in the workaround sections of the post) to make sure it's not the issue.
Later thought: The other thing I notice that is unusual is that you're specifying the TCP protocol in the connection string and the failover partner name. It's not clear to me from the documentation that it's supported in the failover partner name. You might want to try removing that and specifying the network attribute instead. (Recommended here.)
I do understand that you believe the issue isn't these things due to the single / multiple connections issue you've tested out.
However, I think you're better off simplifying the connection string so it's as consistent as possible with the published examples and making sure it's not the issues that people have commonly hit with this first. (The retry issue happens when there is latency, which can make it somewhat sporadic.)
Ok I have found the answer.
I had to modify the hosts file because my application did not reside in the same domain as the databases. Therefore when trying to fail over, I could not reach the database with the instance name (which is what the failover partner was cached as). I changed the hosts file to resolve the instance name to the ip address of the machine and it all works now.

Can Winsock connections randomly fail?

I have a blocking client/server connected locally via Winsock. The client uses firefox to retrieve data from websites, passing certain data along to the server for extra processing. The server always responds, and the processing can take anywhere from 1/10th second to a few minutes. The client has no winsock connection to anything but the server; all web data is retrieved to hard-drive via firefox.
This setup works quite well until, seemingly randomly, the client's recv returns -1 (SOCKET_ERROR) with error code 10054 (WSAECONNRESET). This means the server supposedly terminated connection, but the server is actually still waiting to recv as if nothing is wrong. The connection has failed in this way as early as 5 minutes in or after working for as long as about an hour and a half. The client sends about 10 different types of requests to the server, and failure has occurred on a variety of them. The frequency of requests is roughly constant, probably an average of 10-15 a minute. When the connection breaks, neither computer experiences internet problems and remote desktop does not disconnect.
Initially I thought memory leaks, but after extensive debugging I am reasonably certain no more exist. Firefox is engaged in considerable HTTP traffic at times, so I thought maybe that could be filling available socket bufferspace or something -- seems doubtful but at this point I'm really not sure. So, could it be more memory leaks, maybe a hidden buffer overrun, too much web traffic? What is causing my Winsock app to randomly fail?
Sounds like a firewall at work.
Many firewalls are configured to terminate idle connections (i.e. open TCP sessions on which no data is transferred for awhile). Especially if it's an HTTP connection, which are typically not persistent.