I am using the Python Cassandra driver offered by Datastax to connect to a single node Cassandra instance. My Python code spawns multiple processes (using the multiprocessing module), each of which opens a connection to this node, and shuts it down during exit.
Here's the behavior I observe: when the number of processes spawned is less (say ~ 30) my code runs flawlessly. But with a higher number I see errors like these (probably not surprising):
File "/usr/local/lib/python2.7/dist-packages/cassandra/cluster.py", line 755, in connect
self.control_connection.connect()
File "/usr/local/lib/python2.7/dist-packages/cassandra/cluster.py", line 1868, in connect
self._set_new_connection(self._reconnect_internal())
File "/usr/local/lib/python2.7/dist-packages/cassandra/cluster.py", line 1903, in _reconnect_internal
raise NoHostAvailable("Unable to connect to any servers", errors)
NoHostAvailable: ('Unable to connect to any servers', {'127.0.0.1': error(99, "Tried connecting to [('127.0.0.1', 9042)]. Last error: Cannot assign requested address")})
Apparently, the host is unable to accept new connections. This is something that looks like should be taken care of by the driver or Cassandra - by having new connection requests queue up and grant them when it frees up.
How do I impose this behavior?
"Cannot assign requested address" can indicate that you're running out of local ports. This is not up to the driver -- it is a system configuration issue. Here is a good article about the problem (it refers to MySQL, but the issue is the same). Note that connections in TIME_WAIT state occupy local ports, and can linger beyond individual program runs.
The article discusses multiple solutions, including expanded port ranges, listening on multiple IP addresses, or changing application connection behavior. I would consider application behavior, and recommend running fewer processes. Depending on what you're trying to overcome with multiprocessing, you'd probably be best served using (process count) <= (machine cores) (this is the default behavior of multiprocessing.Pool).
Related
From 31th March I've got following error in Google Cloud SQL:
Got an error reading communication packets.
I have been using Google Cloud SQL for 2 years, but never faced with such problem.
I'm very worried about it.
This is detail error message:
textPayload: "2019-04-29T17:21:26.007574Z 203385 [Note] Aborted connection 203385 to db: {db_name} user: {db_username} host: 'cloudsqlproxy~{private ip}' (Got an error reading communication packets)"
While it is true that this error message often occurs after a maintenance period, it isn't necessarily a cause for concern as this is a known behavior by MySQL.
Possible explanations about why this issue is happening are :
The large increase of connection requests to the instance, with the
number of active connections increasing over a short period of time.
The freezing / unavailability of the instance can also occur due to
the burst of connections happening in a very short time interval. It
is observed that this freezing always happens with an increase of
connection requests. This increase in connections causes the
instance to be overloaded and hence unavailable to respond to
further connection requests until the number of connections
decreases or the instance stabilizes.
The server was too busy to accept new connections.
There were high rates of previous connections that were not closed
correctly.
The client terminated it abnormally.
readTimeout setting being set too low in the MySQL driver.
In an excerpt from the documentation, it is stated that:
There are many reasons why a connection attempt might not succeed.
Network communication is never guaranteed, and the database might be
temporarily unable to respond. Make sure your application handles
broken or unsuccessful connections gracefully.
Also a low Cloud SQL Proxy version can be the reason for such
incident issues. Possible upgrade to the latest version of (v1.23.0)
can be a troubleshooting solution.
IP from where you are trying to connect, may not be added to the
Authorized Networks in the Cloud SQL instance.
Some possible workaround for this issue, depending which is your case could be one of the following:
In the case that the issue is related to a high load, you could
retry the connection, using an exponential backoff to prevent
from sending too many simultaneous connection requests. The best
practice here is to exponentially back off your connection requests
and add randomized backoffsto avoid throttling, and potentially
overloading the instance. As a way to mitigate this issue in the
future, it is recommended that connection requests should be
spaced-out to prevent overloading. Although, depending on how you
are connecting to Cloud SQL, exponential backoffs may already be in
use by default with certain ORM packages.
If the issue could be related to an accumulation of long-running
inactive connections, you would be able to know if it is your case
using show full processliston your database looking for
the connections with high Time or connections where Command is
Sleep.
If this is your case you would have a few possible options:
If you are not using a connection pool you could try to update the client application logic to properly close connections immediately at the end of an operation or use a connection pool to limit your connections lifetime. In particular, it is ideal to manage the connection count by using a connection pool. This way unused connections are recycled and also the number of simultaneous connection requests can be limited through the use of the maximum pool size parameter.
If you are using a connecting pool, you could return the idle connections to the pool immediately at the end of an operation and set a shorter timeout by adjusting wait_timeout or interactive_timeoutflag values. Set CloudSQL wait_timeout flag to 600 seconds to force refreshing connections.
To check the network and port connectivity once -
Step 1. Confirm TCP connectivity on port 3306 with tcptraceroute or
netcat.
Step 2. If [Step 1] succeeded then try to check if there are any
errors in using mysql client to check timeout/error.
When the client might be terminating the connection abruptly you
could check for:
If the MySQL client or mysqld server are receiving a packet bigger
than max_allowed_packet bytes, or the client receiving a packet
too large message,if it so you could send smaller packets or
increase the max_allowed_packet flag value on both client
and server. If there are transactions that are not being properly
committed using both "begin" and "commit", there is the need to
update the client application logic to properly commit the
transaction.
There are several utilities that I think will be helpful here,
if you can install mtr and the tcpdump utilities to
monitor the packets during these connection-increasing events.
It is strongly recommended to enable the general_log in the
database flags. Another suggestion is to also enable the slow_query
database flag and output to a file. Also have a look at this
GitHub issue comment and go through the list of additional
solutions proposed for this issue here
This error message indicates a connection issue, either because your application doesn't terminate connections properly or because of a network issue.
As suggested in these troubleshooting steps for MySQL or PostgreSQL instances from the GCP docs, you can start debugging by checking that you follow best practices for managing database connections.
I am using Datastax's c++ driver version 2.8.0 for Apache Cassandra inside a kubernetes application. Cassandra is deployed as a 3 node cluster via this Helm chart.
The chart leverages kubernetes' headless services to make the Cassandra endpoints available, so there is an entry in the kubernetes DNS for those endpoints.
I have a c++ app running in a kubernetes pod that interacts with Cassandra, which connects using that DNS entry to resolve the endpoints. The application has a single connection to Cassandra object, following the driver usage guidelines. Connection is initialized at the beginning of the program, and failure to initialize the connection or to execute a query later on will actually fail the program.
Everything is working fine, but cassandra nodes/pods may eventually go down for some reason. When that happens, they're respawned, but get reassigned with a different IP. It seems like the c++ driver is able to get the new endpoints from the DNS without any additional code. However in such a situation the connection is not closed on the client side, and it looks like the previous endpoints remain in the connection pool on some level. This leads to a series of log events similar to the following:
1531920921.161 [WARN] (src/pool.cpp:420:virtual void cass::Pool::on_close(cass::Connection*)): Connection pool was unable to reconnect to host XXX.XXX.XXX.XX because of the following error: Connection timeout
and
1531920921.894 [WARN] (src/pool.cpp:420:virtual void cass::Pool::on_close(cass::Connection*)): Connection pool was unable to reconnect to host XXX.XXX.XXX.XX because of the following error: Connect error 'host is unreachable'
Which pop up every [reconnect timeout]. The more IP reassignments, the more log messages, which as you can guess can get to a pretty large number for long lived applications.
Is there some feature of the driver's API that allows dealing with that? Or a good/recommended way to handle that client side, more generally? One option, external to the driver, could be to reset the connection within the client code, but (although I may be missing out) I fail to see a way to "catch" such events : they only show up in the logs.
Can high Aborted_clients value lead to Host IP is blocked because of many connection errors? I want to known it because such error blocks my Qt application from accessing the database server.
Error message:
QSqlDatabasePrivate::database: unable to open database: "Host 'IP' is blocked because of many connection errors; unblock with 'mysqladmin flush-hosts' QMYSQL: Unable to connect"
Also, can Aborted_clients value increase the max_connect_errors value?
Thanks.
They are Globally unrelated. max_connect_errors is one of Server System Variables , it is a per host basis counter while Aborted_clients is one of the Server status variables which is a global information counter for all clients/hosts.
Another reason they are not related, when the host whose max_connect_errors counter is in an incrementing cycle due to connect errors but then that host establishes a successful connection, the error count max_connect_errors for the host is cleared!
max_connect_errors is incremented for a particular host when the host fails to establish and no successful connection (threshold results in blocking the host). It happens when the handshake with the server was interrupted. If it wasn’t interrupted, it counts as “success” and reset the host counter – regardless of whether the end result was a successful connection or not. so it can be considered as a network performance counter, Note that it does not even strongly stand for security issues. You can test this by telnet MyServer 3306 then pressing CTRL C instead of proceeding ..
This counter could be cleared with mysqladmin flush-hosts; as in this post.
On the other hand, if a client successfully connects , but later disconnects improperly or is terminated, the server increments the Aborted_clients counter.
This can be caused by many things - The client exited without calling mysql_close(). Client connection exceeded wait_timeout without interacting with server. The client connection wast cut off like when turning off the PC.
Server status variables provide information about Server operation, It also includes Aborted_connects which is just a statistic for DBAs- not used by mysqld to determine server behavior.
I have a principal database (server_A), mirror database (server_B), and a witness database (server_C). The databases are set up for automatic failover, that is, when server_A goes down or fails over, server_B assumes the role of the new principal database. The database quorum is set up correctly to the best of my knowledge.
I have written an application in c++ to connect to the database and get a value to ensure a true connection. The application detects when a failure occurs on the GetValue call and attempts to reconnect when the error occurs.
The issue is this:
When I have MULTIPLE connections to the database (two threads connected, once connected, it will get a value in a loop), when the failover occurs (stopping sql server on server A so server B will take over as principal), I detect the connection failure and destroy my connection and attempt to reconnect using the same connection string:
"Driver={SQL Native Client};Server=tcp:Server_A;Failover_Partner=tcp:Server_B;Database=SomeDatabase;Uid=SomeUser;Pwd=SomePassword;"
** NOTE **
I have verified that the failover has taken place by monitoring the databases.
Even though, the connection to the database has been properly disposed of, I cannot reconnect to the database until I restart the application, OR if I bring server_A back online (now acting as the mirror database) and then failover server_B (shutting down sql server) making server A the principal database again, the application can reconnect without having to completely close out.
Though I could manipulate the connection string to make server_B the new principal and server_A the new Failover_Partner, this is not an ideal solution as many more connections will be utilized.
Keep in mind, this ONLY happens with multiple connections to the database. If I run the application with only one connection, all is fine and I can reconnect just fine when the failover occurs.
EDIT: If I connect in the beginning with multiple threads, all is fine. When I shutdown SQL Server, and therefore a failover occurs, I can reconnect only when I go through and delete ALL objects and re-instantiate new objects. Also, I am using SQL Native Client 11.0 (ODBC). Thoughts?
A lot of what you're describing is consistent with the issue described in KB 2605597 "Time-out error when a mirrored database connection is created by the .NET Framework data provider for SQLClient."
The KB describes problems when the connection timeout is set to 15 seconds, I have anecdotally heard of similar problems when the connection timeout is set to 0 (which isn't a good idea for other reasons, mentioning just in case).
This hotfix is applied to the application servers. If you want to rule this out as a possible cause, you could test raising the timeout (like it says in the workaround sections of the post) to make sure it's not the issue.
Later thought: The other thing I notice that is unusual is that you're specifying the TCP protocol in the connection string and the failover partner name. It's not clear to me from the documentation that it's supported in the failover partner name. You might want to try removing that and specifying the network attribute instead. (Recommended here.)
I do understand that you believe the issue isn't these things due to the single / multiple connections issue you've tested out.
However, I think you're better off simplifying the connection string so it's as consistent as possible with the published examples and making sure it's not the issues that people have commonly hit with this first. (The retry issue happens when there is latency, which can make it somewhat sporadic.)
Ok I have found the answer.
I had to modify the hosts file because my application did not reside in the same domain as the databases. Therefore when trying to fail over, I could not reach the database with the instance name (which is what the failover partner was cached as). I changed the hosts file to resolve the instance name to the ip address of the machine and it all works now.
We occasionaly have a problem where we attempt to start the Jrun service and it fails with the following two errors:
error JRun Naming Service unable to start on port 2902
java.net.BindException: Port in use by another service or process: 2902
info No JDBC data sources have been configured for this server (see jrun-resources.xml)
error java.net.BindException: Port in use by another service or process: 8300
We then have to reboot the machine and Jrun comes up with no problem. This is very intermittent - happens perhaps one out of every 10 times we restart Jrun services.
I saw another reference on StackOverflow that if Windows Services take longer than 30 seconds to restart Windows shuts down the startup proccess. Perhaps that is the issue here? The logs indeed indicate that these errors are thrown about 37+ seconds after the restart command is issued.
We are on a 64bit platform on WinServer 2008.
Thanks!
We've been experiencing a similar problem on some of our servers. Unfortunately, netstat never indicated any sort of actual port conflict for us. My suspicion is that it's related to our recent deployment of a ColdFusion "cumulative hotfix" to our servers. We use the multi-server edition of CF 8.0.1 enterprise with a large number of instances on each machine -- each with its own JVM and its own distinct set of ports. Each CF instance is attached to its own IIS website and runs as its own Windows Service.
Within the past few weeks, we started getting similar "port in use" exceptions on startup, on our 32-bit machines as well as our 64-bit machines, all of which are running Windows Server 2003. I found several possible culprits and tried the following:
In jrun-jms.xml for each CF instance, there's an entry for the RMI transport layer that reads <port>0</port> -- which, according to the JRun documentation, means "choose a random port." I made that non-random and distinct per instance (in the 2600-2650 range) and restarted each instance. Things improved temporarily, perhaps coincidentally.
In the same file, under the entry for the TCPIP transport later, every instance defaulted to <port>2522</port> -- so I changed those to distinct ports per instance in the 2500-2550 range and restarted each instance. That didn't seem to help at all.
I tried researching whether ports in the 2500-3000 range might be used for any other purpose, and I couldn't find anything obvious, and besides, netstat wasn't telling me that any of my choices were in use.
I found something online about Windows designating ports from 1024 to 5000 as the "dynamic port" range, so I added 10000 to the port numbers I had set in jrun-jms.xml and restarted each instance again. Still didn't help.
I tried changing the port in jndi.properties, also by adding 10000 to the port numbers. Unfortunately this meant wiping out all my wsconfig connections to IIS and creating them again from scratch. I had to edit wsconfig_jvm.config as well, adding -DWSConfig.PortScanStartPort=12900 to java.args, so it could detect my CF instances. (By default it only scans ports 2900-3000. See bpurcell.org for details. It's an old post but still relevant.) So far so good!
My best guess is that Adobe (or MS Windows) changed the way some of its code grabs "random" ports. But all I know for sure so far is that the steps outlined above appear to have fixed the problem.
Have you verified that the services are in fact stopping? Task manager should show no instances of jrun.exe. You can also check to see what is bound to that port by opening a command window and running
netstat -a -b
This will list all your open ports, plus what program is using them. You can also use
netstat -a -o
Which does the same thing as the above, but will list the process id instead of the program name. You can then cross-reference those with task manager. You'll need to enable showing the PIDs in task manager by going to View->Select Columns and making sure PID is checked. My guess would be that the jrun processes are not shutting down in a timely fashion.