What does External_threads_connected mean in Aurora MySQL 5.6.10 - amazon-web-services

We have 3000 limits for max connection using RDS db.r4.2xlarge, but when our application utilizes 2000 connection we got an error in our application Too many connections which is quite amazing to me, but When I run query this query show status like '%onn%';
I got a response having something like
'External_threads_connected', '1102'
we are using AWS cluster with reader and writer.
Do I need to consider this value in my active connection or does this value contributing to the max connection?

Related

Spring Data Neo4J - Unable to acquire connection from pool within configured maximum time

We have a Reactive REST API using Spring Data Neo4j (SpringBoot v2.7.5) deployed to Kubernetes. When running a stress test to determine the breaking point, once the volume of requests that the service can handle has been breached, the service does not auto-recover, even after the load has dropped to a level at which the service can handle.
After the service has fallen over the Neo4J health indicator shows:
“org.neo4j.driver.exceptions.ClientException: Unable to acquire connection from the pool within configured maximum time of 60000ms”
With respect to connection/configuration settings we are using defaults configured by SDN.
Observations:
Up until the point at which the service breaks only a small number of connections are utilised, at the point at which it breaks the connections in use jumps up to the max pool size and the above mentioned error is observed. No matter how much time passes (even well beyond the max connection lifetime) the service is unable to acquire a connection from the pool. Upon manually shutting down and restarting the service/pod the service returns to a healthy state.
As an interim solution we now check the Neo4J health indicator, if the mentioned error is present the liveness state is set to down which triggers Kubernetes to restart the service automatically. However, I’m wondering if there is an underlying issue with the connections in the pool not getting ‘cleaned up’?
You can take a look at this discussion https://github.com/spring-projects/spring-data-neo4j/issues/2632
I had the same issue. The problem is that either Spring Framework or Neo4j reactive transaction manager doesn't close connections in a complex reactive flow e.g. when there are a lot of inner calls/mappings and somewhere inside an exception is thrown.
So as a workaround you can add #Transactional in such places to avoid multiple transactions to be created.

Google Cloud Composer Airflow sqlalchemy OperationalError causing DAG to hang forever

I have a bunch of tasks within a Cloud Composer Airflow DAG, one of which is a KubernetesPodOperator. This task seems to get stuck in the scheduled state forever and so the DAG runs continuously for 15 hours without finishing (it normally takes about an hour). I have to manually mark it failed for it to end.
I've set the DAG timeout to 2 hours but it does not make any difference.
The Cloud Composer logs show the following error:
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) could not connect to server:
Connection refused
Is the server running on host "airflow-sqlproxy-service.default.svc.cluster.local" (10.7.124.107)
and accepting TCP/IP connections on port 3306?
The error log also gives me a link to this documentation about that error type: https://docs.sqlalchemy.org/en/13/errors.html#operationalerror
When the DAG is next triggered on schedule, it works fine without any fix required. This issue happens intermittently, we've not been able to reproduce it.
Does anyone know the cause of this error and how to fix it?
The reason behind the issue is related to SQLAlchemy using a session by a thread and creating a callable session that can be used later in the Airflow Code. If there are some minimum delays between the queries and sessions, MySQL might close the connection. The connection timeout is set to approximately 10 minutes.
Solutions:
Use the airflow.utils.db.provide_session decorator. This decorator
provides a valid session to the Airflow database in the session
parameter and closes the session at the end of the function.
Do not use a single long-running function. Instead, move all database
queries to separate functions, so that there are multiple functions
with the airflow.utils.db.provide_session decorator. In this case,
sessions are automatically closed after retrieving query results.

How to reconnect if AWS RDS recovery happens

How have I written the code
createPool is used at the start of the app
then for every request I am using getConnection
I am using AWS RDS & it went into sudden recovery mode, due to which my db url was unchanged but instance IP must have changed as it was created in another AZ
So for such a scenario I am supposed to reinitialize my db connection so that new instance DNS is updated.
The issue is in such a scenario I did not received any timeout error or connection error. So how do I capture this type of error?
Kindly guide if possible.
Thanks
It is unclear from your description what exactly you have built, but it sounds like you've created a connection pool.
If you open a connection to the db, the first time you call getConnection you should validate that the connection is still active - obviously if the db fails over, the existing connection will get closed, and you will either need to create a new connection or re-open the existing one.

mysql lost connection error

Currently, I am working on a project to integrate mysql with the IOCP server to collect sensor data and verify the collected data from the client.
However, there is a situation where mysql misses a connection.
The query itself is a simple query that inserts a single row of records or gets the average value between date intervals.
The data of each sensor flows into the DB at the same time every 5 seconds. When the messages of the sensors come on occasionally or overlap with the message of the client, the connection is disconnected.
lost connection to mysql server during query
In relation to throwing the above message
max_allowed_packet Numbers changed.
interactive_timeout, net_read_timeout, net_write_timeout, wait_timeout
It seems that if there are overlapping queries, an error occurs.
Please let me know if you know the solution.
I had a similar issue in a MySQL server with very simple queries where the number of concurrent queries were high. I had to disable the query cache to solve the issue. You could try disabling the query cache using following statements.
SET GLOBAL query_cache_size = 0;
SET GLOBAL query_cache_type = 0;
Please note that a server restart will enable the query cache again. Please put the configuration in MySQL configuration file if you need to have it preserved.
Can you run below command and check the current timeouts?
SHOW VARIABLES LIKE '%timeout';
You can change the timeout, if needed -
SET GLOBAL <timeout_variable>=<value>;

web service best practice - server timeout longer than http client timeout

I am trying to build a web service on top of hbase, so the code looks roughly like:
#GET
#Path("/blabla")
#Override
public List<String> getEvents($$$params$$$) {
......
//calling hbase query the events
......
}
When Hbase service is down, the hbase Java API keeps retrying to connect to Hbase region server util eventually it times out and throws a RT Exception:
NoServerForRegionException: Unable to find region for event,,99999999999999 after 10 tries.
The logic has no problem, my issue here is that the HttpClient times out way before hbase times out the retries. Then my web service API consumer gets no response, ugly.
Question -
What's the best practice here if you have server's timeout potentially longer than the http connection itself? How to have the web service respond to client gracefully in this case?
set the cashing for you scan object to some reasonable value. another thing, since you are using a web service to show the results to your users, i am assuming that you must be showing only a few rows(or records) at a time. you can use Hbase PageFilter so that you get only a specified no of rows each time and don't have to wait to get all the rows in one shot.