How to reconnect if AWS RDS recovery happens - amazon-web-services

How have I written the code
createPool is used at the start of the app
then for every request I am using getConnection
I am using AWS RDS & it went into sudden recovery mode, due to which my db url was unchanged but instance IP must have changed as it was created in another AZ
So for such a scenario I am supposed to reinitialize my db connection so that new instance DNS is updated.
The issue is in such a scenario I did not received any timeout error or connection error. So how do I capture this type of error?
Kindly guide if possible.
Thanks

It is unclear from your description what exactly you have built, but it sounds like you've created a connection pool.
If you open a connection to the db, the first time you call getConnection you should validate that the connection is still active - obviously if the db fails over, the existing connection will get closed, and you will either need to create a new connection or re-open the existing one.

Related

Spring Data Neo4J - Unable to acquire connection from pool within configured maximum time

We have a Reactive REST API using Spring Data Neo4j (SpringBoot v2.7.5) deployed to Kubernetes. When running a stress test to determine the breaking point, once the volume of requests that the service can handle has been breached, the service does not auto-recover, even after the load has dropped to a level at which the service can handle.
After the service has fallen over the Neo4J health indicator shows:
“org.neo4j.driver.exceptions.ClientException: Unable to acquire connection from the pool within configured maximum time of 60000ms”
With respect to connection/configuration settings we are using defaults configured by SDN.
Observations:
Up until the point at which the service breaks only a small number of connections are utilised, at the point at which it breaks the connections in use jumps up to the max pool size and the above mentioned error is observed. No matter how much time passes (even well beyond the max connection lifetime) the service is unable to acquire a connection from the pool. Upon manually shutting down and restarting the service/pod the service returns to a healthy state.
As an interim solution we now check the Neo4J health indicator, if the mentioned error is present the liveness state is set to down which triggers Kubernetes to restart the service automatically. However, I’m wondering if there is an underlying issue with the connections in the pool not getting ‘cleaned up’?
You can take a look at this discussion https://github.com/spring-projects/spring-data-neo4j/issues/2632
I had the same issue. The problem is that either Spring Framework or Neo4j reactive transaction manager doesn't close connections in a complex reactive flow e.g. when there are a lot of inner calls/mappings and somewhere inside an exception is thrown.
So as a workaround you can add #Transactional in such places to avoid multiple transactions to be created.

Google Cloud Composer Airflow sqlalchemy OperationalError causing DAG to hang forever

I have a bunch of tasks within a Cloud Composer Airflow DAG, one of which is a KubernetesPodOperator. This task seems to get stuck in the scheduled state forever and so the DAG runs continuously for 15 hours without finishing (it normally takes about an hour). I have to manually mark it failed for it to end.
I've set the DAG timeout to 2 hours but it does not make any difference.
The Cloud Composer logs show the following error:
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) could not connect to server:
Connection refused
Is the server running on host "airflow-sqlproxy-service.default.svc.cluster.local" (10.7.124.107)
and accepting TCP/IP connections on port 3306?
The error log also gives me a link to this documentation about that error type: https://docs.sqlalchemy.org/en/13/errors.html#operationalerror
When the DAG is next triggered on schedule, it works fine without any fix required. This issue happens intermittently, we've not been able to reproduce it.
Does anyone know the cause of this error and how to fix it?
The reason behind the issue is related to SQLAlchemy using a session by a thread and creating a callable session that can be used later in the Airflow Code. If there are some minimum delays between the queries and sessions, MySQL might close the connection. The connection timeout is set to approximately 10 minutes.
Solutions:
Use the airflow.utils.db.provide_session decorator. This decorator
provides a valid session to the Airflow database in the session
parameter and closes the session at the end of the function.
Do not use a single long-running function. Instead, move all database
queries to separate functions, so that there are multiple functions
with the airflow.utils.db.provide_session decorator. In this case,
sessions are automatically closed after retrieving query results.

Getting ec2 instance ID suddenly stopped working

I've been getting an amazon instance ID from within the instance itself for over a year now by hitting this local web address http://169.254.169.254/latest/meta-data/instance-id. This is the appropriate method according to the AWS documentation. For some reason though, just this week that same call started throwing an error.
I tried pinging the 169.254.169.254 address from the command line and that fails, so it seems like something pretty basic has changed with the EC2 instances. I don't see any changes to the documentation on AWS. One thing I do notice is that I used to see the instance name in the upper right hand corner when loading up the instance and logging in remotely. That information doesn't appear anymore.
Here is the code I've been using to get the ID:
retID = New StreamReader(HttpWebRequest.Create("http://169.254.169.254/latest/meta-data/instance-id").GetResponse().GetResponseStream()).ReadToEnd()
Here is the full error stack:
at System.Net.HttpWebRequest.GetResponse()
at RunControllerInterface.NewRunControlCommunicate.getInstanceIDFromAmazon()
The error message itself says: Unable to connect to the remote server
Any help would be appreciated.
So I think I have at least a partial answer to this problem. When making this image, I was using a t3a.medium instance. As long as I use that same type of instance I am able to pull down the instance name.

[Amazon](500150) Error setting/closing connection: Connection refused

I have a Glue script which is supposed to write its result in a Redshift table in a for loop.
After many hours of processing it raises this exception:
Py4JJavaError: An error occurred while calling o11362.pyWriteDynamicFrame.
: java.sql.SQLException: [Amazon](500150) Error setting/closing connection: Connection refused.
Why am I getting this exception?
It turns out that Redshift clusters have a maintenance window in which they are re-booted. This event of course causes the Glue Job to fail when attempting to write to a table of that cluster.
May be useful to reschedule the maintenance window https://docs.aws.amazon.com/redshift/latest/mgmt/managing-clusters-console.html
This error can occur for many reasons. I'm sure after a few google searches you've found that the most common cause of this is improper security group settings for your cluster (make sure your inbound settings are correct).
I would suggest that you make sure you're able to create a connection for even a short period of time before you try this longer process. If you are able to do so, then I bet the issue is that your connection is closing out after a timeout (since your process is so long). To solve this, you should look into connection pooling, which involves creating an instance of a connection and constantly checking to ensure it is still alive, thus allowing a process to continuously use the cluster connection.

Modify mirroring_partner_instance name in sys.database_mirroring

I have an issue with correctly failing over to the mirror database. When I am connected to the principal database (dbx) (mirroring is enabled and set up) and I fail over the principal database (shutting down SQL Server to simulate a crash), I can no longer send queries without a failure. This is expected since the previous connection is now lost.
I would like to simply close out my connections and handles and re-establish a new connection, using the same connection string, and re-connect to the mirror database (dby, which is now the principal database).
My connection string is as follows:
Driver={SQL Native Client};Server=dbx;Failover_Partner=dby;Database=db;Uid=uid;Pwd=pwd;Network=DBMSSOCN;
From doing research, I have learned that the Failover_Partner parameter in the connection is almost worthless. It is only used when the principal server is down and a new connection is being made for the first time. For some reason, the Failover_Partner is overwritten internally when a connection is established to the principal and the mirroring_partner_instance found in the sys.database_mirroring table is used instead. So when I specify the Failover_Partner to be dby, after I establish a connection, I query for what it thinks the failover partner is, and it returns the INSTANCE name of the failover partner and not the DNS name (dby).
Here is the issue, I cannot use the INSTANCE name as the failover partner. I am required to use the DNS name as the failover partner.
So my question(s) is/are this:
Is there a way to modify the sys.database_mirroring entry and change the mirroring_partner_instance?
Where does this field get its value from?
Is there any other way to force SQL Server to use the DNS name and NOT the INSTANCE name?
I found the answer to this question in case anyone has the same or a similar issue.
I had to modify the ##SERVERNAME property in SQL. It was internally set to the computer WIN-... instance name and I was able to drop it and add the server name I wanted with the following commands:
Get the current server name (WIN_NAME)
SELECT ##SERVERNAME
Drop the WIN-NAME
SP_DropServer 'WIN_NAME'
GO
SP_AddServer 'SERVER_NAME',local
GO
Restart SQL Server to see the changes take effect.