Auto failover multiple connections to mirror database when principal goes down

Auto failover multiple connections to mirror database when principal goes down - c++

I have a principal database (server_A), mirror database (server_B), and a witness database (server_C). The databases are set up for automatic failover, that is, when server_A goes down or fails over, server_B assumes the role of the new principal database. The database quorum is set up correctly to the best of my knowledge.
I have written an application in c++ to connect to the database and get a value to ensure a true connection. The application detects when a failure occurs on the GetValue call and attempts to reconnect when the error occurs.
The issue is this:
When I have MULTIPLE connections to the database (two threads connected, once connected, it will get a value in a loop), when the failover occurs (stopping sql server on server A so server B will take over as principal), I detect the connection failure and destroy my connection and attempt to reconnect using the same connection string:
"Driver={SQL Native Client};Server=tcp:Server_A;Failover_Partner=tcp:Server_B;Database=SomeDatabase;Uid=SomeUser;Pwd=SomePassword;"
** NOTE **
I have verified that the failover has taken place by monitoring the databases.
Even though, the connection to the database has been properly disposed of, I cannot reconnect to the database until I restart the application, OR if I bring server_A back online (now acting as the mirror database) and then failover server_B (shutting down sql server) making server A the principal database again, the application can reconnect without having to completely close out.
Though I could manipulate the connection string to make server_B the new principal and server_A the new Failover_Partner, this is not an ideal solution as many more connections will be utilized.
Keep in mind, this ONLY happens with multiple connections to the database. If I run the application with only one connection, all is fine and I can reconnect just fine when the failover occurs.
EDIT: If I connect in the beginning with multiple threads, all is fine. When I shutdown SQL Server, and therefore a failover occurs, I can reconnect only when I go through and delete ALL objects and re-instantiate new objects. Also, I am using SQL Native Client 11.0 (ODBC). Thoughts?

A lot of what you're describing is consistent with the issue described in KB 2605597 "Time-out error when a mirrored database connection is created by the .NET Framework data provider for SQLClient."
The KB describes problems when the connection timeout is set to 15 seconds, I have anecdotally heard of similar problems when the connection timeout is set to 0 (which isn't a good idea for other reasons, mentioning just in case).
This hotfix is applied to the application servers. If you want to rule this out as a possible cause, you could test raising the timeout (like it says in the workaround sections of the post) to make sure it's not the issue.
Later thought: The other thing I notice that is unusual is that you're specifying the TCP protocol in the connection string and the failover partner name. It's not clear to me from the documentation that it's supported in the failover partner name. You might want to try removing that and specifying the network attribute instead. (Recommended here.)
I do understand that you believe the issue isn't these things due to the single / multiple connections issue you've tested out.
However, I think you're better off simplifying the connection string so it's as consistent as possible with the published examples and making sure it's not the issues that people have commonly hit with this first. (The retry issue happens when there is latency, which can make it somewhat sporadic.)

Ok I have found the answer.
I had to modify the hosts file because my application did not reside in the same domain as the databases. Therefore when trying to fail over, I could not reach the database with the instance name (which is what the failover partner was cached as). I changed the hosts file to resolve the instance name to the ip address of the machine and it all works now.

Related

Amazon Redshift: Queries never finish running after period idle

I am working on a new Amazon Redshift database that I recently started.
I am experiencing an issue where after I connect to the database, I can run queries without any issue. However, if I spend some time without running anything (like, 5 minutes), when I try running another query or command, ir never finishes.
I am using dBeaver Community 21.2.2 to interact with the connection, and it stays "Executing query" forever. The only way i can get it to work is by cancelling, disconnecting from the redshift, connecting again and then it executes correctly. Until I stop using for some minutes, and then it's happens all over again.
I tought this was a dBeaver issue, as we have a Meabase connected to this same cluster without any issues. But today, I tried manipulating this cluster with R using RJDBC, and the same thing happens: I can run queries, until I stop, and then when I try running something else it never stops, until I disconnect and connect again.
I'm sorry if I wasn't able to explain it clearly, I tried searching for simmilar issues but couldn't.

I suspect that the queries in question are not even being launched on the database. You can check this by reviewing svl_statementtext to see if the query is even being seen. Put a unique comment in the query to help determine if it is actually the query in question.
Since I've seen similar behavior before I'll write up a possible way this can happen. In this case the queries were not being seen by the database or the connection to the database was being dropped mid execution. The cause is network switches and their configurations.
Typical network connections are fairly quick - you ask for a web page and it is given to you. Connection is complete. When you click on a link a new connection is established and also end quickly. These network actions are atomic from a network connection point of view. However, database connections are different. One connection is made and many back and forth transmissions of data happen while the connection is open. No problem and with the right set of network configurations these connections can be open and idle for days.
The problem come in when the operators of the network equipment decide that connections that have no data flowing are "stale" after some fixed amount of time. They do this so that the network equipment can "forget" about these connections and focus on "active" connections. ISPs drop idle connections a lot so that they can handle the load of traffic and connections that flow through their equipment. This doesn't cause any issues for web pages and APIs but database connections get clobbered.
When this happens is look exactly like what you describe. Both sides (client and database) think that the connection is still active but the network equipment has dropped the connection. Nothing gets through but no notification is sent either party. You will likely see corresponding open sessions on the Redshift side for these dropped connections and the database is just waiting for the client to give a command on each of them. An administrator will need to go through and close (terminate) these sessions for them to go away.
Now the thing that doesn't align with experience is the speed at which these connections are being marked as "stale". In my case my ISP was closing connections that were idle for more than 30 min. You seem to be timing out much faster than this. In some cases corporate firewalls will be configured with short idle connection timeouts for routes out of the private network to the internet. So there are cases where the timeouts can be short. The networks at AWS do not have these timeouts so if your connections are completely within AWS then this isn't your answer.
To address this there are a few ways to go. The easy way is to set up a tunnel into AWS with "keep alive" packets sent every 30 sec or so. You will need an ec2 instance at AWS so it isn't cost free. Ssh tunneling is the usual tool for this and there are write-ups online for setting it up.
The hard way (but likely most correct way) is to work with network experts to understand when the timeout is happening and why. If the timeout cannot be changed then it may be possible to configure a different network topology for your use case. Network peering or VPN could address.
In some cases you may be able to not have jdbc or odbc connections at all. You see these protocols are valid but they are old and most networking doesn't work this way anymore which is why they suffer from these issues. Redshift Data API let's you issue SQL to redshift in a single package and check on completion later on. These API calls are each independent connections so there is no possibility of "timing out" between them. The downside is this process is not interactive and therefore not supported by workbenches.
So does this match what you have going on?

Got an error reading communication packets in Google Cloud SQL

From 31th March I've got following error in Google Cloud SQL:
Got an error reading communication packets.
I have been using Google Cloud SQL for 2 years, but never faced with such problem.
I'm very worried about it.
This is detail error message:
textPayload: "2019-04-29T17:21:26.007574Z 203385 [Note] Aborted connection 203385 to db: {db_name} user: {db_username} host: 'cloudsqlproxy~{private ip}' (Got an error reading communication packets)"

While it is true that this error message often occurs after a maintenance period, it isn't necessarily a cause for concern as this is a known behavior by MySQL.
Possible explanations about why this issue is happening are :
The large increase of connection requests to the instance, with the
number of active connections increasing over a short period of time.
The freezing / unavailability of the instance can also occur due to
the burst of connections happening in a very short time interval. It
is observed that this freezing always happens with an increase of
connection requests. This increase in connections causes the
instance to be overloaded and hence unavailable to respond to
further connection requests until the number of connections
decreases or the instance stabilizes.
The server was too busy to accept new connections.
There were high rates of previous connections that were not closed
correctly.
The client terminated it abnormally.
readTimeout setting being set too low in the MySQL driver.
In an excerpt from the documentation, it is stated that:
There are many reasons why a connection attempt might not succeed.
Network communication is never guaranteed, and the database might be
temporarily unable to respond. Make sure your application handles
broken or unsuccessful connections gracefully.
Also a low Cloud SQL Proxy version can be the reason for such
incident issues. Possible upgrade to the latest version of (v1.23.0)
can be a troubleshooting solution.
IP from where you are trying to connect, may not be added to the
Authorized Networks in the Cloud SQL instance.
Some possible workaround for this issue, depending which is your case could be one of the following:
In the case that the issue is related to a high load, you could
retry the connection, using an exponential backoff to prevent
from sending too many simultaneous connection requests. The best
practice here is to exponentially back off your connection requests
and add randomized backoffsto avoid throttling, and potentially
overloading the instance. As a way to mitigate this issue in the
future, it is recommended that connection requests should be
spaced-out to prevent overloading. Although, depending on how you
are connecting to Cloud SQL, exponential backoffs may already be in
use by default with certain ORM packages.
If the issue could be related to an accumulation of long-running
inactive connections, you would be able to know if it is your case
using show full processliston your database looking for
the connections with high Time or connections where Command is
Sleep.
If this is your case you would have a few possible options:
If you are not using a connection pool you could try to update the client application logic to properly close connections immediately at the end of an operation or use a connection pool to limit your connections lifetime. In particular, it is ideal to manage the connection count by using a connection pool. This way unused connections are recycled and also the number of simultaneous connection requests can be limited through the use of the maximum pool size parameter.
If you are using a connecting pool, you could return the idle connections to the pool immediately at the end of an operation and set a shorter timeout by adjusting wait_timeout or interactive_timeoutflag values. Set CloudSQL wait_timeout flag to 600 seconds to force refreshing connections.
To check the network and port connectivity once -
Step 1. Confirm TCP connectivity on port 3306 with tcptraceroute or
netcat.
Step 2. If [Step 1] succeeded then try to check if there are any
errors in using mysql client to check timeout/error.
When the client might be terminating the connection abruptly you
could check for:
If the MySQL client or mysqld server are receiving a packet bigger
than max_allowed_packet bytes, or the client receiving a packet
too large message,if it so you could send smaller packets or
increase the max_allowed_packet flag value on both client
and server. If there are transactions that are not being properly
committed using both "begin" and "commit", there is the need to
update the client application logic to properly commit the
transaction.
There are several utilities that I think will be helpful here,
if you can install mtr and the tcpdump utilities to
monitor the packets during these connection-increasing events.
It is strongly recommended to enable the general_log in the
database flags. Another suggestion is to also enable the slow_query
database flag and output to a file. Also have a look at this
GitHub issue comment and go through the list of additional
solutions proposed for this issue here

This error message indicates a connection issue, either because your application doesn't terminate connections properly or because of a network issue.
As suggested in these troubleshooting steps for MySQL or PostgreSQL instances from the GCP docs, you can start debugging by checking that you follow best practices for managing database connections.

Scaling from one EC2 instance to two when your application and database are on the same instance

If I have one EC2 instance that hosts my web application and my MariaDB database, and I want to scale out at some point by separating the web application and database into separate instances, what is the standard practice for doing so without incurring any downtime? It seems like a complicated problem to me, but all the posts I've seen discussing the benefits of keeping the web and data tier separate from the get-go mostly talk about security benefits and don't seem to emphasize the scalability benefits which makes me think that it's not as complex a problem as it seems.
Also, in this same scenario, if scaling up and keeping the application and database coupled would be less complex, how would it work? -- keeping in mind the 0 downtime requirement.

Familiarize yourself with the way replication works in MariaDB and the solution becomes intuitively obvious.
You create a replica database server by copying the existing database to a new server using mysqldump with particular attention to the options --master-data and --single-transaction to make a backup. Loading the results onto your new database server creates a replica of the original database as it existed at the moment you started making the backup. InnoDB MVCC assures that the version of each row in each table, as it existed at the beginning of the backup, is what appears on the new server as a result of loading this backup. (Yes, you have to be using InnoDB, as you should be doing anyway.)
You then connect the new database (as a slave) to the old database (as master), directing it to begin replicating from that same point in time -- the point in time identified by the master log coordinates contained in the backup -- the time the backup was started.
You wait for the slave to be in sync with the master.
Monitoring the replication status using SHOW MASTER STATUS; on the master and SHOW SLAVE STATUS; on the slave, it is trivial to determine when the slave is indeed "current" with the master. MariaDB replication is "asynchronous" in the sense that changes on the master are made before changes on the slave, but with a slave server of appropriate capacity, the typical replication lag is on the order or milliseconds... and, again, is easily determined. In the time it takes to stop/start your application, any lingering data can be confirmed to have finished replicating across.
Make the slave writable (typically a slave is set to read-only mode, with the only source of changes being the replication SQL thread, which can of course still write to it) ... then monitor replication to verify sync, stop app, point app to new database, verify replication still in sync, start app... done. Now, disconnect the slave from the master database and abandon the old master.
Of course, truly zero downtime is effectively impossible, since at some point the application must be reconfigured to connect to a different database... but the total downtime is essentially determined by how fast you can type, or automate the necessary steps to poll both database servers and compare replication coordinates, and make the transition.
At the risk of stating the obvious, never put anything other than the database on the database server, and never collocate it with the application. No exceptions in production should even be open to discussion. A problem that comes up all too often, as seen here, here, here, and here is more often than not attributable to people disregarding this principle, running the application and its database on the same server. Performance and stability are not only at risk, but the symptoms that arise also give the (incorrect) impression that MySQL (or MariaDB or Percona Server) is at fault, "crashing," when in fact the application is at fault, prompting the OS to force-crash the database in an effort to try to preserve overall machine stability in the face of inevitable memory exhaustion.

One possible solution:
Put a Load Balancer in front of your EC2 instance, intially just directing traffic to the single instance you have.
Spin up a second instance that will run a copy of your website, get it all configurely and pointing at the DB on the first instance and then add it into the load-balancer so it starts to get traffic
Optional: Add a third instance configured the same as the second, also running a copy of the website only.
Take the original instance out of the LB pool so that web traffic now only goes to #2 and #3.
Deinstall the website from the #1 instance, so it is left only running the db server.

SQL-Server Connection Fails after Network Reconnect

I am working on an update to an application that uses DAO to access an SQL Server. I know, but let's consider DAO a requirement for now.
The application runs all the time in the system tray and periodically performs SQL server operations. Since it is running all the time, and users of the application will be on laptops and transitioning between buildings, I've designed it to quietly transition between active and inactive states. When the database connection is successful operations resume.
I have one last issue before I release this update: When a connection is dropped, then reestablished, the SQL operations fail. This occurs only if I have specified the hostname in my connection string. If I use the IP, everything is fine (but I need to be able to use hostname).
Here is the behavior:
1) Everything working. Good network connection, database operations are fine.
2) Lost connection. Little 'x' appears on task bar icon, and nothing else. All ok.
3) Reconnect.
At step 3, I get an 'ODBC--call failed' error when I run the first query. Interestingly, the database is first opened without error.
If I skip step 1, and start the application when the connection is down, everything works fine in step 3, hostname or not.
I expect this is an issue with the DAO engine caching the DNS entry after the first connection, although the destination IP does not change so I'm not sure about that. I have tried flushing the windows DNS cache (from cmd prompt) to no effect. The same behavior occurs even when I'm using my local hostname with a local SQL server I set up for development. 127.0.0.1 has no problems.
I also tried to CoUninitialize() the DAO interface between active times, but I had trouble getting this to work. If someone thinks that would help I will work harder at it.
This behavior is the same in Windows XP or 7.
Thanks for anything you've got!
Edit: I should have mentioned - I am closing the database connection between the attempts, then reopening it with
m_pDb = m_pDaoEngine->OpenDatabase()

I ended up biting the bullet and converting the application to ADO. Everything works nicely now, and database operations are much faster to boot.

c++ Mysql C API Connection Question

I'm building an application which uses Mysql, I was wondering what would be the best way to manage the connection to the actual Mysql server?
I'm still in the design phase, but currently I have it Connecting (or aborting if error) before every query and disconnecting after which is just for testing as right now I'm only running 1 query to see if the code I've setup so far works.
My App might be performing a few queries every 5/10/20/30 minutes depending on settings and doesn't really need to do anything with SQL until that time.
So I'm wondering if its more beneficial to use a continuous connection that exists for the lifetime of the application (if possible) or to simply connect to sql before I intend to use it, do what the app needs to do then disconnect?

Connecting once and performing many queries will naturally be more efficient.
However, if performance isn't a major concern for your project, maybe aiming for simplicity in your code might be a better option (especially if you are the only connection to the database).
If you want to get clever, then maybe connect as and when you need to, then keep the connection alive until you stop making queries. Eg, drop the connection if there have been no queries for 30 seconds or something like that.

How many instances of this app will be connecting to MySQL? If it's just one, keeping a MySQL connection open for convenience shouldn't cause any problems, but remember there's a (configurable) limit to the number of MySQL connections you can have open to the server. In this case, I would recommend opening a connection, running whatever queries you need to run, and then closing it. Connecting per query adds more overhead as you add queries to your application.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js