Multi-AZ RDS test failover and connection monitoring - amazon-web-services

My question has two parts:
What is the best way to initiate an RDS failover for testing purposes?
How can I monitor the connection during failover in order to observe the time that it takes for AWS to reconnect the user to the standby instance?
With respect to part (1): If I understand correctly, all instance modifications are made on the standby and then AWS fails over by flipping the CNAME over to the standby as the primary is updated, so if I were to make any kind of instance modification and select "apply immediately," it should cause a failover, correct?
With respect to part (2): I am looking specifically for a way of monitoring the failover of an Oracle RDS instance, whether through a lambda function, a bash script, or some other means. As far as I can tell, it is not possible to use ping with RDS, even when I allow all ICMP traffic via the security group. I can connect without trouble using telnet or an SQL client. What I would like though is some way of doing something like periodically pinging the database during a failover to see when the IP associated with the connection string switches over and how long it takes. Any suggestions?

Correct, RDS will make your modifications on the failover instance and then failover to it. Per their documentation:
The availability benefits of Multi-AZ deployments also extend to
planned maintenance and backups. In the case of system upgrades like
OS patching or DB Instance scaling, these operations are applied first
on the standby, prior to the automatic failover. As a result, your
availability impact is, again, only the time required for automatic
failover to complete.
To simulate failover, simply reboot with failover when rebooting, instead of rebooting both. From the linked documentation:
Reboot with failover is beneficial when you want to simulate a failure
of a DB instance for testing, or restore operations to the original AZ
after a failover occurs.
Write a script that, on a regular interval, connects with a SQL Client and performs a quick select on a table of your preference. You can use this to measure true downtime during the failover; we have a tool very similar to this that we use when getting estimates of modifications on a test RDS before we apply it to our production RDS. Our tool simply writes to console with a timestamp and whether it failed/succeeded every few seconds. The tool will write success before the reboot, failure during, and success again after the cutover completes.
Additional Resources:
Modifying an Amazon RDS DB Instance and Using the Apply Immediately Parameter
Modifying a DB Instance Running the Oracle Database Engine

Update on this:
I ended up using a simple bash script:
date; while true; date; do nc -vz DBNAME.REGION.rds.amazonaws.com PORT; sleep 1; done
Note: the above is for netcat-openbsd. If using netcat-traditional, you'll need to modify this.
This polls the database each second to see if it's still possible to connect. Typically when I ran this and then initiated reboot with failover, the connection would simply dangle during the failover then display a timeout error when the failover was complete and connectivity resumed, presumably because the failover usually takes longer than the reboot. If the reboot happens to take longer than the failover though, there may be a period of time during which the connection is refused as the reboot completes. In any case, using this method, I was able to get a consistent failover time of 2:08.
It seeems, however, that unlike I originally thought, most instance modifications do not involve a failover at all. I have tested resizing the instance as well as changing the option groups and parameter groups and did not experience any downtime.
Changing the database engine does result in a failover.

Related

Which method to use for updating CA certificates for AWS RDS

I currently need to update the CA certificates for my AWS RDS instance, and as far as I am aware there are two ways to do this: by modifying my DB instance or by applying DB instance maintenance (source: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/UsingWithRDS.SSL-certificate-rotation.html).
Does it matter which method I choose? Is one way particularly better than the other/better in some circumstances?
In both methods, it is given that the RDS instance needs a reboot (read as outage!).
In our case, the RDS client application (java-based) had troubles re-establishing JDBC/SSL connection with the rebooted RDS instance (after CA upgrade), so we had to manually trigger a restart of RDS client application to bring the situation to normalcy. Hence, we need to exactly know at what point RDS CA upgrade was complete.
Hence, the workflow would be like this:
1/ Add CA (2019) to your client application's trust store first!
2/ On the RDS side, use 'Apply Immediately' option in lower environments (in Production, we also used 'Apply Immediately' but executed it during the approved maintenance window).
3/ Wait for a few minutes for AWS to apply CA and reboot the RDS instance.
4/ Go and perform post-actions like restart your client application (if needed) and regression tests.
In this way, we were able to limit the outage to a couple of minutes.
Alert: If we would have chosen 'Apply during maintenance window' option, we would not have been 'in control' of at what point AWS would upgrade RDS (CA) because AWS may choose any point in time during the maintenance window specified to perform the upgrade, it is not guaranteed to be at the start of maintenance window.
Hope this helps!
I like to test the update manually by modifying the DB instance in a test environment. Then I confirm any dependent software, and make sure that everything is working.
Then in production I let it modify during the maintenance window update. Since this change requires a reboot, I let it apply during my 3 a.m. Sunday maintenance window.
So both methods are handy depending on your needs. The end result is identical.

Does AWS DBInstance maintenance keep data intact

We are using CloudFormation Template to create MySQL AWS::RDS::DBInstance.
My question is when there is maintenance in progress while applying OS upgrades or software/security patches, will
Database Instance be unavailable for the time of maintenance
Does it wipe out data from database instance during maintenance?
If answer to first is yes, will using DBCluster help avoid that short downtime, if I use more than one instances?
From the documentation I did not receive any indication that there is any loss of data possibility.
Database Instance be unavailable for the time of maintenance
They may reboot the server to apply the maintenance. I've personally never seen anything more than a reboot, but I suppose it's possible they may have to shut it down for a few minutes.
Does it wipe out data from database instance during maintenance?
Definitely not.
If answer to first is yes, will using DBCluster help avoid that short
downtime, if I use more than one instances?
Yes, a database in cluster mode would fail-over to another node while they were applying patches to one node.
I am actively working of RDS Database system from the last 5 years. Based on my experience, my answer to your questions as follows in BOLD.
Database Instance be unavailable for the time of maintenance
[Yes, Your RDS system will be unavailable during maintenance of database]
Does it wipe out data from database instance during maintenance?
[ Definitely BIG NO ]
If answer to first is yes, will using DBCluster help avoid that short downtime, if I use more than one instances?
[Yes, In cluster Mode or Multi A-Z Deployment, Essentially AWS apply the patches on the backup node or replica first and then failover to this patch instance. Last, there would be some downtime during the g switchover process]

How to make a HTTP call reaching all instances behind amazon AWS load balancer?

I have a web app which runs behind Amazon AWS Elastic Load Balancer with 3 instances attached. The app has a /refresh endpoint to reload reference data. It need to be run whenever new data is available, which happens several times a week.
What I have been doing is assigning public address to all instances, and do refresh independently (using ec2-url/refresh). I agree with Michael's answer on a different topic, EC2 instances behind ELB shouldn't allow direct public access. Now my problem is how can I make elb-url/refresh call reaching all instances behind the load balancer?
And it would be nice if I can collect HTTP responses from multiple instances. But I don't mind doing the refresh blindly for now.
one of the way I'd solve this problem is by
writing the data to an AWS s3 bucket
triggering a AWS Lambda function automatically from the s3 write
using AWS SDK to to identify the instances attached to the ELB from the Lambda function e.g. using boto3 from python or AWS Java SDK
call /refresh on individual instances from Lambda
ensuring when a new instance is created (due to autoscaling or deployment), it fetches the data from the s3 bucket during startup
ensuring that the private subnets the instances are in allows traffic from the subnets attached to the Lambda
ensuring that the security groups attached to the instances allow traffic from the security group attached to the Lambda
the key wins of this solution are
the process is fully automated from the instant the data is written to s3,
avoids data inconsistency due to autoscaling/deployment,
simple to maintain (you don't have to hardcode instance ip addresses anywhere),
you don't have to expose instances outside the VPC
highly available (AWS ensures the Lambda is invoked on s3 write, you don't worry about running a script in an instance and ensuring the instance is up and running)
hope this is useful.
While this may not be possible given the constraints of your application & circumstances, its worth noting that best practice application architecture for instances running behind an AWS ELB (particularly if they are part of an AutoScalingGroup) is ensure that the instances are not stateful.
The idea is to make it so that you can scale out by adding new instances, or scale-in by removing instances, without compromising data integrity or performance.
One option would be to change the application to store the results of the reference data reload into an off-instance data store, such as a cache or database (e.g. Elasticache or RDS), instead of in-memory.
If the application was able to do that, then you would only need to hit the refresh endpoint on a single server - it would reload the reference data, do whatever analysis and manipulation is required to store it efficiently in a fit-for-purpose way for the application, store it to the data store, and then all instances would have access to the refreshed data via the shared data store.
While there is a latency increase adding a round-trip to a data store, it is often well worth it for the consistency of the application - under your current model, if one server lags behind the others in refreshing the reference data, if the ELB is not using sticky sessions, requests via the ELB will return inconsistent data depending on which server they are allocated to.
You can't make these requests through the load balancer, So you will have to open up the security group of the instances to allow incoming traffic from source other than the ELB. That doesn't mean you need to open it to all direct traffic though. You could simply whitelist an IP address in the security group to allow requests from your specific computer.
If you don't want to add public IP addresses to these servers then you will need to run something like a curl command on an EC2 instance inside the VPC. In that case you would only need to open the security group to allow traffic from some server (or group of servers) that exist in the VPC.
I solved it differently, without opening up new traffic in security groups or resorting to external resources like S3. It's flexible in that it will dynamically notify instances added through ECS or ASG.
The ELB's Target Group offers a feature of periodic health check to ensure instances behind it are live. This is a URL that your server responds on. The endpoint can include a timestamp parameter of the most recent configuration. Every server in the TG will receive the health check ping within the configured Interval threshold. If the parameter to the ping changes it signals a refresh.
A URL may look like:
/is-alive?last-configuration=2019-08-27T23%3A50%3A23Z
Above I passed a UTC timestamp of 2019-08-27T23:50:23Z
A service receiving the request will check if the in-memory state is at least as recent as the timestamp parameter. If not, it will refresh its state and update the timestamp. The next health-check will result in a no-op since your state was refreshed.
Implementation notes
If refreshing the state can take more time than the interval window or the TG health timeout, you need to offload it to another thread to prevent concurrent updates or outright service disruption as the health-checks need to return promptly. Otherwise the node will be considered off-line.
If you are using traffic port for this purpose, make sure the URL is secured by making it impossible to guess. Anything publicly exposed can be subject to a DoS attack.
As you are using S3 you can automate your task by using the ObjectCreated notification for S3.
https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html
https://docs.aws.amazon.com/cli/latest/reference/s3api/put-bucket-notification.html
You can install AWS CLI and write a simple Bash script that will monitor that ObjectCreated notification. Start a Cron job that will look for the S3 notification for creation of new object.
Setup a condition in that script file to curl "http: //127.0.0.1/refresh" when the script file detects new object created in S3 it will curl the 127.0.0.1/refresh and done you don't have to do that manually each time.
I personally like the answer by #redoc, but wanted to give another alternative for anyone that is interested, which is a combination of his and the accepted answer. Using SEE object creation events, you can trigger a lambda, but instead of discovering the instances and calling them, which requires the lambda to be in the vpc, you could have the lambda use SSM (aka Systems Manager) to execute commands via a powershell or bash document on EC2 instances that are targeted via tags. The document would then call 127.0.0.1/reload like the accepted answer has. The benefit of this is that your lambda doesn't have to be in the vpc, and your EC2s don't need inbound rules to allow the traffic from lambda. The downside is that it requires the instances to have the SSM agent installed, which sounds like more work than it really is. There's AWS AMIs already optimized with SSM agent stuff, but installing it yourself in the user data is very simple. Another potential downside, depending on your use case, is that it uses an exponential ramp up for simultaneous executions, which means if you're targeting 20 instances, it runs one 1, then 2 at once, then 4 at once, then 8, until they are all done, or it reaches what you set for the max. This is because of the error recovery stuff it has built in. It doesn't want to destroy all your stuff if something is wrong, like slowly putting your weight on some ice.
You could make the call multiple times in rapid succession to call all the instances behind the Load Balancer. This would work because the AWS Load Balancers use round-robin without sticky sessions by default, meaning that each call handled by the Load Balancer is dispatched to the next EC2 Instance in the list of available instances. So if you're making rapid calls, you're likely to hit all the instances.
Another option is that if your EC2 instances are fairly stable, you can create a Target Group for each EC2 Instance, and then create a listener rule on your Load Balancer to target those single instance groups based on some criteria, such as a query argument, URL or header.

AWS architecture help for running database dumps

I have mysql running on one ec2-instance and tableau uses this database. mysqldump runs from production servers every 4 hours during which the system is down for probably 10-15 mins due to the dump. I am planning to have another ec2 instance with mysql running and and elb on top of these two instances so that the system wont be down trough the dump. For this I might have to de-register the instances from elb during the dump and register them back after the dump. Is this the right way to do it in the situations like this?
You can't use an ELB with MySQL servers. The ELB wouldn't know which server was master and which was slave, so it wouldn't know which to send updates to.
Is there any reason you aren't using Amazon's RDS service for your database servers? It provides automated snapshots that don't cause any down-time. It also makes it easy to create a read-replica against which you could perform mysqldumps without affecting the main server.
Currently you are taking logical backups of your system every 4 hours. Logical backups in most cases should only be used in a worst case scenario. In the event of a restore, logical backups are very slow compared to alternatives, such as snapshots and binary backups. If snapshoting using Amazon RDS or any of the other multitude of alternatives out there in your environment is not an option, I would look into Xtrabackup. This is a free stand alone HOT online binary backup tool that can be used with a Vanilla install of MySQL. This should not bring down your production server, assuming you are using InnoDB and not an alternative storage engine such as MyISAM. I personally used it for hot online binary backups and to automate building slaves in my previous work environment. A binary backups bottleneck is your network speed in terms of the restore process and is exponentially faster than a logical backup.
If setting up another MySQL instance is your only option look into GTID replication and/or Master-Passive HA environment in order to take the mysqldump off of the secondary non-active production server so that your production environment does not go down.
The bottom line is that you should not be taking production down to do a logical backup every 4 hours. This is def not ideal in any production environment.
Have a look at Amazon Database Migration Service (https://aws.amazon.com/dms/). It allows you to do zero-downtime database migration or just synchronization.

Achieving read and write query availability in AWS Multi-AZ RDS

I have configured Multi-AZ RDS mysql instance with no read replicas in a development environment and I am testing Multi-AZ RDS fail-over by rebooting the DB instance.
Below is my observation: During RDS fail-over, the client application will not lost connection but at the same time it won't be able to access the database as well and once fail-over completes, client will able to access the database.
Update 1: Above observation is wrong.What I observed just now is that after fail-over completion I am getting below error and it results in connection termination.
ERROR 2003 (HY000): Can't connect to MySQL server on 'rds-test.czswqpewzqas.---------.amazonaws.com' (110)
So in short my queries are failing during reboot of Multi-AZ mysql instance.
Does any one have any idea, what I am missing here.
Update - Achieving read availability : Now I have created a Read Replica for the Multi-AZ mysql instance and on getting above mentioned error, redirecting "select queries" to the Read Replica Instance.
So,using Read replica I am able to achieve read availability.Is this the right way?Would like to know if there is any other way to do it.
Also, how I can achieve write availability in Multi-AZ RDS?
Your observations are correct. During the fail over, TCP connections are lost, the time to fail over to the secondary database and to switch over IP addresses in DNS.
It is up to the application to
a/ try to reconnect using exponential back off. Reconnection will be possible within minutes.
b/ decide how to behave during the failover.
Read transactions (SELECT) can be hand off to a read replica. Modern JDBC and ODBC drivers are able to handle read replica by themselves, just give the list of IP address / DNS names of your replicas. The driver will apply the load balancing automatically. No code change is required.
Write transactions are more complex to handle and there is no single answer for all applications. Correct answer will depend on your application & business requirements.
Some customers decide to block all write operations, return an error message to end users asking them to try again a few minutes later.
Some customers are queuing write transactions in an SQS queue. They develop a queue reader application to flush pending transactions when master database is available again. (depending on workload, S3 or DynamoDB can be use for this as well). Of course, your data will not be consistent during the fail over and a short period of time right after the fail-over, the time required to flush all pending write.
Please feel free to comment about other strategies used in real world scenarios.