Rebooting a AWS RDS Aurora master/writer also reboots the readers? - amazon-web-services

I'm trying to evaluate AWS RDS Aurora as future replacement for our local MySQL databases, but I'm noticing some strange behaviors.
I have a basic cluster with a DB master (writer) and a replica (reader). My idea was to use the reader as an always available datasource, even when the writer in unavailable. But when I'm rebooting the master, it takes down the reader as well, making the setup quite worthless.
Looking at the reader replica log, this is what happens when the it notices that the writer is down:
Does anyone know how to have a Aurora read entry point that never goes down even if the writer is offline or busy for a brief time?
Or does the write/read "out of sync" always take down the reader entry points no matter the size of the cluster?

The only way to have a replica that remains available during a reboot of the master would be to have an asynchronous replica using conventional MySQL replication -- which Aurora does support.
Aurora replication is very different than MySQL (or Galera) replication. A loss of the master necessarily triggers a reorganization of the cluster, because the individual instances don't have their own copies of the data, they share a 6-way replicated storage volume -- that's how replication can remain in the 10-20 ms time range. What's actually replicated from the master is the transaction log LSN. Replacement of a master requires one replica to be promoted, verify that the on-disk data structures are clean after taking over, and then all of the other replicas start follow it.
If the DB cluster has one or more Aurora Replicas, then an Aurora Replica is promoted to the primary instance during a failure event. A failure event results in a brief interruption, during which read and write operations fail with an exception.
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Aurora.Managing.html#Aurora.Managing.FaultTolerance
When an Aurora replica stops seeing updates from the master, it doesn't matter where the actual fault lies -- whether with the actual master or elsewhere in the infrastructure -- the replica stops serving queries because, best case, it no longer has access to authoritative data.
Where possible, zero-downtime patching appears to avoid a master restart during upgrades. Other than upgrades, there should not be a need to restart the master.

Related

AWS RDS Read Replica act as Failover Standby

I am currently assessing whether to use RDS MySQL Multi-AZ or Single AZ with Read Replica.
Considerations are budget and performance, as Multi-AZ cost twice as much as Single AZ and have no ability to offload read operations, Single AZ with Read Replica seems to be a logical choice.
However, I saw a way to manually 'promote' the Read Replica to master in the event of master's failure, but is there a way to automate this?
Note: There was a similar question but it did not address my question:
Read replicas in RDS AWS
I think the problem is that you are a bit confused with these features. Let me help - you can launch AWS RDS in Multi-AZ deployment mode. In this case, AWS will do the following:
It will allocate a DNS record for you. This DNS record represents a single entry point to your master database, which is, lets assume, currently active and able to serve connections.
In the case of master failure for any reason, AWS will simply address hidden by the DNS record (quite fast, within 1-2 minutes) to be pointed to your stand by, which is located on another AZ.
When the master will become available again, then your stand by, which have served writes also needs now to synchronize everything with the master. You do not need to take care about it - AWS will manage it for you
In case of read replica:
AWS will allocate you 2 different DNS records - one for master, another for read replica. Read replica can be on the same AZ as a master, or even in an another Region
You can, and must in you application choose what DNS name to use in different scenarios. I mean, you, most probably, will have 2 different connection pools - one for master, another for read replica. Replication itself will be asynchronous
In the case of read replica, AWS solves the problem of replication by its own - you do not need to worry about it. But since the replica is read only AWS does not solve, by nature, the synchronization problem between read replica and master, because the replica is aimed to be read only, it should not accept any write traffic
Addressing your question directly:
Technically, you can try to make you read replica serve as a failover, but in this case you will have to implement a custom solution for synchronization with the master, because during the time the master was down, your read replica certainly received N amount of writes. AWS does not solve this synchronization problem in this case
In redards to Mutli-AZ - you cannot use your Multi-AZ standby as read replica, since it is not supported in AWS. I highly recommend to check out this documentation. I think it will help you sort the things out, have a nice day!)

Difference between "Multi-AZ Deployment" and "Read Replica Verison Multi-AZ Deployment"

Summary
Amazon RDS has two main different types of replicas, Multi-AZ Replica and Read Replica, and it's easily to find their difference.
However, Read Replica had supported Multi-AZ deployment at JAN, 2018.
What is the main difference between "Multi-AZ Deployment" and "Read Replica Version Multi-AZ Deployment"?
The two ways to add the Multi-AZ Deployment at the current database are as follow:
Situation 1: (Original, Multi-AZ Deployment)
Instance Action
→ Modify
→ specified the "Multi-AZ deployment" option
Situation 2: (Read Replica Version Multi-AZ Deployment)
Instance Action
→ Create read replica
→ specified the "Multi-AZ deployment" option
An RDS read replica instance is an asynchronous read-only replica of an upstream primary ("master") database instance. It can be used by your application for any query that does not require changing data, thus relieving load from the master. If the replica crashes or fails, it has no impact on the master but the replica itself can no longer handle any traffic.
Multi-AZ means the database instance has a standby spare server machine and spare hard drive in a different availability zone of the same region. This is a synchronous replica, but cannot be accessed by you. If the active server fails, the spare server takes over and starts handling traffic more quickly than would be possible without the spare.
Multi-AZ is a deployment strategy for higher reliability.
It reduces the downtime required for version upgrades, and reduces the impact of backup snapshots and creation of replicas, since snapshots can be done from the spare (by the service). It doubles the cost of the instance because of the hot standby capacity it provides.
Multi-AZ typically used only on the master instance, for fast recovery.
Historically, this was the only variant of Multi-AZ, but a Multi-AZ read replica is now possible, and is what it sounds like: a replica with Multi-AZ. It will recover more quickly from faults and failures because it has spare hardware. The active and spare are synchronous replicas of each other but are still asynchronous replicas of the master, as all non-Aurora replicas are in RDS/MySQL.
Combining Read Replicas with Multi-AZ enables you to build a resilient disaster recovery strategy and simplify your database engine upgrade process.
Amazon RDS Read Replicas enable you to create one or more read-only copies of your database instance within the same AWS Region or in a different AWS Region. Updates made to the source database are then asynchronously copied to your Read Replicas. In addition to providing scalability for read-heavy workloads, Read Replicas can be promoted to become a standalone database instance when needed.
https://aws.amazon.com/about-aws/whats-new/2018/01/amazon-rds-read-replicas-now-support-multi-az-deployments/
In summary, Multi-AZ on the master gets you one server with an invisible hot spare that is used for failure recovery but is not a usable database replica. It is a good strategy for resiliency.
Multi-AZ on a replica is an expensive way of speeding recovery time on a crashed instance. It is a separate server, so can be accessed by you, but so can a non-Multi-AZ read replica.
A multi-AZ deployment has a Master database in one AZ and a Standby (or Secondary) database in another AZ. Only the Master database serves traffic. If the Master fails, then the Secondary takes over.
A Read Replica is a read-only copy of the database. It is actively running and apps can use it for read-only queries. A Read Replica can be in a different AZ or even in a different region.
In terms of Highly Available, Multi-AZ has higher availability over Read-replica. As Multi-AZ provide a backup writer in other AZ, so both read and write is not affected on Single AZ fails.

Does Amazon Aurora create a new replica if an existing one gets promoted to the primary?

If a primary Aurora DB instance dies for some reason, and an existing replica gets promoted to the new primary, does a new replica instance get created so that I end up with the same number of read replicas?
If so, how long does it take for a new replica to be spun up on average?
There are two types of read replicas:
Backup replica (also known as slave) made by AWS when you deploy Multi-AZ RDS instance. That is synchronous read replica, but you can not use it.
Read replica created by you. Those are asynchronous replicas that you can use to offload some work.
A backup replica will be promoted to master automatically, usually it takes less than a minute. And yes, AWS will create new slave for the RDS instance that's now the master. It could take from several minutes to several hours depending on your workload and DB size.
Read replicas created by you will be just switched to the new master.
AWS Aurora is AWS's database with an architecture designed for cloud computing technologies. One of it's differences is that data is stored in a storage architecture similar to S3, in a cluster volume, which is a single, utilizes solid state disk (SSD) drives and consists of copies of the data across multiple Availability Zones in a single region. That has a few advantages, such as durability and also the fact that is distributed through in entire region, not just an AZ, helping with consistency between replicas and performance.
In case you have read replicas and your Master fails, one of them will become Master without downtime.
If you don't have a read replica, a new Master instance will be created and the process is really fast. Since data is on clusterized across the region, not on the server's disk, the process is fast, but there is downtime.
As AWS says:
To increase availability, you can use Aurora Replicas as failover
targets. That is, if the primary instance fails, an Aurora Replica is
promoted to the primary instance with only a brief interruption during
which read and write requests made to the primary instance fail with
an exception. If your Aurora DB cluster does not include any Aurora
Replicas, then the primary instance is recreated during a failure
event. However, promoting an Aurora Replica is much faster than
recreating the primary instance. For high-availability scenarios, we
recommend that you create one or more Aurora Replicas, of the same DB
instance class as the primary instance, in different Availability
Zones for your Aurora DB cluster. For more information on Aurora
Replicas as failover targets, see Fault Tolerance for an Aurora DB
Cluster.
You can read more on: http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Aurora.Replication.html"

Amazon Aurora Replica

I have a big database (~250GB) in Aurora getting lots of inserts. There's only one instance, so I'd like to create a replica for redundancy. While we are doing nightly snapshots, we would prefer a more fault tolerant system, and it appears that using aurora replicas would provide automatic failover.
My question: What exactly happens when I use the console and create a replica? Will a new instance come up and begin pulling data from the master instance? Could that affect database performance? I'm sure that it will take some time before the replica "catches up" and loads the 250GB; how will I know when it's "finished"?
Don't want to have any downtime, so I'm a bit afraid to push the "create replica" button without knowing what it does...
What exactly happens when I use the console and create a replica?
A new instance is started as part of the cluster, and it has access to the master's data -- or, perhaps more precisely, the cluster's data. All Aurora instances are members of a "cluster," even if it's only a cluster of one master server. Aurora replication, within the same region, is starkly different than MySQL native replication.
Will a new instance come up and begin pulling data from the master instance?
Not really. As described above, the new instance will come up and be able to read from the master's backing store -- it doesn't have its own separate storage.
Aurora runs on 3 sets of 2 copies of the working data, mirrored and replicated across the availability zones in the region. This logical entity is called the Cluster Volume.
The cluster volume spans multiple Availability Zones in a single region, and each Availability Zone contains a copy of the cluster volume data.
http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Aurora.Managing.html
(The docs say each AZ contains "a copy," which is true, but it's mirrored.)
Aurora replicas read from this data -- for all practical purposes, synchronously.
Q: How far behind the primary will my replicas to be?
Since Amazon Aurora Replicas share the same data volume as the primary, there is virtually no replication lag. We typically observe lag times in the 10s of milliseconds.
— https://aws.amazon.com/rds/aurora/faqs/
Could that affect database performance?
It shouldn't.
I'm sure that it will take some time before the replica "catches up" and loads the 250GB; how will I know when it's "finished"?
No, it really shouldn't. Once the replica instance becomes accessible, it should be up-to-date, because it's reading the same data from the same place that the master is writing. Metrics related to Aurora replica lag are accessible in the console.

Achieving read and write query availability in AWS Multi-AZ RDS

I have configured Multi-AZ RDS mysql instance with no read replicas in a development environment and I am testing Multi-AZ RDS fail-over by rebooting the DB instance.
Below is my observation: During RDS fail-over, the client application will not lost connection but at the same time it won't be able to access the database as well and once fail-over completes, client will able to access the database.
Update 1: Above observation is wrong.What I observed just now is that after fail-over completion I am getting below error and it results in connection termination.
ERROR 2003 (HY000): Can't connect to MySQL server on 'rds-test.czswqpewzqas.---------.amazonaws.com' (110)
So in short my queries are failing during reboot of Multi-AZ mysql instance.
Does any one have any idea, what I am missing here.
Update - Achieving read availability : Now I have created a Read Replica for the Multi-AZ mysql instance and on getting above mentioned error, redirecting "select queries" to the Read Replica Instance.
So,using Read replica I am able to achieve read availability.Is this the right way?Would like to know if there is any other way to do it.
Also, how I can achieve write availability in Multi-AZ RDS?
Your observations are correct. During the fail over, TCP connections are lost, the time to fail over to the secondary database and to switch over IP addresses in DNS.
It is up to the application to
a/ try to reconnect using exponential back off. Reconnection will be possible within minutes.
b/ decide how to behave during the failover.
Read transactions (SELECT) can be hand off to a read replica. Modern JDBC and ODBC drivers are able to handle read replica by themselves, just give the list of IP address / DNS names of your replicas. The driver will apply the load balancing automatically. No code change is required.
Write transactions are more complex to handle and there is no single answer for all applications. Correct answer will depend on your application & business requirements.
Some customers decide to block all write operations, return an error message to end users asking them to try again a few minutes later.
Some customers are queuing write transactions in an SQS queue. They develop a queue reader application to flush pending transactions when master database is available again. (depending on workload, S3 or DynamoDB can be use for this as well). Of course, your data will not be consistent during the fail over and a short period of time right after the fail-over, the time required to flush all pending write.
Please feel free to comment about other strategies used in real world scenarios.