Need help regarding aurora DB scaling - amazon-web-services

I need to scale the Aurora DB up and then back down at later sometime but I dont need any downtime not even a bit. I though of doing it by creating aurora replica and promoting it then scaling primary and promoting back to primary. But this involve downtime. Please suggest any alternative way.

i would first ask you in what sense do you need to scale up? writing? reading? if you currently run a server with a high-read/low-write ratio, i'd suggest another read-replica (or more) and then configure your application specifically to use the read replica in those specific cases where you're only making queries and not modifying the data. in this sense, you're offloading your writes to a separate instance of the same data and letting the primary deal with mostly writes. if you did this, i'd also suggest spreading out the read replicas across other availability zones, in case the AZ of your primary goes down, another read-replica will be auto-promoted to primary.
armed with that info, i would suggest you scale up with read replicas that your application is configured to use, then you can bring down those replicas as you scale down without ever bringing down the Aurora primary. it's more of a configuration problem in dealing with multiple RDS endpoints, essentially.
HTH.
one caveat is that there is some delay in replication between primary and read replicas (my instances show about 10-20 milliseconds), so you have to keep this in mind if you perform a write then a read in fast succession -- if your first read after the initial write happens 'too fast', the replica might not see the change and may either see no data (if you're creating) or stale data (if you're updating).
in general, this isn't an issue until you're under some heavy load and the writes on the primary are backed up and you start reading from the read replica before the writes have been applied to the primary.
HTH.

Related

HA Cloud SQL writes low latency despite synchronous replication across multiple DCs

I have just read Google's Cloud SQL's high availability documentation.
From what I understood in order for Google to:
Guarantee no data loss in case of primary node failure.
Allow clients to use standby node as read replica with strong consistency.
Google has to replicate writes in a synchronous way across multiple zones.
This seems like a very costly operation that should affect write transactions' latency. I however personally have not observed any significant latency differences between HA and non-HA version of GCP's Postgres.
How is it possible?
Not a definitive answer, but i hope it helps anyway.
The technology used here is the same that is used for regional persistent disks, which are highly optimized for these kind of multi-zone write scenarios. This basically means that the operation is costly but not as costly as you seem to expect. However even Google itself acknowledges the fact that there will be increased write latency. From a page directly linked from the documentation you shared:
Regional persistent disks are an option when write performance is less critical than data redundancy across multiple zones.
Documentation
You're confusing Read Replicas, HA and Legacy HA.
Cloud SQL some time ago was using Legacy HA which uses an explicit instance that replicates from the primary instance (in some way it is a read replica). In this particular case there could be replication lag because as you mention, the writes are synchronous and the performance could be impacted.
Legacy HA it is only available for MySQL.
In the case of Postgres it is expect you do not see a difference between HA and non-HA because in fact Postgres does not use this Legacy HA and uses the current HA schema which uses a single Regional Disk. non-HA also uses a single disk, the difference is that the disk is Zonal.
In other words, both (HA and non-HA) use a single disk, what changes is the scope of the disk (Regional or Zonal).
Finally since it is only a disk, there's no need to replicate like in a Read Replica or Legacy HA.

Differences b/w AWS Read Replica and the Standby instances

can anyone elaborate on the difference between AWS Read Replica and readable Standby instances which AWS has offered recently?
I assume you're talking about the Readable Standby Instances available in Preview at the time of writing this.
Compared to the traditional read replicas, the main difference is the kind of replication involved. Replication to read replicas happens asynchronously. That means read replicas aren't necessarily up to date with the main database. This is something your workload needs to be able to deal with if you want to use that.
Readable standby instances on the other hand use synchronous replication. When you read from one of those instances your data will be up to date.
There are also a couple of other differences between the capabilities, but some things aren't finalised yet. The main difference is the kind of replication.

AWS RDS Read Replica act as Failover Standby

I am currently assessing whether to use RDS MySQL Multi-AZ or Single AZ with Read Replica.
Considerations are budget and performance, as Multi-AZ cost twice as much as Single AZ and have no ability to offload read operations, Single AZ with Read Replica seems to be a logical choice.
However, I saw a way to manually 'promote' the Read Replica to master in the event of master's failure, but is there a way to automate this?
Note: There was a similar question but it did not address my question:
Read replicas in RDS AWS
I think the problem is that you are a bit confused with these features. Let me help - you can launch AWS RDS in Multi-AZ deployment mode. In this case, AWS will do the following:
It will allocate a DNS record for you. This DNS record represents a single entry point to your master database, which is, lets assume, currently active and able to serve connections.
In the case of master failure for any reason, AWS will simply address hidden by the DNS record (quite fast, within 1-2 minutes) to be pointed to your stand by, which is located on another AZ.
When the master will become available again, then your stand by, which have served writes also needs now to synchronize everything with the master. You do not need to take care about it - AWS will manage it for you
In case of read replica:
AWS will allocate you 2 different DNS records - one for master, another for read replica. Read replica can be on the same AZ as a master, or even in an another Region
You can, and must in you application choose what DNS name to use in different scenarios. I mean, you, most probably, will have 2 different connection pools - one for master, another for read replica. Replication itself will be asynchronous
In the case of read replica, AWS solves the problem of replication by its own - you do not need to worry about it. But since the replica is read only AWS does not solve, by nature, the synchronization problem between read replica and master, because the replica is aimed to be read only, it should not accept any write traffic
Addressing your question directly:
Technically, you can try to make you read replica serve as a failover, but in this case you will have to implement a custom solution for synchronization with the master, because during the time the master was down, your read replica certainly received N amount of writes. AWS does not solve this synchronization problem in this case
In redards to Mutli-AZ - you cannot use your Multi-AZ standby as read replica, since it is not supported in AWS. I highly recommend to check out this documentation. I think it will help you sort the things out, have a nice day!)

Does AWS take down each availability zones(A-Z) or whole regions for maintenance

AWS has a maintenance window for each region.
https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/maintenance-window.html but could not find any documentation about how it works with multiple A-Z in the same region.
I have a Redis cache configured and have a replica on different(A-Z) in the same region. The whole purpose of configuring replica on different(A-Z) if one (A-Z) is not available serve it from next(A-Z)
When they doing maintenance are they take down the whole region or individual availability zone
You should read the FAQ on ElastiCache maintenance https://aws.amazon.com/elasticache/elasticache-maintenance/
This says that if you have a multi AZ deployment, it will take down the instances one at a time triggering a fail-over to the read replica, and then create new instances before taking down the rest so you should not experience any interruptions in your service.
Thanks #morras for the above link and explains how elasticache works maintenance window period. Below 3 question I have taken out from the above link and explain about it.
1. How long does a node replacement take?
A replacement typically completes within a few minutes. The replacement may take longer in certain instance configurations and traffic patterns. For example, Redis primary nodes may not have enough free memory, and may be experiencing high write traffic. When an empty replica syncs from this primary, the primary node may run out of memory trying to address the incoming writes as well as sync the replica. In that case, the master disconnects the replica and restarts the sync process. It may take multiple attempts for replica to sync successfully. It is also possible that replica may never sync if the incoming write traffic continues to remains high.
Memcached nodes do not need to sync during replacement and are always replaced fast irrespective of node sizes.
2. How does a node replacement impact my application?
For Redis nodes, the replacement process is designed to make a best effort to retain your existing data and requires successful Redis replication. For single node Redis clusters, ElastiCache dynamically spins up a replica, replicates the data, and then fails over to it. For replication groups consisting of multiple nodes, ElastiCache replaces the existing replicas and syncs data from the primary to the new replicas. If Multi-AZ or Cluster Mode is enabled, replacing the primary triggers a failover to a read replica. If Multi-AZ is disabled, ElastiCache replaces the primary and then syncs the data from a read replica. The primary will be unavailable during this time.
For Memcached nodes, the replacement process brings up an empty new node and terminates the current node. The new node will be unavailable for a short period during the switch. Once switched, your application may see performance degradation while the empty new node is populated with cache data.
3. What best practices should I follow for a smooth replacement experience and minimize data loss?
For Redis nodes, the replacement process is designed to make a best effort to retain your existing data and requires successful Redis replication. We try to replace just enough nodes from the same cluster at a time to keep the cluster stable. You can provision primary and read replicas in different availability zones. In this case, when a node is replaced, the data will be synced from a peer node in a different availability zone. For single node Redis clusters, we recommend that sufficient memory is available to Redis, as described here. For Redis replication groups with multiple nodes, we also recommend scheduling the replacement during a period with low incoming write traffic.
For Memcached nodes, schedule your maintenance window during a period with low incoming write traffic, test your application for failover and use the ElastiCache provided "smarter" client. You cannot avoid data loss as Memcached has data purely in memory.

Scaling Up an Elasticache Instance?

I'm currently running a site which uses Redis through Elasticache. We want to move to a larger instance with more RAM since we're getting to around 70% full on our current instance type.
Is there a way to scale up an Elasticache instance in the same way a RDS instance can be scaled?
Alternative, I wanted to create a replica group and add a bigger instance to it. Then, once it's replicated and running, promote the new instance to be the master. This doesn't seem possible through the AWS console as the replicas are created with the same instance type as the primary node.
Am I missing something or is it simply a use case which can't be achieved. I understand that I can start a bigger instance and manually deal with replication then move the web servers over to use the new server but this would require some downtime due to DNS migration, etc.
Thanks!,
Alan
Elasticache feels more like a cache solution in the memcached sense of the word, meaning that to scale up, you would indeed fire up a new cluster and switch your application over to it. Performance will degrade for a moment because the cache would have to be rebuilt, but nothing more.
For many people (I suspect you included), however, Redis is more of a NoSQL database solution in which data loss is unacceptable. Amazon offers the read replicas as a "solution" to that problem, but it's still a bit iffy. Of course, it offers replication to reduce the risk of data loss, but it's still nowhere near as production-safe (or mature) as RDS for a Redis database (as opposed to a cache, for which it's quite perfect), which offers backup and restore procedures, as well as well-structured change management to support scaling up. To my knowledge, ElastiCache does not support changing the instance type for a running cluster. This suggests that it's merely an in-memory solution that would lose all its data on reboot.
I'd go as far as saying that if data loss concerns you, you should look at a self-rolled Redis solution instead of simply using ElastiCache. Not only is it marginally cheaper to run, it would enable you to change the instance type like you would on any other EC2 instance (after stopping it, of course). It would also enable you to use RDB or AOF persistence.
You can now scale up to a larger node type while ElastiCache preserves:
https://aws.amazon.com/blogs/aws/elasticache-for-redis-update-upgrade-engines-and-scale-up/
Yes, you can instantly scale up a running Elasticache instance type to a larger size. I've tested it and experienced very little actual downtime (I think a few seconds at first, but very quickly it's back online, even while the Console will show the process taking roughly a few minutes to actually finish.) I went from a t2.micro to a m3.medium with no problem.
You can scale up or down
Go to Elasticache service
Select the cluster
From Actions menu in top, choose Modify
Modify Node Type as shown below
If you have a cluster, you can add more shards, decrease number of shards, rebalance slot distributions, or add more read replicas. just click on the cluster itself, you should be see something like this
Be aware when you delete shards, it will automatically redistribute data to other existing shards so it will affect on traffic and overloading other shards, when you try to delete a shard you would get a warning like this
Still need more help, please feel free to leave a comment and I would be more than happy to help.