Can I upgrade Elasticache Redis Engine Version without downtime?

Can I upgrade Elasticache Redis Engine Version without downtime? - amazon-web-services

I cannot find any information in the AWS documentation that modifying the Redis engine version will or will not cause downtime. It does not explain how the upgrade occurs other than it's performed in the maintenance window.
It is safe to upgrade a production Elasticache Redis instance via the AWS console without loss of data or downtime?
Note: The client library we use is compatible with all versions of Redis so the application should not notice the upgrade.

Changing a cache engine version is a disruptive process which clears
all cache data in the cluster. **
I don't now the requirements of your particular application. But if you can't lose your data and you need to do a major version upgrade, it would be advisable to migrate to a new cluster rather than upgrading the current setup.
** http://docs.aws.amazon.com/AmazonElastiCache/latest/UserGuide/VersionManagement.html

I am not sure if the answers are still relevant given that the question was asked nearly 7 years ago but there's a few things.
Changing the node type or engine version is a Modify action and your data remains intact on your Elasticache clusters. I believe there was a doc that mentioned (if I find it, I will link it here) the process of Elasticache modifications take place.
Essentially, Elasticache launches a new node on the backend with the modifications you've made and copies your data to it. Suppose the modification you make is a change in the engine version from 5.x to 6.x -
Elasticache will launch new Redis nodes on the backend with Engine 6.x.
As the node comes up, Elasticache will read keys from the 5.x node and copy data to 6.x
When the copy is complete, Elasticache will make a change in the DNS records for your cluster's endpoints.
So, there will be some downtime depending on your application's DNS cache TTL config. For example, your application holds the DNS cache for 300 seconds, it can take it 300 seconds to refresh the DNS cache on your application/client-machine and during that time application might show some errors.
From the elasticache side, I do not think they provide any official SLA for this. But this doc[1] mentions it will only take a "few seconds" for this DNS to propogate (depending on engine versions).
Still, you can always take a snapshot of your cluster as a backup. If anything goes south, you can use snapshot to launch a new cluster with the same data.
Also, one more thing - AWS will never upgrade your engine versions by themselves. The maintenance window for your Elasitcache Cluster is for Security patches and small optimizations on the engines. They do not affect the engine versions.
Cheers!
[1] https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/AutoFailover.html

As mentioned by Will above, the AWS answer has changed. and in theory you can do it without downtime. See:
https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/VersionManagement.html
The key points are in terms of downtime and impact on existing use:
The Amazon ElastiCache for Redis engine upgrade process is designed
to make a best effort to retain your existing data and requires
successful Redis replication.
...
For single Redis clusters and clusters with Multi-AZ disabled, we
recommend that sufficient memory be made available to Redis as
described in Ensuring That You Have Enough Memory to Create a Redis
Snapshot. In these cases, the primary is unavailable to service
requests during the upgrade process.
...
For Redis clusters with Multi-AZ enabled, we also recommend that you
schedule engine upgrades during periods of low incoming write traffic.
When upgrading to Redis 5.0.5 or above, the primary cluster continues
to be available to service requests during the upgrade process. When
upgrading to Redis 5.0.4 or below, you may notice a brief interruption
of a few seconds associated with the DNS update.
There are no guarantees here so you will have to make your own decision about the risk of losing data if it fails

It depends on your current version:
When upgrading to Redis 5.0.6 or above, the primary cluster continues to be available to service requests during the upgrade process (source).
Starting with Redis engine version 5.0.5, you can upgrade your cluster version with minimal downtime. The cluster is available for reads during the entire upgrade and is available for writes for most of the upgrade duration, except during the failover operation which lasts a few seconds (source); cluster available for reads during engine upgrades, writes are interrupted only for < 1sec with version 5.0.5 (source)
You can also upgrade your ElastiCache clusters with versions earlier than 5.0.5. The process involved is the same but may incur longer failover time during DNS propagation (30s-1m) (source, source).

Related

How Does Container Optimized OS Handle Security Updates?

If there is a security patch for Google's Container Optimized OS itself, how does the update get applied?
Google's information on the subject is vague
https://cloud.google.com/container-optimized-os/docs/concepts/security#automatic_updates
Google claims the updates are automatic, but how?
Do I have to set a config option to update automatically?
Does the node need to have access to the internet, where is the update coming from? Or is Google Cloud smart enough to let Container Optimized OS update itself when it is in a private VPC?

Do I have to set a config option to update automatically?
The automatic update behavior for Compute Engine (GCE) Container-Optimized OS (COS) VMs (i.e. those instances you created directly from GCE) are controlled via the "cos-update-strategy" GCE metadata. See the documentation at here.
The current documented default behavior is: "If not set all updates from the current channel are automatically downloaded and installed."
The download will happen in the background, and the update will take effect when the VM reboots.
Does the node need to have access to the internet, where is the update coming from? Or is Google Cloud smart enough to let Container Optimized OS update itself when it is in a private VPC?
Yes, the VM needs to access to the internet. If you disabled all egress network traffic, COS VMs won't be able to update itself.

When operated as part of Kubernetes Engine, the auto-upgrade functionality of Container Optimized OS (cos) is disabled. Updates to cos are applied by upgrading the image version of the nodes using the GKE upgrade functionality – upgrade the master, followed by the node pool, or use the GKE auto-upgrade features.
The guidance on upgrading a Kubernetes Engine cluster describes the upgrade process used for manual and automatic upgrades: https://cloud.google.com/kubernetes-engine/docs/how-to/upgrading-a-cluster.
In summary, the following process is followed:
Nodes have scheduling disabled (so they will not be considered for scheduling new pods admitted to the cluster).
Pods assigned to the node under upgrade are drained. They may be recreated elsewhere if attached to a replication controller or equivalent manager which reschedules a replacement, and there is cluster capacity to schedule the replacement on another node.
The node's Computer Engine instance is upgraded with the new cos image, using the same name.
The node is started, re-added to the cluster, and scheduling is re-enabled. (Besides some conditions, most pods will not automatically move back.)
This process is repeated for subsequent nodes in the cluster.
When you run an upgrade, Kubernetes Engine stops scheduling, drains, and deletes all of the cluster's nodes and their Pods one at a time. Replacement nodes are recreated with the same name as their predecessors. Each node must be recreated successfully for the upgrade to complete. When the new nodes register with the master, Kubernetes Engine marks the nodes as schedulable.

Migrating Redis to AWS Elasticache with minimal downtime

Let's start by listing some facts:
Elasticache can't be a slave of my existing Redis setup. Real shame, that would be so much more efficent.
I have only one Redis server to migrate, with roughly 3gb of data.
Downtime must be less than 10 mins. I assume the usual "stop the site, stop redis, provision cluster with snapshot" will take longer than this.
Similar to this question: How do I set an elasticache redis cluster as a slave?
One idea on how this might work:
Set Redis to use an AOF and trigger BGSAVE at the same time.
When BGSAVE finishes, provision the Elasticache cluster with RDB seed.
Stop the site and shut down my local Redis instance.
Use an aof-replay tool to replay the AOF into Elasticache.
Start the site again, pointed at the Elasticache cluster.
My questions:
How can I guarantee that my AOF file begins at exactly the point the RDB file ends, and that no data will be written in between?
Is there an AOF tool supported by the maintainers of Redis, or are they all third-party solutions, and therefore (potentially) of questionable reliability?*
* No offence intended to any authors of such tools, I'm sure they're great, I just feel much more confident using a tool written by the same team as the product to avoid potential compatibility bugs.

I have only one Redis server to migrate, with roughly 3gb of data
I would halt, save the REDIS to S3 and then upload it to a new cluster.
I'm guessing 10 mins to save the file and get it into s3.
10 minutes to just launch an elasticache cluster from that data.
Leaves you ten extra minutes to configure and test.
But there is a simple way of knowing EXACTLY how long.
Do a test migration of it.
DONT stop your live system
Run BGSAVE and get a dump of your Redis (leave everything running as normal)
move the dump S3
launch an elasticache cluster for it.
Take DETAILED notes, TIME each step, copy the commands to a notepad window.
Put a Word/excel document so you have a migration document. That way you know how long it takes and there are no surprises. Let us know how it goes.

ElastiCache has online migration support. You can use the start-migration API to start migration from self managed cluster to ElastiCache cluster.
aws elasticache start-migration --replication-group-id <ElastiCache Replication Group Id> --customer-node-endpoint-list "Address='<IP Address>',Port=<Port>"
The input to the API is your ElastiCache replication group id and the IP and port of the master of your self managed cluster. You need to ensure that the IP address is accessible from ElastiCache node. (An example IP address would be the private IP address of the master of your self managed cluster). This API will make the master node of the ElastiCache cluster call 'SLAVEOF' on the master of your self managed cluster. This will establish a replication stream and will start migrating data from self-managed cluster to ElastiCache cluster. During migration, the master of the ElastiCache cluster will stop accepting writes sent to it directly. You can start using ElastiCache cluster from your application for reads.
Once you have all your data in ElastiCache cluster, you can use the complete-migration API to stop the migration. This API will stop the replication from self managed cluster to ElastiCache cluster.
aws elasticache complete-migration --replication-group-id <ElastiCache Replication Group Id>
After this, the master of the ElastiCache cluster will start accepting writes. You can start using ElastiCache cluster from your application for both read and write.
The following limitations to be aware of for this migration method:
An existing or newly created ElastiCache deployment should meet the following requirements for migration:
It's cluster-mode disabled using Redis engine version 5.0.5 or higher.
It doesn't have either encryption in-transit or encryption at-rest enabled.
It has Multi-AZ with Auto-Failover enabled.
It has sufficient memory available to fit the data from your Redis on EC2 instance. To configure the right reserved memory settings, see Managing Reserved Memory.

There are a few ways to migrate the data without downtime. They are harder to achieve though.
you could have your app write to two redis instances simultaneously - one of which would be on EC. Once the caches are both 'warm', you could just restart your app, and read from the EC cache.
You could initially migrate to EC2 instead of EC. not really what you were hoping to hear, I imagine. this is easy to do because you can set EC2 as salve of your redis instance. Also, migrating from EC2 to EC is somewhat easier (the data is already on AWS), so there's a benefit for users with huge sets of data.
You could, in theory, intercept the commands from the client and send them to EC, thus effectively "replicating". But this requires some programming ( I dont believe a tool like this exists ATM) and would be hard with multiple, ephemeral clients.

Cassandra Datastax AMI on EC2 - Recover from "Stop"/"Start"

We're looking for the best way to deploy a small production Cassandra cluster (community) on EC2. For performance reasons, all recommendations are to avoid EBS.
But when deploying the Datastax provided AMI with Ephemeral storage, whenever the ephemeral storage is wiped out the instance dies permanently. (Start + Stop manually, or sometimes triggered by AWS for maintenance) will render the instance unusable.
OpsCenter fails to fix the instance after a reboot and the instance does not recover on its own.
I'd expect the instance to launch itself back up, run some script to detect that the ephemeral storage is wiped, and sync with the cluster. Since it does not the AMI looks appropriate only for dev tasks.
Can anyone please help us understand what is the alternative? We can live with a momentary loss of a node due to replication but if the node never recovers and a new cluster is required this looks like a dead end for a production environment.
is there a way to install Cassandra on EC2 so that it will recover from an Ephemeral storage loss?
If we buy a license for an enterprise edition will this problem go away?
Does this meant that in spite of poor performance, EBS (optimized) with PIOPS is the best way to run Cassandra on AWS?
Is the recommendation to just avoid stopping + starting the instance and hope that AWS will not retire or reallocate their host machine? What is the recommendation in this case?
What about AWS rolling update? Upgrading one machine (killing it) and starting it again, then proceeding to next machine will erase all cluster data, since machines will be responsive (unlike Cassandra on those). That way it can destroy small (e.g. 3 node) cluster.
Has anyone had good experience with payed services such as Instacluster?

New docs from Datastax actually indicate that EBS Optimized GP2 SSD backed instances can be used for production workloads. With EBS backed, you can easily do snapshots which virtually eliminate the chance of data loss on a node, and it makes it so that they are easily migrated to a new host by a simple start/stop.
With ephemeral, you basically have to plan around failure, consider if your entire cluster is in a single region (SimpleSnitch) and that region goes down.
http://docs.datastax.com/en/cassandra/3.x/cassandra/planning/planPlanningEC2.html

No downtime upgrade Elasticache from Redis 2.6 to 2.8

I'm trying to determine whether updating my Elasticache cluster to use Redis 2.8 instead of 2.6 will cause Elasticache downtime. Ideally the upgrade would occur during the cluster's scheduled downtime, but I can't seem to find any documentation on what will happen when I tell the cluster to upgrade.
Has anyone gone through this yet?

Yes, we have done an upgrade to our redis elasticache. We made a snapshot of the current cache then spawned a new instance from the created snapshot. We pointed the servers to use the new cache then proceeded with the upgrade on the old cache. It did not take some time and the data was still intact but I cannot guarantee it because it was stated in the prompt before the upgrade that the data might be lost during the upgrade.

Downgrade Amazon RDS Multi AZ deployment to Standard Deployment

What might happen if I downgrade my Multi AZ deployment to standard deployment? Is there any possibility of i/o freeze or data loss? if yes, what might be the proper way to minimize downtime of data availability.

I have tried downgrading from Multi AZ deployment to a standard deployment.
The entire process took around 2-3 minutes (The transition time should depend upon your database size). The transition was seamless. We did not experience any downtime. Our website was working as expected during this period.
Just to ensure that nothing gets affected, I took a snapshot and a manual data base dump before downgrading.
Hope this helps.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js