Cassandra Datastax AMI on EC2 - Recover from "Stop"/"Start" - amazon-web-services

We're looking for the best way to deploy a small production Cassandra cluster (community) on EC2. For performance reasons, all recommendations are to avoid EBS.
But when deploying the Datastax provided AMI with Ephemeral storage, whenever the ephemeral storage is wiped out the instance dies permanently. (Start + Stop manually, or sometimes triggered by AWS for maintenance) will render the instance unusable.
OpsCenter fails to fix the instance after a reboot and the instance does not recover on its own.
I'd expect the instance to launch itself back up, run some script to detect that the ephemeral storage is wiped, and sync with the cluster. Since it does not the AMI looks appropriate only for dev tasks.
Can anyone please help us understand what is the alternative? We can live with a momentary loss of a node due to replication but if the node never recovers and a new cluster is required this looks like a dead end for a production environment.
is there a way to install Cassandra on EC2 so that it will recover from an Ephemeral storage loss?
If we buy a license for an enterprise edition will this problem go away?
Does this meant that in spite of poor performance, EBS (optimized) with PIOPS is the best way to run Cassandra on AWS?
Is the recommendation to just avoid stopping + starting the instance and hope that AWS will not retire or reallocate their host machine? What is the recommendation in this case?
What about AWS rolling update? Upgrading one machine (killing it) and starting it again, then proceeding to next machine will erase all cluster data, since machines will be responsive (unlike Cassandra on those). That way it can destroy small (e.g. 3 node) cluster.
Has anyone had good experience with payed services such as Instacluster?

New docs from Datastax actually indicate that EBS Optimized GP2 SSD backed instances can be used for production workloads. With EBS backed, you can easily do snapshots which virtually eliminate the chance of data loss on a node, and it makes it so that they are easily migrated to a new host by a simple start/stop.
With ephemeral, you basically have to plan around failure, consider if your entire cluster is in a single region (SimpleSnitch) and that region goes down.
http://docs.datastax.com/en/cassandra/3.x/cassandra/planning/planPlanningEC2.html

Related

How to increase RAM size and database storage capacity in AWS

I have AWS linux based server with one project, and now I want to deploy another project on the same server. For this I want to know whether my existing memory is enough or should I have to increase the memory limit, and please let me know how to increase the memory limit.
Please refer the below images for available memory space.
There are two approaches to using a database in AWS.
You can install the database on the Amazon EC2 instance. You will then be responsible for configuring and maintaining the database and doing backups. The up-side is that it can run on the same EC2 instance as your application.
Or, you can use Amazon RDS to provide a database. Amazon RDS can install, configure and operate the database for you, including taking backups. It runs on a separate computer so there are additional costs involved, but there are many benefits to keeping a database separate from the application, such as allowing you to scale your application separately to the database. Large applications often run across multiple computers and they can all connect to the one database on Amazon RDS.
From your description, it looks like you are going with the first option. You can increase the disk capacity of the Amazon EC2 instance by increasing the size of the Amazon EBS disk volume (and then do a reboot). If you desire more RAM, then Stop the instance, change the Instance Type to something larger, then Start the instance again.

How much outage time is involved in resizing an AWS RDS Multi-AZ instance?

We're implementing new SQL Server databases in AWS. Our cloud engineer recommended RDS, despite the known downsides (inability to restore a single database or copy out a single backup, inability to resize the instance or reconfigure storage without downtime). Meanwhile, if we implement on EC2 we could get the benefit of zero-downtime upgrades.
In further reading, it seems like Multi-AZ may avoid downtime when resizing (see samples below) but the documentation is vague.
"Running a DB instance with high availability can enhance availability during planned system maintenance"
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Concepts.MultiAZ.html
"There is minimal downtime when you are scaling up on a Multi-AZ environment because the standby database gets upgraded first, then a failover will occur to the newly sized database."
https://aws.amazon.com/blogs/database/scaling-your-amazon-rds-instance-vertically-and-horizontally/
My question: does using Multi-AZ in RDS allow zero downtime when adding storage? If not, how much outage time would we experience when reconfiguring a Multi-AZ instance?
Multi-AZ doesn't do zero downtime, but we generally see less than a minute (with MySQL).
I would just create a new multi-AZ db from a snapshot, and test it to see. It shouldn't cost more than a buck or two to find out.

Migrating Redis to AWS Elasticache with minimal downtime

Let's start by listing some facts:
Elasticache can't be a slave of my existing Redis setup. Real shame, that would be so much more efficent.
I have only one Redis server to migrate, with roughly 3gb of data.
Downtime must be less than 10 mins. I assume the usual "stop the site, stop redis, provision cluster with snapshot" will take longer than this.
Similar to this question: How do I set an elasticache redis cluster as a slave?
One idea on how this might work:
Set Redis to use an AOF and trigger BGSAVE at the same time.
When BGSAVE finishes, provision the Elasticache cluster with RDB seed.
Stop the site and shut down my local Redis instance.
Use an aof-replay tool to replay the AOF into Elasticache.
Start the site again, pointed at the Elasticache cluster.
My questions:
How can I guarantee that my AOF file begins at exactly the point the RDB file ends, and that no data will be written in between?
Is there an AOF tool supported by the maintainers of Redis, or are they all third-party solutions, and therefore (potentially) of questionable reliability?*
* No offence intended to any authors of such tools, I'm sure they're great, I just feel much more confident using a tool written by the same team as the product to avoid potential compatibility bugs.
I have only one Redis server to migrate, with roughly 3gb of data
I would halt, save the REDIS to S3 and then upload it to a new cluster.
I'm guessing 10 mins to save the file and get it into s3.
10 minutes to just launch an elasticache cluster from that data.
Leaves you ten extra minutes to configure and test.
But there is a simple way of knowing EXACTLY how long.
Do a test migration of it.
DONT stop your live system
Run BGSAVE and get a dump of your Redis (leave everything running as normal)
move the dump S3
launch an elasticache cluster for it.
Take DETAILED notes, TIME each step, copy the commands to a notepad window.
Put a Word/excel document so you have a migration document. That way you know how long it takes and there are no surprises. Let us know how it goes.
ElastiCache has online migration support. You can use the start-migration API to start migration from self managed cluster to ElastiCache cluster.
aws elasticache start-migration --replication-group-id <ElastiCache Replication Group Id> --customer-node-endpoint-list "Address='<IP Address>',Port=<Port>"
The input to the API is your ElastiCache replication group id and the IP and port of the master of your self managed cluster. You need to ensure that the IP address is accessible from ElastiCache node. (An example IP address would be the private IP address of the master of your self managed cluster). This API will make the master node of the ElastiCache cluster call 'SLAVEOF' on the master of your self managed cluster. This will establish a replication stream and will start migrating data from self-managed cluster to ElastiCache cluster. During migration, the master of the ElastiCache cluster will stop accepting writes sent to it directly. You can start using ElastiCache cluster from your application for reads.
Once you have all your data in ElastiCache cluster, you can use the complete-migration API to stop the migration. This API will stop the replication from self managed cluster to ElastiCache cluster.
aws elasticache complete-migration --replication-group-id <ElastiCache Replication Group Id>
After this, the master of the ElastiCache cluster will start accepting writes. You can start using ElastiCache cluster from your application for both read and write.
The following limitations to be aware of for this migration method:
An existing or newly created ElastiCache deployment should meet the following requirements for migration:
It's cluster-mode disabled using Redis engine version 5.0.5 or higher.
It doesn't have either encryption in-transit or encryption at-rest enabled.
It has Multi-AZ with Auto-Failover enabled.
It has sufficient memory available to fit the data from your Redis on EC2 instance. To configure the right reserved memory settings, see Managing Reserved Memory.
There are a few ways to migrate the data without downtime. They are harder to achieve though.
you could have your app write to two redis instances simultaneously - one of which would be on EC. Once the caches are both 'warm', you could just restart your app, and read from the EC cache.
You could initially migrate to EC2 instead of EC. not really what you were hoping to hear, I imagine. this is easy to do because you can set EC2 as salve of your redis instance. Also, migrating from EC2 to EC is somewhat easier (the data is already on AWS), so there's a benefit for users with huge sets of data.
You could, in theory, intercept the commands from the client and send them to EC, thus effectively "replicating". But this requires some programming ( I dont believe a tool like this exists ATM) and would be hard with multiple, ephemeral clients.

AWS architecture help for running database dumps

I have mysql running on one ec2-instance and tableau uses this database. mysqldump runs from production servers every 4 hours during which the system is down for probably 10-15 mins due to the dump. I am planning to have another ec2 instance with mysql running and and elb on top of these two instances so that the system wont be down trough the dump. For this I might have to de-register the instances from elb during the dump and register them back after the dump. Is this the right way to do it in the situations like this?
You can't use an ELB with MySQL servers. The ELB wouldn't know which server was master and which was slave, so it wouldn't know which to send updates to.
Is there any reason you aren't using Amazon's RDS service for your database servers? It provides automated snapshots that don't cause any down-time. It also makes it easy to create a read-replica against which you could perform mysqldumps without affecting the main server.
Currently you are taking logical backups of your system every 4 hours. Logical backups in most cases should only be used in a worst case scenario. In the event of a restore, logical backups are very slow compared to alternatives, such as snapshots and binary backups. If snapshoting using Amazon RDS or any of the other multitude of alternatives out there in your environment is not an option, I would look into Xtrabackup. This is a free stand alone HOT online binary backup tool that can be used with a Vanilla install of MySQL. This should not bring down your production server, assuming you are using InnoDB and not an alternative storage engine such as MyISAM. I personally used it for hot online binary backups and to automate building slaves in my previous work environment. A binary backups bottleneck is your network speed in terms of the restore process and is exponentially faster than a logical backup.
If setting up another MySQL instance is your only option look into GTID replication and/or Master-Passive HA environment in order to take the mysqldump off of the secondary non-active production server so that your production environment does not go down.
The bottom line is that you should not be taking production down to do a logical backup every 4 hours. This is def not ideal in any production environment.
Have a look at Amazon Database Migration Service (https://aws.amazon.com/dms/). It allows you to do zero-downtime database migration or just synchronization.

Cassandra on AWS

I'm new to AWS and also to Cassandra. I just read about EBS and S3 storage available in AWS. I was trying to figure out if we have Cassandra installed in EC2, which storage would it use? EBS or S3? Or is there other storage? I'm little confused with this. Please help me understand this.
Thanks
Aravind
You shouldn't run Cassandra on EBS, as recommended per Datastax itself :
"EBS volumes are not recommended for Cassandra data volumes for the following reasons:
EBS volumes contend directly for network throughput with standard packets. This means that EBS throughput is likely to fail if you saturate a network link.
EBS volumes have unreliable performance. I/O performance can be exceptionally slow, causing the system to back load reads and writes until the entire cluster becomes unresponsive.
Adding capacity by increasing the number of EBS volumes per host does not scale. You can easily surpass the ability of the system to keep effective buffer caches and concurrently serve requests for all of the data it is responsible for managing."
http://docs.datastax.com/en/cassandra/1.2/cassandra/architecture/architecturePlanningEC2_c.html
The answer above comes from Cassandra 1.2, a relatively old version. Documentation for newer versions of Cassandra indicate that EBS Optimized instances using GP2 SSD can be used for production workloads.
http://docs.datastax.com/en/cassandra/3.x/cassandra/planning/planPlanningEC2.html
Things that changed since then were the creation of EBS Optimized instances, which reduces and/or eliminates noisy neighbor throughput problems, and using GP2 SSD for EBS storage.
If you are just getting started, I would recommend EBS Optimized. The performance should be pretty good, but you gain a critical ability -> creating snapshots. This reduces the risk of your instance becoming unstable because you would have S3-backed volume snapshots for AWS to rebuild data from if a drive died.
This reduces the need to setup your Cassandra cluster across regions. One of the concerns that you have to build around when using Ephemeral is a whole region potentially going down, which could wipe out your entire cluster if you didn't build a multi-region cluster. With EBS, this isn't really a concern.
For Cassandra you need to use EBS. S3 is an object store with and API to store and retrieve objects, but not easy querying mechanisms. The use cases include backup and archiving, Disaster Recovery, Static Website Hosting, etc
However, you can use S3 for Cassandra backup.
You can also consider ephemeral disks (as Jeff mentions) and storage which comes with AWS instance.