70 TB Cassandra migration to AWS - amazon-web-services

We have a 70TB cluster which has around 200 Keyspaces, and planning to move this to AWS. Few approaches which we are thinking
Replace the node in one of cluster with a Node in AWS, and do that one by one for all nodes
Create a new Cluster in AWS, Bulk Copy each key space and do the dual write to both the clusters, and cutover during downtime.
Any other better ways to do this? Could we use the AWS as a new DC and change one keyspace at a time?

Yes, for a live migration you can use a hybrid cloud model and create a new DC in AWS. This is probably the best approach if you want to migrate data without downtime and you can do this keyspace-by-keyspace to manage the I/O streaming.
This blog article by Alain Rodriguez on Cassandra Data Center Switch provides a walk though of how to do this in great detail.
Using AWS Snowball is a faster and cheaper approach if downtime is an option.

You can use AWS as a new cluster. But you need to be careful. Not all cassandra sstable can talk each other, so you need to verify the compatibility between sstables. Another issue is that you can cause some high load in your "old" cluster.
So i high recomend that you start with this parameters very low to test the powerful of your cluster and the AWS cluster:
compaction_throughput_mb_per_sec (Default 16)
stream_throughput_outbound_megabits_per_sec (Default 200)
Bootstrap a new AWS node inside your actual cluster is not a good idea, because you will tell to cassandra redistributed the keys between the cluster each time that you bootstrap a new node and you will stay without "plan b" if anything wrong.
Another good solution is make a separeted cluster(without connect them) in AWS and move the data with SPARK. Just move the data without transform is very simple and you are "on the control" of the process .

Related

How to setup AWS RDS standalone instance without traffic from actual RDS cluster

We need to know what are the best options to set AWS RDS instance (Aurora mysql) that is standalone and does not get traffic from actual RDS cluster.
Requirement is for our data team to write analytical queries but we do not want it to impact actual application and DB performance. Hence we need a DB which always has near to live data but live traffic or application does not connect to this instance.
Need to know which fits better, DL clone OR AWS Pilot light OR AWS Warn standby OR AWS hot standby OR
multi-AZ configuration.
Kindly let us know which one would fit our requirement better.
We have so far read about below 3 options,
AWS Amazon Aurora DB clone, https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Aurora.Managing.Clone.html
AWS Pilot light or AWS Warn standby or AWS hot standby
. https://aws.amazon.com/blogs/architecture/disaster-recovery-dr-architecture-on-aws-part-iii-pilot- light-and-warm-standby/
With multi-AZ configuration, we can create a new instance in new AZ, so that his instance will have a different host (kind off, a fail over strategy), where traffic to his instance will be from our queries and not from live prod application, unless there is some fail over issue.
Option 1, Aurora cloning says
Run workload-intensive operations, such as exporting data or running analytical queries on the clone.
...which seems to be your use case here.
Just be aware that the clone will not see any changes to the original data after it is made. So you will need to periodically delete and re-clone to get the updated data
Regarding option 2, I wrote those blog posts, and I do not think that approach suits your use case. That approach is for disaster recovery
Option 3 may work. To modify it a bit, the concept here is to create an Aurora Replica, which as you say is a separate instance. The problem here is the reader endpoint for your production workload, it may hit that instance (which is not what you want)
EDIT: Adding new option 4
Option 4. Check out Amazon Aurora zero-ETL integration with Amazon Redshift. This zero-ETL integration also enables you to analyze data from multiple Aurora database clusters in an Amazon Redshift cluster.

AWS containerised apps and database on same Redshift cluster

I a simple question for someone with experience with AWS but I am getting a little confused with the terminology and know how to proceed with which node to purchase.
At my company we currently have a a postgres db that we insert into continuously.
We probably insert ~ 600M rows at year at the moment but would like to be able to scale up.
Each Row is basically a timestamp and two floats, one int and one enum type.
So the workload is write intensive but with also constant small reads.
(There will be the occasional large read)
There are also two services that need to be run (both Rust based)
1, We have a rust application that abstracts the db data allowing clients to access it through a restful interface.
2, We have a rust app that gets the data to import from thousands on individual devices through modbus)
These devices are on a private mobile network. Can I setup AWS cluster nodes to be able to access a private network through a VPN ?
We would like to move to Amazon Redshift but am confused with the node types
Amazon recommend choosing RA3 or DC2
If we chose ra3.4xlarge that means you get one cluster of nodes right ?
Can I run our rust services on that cluster along with a number of Redshift database instances?
I believe AWS uses docker and I could containerise my services easily I think.
Or am I misunderstanding things and when you purchase a Redshift cluster you can only run Redshift on this cluster and have to get a different one for containerised applications, possibly an ec2 cluster ?
Can anyone recommend a better fit for scaling this workload ?
Thanks
I would not recommend Redshift for this application and I'm a Redshift guy. Redshift is designed for analytic workloads (lots or reads and few, large writes). Constant updates is not what it is designed to do.
I would point you to Postgres RDS as the best fit. It has a Restful API interface already. This will be more of the transactional database you are looking for with little migration change.
When your data get really large (TB+) you can add Redshift to the mix to quickly perform the analytics you need.
Just my $.02
Redshift is a Managed service, you don't get any access to it for installing stuff, neither is there a possibility of installing/running any custom software of your own
Or am I misunderstanding things and when you purchase a Redshift cluster you can only run Redshift on this cluster
Yes, you don't run stuff - AWS manages the cluster and you run your analytics/queries etc.
have to get a different one for containerised applications, possibly an ec2 cluster ?
Yes, you could possibly make use of EC2, running the orchestrators on your own, or make use of ECS/Fargate/EKS depending on your budget/how skilled your members are etc

Move production env cassandra cluster to AWS cassandra without downtime

I have cassandra cluster of 4 nodes running in production environment in on-premise DC. I have to move it to AWS cassandra. I don't want move cassandra to dynamoDB due to some reason.
Cassandra version used is pretty old i.e. 1.2.9.
How do I move cassandra from on-premise DC to AWS cassandra without data loss and zero downtime.
Regards,
Vivek
Create a new DC in AWS. Configure inter DC sync between the both DCs. Decommision the old DC.
https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsAddDCToCluster.html
I've done this before.
As Alex Tbk said, you'll add nodes at AWS with a new data center.
Add new, empty nodes with a new, logical data center name. You'll need to use the GossipingPropertyFile snitch (if you're not already) and specify the DC in the cassandra-rackdc.properties file. You can also specify a logical rack in that file, and it's usually a good idea to put the AWS availability zone there.
After you get one AWS node built, build the rest using the first node's IP as a seed. You won't want them trying to hit your on-prem DC on a restart. And afterward, you will also want to set the first node to use one of the others as a seed node.
Once you get your nodes built, you'll need to modify your keyspace and specify a replication factor for your new AWS DC.
Run a nodetool rebuild on each AWS node, using your existing DC as the source.
nodetool rebuild -- sourceDCName
Definitely consider upgrading. 1.2 was a solid version, but you're missing out on so many new features/fixes.
Note: Some folks recommend using the AWS specific snitches (EC2Snitch, EC2MultiRegionSnitch), but you will want all nodes in your cluster running on the same snitch. So for a hybrid-cloud deployment (before you have a chance to decomm your on-premise nodes), you'll want to stick with the GossipingPropertyFile snitch. Honestly, that's the only snitch I use, regardless of provider, and you should be fine with that, too.

How much does storage increase when I add an instance to an Amazon Elasticsearch cluster?

When you're running out of space on an Amazon Elasticsearch cluster the documentation recommends: "If you are not using EBS, add additional nodes to your cluster configuration."
source: https://aws.amazon.com/premiumsupport/knowledge-center/add-storage-elasticsearch/
But I'm not able to find any explanation as to "how much" does that increase the storage? Does it literally double the storage going from one instance to two?
Tangential follow-up: When you add another instance to a cluster does it automatically re-balance the existing indexes or do you have to rebuild them?
If you go from one instance to two, you double the storage, indeed. Try that and see if it solves your storage space issue.
Regarding your follow-up question, when new nodes join the cluster, ES will automatically rebalance the shards to the new nodes. Automatic rebalancing is one of ES' nicest features.
Be aware that if the default elasticsearch index configuration is 1 primary plus 1 replica, the cluster will automatically replicate all your shards, and consume all your added disk space. Check the AES docs, and your instance configs.

RIAK Cluster across AWS regions

I have an application hosted in AWS, where the application's data is stored in RIAK KV cluster; there are 5 nodes forms this cluster.
To meet high demand and availability constrains, i would like to replicate the complete setup in another AWS region (as active-active), where yet another RIAK KV cluster will be created with upto 5 node.
Now the question is, How do i sync the data between these 2 RIAK cluster which are running in 2 different AWS regions?
Since the opensource/commercial version of RIAK KV does not provide multi region clustering capability, How do i sync data between these clusters?
The Enterprise version of Riak KV has multi-cluster/datacenter replication built in (as you note in your question). This form of replication does some pretty clever things to ensure that data copied to both clusters remains in synch when updated as well as recovering from things like data center failure and split brain conditions.
If you want to roll your own replication there are quite a few ways that you might approach it including:
Dual write - have your application send writes to both clusters in parallel;
Post-commit hooks (http://docs.basho.com/riak/kv/latest/developing/usage/commit-hooks/) - after date gets written to one cluster successfully use a post-commit hook to replicate the write to the other cluster
The primary weakness of these solutions is that you still need to figure out how to keep data in synch across the clusters under failure conditions.
I know that there are more than a handful of Riak KV open source users who have rolled various in house replication mechanisms so hopefully one of them will chime in with what they have done.