Currently I am looking how the backup/restore be done in Cassandra. We've setup a three node cluster in AWS. I understand that using nodetool snapshot tool we can take a snapshot but it's bit cumbersome process.
My idea is :
Make use of EBS snapshot because they're more durable and easy to setup but one problem which I see with EBS is inconsistency backup. Hence, my plan is run a script prior to taking EBS snapshot which would just run flush command to flush out all the memtable data and copies it on to the disk(SSTable) and then prepares the hard link with flushed sstables.
Once that's done, initiate the EBS snapshot, this was we can address the inconsistency issue which we might face if we only use EBS snapshost.
Please let me know if you see any issue with this approach or share your suggestions.
Being immutable, SSTables do help a lot when it comes to backups, indeed.
Your ideia sounds ok for situations where everything is healthy on your cluster. Actually, Cassandra is consistency-configurable (if I say eventually consistent, some people may be offended here, hehe), and as the system itself may no be fully consistent at a given time, you cannot say your backup will be as well. But, by the other hand, one of the beauties of Cassandra (and NoSQL models) is that it tends to recover pretty well, which is true for Cassandra in most situations (quite opposite to a relational databases, which are very sensitive to data losses). It's very unlikely you end up with a bunch of useless data if you have at least fully preserved SSTables files.
Be aware that EBS Snapshots are block-level. So, when you have a filesystem on top of it, it may be a concern as well. Fortunately, any modern filesystem have journaling nowadays and are pretty reliable, so that shouldn't be a problem, but having your data in a separate partition is a good practice, so the chances of someone else writing in it right after a full flush are smaller.
You may have some lost replicas when you eventually need to restore you cluster, demanding you to run nodetool repair, what, if you have done before, is a bit painful and takes very long for large amounts of data. (But, repair is recommended to be run regularly anyway, specially if you delete a lot.)
Another thing to consider are hinted handoffs (writes whose row owners are missing, but which are kept by other nodes until the owners come back). I don't know what happens with them when you flush, but I guess they're kept in memory and on commit logs only.
And, off course, do a full restore before you assume this will work in the future.
I don't have a large experience with Cassandra, but what I have heard about backup solutions for it are whole cluster replicas in another region, or datacenter, instead of cold backups like snapshots. It's probably more expensive but more reliable too than raw disks snapshots like you trying to do.
I am not sure how backup of a node will help, because in C* data is already backed up in the replica nodes.
If a node is dead and has to be replaced, the new node will learn about the data from other nodes that it needs to own and get it from other nodes, so you might not need to restore from a disk backup.
Would a replication scenario like the following help ?
Use two data centers (DC:A with 3 nodes) (DC:B with one node) with RF of (A:2 & B:1). Allow clients to interact with nodes in DC:A, with a Read/write consistency of Local_QUORUM. Here since quorum in 2 all reads and write will be successful and you will get data replicated on DC:B. Now you could back up DC:B
Related
We install our own MySQL in GCE and we are thinking to use GCE snapshot as a backup solution. As our MySQL database is quite busy, we would like to know if taking snapshot on it while still in production, can the data be incorrupt and remain integrity in the snapshot? Thanks.
As described in Best Practices for Persistent Disk Snapshots documentation, if your database is in use during snapshot you may have some data loss.
If you don't have too many write but lots of read, that could do the trick as the chance of loosing new datas will be smaller, but that's still not a 100% sure thing.
On one of my AWS instances running Ubuntu 16.04, I've a MySQL replica database on a 1TB ext4 EBS volume. I plan to increase it to 2TB. Before I increase the size of the volume and extend the filesystem using the resize2fs command, do I need to take any precautions? Is there any possibility of data corruption? If so would it be sane to create a EBS snapshot of this volume?
Do I need to take any precautions?
You shouldn't need to take any unusual precautions -- just standard best practices, like maintaining backups and having a tested recovery plan. Anything can go wrong at any time, even when you're sitting, doing nothing.
Important
Before modifying a volume that contains valuable data, it is a best practice to create a snapshot of the volume in case you need to roll back your changes.
https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/ebs-modify-volume.html
But this is not indicative of the operation being especially risky. Anecdotally, I've never experienced complications, and have occasionally resized an EBS volume and then its filesystem under a live, master, production database.
Is there any possibility of data corruption?
The possibility of data corruption is always there, no matter what you are doing... but this seems to be a safe operation. The additional space becomes available immediately, and there is no I/O freeze or disruption.
If so would it be sane to create a EBS snapshot of this volume?
As noted above, yes.
Concerns about errors creeping in later are valid, but EBS maintains internal consistency checks and will disable a volume if this fails to help avoid further scrambling of data so that you can do a controlled recovery and repair operation.
This would not help if EBS is prefectly storing data that was corrupted by something on the instance, such as might be caused by a defect in resize2fs, but it seems to be a solid utility. It doesn't move your existing data -- it just fleshes out the filesystem structures as needed to the filesystem use the entire free space that has become available.
I intend to setup spark cluster on EC2. How much resources spark master instance actually needs? Since master is not involved in processing any of the tasks can it be the smallest EC2 instance?
This obviously depends on what kinds of jobs you're planning to run, how big is the cluster etc, so in that sense the advice to simply try different configurations is good. However, in my purely personal experience the driver instance should be at least at the level of the slave instances. This is mainly due to two reasons.
First of all, there are times when you need the result of the job in a single place. Maybe you just don't want to spend time combining files, maybe you need the results in some specific order which would be hard to achieve in a distributed way etc. but this means the driver should be able to hold all the data (as rdd.collect gathers the results to the driver instance).
Second of all, many of the shuffle-based operations seem to require a lot of memory from the driver. I'm not exactly sure about the details of why this happens (if anyone knows, please do share) but I can't count the number of times I've seen reduceyKey causing an out of memory error from the driver.
Edit: I have assumed you were using Spark's spark-ec2 script, which I believe does install the NameNode in the master instance. If the NameNode is not installed at the master intance, however, my answer has no validity as correctly pointed by #DemetriKots in the comments.
Although the master instance is not involved in data processing, it plays a major role during the management of the workload and resource allocation, e.g (all info is taken from the sources):
NameNode
The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.
Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives.
This (look for Hardware Recommendations for Hadoop on the left index) Hortonworks document specifies some recommendations for the master instance in a Hadoop cluster. While it might not be adequate for the slave instances (due to Spark's memory usage), I would say it can be useful in the case of the master instance in a Spark cluster.
This question has a conceptual and practical parts.
Conceptually I'd like to know if using the autoscaling functionality is equivalent to simply increasing the compute power by a factor of the number of added instances?
Practically ... how does this work? I have one running instance, its database sitting on an LVM composed of multiple EBS volumes, similarly with all website data. Judging from the load on the instance I either need to upgrade to a more powerful instance or introduce this autoscaling. Is it a copy of the running server? If so, how is the database (etc) kept consistent?
I've read through the AWS documentation, and still haven't got the picture yet - I could set one autoscaling group up which would probably clear my doubts, but I am very leery to do this with a production server.
Any nudges in the right direction would be welcome.
Normally if you have a solution that also uses a database, and several machines in the solution, the database is typically not on any of the machines but is instead hosted seperately with each worker machine pointing to the same database - if you are on AWS platform already, then DynamoDB or RDS are both good solutions for this.
In theory, for some applications, upgrading the size of the single machine will give you the same power as adding several smaller machines, but increasing the size of the single machine, while usually these easiest thing to do at first, should not be considered autoscaling and has its own drawbacks. Here are some things to consider:
Using multiple machines instead of one big one gives you some fault tolerance. One or more machines can go down and if your solution is properly designed new machines will spin up to replace them.
Increasing the size of a single machine solution means you are probably paying too much. If you size that single machine big enough to handle peak workloads, that means at other times (maybe most of the time), you are paying for a bigger machine than you need. If you setup your autoscaling solution properly more machines come on line in response to increasing demand, and then they terminate when that demand decreases - you only pay for the power you need when you need it.
When your solution is designed in this manner, you need to think of all of the worker machines as ephermal - likely to disappear at any time, so you need to build your solution differently. Besides using a hosted database (like on DynamoDB or AWS RDS), you also should not store any data on the machines in your auto-scaling group that doesn't also live somewhere else. For example, if part of your app allows users to upload images, you don't store them on the instances, you store them in S3. Same would apply to any other new data that comes in.
You need to be able to figuratively 'pull the plug' at any instant on any of the machines in your ASG without losing data.
Ultimately a properly setup auto-scaling solution will likely serve you better, but without doubt it is simpler to just 'buy a bigger machine' and the extra money you spend on running that bigger machine may be more than offset by the time and effort you don't have to spend re-architecting your solution to properly run in an autoscaling environment. The unique requirements of your solution will ultimately decide which approach is better.
Most of the nosql solution only use eventually consistency, and given that DynamoDB replicate the data into three datacenter, how does read after write consistency is being maintained?
What would be generic approach to this kind of problem? I think it is interesting since even in MySQL replication data is replicated asynchronously.
I'll use MySQL to illustrate the answer, since you mentioned it, though, obviously, neither of us is implying that DynamoDB runs on MySQL.
In a single network with one MySQL master and any number of slaves, the answer seems extremely straightforward -- for eventual consistency, fetch the answer from a randomly-selected slave; for read-after-write consistency, always fetch the answer from the master.
even in MySQL replication data is replicated asynchronously
There's an important exception to that statement, and I suspect there's a good chance that it's closer to the reality of DynamoDB than any other alternative here: In a MySQL-compatible Galera cluster, replication among the masters is synchronous, because the masters collaborate on each transaction at commit-time and a transaction that can't be committed to all of the masters will also throw an error on the master where it originated. A cluster like this technically can operate with only 2 nodes, but should not have less than three, because when there is a split in the cluster, any node that finds itself alone or in a group smaller than half of the original cluster size will roll itself up into a harmless little ball and refuse to service queries, because it knows it's in an isolated minority and its data can no longer be trusted. So three is something of a magic number in a distributed environment like this, to avoid a catastrophic split-brain condition.
If we assume the "three geographically-distributed replicas" in DynamoDB are all "master" copies, they might operate with logic along same lines of synchronous masters like you'd find with Galera, so the solution would be essentially the same since that setup also allows any or all of the masters to still have conventional subtended asynchronous slaves using MySQL native replication. The difference there is that you could fetch from any of the masters that is currently connected to the cluster if you wanted read-after-write consistency, since all of them are in sync; otherwise fetch from a slave.
The third scenario I can think of would be analogous to three geographically-dispersed MySQL masters in a circular replication configuration, which, again, supports subtended slaves off of each master, but has the additional problems that the masters are not synchronous and there is no conflict resolution capability -- not at all viable for this application, but for purposes of discussion, the objective could still be achieved if each "object" had some kind of highly-precise timestamp. When read-after-write consistency is needed, the solution here might be for the system serving the response to poll all of the masters to find the newest version, not returning an answer until all masters had been polled, or to read from a slave for eventual consistency.
Essentially, if there's more than one "write master" then it would seem like the masters have no choice but to either collaborate at commit-time, or collaborate at consistent-read-time.
Interestingly, I think, in spite of some whining you can find in online opinion pieces about the disparity in pricing among the two read-consistency levels in DynamoDB, this analysis -- even as divorced from the reality of DynamoDB's internals as it is -- does seem to justify that discrepancy.
Eventually-consistent read replicas are essentially infinitely scalable (even with MySQL, where a master can easily serve several slaves, each of which can also easily serve several slaves of its own, each of which can serve several... ad infinitum) but read-after-write is not infinitely scalable, since by definition it would seem to require the involvement of a "more-authoritative" server, whatever that specifically means, thus justifying a higher price for reads where that level of consistency is required.
I'll tell you exactly how DynamoDB does this. No guessing.
In order for a write request to be acknowledged to the client, the write must be durable on two of the three storage nodes for that partition. One of the two storage nodes MUST be the leader node for that partition. The third storage node is probably updated as well, but on the off chance something happened, it may not be. DynamoDB will get that one updated as soon as it can.
When you request a strongly consistent read, that read comes from the leader storage node for the partition the item(s) are stored in.
I know I'm answering this question long after it has been asked, but I thought a could contribute some helpful information...
In a distributed database the concept of a "master" is not particularly relevant anymore (at least for reads/writes). Each node should be able to perform reads and writes, so that read/write performance increases as the # of machines increases. If you want reads to be correct immediately after a write, the number of machines you write to and then read from must be greater than the total number of machines in the system.
Example: if you only write to 1 machine, then you must read from all of them to ensure that your data is not stale. Or if you write to 2 machines (in this case, quorum) you can perform reads at quorum and guarantee that your data is recent.
NOTE: these assumptions change when a subset of nodes in the system crash.