Concurrent writes in Cassandra: Are conflicts possible? - concurrency

Does Cassandra guarantee consistency of replicas in case of concurrent writes? For example, if N=3, W=3 and there are 3 concurrent writers, is it possible to end up with 3 different values on each replica?
Is it a Cassandra-specific problem or does the canonical Dynamo design also has this problem, despite its use of vector clocks?

Cassandra uses client-provided timestamps in this case, to ensure each replica keeps the 'latest' value. In your example, where you write to each replica, even when replicas receive the writes in different order, they will use the timestamp provided with the writes to decide which one to keep. Writing the same key with an older timestamp to a replica will just be ignored.
This mechanism isn't just needed to cope with concurrent writes - Cassandra can receive writes out of order over long periods of time (ie replying hints to a recently down node). To cope with this, when Cassandra compacts of SSTables and encounters two keys that are the same, it will use the timestamps to decide which one is kept.
Similarly, Cassandra has a feature called read repair. On read, Cassandra will compare the timestamp given by each replica and return the value associated with the latest timestamp to the client. It will then write this value back to any replicas which were out of date (this can have a performance impact, so the chance of it doing the subsequent write is tuneable).

Just to add on tom.wilkie's answer
If you want to guarantee a good consistency to your data with the latest value being kept try to read AND write always at LOCAL_QUORUM or QUORUM consistency.

Related

Storing a very large array of strings in AWS

I want to store a large array of strings in AWS to be used from my application. The requirements are as follows:
During normal operations, string elements will be added to the array and the array size will continue to grow
I need to enforce uniqueness - i.e. the same string cannot be stored twice
I will have to retrieve the entire array periodically - most probably to put it in a file and use it from the application
I need to backup the data (or at least be convinced that there is a good built-in backup system as part of the features)
I looked at the following:
RDS (MySQL) - this may be overkill and also may become uncomfortably large for a single table (millions of records).
DynamoDB - This is intended for key/value pairs, but I have only a single value per record. Also, and more importantly, retrieving a large number of records seems to be an issue in DynamoDB as the scan operation needs paging and also can be expensive in terms of capacity units, etc.
Single S3 file - This could be a practical solution except that I may need to write to the file (append) concurrently, and that is not a feature that is available in S3. Also, it would be hard to enforce the element uniqueness
DocumentDB - This seems to be too expensive and overkill for this purpose
ElastiCache - I don't have a lot of experience with this and wonder if it would be a good fit for my requirement and if it's practical to have it be backed up periodically. This also uses key/value pairs and it is not advisable to read millions of records (entire data) at the same time
Any insights or recommendations would be helpful.
Update:
I don't know why people are voting to close this. It is definitely a programming related question and I have already gotten extremely useful answers and comments that will help me and hopefully others in the future. Why is there such an obsession with opinionated closure of useful posts on SO?
DynamoDB might be a good fit.
It doesn't matter that you don't have any "value" to your "key". Just use the string as the primary key. That will also enforce uniqueness.
You get on-demand and continuous backups. I don't have experience with these so I can only point you to the documentation.
The full retrieval of the data might be the biggest hassle. It is not recommended to do a full-table SCAN with DynamoDB; it can get expensive. There's a way how to use Data Pipelines to do an export (I also have not used it). Alternatively, you could put together a system by yourself, utilizing DynamoDB streams, e.g. you can push a stream to Kinesis and then to S3.

DynamoDB Eventually consistent reads vs Strongly consistent reads

I recently came to know about two read modes of DynamoDB. But I am not clear about when to choose what. Can anyone explain the trade-offs?
Basically, if you NEED to have the latest values, use a fully consistent read. You'll get the guaranteed current value.
If your app is okay with potentially outdated information (mere seconds or less out of date), then use eventually consistent reads.
Examples of fully-consistent:
Bank balance (Want to know the latest amount)
Location of a locomotive on a train network (Need absolute certainty to guarantee safety)
Stock trading (Need to know the latest price)
Use-cases for eventually consistent reads:
Number of Facebook friends (Does it matter if another was added in the last few seconds?)
Number of commuters who used a particular turnstile in the past 5 minutes (Not important if it is out by a few people)
Stock research (Doesn't matter if it's out by a few seconds)
Apart from the other answers shortly the reason for this read modes is:
Lets say you have table User in eu-west-1 region. Without you being aware there are multiple Availability Zones AWS handles in the background. Like replicates your data in case of failure etc..Basically there are copies of your tables and once you insert a table there needs to be multiple resources to be updated.
But now when you wanna read there might be a chance that you are reading from not-yet-updated table without being aware of. Usually takes under a second for dynamodb to update. This is why its called eventually consistent. It will eventually be consistent in a short amount of time :)
When making decision knowing this reasoning helps me to understand and design my use cases.

Which Key value, Nosql database can ensure no data loss in case of a power failure?

At present, we are using Redis as an in-memory, fast cache. It is working well. The problem is, once Redis is restarted, we need to re-populate it by fetching data from our persistent store. This overloads our persistent
store beyond its capacity and hence the recovery takes a long time.
We looked at Redis persistence options. The best option (without compromising performance) is to use AOF with 'appendfsync everysec'. But with this option, we can loose last second data. That is not acceptable. Using AOF with 'appednfsync always' has a considerable performance penalty.
So we are evaluating single node Aerospike. Does it guarantee no data loss in case of power failures? i.e. In response to a write operation, once Aerospike sends success to the client, the data should never be lost, even if I pull the power cable of the server machine. As I mentioned above, I believe Redis can give this guarantee with the 'appednfsync always' option. But we are not considering it as it has the considerable performance penalty.
If Aerospike can do it, I would want to understand in detail how persistence works in Aerospike. Please share some resources explaining the same.
We are not looking for a distributed system as strong consistency is a must for us. The data should not be lost in node failures or split brain scenarios.
If not aerospike, can you point me to another tool that can help achieve this?
This is not a database problem, it's a hardware and risk problem.
All databases (that have persistence) work the same way, some write the data directly to the physical disk while others tell the operating system to write it. The only way to ensure that every write is safe is to wait until the disk confirms the data is written.
There is no way around this and, as you've seen, it greatly decreases throughput. This is why databases use a memory buffer and write batches of data from the buffer to disk in short intervals. However, this means that there's a small risk that a machine issue (power, disk failure, etc) happening after the data is written to the buffer but before it's written to the disk will cause data loss.
On a single server, you can buy protection through multiple power supplies, battery backup, and other safeguards, but this gets tricky and expensive very quickly. This is why distributed architectures are so common today for both availability and redundancy. Distributed systems do not mean you lose consistency, rather they can help to ensure it by protecting your data.
The easiest way to solve your problem is to use a database that allows for replication so that every write goes to at least 2 different machines. This way, one machine losing power won't affect the write going to the other machine and your data is still safe.
You will still need to protect against a power outage at a higher level that can affect all the servers (like your entire data center losing power) but you can solve this by distributing across more boundaries. It all depends on what amount of risk is acceptable to you.
Between tweaking the disk-write intervals in your database and using a proper distributed architecture, you can get the consistency and performance requirements you need.
I work for Aerospike. You can choose to have your namespace stored in memory, on disk or in memory with disk persistence. In all of these scenarios we perform favourably in comparison to Redis in real world benchmarks.
Considering storage on disk when a write happens it hits a buffer before being flushed to disk. The ack does not go back to the client until that buffer has been successfully written to. It is plausible that if you yank the power cable before the buffer flushes, in a single node cluster the write might have been acked to the client and subsequently lost.
The answer is to have more than one node in the cluster and a replication-factor >= 2. The write then goes to the buffer on the client and the replica and has to succeed on both before being acked to the client as successful. If the power is pulled from one node, a copy would still exist on the other node and no data would be lost.
So, yes, it is possible to make Aerospike as resilient as it is reasonably possible to be at low cost with minimal latencies. The best thing to do is to download the community edition and see what you think. I suspect you will like it.
I believe aerospike would serves your purpose, you can configure it for hybrid storage at namespace(i.e. DB) level in aerospike.conf
which is present at /etc/aerospike/aerospike.conf
For details please refer official documentation here: http://www.aerospike.com/docs/operations/configure/namespace/storage/
I believe you're going to be at the mercy of the latency of whatever the storage medium is, or the latency of the network fabric in the case of cluster, regardless of what DBMS technology you use, if you must have a guarantee that the data won't be lost. (N.B. Ben Bates' solution won't work if there is a possibility that the whole physical plant loses power, i.e. both nodes lose power. But, I would think an inexpensive UPS would substantially, if not completely, mitigate that concern.) And those latencies are going to cause a dramatic insert/update/delete performance drop compared to a standalone in-memory database instance.
Another option to consider is to use NVDIMM storage for either the in-memory database or for the write-ahead transaction log used to recover from. It will have the absolute lowest latency (comparable to conventional DRAM). And, if your in-memory database will fit in the available NVDIMM memory, you'll have the fastest recovery possible (no need to replay from a transaction log) and comparable performance to the original IMDB performance because you're back to a single write versus 2+ writes for adding a write-ahead log and/or replicating to another node in a cluster. But, your in-memory database system has to be able to support direct recovery of an in-memory database (not just from a transaction log). But, again, two requirements for this to be an option:
1. The entire database must fit in the NVDIMM memory
2. The database system has to be able to support recovery of the database directly after system restart, without a transaction log.
More in this white paper http://www.odbms.org/wp-content/uploads/2014/06/IMDS-NVDIMM-paper.pdf

When a ConcurrencyException occurs, what gets written?

We use RavenDB in our production environment. It stores millions of documents, and gets updated pretty much constantly during the day.
We have two boxes load-balanced using a round-robin strategy which replicate to one another.
Every week or so, we get a ConcurrencyException from Raven. I understand that this basically means that one of the servers was told to insert or update the same document within in short timeframe - it's kind of like a conflict exception, except occurring on the same server instead of two replicating servers.
What happens when this error occurs? Can I assume that at least one of the writes succeeded? Can I predict which one? Is there anything I can do to make these exceptions less likely?
ConcurrencyException means that on a single server, you have two writes to the same document at the same instant.
That lead to:
One write is accepted.
One write is rejected (with concurrency exception).

Google File System Consistency Model

I was reading about GFS and its consistency model but I'm failing to grasp some of it.
In particular, can someone provide me with a specific example scenario (or an explanation of why it cannot happen) of:
concurrent record append that could result in record duplication
concurrent record append that could result in undefined regions
concurrent writes (on a single chunk) that could result in undefined regions
I'm quoting from http://research.google.com/archive/gfs.html. Check out Table 1, which is a summary of the possible outcomes for writes/appends:
"If a record append fails at any replica, the client retries the
operation. As a result, replicas of the same chunk may contain
different data possibly including duplicates of the same
record in whole or in part." So any failure on a replica (e.g. timeout) will cause a duplicate record at least on the other replicas. This can happen without concurrent writes.
The same situation that causes a duplicate record also causes an inconsistent (and hence undefined) region. If a replica failed to acknowledge the mutation, it may not have performed it. In that case when the client retries the append this replica will have to add padding in place of the missing data, so that the record can be written at the right offset. So one replica will have padding while other will have the previously written record in this region.
A failed write can cause an inconsistent (hence undefined) region as well. More interestingly, successful concurrent writes can cause consistent but undefined regions. "If a write by the application is large or straddles a chunk
boundary, GFS client code breaks it down into multiple
write operations. They [...] may be interleaved with and overwritten by concurrent
operations from other clients. Therefore, the shared
file region may end up containing fragments from different
clients, although the replicas will be identical because the individual
operations are completed successfully in the same
order on all replicas. This leaves the file region in consistent
but undefined state [...]."
I don't think it really has to do with concurrent append but wih the at least once semantics of their system.
Failure is a fundamental problem of large distributed systems. In the presence of failure a sender may not know if the computer on the other end of the network fully received its message.
For such occasions distributed systems guarantee that a message is either delivered either at most once or delivered at least once.
In this case, it appears GFS decided upon at least once delivery to the storage nodes.