Consider a distributed system with 3 nodes- n1, n2, n3. There is a shared data, x, among the nodes. Paxos is running on the nodes. In the beginning, x is equal to 4.
A client sends an update request to n1 to change the value of x to 5. n1 and n2 reach consensus on the new value by running Paxos but some link failures occur for n3, so n3 does not have the newest value of x.
We know that Paxos provide Strong consistency. In the other hand, if a client sends a read request to n1 and also another read request to n3, the returned values are not the same (one of them is 5 but the other one is 4). Therefore, after running Paxos, the system is not strongly consistent.
My question is: How we can resolve this contradiction? Did I misunderstand something?
In multi-paxos, peers can lag behind as you noticed. If you read the values from a quorum though you're guaranteed to see the most recent value, the trick is figuring out which one that is. Not all applications need this but if yours does, a very simple augmentation is sufficient. Just use a tuple instead of the raw value where the first item is an update counter and the second is the raw value. Each time a peer tries to update the value, it also updates the counter. So when you read from a quorum, the tuple with the highest update counter is guaranteed to be the most recent value.
Related
What I understand about blockchain is that:
Blocks are secured by the hash.
Transactions are secured by the markle-tree.
Does this mean that the markle-tree is not involved at all in securing the blocks?
If so, what prevents us from changing the transactions if we know the hash of older blocks in the chain?
Please note that I'm assuming that we are using a blockchain with only one node. And I want to know how hard it is to hack the blockchain in one node. Because as far as i understand, the hashing alone is very secure, but distributing the blockchain on multiple nodes will make it even more secure.
Blocks are secured with the proof of work. The proof of work is a measure related to how many hashes (on average) it would take to get a block hash equal to the network target value. The lower the target value, the more work was done on the block, and the harder it is to change or "hack" the data in the block and still remain a valid block (because you must do the work again).
The merkle root is just a way to represent all of the transactions in the block in a single hash value, which is part of the data that is hashed to produce the block hash. If you change any of the transaction data, it will produce a different merkle root, and that will make the resulting block hash different too, and now the proof of work must be done again before the block would be considered valid.
Now, with only one node, it does not matter. If you are able to change the data in a block and rehash that block with a new valid hash (one that is equal to or lower than the network target value), you have a new block, but the node will reject it because it already has that block. You must mine the next block also before anyone else because one of the consensus rules is that the longest valid chain always wins.
Having only one node running means that node can be changed by the person who is running it, possibly without anyone else knowing. This might remove certain rules that you thought we're being followed, which could reverse one of your transactions, so it is good to run your own node to make sure the rules are being followed.
AFAIK after block validation node runs all transactions in the block, changing the state (list of UTXOs)
Let's imagine that at some point node realizes that it was on the wrong chain and there is longer chain available, which forked some blocks before.
How does it make the switch? I imagine that it should run all transactions in reverse till the fork happened to restore the state and than replay all transactions in the blocks from the longer chain?
Thanks!
Each node receives individual transactions as well as individual blocks from the network.
It also keeps the most updated blockchain locally.
For every new transaction it receives, the node validates it, and if valid, propagates to its peers.
For every block the node receives, it validates it. The validation includes several steps, among which:
1. checking that the block points to the most recent block in the blockchain (it's preceding block)
2. all transactions included in the block are valid.
A fork is a temporary situation, possible when there are 2 valid blocks (or more) which arrive to a node pretty much at the same time, so the node doesn't know which is the right one. It keeps the first one added to it's local blockchain as the main chain, and keeps the second one as a fork chain (also locally), until a next node arrives, and is added to one of the two. When it happens - the longer chain is chosen to be the main blockchain (at that node!), and the second is kept as a side chain.
All such side chains are kept in the node's memory for some time, until it can be sure they are not relevant anymore (since they are shorter than the main blockchain by several blocks), and then removed.
I don't know why you have the picture that anything has to be "rolled back". Yes, it's rolling back, but no calculations with transactions have to be done at all. Here's why:
When node A has N+5 blocks and node B has N+2 block, then all node B has to do is drop these additional two blocks and take the 5 new blocks from A.
That's all! Yes, it effectively is rolling back, but nothing has to be run in reverse, because dropping blocks is effectively equivalent to reversing transactions.
Remember that transactions are directed, so they happen only in one direction in time. Meaning: For valid blocks, every non-coinbase transaction in block number N has to have some history in every previous blocks, so block number N is dependent on that history, but the opposite is not true. Previous blocks don't depend on block number N, so dropping N won't invalidate them.
I have a distributed system that processes sessions (the definition of session is not important for this problem except to note that its a process that has a duration that is larger than a second, usually much larger), where I want to identify what is the largest number of sessions processed concurrently during a given time period.
The basic setup is a Redis database where I increment a counter for every session start and decrement it for every session end. The counter value thereby represents the current concurrency at any given point in time.
My problem is how to generate accurate metric of the peak (max) concurrency at given time slices (say, what was the max concurrency in a given day).
I would like to hear how other people would solve this problem, but my current approach is as follows:
Session start
INCR counter-name to increase the current value of the counter
The result of the increment command is the current value of the counter
ZADD collector-name NX <counterval> <uniqueid> to store the currently known concurrency value in an ordered set. Flake-id can be used for fast id generations, but if the session already has a unique ID - which often is the case - we can just use that.
Session end
DECR counter-name to reduce the current concurrency value
Each report time period
RENAME collector-name tempkey to take a snapshot of the status and allow workers to start a new collector.
ZREVRANGEBYSCORE tempkey +inf -inf WITHSCORES LIMIT 0 1 is run, returning the peak value of the counter since the last check (and the unique id of the session that caused the peak, if it is of any relevance).
DEL tempkey as we don't need it anymore.
Notes:
The final max calculation is done offline from the counter, and its also only O(log(n)).
The data entry is also O(log(n)), which might be a problem under high load, but n here is the number of entries in the current period so we can just increase our reporting frequency to improve performance (nice side effect - lets increase performance by generating more data!)
Are there any flaws in this setup that I've missed?
I'm not detecting any major flaws in the flow, but the choice of data structure could be improved.
Sorted Sets are comparatively expensive in terms of space and time, and your scenario doesn't utilize their special ability (i.e. ordering). More optimal structures would be a Hash of counters, or the highly-compressed BITFIELD.
Hi imagine I have such code:
0. void someFunction()
1. {
2. ...
3. if(x>5)
4. doSmth();
5.
6. writeDataToCard(handle, data1);
7.
8. writeDataToCard(handle, data2);
9.
10. incrementDataOnCard(handle, data);
11. }
The thing is following. If step 6 & 8 gets executed, and then someone say removes the card - then operation 10 will not be completed successfully. But this will be a bug in my system. Meaning if 6 & 8 are executed then 10 MUST also be executed. How to deal with such situations?
Quick Summary: What I mean is say after step 8 someone may remove my physical card, which means that step 10 will never be reached, and that will cause a problem in my system. Namely card will be initialized with incomplete data.
You will have to create some kind of protcol, for instance you write to the card a list of operatons to complete:
Step6, Step8, Step10
and as you complete the tasks you remove the entry from the list.
When you reread the data from the disk, you check the list if any entry remains. If it does, the operation did not successfully complete before and you restore a previous state.
Unless you can somehow physically prevent the user from removing the card, there is no other way.
If the transaction is interrupted then the card is in the fault state. You have three options:
Do nothing. The card is in fault state, and it will remain there. Advise users not to play with the card. Card can be eligible for complete clean or format.
Roll back the transaction the next time the card becomes available. You need enough information on the card and/or some central repository to perform the rollback.
Complete the transaction the next time the card becomes available. You need enough information on the card and/or some central repository to perform the completion.
In all three cases you need to have a flag on the card denoting a transaction in progress.
More details are required in order to answer this.
However, making some assumption, I will suggest two possible solutions (more are possible...).
I assume the write operations are persistent - hence data written to the card is still there after card is removed-reinserted, and that you are referring to the coherency of the data on the card - not the state of the program performing the function calls.
Also assumed is that the increment method, increments the data already written, and the system must have this operation done in order to guarantee consistency:
For each record written, maintain another data element (on the card) that indicates the record's state. This state will be initialized to something (say "WRITING" state) before performing the writeData operation. This state is then set to "WRITTEN" after the incrementData operation is (successfully!) performed.
When reading from the card - you first check this state and ignore (or delete) the record if its not WRITTEN.
Another option will be to maintain two (persistent) counters on the card: one counting the number of records that began writing, the other counts the number of records that ended writing.
You increment the first before performing the write, and then increment the second after (successfully) performing the incrementData call.
When later reading from the card, you can easily check if a record is indeed valid, or need to be discarded.
This option is valid if the written records are somehow ordered or indexed, so you can see which and how many records are valid just by checking the counter. It has the advantage of requiring only two counters for any number of records (compared to 1 state for EACH record in option 1.)
On the host (software) side you then need to check that the card is available prior to beginning the write (don't write if its not there). If after the incrementData op you you detect that the card was removed, you need to be sure to tidy up things (remove unfinished records, update the counters) either once you detect that the card is reinserted, or before doing another write. For this you'll need to maintain state information on the software side.
Again, the type of solution (out of many more) depends on the exact system and requirements.
Isn't that just:
Copy data to temporary_data.
Write to temporary_data.
Increment temporary_data.
Rename data to old_data.
Rename temporary_data to data.
Delete the old_data.
You will still have a race condition (if a lucky user removes the card) at the two rename steps, but you might restore the data or temporary_data.
You haven't said what you're incrementing (or why), or how your data is structured (presumably there is some relationship between whatever you're writing with writeDataToCard and whatever you're incrementing).
So, while there may be clever techniques specific to your data, we don't have enough to go on. Here are the obvious general-purpose techniques instead:
the simplest thing that could possibly work - full-card commit-or-rollback
Keep two copies of all the data, the good one and the dirty one. A single byte at the lowest address is sufficient to say which is the current good one (it's essentially an index into an array of size 2).
Write your new data into the dirty area, and when that's done, update the index byte (so swapping clean & dirty).
Either the index is updated and your new data is all good, or the card is pulled out and the previous clean copy is still active.
Pro - it's very simple
Con - you're wasting exactly half your storage space, and you need to write a complete new copy to the dirty area when you change anything. You haven't given enough information to decide whether this is a problem for you.
... now using less space ... - commit-or-rollback smaller subsets
if you can't waste 50% of your storage, split your data into independent chunks, and version each of those independently. Now you only need enough space to duplicate your largest single chunk, but instead of a simple index you need an offset or pointer for each chunk.
Pro - still fairly simple
Con - you can't handle dependencies between chunks, they have to be isolated
journalling
As per RedX's answer, this is used by a lot of filesystems to maintain integrity.
Pro - it's a solid technique, and you can find documentation and reference implementations for existing filesystems
Con - you just wrote a modern filesystem. Is this really what you wanted?
I've written a 'server' program that writes to shared memory, and a client program that reads from the memory. The server has different 'channels' that it can be writing to, which are just different linked lists that it's appending items too. The client is interested in some of the linked lists, and wants to read every node that's added to those lists as it comes in, with the minimum latency possible.
I have 2 approaches for the client:
For each linked list, the client keeps a 'bookmark' pointer to keep its place within the linked list. It round robins the linked lists, iterating through all of them over and over (it loops forever), moving each bookmark one node forward each time if it can. Whether it can is determined by the value of a 'next' member of the node. If it's non-null, then jumping to the next node is safe (the server switches it from null to non-null atomically). This approach works OK, but if there are a lot of lists to iterate over, and only a few of them are receiving updates, the latency gets bad.
The server gives each list a unique ID. Each time the server appends an item to a list, it also appends the ID number of the list to a master 'update list'. The client only keeps one bookmark, a bookmark into the update list. It endlessly checks if the bookmark's next pointer is non-null ( while(node->next_ == NULL) {} ), if so moves ahead, reads the ID given, and then processes the new node on the linked list that has that ID. This, in theory, should handle large numbers of lists much better, because the client doesn't have to iterate over all of them each time.
When I benchmarked the latency of both approaches (using gettimeofday), to my surprise #2 was terrible. The first approach, for a small number of linked lists, would often be under 20us of latency. The second approach would have small spats of low latencies but often be between 4,000-7,000us!
Through inserting gettimeofday's here and there, I've determined that all of the added latency in approach #2 is spent in the loop repeatedly checking if the next pointer is non-null. This is puzzling to me; it's as if the change in one process is taking longer to 'publish' to the second process with the second approach. I assume there's some sort of cache interaction going on I don't understand. What's going on?
Update: Originally, approach #2 used a condition variable, so that if node->next_ == NULL it would wait on the condition, and the server would notify on the condition everytime it issued an update. The latency was the same, and in trying to figure out why I reduced the code down to the approach above. I'm running on a multicore machine, so one process spinlocking shouldn't affect the other.
Update 2: node->next_ is volatile.
Since it sounds like reads and writes are occurring on separate CPUs, perhaps a memory barrier would help? Your writes may not be occurring when you expect them to be.
You are doing a Spin Lock in #2, which is generally not such a great idea, and is chewing up cycles.
Have you tried adding a yield after each failed polling-attempt in your second approach? Just a guess, but it may reduce the power-looping.
With Boost.Thread this would look like this:
while(node->next_ == NULL) {
boost::this_thread::yield( );
}