PBFT Consensus Algorithm

PBFT Consensus Algorithm - blockchain

I still have some difficulties to understand how does the PBFT Consensus Algorithm work in Hyperledger Fabric 0.6. Are there any paper which describes the PBFT Algorithm in blockchain environment?
Thank you very much for your answers!

While Hyperledger Fabric v0.6 has been deprecated for quite some time (we are working towards release of v1.1 shortly, as I write this) we have preserved the archived repository, and the protocol specification contains all that you might want to know about how the system works.
It is really too long a description to add here.

Practical Byzantine Fault Tolerance (PBFT) is a protocol developed to
provide consensus in the presence of Byzantine faults.
The PBFT algorithm has three phases. These phases run in a sequence to
achieve consensus: pre-prepare, prepare, and commit
The protocol runs in rounds where, in each round, an elected leader
node, called the primary node, handles the communication with the
client. In each round, the protocol progresses through the three
previously mentioned phases. The participants in the PBFT protocol are
called replicas, where one of the replicas becomes primary as a
leader in each round, and the rest of the nodes act as backups.
Pre-prepare:
This is the first phase in the protocol, where the primary node, or
primary, receives a request from the client. The primary node assigns
a sequence number to the request. It then sends the pre-prepare
message with the request to all backup replicas. When the pre-prepare
message is received by the backup replicas, it checks a number of
things to ensure the validity of the message:
First, whether the digital signature is valid.
After this, whether the current view number is valid.
Then, that the sequence number of the operation's request message is valid.
Finally, if the digest/hash of the operation's request message is valid.
If all of these elements are valid, then the backup replica accepts
the message. After accepting the message, it updates its local state
and progresses toward the preparation phase.
Prepare:
A prepare message is sent by each backup to all other replicas in the
system. Each backup waits for at least 2F + 1 (F is the number of
faulty nodes.) to prepare messages to be received from other replicas.
They also check whether the prepared message contains the same view
number, sequence number, and message digest values. If all these
checks pass, then the replica updates its local state and progresses
toward the commit phase.
Commit:
Each replica sends a commit message to all other replicas in the
network. The same as the prepare phase, replicas wait for 2F + 1
commit messages to arrive from other replicas. The replicas also check
the view number, sequence number, and message digest values. If they
are valid for 2F+ 1 commit messages received from other replicas, then
the replica executes the request, produces a result, and finally,
updates its state to reflect a commit. If there are already some
messages queued up, the replica will execute those requests first
before processing the latest sequence numbers. Finally, the replica
sends the result to the client in a reply message. The client accepts
the result only after receiving 2F+ 1 reply messages containing the
same result.
Reference

Related

Lambda-SQS Fanout with stateful messages

I am trying to perform some parallelization on work which is computationally expensive to process on AWS via lambda functions. Specifically, the current architecture consists of a coordinator lambda which invokes several copies of a worker lambda via SNS with metadata specific to each invocation. These workers take the event from SNS to decide which partition of the data to work on, but I need something a bit more dynamic.
I need each worker to be ready to ingest new messages which affect the state of the worker. The one constraint of these messages is that these messages are indexed a key. Now initially does not matter which worker ingests a message with a particular key. What is important is that once that worker accepts this message (maybe through an acknowledgment?), it can only accept future messages with that specific key.
The number of possible keys is far, far smaller than the number of workers, but the keys themselves are not known in advanced. Usually they are determined after fanning out. The number of messages for a given key is around 2-8, each interspersed by some time. If it is important to the question, maybe we should distinguish case 1 - when only exactly one worker can commit to a key - to case 2 -when multiple workers can commit to the same key.
Example of Case #1:
Desired example of case #1:
Coordinator creates W1, W2, W3.
Coordinator calculates key K1. Begins sending messages indexed by K1.
W1 receives M1:K1. W1 acknowledges message M1:K1 and commits to only seeing key K1.
W2 receives M1:K2 and acknowledges. Acknowledgement fails since W1 already committed to K1.
W1 processes M2:K1.
Coordinator begins sending messages indexed by K2.
W1 doesn't see (or ignores?) M4:K2.
I'm not really sure how to go about designing this change. For example in case #1, it would ideal to have a dedicated SQS queue for the lambda that received and acknowledged the first message of a given key. The problem is that the coordinator will need to create the resource on the fly, get the lambda to read from it, etc which seems very expensive. Maybe I'm misunderstanding SQS but it doesn't seem to support routing messages of different keys within a queue. SNS probably won't do since no data is persisted. I'm not sure about EventBridge. Another concern is that there will be a lot of herding where lambdas that haven't committed to a particular message key send acknowledgements to the coordinator which eventually fail. They will fail on basically all keys since there are so many workers compared to keys.
What I'm not looking for
A system which is long-lasting such as EKS. There are usually only 2-3 messages for any given key and processing each message is fairly cheap.
Preferably, once a worker has committed to a key, it does not need to see messages for different keys. This maybe isn't a problem now, but will probably be one if the # of messages is far greater than 2-10.
I am curious for feedback. Thanks.

How is Google Cloud Pub/Sub avoiding clock skew

I am looking into ways to order list of messages from google cloud pub/sub. The documentation says:
Have a way to determine from all messages it has currently received whether or not there are messages it has not yet received that it needs to process first.
...is possible by using Cloud Monitoring to keep track of the pubsub.googleapis.com/subscription/oldest_unacked_message_age metric. A subscriber would temporarily put all messages in some persistent storage and ack the messages. It would periodically check the oldest unacked message age and check against the publish timestamps of the messages in storage. All messages published before the oldest unacked message are guaranteed to have been received, so those messages can be removed from persistent storage and processed in order.
I tested it locally and this approach seems to be working fine.
I have one gripe with it however, and this is not something easily testable by myself.
This solution relies on server-side assigned (by google) publish_time attribute. How does Google avoid the issues of skewed clocks?
If my producer publishes messages A and then immediately B, how can I be sure that A.publish_time < B.publish_time is true? Especially considering that the same documentation page mentions internal load-balancers in the architecture of the solution. Is Google Pub/Sub using atomic clocks to synchronize time on the very first machines which see messages and enrich those messages with the current time?
There is an implicit assumption in the recommended solution that the clocks on all the servers are synchronized. But the documentation never explains if that is true or how it is achieved so I feel a bit uneasy about the solution. Does it work under very high load?
Notice I am only interested in relative order of confirmed messages published after each other. If two messages are published simultaneously, I don't care about the order of them between each other. It can be A, B or B, A. I only want to make sure that if B is published after A is published, then I can sort them in that order on retrieval.
Is the aforementioned solution only "best-effort" or are there actual guarantees about this behavior?

There are two sides to ordered message delivery: establishing an order of messages on the publish side and having an established order of processing messages on the subscribe side. The document to which you refer is mostly concerned with the latter, particularly when it comes to using oldest_unacked_message_age. When using this method, one can know that if message A has a publish timestamp that is less than the publish timestamp for message B, then a subscriber will always process message A before processing message B. Essentially, once the order is established (via publish timestamps), it will be consistent. This works if it is okay for the Cloud Pub/Sub service itself to establish the ordering of messages.
Publish timestamps are not synchronized across servers and so if it is necessary for the order to be established by the publishers, it will be necessary for the publishers to provide a timestamp (or sequence number) as an attribute that is used for ordering in the subscriber (and synchronized across publishers). The subscriber would sort message by this user-provided timestamp instead of by the publish timestamp. The oldest_unacked_message_age will no longer be exact because it is tied to the publish timestamp. One could be more conservative and only consider messages ordered that are older than oldest_unacked_message_age minus some delta to account for this discrepancy.

Google Cloud Pub-sub does not guarantee order of events receive to consumers as they were produced. Reason behind that is Google Cloud Pub-sub also running on a cluster of nodes. The possibility is there an event B can reach the consumer before event A. To Ensure ordering you have to make changes on both producer and consumer to identify the order of events. Here is section from docs.

How PBFT applied in block chain?

I am trying to understand how PBFT(practical byzantine fault tolerance) applied in block chain. After reading paper, I found that process for PBFT to reach a consensus is like below:
A client sends a request to invoke a service operation to the primary
The primary multicasts the request to the backups
Replicas execute the request and send a reply to the client
The client waits for f + 1 replies from different replicas with the same result; this is the result of the operation.
This is how I understand how it is applied in block chain:
At first, the elected primary node wants to write transaction A to chain, it will broadcast transaction A to other nodes.
Any node receives the transaction checks if the transaction legal. If the transaction is thought as legal, the node will broadcast a legal signal to all of nodes in this round of consensus.
Any node that receives equal or greater than f + 1 responds will write the transaction to the its own chain.
Here are my questions:
For malfunctioned nodes, if they keep failing to write block into its chain, they will hold a different chains with healthy node. In next consensus, the existing chain will be picked up at first. How do nodes know which one is the correct chain?
In step 1, the elected node send transaction to other nodes. Does "other nodes" means all nodes in the network? How to make sure if all nodes included in the consensus because there is not a centralized agency.

How do nodes know which one is the correct chain?
For tolerating Byzantine faulty nodes, It needs at least 3f+1 nodes in the network. PBFT is one of the algorithms which can tolerate Byzantine failure. So PBFT can tolerate up to f Byzantine nodes.
f number of malicious nodes can be tolerated if you use PBFT. If there are f number of malicious nodes which keep failing to write block into its chain, resulting in inconsistency with correct nodes, then one can figure that the same chains from rest 2f + 1 nodes are correct. (Correct nodes always output exactly same data to the same request in same order).
Does "other nodes" means all nodes in the network? How to make sure if all nodes included in the consensus because there is not a centralized agency.
In PBFT setup, identities of all nodes should be established. To do that, there should be central authority to determine whether a node can join the network or not. (Important: central authority only engages in identity management, not the algorithm itself)
Why this is needed? It's because PBFT works by voting mechanism and voting is not secure when anyone (including malicious node) can join the network. For example, a proposed value by the primary only can be recorded to all nodes in the way of state machine replication, which it means that there needs at least 2f + 1 agreed matching messages for the value to be accepted to the correct nodes.
Without the trusted identity management, Sybil attack is possible. And this is the main reason why PBFT is not for the open blockchain which allows any node can freely join or leave the network.

Why does the gossip protocol in akka need to deliver it's state twice for the state change to be registered?

I am having trouble of understanding the cluster algorithm used in Akka.
In the description in the akka Gossip Protocol it says that:
The recipient of the gossip state or the gossip status can use the
gossip version (vector clock) to determine whether:
it has a newer version of the gossip state, in which case it sends
that back to the gossiper
it has an outdated version of the state, in which case the recipient requests the current state from the gossiper
by sending back its version of the gossip state
it has conflicting gossip versions, in which case the different versions are merged and
sent back
Step two seems a waste of communication as the gossiper sends its state twice. Once when it is noticed that it does not have the newest version, and again, when the recipient want the newest version by sending its own outdated version back.
I think I am misunderstanding this because my understanding of vector clocks and CFRD are limited, and the description given in the Akka documentation is short, and the wikipedia article is to advanced. As far as I intrepid it, is that a vector clock is an implementation of a CRDT, but that might be incorrect.
But in the end I don't understand why the gossip node needs to communicate its state twice. Please clarify.
But I think I might be misunderstanding how vector
Akka Cluster

like it's stated in the documentation, Akka implements :
A variation of push-pull gossip is used to reduce the amount of gossip
information sent around the cluster. In push-pull gossip a digest is
sent representing current versions but not actual values; the
recipient of the gossip can then send back any values for which it has
newer versions and also request values for which it has outdated
versions.
Thus, simplifying, in the case 2 when a recipient of a gossip message sees that it has an outdated version of the cluster state it asks back to the gossiper the last state.
So, the first message from the gossiper to the recipient carries the versions, the second from the gossiper to the recipient carries the actual states of the nodes.
Hope this helps.

sampled machines when using queues

I am new to Amazon Web Services and am currently trying to get my head around how Simple Queue Service (SQS) works.
In the link ReceiveMessage the following is mentioned:
Short poll is the default behavior where a weighted random set of
machines is sampled on a ReceiveMessage call. This means only the
messages on the sampled machines are returned. If the number of
messages in the queue is small (less than 1000), it is likely you will
get fewer messages than you requested per ReceiveMessage call. If the
number of messages in the queue is extremely small, you might not
receive any messages in a particular ReceiveMessage response; in which
case you should repeat the request.
What I understand there is one queue and many machines/instances can read the messages. What is not clear to me is what does "weighted random set of machines" means? Is there more than one queue on a number of machines? Clearly I am lacking some knowledge on on SQS works.

I believe what this means is that because SQS is geographically distributed, not all of the machines (amazon's servers that have your queue) will have the exact same queue content at all times because they won't always be in sync with each other at every instant.
You don't know or control from which of amazons servers it will serve messages from, it uses an algorithm to figure out which messages are sent to you when you request some. That is why you don't always get messages when you ask for them, and occasionally the same message will get served up more than once; you need to make sure whatever your processing entails it can deal with the possibility that it is processing something that has already been processed by another of your worker machines.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js