How does Google file system deal with write failures at replicas?

How does Google file system deal with write failures at replicas? - concurrency

I was reading the Google file system paper and am not sure how it deals with write (not atomic record append) failures at replicas. If it returns success, then it will wait until the next heartbeat for the master to get the updated state of the world and detect the corruption/stale chunk version and delete the chunk. I guess that the master can verify the validity of all the replicas whenever clients ask for replica locations, to prevent client from ever getting stale/corrupted data. Is this how it deals with replica write failure?

The invalid chunks can be divided into two categories: stale and corrupted checksum. They are checked in two different ways.
Stale chunk. The version of chunk is not up to date. The master checks stale chunks during regular heartbeat with chunk servers.
Corrupted checksum. Since divergent replicas may be legal, replicas across the GFS are not guaranteed to be identical. In addition, for performance consideration, the checksum is performed independently on the chunk servers themselves rather than on the master.
Checksum can be checked in two phases:
When clients or other chunk servers request for the chunk
When the chunk servers are during idle periods, they scan and verify the inactive chunks to avoid corrupted chunks are considered as valid replicas.
If the checksum is corrupted, the chunk server reports the master about the problem. The master clones the replica from other chunk servers with healthy replicas. After that, the master instructs the chunk server that reports the problem to delete the chunk.
Back to your question, how GFS deal with replica write failure?
If any error is encounter during replication, the failure of mutation is reported to the client. The client must handle the error and retry the mutation. The inconsistent chunks will be garbage collected during regular scan in chunk servers.

Your question is not very clear but I'll answer what I made an educated guess of what your question would be. It seems related to how chunk version number helps in stale replica detection-:
Clients asks to write in files.
Master sends metadata response to client. In that metadata response,a chunk version number is also included. This is indication to grant lease of a chunkserver to client.
Now, the master increments the chunk version number and asks all the chunkservers (chunks) to do the same once the WRITE is COMPLETED.
All this has happened before the master starts writing on file on the chunk.
Say the chunkserver crashed.
Once the chunkserver restarts, master-chunkserver communicates in heartbeat messages and compares the "chunk version number". The thing is, if the write was accomplished, then the chunk version number in all chunks/replicas should've been same as the master's chunk version number. If it's not the same, failure has occured during writing.
So the master decrements its chunk version number and during garbage collection, all those failed replicas are removed.

Related

flink readCSV thrown back with "org.apache.flink.fs.s3base.shaded.com.amazonaws.SdkClientException:Timeout waiting for connection from pool"

We are using Flink 1.9.0 Dataset API to read CSV files from Amazon S3 Bucket. Facing connection pool timeout most of the times.
Following are the configurations at Flink level
Reading 19708 objects from s3 in a single go, as we need to apply the logic on top of whole data set. Lets take an eg: Imagine have 20 source folders eg( AAA, BBB, CCC ) with multiple subfolders (AAA/4May2020/../../1.csv,AAA/4May2020/../../2.csv, AAA/3May2020/../../1.csv ,AAA/3May2020/../../2.csv ....), for the read to happen, before calling the readCSV, the logic scan folders and pick the one only with latest date folder and pass that for read. For the read operation we use parallelism as "5". But when the execution graph is formed all 20 Sources comes together.
Running on Kube-Aws with around 10
Task Managers hosted under "m5.4X large machine". Task Manager
docker is allocated with "8" cores and "50GB" memory.
Following were tried to address the issue, but no luck so far. Really need some pointers and help to address this
Enabled the Flink retry mechanism with failover as "region", sometimes with retries it gets through. But even with retry it fails intermittently.
Revisited the core-site.xml as per AWS Site:
fs.s3a.threads.max:3000,fs.s3a.connection.maximum:4500
Also could anyone help with the following questions
Is there anyway to check if the HTTP connections opened by readCSV
are closed
Any pointers to understand how dataset ReadCSV operates
will help.
Any way to introduce a wait mechanisms before the
read?
Any better way to address this issue

PBFT Consensus Algorithm

I still have some difficulties to understand how does the PBFT Consensus Algorithm work in Hyperledger Fabric 0.6. Are there any paper which describes the PBFT Algorithm in blockchain environment?
Thank you very much for your answers!

While Hyperledger Fabric v0.6 has been deprecated for quite some time (we are working towards release of v1.1 shortly, as I write this) we have preserved the archived repository, and the protocol specification contains all that you might want to know about how the system works.
It is really too long a description to add here.

Practical Byzantine Fault Tolerance (PBFT) is a protocol developed to
provide consensus in the presence of Byzantine faults.
The PBFT algorithm has three phases. These phases run in a sequence to
achieve consensus: pre-prepare, prepare, and commit
The protocol runs in rounds where, in each round, an elected leader
node, called the primary node, handles the communication with the
client. In each round, the protocol progresses through the three
previously mentioned phases. The participants in the PBFT protocol are
called replicas, where one of the replicas becomes primary as a
leader in each round, and the rest of the nodes act as backups.
Pre-prepare:
This is the first phase in the protocol, where the primary node, or
primary, receives a request from the client. The primary node assigns
a sequence number to the request. It then sends the pre-prepare
message with the request to all backup replicas. When the pre-prepare
message is received by the backup replicas, it checks a number of
things to ensure the validity of the message:
First, whether the digital signature is valid.
After this, whether the current view number is valid.
Then, that the sequence number of the operation's request message is valid.
Finally, if the digest/hash of the operation's request message is valid.
If all of these elements are valid, then the backup replica accepts
the message. After accepting the message, it updates its local state
and progresses toward the preparation phase.
Prepare:
A prepare message is sent by each backup to all other replicas in the
system. Each backup waits for at least 2F + 1 (F is the number of
faulty nodes.) to prepare messages to be received from other replicas.
They also check whether the prepared message contains the same view
number, sequence number, and message digest values. If all these
checks pass, then the replica updates its local state and progresses
toward the commit phase.
Commit:
Each replica sends a commit message to all other replicas in the
network. The same as the prepare phase, replicas wait for 2F + 1
commit messages to arrive from other replicas. The replicas also check
the view number, sequence number, and message digest values. If they
are valid for 2F+ 1 commit messages received from other replicas, then
the replica executes the request, produces a result, and finally,
updates its state to reflect a commit. If there are already some
messages queued up, the replica will execute those requests first
before processing the latest sequence numbers. Finally, the replica
sends the result to the client in a reply message. The client accepts
the result only after receiving 2F+ 1 reply messages containing the
same result.
Reference

Elasticsearch percolation dead slow on AWS EC2

Recently we switched our cluster to EC2 and everything is working great... except percolation :(
We use Elasticsearch 2.2.0.
To reindex (and percolate) our data we use a separate EC2 c3.8xlarge instance (32 cores, 60GB, 2 x 160 GB SSD) and tell our index to include only this node in allocation.
Because we'll distribute it amongst the rest of the nodes later, we use 10 shards, no replicas (just for indexing and percolation).
There are about 22 million documents in the index and 15.000 percolators. The index is a tad smaller than 11GB (and so easily fits into memory).
About 16 php processes talk to the REST API doing multi percolate requests with 200 requests in each (we made it smaller because of the performance, it was 1000 per request before).
One percolation request (a real one, tapped off of the php processes running) is taking around 2m20s under load (of the 16 php processes). That would've been ok if one of the resources on the EC2 was maxed out but that's the strange thing (see stats output here but also seen on htop, iotop and iostat): load, cpu, memory, heap, io; everything is well (very well) within limits. There doesn't seem to be a shortage of resources but still, percolation performance is bad.
When we back off the php processes and try the percolate request again, it comes out at around 15s. Just to be clear: I don't have a problem with a 2min+ multi percolate request. As long as I know that one of the resources is fully utilized (and I can act upon it by giving it more of what it wants).
So, ok, it's not the usual suspects, let's try different stuff:
To rule out network, coordination, etc issues we also did the same request from the node itself (enabling the client) with the same pressure from the php processes: no change
We upped the processors configuration in elasticsearch.yml and restarted the node to fake our way to a higher usage of resources: no change.
We tried tweaking the percolate and get pool size and queue size: no change.
When we looked at the hot threads, we DiscovereUsageTrackingQueryCachingPolicy was coming up a lot so we did as suggested in this issue: no change.
Maybe it's the amount of replicas, seeing Elasticsearch uses those to do searches as well? We upped it to 3 and used more EC2 to spread them out: no change.
To determine if we could actually use all resources on EC2, we did stress tests and everything seemed fine, getting it to loads of over 40. Also IO, memory, etc showed no issues under high strain.
It could still be the batch size. Under load we tried a batch of just one percolator in a multi percolate request, directly on the data & client node (dedicated to this index) and found that it used 1m50s. When we tried a batch of 200 percolators (still in one multi percolate request) it used 2m02s (which fits roughly with the 15s result of earlier, without pressure).
This last point might be interesting! It seems that it's stuck somewhere for a loooong time and then goes through the percolate phase quite smoothly.
Can anyone make anything out of this? Anything we have missed? We can provide more data if needed.

Have a look at the thread on the Elastic Discuss forum to see the solution.
TLDR;
Use multiple nodes on one big server to get better resource utilization.

Checksum process in HDFS /Hadoop

I could not understand how checksum works in HDFS to identify corrupt blocks while file writing reading. can someone explain me in details ?

Have a look at Apache documentation regarding HDFS Architecture.
Data Integrity
It is possible that a block of data fetched from a DataNode arrives corrupted. This corruption can occur because of faults in a storage device, network faults, or buggy software.
It works in below way.
The HDFS client software implements checksum checker. When a client creates an HDFS file, it computes a checksum of each block of the file and stores these checksums in a separate hidden file in the same HDFS namespace.
When a client retrieves file contents, it verifies that the data it received from each DataNode matches the checksum stored in the associated checksum file.
If not, then the client can opt to retrieve that block from another DataNode that has a replica of that block.
If checksum of another Data node block matches with checksum of hidden file, system will serve these data blocks.
Have a look at Robustness section too. The solution will be incomplete without having look at data replication mechanism.
Each DataNode sends a Heartbeat message to the NameNode periodically.
A network partition can cause a subset of DataNodes to lose connectivity with the NameNode. The NameNode detects this condition by the absence of a Heartbeat message.
The NameNode marks DataNodes without recent Heartbeats as dead and does not forward any new IO requests to them. Any data that was registered to a dead DataNode is not available to HDFS any more.
DataNode death may cause the replication factor of some blocks to fall below their specified value. The NameNode constantly tracks which blocks need to be replicated and initiates replication whenever necessary.
The necessity for re-replication may arise due to many reasons: a DataNode may become unavailable, a replica may become corrupted, a hard disk on a DataNode may fail, or the replication factor of a file may be increased.
Example of replication scenario:
It depends configuration of the cluster
dfs.replication
(Assume that it is 3)
dfs.namenode.replication.min ( Assume that it is 1 )
In case of one Data Node is lost, Name node will recognize that a block is under-replicated. Then Name Node will replicate the data blocks until dfs.replicaiton is met.

How to keep system running while replacing the current engine

We have two systems, A and B. System B sends Write and Read request as well as A returns a response for every read request using the existing engine E_current in A. Each Write request causes a modification in the existing engine E_current.
Periodically E_current will be replaced by E_new. While in renewal process, E_new is not able to used yet. Some of the Read request that comes while this renewal process depends on Write request that came after beginning of the renewal process. The new engine, E_new, should also do modifications on itself for each Write request that came during the renewal process and already processed by .
After completion of renewal process E_current will be evicted and E_new becomes E_current.
Requirements:
Requests are completely concurrent. For example, a write request can
come while a read request is being processed.
Multiple modifications on any engine E could cause inconsistent state, state consistency should be preserved.
Diagrams:
https://dl.dropbox.com/u/3482709/stack1.jpg
https://dl.dropbox.com/u/3482709/stack2.jpg

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js