Hadoop 3.0 erasure coding - determining the number of acceptable node failures? - hdfs

In hadoop 2.0 the default replication factor is 3. And the number of node failures acceptable was 3-1=2.
So on a 100 node cluster if a file was divided in to say 10 parts (blocks), with replication factor of 3 the total storage blocks required are 30. And if any 3 nodes containing a block X and it's replicas failed then the file is not recoverable. Even if the cluster had 1000 nodes or the file was split in to 20 parts, failure of 3 nodes on the cluster can still be disastrous for the file.
Now stepping into hadoop 3.0.
With erasure coding, as Hadoop says it provides the same durability with 50% efficient storage. And based on how Reed-Solomon method works (that is for k data blocks and n parity blocks, at least k of the (k+n) blocks should be accessible for the file to be recoverable/readable)
So for the same file above - there are 10 data blocks and to keep the data efficiency to 50%, 5 parity blocks can be added. So from the 10+5 blocks, at least any 10 blocks should be available for the file to be accessible. And on the 100 node cluster if each of the 15 blocks are stored on a separate node, then as you can see, a total of 5 node failures is acceptable. Now storing the same file (ie 15 blocks) on a 1000 node cluster would not make any difference w.r.t the number of acceptable node failures - it's still 5.
But the interesting part here is - if the same file (or another file) was divided into 20 blocks and then 10 parity block were added, then for the total of 30 blocks to be saved on the 100 node cluster, the acceptable number of node failures is 10.
The point I want to make here is -
in hadoop 2 the number of acceptable node failures is ReplicationFactor-1 and is clearly based on the replication factor. And this is a cluster wide property.
but in hadoop 3, say if the storage efficiency was fixed to 50%, then the number of acceptable node failures seems to be different for different files based on the number of blocks it is divided in to.
So can anyone comment if the above inference is correct? And how any clusters acceptable node failures is determined?
(And I did not want to make it complex above, so did not discuss the edge case of a file with one block only. But the algorithm, I guess, will be smart enough to replicate it as is or with parity data so that the data durability settings are guaranteed.)
Edit:
This question is part of a series of questions I have on EC - Others as below -
Hadoop 3.0 erasure coding: impact on MR jobs performance?

Using your numbers for Hadoop 2.0, each block of data is stored on 3 different nodes. As long as any one of the 3 nodes has not failed to read a specific block, that block of data is recoverable.
Again using your numbers, for Hadoop 3.0, every set of 10 blocks of data and 5 blocks of parities are stored on 15 different nodes. So the data space requirement is reduced to 50% overhead, but the number of nodes the data and parities are written to have increased by a factor of 5, from 3 nodes for Hadoop 2.0 to 15 nodes for Hadoop 3.0. Since the redundancy is based on Reed Solomon erasure correction, then as long as any 10 of the 15 nodes have not failed to read a specific set of blocks, that set of blocks is recoverable (maximum allowed failure for a set of blocks is 5 nodes). If it's 20 blocks of data and 10 blocks of parities, then the data and parity blocks are distributed on 30 different nodes (maximum allowed failure for a set of blocks is 10 nodes).
For a cluster-wide view, failure can occur if more than n-k nodes fail, regardless of the number of nodes, since there's some chance that a set of data and parity blocks will happen to include all of the failing nodes. To avoid this, n should be increased along with the number of nodes in a cluster. For 100 nodes, each set could be 80 blocks of data, 20 blocks of parity (25% redundancy). Note 100 nodes would be unusually large. The example from this web page is 14 nodes RS(14,10) (for each set: 10 blocks of data, 4 blocks of parity).
https://hadoop.apache.org/docs/r3.0.0
with your numbers, the cluster size would be 15 (10+5) or 30 (20+10) nodes.
For a file with 1 block or less than k blocks, n-k parity blocks would still be needed to ensure that it takes more than n-k nodes to fail before a failure occurs. For Reed Solomon encoding, this could be done by emulating leading blocks of zeroes for the "missing" blocks.
I thought I'd add some probability versus number of nodes in a cluster.
Assume node failure rate is 1%.
15 nodes, 10 for data, 5 for parities, using comb(a,b) for combinations of a things b at a time:
Probability of exactly x node failures is:
6 => ((.01)^6) ((.99)^9) (comb(15,6)) ~= 4.572 × 10^-9
7 => ((.01)^7) ((.99)^8) (comb(15,7)) ~= 5.938 × 10^-11
8 => ((.01)^8) ((.99)^7) (comb(15,8)) ~= 5.998 × 10^-13
...
Probability of 6 or more failures ~= 4.632 × 10^-9
30 nodes, 20 for data, 10 for parities
Probability of exactly x node failures is:
11 => ((.01)^11) ((.99)^19) (comb(30,11)) ~= 4.513 × 10^-15
12 => ((.01)^12) ((.99)^18) (comb(30,12)) ~= 7.218 × 10^-17
13 => ((.01)^13) ((.99)^17) (comb(30,13)) ~= 1.010 × 10^-18
14 => ((.01)^14) ((.99)^16) (comb(30,14)) ~= 1.238 × 10^-20
Probability of 11 or more failures ~= 4.586 × 10^-15
To show that the need for parity overhead decreases with number of nodes, consider the extreme case of 100 nodes, 80 for data, 20 for parties (25% redundancy):
Probability of exactly x node failures is:
21 => ((.01)^21) ((.99)^79) (comb(100,21)) ~= 9.230 × 10^-22
22 => ((.01)^22) ((.99)^78) (comb(100,22)) ~= 3.348 × 10^-23
23 => ((.01)^23) ((.99)^77) (comb(100,23)) ~= 1.147 × 10^-24
Probability of 21 or more failures ~= 9.577 × 10^-22

Related

Learning about multithreading. Tried to make a prime number finder

I'm studying for a uni project and one of the requirements is to include multithreading. I decided to make a prime number finder and - while it works - it's rather slow. My best guess is that this has to do with the amount of threads I'm creating and destroying.
My approach was to take the range of primes that are below N, and distribute these evenly across M threads (where M = number of cores (in my case 8)), however these threads are being created and destroyed every time N increases.
Pseudocode looks like this:
for each core
# new thread
for i in (range / numberOfCores) * currentCore
if !possiblePrimeIsntActuallyPrime
if possiblePrime % i == 0
possiblePrimeIsntActuallyPrime = true
return
else
return
Which does work, but 8 threads being created for every possible prime seems to be slowing the system down.
Any suggestions on how to optimise this further?
Use thread pooling.
Create 8 threads and store them in an array. Feed it new data each time one ends and start it again. This will prevent them from having to be created and destroyed each time.
Also, when calculating your range of numbers to check, only check up to ceil(sqrt(N)) as anything after that is guaranteed to either not go into it or the other corresponding factor has already been checked. i.e. ceil(sqrt(24)) is 5.
Once you check 5 you don't need to check anything else because 6 goes into 24 4 times and 4 has been checked, 8 goes into it 3 times and 3 has been checked, etc.

What's the meaning of the ethereum Parity console output lines?

Parity doesn't seem to have any documentation on what it's console output means. At least none that I've found which admittedly doesn't mean a whole lot. Can anyone give me a breakdown of the meaning of the following line?
2018-03-09 00:05:12 UTC Syncing #4896969 61ee…bdad 2 blk/s 508 tx/s 16 Mgas/s 645+ 1 Qed #4897616 17/25 peers 4 MiB chain 135 MiB db 42 MiB queue 5 MiB sync RPC: 0 conn, 0 req/s, 182 µs
Thanks.
Why document when you can just read code? (bleh)
2018-03-09 00:05:12 UTC(1) Syncing #4896969(2) 61ee…bdad(3) 2 blk/s(4) 508 tx/s(5) 16 Mgas/s(6) 645+(7) 1(8) Qed #4897616(9) 17/25 peers(10) 4 MiB chain(11) 135 MiB db(12) 42 MiB queue(13) 5 MiB sync(14) RPC: 0 conn(15), 0 req/s(16), 182 µs(17)
Timestamp
Best block number (latest verified block number)
Best block hash
Blocks downloaded per second
Transactions downloaded per second
Millions of gas processed per second
Unverified queue size
Verified queue size
Latest block number
Number of active peer nodes/number of total peer nodes
Blockchain header cache size
Blockchain state cache size
Queue cache size
Node sync metadata cache size
Number of open RPC sessions to your node
RPC requests per second
Approximate roundtrip ping
Now the answer to this question is also included in Parity's FAQ. It provides a comprehensive explanation of different command line output:
What does Parity's command line output mean?

Aerospike error: All batch queues are full

I am running an Aerospike cluster in Google Cloud. Following the recommendation on this post, I updated to the last version (3.11.1.1) and re-created all servers. In fact, this change cause my 5 servers to operate in a much lower CPU load (it was around 75% load before, now it is on 20%, as show in the graph bellow:
Because of this low load, I decided to reduce the cluster size to 4 servers. When I did this, my application started to receive the following error:
All batch queues are full
I found this discussion about the topic, recommending to change the parameters batch-index-threads and batch-max-unused-buffers with the command
asadm -e "asinfo -v 'set-config:context=service;batch-index-threads=NEW_VALUE'"
I tried many combinations of values (batch-index-threads with 2,4,8,16) and none of them solved the problem, and also changing the batch-index-threads param. Nothing solves my problem. I keep receiving the All batch queues are full error.
Here is my aerospace.conf relevant information:
service {
user root
group root
paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
paxos-recovery-policy auto-reset-master
pidfile /var/run/aerospike/asd.pid
service-threads 32
transaction-queues 32
transaction-threads-per-queue 4
batch-index-threads 40
proto-fd-max 15000
batch-max-requests 30000
replication-fire-and-forget true
}
I use 300GB SSD disks on these servers.
A quick note which may or may not pertain to you:
A common mistake we have seen in the past is that developers decide to use 'batch get' as a general purpose 'get' for single and multiple record requests. The single record get will perform better for single record requests.
It's possible that you are being constrained by the network between the clients and servers. Reducing from 5 to 4 nodes reduced the aggregate pipe. In addition, removing a node will start cluster migrations which adds additional network load.
I would look at the batch-max-buffer-per-queue config parameter.
Maximum number of 128KB response buffers allowed in each batch index
queue. If all batch index queues are full, new batch requests are
rejected.
In conjunction with raising this value from the default of 255 you will want to also raise the batch-max-unused-buffers to batch-index-threads x batch-max-buffer-per-queue + 1 (at least). If you do not do that new buffers will be created and destroyed constantly, as the amount of free (unused) buffers is smaller than the ones you're using. The moment the batch response is served the system will strive to trim the buffers down to the max unused number. You will see this reflected in the batch_index_created_buffers metric constantly rising.
Be aware that you need to have enough DRAM for this. For example if you raise the batch-max-buffer-per-queue to 320 you will consume
40 (`batch-index-threads`) x 320 (`batch-max-buffer-per-queue`) x 128K = 1600MB
For the sake of performance the batch-max-unused-buffers should be set to 13000 which will have a max memory consumption of 1625MB (1.59GB) per-node.

In Hadoop HDFS, how many data nodes a 1GB file uses to be stored?

I have a file of 1GB size to be stored on HDFS file system. I am having a cluster setup of 10 data nodes and a namenode. Is there any calculation that the Namenode uses (not for replicas) a particular no of data nodes for the storage of the file? Or Is there any parameter that we can configure to use for a file storage? If so, what is the default no of datanodes that Hadoop uses to store the file if it is not specifically configured?
I want to know if it uses all the datanodes of the cluster or only specific no of datanodes.
Let's consider the HDFS block size is 64MB and free space is also existing on all the datanodes.
Thanks in advance.
If the configured block size is 64 MB, and you have a 1 GB file which means the file size is 1024 MB.
So the blocks needed will be 1024/64 = 16 blocks, which means 1 Datanode will consume 16 blocks to store your 1 GB file.
Now, let's say that you have a 10 nodes cluster then the default replica is 3, that means your 1 GB file will be stored on 3 different nodes. So, the blocks acquired by your 1 GB file is -> *16 * 3 =48 blocks*.
If your one block is of 64 MB, then total size your 1 GB file consumed is ->
*64 * 48 = 3072 MB*.
Hope that clears your doubt.
In Second(2nd) Generation of Hadoop
If the configured block size is 128 MB, and you have a 1 GB file which means the file size is 1024 MB.
So the blocks needed will be 1024/128 = 8 blocks, which means 1 Datanode will contain 8 blocks to store your 1 GB file.
Now, let's say that you have a 10 nodes cluster then the default replica is 3, that means your 1 GB file will be stored on 3 different nodes. So, the blocks acquired by your
1 GB file is -> *8 * 3 =24 blocks*.
If your one block is of 128 MB, then total size your 1 GB file consumed is -
*128 * 24 = 3072 MB*.

Is a process can send to himself data? Using MPICH2

I have an upper triangular matrix and the result vector b.
My program need to solve the linear system:
Ax = b
using the pipeline method.
And one of the constraints is that the number of process is smaller than the number of
the equations (let's say it can be from 2 to numberOfEquations-1).
I don't have the code right now, I'm thinking about the pseudo code..
My Idea was that one of the processes will create the random upper triangular matrix (A)
the vector b.
lets say this is the random matrix:
1 2 3 4 5 6
0 1 7 8 9 10
0 0 1 12 13 14
0 0 0 1 16 17
0 0 0 0 1 18
0 0 0 0 0 1
and the vector b is [10 5 8 9 10 5]
and I have a smaller amount of processes than the number of equations (lets say 2 processes)
so what I thought is that some process will send to each process line from the matrix and the relevant number from vector b.
so the last line of the matrix and the last number in vector b will be send to
process[numProcs-1] (here i mean to the last process (process 1) )
than he compute the X and sends the result to process 0.
Now process 0 need to compute the 5 line of the matrix and here i'm stuck..
I have the X that was computed by process 1, but how can the process can send to himself
the next line of the matrix and the relevant number from vector b that need to be computed?
Is it possible? I don't think it's right to send to "myself"
Yes, MPI allows a process to send data to itself but one has to be extra careful about possible deadlocks when blocking operations are used. In that case one usually pairs a non-blocking send with blocking receive or vice versa, or one uses calls like MPI_Sendrecv. Sending a message to self usually ends up with the message simply being memory-copied from the source buffer to the destination one with no networking or other heavy machinery involved.
And no, communicating with self is not necessary a bad thing. The most obvious benefit is that it makes the code more symmetric as it removes/reduces the special logic needed to handle self-interaction. Sending to/receiving from self also happens in most collective communication calls. For example, MPI_Scatter also sends part of the data to the root process. To prevent some send-to-self cases that unnecessarily replicate data and decrease performance, MPI allows in-place mode (MPI_IN_PLACE) for most communication-related collectives.
Is it possible? I don't think it's right to send to "myself"
Sure, it is possible to communicate with oneself. There is even a communicator for it: MPI_COMM_SELF. Talking to yourself is not too uncommon.
Your setup sounds like you would rather use MPI collectives. Have a look at MPI_Scatter and MPI_Gather and see if they don't provide you with the functionality, you are looking for.