In Hadoop HDFS, how many data nodes a 1GB file uses to be stored? - hdfs

I have a file of 1GB size to be stored on HDFS file system. I am having a cluster setup of 10 data nodes and a namenode. Is there any calculation that the Namenode uses (not for replicas) a particular no of data nodes for the storage of the file? Or Is there any parameter that we can configure to use for a file storage? If so, what is the default no of datanodes that Hadoop uses to store the file if it is not specifically configured?
I want to know if it uses all the datanodes of the cluster or only specific no of datanodes.
Let's consider the HDFS block size is 64MB and free space is also existing on all the datanodes.
Thanks in advance.

If the configured block size is 64 MB, and you have a 1 GB file which means the file size is 1024 MB.
So the blocks needed will be 1024/64 = 16 blocks, which means 1 Datanode will consume 16 blocks to store your 1 GB file.
Now, let's say that you have a 10 nodes cluster then the default replica is 3, that means your 1 GB file will be stored on 3 different nodes. So, the blocks acquired by your 1 GB file is -> *16 * 3 =48 blocks*.
If your one block is of 64 MB, then total size your 1 GB file consumed is ->
*64 * 48 = 3072 MB*.
Hope that clears your doubt.

In Second(2nd) Generation of Hadoop
If the configured block size is 128 MB, and you have a 1 GB file which means the file size is 1024 MB.
So the blocks needed will be 1024/128 = 8 blocks, which means 1 Datanode will contain 8 blocks to store your 1 GB file.
Now, let's say that you have a 10 nodes cluster then the default replica is 3, that means your 1 GB file will be stored on 3 different nodes. So, the blocks acquired by your
1 GB file is -> *8 * 3 =24 blocks*.
If your one block is of 128 MB, then total size your 1 GB file consumed is -
*128 * 24 = 3072 MB*.

Related

What's the meaning of the ethereum Parity console output lines?

Parity doesn't seem to have any documentation on what it's console output means. At least none that I've found which admittedly doesn't mean a whole lot. Can anyone give me a breakdown of the meaning of the following line?
2018-03-09 00:05:12 UTC Syncing #4896969 61ee…bdad 2 blk/s 508 tx/s 16 Mgas/s 645+ 1 Qed #4897616 17/25 peers 4 MiB chain 135 MiB db 42 MiB queue 5 MiB sync RPC: 0 conn, 0 req/s, 182 µs
Thanks.
Why document when you can just read code? (bleh)
2018-03-09 00:05:12 UTC(1) Syncing #4896969(2) 61ee…bdad(3) 2 blk/s(4) 508 tx/s(5) 16 Mgas/s(6) 645+(7) 1(8) Qed #4897616(9) 17/25 peers(10) 4 MiB chain(11) 135 MiB db(12) 42 MiB queue(13) 5 MiB sync(14) RPC: 0 conn(15), 0 req/s(16), 182 µs(17)
Timestamp
Best block number (latest verified block number)
Best block hash
Blocks downloaded per second
Transactions downloaded per second
Millions of gas processed per second
Unverified queue size
Verified queue size
Latest block number
Number of active peer nodes/number of total peer nodes
Blockchain header cache size
Blockchain state cache size
Queue cache size
Node sync metadata cache size
Number of open RPC sessions to your node
RPC requests per second
Approximate roundtrip ping
Now the answer to this question is also included in Parity's FAQ. It provides a comprehensive explanation of different command line output:
What does Parity's command line output mean?

Spark writing/reading to/from S3 - Partition Size and Compression

I am doing an experiment to understand which file size behaves best with s3 and [EMR + Spark]
Input data :
Incompressible data: Random Bytes in files
Total Data Size: 20GB
Each folder has varying input file size: From 2MB To 4GB file size.
Cluster Specifications :
1 master + 4 nodes : C3.8xls
--driver-memory 5G \
--executor-memory 3G \
--executor-cores 2 \
--num-executors 60 \
Code :
scala> def time[R](block: => R): R = {
val t0 = System.nanoTime()
val result = block // call-by-name
val t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0) + "ns")
result
}
time: [R](block: => R)R
scala> val inputFiles = time{sc.textFile("s3://bucket/folder/2mb-10240files-20gb/*/*")};
scala> val outputFiles = time {inputFiles.saveAsTextFile("s3://bucket/folder-out/2mb-10240files-20gb/")};
Observations
2MB - 32MB: Most of the time is spent in opening file handles [Not efficient]
64MB till 1GB: Spark itself is launching 320 tasks for all these file sizes, it's no longer the no of files in that bucket with 20GB
data e.g. 512 MB files had 40 files to make 20gb data and could
just have 40 tasks to be completed but instead, there were 320
tasks each dealing with 64MB data.
4GB file size : 0 Bytes outputted [Not able to handle in-memory /Data not even splittable ???]
Questions
Any default setting that forces input size to be dealt with to be 64MB ??
Since the data I am using is random bytes and is already compressed how is it splitting this data further? If it can split this data why is it not able to split file size of 4gb object file
size?
Why is compressed file size increased after uploading via spark? The 2MB compressed input file becomes 3.6 MB in the output bucket.
Since it is not specified, I'm assuming usage of gzip and Spark 2.2 in my answer.
Any default setting that forces input size to be dealt with to be 64MB ??
Yes, there is. Spark is a Hadoop project, and therefore treats S3 to be a block based file system even though it is an object based file system.
So the real question here is: which implementation of S3 file system are you using(s3a, s3n) etc. A similar question can be found here.
Since the data I am using is random bytes and is already compressed how is it splitting this data further? If it can split this data why is it not able to split file size of 4gb object file size?
Spark docs indicate that it is capable of reading compressed files:
All of Spark’s file-based input methods, including textFile, support running on directories, compressed files, and wildcards as well. For example, you can use textFile("/my/directory"), textFile("/my/directory/.txt"), and textFile("/my/directory/.gz").
This means that your files were read quite easily and converted to a plaintext string for each line.
However, you are using compressed files. Assuming it is a non-splittable format such as gzip, the entire file is needed for de-compression. You are running with 3gb executors which can satisfy the needs of 4mb-1gb files quite well, but can't handle a file larger than 3gb at once (probably lesser after accounting for overhead).
Some further info can be found in this question. Details of splittable compression types can be found in this answer.
Why is compressed file size increased after uploading via spark?The 2MB compressed input file becomes 3.6 MB in output bucket.
As a corollary to the previous point, this means that spark has de-compressed the RDD while reading as plaintext. While re-uploading, it is no longer compressed. To compress, you can pass a compression codec as a parameter:
sc.saveAsTextFile("s3://path", classOf[org.apache.hadoop.io.compress.GzipCodec])
There are other compression formats available.

AWS volume larger than specified

I created an AWS instance with an attached volume of size 500 GB. Everything looks good in AWS console which shows the volume to be 500 GB (it is /dev/xvdf). When I ssh into the instance and look at the drive I see the drive is actually 540 GB instead of 500 GB. Why is this, where did this extra 40 GB come from?
fdisk output:
Disk /dev/xvdf: 536.9 GB, 536870912000 bytes
255 heads, 63 sectors/track, 65270 cylinders, total 1048576000 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000
df -h (uses 1024):
/dev/xvdf 493G 110G 358G 24% /data0
df -H (uses 1000):
/dev/xvdf 529G 118G 384G 24% /data0
Your volume size is correct.
536,870,912,000 ÷ 1,024 ÷ 1,024 ÷ 1,024 = 500 GiB.
1 GiB ("gibibyte," or giga-binary byte) is 230 bytes. EBS volume sizes are in GiB.
I might be wrong but "/dev/xvdf" Shows that AWS is using some form of Xen
be that XenServer or some other flavor.
What happens is:
Xen calculates how much space is actually needed so that after you format the volume to "ext4" or any other FS you will have 500GB or as close to it as possible.
Anyways this is IME.

what happens to HDFS block size in case of smaller file size

i read but still confused
what happens if a file size is less than block size.
if file size is 1MB will it consume 64MB or only 1MB
It will consume only 1 MB. That remaining can be used to store some other files block.
Ex: Consider your HDFS Data node total size is 128MB and block size is 64MB.
Then HDFS can store 2, 64MB blocks
or 128, 1MB blocks
or any number of block that can consume 128MB of Data node.

HDFS block size and replication

I have a file called employee.txt.The size of that file is 66 MB.
We know that default HDFS block size is 64 MB.
So, There is going to be two blocks for employee.txt as the size of it is greater than 64 MB
employee.txt = block A + block B
So, There are two blocks.
For block A it is 64 MB,So there is no problem at all ..
For block B it is 2 MB, So what will happen to the reamining 62 MB of block B.Does it kept as empty?
I would like to know what happens to that unoccupied space for block B.
HDFS blocks are logical blocks that sits on Physical file system. So when your second block uses just 8 MB, it only consumes that much space from the underlying file system, rest of the disk space will be left free.