how HDFS will divide the file in block - hdfs

i know that hdfs divide file in 128 mb block
now suppose a file size of 1034mb is been divided in 128*8 blocks.what will happen 10 mb will it be allocated the whole 128 mb or just 10mb block.
if yes. then rest 118mb gets wasted.
how this issue can be resolved.
I have just started learning hadoop.
please pardon me if my question is naive.
Thanks

Related

Spark writing/reading to/from S3 - Partition Size and Compression

I am doing an experiment to understand which file size behaves best with s3 and [EMR + Spark]
Input data :
Incompressible data: Random Bytes in files
Total Data Size: 20GB
Each folder has varying input file size: From 2MB To 4GB file size.
Cluster Specifications :
1 master + 4 nodes : C3.8xls
--driver-memory 5G \
--executor-memory 3G \
--executor-cores 2 \
--num-executors 60 \
Code :
scala> def time[R](block: => R): R = {
val t0 = System.nanoTime()
val result = block // call-by-name
val t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0) + "ns")
result
}
time: [R](block: => R)R
scala> val inputFiles = time{sc.textFile("s3://bucket/folder/2mb-10240files-20gb/*/*")};
scala> val outputFiles = time {inputFiles.saveAsTextFile("s3://bucket/folder-out/2mb-10240files-20gb/")};
Observations
2MB - 32MB: Most of the time is spent in opening file handles [Not efficient]
64MB till 1GB: Spark itself is launching 320 tasks for all these file sizes, it's no longer the no of files in that bucket with 20GB
data e.g. 512 MB files had 40 files to make 20gb data and could
just have 40 tasks to be completed but instead, there were 320
tasks each dealing with 64MB data.
4GB file size : 0 Bytes outputted [Not able to handle in-memory /Data not even splittable ???]
Questions
Any default setting that forces input size to be dealt with to be 64MB ??
Since the data I am using is random bytes and is already compressed how is it splitting this data further? If it can split this data why is it not able to split file size of 4gb object file
size?
Why is compressed file size increased after uploading via spark? The 2MB compressed input file becomes 3.6 MB in the output bucket.
Since it is not specified, I'm assuming usage of gzip and Spark 2.2 in my answer.
Any default setting that forces input size to be dealt with to be 64MB ??
Yes, there is. Spark is a Hadoop project, and therefore treats S3 to be a block based file system even though it is an object based file system.
So the real question here is: which implementation of S3 file system are you using(s3a, s3n) etc. A similar question can be found here.
Since the data I am using is random bytes and is already compressed how is it splitting this data further? If it can split this data why is it not able to split file size of 4gb object file size?
Spark docs indicate that it is capable of reading compressed files:
All of Spark’s file-based input methods, including textFile, support running on directories, compressed files, and wildcards as well. For example, you can use textFile("/my/directory"), textFile("/my/directory/.txt"), and textFile("/my/directory/.gz").
This means that your files were read quite easily and converted to a plaintext string for each line.
However, you are using compressed files. Assuming it is a non-splittable format such as gzip, the entire file is needed for de-compression. You are running with 3gb executors which can satisfy the needs of 4mb-1gb files quite well, but can't handle a file larger than 3gb at once (probably lesser after accounting for overhead).
Some further info can be found in this question. Details of splittable compression types can be found in this answer.
Why is compressed file size increased after uploading via spark?The 2MB compressed input file becomes 3.6 MB in output bucket.
As a corollary to the previous point, this means that spark has de-compressed the RDD while reading as plaintext. While re-uploading, it is no longer compressed. To compress, you can pass a compression codec as a parameter:
sc.saveAsTextFile("s3://path", classOf[org.apache.hadoop.io.compress.GzipCodec])
There are other compression formats available.

Is there a maximum concurrency for AWS s3 multipart uploads?

Referring to the docs, you can specify the number of concurrent connection when pushing large files to Amazon Web Services s3 using the multipart uploader. While it does say the concurrency defaults to 5, it does not specify a maximum, or whether or not the size of each chunk is derived from the total filesize / concurrency.
I trolled the source code and the comment is pretty much the same as the docs:
Set the concurrency level to use when uploading parts. This affects
how many parts are uploaded in parallel. You must use a local file as
your data source when using a concurrency greater than 1
So my functional build looks like this (the vars are defined by the way, this is just condensed for example):
use Aws\Common\Exception\MultipartUploadException;
use Aws\S3\Model\MultipartUpload\UploadBuilder;
$uploader = UploadBuilder::newInstance()
->setClient($client)
->setSource($file)
->setBucket($bucket)
->setKey($file)
->setConcurrency(30)
->setOption('CacheControl', 'max-age=3600')
->build();
Works great except a 200mb file takes 9 minutes to upload... with 30 concurrent connections? Seems suspicious to me, so I upped concurrency to 100 and the upload time was 8.5 minutes. Such a small difference could just be connection and not code.
So my question is whether or not there's a concurrency maximum, what it is, and if you can specify the size of the chunks or if chunk size is automatically calculated. My goal is to try to get a 500mb file to transfer to AWS s3 within 5 minutes, however I have to optimize that if possible.
Looking through the source code, it looks like 10,000 is the maximum concurrent connections. There is no automatic calculations of chunk sizes based on concurrent connections but you could set those yourself if needed for whatever reason.
I set the chunk size to 10 megs, 20 concurrent connections and it seems to work fine. On a real server I got a 100 meg file to transfer in 23 seconds. Much better than the 3 1/2 to 4 minute it was getting in the dev environments. Interesting, but thems the stats, should anyone else come across this same issue.
This is what my builder ended up being:
$uploader = UploadBuilder::newInstance()
->setClient($client)
->setSource($file)
->setBucket($bicket)
->setKey($file)
->setConcurrency(20)
->setMinPartSize(10485760)
->setOption('CacheControl', 'max-age=3600')
->build();
I may need to up that max cache but as of yet this works acceptably. The key was moving the processor code to the server and not relying on the weakness of my dev environments, no matter how powerful the machine is or high class the internet connection is.
We can abort the process during upload and can halt all the operations and abort the upload at any instance. We can set Concurrency and minimum part size.
$uploader = UploadBuilder::newInstance()
->setClient($client)
->setSource('/path/to/large/file.mov')
->setBucket('mybucket')
->setKey('my-object-key')
->setConcurrency(3)
->setMinPartSize(10485760)
->setOption('CacheControl', 'max-age=3600')
->build();
try {
$uploader->upload();
echo "Upload complete.\n";
} catch (MultipartUploadException $e) {
$uploader->abort();
echo "Upload failed.\n";
}

Not enough space to cache rdd in memory warning

I am running a spark job, and I got Not enough space to cache rdd_128_17000 in memory warning. However, in the attached file, it obviously saying only 90.8 G out of 719.3 G is used. Why is that? Thanks!
15/10/16 02:19:41 WARN storage.MemoryStore: Not enough space to cache rdd_128_17000 in memory! (computed 21.4 GB so far)
15/10/16 02:19:41 INFO storage.MemoryStore: Memory use = 4.1 GB (blocks) + 21.2 GB (scratch space shared across 1 thread(s)) = 25.2 GB. Storage limit = 36.0 GB.
15/10/16 02:19:44 WARN storage.MemoryStore: Not enough space to cache rdd_129_17000 in memory! (computed 9.4 GB so far)
15/10/16 02:19:44 INFO storage.MemoryStore: Memory use = 4.1 GB (blocks) + 30.6 GB (scratch space shared across 1 thread(s)) = 34.6 GB. Storage limit = 36.0 GB.
15/10/16 02:25:37 INFO metrics.MetricsSaver: 1001 MetricsLockFreeSaver 339 comitted 11 matured S3WriteBytes values
15/10/16 02:29:00 INFO s3n.MultipartUploadOutputStream: uploadPart /mnt1/var/lib/hadoop/s3/959a772f-d03a-41fd-bc9d-6d5c5b9812a1-0000 134217728 bytes md5: qkQ8nlvC8COVftXkknPE3A== md5hex: aa443c9e5bc2f023957ed5e49273c4dc
15/10/16 02:38:15 INFO s3n.MultipartUploadOutputStream: uploadPart /mnt/var/lib/hadoop/s3/959a772f-d03a-41fd-bc9d-6d5c5b9812a1-0001 134217728 bytes md5: RgoGg/yJpqzjIvD5DqjCig== md5hex: 460a0683fc89a6ace322f0f90ea8c28a
15/10/16 02:42:20 INFO metrics.MetricsSaver: 2001 MetricsLockFreeSaver 339 comitted 10 matured S3WriteBytes values
This is likely to be caused by the configuration of spark.storage.memoryFraction being too low. Spark will only use this fraction of the allocated memory to cache RDDs.
Try either:
increasing the storage fraction
rdd.persist(StorageLevel.MEMORY_ONLY_SER) to reduce memory usage by serializing the RDD data
rdd.persist(StorageLevel.MEMORY_AND_DISK) to partially persist onto disk if memory limits are reached.
This could be due to the following issue if you're loading lots of avro files:
https://mail-archives.apache.org/mod_mbox/spark-user/201510.mbox/%3CCANx3uAiJqO4qcTXePrUofKhO3N9UbQDJgNQXPYGZ14PWgfG5Aw#mail.gmail.com%3E
With a PR in progress at:
https://github.com/databricks/spark-avro/pull/95
I have a Spark-based batch application (a JAR with main() method, not written by me, I'm not a Spark expert) that I run in local mode without spark-submit, spark-shell, or spark-defaults.conf. When I tried to use IBM JRE (like one of my customers) instead of Oracle JRE (same machine and same data), I started getting those warnings.
Since the memory store is a fraction of the heap (see the page that Jacob suggested in his comment), I checked the heap size: IBM JRE uses a different strategy to decide default heap size and it was too small, so I simply added appropriate -Xms and -Xmx params and the problem disappeared: now the batch works fine both with IBM and Oracle JRE.
My usage scenario is not typical, I know, however I hope this can help someone.

HDFS block size and replication

I have a file called employee.txt.The size of that file is 66 MB.
We know that default HDFS block size is 64 MB.
So, There is going to be two blocks for employee.txt as the size of it is greater than 64 MB
employee.txt = block A + block B
So, There are two blocks.
For block A it is 64 MB,So there is no problem at all ..
For block B it is 2 MB, So what will happen to the reamining 62 MB of block B.Does it kept as empty?
I would like to know what happens to that unoccupied space for block B.
HDFS blocks are logical blocks that sits on Physical file system. So when your second block uses just 8 MB, it only consumes that much space from the underlying file system, rest of the disk space will be left free.

Reading binary files, Linux Buffer Cache

I am busy writing something to test the read speeds for disk IO on Linux.
At the moment I have something like this to read the files:
Edited to change code to this:
const int segsize = 1048576;
char buffer[segsize];
ifstream file;
file.open(sFile.c_str());
while(file.readsome(buffer,segsize)) {}
For foo.dat, which is 150GB, the first time I read it in, it takes around 2 minutes.
However if I run it within 60 seconds of the first run, it will then take around 3 seconds to run. How is that possible? Surely the only place that could be read from that fast is the buffer cache in RAM, and the file is too big to fit in RAM.
The machine has 50GB of ram, and the drive is a NFS mount with all the default settings. Please let me know where I could look to confirm that this file is actually being read at this speed? Is my code wrong? It appears to take a correct amount of time the first time the file is read.
Edited to Add:
Found out that my files were only reading up to a random point. I've managed to fix this by changing segsize down to 1024 from 1048576. I have no idea why changing this allows the ifstream to read the whole file instead of stopping at a random point.
Thanks for the answers.
On Linux, you can do this for a quick troughput test:
$ dd if=/dev/md0 of=/dev/null bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 0.863904 s, 243 MB/s
$ dd if=/dev/md0 of=/dev/null bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 0.0748273 s, 2.8 GB/s
$ sync && echo 3 > /proc/sys/vm/drop_caches
$ dd if=/dev/md0 of=/dev/null bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 0.919688 s, 228 MB/s
echo 3 > /proc/sys/vm/drop_caches will flush the cache properly
in_avail doesn't give the length of the file, but a lower bound of what is available (especially if the buffer has already been used, it return the size available in the buffer). Its goal is to know what can be read without blocking.
unsigned int is most probably unable to hold a length of more than 4GB, so what is read can very well be in the cache.
C++0x Stream Positioning may be interesting to you if you are using large files
in_avail returns the lower bound of how much is available to read in the streams read buffer, not the size of the file. To read the whole file via the stream, just keep
calling the stream's readsome() method and checking how much was read with the gcount() method - when that returns zero, you have read everthing.
It appears to take a correct amount of time the first time the file is read.
On that first read, you're reading 150GB in about 2 minutes. That works out to about 10 gigabits per second. Is that what you're expecting (based on the network to your NFS mount)?
One possibility is that the file could be at least in part sparse. A sparse file has regions that are truly empty - they don't even have disk space allocated to them. These sparse regions also don't consume much cache space, and so reading the sparse regions will essentially only require time to zero out the userspace pages they're being read into.
You can check with ls -lsh. The first column will be the on-disk size - if it's less than the file size, the file is indeed sparse. To de-sparse the file, just write to every page of it.
If you would like to test for true disk speeds, one option would be to use the O_DIRECT flag to open(2) to bypass the cache. Note that all IO using O_DIRECT must be page-aligned, and some filesystems do not support it (in particular, it won't work over NFS). Also, it's a bad idea for anything other than benchmarking. See some of Linus's rants in this thread.
Finally, to drop all caches on a linux system for testing, you can do:
echo 3 > /proc/sys/vm/drop_caches
If you do this on both client and server, you will force the file out of memory. Of course, this will have a negative performance impact on anything else running at the time.