Problem : What is input split
How the input split is calculated in MapReduce v1?
Is the input Split is same as the HDFS block size?
Each input split size generally equals to HDFS block size. For example, for a file of 1GB size, there will be 16 input splits, if block size is 64MB. However, split size can be configured to be less/more than HDFS block size. For general case, calculation of input splits is done with FileInputFormat.
Calculation of input split size is done in InputFileFormat as:
Math.max("mapred.min.split.size", Math.min("mapred.max.split.size", blockSize));
Some examples:
mapred.min.split.size mapred.max.split.size dfs.block.size Split Size
1 (default) Long.MAX_VALUE(default) 64MB(Default) 64MB
1 (default) Long.MAX_VALUE(default) 128MB 128MB
128MB Long.MAX_VALUE(default) 64MB 128MB
1 (default) 32MB 64MB 32MB
For detailed explanation, you can view here.
Related
I am using aws comprehend for PII redaction, Idea is to detect entities and then redact PII from it.
Now the problem is this API has a Input text size limit. How can I increase the limit ?? Maybe to 1 MB ?? Or is there any other way to detect entities for large text.
ERROR: botocore.errorfactory.TextSizeLimitExceededException: An error occurred (TextSizeLimitExceededException) when calling the DetectPiiEntities operation: Input text size exceeds limit. Max length of request text allowed is 5000 bytes while in this request the text size is 7776 bytes
There's no way to increase this limit.
For input text greater than 5000 bytes, you can split the text into multiple chunks of 5000 bytes each and then aggregate the results back.
Please do mind that you keep some overlap between different chunks, to carry over some context from previous chunk.
For reference you can use similar solution exposed by Comprehend team itself . https://github.com/aws-samples/amazon-comprehend-s3-object-lambda-functions/blob/main/src/processors.py#L172
I want to understand how mapreduce job process the data blocks. As per my understanding for each data block one mapper is invoked.
Let me put my query with one example.
Suppose, I have a long text file having data stored in HDFS in 4 blocks ( 64 MB ) on 4 nodes (let's forget about the replication here )
In this case 4 map task will be invoked on each machine ( all 4 data nodes/machines)
Here is question : this splitting may have resulted partial record stored on two blocks. Like last record may have been get stored in block 1(at end) partially and other part on block 2.
In this case, how does mapreduce program ensure that complete record is getting processed?
I hope, I have been able to put my query
Please read this article at http://www.hadoopinrealworld.com/inputsplit-vs-block/
This is an excerpt from the same article I posted above.
A block is a hard division of data at the block size. So if the block size in the cluster is 128 MB, each block for the dataset will be 128 MB except for the last block which could be less than the block size if the file size is not entirely divisible by the block size. So a block is a hard cut at the block size and blocks can end even before a logical record ends.
Consider the block size in your cluster is 128 MB and each logical record is your file is about 100 Mb. (yes.. huge records)
So the first record will perfectly fit in the block no problem since the record size 100 MB is well with in the block size which is 128 MB. However the 2nd record can not fit in the block, so the record number 2 will start in block 1 and will end in block 2.
If you assign a mapper to a block 1, in this case, the Mapper can not process Record 2 because block 1 does not have the complete record 2. That is exactly the problem InputSplit solves. In this case InputSplit 1 will have both record 1 and record 2. InputSplit 2 does not start with Record 2 since Record 2 is already included in the Input Split 1. So InputSplit 2 will have only record 3. As you can see record 3 is divided between Block 2 and 3 but still InputSplit 2 will have the whole of record 3.
Blocks are physical chunks of data store in disks where as InputSplit is not physical chunks of data. Inputsplit is a Java class with pointers to start and end locations in blocks. So when Mapper tries to read the data it clearly knows where to start reading and where to stop reading. The start location of an InputSplit can start in a block and end in another block.
Happy Hadooping
I am running pig in local mode on a large file 54 GB. I observe it spawning lot of map tasks sequentially . What I am expecting is that maybe each map task is reading 64 MB worth of lines. So if I want to optimize this and maybe reads 1GB equivalent number of lines,
a.) Is it possible?(Maybe by increasing split size)
b.) How?
c.) IS there any other optimum approach.
thanks
You can increase split size by setting:
SET mapred.max.split.size #bytes
By default block size is 64MB.
Try this to increase the block size:
Open the hdfs-site.xml file. This file is usually found in the conf/ folder of the Hadoop installation directory.Set the following property in hdfs-site.xml:
-property-
-name-dfs.block.size-name-
-value-134217728-value-
-description-Block size-description-
-property-
The number of map() spawned is equal to the number of 64MB blocks of input data. Suppose we have 2 input files of 1MB size, both the files will be stored in a single block. But when I run my MR program with 1 namenode and 2 jobnodes, I see 2 map() spawned, one for each file. So is this because the system tried to split the job between 2 nodes i.e.,
Number of map() spawned = number of 64MB blocks of input data * number of jobnodes ?
Also, in the mapreduce tutorial, its written than for a 10TB file with blocksize being 128KB, 82000 maps will be spawned. However, according to the logic that number of maps is only dependent on block size, 78125 jobs must be spawned (10TB/128MB). I am not understanding how few extra jobs have been spawned? It will be great if anyone can share your thoughts on this? Thanks. :)
By Default one mapper per input file is spawned and if the size of input file is greater than than split size(which is normally kept same as block size) then for that file number of mappers will be ceil of filesize/split size.
Now say you 5 input files and split size is kept as 64 MB
file1 - 10 MB
file2 - 30 MB
file3 - 50 MB
file4 - 100 MB
file5 - 1500 MB
number of mapper launched
file1 - 1
file2 - 1
file3 - 1
file4 - 2
file5 - 24
total mappers - 29
Additionally, input split size and block size is not always honored. If input file is a gzip, it is not splittable. So if one of the gzip file is 1500mb, it will not be split. It is better to use Block compression with Snappy or LZO along with sequence file format.
Also, input split size is not used if input is HBASE table. In case of HBase table, only to split is to maintain correct region size for the table. If table is not properly distributed, manually split the table into multiple regions.
Number of mappers depends on just one thing, the no of InputSplits created by the InputFormat you are using(Default is TextInputFormat which creates splits taking \n as the delimiter). It does not depend on the no. of nodes or the file or the block size(64MB or whatever). It's very good if the split is equal to the block. But this is just an ideal situation and cannot be guaranteed always. MapReudce framework tries its best to optimise the process. And in this process things like creating just 1 mapper for the entire file happen(if the filesize is less than the block size). Another optimization could be to create lesser number of mappers than the number of splits. For example if your file has 20 lines and you are using TextInputFormat then you might think that you'll get 20 mappers(as no. of mappers = no. of splits and TextInputFormat creates splits based on \n). But this does not happen. There will be unwanted overhead in creating 20 mappers for such a small file.
And if the size of a split is greater than the block size, the remaining data is moved in from the other remote block on a different machine in order to gets processed.
About the MapReduce tutorial :
If you have 10TB data, then -
(10*1024*1024)/128 = 81,920 mappers, which almost = 82,000
Hope this clears some of the things.
I would like to ask a question about the performance of compression
which is related to chunk size of hdf5 files.
I have 2 hdf5 files on hand, which have the following properties.
They both only contain one dataset, called "data".
File A's "data":
Type: HDF5 Scalar Dataset
No. of Dimensions: 2
Dimension Size: 5094125 x 6
Max. dimension size: Unlimited x Unlimited
Data type: 64-bit floating point
Chunking: 10000 x 6
Compression: GZIP level = 7
File B's "data":
Type: HDF5 Scalar Dataset
No. of Dimensions: 2
Dimension Size: 6720 x 1000
Max. dimension size: Unlimited x Unlimited
Data type: 64-bit floating point
Chunking: 6000 x 1
Compression: GZIP level = 7
File A's size:
HDF5----19 MB
CSV-----165 MB
File B's size:
HDF5----60 MB
CSV-----165 MB
Both of them shows great compression on data stored when comparing to csv files.
However, the compression rate of file A is about 10% of original csv,
while that of file B is only about 30% of original csv.
I have tried different chunk size to make file B as small as possible, but it seems that 30% is the optimum compression rate. I would like to ask why file A can achieve a greater compression while file B cannot.
If file B can also achieve, what should the chunk size be?
Is that any rule to determine the optimum chunk size of HDF5 for compression purpose?
Thanks!
Chunking doesn't really affect the compression ratio per se, except in the manner #Ümit describes. What chunking does do is affect the I/O performance. When compression is applied to an HDF5 dataset, it is applied to whole chunks, individually. This means that when reading data from a single chunk in a dataset, the entire chunk must be decompressed - possibly involving a whole lot more I/O, depending on the size of the cache, shape of the chunk, etc.
What you should do is make sure that the chunk shape matches how you read/write your data. If you generally read a column at a time, make your chunks columns, for example. This is a good tutorial on chunking.