Terasort error : Requested more partitions than input keys (1 > 0) - mapreduce

I am working on Hadoop benchmarking and working with teragen and tera sort tool for the same.
The teragen tool is working fine, for which i am using the following command:
hadoop jar /Users/karan.verma/Documents/backups/h/hadoop-2.6.4/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar teragen -Dmapreduce.job.maps=100 1t random-data1
and gives the following output on console:
17/10/03 17:19:21 INFO mapreduce.Job: Job job_1507026170114_0005 completed successfully
17/10/03 17:19:21 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=10661490
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=8594
HDFS: Number of bytes written=0
HDFS: Number of read operations=400
HDFS: Number of large read operations=0
HDFS: Number of write operations=200
Job Counters
Launched map tasks=100
Other local map tasks=100
Total time spent by all maps in occupied slots (ms)=1089472
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=1089472
Total vcore-milliseconds taken by all map tasks=1089472
Total megabyte-milliseconds taken by all map tasks=1115619328
Map-Reduce Framework
Map input records=0
Map output records=0
Input split bytes=8594
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=9690
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
Total committed heap usage (bytes)=11115954176
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
And following this when i execute the terasort tool using the following command:
hadoop jar /Users/karan.verma/Documents/backups/h/hadoop-2.6.4/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar terasort random-data1 sorted-data
i got the following error:
17/10/03 17:20:10 INFO terasort.TeraSort: starting
17/10/03 17:20:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/10/03 17:20:11 INFO input.FileInputFormat: Total input paths to process : 100
Spent 168ms computing base-splits.
Spent 2ms computing TeraScheduler splits.
Computing input splits took 172ms
Sampling 10 splits of 100
Making 1 from 0 sampled records
17/10/03 17:20:11 ERROR terasort.TeraSort: Requested more partitions than input keys (1 > 0)
Any help, why is this happening? Is there anything which i'm missing in configuration part?

Check the output of the teragen command because output file generated may be empty. You would get this error if the input data size is 0.

Related

Regex for sqoop deamon logs

I'm trying to create a regex for Sqoop logs.
Below is the log :
> Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-
1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
18/12/06 07:03:04 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
18/12/06 07:03:05 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
18/12/06 07:03:05 WARN sqoop.ConnFactory: Parameter --driver is set to an explicit driver however appropriate connection manager is not being set (via --connection-manager). Sqoop is going to fall back to org.apache.sqoop.manager.GenericJdbcManager. Please specify explicitly which connection manager should be used next time.
18/12/06 07:03:05 INFO manager.SqlManager: Using default fetchSize of 1000
18/12/06 07:03:05 INFO tool.CodeGenTool: Beginning code generation
18/12/06 07:03:06 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM ROH319P4 AS t WHERE 1=0
18/12/06 07:03:06 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce
Note: /tmp/sqoop-root/compile/92b93a5009481a238e86271708bb80e0/ROH319P4.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
18/12/06 07:03:10 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/92b93a5009481a238e86271708bb80e0/ROH319P4.jar
18/12/06 07:03:10 INFO mapreduce.ImportJobBase: Beginning import of ROH319P4
18/12/06 07:03:10 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
18/12/06 07:03:11 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
18/12/06 07:03:11 INFO client.RMProxy: Connecting to ResourceManager at ip-172-27-88-6.ap-south-1.compute.internal/172.27.88.6:8032
18/12/06 07:03:14 INFO db.DBInputFormat: Using read commited transaction isolation
18/12/06 07:03:14 INFO mapreduce.JobSubmitter: number of splits:1
18/12/06 07:03:14 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1536239303820_0582
18/12/06 07:03:15 INFO impl.YarnClientImpl: Submitted application application_1536239303820_0582
18/12/06 07:03:15 INFO mapreduce.Job: The url to track the job: http://ip-172-27-88-6.ap-south-1.compute.internal:20888/proxy/application_1536239303820_0582/
18/12/06 07:03:15 INFO mapreduce.Job: Running job: job_1536239303820_0582
18/12/06 07:03:22 INFO mapreduce.Job: Job job_1536239303820_0582 running in uber mode : false
18/12/06 07:03:22 INFO mapreduce.Job: map 0% reduce 0%
18/12/06 07:03:28 INFO mapreduce.Job: map 100% reduce 0%
18/12/06 07:03:28 INFO mapreduce.Job: Job job_1536239303820_0582 completed successfully
18/12/06 07:03:28 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=189523
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=87
HDFS: Number of bytes written=4997
HDFS: Number of read operations=4
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Other local map tasks=1
Total time spent by all maps in occupied slots (ms)=742431
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=4057
Total vcore-milliseconds taken by all map tasks=4057
Total megabyte-milliseconds taken by all map tasks=23757792
Map-Reduce Framework
Map input records=38
Map output records=38
Input split bytes=87
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=62
CPU time spent (ms)=1260
Physical memory (bytes) snapshot=295165952
Virtual memory (bytes) snapshot=7060725760
Total committed heap usage (bytes)=340262912
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=4997
18/12/06 07:03:28 INFO mapreduce.ImportJobBase: Transferred 4.8799 KB in 17.1259 seconds (291.781 bytes/sec)
18/12/06 07:03:28 INFO mapreduce.ImportJobBase: Retrieved 38 records.
The regex that I have tried creating :
^(\d{4}/\d{2}/\d{2})\s+(\d{2}.\d{2}.\d{2})\s+(\S+)\s+(\S+)\s+(.*)$
So my aim is to fetch lines with the format like :
18/12/06 07:03:06 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce
Can anyone please help me out in figuring out the regex ?
Your regex was almost correct, except following,
First \d{4} should have been \d{2} as year is only having two digits. If you think it can have 4 to 2 digits, you could use \d{2,4}
/ character needs to be escaped in regex like this \/
Instead of . you need to use : as your time is colon separated. Although . will match any character and even colon, but I think you should better be precise and use :
So you updated regex becomes this which matches successfully the lines in your log files,
^(\d{2}\/\d{2}\/\d{2})\s+(\d{2}:\d{2}:\d{2})\s+(\S+)\s+(\S+)\s+(.*)$
Check here
And if you don't want capture groups, it becomes better with non-capture groups,
^(?:\d{2}\/\d{2}\/\d{2})\s+(?:\d{2}:\d{2}:\d{2})\s+(?:\S+)\s+(?:\S+)\s+(?:.*)$
Without groups

EMR cluster running slow

I was running a map reduce Hadoop job on Amazon EMR 5.5.2 which uses Hadoop 2.7.3.
I recently upgraded EMR to 5.12.1 which uses Hadoop 2.8.0.
For the same input load, my new cluster is running comparatively very slow.
I am not able to find out the reason. Maybe I will need to tweak some performance parameters.
Following are the map reduce job counters. Looking at these counters can anybody have any insights on which performance parameters are wrong?
Job Counters
File System Counters
FILE: Number of bytes read=1087
FILE: Number of bytes written=24787084
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=15840
HDFS: Number of bytes written=0
HDFS: Number of read operations=132
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
S3N: Number of bytes read=0
S3N: Number of bytes written=4315
S3N: Number of read operations=0
S3N: Number of large read operations=0
S3N: Number of write operations=0
Job Counters
Launched map tasks=132
Launched reduce tasks=7
Other local map tasks=132
Total time spent by all maps in occupied slots (ms)=1576936320
Total time spent by all reduces in occupied slots (ms)=26894720
Total time spent by all map tasks (ms)=2463963
Total time spent by all reduce tasks (ms)=42023
Total vcore-milliseconds taken by all map tasks=2463963
Total vcore-milliseconds taken by all reduce tasks=42023
Total megabyte-milliseconds taken by all map tasks=50461962240
Total megabyte-milliseconds taken by all reduce tasks=860631040
Map-Reduce Framework
Map input records=12523
Map output records=2
Map output bytes=3236
Map output materialized bytes=15935
Input split bytes=15840
Combine input records=0
Combine output records=0
Reduce input groups=1
Reduce shuffle bytes=15935
Reduce input records=2
Reduce output records=8
Spilled Records=4
Shuffled Maps =924
Failed Shuffles=0
Merged Map outputs=924
GC time elapsed (ms)=64327
CPU time spent (ms)=2737480
Physical memory (bytes) snapshot=166237839360
Virtual memory (bytes) snapshot=2760473792512
Total committed heap usage (bytes)=187218526208

Spark writing/reading to/from S3 - Partition Size and Compression

I am doing an experiment to understand which file size behaves best with s3 and [EMR + Spark]
Input data :
Incompressible data: Random Bytes in files
Total Data Size: 20GB
Each folder has varying input file size: From 2MB To 4GB file size.
Cluster Specifications :
1 master + 4 nodes : C3.8xls
--driver-memory 5G \
--executor-memory 3G \
--executor-cores 2 \
--num-executors 60 \
Code :
scala> def time[R](block: => R): R = {
val t0 = System.nanoTime()
val result = block // call-by-name
val t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0) + "ns")
result
}
time: [R](block: => R)R
scala> val inputFiles = time{sc.textFile("s3://bucket/folder/2mb-10240files-20gb/*/*")};
scala> val outputFiles = time {inputFiles.saveAsTextFile("s3://bucket/folder-out/2mb-10240files-20gb/")};
Observations
2MB - 32MB: Most of the time is spent in opening file handles [Not efficient]
64MB till 1GB: Spark itself is launching 320 tasks for all these file sizes, it's no longer the no of files in that bucket with 20GB
data e.g. 512 MB files had 40 files to make 20gb data and could
just have 40 tasks to be completed but instead, there were 320
tasks each dealing with 64MB data.
4GB file size : 0 Bytes outputted [Not able to handle in-memory /Data not even splittable ???]
Questions
Any default setting that forces input size to be dealt with to be 64MB ??
Since the data I am using is random bytes and is already compressed how is it splitting this data further? If it can split this data why is it not able to split file size of 4gb object file
size?
Why is compressed file size increased after uploading via spark? The 2MB compressed input file becomes 3.6 MB in the output bucket.
Since it is not specified, I'm assuming usage of gzip and Spark 2.2 in my answer.
Any default setting that forces input size to be dealt with to be 64MB ??
Yes, there is. Spark is a Hadoop project, and therefore treats S3 to be a block based file system even though it is an object based file system.
So the real question here is: which implementation of S3 file system are you using(s3a, s3n) etc. A similar question can be found here.
Since the data I am using is random bytes and is already compressed how is it splitting this data further? If it can split this data why is it not able to split file size of 4gb object file size?
Spark docs indicate that it is capable of reading compressed files:
All of Spark’s file-based input methods, including textFile, support running on directories, compressed files, and wildcards as well. For example, you can use textFile("/my/directory"), textFile("/my/directory/.txt"), and textFile("/my/directory/.gz").
This means that your files were read quite easily and converted to a plaintext string for each line.
However, you are using compressed files. Assuming it is a non-splittable format such as gzip, the entire file is needed for de-compression. You are running with 3gb executors which can satisfy the needs of 4mb-1gb files quite well, but can't handle a file larger than 3gb at once (probably lesser after accounting for overhead).
Some further info can be found in this question. Details of splittable compression types can be found in this answer.
Why is compressed file size increased after uploading via spark?The 2MB compressed input file becomes 3.6 MB in output bucket.
As a corollary to the previous point, this means that spark has de-compressed the RDD while reading as plaintext. While re-uploading, it is no longer compressed. To compress, you can pass a compression codec as a parameter:
sc.saveAsTextFile("s3://path", classOf[org.apache.hadoop.io.compress.GzipCodec])
There are other compression formats available.

Calculating Disk Read/Write in Linux with C++

My requirement is to profile current process disk read/write operations with total disk read/write operations (or amount of data read/written). I need to take samples evry second and plot a graph between these two. I need to do this on Linux (Ubuntu 12.10) in c++.
Are there any APIs/Tools available for this task ? I found one tool namely iotop but I am not sure how to use this for current process vs system wide usage.
Thank You
You can read the file /proc/diskstats every second. Each line represents one device.
From kernel's "Documentation/iostat.txt":
Field 1 -- # of reads completed
This is the total number of reads completed successfully.
Field 2 -- # of reads merged, field 6 -- # of writes merged
Reads and writes which are adjacent to each other may be merged for
efficiency. Thus two 4K reads may become one 8K read before it is
ultimately handed to the disk, and so it will be counted (and queued)
as only one I/O. This field lets you know how often this was done.
Field 3 -- # of sectors read
This is the total number of sectors read successfully.
Field 4 -- # of milliseconds spent reading
This is the total number of milliseconds spent by all reads (as
measured from __make_request() to end_that_request_last()).
Field 5 -- # of writes completed
This is the total number of writes completed successfully.
Field 7 -- # of sectors written
This is the total number of sectors written successfully.
Field 8 -- # of milliseconds spent writing
This is the total number of milliseconds spent by all writes (as
measured from __make_request() to end_that_request_last()).
Field 9 -- # of I/Os currently in progress
The only field that should go to zero. Incremented as requests are
given to appropriate struct request_queue and decremented as they finish.
Field 10 -- # of milliseconds spent doing I/Os
This field increases so long as field 9 is nonzero.
Field 11 -- weighted # of milliseconds spent doing I/Os
This field is incremented at each I/O start, I/O completion, I/O
merge, or read of these stats by the number of I/Os in progress
(field 9) times the number of milliseconds spent doing I/O since the
last update of this field. This can provide an easy measure of both
I/O completion time and the backlog that may be accumulating.
For each process, you can get use /proc/<pid>/io, which produces something like this:
rchar: 2012
wchar: 0
syscr: 7
syscw: 0
read_bytes: 0
write_bytes: 0
cancelled_write_bytes: 0
rchar, wchar: number of bytes read/written.
syscr, syscw: number of read/write system calls.
read_bytes, write_bytes: number of bytes read/written to storage media.
cancelled_write_bytes: from the best of my understanding, caused by calls to "ftruncate" that cancel pending writes to the same file. Probably most often 0.

Reading binary files, Linux Buffer Cache

I am busy writing something to test the read speeds for disk IO on Linux.
At the moment I have something like this to read the files:
Edited to change code to this:
const int segsize = 1048576;
char buffer[segsize];
ifstream file;
file.open(sFile.c_str());
while(file.readsome(buffer,segsize)) {}
For foo.dat, which is 150GB, the first time I read it in, it takes around 2 minutes.
However if I run it within 60 seconds of the first run, it will then take around 3 seconds to run. How is that possible? Surely the only place that could be read from that fast is the buffer cache in RAM, and the file is too big to fit in RAM.
The machine has 50GB of ram, and the drive is a NFS mount with all the default settings. Please let me know where I could look to confirm that this file is actually being read at this speed? Is my code wrong? It appears to take a correct amount of time the first time the file is read.
Edited to Add:
Found out that my files were only reading up to a random point. I've managed to fix this by changing segsize down to 1024 from 1048576. I have no idea why changing this allows the ifstream to read the whole file instead of stopping at a random point.
Thanks for the answers.
On Linux, you can do this for a quick troughput test:
$ dd if=/dev/md0 of=/dev/null bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 0.863904 s, 243 MB/s
$ dd if=/dev/md0 of=/dev/null bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 0.0748273 s, 2.8 GB/s
$ sync && echo 3 > /proc/sys/vm/drop_caches
$ dd if=/dev/md0 of=/dev/null bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 0.919688 s, 228 MB/s
echo 3 > /proc/sys/vm/drop_caches will flush the cache properly
in_avail doesn't give the length of the file, but a lower bound of what is available (especially if the buffer has already been used, it return the size available in the buffer). Its goal is to know what can be read without blocking.
unsigned int is most probably unable to hold a length of more than 4GB, so what is read can very well be in the cache.
C++0x Stream Positioning may be interesting to you if you are using large files
in_avail returns the lower bound of how much is available to read in the streams read buffer, not the size of the file. To read the whole file via the stream, just keep
calling the stream's readsome() method and checking how much was read with the gcount() method - when that returns zero, you have read everthing.
It appears to take a correct amount of time the first time the file is read.
On that first read, you're reading 150GB in about 2 minutes. That works out to about 10 gigabits per second. Is that what you're expecting (based on the network to your NFS mount)?
One possibility is that the file could be at least in part sparse. A sparse file has regions that are truly empty - they don't even have disk space allocated to them. These sparse regions also don't consume much cache space, and so reading the sparse regions will essentially only require time to zero out the userspace pages they're being read into.
You can check with ls -lsh. The first column will be the on-disk size - if it's less than the file size, the file is indeed sparse. To de-sparse the file, just write to every page of it.
If you would like to test for true disk speeds, one option would be to use the O_DIRECT flag to open(2) to bypass the cache. Note that all IO using O_DIRECT must be page-aligned, and some filesystems do not support it (in particular, it won't work over NFS). Also, it's a bad idea for anything other than benchmarking. See some of Linus's rants in this thread.
Finally, to drop all caches on a linux system for testing, you can do:
echo 3 > /proc/sys/vm/drop_caches
If you do this on both client and server, you will force the file out of memory. Of course, this will have a negative performance impact on anything else running at the time.