hadoop copying the result from hdfs to S3 - amazon-web-services

I've successfully completed my job on Amazon EMR, Now I want to copy results from HDFS to S3,but I have some problem
this is the code (--steps)
{
"Name":"AAAAA",
"Type":"CUSTOM_JAR",
"Jar":"command-runner.jar",
"ActionOnFailure":"CONTINUE",
"Args": [
"s3-dist-cp",
"--src", "hdfs:///seqaddid_output",
"--dest", "s3://wuda-notebook/seqaddid"
]
}
this is the logs:
2019-04-12 03:01:23,571 INFO com.amazon.elasticmapreduce.s3distcp.S3DistCp (main): Running with args: -libjars /usr/share/aws/emr/s3-dist-cp/lib/commons-httpclient-3.1.jar,/usr/share/aws/emr/s3-dist-cp/lib/commons-logging-1.0.4.jar,/usr/share/aws/emr/s3-dist-cp/lib/guava-18.0.jar,/usr/share/aws/emr/s3-dist-cp/lib/s3-dist-cp-2.10.0.jar,/usr/share/aws/emr/s3-dist-cp/lib/s3-dist-cp.jar --src hdfs:///seqaddid_output/ --dest s3://wuda-notebook/seqaddid
2019-04-12 03:01:24,196 INFO com.amazon.elasticmapreduce.s3distcp.S3DistCp (main): S3DistCp args: --src hdfs:///seqaddid_output/ --dest s3://wuda-notebook/seqaddid
2019-04-12 03:01:24,203 INFO com.amazon.elasticmapreduce.s3distcp.S3DistCp (main): Using output path 'hdfs:/tmp/4f93d497-fade-4c78-86b9-59fc3da35b4e/output'
2019-04-12 03:01:24,263 INFO com.amazon.elasticmapreduce.s3distcp.S3DistCp (main): GET http://169.254.169.254/latest/meta-data/placement/availability-zone result: us-east-1f
2019-04-12 03:01:24,664 FATAL com.amazon.elasticmapreduce.s3distcp.S3DistCp (main): Failed to get source file system
java.io.FileNotFoundException: File does not exist: hdfs:/seqaddid_output
at org.apache.hadoop.hdfs.DistributedFileSystem$27.doCall(DistributedFileSystem.java:1444)
at org.apache.hadoop.hdfs.DistributedFileSystem$27.doCall(DistributedFileSystem.java:1437)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1452)
at com.amazon.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:795)
at com.amazon.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:705)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
at com.amazon.elasticmapreduce.s3distcp.Main.main(Main.java:22)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:234)
at org.apache.hadoop.util.RunJar.main(RunJar.java:148)

It looks like the bug is caused by a race condition when CopyFilesReducer uses multiple CopyFilesRunable instances to download the files from S3. The problem is that it uses the same temp directory in multiple threads, and the threads delete the temp directory when they're done. Hence, when one thread completes before another it deletes the temp directory that another thread is still using.
I've reported the problem to AWS, but in the mean time you can work around the bug by forcing the reducer to use a single thread by setting the variable s3DistCp.copyfiles.mapper.numWorkers to 1 in your job config.

Well I had the same problem with the copy and the issue was because I was not adding the full path for the main path.
I was doing just like you:
s3-dist-cp --src hdfs:///my_path/ --dest s3://my_mucket --srcPattern .*\.parquet
And I got the exactly same error as you.
The solution is add the full path for the hdfs path:
s3-dist-cp --src hdfs:///user/hadoop/my_path/ --dest s3://my_mucket --srcPattern .*\.parquet

Related

Jenkins reported a SSLException: Read timed out error when installing the aws sdk plugin

I installed jenkins with docker.
docker run -u root --rm -d -p 8080:8080 -p 50000:50000 -v jenkins-data:/var/jenkins_home -v /var/run/docker.sock:/var/run/docker.sock jenkinsci/blueocean
When I downloaded the aws plugin, it reported a java https error, as follows:
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:457)
at sun.security.ssl.SSLSocketInputRecord.decodeInputRecord(SSLSocketInputRecord.java:237)
at sun.security.ssl.SSLSocketInputRecord.decode(SSLSocketInputRecord.java:190)
at sun.security.ssl.SSLTransport.decode(SSLTransport.java:109)
Caused: javax.net.ssl.SSLException: Read timed out
at sun.security.ssl.Alert.createSSLException(Alert.java:127)
at sun.security.ssl.TransportContext.fatal(TransportContext.java:324)
at sun.security.ssl.TransportContext.fatal(TransportContext.java:267)
at sun.security.ssl.TransportContext.fatal(TransportContext.java:262)
at sun.security.ssl.SSLTransport.decode(SSLTransport.java:138)
at sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1386)
at sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1354)
at sun.security.ssl.SSLSocketImpl.access$300(SSLSocketImpl.java:73)
at sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:948)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at sun.net.www.MeteredStream.read(MeteredStream.java:134)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.read(HttpURLConnection.java:3454)
at sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.read(HttpURLConnection.java:3447)
at org.apache.commons.io.input.ProxyInputStream.read(ProxyInputStream.java:81)
at hudson.model.UpdateCenter$UpdateCenterConfiguration.download(UpdateCenter.java:1283)
Caused: java.io.IOException: Failed to load https://updates.jenkins.io/download/plugins/aws-java-sdk/1.11.995/aws-java-sdk.hpi to /var/jenkins_home/plugins/aws-java-sdk.jpi.tmp
at hudson.model.UpdateCenter$UpdateCenterConfiguration.download(UpdateCenter.java:1288)
Caused: java.io.IOException: Failed to download from https://updates.jenkins.io/download/plugins/aws-java-sdk/1.11.995/aws-java-sdk.hpi (redirected to: https://mirrors.tuna.tsinghua.edu.cn/jenkins/plugins/aws-java-sdk/1.11.995/aws-java-sdk.hpi)
at hudson.model.UpdateCenter$UpdateCenterConfiguration.download(UpdateCenter.java:1322)
at hudson.model.UpdateCenter$DownloadJob._run(UpdateCenter.java:1870)
at hudson.model.UpdateCenter$InstallationJob._run(UpdateCenter.java:2162)
at hudson.model.UpdateCenter$DownloadJob.run(UpdateCenter.java:1844)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at hudson.remoting.AtmostOneThreadExecutor$Worker.run(AtmostOneThreadExecutor.java:118)
at java.lang.Thread.run(Thread.java:748)
Have friends encountered the same problem, and can you give me some help, thanks in advance
https://updates.jenkins.io/download/plugins/aws-java-sdk/1.11.995/aws-java-sdk.hpi
use this URL and download the aws-java-sdk.hpi file
copy paste that file to your Jenkins -> plugin folder and restart the Jenkins it will
automatically install.

Spark Scala EMR Job fails to download file from S3

I have a spark scala job that keeps failing in AWS EMR production and one of the first errors that I keep seeing in my executors is this Download failed error. I've looked at these files in S3, I even copied the file to a lower environment and ran the same job against it and everything worked as expected. The lower environment has less data to process but other than that I'm not really sure why I'm running into this issue. The production folder does have a Glue Job that runs every hour and writes new data, but I tried running the emr job with the glue job paused and I still ran into this error. Other than that nothing new is being written and some of the files that are trying to be accessed are months old and have data present.
2021-05-20 00:40:15 ERROR S3FSInputStream:295 - Unable to recover reading from stream
2021-05-20 00:40:15 ERROR AsyncFileDownloader:91 - TID: 3497 - Download failed for file path: s3://bucket/folder/part-000.snappy.parquet, range: 0-20427, partition values: [empty row], isDataPresent: false
java.io.IOException: Unexpected end of stream pos=4, contentLength=20427
at com.amazon.ws.emr.hadoop.fs.s3.S3FSInputStream.read(S3FSInputStream.java:296)
at org.apache.commons.io.IOUtils.read(IOUtils.java:2454)
at org.apache.commons.io.IOUtils.readFully(IOUtils.java:2537)
at org.apache.hadoop.util.ByteBufferIOUtils.readFullyHeapBuffer(ByteBufferIOUtils.java:89)
at org.apache.hadoop.util.ByteBufferIOUtils.readFully(ByteBufferIOUtils.java:53)
at com.amazon.ws.emr.hadoop.fs.s3.AbstractS3FSInputStream.readFullyIntoBuffers(AbstractS3FSInputStream.java:97)
at org.apache.hadoop.fs.BufferedFSInputStream.readFullyIntoBuffers(BufferedFSInputStream.java:137)
at org.apache.hadoop.fs.FSDataInputStream.readFullyIntoBuffers(FSDataInputStream.java:270)
at org.apache.parquet.hadoop.util.H1SeekableInputStream.readFullyIntoBuffers(H1SeekableInputStream.java:64)
at org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1181)
at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:806)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildPrefetcherWithPartitionValues$1.apply(ParquetFileFormat.scala:634)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildPrefetcherWithPartitionValues$1.apply(ParquetFileFormat.scala:576)
at org.apache.spark.sql.execution.datasources.AsyncFileDownloader.org$apache$spark$sql$execution$datasources$AsyncFileDownloader$$downloadFile(AsyncFileDownloader.scala:93)
at org.apache.spark.sql.execution.datasources.AsyncFileDownloader$$anonfun$initiateFilesDownload$2$$anon$1.call(AsyncFileDownloader.scala:73)
at org.apache.spark.sql.execution.datasources.AsyncFileDownloader$$anonfun$initiateFilesDownload$2$$anon$1.call(AsyncFileDownloader.scala:72)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.AbortedException:
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleInterruptedException(AmazonHttpClient.java:868)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:746)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:704)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:686)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:550)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:530)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5140)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5086)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1490)
at com.amazon.ws.emr.hadoop.fs.s3.lite.call.GetObjectCall.perform(GetObjectCall.java:24)
at com.amazon.ws.emr.hadoop.fs.s3.lite.call.GetObjectCall.perform(GetObjectCall.java:8)
at com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor.execute(GlobalS3Executor.java:114)
at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.invoke(AmazonS3LiteClient.java:191)
at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.getObject(AmazonS3LiteClient.java:102)
at com.amazon.ws.emr.hadoop.fs.s3.GetObjectInputStreamWithInfoFactory.create(GetObjectInputStreamWithInfoFactory.java:63)
at com.amazon.ws.emr.hadoop.fs.s3.S3FSInputStream.open(S3FSInputStream.java:199)
at com.amazon.ws.emr.hadoop.fs.s3.S3FSInputStream.retrieveInputStreamWithInfo(S3FSInputStream.java:390)
at com.amazon.ws.emr.hadoop.fs.s3.S3FSInputStream.reopenStream(S3FSInputStream.java:377)
at com.amazon.ws.emr.hadoop.fs.s3.S3FSInputStream.read(S3FSInputStream.java:259)
... 19 more
Caused by: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.timers.client.SdkInterruptedException
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.checkInterrupted(AmazonHttpClient.java:923)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.afterAttempt(AmazonHttpClient.java:1073)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1196)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:802)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:770)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:744)
... 36 more

Dependent jar file not found in mapreduce job

I have 2 almost identical CDH 5.8 clusters, namely, Lab & Production. I have a mapreduce job that runs fine in Lab but fails in Production cluster. I spent over 10 hours on this already. I made sure I am running exact same code and also compared the configurations between the clusters. I couldn't find any difference.
Only difference I could see is when I run in Production, I see these warnings:
Also note, the path of the cached file starts with "file://null/"
17/08/16 10:13:14 WARN util.MRApps: cache file (mapreduce.job.cache.files) file://null/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/commons-httpclient-3.1.jar conflicts with cache file (mapreduce.job.cache.files) file://null/opt/cloudera/parcels/CDH/lib/hadoop/client/commons-httpclient-3.1.jar This will be an error in Hadoop 2.0
17/08/16 10:13:14 WARN util.MRApps: cache file (mapreduce.job.cache.files) file://null/opt/cloudera/parcels/CDH/lib/hadoop/client/hadoop-yarn-server-common.jar conflicts with cache file (mapreduce.job.cache.files) file://null/opt/cloudera/parcels/CDH/lib/hadoop-yarn/hadoop-yarn-server-common.jar This will be an error in Hadoop 2.0
17/08/16 10:13:14 WARN util.MRApps: cache file (mapreduce.job.cache.files) file://null/opt/cloudera/parcels/CDH/lib/hadoop-yarn/lib/stax-api-1.0-2.jar conflicts with cache file (mapreduce.job.cache.files) file://null/opt/cloudera/parcels/CDH/lib/hadoop/client/stax-api-1.0-2.jar This will be an error in Hadoop 2.0
17/08/16 10:13:14 WARN util.MRApps: cache file (mapreduce.job.cache.files) file://null/opt/cloudera/parcels/CDH/lib/hbase/lib/snappy-java-1.0.4.1.jar conflicts with cache file (mapreduce.job.cache.files) file://null/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/snappy-java-1.0.4.1.jar This will be an error in Hadoop 2.0
17/08/16 10:13:14 INFO impl.YarnClientImpl: Submitted application application_1502835801144_0005
17/08/16 10:13:14 INFO mapreduce.Job: The url to track the job: http://myserver.com:8088/proxy/application_1502835801144_0005/
17/08/16 10:13:14 INFO mapreduce.Job: Running job: job_1502835801144_0005
17/08/16 10:13:15 INFO mapreduce.Job: Job job_1502835801144_0005 running in uber mode : false
17/08/16 10:13:15 INFO mapreduce.Job: map 0% reduce 0%
17/08/16 10:13:15 INFO mapreduce.Job: Job job_1502835801144_0005 failed with state FAILED due to: Application application_1502835801144_0005 failed 2 times due to AM Container for appattempt_1502835801144_0005_000002 exited with exitCode: -1000
For more detailed output, check application tracking page:http://myserver.com:8088/proxy/application_1502835801144_0005/Then, click on links to logs of each attempt.
Diagnostics: java.io.FileNotFoundException: File file:/var/cdr-ingest-mapreduce/lib/mail-1.4.7.jar does not exist
Failing this attempt. Failing the application.
17/08/16 10:13:15 INFO mapreduce.Job: Counters: 0
17/08/16 10:13:16 INFO client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x25ba0c30a33ea46
17/08/16 10:13:16 INFO zookeeper.ZooKeeper: Session: 0x25ba0c30a33ea46 closed
17/08/16 10:13:16 INFO zookeeper.ClientCnxn: EventThread shut down
As we can see, the job tries to start but fails saying that a jar file is not found. I made sure the jar file exist in local fs with ample permissions. I suspect the issue happens when it tries to copy the jar files into the distributed cache and fails somehow.
Here is my shell script that start the MR job:
#!/bin/bash
LIBJARS=`ls -m /var/cdr-ingest-mapreduce/lib/*.jar |tr -d ' '|tr -d '\n'`
LIBJARS="$LIBJARS,`ls -m /opt/cloudera/parcels/CDH/lib/hbase/lib/*.jar |tr -d ' '|tr -d '\n'`"
LIBJARS="$LIBJARS,`ls -m /opt/cloudera/parcels/CDH/lib/hadoop/client/*.jar |tr -d ' '|tr -d '\n'`"
LIBJARS="$LIBJARS,`ls -m /opt/cloudera/parcels/CDH/lib/hadoop-yarn/*.jar |tr -d ' '|tr -d '\n'`"
LIBJARS="$LIBJARS,`ls -m /opt/cloudera/parcels/CDH/lib/hadoop-yarn/lib/*.jar |tr -d ' '|tr -d '\n'`"
LIBJARS="$LIBJARS,`ls -m /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/*.jar |tr -d ' '|tr -d '\n'`"
job_start_timestamp=''
if [ -n "$1" ]; then
job_start_timestamp="-overridedJobStartTimestamp $1"
fi
export HADOOP_CLASSPATH=`echo ${LIBJARS} | sed s/,/:/g`
yarn jar `ls /var/cdr-ingest-mapreduce/cdr-ingest-mapreduce-core*.jar` com.blah.CdrIngestor \
-libjars ${LIBJARS} \
-zookeeper 44.88.111.216,44.88.111.220,44.88.111.211 \
-yarnResourceManagerHost 44.88.111.220 \
-yarnResourceManagerPort 8032 \
-yarnResourceManagerSchedulerHost 44.88.111.220 \
-yarnResourceManagerSchedulerPort 8030 \
-mrClientSubmitFileReplication 6 \
-logFile '/var/log/cdr_ingest_mapreduce/cdr_ingest_mapreduce' \
-hdfsTempOutputDirectory '/cdr/temp_cdr_ingest' \
-versions '3' \
-jobConfigDir '/etc/cdr-ingest-mapreduce' \
${job_start_timestamp}
Node Manager Log:
2017-08-16 18:34:28,438 ERROR org.apache.hadoop.yarn.server.nodemanager.NodeManager: RECEIVED SIGNAL 15: SIGTERM
2017-08-16 18:34:28,551 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl is interrupted. Exiting.
2017-08-16 18:34:31,638 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: The Auxilurary Service named 'mapreduce_shuffle' in the configuration is for class class org.apache.hadoop.mapred.ShuffleHandler which has a name of 'httpshuffle'. Because these are not the same tools trying to send ServiceData and read Service Meta Data may have issues unless the refer to the name in the config.
2017-08-16 18:34:31,851 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: container_1502835801144_0006_01_000001 has no corresponding application!
2017-08-16 18:36:08,221 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: rollingMonitorInterval is set as -1. The log rolling mornitoring interval is disabled. The logs will be aggregated after this application is finished.
2017-08-16 18:36:08,364 WARN org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hdfs OPERATION=Container Finished - Failed TARGET=ContainerImpl RESULT=FAILURE DESCRIPTION=Container failed with state: LOCALIZATION_FAILED APPID=application_1502933671610_0001 CONTAINERID=container_1502933671610_0001_01_000001
More logs from Node Manager showing that the jars were not copied to the cache (I am not sure what the 4th parameter "NULL" in the message is):
2017-08-15 15:20:09,876 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Created localizer for container_1502835577753_0001_01_000001
2017-08-15 15:20:09,876 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1502835577753_0001_01_000001 transitioned from LOCALIZING to LOCALIZATION_FAILED
2017-08-15 15:20:09,877 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl: Container container_1502835577753_0001_01_000001 sent RELEASE event on a resource request { file:/var/cdr-ingest-mapreduce/lib/mail-1.4.7.jar, 1502740240000, FILE, null } not present in cache.
2017-08-15 15:20:09,877 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl: Container container_1502835577753_0001_01_000001 sent RELEASE event on a resource request { file:/var/cdr-ingest-mapreduce/lib/commons-lang3-3.4.jar, 1502740240000, FILE, null } not present in cache.
2017-08-15 15:20:09,877 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl: Container container_1502835577753_0001_01_000001 sent RELEASE event on a resource request { file:/var/cdr-ingest-mapreduce/lib/cdr-ingest-mapreduce-core-1.0.3-SNAPSHOT.jar, 1502740240000, FILE, null } not present in cache.
2017-08-15 15:20:09,877 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl: Container container_1502835577753_0001_01_000001 sent RELEASE event on a resource request { file:/var/cdr-ingest-mapreduce/lib/opencsv-3.8.jar, 1502740240000, FILE, null } not present in cache.
2017-08-15 15:20:09,877 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Failed to download resource { { file:/var/cdr-ingest-mapreduce/lib/dataplatform-common-1.0.7.jar, 1502740240000, FILE, null },pending,[(container_1502835577753_0001_01_000001)],31900834426583787,DOWNLOADING}
Any help is appreciated.
Basically, the mapper/reducer was trying to read the dependent jar file(s) from the node manager's local filesystem. I confirmed that by comparing the configurations between the 2 clusters. The value of "fs.defaultFS" was set to "file:///" for the cluster that wasn't working. It looks like, that value comes from the file /etc/hadoop/conf/core-site.xml on the server (edge server) where my mapreduce was started. This file had no configurations because I had no service/role deployed on that edge server. I deployed HDFS/HttpFs on the edge server and redeployed the client configurations across the cluster. Alternatively, one could deploy a gateway role on the server to pull the configurations without having to run any role. Thanks to #tk421 for the tip. This created the contents in /etc/hadoop/conf/core-site.xml and fixed my problem.
For those who don't want to deploy any service/role on the edge server, you could copy the file contents from one of your your data node.
I added this little code snippet before starting the job to print the configuration values:
for (Entry<String, String> entry : config) {
System.out.println(entry.getKey() + "-->" + entry.getValue());
}
// Start and wait for the job to finish
jobStatus = job.waitForCompletion(true);

Deduce the HDFS path at runtime on EMR

I have spawned an EMR cluster with an EMR step to copy a file from S3 to HDFS and vice-versa using s3-dist-cp.
This cluster is an on-demand cluster so we are not keeping track of the ip.
The first EMR step is:
hadoop fs -mkdir /input - This step completed successfully.
The second EMR step is:
Following is the command I am using:
s3-dist-cp --s3Endpoint=s3.amazonaws.com --src=s3://<bucket-name>/<folder-name>/sample.txt --dest=hdfs:///input - This step FAILED
I get the following exception Error:
Error: java.lang.IllegalArgumentException: java.net.UnknownHostException: sample.txt
at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:378)
at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:310)
at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:176)
at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:678)
at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:619)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:149)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2717)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:93)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2751)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2733)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:377)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at com.amazon.elasticmapreduce.s3distcp.CopyFilesReducer.reduce(CopyFilesReducer.java:213)
at com.amazon.elasticmapreduce.s3distcp.CopyFilesReducer.reduce(CopyFilesReducer.java:28)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:635)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.net.UnknownHostException: sample.txt
But this file does exist on S3 and I can read it through my spark application on EMR.
The solution was while using s3-dist-cp , filename should not be mentioned in both source and destination.
If you want to filter files in the src directory, you can use --srcPattern option
eg: s3-dist-cp --s3Endpoint=s3.amazonaws.com --src=s3://// --dest=hdfs:///input/ --srcPattern=sample.txt.*

Hadoop C++, error running wordcount example

I am trying to run the wordcount example in c++, on Hadoop 1.0.4, on Ubuntu 12.04, but I am getting the following error:
Command:
hadoop pipes -D hadoop.pipes.java.recordreader=true -D
hadoop.pipes.java.recordwriter=true -input bin/input.txt -output
bin/output.txt -program bin/wordcount.
Error message:
13/06/14 13:50:11 WARN mapred.JobClient: No job jar file set. User
classes may not be found. See JobConf(Class) or
JobConf#setJar(String).
13/06/14 13:50:11 INFO util.NativeCodeLoader:
Loaded the native-hadoop library 13/06/14 13:50:11 WARN
snappy.LoadSnappy: Snappy native library not loaded 13/06/14 13:50:11
INFO mapred.FileInputFormat: Total input paths to process : 1 13/06/14
13:50:11 INFO mapred.JobClient: Running job: job_201306141334_0003
13/06/14 13:50:12 INFO mapred.JobClient: map 0% reduce 0% 13/06/14
13:50:24 INFO mapred.JobClient: Task Id :
attempt_201306141334_0003_m_000000_0, Status : FAILED
java.io.IOException at
org.apache.hadoop.mapred.pipes.OutputHandler.waitForAuthentication(OutputHandler.java:188)
at
org.apache.hadoop.mapred.pipes.Application.waitForAuthentication(Application.java:194)
at
org.apache.hadoop.mapred.pipes.Application.(Application.java:149)
at
org.apache.hadoop.mapred.pipes.PipesMapRunner.run(PipesMapRunner.java:71)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) at
org.apache.hadoop.mapred.Child$4.run(Child.java:255) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:415) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
attempt_201306141334_0003_m_000000_0: Server failed to authenticate.
Exiting 13/06/14 13:50:24 INFO mapred.JobClient: Task Id :
attempt_201306141334_0003_m_000001_0, Status : FAILED
I didn't find any solution and i've been trying for quite a while to make it work.
I appreciate your help,
Thanks.
Found this SO question (hadoop not running in the multinode cluster) where that user got similar errors and it ended up being that they did not "Set a class" according to the top answer. This was Java however.
I found this tutorial about running the C++ wordcount example in Hadoop. Hopefully this helps you out.
http://cs.smith.edu/dftwiki/index.php/Hadoop_Tutorial_2.2_--_Running_C%2B%2B_Programs_on_Hadoop