Copying files from HDFS to S3 on EMR cluster using S3DistCp

Copying files from HDFS to S3 on EMR cluster using S3DistCp - amazon-web-services

I am copying 800 avro files, size around 136 MB, from HDFS to S3 on EMR cluster, but Im getting this exception:
8/06/26 10:53:14 INFO mapreduce.Job: map 100% reduce 91%
18/06/26 10:53:14 INFO mapreduce.Job: Task Id : attempt_1529995855123_0003_r_000006_0, Status : FAILED
Error: java.lang.RuntimeException: Reducer task failed to copy 1 files: hdfs://url-to-aws-emr/user/hadoop/output/part-00258-3a28110a-9270-4639-b389-3e1f7f386ed6-c000.avro etc
at com.amazon.elasticmapreduce.s3distcp.CopyFilesReducer.cleanup(CopyFilesReducer.java:67)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:179)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:635)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
The configuration for the EMR cluster is:
core-site fs.trash.checkpoint.interval 60
core-site fs.trash.interval 60
hadoop-env.export HADOOP_CLIENT_OPTS -Xmx10g
hdfs-site dfs.replication 3
Any help will be appreciated.
Edit:
Running the hdfs dfsadmin -report command, gives the following result:
[hadoop#~]$ hdfs dfsadmin -report
Configured Capacity: 79056308744192 (71.90 TB)
Present Capacity: 78112126204492 (71.04 TB)
DFS Remaining: 74356972374604 (67.63 TB)
DFS Used: 3755153829888 (3.42 TB)
DFS Used%: 4.81%
Under replicated blocks: 126
Blocks with corrupt replicas: 0
Missing blocks: 63
Missing blocks (with replication factor 1): 0
It suggests that the block are missing. Does it mean that I have to re-run the program again? And if I see the output of Under replicated blocks, it says 126. It means 126 blocks will be replicated. How can I know, will it replicate the missing blocks?
Also, the value of Under replicated blocks is 126 for the last 30 minutes. Is there any way to force to it to replicate quickly?

I got the same "Reducer task failed to copy 1 files" error and I found logs in HDFS /var/log/hadoop-yarn/apps/hadoop/logs related to the MR job that s3-dist-cp kicks off.
hadoop fs -ls /var/log/hadoop-yarn/apps/hadoop/logs
I copied them out to local:
hadoop fs -get /var/log/hadoop-yarn/apps/hadoop/logs/application_nnnnnnnnnnnnn_nnnn/ip-nnn-nn-nn-nnn.ec2.internal_nnnn
And then examined them in a text editor to find more diagnostic information about the detailed results of the Reducer phase. In my case I was getting an error back from the S3 service. You might find a different error.

Related

How to solve an OutOfMemory exception when I load a big number of JSON files in Spark using and HDFS source

The problem:
I have a hdfs source hdfs:///data that contains 500 GB of JSON files.
My executor node memory limit is 64 GB, my Spark version is 3.3.0, and I am running on AWS EMR 6.8.
val df = spark.read.option("recursiveFileLookup","true").json("hdfs:///data")
The issue
I got an exception when I use this spark command. How do I solve this issue ?
I have just used this command and never executed a side effect such as "save", but I got the memory exception.
#
# java.lang.OutOfMemoryError: GC overhead limit exceeded
# -XX:OnOutOfMemoryError="kill -9 %p"
# Executing /bin/sh -c "kill -9 4314"...
Stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 9571 tasks (1024.0 MiB) is bigger than spark.driver.maxResultSize (1024.0 MiB)
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2863)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2799)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2798)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2798)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1239)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1239)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1239)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3051)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoo

Spark Scala EMR Job fails to download file from S3

I have a spark scala job that keeps failing in AWS EMR production and one of the first errors that I keep seeing in my executors is this Download failed error. I've looked at these files in S3, I even copied the file to a lower environment and ran the same job against it and everything worked as expected. The lower environment has less data to process but other than that I'm not really sure why I'm running into this issue. The production folder does have a Glue Job that runs every hour and writes new data, but I tried running the emr job with the glue job paused and I still ran into this error. Other than that nothing new is being written and some of the files that are trying to be accessed are months old and have data present.
2021-05-20 00:40:15 ERROR S3FSInputStream:295 - Unable to recover reading from stream
2021-05-20 00:40:15 ERROR AsyncFileDownloader:91 - TID: 3497 - Download failed for file path: s3://bucket/folder/part-000.snappy.parquet, range: 0-20427, partition values: [empty row], isDataPresent: false
java.io.IOException: Unexpected end of stream pos=4, contentLength=20427
at com.amazon.ws.emr.hadoop.fs.s3.S3FSInputStream.read(S3FSInputStream.java:296)
at org.apache.commons.io.IOUtils.read(IOUtils.java:2454)
at org.apache.commons.io.IOUtils.readFully(IOUtils.java:2537)
at org.apache.hadoop.util.ByteBufferIOUtils.readFullyHeapBuffer(ByteBufferIOUtils.java:89)
at org.apache.hadoop.util.ByteBufferIOUtils.readFully(ByteBufferIOUtils.java:53)
at com.amazon.ws.emr.hadoop.fs.s3.AbstractS3FSInputStream.readFullyIntoBuffers(AbstractS3FSInputStream.java:97)
at org.apache.hadoop.fs.BufferedFSInputStream.readFullyIntoBuffers(BufferedFSInputStream.java:137)
at org.apache.hadoop.fs.FSDataInputStream.readFullyIntoBuffers(FSDataInputStream.java:270)
at org.apache.parquet.hadoop.util.H1SeekableInputStream.readFullyIntoBuffers(H1SeekableInputStream.java:64)
at org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1181)
at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:806)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildPrefetcherWithPartitionValues$1.apply(ParquetFileFormat.scala:634)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildPrefetcherWithPartitionValues$1.apply(ParquetFileFormat.scala:576)
at org.apache.spark.sql.execution.datasources.AsyncFileDownloader.org$apache$spark$sql$execution$datasources$AsyncFileDownloader$$downloadFile(AsyncFileDownloader.scala:93)
at org.apache.spark.sql.execution.datasources.AsyncFileDownloader$$anonfun$initiateFilesDownload$2$$anon$1.call(AsyncFileDownloader.scala:73)
at org.apache.spark.sql.execution.datasources.AsyncFileDownloader$$anonfun$initiateFilesDownload$2$$anon$1.call(AsyncFileDownloader.scala:72)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.AbortedException:
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleInterruptedException(AmazonHttpClient.java:868)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:746)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:704)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:686)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:550)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:530)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5140)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5086)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1490)
at com.amazon.ws.emr.hadoop.fs.s3.lite.call.GetObjectCall.perform(GetObjectCall.java:24)
at com.amazon.ws.emr.hadoop.fs.s3.lite.call.GetObjectCall.perform(GetObjectCall.java:8)
at com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor.execute(GlobalS3Executor.java:114)
at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.invoke(AmazonS3LiteClient.java:191)
at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.getObject(AmazonS3LiteClient.java:102)
at com.amazon.ws.emr.hadoop.fs.s3.GetObjectInputStreamWithInfoFactory.create(GetObjectInputStreamWithInfoFactory.java:63)
at com.amazon.ws.emr.hadoop.fs.s3.S3FSInputStream.open(S3FSInputStream.java:199)
at com.amazon.ws.emr.hadoop.fs.s3.S3FSInputStream.retrieveInputStreamWithInfo(S3FSInputStream.java:390)
at com.amazon.ws.emr.hadoop.fs.s3.S3FSInputStream.reopenStream(S3FSInputStream.java:377)
at com.amazon.ws.emr.hadoop.fs.s3.S3FSInputStream.read(S3FSInputStream.java:259)
... 19 more
Caused by: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.timers.client.SdkInterruptedException
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.checkInterrupted(AmazonHttpClient.java:923)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.afterAttempt(AmazonHttpClient.java:1073)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1196)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:802)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:770)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:744)
... 36 more

Docker-compose blkio: device_write_iops not working for AWS/EBS instance

I am trying to limit iops on a particular container in my docker-compose stack. To do this I am using the following config:
blkio_config:
device_write_iops:
- path: "/dev/xvda1"
rate: 20
device_read_iops:
- path: "/dev/xvda1"
rate: 20
I cannot provide the rest of the file for security reasons however it is isolated to this statement. I confirmed that this is the correct path for my ebs volume using the df -h command.
When I then run docker-compose up -d I get the following error:
Recreating e1c25c41b612_drone ... error
ERROR: for e1c25c41b612_drone Cannot start service drone: OCI runtime create failed: container_linux.go:345: starting container process caused "process_linux.go:430: container init caused \"process_linux.go:396: setting cgroup config for procHooks process caused \\\"failed to write 202:1 20 to blkio.throttle.read_iops_device: write /sys/fs/cgroup/blkio/docker/a674e86d50111afa576d5fd4e16a131070c100b7db3ac22f95986904a47ae82a/blkio.throttle.read_iops_device: invalid argument\\\"\"": unknown
ERROR: for drone Cannot start service drone: OCI runtime create failed: container_linux.go:345: starting container process caused "process_linux.go:430: container init caused \"process_linux.go:396: setting cgroup config for procHooks process caused \\\"failed to write 202:1 20 to blkio.throttle.read_iops_device: write /sys/fs/cgroup/blkio/docker/a674e86d50111afa576d5fd4e16a131070c100b7db3ac22f95986904a47ae82a/blkio.throttle.read_iops_device: invalid argument\\\"\"": unknown
The iops limit on my EBS instance is 120 and so I tested using a variety of different values to no avail.
Any help is massively appreciated.

PySpark fails with exit code 52

I have an Amazon EMR cluster running, to which I submit jobs using the spark-submit shell command.
The way I call it:
spark-submit --master yarn --driver-memory 10g convert.py
The convert.py script is running using PySpark with Python 3.4.
After reading in a text file into an RDD, calling any method such as .take(5), .first(), .collect() or creating a dataframe from the RDD leads to the following error:
18/03/26 20:17:53 WARN TaskSetManager: Lost task 0.3 in stage 0.0 (TID
3, ip-xx-xx-xx-xxx.ec2.internal, executor 4): ExecutorLostFailure
(executor 4 exited caused by one of the running tasks) Reason:
Container marked as failed: container_0000000000001_0001_01_000001 on
host: ip-xx-xx-xx-xxx.ec2.internal. Exit status: 52. Diagnostics:
Exception from container-launch. Container id:
container_0000000000001_0001_01_000001 Exit code: 52 Stack trace:
ExitCodeException exitCode=52: at
org.apache.hadoop.util.Shell.runCommand(Shell.java:972) at
org.apache.hadoop.util.Shell.run(Shell.java:869) at
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
at
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:236)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:305)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:84)
at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
It only happens for one specific file (~900MB in size). I managed to replicate the issue by just using the pyspark shell as well.
Interestingly enough, doing the same steps in scala using the spark-shell program works perfectly.
Could this be a problem with YARN? Also, memory shouldn't be an issue since I was able to convert an 18GB file with the same code.
Any guidance will be greatly appreciated.

Deduce the HDFS path at runtime on EMR

I have spawned an EMR cluster with an EMR step to copy a file from S3 to HDFS and vice-versa using s3-dist-cp.
This cluster is an on-demand cluster so we are not keeping track of the ip.
The first EMR step is:
hadoop fs -mkdir /input - This step completed successfully.
The second EMR step is:
Following is the command I am using:
s3-dist-cp --s3Endpoint=s3.amazonaws.com --src=s3://<bucket-name>/<folder-name>/sample.txt --dest=hdfs:///input - This step FAILED
I get the following exception Error:
Error: java.lang.IllegalArgumentException: java.net.UnknownHostException: sample.txt
at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:378)
at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:310)
at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:176)
at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:678)
at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:619)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:149)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2717)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:93)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2751)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2733)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:377)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at com.amazon.elasticmapreduce.s3distcp.CopyFilesReducer.reduce(CopyFilesReducer.java:213)
at com.amazon.elasticmapreduce.s3distcp.CopyFilesReducer.reduce(CopyFilesReducer.java:28)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:635)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.net.UnknownHostException: sample.txt
But this file does exist on S3 and I can read it through my spark application on EMR.

The solution was while using s3-dist-cp , filename should not be mentioned in both source and destination.
If you want to filter files in the src directory, you can use --srcPattern option
eg: s3-dist-cp --s3Endpoint=s3.amazonaws.com --src=s3://// --dest=hdfs:///input/ --srcPattern=sample.txt.*

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Copying files from HDFS to S3 on EMR cluster using S3DistCp - amazon-web-services

Related

How to solve an OutOfMemory exception when I load a big number of JSON files in Spark using and HDFS source

Spark Scala EMR Job fails to download file from S3

Docker-compose blkio: device_write_iops not working for AWS/EBS instance

PySpark fails with exit code 52

Deduce the HDFS path at runtime on EMR

Categories

Resources