PySpark fails with exit code 52 - amazon-web-services

I have an Amazon EMR cluster running, to which I submit jobs using the spark-submit shell command.
The way I call it:
spark-submit --master yarn --driver-memory 10g convert.py
The convert.py script is running using PySpark with Python 3.4.
After reading in a text file into an RDD, calling any method such as .take(5), .first(), .collect() or creating a dataframe from the RDD leads to the following error:
18/03/26 20:17:53 WARN TaskSetManager: Lost task 0.3 in stage 0.0 (TID
3, ip-xx-xx-xx-xxx.ec2.internal, executor 4): ExecutorLostFailure
(executor 4 exited caused by one of the running tasks) Reason:
Container marked as failed: container_0000000000001_0001_01_000001 on
host: ip-xx-xx-xx-xxx.ec2.internal. Exit status: 52. Diagnostics:
Exception from container-launch. Container id:
container_0000000000001_0001_01_000001 Exit code: 52 Stack trace:
ExitCodeException exitCode=52: at
org.apache.hadoop.util.Shell.runCommand(Shell.java:972) at
org.apache.hadoop.util.Shell.run(Shell.java:869) at
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
at
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:236)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:305)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:84)
at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
It only happens for one specific file (~900MB in size). I managed to replicate the issue by just using the pyspark shell as well.
Interestingly enough, doing the same steps in scala using the spark-shell program works perfectly.
Could this be a problem with YARN? Also, memory shouldn't be an issue since I was able to convert an 18GB file with the same code.
Any guidance will be greatly appreciated.

Related

Spark Scala EMR Job fails to download file from S3

I have a spark scala job that keeps failing in AWS EMR production and one of the first errors that I keep seeing in my executors is this Download failed error. I've looked at these files in S3, I even copied the file to a lower environment and ran the same job against it and everything worked as expected. The lower environment has less data to process but other than that I'm not really sure why I'm running into this issue. The production folder does have a Glue Job that runs every hour and writes new data, but I tried running the emr job with the glue job paused and I still ran into this error. Other than that nothing new is being written and some of the files that are trying to be accessed are months old and have data present.
2021-05-20 00:40:15 ERROR S3FSInputStream:295 - Unable to recover reading from stream
2021-05-20 00:40:15 ERROR AsyncFileDownloader:91 - TID: 3497 - Download failed for file path: s3://bucket/folder/part-000.snappy.parquet, range: 0-20427, partition values: [empty row], isDataPresent: false
java.io.IOException: Unexpected end of stream pos=4, contentLength=20427
at com.amazon.ws.emr.hadoop.fs.s3.S3FSInputStream.read(S3FSInputStream.java:296)
at org.apache.commons.io.IOUtils.read(IOUtils.java:2454)
at org.apache.commons.io.IOUtils.readFully(IOUtils.java:2537)
at org.apache.hadoop.util.ByteBufferIOUtils.readFullyHeapBuffer(ByteBufferIOUtils.java:89)
at org.apache.hadoop.util.ByteBufferIOUtils.readFully(ByteBufferIOUtils.java:53)
at com.amazon.ws.emr.hadoop.fs.s3.AbstractS3FSInputStream.readFullyIntoBuffers(AbstractS3FSInputStream.java:97)
at org.apache.hadoop.fs.BufferedFSInputStream.readFullyIntoBuffers(BufferedFSInputStream.java:137)
at org.apache.hadoop.fs.FSDataInputStream.readFullyIntoBuffers(FSDataInputStream.java:270)
at org.apache.parquet.hadoop.util.H1SeekableInputStream.readFullyIntoBuffers(H1SeekableInputStream.java:64)
at org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1181)
at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:806)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildPrefetcherWithPartitionValues$1.apply(ParquetFileFormat.scala:634)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildPrefetcherWithPartitionValues$1.apply(ParquetFileFormat.scala:576)
at org.apache.spark.sql.execution.datasources.AsyncFileDownloader.org$apache$spark$sql$execution$datasources$AsyncFileDownloader$$downloadFile(AsyncFileDownloader.scala:93)
at org.apache.spark.sql.execution.datasources.AsyncFileDownloader$$anonfun$initiateFilesDownload$2$$anon$1.call(AsyncFileDownloader.scala:73)
at org.apache.spark.sql.execution.datasources.AsyncFileDownloader$$anonfun$initiateFilesDownload$2$$anon$1.call(AsyncFileDownloader.scala:72)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.AbortedException:
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleInterruptedException(AmazonHttpClient.java:868)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:746)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:704)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:686)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:550)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:530)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5140)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5086)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1490)
at com.amazon.ws.emr.hadoop.fs.s3.lite.call.GetObjectCall.perform(GetObjectCall.java:24)
at com.amazon.ws.emr.hadoop.fs.s3.lite.call.GetObjectCall.perform(GetObjectCall.java:8)
at com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor.execute(GlobalS3Executor.java:114)
at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.invoke(AmazonS3LiteClient.java:191)
at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.getObject(AmazonS3LiteClient.java:102)
at com.amazon.ws.emr.hadoop.fs.s3.GetObjectInputStreamWithInfoFactory.create(GetObjectInputStreamWithInfoFactory.java:63)
at com.amazon.ws.emr.hadoop.fs.s3.S3FSInputStream.open(S3FSInputStream.java:199)
at com.amazon.ws.emr.hadoop.fs.s3.S3FSInputStream.retrieveInputStreamWithInfo(S3FSInputStream.java:390)
at com.amazon.ws.emr.hadoop.fs.s3.S3FSInputStream.reopenStream(S3FSInputStream.java:377)
at com.amazon.ws.emr.hadoop.fs.s3.S3FSInputStream.read(S3FSInputStream.java:259)
... 19 more
Caused by: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.timers.client.SdkInterruptedException
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.checkInterrupted(AmazonHttpClient.java:923)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.afterAttempt(AmazonHttpClient.java:1073)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1196)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:802)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:770)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:744)
... 36 more

FileAlreadyExistsException: File already exists:s3

I have a AWS glue job (PySpark) that needs to load data from a centralized data lake of size 350GB+, prepare it and load into a s3 bucket partitioned by two columns (date and geohash) Mind you that this is PROD data and the environment is PROD. Whenever the job runs, it crashes with the below error.
My glue job has 60 G.1X workers.
++Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: File already exists:s3://<bucket-name>/etl/<directory>/.spark-staging-f163a945-f93c-44b9-bac3-923ec9315275/p_date=2020-11-02/part-00249-f163a945-f93c-44b9-bac3-923ec9315275.c000.snappy.parquet
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 249 in stage 15.0 failed 4 times, most recent failure: Lost task 249.3 in stage 15.0 (TID 16001, 172.36.172.237, executor 21): org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:257)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I have no idea why this is happening, the spark staging file seems to cause this issue. This wasn't the issue in the DEV environment, I mean the data volume is ceratinly large in PROD but still the code is the same. My SparkConf looks something like this
conf = pyspark.SparkConf().setAll([
("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2"),
("spark.speculation", "false"),
("spark.sql.parquet.enableVectorizedReader", "false"),
("spark.sql.parquet.mergeSchema", "true"),
("spark.sql.crossJoin.enabled", "true"),
("spark.sql.sources.partitionOverwriteMode","dynamic"),
("spark.hadoop.fs.s3.maxRetries", "20"),
("spark.hadoop.fs.s3a.multiobjectdelete.enable", "false")
])
Here is my write to S3 code.
finalDF.write.partitionBy('p_date').save("s3://{bucket}/{basepath}/{table}/".format(bucket=args['DataBucket'], basepath='etl',
table='sessions'), format='parquet', mode="overwrite")
I tried removing the second partition while writing, but still the same issue.
Any help would be appericiated.

Docker-compose blkio: device_write_iops not working for AWS/EBS instance

I am trying to limit iops on a particular container in my docker-compose stack. To do this I am using the following config:
blkio_config:
device_write_iops:
- path: "/dev/xvda1"
rate: 20
device_read_iops:
- path: "/dev/xvda1"
rate: 20
I cannot provide the rest of the file for security reasons however it is isolated to this statement. I confirmed that this is the correct path for my ebs volume using the df -h command.
When I then run docker-compose up -d I get the following error:
Recreating e1c25c41b612_drone ... error
ERROR: for e1c25c41b612_drone Cannot start service drone: OCI runtime create failed: container_linux.go:345: starting container process caused "process_linux.go:430: container init caused \"process_linux.go:396: setting cgroup config for procHooks process caused \\\"failed to write 202:1 20 to blkio.throttle.read_iops_device: write /sys/fs/cgroup/blkio/docker/a674e86d50111afa576d5fd4e16a131070c100b7db3ac22f95986904a47ae82a/blkio.throttle.read_iops_device: invalid argument\\\"\"": unknown
ERROR: for drone Cannot start service drone: OCI runtime create failed: container_linux.go:345: starting container process caused "process_linux.go:430: container init caused \"process_linux.go:396: setting cgroup config for procHooks process caused \\\"failed to write 202:1 20 to blkio.throttle.read_iops_device: write /sys/fs/cgroup/blkio/docker/a674e86d50111afa576d5fd4e16a131070c100b7db3ac22f95986904a47ae82a/blkio.throttle.read_iops_device: invalid argument\\\"\"": unknown
The iops limit on my EBS instance is 120 and so I tested using a variety of different values to no avail.
Any help is massively appreciated.

Spark EMR Cluster is removing executors when run because they are idle

I have a spark application that was running fine in standalone mode, I'm now trying to get the same application to run on an AWS EMR Cluster but currently it's failing.
The message is one I've not seen before and implies that the workers are not receiving jobs and are being shut down.
**16/11/30 14:45:00 INFO ExecutorAllocationManager: Removing executor 3 because it has been idle for 60 seconds (new desired total will be 7)
16/11/30 14:45:00 INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 2
16/11/30 14:45:00 INFO ExecutorAllocationManager: Removing executor 2 because it has been idle for 60 seconds (new desired total will be 6)
16/11/30 14:45:00 INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 4
16/11/30 14:45:00 INFO ExecutorAllocationManager: Removing executor 4 because it has been idle for 60 seconds (new desired total will be 5)
16/11/30 14:45:01 INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 7
16/11/30 14:45:01 INFO ExecutorAllocationManager: Removing executor 7 because it has been idle for 60 seconds (new desired total will be 4)**
The DAG shows the workers initialised, then a collect (one that is relatively small) and then shortly after they all fail.
Dynamic allocation was enabled so there was a thought that perhaps the driver wasn't sending them any tasks and so they timed out - to prove the theory I spun up another cluster without dynamic allocation and the same thing happened.
The master is set to yarn.
Any help is massively appreciated, thanks.
16/11/30 14:49:16 INFO BlockManagerMaster: Removal of executor 21 requested
16/11/30 14:49:16 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asked to remove non-existent executor 21
16/11/30 14:49:16 INFO BlockManagerMasterEndpoint: Trying to remove executor 21 from BlockManagerMaster.
16/11/30 14:49:24 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1480517110174_0001_01_000049 on host: ip-10-138-114-125.ec2.internal. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_1480517110174_0001_01_000049
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
at org.apache.hadoop.util.Shell.run(Shell.java:456)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
My step is quite simple - spark-submit --deploy-mode client --master yarn --class Run app.jar

Error while running map-reduce job on hadoop-2.6.0

I am running map-reduce job locally with apache hadoop-2.6.0, and getting this error:
15/04/02 13:00:26 INFO mapreduce.Job: map 0% reduce 0%
15/04/02 13:00:26 INFO mapreduce.Job: Job job_1427959779168_0001 failed with state FAILED due to: Application application_1427959779168_0001 failed 2 times due to AM Container for appattempt_1427959779168_0001_000002 exited with exitCode: 127
For more detailed output, check application tracking page:http://Ha-Ha-Ha.local:8088/proxy/application_1427959779168_0001/Then, click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1427959779168_0001_02_000001
Exit code: 127
Stack trace: ExitCodeException exitCode=127:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Can anybody suggest what might have caused this?
Could you check your classpath settings.Please try setting the variables HADOOP_COMMON_HOME, HADOOP_HDFS_HOME, HADOOP_YARN_HOME, and HADOOP_MAPRED_HOME in hadoop-env.sh to point to the appropriate directories under /usr/lib.Hope it helps!!