Spark Dataframe hanging on save - amazon-web-services

I've been struggling to find out what is wrong with my spark job that indefinitely hangs where I try to write it out to either S3 or HDFS (~100G of data in parquet format).
The line that causes the hang:
spark_df.write.save(MY_PATH,format='parquet',mode='append')
I have tried this in overwrite as well as append mode, and tried saving to HDFS and S3, but the job will hang no matter what.
In the Hadoop Resource Manager GUI, it shows the state of the spark application as "RUNNING", but looking it seems nothing is actually being done by Spark and when I look at the Spark UI there are no jobs running.
The one thing that has gotten it to work is to increase the size of the cluster while it is in this hung state (I'm on AWS). This, however, doesn't matter if I start the cluster with 6 workers and increase to 7, or if I start with 7 and increase to 8 which seems somewhat odd to me. The cluster is using all of the memory available in both cases, but I am not getting memory errors.
Any ideas on what could be going wrong?

Thanks for the help all. I ended up figuring out the problem was actually a few separate issues. Here's how I understand them:
When I was saving directly to S3, it was related to the issue that Steve Loughran mentioned where the renames on S3 were just incredibly slow (so it looked like my cluster was doing nothing). On writes to S3, all the data is copied to temporary files and then "renamed" on S3 -- the problem is that renames don't happen like they do on a filesystem and actually take O(n) time. So all of my data was copied to S3 and then all of the time was spent renaming the files.
The other problem I faced was with saving my data to HDFS and then moving it to S3 via s3-dist-cp. All of my clusters resources were being used by Spark, and so when the Application Master tried giving resources to move the data to via s3-dist-cp it was unable to. The moving of data couldn't happen because of Spark, and Spark wouldn't shut down because my program was still trying to copy data to S3 (so they were locked).
Hope this can help someone else!

Related

Pyspark job freezes with too many vcpus

TLDR: I have a pyspark job that finishes in 10 minutes when I run it in a ec2 instance with 16 vcpus but freezes out (it doesn't fail, just never finishes) if I use an instance with over 20 vcpus. I have tried everything I could think of and I just don't know why this happens.
Full story:
I have around 200 small pyspark jobs that for a matter of costs and flexibility I execute using aws batch with spark dockers instead of EMR. Recently I decided to experiment around the best configuration for those jobs and I realized something weird: a job that finished quickly (around 10 minutes) with 16 vcpus or less would just never end with 20 or more (I waited for 3 hours). First thing I thought is that it could be a problem with batch or the way ecs-agents manage the task, so I tried running the docker in an ec2 directly and had the same problem. Then I thought the problem was with the docker image, so I tried creating a new one:
First one used with spark installed as per AWS glue compatible version (https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-1.0/spark-2.4.3-bin-hadoop2.8.tgz)
New one was ubuntu 20 based with spark installed from the apache mirror (https://apache.mirror.digitalpacific.com.au/spark/spark-$SPARK_VERSION/spark-$SPARK_VERSION-bin-hadoop$HADOOP_VERSION.tgz)
Same thing happened. Then I decided the problem was with using docker at all, so I installed everything directly in the ec2, had the same result. Tried changing spark version, also the same thing happened. Thought it could be a problem with hardware blocking too many threads, so I switched to an instance with AMD, nothing changed. Tried modifying some configurations, memory amount used by driver, but it always has the same result: 16 vcpus it work, more than it, it stops.
Other details:
According to the logs it seems to always stop at the same point: a parquet read operation on s3, but the parquet file is super small (> 1mb) so I don't think that is the actual problem.
After that it still has logs sometimes but nothing really useful, just "INFO ContextCleaner: Cleaned accumulator".
I use s3a to read the files from s3.
I don't get any errors or spark logs.
I appreciate any help on the matter!
Stop using the Hadoop 2.7 binaries. They are woefully obsolete, especially for S3 connectivity. replace all the hadoop 2.7 artifacts with Hadoop 2.8 ones, or, preferably, Hadoop 3.2 or later, with the consistent dependencies.
set `spark.hadoop.fs.s3a.experimental.fadvise" to random.
If you still see problems, see if you can replicate them on hadoop 3.3.x, and if so: file a bug.
(advice correct of 2021-03-9; the longer it stays in SO unedited, the less it should be believed)

Apache Spark/AWS EMR and tracking of processed files

I have AWS S3 folder where the big number of JSON files is stored. I need to ETL these files with AWS EMR over Spark and store the transformation into AWS RDS.
I have implemented the Spark job for this purpose on Scala and everything is working fine. I plan to execute this job once a week.
From time to time the external logic can add a new files to AWS S3 folder so the next time when my Spark job is starting I'd like to process only the new(unprocessed) JSON files.
Right now I don't know where to store the information about the processed JSON files so the Spark job can decide what files/folders to process. Could you please advise me what is the best practice(and how) to track this changes with Spark/AWS?
If it is spark streaming job, checkpointing is what you are looking for, it is discussed here.
Checkpointing stores the state information (ie offsets etc) in hdfs/s3 bucket, so when the job is started again, spark picks up only the un-processed files. Checkpointing offers better fault tolerance in case of failures as well, as state is handled automatically by spark itself.
Again checkpointing only works in the streaming mode of spark job.

Is there any way to stop/resume file download from AWS S3 while using s3cmd?

I'm downloading a huge file from S3 (around 20 GB) using s3cmd and I wish to pause the download right now and resume it again tomorrow.
I've read about the --continue flag but I don't know about its usage. As in, should the download be ended in a specific way for the --continue flag to be able to resume it later on? Or will the --continue flag be able to resume download no matter how the process was stopped, regardless if it was a keyboard interrupt, accidental shutdown or network error.
Can somebody give an example?
The intended usage is with the get command and version of s3cmd is 2.0.2.
You can use the s3api cli to get the object in parts.
Look at the --range and --part-number.
While researching your question I came across something interesting. Ability to download S3 objects over bittorrent. However there are few caveats, the object should be less than 5GB and that it should be public.
Apparently, simply using the continue flag does the job splendidly. I ended the download by disconnecting the network. And then started again the next day using the --continue flag and it resumed the download from where it had left.

s3distcp copy from S3 to EMR HDFS data replica always on one node

I am using s3distcp to copy a 500GB dataset into my EMR cluster. It's a 12 node r4.4xlarge cluster each with 750GB disk. It's using the EMR release label emr-5.13.0 and I'm adding Hadoop: Amazon 2.8.3, Ganglia: 3.7.2 and Spark 2.3.0. I'm using the following command to copy the data into the cluster:
s3-dist-cp --src=s3://bucket/prefix/ --dest=hdfs:///local/path/ --groupBy=.*(part_).* --targetSize=128 --outputCodec=none
When I look at the disk usage in either Ganglia or the namenode UI (port 50070 on the EMR cluster) then I can see that one node has most of it's disk filled and the others have a similar percentage used. Clicking through a lot of the files (~50) I can see that a replicate of the file always appears on the full node.
I'm using Spark to transform this data, write it to HDFS and then copy back to S3. I'm having trouble with this dataset as my tasks are being killed. I'm not certain this is the cause of the problem. I don't need to copy the data locally, nor decompress it. Initially I thought the BZIP2 codec was not splitable and decompressing would help gain parallelism in my Spark jobs but I was wrong, it is splitable. I have also discovered the hdfs balancer command which I'm using to redistribute the replicas and see if this solves my Spark problems.
However, now I've seen what I think is odd behaviour I would like to understand if this is normal for s3distcp/HDFS to create a replica of the files always on one node?
s3distcp is closed source; I can't comment in detail about its internals.
When HDFS creates replicas of data, it tries to save one block to the local machine, then 2 more elsewhere (Assuming replication==3). Whichever host is running the distcp worker processes will end up having a copy of the entire file. So if only one host is used for the copy, that fills up.
FWIW, I don't believe you need to do that distcp, not if you can do a read and filter of the data straight off S3, saving that result to hdfs. Your spark workers will do the filtering, and write their blocks back to the machines running these workers and other hosts in the chain. And for short-lived clusters, you could also try lowering the hdfs replication factor (2?), so save on HDFS data across the cluster, at the cost of having one less place for spark to schedule work adjacent to the data

Spark application stops when reading file from s3

I have an application which runs on EMR and reads a csv file from s3.
However, the whole thing seems to stop (I've let it run for about an hour) when I try to read in that file from s3. Nothing happens and nothing is written to the logs any more except that the application is still running. The step in which this application is running does not fail!
I've tried copying the file to the cluster via the flag --files of spark-submit and reading it directly within the application with sc.textFile(filename).
Is there anything I am missing?
After a while I finally got back to that problem again and could "solve" it myself (I don't really know what the problem was, though...)
It seems like spark was failing to allocate worker nodes. After setting spark.dynamicAllocation.enabled to true everything is working as expected now.