Spark application stops when reading file from s3 - amazon-web-services

I have an application which runs on EMR and reads a csv file from s3.
However, the whole thing seems to stop (I've let it run for about an hour) when I try to read in that file from s3. Nothing happens and nothing is written to the logs any more except that the application is still running. The step in which this application is running does not fail!
I've tried copying the file to the cluster via the flag --files of spark-submit and reading it directly within the application with sc.textFile(filename).
Is there anything I am missing?

After a while I finally got back to that problem again and could "solve" it myself (I don't really know what the problem was, though...)
It seems like spark was failing to allocate worker nodes. After setting spark.dynamicAllocation.enabled to true everything is working as expected now.

Related

EMR not generating step logs

Due to some reason, I do not see steps logs for my jobs in EMR. It used to work fine a while back, but it just stopped logging.
I checked from HDFS in path /mnt/var/log/hadoop/steps/, but there is no log there. The steps do complete successfully, just that there are no logs.
Is there anything I can do to find the issue and get the logger back to work?
Thanks in advance for taking time to read and respond.
All the best.

Pyspark job freezes with too many vcpus

TLDR: I have a pyspark job that finishes in 10 minutes when I run it in a ec2 instance with 16 vcpus but freezes out (it doesn't fail, just never finishes) if I use an instance with over 20 vcpus. I have tried everything I could think of and I just don't know why this happens.
Full story:
I have around 200 small pyspark jobs that for a matter of costs and flexibility I execute using aws batch with spark dockers instead of EMR. Recently I decided to experiment around the best configuration for those jobs and I realized something weird: a job that finished quickly (around 10 minutes) with 16 vcpus or less would just never end with 20 or more (I waited for 3 hours). First thing I thought is that it could be a problem with batch or the way ecs-agents manage the task, so I tried running the docker in an ec2 directly and had the same problem. Then I thought the problem was with the docker image, so I tried creating a new one:
First one used with spark installed as per AWS glue compatible version (https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-1.0/spark-2.4.3-bin-hadoop2.8.tgz)
New one was ubuntu 20 based with spark installed from the apache mirror (https://apache.mirror.digitalpacific.com.au/spark/spark-$SPARK_VERSION/spark-$SPARK_VERSION-bin-hadoop$HADOOP_VERSION.tgz)
Same thing happened. Then I decided the problem was with using docker at all, so I installed everything directly in the ec2, had the same result. Tried changing spark version, also the same thing happened. Thought it could be a problem with hardware blocking too many threads, so I switched to an instance with AMD, nothing changed. Tried modifying some configurations, memory amount used by driver, but it always has the same result: 16 vcpus it work, more than it, it stops.
Other details:
According to the logs it seems to always stop at the same point: a parquet read operation on s3, but the parquet file is super small (> 1mb) so I don't think that is the actual problem.
After that it still has logs sometimes but nothing really useful, just "INFO ContextCleaner: Cleaned accumulator".
I use s3a to read the files from s3.
I don't get any errors or spark logs.
I appreciate any help on the matter!
Stop using the Hadoop 2.7 binaries. They are woefully obsolete, especially for S3 connectivity. replace all the hadoop 2.7 artifacts with Hadoop 2.8 ones, or, preferably, Hadoop 3.2 or later, with the consistent dependencies.
set `spark.hadoop.fs.s3a.experimental.fadvise" to random.
If you still see problems, see if you can replicate them on hadoop 3.3.x, and if so: file a bug.
(advice correct of 2021-03-9; the longer it stays in SO unedited, the less it should be believed)

Spark Dataframe hanging on save

I've been struggling to find out what is wrong with my spark job that indefinitely hangs where I try to write it out to either S3 or HDFS (~100G of data in parquet format).
The line that causes the hang:
spark_df.write.save(MY_PATH,format='parquet',mode='append')
I have tried this in overwrite as well as append mode, and tried saving to HDFS and S3, but the job will hang no matter what.
In the Hadoop Resource Manager GUI, it shows the state of the spark application as "RUNNING", but looking it seems nothing is actually being done by Spark and when I look at the Spark UI there are no jobs running.
The one thing that has gotten it to work is to increase the size of the cluster while it is in this hung state (I'm on AWS). This, however, doesn't matter if I start the cluster with 6 workers and increase to 7, or if I start with 7 and increase to 8 which seems somewhat odd to me. The cluster is using all of the memory available in both cases, but I am not getting memory errors.
Any ideas on what could be going wrong?
Thanks for the help all. I ended up figuring out the problem was actually a few separate issues. Here's how I understand them:
When I was saving directly to S3, it was related to the issue that Steve Loughran mentioned where the renames on S3 were just incredibly slow (so it looked like my cluster was doing nothing). On writes to S3, all the data is copied to temporary files and then "renamed" on S3 -- the problem is that renames don't happen like they do on a filesystem and actually take O(n) time. So all of my data was copied to S3 and then all of the time was spent renaming the files.
The other problem I faced was with saving my data to HDFS and then moving it to S3 via s3-dist-cp. All of my clusters resources were being used by Spark, and so when the Application Master tried giving resources to move the data to via s3-dist-cp it was unable to. The moving of data couldn't happen because of Spark, and Spark wouldn't shut down because my program was still trying to copy data to S3 (so they were locked).
Hope this can help someone else!

Delaying system shutdown during json DB update in python

So I have a rather large json database that I'm maintaining with python. It's basically scraping data from a website on an hourly basis and I'm running daily restarts on the system (Linux Mint) via crontab. My issue is that if the system happens to restart during the database updating process I get corrupted json files.
My question is if there is anyway to delay the system restart in my script to ensure the system shuts down at a safe time? I could issue the restart command inside the script itself but if I decide to run multiple scripts that are similar to this in the future I'll obviously have a problem.
Any help here would be greatly appreciated. Thanks
Edit: Just to clarify I'm not using the python jsondb package. I am doing all file handling myself
So my solution to this was quite simple (Just protect data integrity):
Before write - backup the file
On successful write - delete the backup (Avoids doubling the size of the DB)
Where ever a corrupted file is encountered - revert to backup
The idea being that if the system closes the script during the file backup, it doesn't matter, we still have the original and if the system closes the script during write to the original file, the backup never gets deleted and we can just use that instead. All and all it was just an extra 6 lines of code and appears to have solved the issue.

Spark step on EMR just hangs as "Running" after done writing to S3

Running PySpark 2 job on EMR 5.1.0 as a step. Even after the script is done with a _SUCCESS file written to S3 and Spark UI showing the job as completed, EMR still shows the step as "Running". I've waited for over an hour to see if Spark was just trying to clean itself up but the step never shows as "Completed". The last thing written in the logs is:
INFO MultipartUploadOutputStream: close closed:false s3://mybucket/some/path/_SUCCESS
INFO DefaultWriterContainer: Job job_201611181653_0000 committed.
INFO ContextCleaner: Cleaned accumulator 0
I didn't have this problem with Spark 1.6. I've tried a bunch of different hadoop-aws and aws-java-sdk jars to no avail.
I'm using the default Spark 2.0 configurations so I don't think anything else like metadata is being written. Also the size of the data doesn't seem to have an impact on this problem.
If you aren't already, you should close your spark context.
sc.stop()
Also, if you are watching the Spark Web UI via a browser, you should close that as it sometimes keeps the spark context alive. I recall seeing this on the spark dev mailing list, but can't find the jira for it.
We experienced this problem and resolved it by running the job in cluster deploy mode using the following spark-submit option:
spark-submit --deploy-mode cluster
It was something to do with when running in client mode the driver runs in the master instance and the spark-submit process is getting stuck despite the spark spark context closing. This was causing the instance controller to continuously polling for process as it never receives the completion signal. Running the driver on one of the instance nodes using the above option doesn't seem to have this problem. Hope this helps
I experienced the same issue with Spark on AWS EMR and I solved the issue by calling sys.exit(O) at the end of my Python script. The same worked with Scala program with System.exit(0).