Pyspark job freezes with too many vcpus - amazon-web-services

TLDR: I have a pyspark job that finishes in 10 minutes when I run it in a ec2 instance with 16 vcpus but freezes out (it doesn't fail, just never finishes) if I use an instance with over 20 vcpus. I have tried everything I could think of and I just don't know why this happens.
Full story:
I have around 200 small pyspark jobs that for a matter of costs and flexibility I execute using aws batch with spark dockers instead of EMR. Recently I decided to experiment around the best configuration for those jobs and I realized something weird: a job that finished quickly (around 10 minutes) with 16 vcpus or less would just never end with 20 or more (I waited for 3 hours). First thing I thought is that it could be a problem with batch or the way ecs-agents manage the task, so I tried running the docker in an ec2 directly and had the same problem. Then I thought the problem was with the docker image, so I tried creating a new one:
First one used with spark installed as per AWS glue compatible version (https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-1.0/spark-2.4.3-bin-hadoop2.8.tgz)
New one was ubuntu 20 based with spark installed from the apache mirror (https://apache.mirror.digitalpacific.com.au/spark/spark-$SPARK_VERSION/spark-$SPARK_VERSION-bin-hadoop$HADOOP_VERSION.tgz)
Same thing happened. Then I decided the problem was with using docker at all, so I installed everything directly in the ec2, had the same result. Tried changing spark version, also the same thing happened. Thought it could be a problem with hardware blocking too many threads, so I switched to an instance with AMD, nothing changed. Tried modifying some configurations, memory amount used by driver, but it always has the same result: 16 vcpus it work, more than it, it stops.
Other details:
According to the logs it seems to always stop at the same point: a parquet read operation on s3, but the parquet file is super small (> 1mb) so I don't think that is the actual problem.
After that it still has logs sometimes but nothing really useful, just "INFO ContextCleaner: Cleaned accumulator".
I use s3a to read the files from s3.
I don't get any errors or spark logs.
I appreciate any help on the matter!

Stop using the Hadoop 2.7 binaries. They are woefully obsolete, especially for S3 connectivity. replace all the hadoop 2.7 artifacts with Hadoop 2.8 ones, or, preferably, Hadoop 3.2 or later, with the consistent dependencies.
set `spark.hadoop.fs.s3a.experimental.fadvise" to random.
If you still see problems, see if you can replicate them on hadoop 3.3.x, and if so: file a bug.
(advice correct of 2021-03-9; the longer it stays in SO unedited, the less it should be believed)

Related

AWS Sagemaker: Jupyter Notebook kernel keeps dying

I get disconnect every now and then when running a piece of code in Jupyter Notebooks on Sagemaker. I usually just restart my notebook and run all the cells again. However, I want to know if there is a way to reconnect to my instance without having to lose my progress. At the minute, it shows that there is "No Kernel" at the bottom bar, but my file seems active in the kernel sessions tab. Can I recover my notebook's variables and contents? Also, is there a way to prevent future kernel disconnections?
Note that I reverted back to tornado = 5.1.1, which seems to decrease the number of disconnections, but it still happens every now and then.
Often, disconnections will be caused by inactivity because a job is running for a long time with no user input. If it's pre-processing that's taking a long time, you could increase the instance size of the processing job so that it executes faster, or increase the instance count. If you're using EMR, you can now run an EMR Spark query directly on the EMR cluster since December 2021:
https://aws.amazon.com/about-aws/whats-new/2021/12/amazon-sagemaker-studio-data-notebook-integration-emr/
There's a useful blog here https://aws.amazon.com/blogs/machine-learning/build-amazon-sagemaker-notebooks-backed-by-spark-in-amazon-emr/ which is helpful in getting you up and running.
Please let me know if you need more information, or vote for the answer if it's useful. :-)
For me a quick solution was to open a Terminal instead, save the notebook file as a Pytohn file, and run it from the terminal within Sagemaker.

Not able to run airflow in Cloud run getting error disk I/o error

I am trying to run airflow in google cloud run.
Getting error Disk I/O error, I guess the disk write permission is missing.
can someone please help me with this how to give write permission inside cloud run.
I also have to write file and later delete it.
Only the directory /tmp is writable in Cloud Run. So, change the default write location to write into this directory.
However, you have to be aware of 2 things:
Cloud Run is stateless, that means when a new instance is created, the container start from scratch, with an empty /tmp directory
/tmp directory is an in-memory file system. The maximum allowed memory on Cloud Run is 2Gb, your app memory footprint included. In addition of your file and Airflow, not sure that you will have a lot of space.
A final remark. Cloud Run is active only when it process request, and a request has a maximum timeout of 15 minutes. When no request, the allowed cpu is close to 0%. I'm not sure of what you want to achieve with Airflow on Cloud Run, but my feeling tells me that your design is strange. And I prefer to warn you before you spend too much effort on this.
EDIT 1:
Cloud Run service has evolved in the right way. In 2022,
/tmp is no longer the only writable directory (you can write everywhere, but it's still in memory)
the timeout is no longer limited to 15 minutes, but to 60 minutes
The 2nd gen runtime execution (still in preview) allows you to mount NFS (Filestore) or Cloud Storage (GCSFuse) volume to have services "more stateful".
You can also execute jobs now. So, a lot of very great evolution!
My impression is that you have a write i/o error because you are using SQLite. Is that possible.
If you want to run Airflow using cointainers, I would recommend to use Postgres or MySQL as backend databases.
You can also mount the plugins and dag folder in some external volume.

Slow performance after syncing storage bucket with gcsfuse

I've run into a bug that seemingly has no explanation as to why its happening.
Running scripts that are in my mounted drive(mounted using gcsfuse) takes forever to run(almost a couple of minutes per command). However, any scripts that I run from outside the mount folder seems to work fine. I can also notice a significant lag in the cursor movement too. Using GCS is a must for me as my dataset is already uploaded there and trying to rsync it to the VM takes forever.
I would appreciate any help regarding this issue.
Here is the cloud monitoring dashboard for some metrics I measured.

Selecting google cloud tool for executing demanding python script

Where should I execute a python script that process ~7giga of data that is available on GCS. The output will be writen to GCS as well.
The script was debugged on datalab notebook with small dataset. I would like to scale up the processing. Should I allocate a big machine? I have no idea what size (resources) of machine is needed.
Many thanks,
Eila
Just in case,
Dataflow can’t work for that kind of data processing
From what I have read about HDF5, it seems that it is not easily parallelizable (See Parallel HDF5 and h5py multiprocessing_example) so I'll assume that reading that ~7GB must me done by one worker.
If there is no workaround to it, and you do not encounter memory issues while processing it on the machine you are already using, I do not see a need to upgrade your datalab instance.

Spark Dataframe hanging on save

I've been struggling to find out what is wrong with my spark job that indefinitely hangs where I try to write it out to either S3 or HDFS (~100G of data in parquet format).
The line that causes the hang:
spark_df.write.save(MY_PATH,format='parquet',mode='append')
I have tried this in overwrite as well as append mode, and tried saving to HDFS and S3, but the job will hang no matter what.
In the Hadoop Resource Manager GUI, it shows the state of the spark application as "RUNNING", but looking it seems nothing is actually being done by Spark and when I look at the Spark UI there are no jobs running.
The one thing that has gotten it to work is to increase the size of the cluster while it is in this hung state (I'm on AWS). This, however, doesn't matter if I start the cluster with 6 workers and increase to 7, or if I start with 7 and increase to 8 which seems somewhat odd to me. The cluster is using all of the memory available in both cases, but I am not getting memory errors.
Any ideas on what could be going wrong?
Thanks for the help all. I ended up figuring out the problem was actually a few separate issues. Here's how I understand them:
When I was saving directly to S3, it was related to the issue that Steve Loughran mentioned where the renames on S3 were just incredibly slow (so it looked like my cluster was doing nothing). On writes to S3, all the data is copied to temporary files and then "renamed" on S3 -- the problem is that renames don't happen like they do on a filesystem and actually take O(n) time. So all of my data was copied to S3 and then all of the time was spent renaming the files.
The other problem I faced was with saving my data to HDFS and then moving it to S3 via s3-dist-cp. All of my clusters resources were being used by Spark, and so when the Application Master tried giving resources to move the data to via s3-dist-cp it was unable to. The moving of data couldn't happen because of Spark, and Spark wouldn't shut down because my program was still trying to copy data to S3 (so they were locked).
Hope this can help someone else!