Slow performance after syncing storage bucket with gcsfuse - google-cloud-platform

I've run into a bug that seemingly has no explanation as to why its happening.
Running scripts that are in my mounted drive(mounted using gcsfuse) takes forever to run(almost a couple of minutes per command). However, any scripts that I run from outside the mount folder seems to work fine. I can also notice a significant lag in the cursor movement too. Using GCS is a must for me as my dataset is already uploaded there and trying to rsync it to the VM takes forever.
I would appreciate any help regarding this issue.
Here is the cloud monitoring dashboard for some metrics I measured.

Related

Getting "[Errno 28] No space left on device" on AWS SageMaker with plenty of storage left

I am running a notebook instance on Amazon Sagemaker and my understanding is that notebooks by default have 5GB of storage. I am running into [Errno 28] No space left on device on a notebook that worked just fine the last time I tried it. I checked and I'm using approximately 1.5GB out of 5GB. I'm trying to download a bunch of files from my S3 bucket but I get the error even before one file is downloaded. Additionally, the notebook no longer autosaves.
Has anyone run into this and figured out a way to fix it? I've already tried clearing all outputs.
Thanks in advance!
Open a terminal and run df -kh to tell which FS is running out of disk space.
There's a root filesystem which is 100GB, and there's a user filesystem which size you can customize (default 5GB) (doc).
Guess: I saw that, especially when using Docker, the root FS can run out of space.
You will want to try to restart or kill the kernel. Sometimes some cells are left running and you may be trying to execute another operation. Sometimes logfiles also if you have any kill the space so try to remove all tertiary files that you are not using.
I work for AWS & my opinions are my own
Often this is seen in cases when there are certain unused resources running.
The default FS size is 100GB.
If you are using SageMaker Studio, you can use [this][1] JupyterLab extension to automatically shuts down Kernels, Terminals and Apps in Sagemaker Studio when they are idle for a stipulated period of time. You will be able to configure an idle time limit using the user interface this extension provides.
[1]: https://github.com/aws-samples/sagemaker-studio-auto-shutdown-extension
You can resize the persistent storage of your notebook up to 16Tb in the editing the notebook details on AWS console. This volume however is mounted under /home/ec2-user/SageMaker. Download your files under this folder and you'll see all the storage you allocated.

Pyspark job freezes with too many vcpus

TLDR: I have a pyspark job that finishes in 10 minutes when I run it in a ec2 instance with 16 vcpus but freezes out (it doesn't fail, just never finishes) if I use an instance with over 20 vcpus. I have tried everything I could think of and I just don't know why this happens.
Full story:
I have around 200 small pyspark jobs that for a matter of costs and flexibility I execute using aws batch with spark dockers instead of EMR. Recently I decided to experiment around the best configuration for those jobs and I realized something weird: a job that finished quickly (around 10 minutes) with 16 vcpus or less would just never end with 20 or more (I waited for 3 hours). First thing I thought is that it could be a problem with batch or the way ecs-agents manage the task, so I tried running the docker in an ec2 directly and had the same problem. Then I thought the problem was with the docker image, so I tried creating a new one:
First one used with spark installed as per AWS glue compatible version (https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-1.0/spark-2.4.3-bin-hadoop2.8.tgz)
New one was ubuntu 20 based with spark installed from the apache mirror (https://apache.mirror.digitalpacific.com.au/spark/spark-$SPARK_VERSION/spark-$SPARK_VERSION-bin-hadoop$HADOOP_VERSION.tgz)
Same thing happened. Then I decided the problem was with using docker at all, so I installed everything directly in the ec2, had the same result. Tried changing spark version, also the same thing happened. Thought it could be a problem with hardware blocking too many threads, so I switched to an instance with AMD, nothing changed. Tried modifying some configurations, memory amount used by driver, but it always has the same result: 16 vcpus it work, more than it, it stops.
Other details:
According to the logs it seems to always stop at the same point: a parquet read operation on s3, but the parquet file is super small (> 1mb) so I don't think that is the actual problem.
After that it still has logs sometimes but nothing really useful, just "INFO ContextCleaner: Cleaned accumulator".
I use s3a to read the files from s3.
I don't get any errors or spark logs.
I appreciate any help on the matter!
Stop using the Hadoop 2.7 binaries. They are woefully obsolete, especially for S3 connectivity. replace all the hadoop 2.7 artifacts with Hadoop 2.8 ones, or, preferably, Hadoop 3.2 or later, with the consistent dependencies.
set `spark.hadoop.fs.s3a.experimental.fadvise" to random.
If you still see problems, see if you can replicate them on hadoop 3.3.x, and if so: file a bug.
(advice correct of 2021-03-9; the longer it stays in SO unedited, the less it should be believed)

AWS Sagemaker: Jupyter Notebook kernel keeps dying

I get disconnect every now and then when running a piece of code in Jupyter Notebooks on Sagemaker. I usually just restart my notebook and run all the cells again. However, I want to know if there is a way to reconnect to my instance without having to lose my progress. At the minute, it shows that there is "No Kernel" at the bottom bar, but my file seems active in the kernel sessions tab. Can I recover my notebook's variables and contents? Also, is there a way to prevent future kernel disconnections?
Note that I reverted back to tornado = 5.1.1, which seems to decrease the number of disconnections, but it still happens every now and then.
Often, disconnections will be caused by inactivity because a job is running for a long time with no user input. If it's pre-processing that's taking a long time, you could increase the instance size of the processing job so that it executes faster, or increase the instance count. If you're using EMR, you can now run an EMR Spark query directly on the EMR cluster since December 2021:
https://aws.amazon.com/about-aws/whats-new/2021/12/amazon-sagemaker-studio-data-notebook-integration-emr/
There's a useful blog here https://aws.amazon.com/blogs/machine-learning/build-amazon-sagemaker-notebooks-backed-by-spark-in-amazon-emr/ which is helpful in getting you up and running.
Please let me know if you need more information, or vote for the answer if it's useful. :-)
For me a quick solution was to open a Terminal instead, save the notebook file as a Pytohn file, and run it from the terminal within Sagemaker.

How to back-up local SSD in Google Cloud Platform

Local SSDs are the fastest storage available in Google Cloud Platform, which makes it clear why people would want to use it. But it comes with some severe drawbacks, most notably that all data on the SSDs will be lost if the VM is shut down manually.
It also seems to be impossible to take and image of a local SSD and restore it to a restarted disk.
What then is the recommended way to back-up local SSD data if I have to shut down the VM (if I want to change the machine specs for example)?
So far I can only thing of manually copying my files to a bucket or a persistent disk.
If you have a solution to this, would it work if multiple local SSDs were mounted into a single logical volume?
I'm guessing that you want to create a backup of data stored on local SSD every time you shut down the machine (or reboot).
To achieve this (as #John Hanley commented) you have to copy the data either manually or by some script to other storage (persistend disk, bucket etc).
If you're running Linux:
Here's a great answer how to run script at reboot / shutdown. You can then create a script that will copy all the data to some more persistend storage solution.
If I were you I'd just use rsync and cron. Run it every hour or day (depends on your use case). Here's a another great example how to use rsync to synchronize folders.
If you're running Windows:
It is also possible to run command at windows shutdown and here's how.

Not able to run airflow in Cloud run getting error disk I/o error

I am trying to run airflow in google cloud run.
Getting error Disk I/O error, I guess the disk write permission is missing.
can someone please help me with this how to give write permission inside cloud run.
I also have to write file and later delete it.
Only the directory /tmp is writable in Cloud Run. So, change the default write location to write into this directory.
However, you have to be aware of 2 things:
Cloud Run is stateless, that means when a new instance is created, the container start from scratch, with an empty /tmp directory
/tmp directory is an in-memory file system. The maximum allowed memory on Cloud Run is 2Gb, your app memory footprint included. In addition of your file and Airflow, not sure that you will have a lot of space.
A final remark. Cloud Run is active only when it process request, and a request has a maximum timeout of 15 minutes. When no request, the allowed cpu is close to 0%. I'm not sure of what you want to achieve with Airflow on Cloud Run, but my feeling tells me that your design is strange. And I prefer to warn you before you spend too much effort on this.
EDIT 1:
Cloud Run service has evolved in the right way. In 2022,
/tmp is no longer the only writable directory (you can write everywhere, but it's still in memory)
the timeout is no longer limited to 15 minutes, but to 60 minutes
The 2nd gen runtime execution (still in preview) allows you to mount NFS (Filestore) or Cloud Storage (GCSFuse) volume to have services "more stateful".
You can also execute jobs now. So, a lot of very great evolution!
My impression is that you have a write i/o error because you are using SQLite. Is that possible.
If you want to run Airflow using cointainers, I would recommend to use Postgres or MySQL as backend databases.
You can also mount the plugins and dag folder in some external volume.