Python code killed automatically on AWS EC2 G4 instance - amazon-web-services

I have an AWS EC2 instance g4dn.xlarge and I have to run a python code on it that uses Keras and Tensorflow GPU for the image processing.
I have a large number of images divided into different sets that I need to process using the python code. I'm using python 3.6 and Tensorflow 2.4 and the code runs recursively to process each set of images but after running for some time it gets killed automatically without throwing any error.
The instance I'm using has a single GPU of 16GB and almost all of it is getting used up by the code so I thought maybe it's due to OOM. I looked up the configuration of other available instances on AWS but the others are too large for my requirement.
90% of RAM and CPU is free. I'm just utilizing GPU.
It would be great if there's a way to resolve this issue on the current instance?
Thank you!
P.S. I can only use AWS and not GCP or Azure.

I figured it out. Apparently, TensorFlow uses up all the GPU memory available on the system whether it needs it or not which in this case was 16GB. Due to the overutilisation of GPU, the python script was getting killed automatically after some time.
To make your script use only the required memory add the "Limiting GPU memory growth" feature available on the TensorFlow documentation.
https://www.tensorflow.org/guide/gpu#limiting_gpu_memory_growth
gpus = tf.config.list_physical_devices('GPU')
if gpus:
# Restrict TensorFlow to only use the first GPU
try:
tf.config.set_visible_devices(gpus[0], 'GPU')
logical_gpus = tf.config.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPU")
except RuntimeError as e:
# Visible devices must be set before GPUs have been initialized
print(e)
You need to add the above script in your main function or if you don't have a main function add it after you have imported all the libraries.
After adding this to the python code only 50% of the GPU memory is being utilised(which was the actual requirement) and the code is running fine without any error or being killed.

Related

Getting "[Errno 28] No space left on device" on AWS SageMaker with plenty of storage left

I am running a notebook instance on Amazon Sagemaker and my understanding is that notebooks by default have 5GB of storage. I am running into [Errno 28] No space left on device on a notebook that worked just fine the last time I tried it. I checked and I'm using approximately 1.5GB out of 5GB. I'm trying to download a bunch of files from my S3 bucket but I get the error even before one file is downloaded. Additionally, the notebook no longer autosaves.
Has anyone run into this and figured out a way to fix it? I've already tried clearing all outputs.
Thanks in advance!
Open a terminal and run df -kh to tell which FS is running out of disk space.
There's a root filesystem which is 100GB, and there's a user filesystem which size you can customize (default 5GB) (doc).
Guess: I saw that, especially when using Docker, the root FS can run out of space.
You will want to try to restart or kill the kernel. Sometimes some cells are left running and you may be trying to execute another operation. Sometimes logfiles also if you have any kill the space so try to remove all tertiary files that you are not using.
I work for AWS & my opinions are my own
Often this is seen in cases when there are certain unused resources running.
The default FS size is 100GB.
If you are using SageMaker Studio, you can use [this][1] JupyterLab extension to automatically shuts down Kernels, Terminals and Apps in Sagemaker Studio when they are idle for a stipulated period of time. You will be able to configure an idle time limit using the user interface this extension provides.
[1]: https://github.com/aws-samples/sagemaker-studio-auto-shutdown-extension
You can resize the persistent storage of your notebook up to 16Tb in the editing the notebook details on AWS console. This volume however is mounted under /home/ec2-user/SageMaker. Download your files under this folder and you'll see all the storage you allocated.

Pyspark job freezes with too many vcpus

TLDR: I have a pyspark job that finishes in 10 minutes when I run it in a ec2 instance with 16 vcpus but freezes out (it doesn't fail, just never finishes) if I use an instance with over 20 vcpus. I have tried everything I could think of and I just don't know why this happens.
Full story:
I have around 200 small pyspark jobs that for a matter of costs and flexibility I execute using aws batch with spark dockers instead of EMR. Recently I decided to experiment around the best configuration for those jobs and I realized something weird: a job that finished quickly (around 10 minutes) with 16 vcpus or less would just never end with 20 or more (I waited for 3 hours). First thing I thought is that it could be a problem with batch or the way ecs-agents manage the task, so I tried running the docker in an ec2 directly and had the same problem. Then I thought the problem was with the docker image, so I tried creating a new one:
First one used with spark installed as per AWS glue compatible version (https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-1.0/spark-2.4.3-bin-hadoop2.8.tgz)
New one was ubuntu 20 based with spark installed from the apache mirror (https://apache.mirror.digitalpacific.com.au/spark/spark-$SPARK_VERSION/spark-$SPARK_VERSION-bin-hadoop$HADOOP_VERSION.tgz)
Same thing happened. Then I decided the problem was with using docker at all, so I installed everything directly in the ec2, had the same result. Tried changing spark version, also the same thing happened. Thought it could be a problem with hardware blocking too many threads, so I switched to an instance with AMD, nothing changed. Tried modifying some configurations, memory amount used by driver, but it always has the same result: 16 vcpus it work, more than it, it stops.
Other details:
According to the logs it seems to always stop at the same point: a parquet read operation on s3, but the parquet file is super small (> 1mb) so I don't think that is the actual problem.
After that it still has logs sometimes but nothing really useful, just "INFO ContextCleaner: Cleaned accumulator".
I use s3a to read the files from s3.
I don't get any errors or spark logs.
I appreciate any help on the matter!
Stop using the Hadoop 2.7 binaries. They are woefully obsolete, especially for S3 connectivity. replace all the hadoop 2.7 artifacts with Hadoop 2.8 ones, or, preferably, Hadoop 3.2 or later, with the consistent dependencies.
set `spark.hadoop.fs.s3a.experimental.fadvise" to random.
If you still see problems, see if you can replicate them on hadoop 3.3.x, and if so: file a bug.
(advice correct of 2021-03-9; the longer it stays in SO unedited, the less it should be believed)

How do I fix a memory allocation error in SageMaker without increasing instance size?

How can I resolve memory issues when training a CNN on SageMaker by increasing the number of instances, rather than changing the amount of memory each instance has?
Using a larger instance does work, but I want to solve my problem by distributing across more instances. Using more instances ends up giving me a memory allocation error instead.
Here is the code I am running in a Jupyter notebook cell:
from sagemaker.tensorflow import TensorFlow
estimator = TensorFlow(entry_point='train_aws.py',
role=role,
framework_version='1.12.0',
training_steps= 100,
evaluation_steps= 100,
hyperparameters={'learning_rate': 0.01},
train_instance_count=2,
train_instance_type='ml.c4.xlarge')
estimator.fit(inputs)
I thought that adding more instances would increase the amount of memory, but it just gave me an allocation error instead.
Adding more instances is increasing the overall memory that you have but not the maximum memory that each training instance can use.
Most likely reducing the batch size in your code should help you recover from the error.
When you are creating a training job in SageMaker, your code is installed on multiple instances and the data from S3 is copied to these instances as well. Your code is accessing the data from the local volume of the instance (usually over EBS), in a similar way to how it would train locally.
On each instance, it will do the following steps:
starts a Docker container optimized for TensorFlow.
downloads the dataset.
setup up training related environment variables
setup up distributed training environment if configured to use parameter
server
starts asynchronous training
To benefit from the distribution you should enable the distribution options of TensorFlow (see: https://sagemaker.readthedocs.io/en/stable/using_tf.html#distributed-training)
from sagemaker.tensorflow import TensorFlow
tf_estimator = TensorFlow(entry_point='train_aws.py',
role=role,
train_instance_count=2,
train_instance_type='ml.c4.xlarge',
framework_version='1.12.0',
py_version='py3',
training_steps= 100,
evaluation_steps= 100,
hyperparameters={'learning_rate': 0.01},
distributions={'parameter_server': {'enabled': True}})
tf_estimator.fit('s3://bucket/path/to/training/data')
Also note that you can distribute your data from S3 evenly to the training instances, which can make your training faster as each instance only sees part of the data.

Selecting google cloud tool for executing demanding python script

Where should I execute a python script that process ~7giga of data that is available on GCS. The output will be writen to GCS as well.
The script was debugged on datalab notebook with small dataset. I would like to scale up the processing. Should I allocate a big machine? I have no idea what size (resources) of machine is needed.
Many thanks,
Eila
Just in case,
Dataflow can’t work for that kind of data processing
From what I have read about HDF5, it seems that it is not easily parallelizable (See Parallel HDF5 and h5py multiprocessing_example) so I'll assume that reading that ~7GB must me done by one worker.
If there is no workaround to it, and you do not encounter memory issues while processing it on the machine you are already using, I do not see a need to upgrade your datalab instance.

Tasks not executed by the Compute Nodes in Ubuntu CfnCluster image

I'm trying to use CfnCluster 1.2.1 for GPU computing and I'm using a custom AMI based on the Ubuntu 14.04 CfnCluster AMI.
Everything is created correctly in the CloudFormation console, although when I submit a new test task to Oracle Grid Engine using qsub from the Master Server, it never gets executed from the queue according to qstat. It stays always in status "qw" and never enters state "r".
It seems to work fine with the Amazon Linux AMI (using user ec2-user instead of ubuntu) and the exact same configuration. Also, the master instance announces the number of remaining tasks to the cluster as a metric, and new compute instances are auto-scaled as a result.
What mechanisms does CfnCluster or Oracle Grid Engine provide to further debug this? I took a look at the log files, but didn't find anything relevant. What could be the cause for this behavior?
Thank you,
Diego
Similar to https://stackoverflow.com/a/37324418/704265
From your qhost output, it looks like your machine "ip-10-0-0-47" is properly configured in SGE. However, on "ip-10-0-0-47" sge_execd is either not running or not configured properly. If it were, qhost would report statistics for "ip-10-0-0-47".
I think I found the solution. It seems to be the same issue as the one described in https://github.com/awslabs/cfncluster/issues/86#issuecomment-196966385
I fixed it by adding the following line to the CfnCluster configuration file:
base_os = ubuntu1404
If a custom_ami is specified but no base_os is specified, it defaults to use the Amazon Linux, which uses a different method to configure SGE. There may be problems in the SGE configuration performed by CfnCluster if base_os and custom_ami os are different.