Heyho!
I'm trying to deploy a machine learning model on Google Cloud Run. On my machine it needs around 2,5 GB, but when I try to run it via GCR I always get an out or memory error even though I put a 4 GB limit.
I also tried it in a to 4 GB limited Virtual Machine and everything works perfectly.
Is there a big overhead I don't know about or is it a bug?
Thanks in advance!
Related
I’m very new to cloud and wanted to try using it for machine learning. I set up the notebook to 8vCPUs 30GB RAM, with a T4 GPU. Somehow it runs slower compared to when I would work on colab, even when just loading the libraries. I haven’t reached to training the actual dataset (~200k images) because the site would eventually just error 524. Is there a way to resolve this?
I have an AWS EC2 instance g4dn.xlarge and I have to run a python code on it that uses Keras and Tensorflow GPU for the image processing.
I have a large number of images divided into different sets that I need to process using the python code. I'm using python 3.6 and Tensorflow 2.4 and the code runs recursively to process each set of images but after running for some time it gets killed automatically without throwing any error.
The instance I'm using has a single GPU of 16GB and almost all of it is getting used up by the code so I thought maybe it's due to OOM. I looked up the configuration of other available instances on AWS but the others are too large for my requirement.
90% of RAM and CPU is free. I'm just utilizing GPU.
It would be great if there's a way to resolve this issue on the current instance?
Thank you!
P.S. I can only use AWS and not GCP or Azure.
I figured it out. Apparently, TensorFlow uses up all the GPU memory available on the system whether it needs it or not which in this case was 16GB. Due to the overutilisation of GPU, the python script was getting killed automatically after some time.
To make your script use only the required memory add the "Limiting GPU memory growth" feature available on the TensorFlow documentation.
https://www.tensorflow.org/guide/gpu#limiting_gpu_memory_growth
gpus = tf.config.list_physical_devices('GPU')
if gpus:
# Restrict TensorFlow to only use the first GPU
try:
tf.config.set_visible_devices(gpus[0], 'GPU')
logical_gpus = tf.config.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPU")
except RuntimeError as e:
# Visible devices must be set before GPUs have been initialized
print(e)
You need to add the above script in your main function or if you don't have a main function add it after you have imported all the libraries.
After adding this to the python code only 50% of the GPU memory is being utilised(which was the actual requirement) and the code is running fine without any error or being killed.
TLDR: I have a pyspark job that finishes in 10 minutes when I run it in a ec2 instance with 16 vcpus but freezes out (it doesn't fail, just never finishes) if I use an instance with over 20 vcpus. I have tried everything I could think of and I just don't know why this happens.
Full story:
I have around 200 small pyspark jobs that for a matter of costs and flexibility I execute using aws batch with spark dockers instead of EMR. Recently I decided to experiment around the best configuration for those jobs and I realized something weird: a job that finished quickly (around 10 minutes) with 16 vcpus or less would just never end with 20 or more (I waited for 3 hours). First thing I thought is that it could be a problem with batch or the way ecs-agents manage the task, so I tried running the docker in an ec2 directly and had the same problem. Then I thought the problem was with the docker image, so I tried creating a new one:
First one used with spark installed as per AWS glue compatible version (https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-1.0/spark-2.4.3-bin-hadoop2.8.tgz)
New one was ubuntu 20 based with spark installed from the apache mirror (https://apache.mirror.digitalpacific.com.au/spark/spark-$SPARK_VERSION/spark-$SPARK_VERSION-bin-hadoop$HADOOP_VERSION.tgz)
Same thing happened. Then I decided the problem was with using docker at all, so I installed everything directly in the ec2, had the same result. Tried changing spark version, also the same thing happened. Thought it could be a problem with hardware blocking too many threads, so I switched to an instance with AMD, nothing changed. Tried modifying some configurations, memory amount used by driver, but it always has the same result: 16 vcpus it work, more than it, it stops.
Other details:
According to the logs it seems to always stop at the same point: a parquet read operation on s3, but the parquet file is super small (> 1mb) so I don't think that is the actual problem.
After that it still has logs sometimes but nothing really useful, just "INFO ContextCleaner: Cleaned accumulator".
I use s3a to read the files from s3.
I don't get any errors or spark logs.
I appreciate any help on the matter!
Stop using the Hadoop 2.7 binaries. They are woefully obsolete, especially for S3 connectivity. replace all the hadoop 2.7 artifacts with Hadoop 2.8 ones, or, preferably, Hadoop 3.2 or later, with the consistent dependencies.
set `spark.hadoop.fs.s3a.experimental.fadvise" to random.
If you still see problems, see if you can replicate them on hadoop 3.3.x, and if so: file a bug.
(advice correct of 2021-03-9; the longer it stays in SO unedited, the less it should be believed)
I have a relatively small VPS that I use as a remote dev environment :
1 vCore
2 Mb of RAM
I plan to have up to 3 dev environments on the VPS. I dont need to run 2 simultaneously however.
The biggest project is roughly the same size as a small Magento eShop. It is actually run by Python and Django.
The environment runs on Ubuntu + Nginx + uWCGI but this could be changed.
I can code remotely in the VPS using Eclipse RSE or Codeanywhere.
However Eclipse CHE offer very interesting functionalities for this type of remote environment.
The main risk is that the VPS configuration is very small. It is exactly the minimal configuration stated in the doc. I don't know if I can use it this way without making things really slow...
My instinct is "no", I think 2MB of RAM is not sufficient given that the Che workspace server is itself a Java application that needs about 750 MB of RAM. If you are running the workspace somewhere else, and just using the VPS as a compute node for container workspaces, I would suspect the answer is still no, as your container OS and language runtime will need more than 2MB of RAM. If you meant 2GB of RAM, it's still difficult, but maybe feasible, to run a workspace with a full Django environment on there, using a workspace server running on a separate host.
It sure would be nice to see if you can make it work, though - and I would love to hear if you make it work!
I've been experimenting with Google Colab to work on Python notebooks with team members. However, the VMs that Colab runs on appear to only have ~13GB of RAM. The datasets we're working with require more (64 GB of RAM would be sufficient).
Is there a way to increase the RAM available to Colab notebooks? Like by integrating with other services in the Google Cloud Platform?
Not at the moment, unfortunately.
Unfortunately, neither a Swap file is possible to create (the jupyter notebook don't have this kind of permission).
Now it is possible to have 25 GBs ram on colab. After crashing the season then colab asks you to activate high-ram.
You can edit (add or remove) ram or vcpu by powering off your instance.