Keras model not using GPU on ai platform training - google-cloud-platform

I have a simple Keras model that I am submitting to Google Cloud AI Platform training, and would like to make use of a GPU for processing.
The job submits and completes successfully.
Looking at the usage statistics, the GPU never goes beyond 0% utilization. However, CPU usage increases as training progresses.
Any idea on what might be wrong in making my model work with a GPU?
Are there any ways that I might be able to troubleshoot a situation like this?
config.yaml
trainingInput:
scaleTier: CUSTOM
masterType: standard_gpu
I am using runtime version 1.13, which comes with tensorflow already installed. My additional required packages in my setup.py include:
REQUIRED_PACKAGES = ['google-api-core==1.14.2',
'google-cloud-core==1.0.3',
'google-cloud-logging==1.12.1',
'google-cloud-storage==1.18.0',
'gcsfs==0.2.3',
'h5py==2.9.0',
'joblib==0.13.2',
'numpy==1.16.4',
'pandas==0.24.2',
'protobuf==3.8.0',
'scikit-learn==0.21.2',
'scipy==1.3.0',
'Keras==2.2.4',
'Keras-Preprocessing==1.1.0',
]
Looking at logs, it looks like the GPU is found
master-replica-0 Found device 0 with properties: master-replica-0
master-replica-0 name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235 master-replica-0
Update:
The model is using a GPU, but is under-utilized.
Within AI Platform, the utilization graphs in the Job overview page are about 5 minutes behind the activity displayed in the logs.
As a result, your logs could show an epoch being processed, but the utilization graphs can still show 0% utilization.
How I resolved -
I am using the fit_generator function
I set multiprocessing=true, queue_length=10, workers=5. I am currently tweaking these parameters to determine what works best, however I see ~30% utilization on my GPU now.

The model is using a GPU, but is under-utilized.
Within AI Platform, the utilization graphs in the Job overview page are about 5 minutes behind the activity displayed in the logs.
As a result, your logs could show an epoch being processed, but the utilization graphs can still show 0% utilization.
How I resolved -
I am using the fit_generator function
I set multiprocessing=true, queue_length=10, workers=5. I am currently tweaking these parameters to determine what works best, however I see ~30% utilization on my GPU now.

Related

Training job runtime exceeded MaxRuntimeInSeconds provided

I would like to run my model 30 days in using aws sagemaker training job, but its max time is 5 days, how to resume the earlier to proceed further
Follow these steps:
Open a support ticket to increase Longest run time for a training job
to 2419200 seconds (28 days). (this can't be adjusted using the service quotas in AWS Web console).
Using the SageMaker Python SDK, when creating an Estimator, set max_run=2419200.
Implement Resume from checkpoints in your training script.
Also, the questions in #rok's answer are very relevant to consider.
According to the documentation here the maximum allowed runtime is 28 days, not 5. Check your configuration please . You are right, according to the documentation here the maximum runtime for a training job is 5 days. There are multiple things you can do: more powerful (multiple) GPU to reduce training time, or save checkpoint and restart training from there. Anyway 30 days looks like a very big training time (with associated cost), are you sure you need that ?
Actually you could ask for service quotas increase from here but as you can see Longest run time for a training job is not adjustable. So I don't you have any other choice of either using checkpoints or greater GPUs.

How do tensorboard get gpu utilization?

We use tensorboard to analyse GPU details after a training task or something else, and it will always base on log files. I'm just wondering how do tensorboard (or the log file) get (or record) the GPU utilization. There is no explain online about this. Is it just record or watch the nvidia-smi time after time in a loop, and finally calculate the mean value?
GPU utilization in tensorboard
Or if someone could guide me the documents, please.

Scheduler node always above 100% utilization

We're running Cloud Composer on a 5 node, n1-standard-2 cluster running composer-1.11.3-airflow-1.10.9. Private IP is enabled. Python 3 is selected. We currently have around 30 DAGs, some containing over 100 tasks. Most DAGs run once a day.
The node running the airflow scheduler workload is consistently running at around 150% CPU utilisation regardless of the number of running tasks. The only way to lower the CPU usage is to remove DAGs until only 5 or 6 remain (obviously not an option). What we've tried:
We have followed this Medium article detailing how to run the scheduler service on a dedicated node however we cannot find a configuration that reduces the CPU usage. We've tried a node as powerful as an e2-highcpu-32 running on a 100 GB SSD. Usage remained at 150%.
We've tried to update the airflow.cfg variables to reduce the frequency the dags directory is parsed via settings such as store_serialized_dags and max_threads. Again, this did not have any impact on the CPU usage.
For reference the other nodes all run at 30-70% CPU for the majority of the time, spiking to over 100% for short periods when big DAGs are running. No node has any issue with memory usage, with between 2 GB and 4 GB used.
We plan on adding more DAGs in future and are concerned the scheduler may become a bottle neck with this current setup. Are there any other configuration options available to reduce the CPU usage to allow for a future increase in DAG number?
Edit in response to Ines' answer:
I'm seeing the CPU usage as a percentage in the Monitoring tab, the node running the scheduler service is coloured orange:
Additionally, when I look at the pod running airflow scheduler this is the CPU usage, pretty much always 100%:
Please, have a look for the official documentation, which describes CPU usage per node metric. Can you elaborate where do you see the percentage values, since the documentation mentioned a core time usage ratio:
A chart showing the usage of CPU cores aggregated over all running
Pods in the node, measured as a core time usage ratio. This does not
include CPU usage of the App Engine instance used for the Airflow UI
or Cloud SQL instance. High CPU usage is often the root cause of
Worker Pod evictions. If you see very high usage, consider scaling out
your Composer environment or changing the schedule of your DAG runs.
In the meantime, there is an ongoing workaround and it would be worth trying. You should follow these steps to limit CPU usage on syncing POD:
Go to environment configuration page and click view cluster workloads
Click airflow-scheduler, then edit
Find name: gcs-syncd and add:
resources:
limits:
cpu: some value (you can try with 300m)
requests:
cpu: 10m
then click save (at the bottom).
Repeat the procedure for airflow-worker.
We have to edit also the section of airflow-scheduler of the workload airflow-scheduler. Click on edit the YAML file and for the section airflow-scheduler add:
resources:
limits:
cpu: 750m
requests:
cpu: 300m
It would be great if you could try the aforementioned steps and see if it improves the performance.
Sometimes bucket's /logs might consist of lot of files that causes gcs-syncd to use CPU much while doing an internal synchronization of the logs. You can try to remove some of the oldest logs of the bucket gs://<composer-env-name>/logs. As an example, if you would like to remove all logs of May, please use the following command:
gsutil -m rm -r gs://europe-west1-td2-composter-a438b8eb-bucket/logs/*/*/2020-05*
Ideally, the GCE instances shouldn't be running over 70% CPU at all times, or the Composer environment may become unstable during resource usage.

Google cloud ML engine Batch Predict on GPU

If I train with standard_p_100 GPU's and am running some batch predict jobs with the trained models, is there a way for me to specify or request that the batch predictions be performed on GPU's? For comparison, the batch prediction for an epoch of training data seems to be taking 8-10x than it would during training time which leads me to suspect that it is not taking advantage of GPUs and i'm wondering if I have any control over speeding this up?

How can I check system usage when using google cloudml?

I ran a cloudml job with BASIG_GPU.
I would like to check Walker's cpu, gpu, and momory usage, but is it possible?
The reason for this is that I applied for one GPU, but I want to see the change in gpu usage when I turn two jobs (scaleTier: BASIG_GPU).
thanks.
The CPU and memory utilization charts are available on the ML Engine job page on your Google Cloud Console, and the GPU utilization metrics are under development.