We use tensorboard to analyse GPU details after a training task or something else, and it will always base on log files. I'm just wondering how do tensorboard (or the log file) get (or record) the GPU utilization. There is no explain online about this. Is it just record or watch the nvidia-smi time after time in a loop, and finally calculate the mean value?
GPU utilization in tensorboard
Or if someone could guide me the documents, please.
Related
We have a cluster set up with HDP, and we use it in order to execute a process that runs for ~40h, going through different tasks and stages. I would like to know what is a highest HDFS Disc Usage during this period, and at what time. I can see that Ambari Dashboard (v2.7.4.0) and NameNode UI provide current HDFS Disc Usage, but I can't find an option to show it over time (even though CPU and Memory usage have such option and nice graphs). Does anyone know is it possible to gather such statistics?
I have a simple Keras model that I am submitting to Google Cloud AI Platform training, and would like to make use of a GPU for processing.
The job submits and completes successfully.
Looking at the usage statistics, the GPU never goes beyond 0% utilization. However, CPU usage increases as training progresses.
Any idea on what might be wrong in making my model work with a GPU?
Are there any ways that I might be able to troubleshoot a situation like this?
config.yaml
trainingInput:
scaleTier: CUSTOM
masterType: standard_gpu
I am using runtime version 1.13, which comes with tensorflow already installed. My additional required packages in my setup.py include:
REQUIRED_PACKAGES = ['google-api-core==1.14.2',
'google-cloud-core==1.0.3',
'google-cloud-logging==1.12.1',
'google-cloud-storage==1.18.0',
'gcsfs==0.2.3',
'h5py==2.9.0',
'joblib==0.13.2',
'numpy==1.16.4',
'pandas==0.24.2',
'protobuf==3.8.0',
'scikit-learn==0.21.2',
'scipy==1.3.0',
'Keras==2.2.4',
'Keras-Preprocessing==1.1.0',
]
Looking at logs, it looks like the GPU is found
master-replica-0 Found device 0 with properties: master-replica-0
master-replica-0 name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235 master-replica-0
Update:
The model is using a GPU, but is under-utilized.
Within AI Platform, the utilization graphs in the Job overview page are about 5 minutes behind the activity displayed in the logs.
As a result, your logs could show an epoch being processed, but the utilization graphs can still show 0% utilization.
How I resolved -
I am using the fit_generator function
I set multiprocessing=true, queue_length=10, workers=5. I am currently tweaking these parameters to determine what works best, however I see ~30% utilization on my GPU now.
The model is using a GPU, but is under-utilized.
Within AI Platform, the utilization graphs in the Job overview page are about 5 minutes behind the activity displayed in the logs.
As a result, your logs could show an epoch being processed, but the utilization graphs can still show 0% utilization.
How I resolved -
I am using the fit_generator function
I set multiprocessing=true, queue_length=10, workers=5. I am currently tweaking these parameters to determine what works best, however I see ~30% utilization on my GPU now.
I tried running my job with BASIC_GPU scale tier but I got an out of memory error. So then I tried running it with a custom configuration but I can't find a way of just using 1 Nvidia K80 with additional memory. All examples and predefined options use a number of GPUs, CPUs and workers and my code is not optimized for that. I just want 1 GPU and additional memory. How can I do that?
GPU memory is not extensible currently (Till something like PASCAL is accessible)
Reducing the batch size solves some of the out of memory issues
Adding GPUs to workers doesn't help either, as the model is deployed on individual worker separately (No memory pooling b/n workers)
I ran a cloudml job with BASIG_GPU.
I would like to check Walker's cpu, gpu, and momory usage, but is it possible?
The reason for this is that I applied for one GPU, but I want to see the change in gpu usage when I turn two jobs (scaleTier: BASIG_GPU).
thanks.
The CPU and memory utilization charts are available on the ML Engine job page on your Google Cloud Console, and the GPU utilization metrics are under development.
I am using Google Cloud ML to for training jobs. I observe a peculiar behavior in which I observe varying time taken for the training job to complete for the same data. I analyzed the CPU and Memory utilization in the cloud ML console and see very similar utilization in both the cases(7min and 14mins).
Can anyone let me know what would be the reason for the service to take inconsistent time for the job to complete.
I have the same parameters and data in both the cases and also verified that the time spent in the PREPARING phase is pretty much the same in both cases.
Also would it matter that I schedule simultaneous multiple independent training job on the same project, if so then would like to know the rationale behind it.
Any help would be greatly appreciated.
The easiest way is to add more logging to inspect where the time was spent. You can also inspect training progress using TensorBoard. There's no VM sharing between multiple jobs, so it's unlikely caused by simultaneous jobs.
Also, the running time should be measured from the point when the job enters RUNNING state. Job startup latency varies depending on it's cold or warm start (i.e., we keep the VMs from previous job running for a while).