I ran a cloudml job with BASIG_GPU.
I would like to check Walker's cpu, gpu, and momory usage, but is it possible?
The reason for this is that I applied for one GPU, but I want to see the change in gpu usage when I turn two jobs (scaleTier: BASIG_GPU).
thanks.
The CPU and memory utilization charts are available on the ML Engine job page on your Google Cloud Console, and the GPU utilization metrics are under development.
Related
I have a compute engine instance in Google Cloud Platform. I have monitoring agent installed on it. There is a memory part (Disk Data (cached)) shown in memory detail chart, that is using more than 50% of my total available memory.
May I know What is this Disk Data (cached) and why its taking that much of memory?
I assume it's a Linux VM? That's just normal behavior of Linux to use spare memory for disk caching. It is unobtrusive.
See https://www.linuxatemyram.com/ for an overview.
Also might be usefull - https://andythemoron.com/blog/2017-04-23/Understanding-Linux-Memory-Usage on how to check and interpret memory usage with free utility.
I was running a load test but seeing only ~20% CPU on GCE. That was surprising, so I decided to SSH into my machine and run top, which showed me 99.7% utilization.
I found a question that was very similar: Google Cloud Compute engine CPU usage shows 100% but dashboard only shows 10% usage
However, I am certain I only have one core (1 vCPU, 3.75 GB memory).
Here's top running that shows 99.7% utilization:
What could be the reason for this?
Even for a single core case, when workload is not split between several cores, the curve shape and the Y-coordinate of the CPU Utilization chart depend on the aggregation and alignment settings you use: for instance max or mean, 1m or 1h alignment period, etc. For instance in case of a short peak load, the wide time window will act as a big denominator for the mean aligner. That way you'll get lower values on the chart.
For more details please see:
Google Monitoring > Documentation > Alignment
Google Cloud Blog > Stackdriver tips and tricks: Understanding metrics and building charts
I tried running my job with BASIC_GPU scale tier but I got an out of memory error. So then I tried running it with a custom configuration but I can't find a way of just using 1 Nvidia K80 with additional memory. All examples and predefined options use a number of GPUs, CPUs and workers and my code is not optimized for that. I just want 1 GPU and additional memory. How can I do that?
GPU memory is not extensible currently (Till something like PASCAL is accessible)
Reducing the batch size solves some of the out of memory issues
Adding GPUs to workers doesn't help either, as the model is deployed on individual worker separately (No memory pooling b/n workers)
I am using Google Cloud ML to for training jobs. I observe a peculiar behavior in which I observe varying time taken for the training job to complete for the same data. I analyzed the CPU and Memory utilization in the cloud ML console and see very similar utilization in both the cases(7min and 14mins).
Can anyone let me know what would be the reason for the service to take inconsistent time for the job to complete.
I have the same parameters and data in both the cases and also verified that the time spent in the PREPARING phase is pretty much the same in both cases.
Also would it matter that I schedule simultaneous multiple independent training job on the same project, if so then would like to know the rationale behind it.
Any help would be greatly appreciated.
The easiest way is to add more logging to inspect where the time was spent. You can also inspect training progress using TensorBoard. There's no VM sharing between multiple jobs, so it's unlikely caused by simultaneous jobs.
Also, the running time should be measured from the point when the job enters RUNNING state. Job startup latency varies depending on it's cold or warm start (i.e., we keep the VMs from previous job running for a while).
Often Google Compute recommends upgrading cpu/memory of the vms I am using. I can see the cpu graph of the instance so I can imagine where it gets the idea I should upgrade the CPU, but there is no such thing for ram so How does it know when to upgrade the ram?
You can use the companion Google Stackdriver app which by default collects more metrics for all your instances. The URL for the metrics for a single instance is
https://app.google.stackdriver.com/instances/<INSTANCE_ID>?project=<PROJECT_ID>
The hypervisor gives us some idea of the number of RAM pages that have been used, and we can make a recommendation based on that. However, since in any case Google does not inspect the RAM, we don't actually know what the purpose of pages are, merely how many are backed by physical RAM. Installing the StackDriver agent allows us to get better information from the guest OS about what is considered to be in use by the OS, and we can actually recommend better discounts in that case. The docs page [1] talks about this a little, although it could probably go into more detail about the memory usage signals.
[1] https://cloud.google.com/compute/docs/instances/viewing-sizing-recommendations-for-instances