What exactly does NVPROF Power Profile measure? - profiling

I have used NVPROF to get the power profile of a Kepler Architecture NVIDIA GPUs. My question is what exactly are we seeing? If I understand correctly there is a 12V and 3.3V rail feeding the GPU and the GPU can draw power from the PCI Bus. Is the NVPROF power samples a sum of the three? Or something else?
Thanks,

It's the sum of all power consumed by the GPU from all of its power rails/sources.
It is intended to be a measurement comparable to what nvidia-smi would report, or comparable to what is available in NVML (upon which nvidia-smi is built.)
It should be comparable to this NVML API call.

Related

Which GPU should I use on Google Cloud Platform (GCP)

Right now, I'm working on my master's thesis and I need to train a huge Transformer model on GCP. And the fastest way to train deep learning models is to use GPU. So, I was wondering which GPU should I use among the ones provided by GCP? The ones available at the current moment are:
NVIDIA® A100
NVIDIA® T4
NVIDIA® V100
NVIDIA® P100
NVIDIA® P4
NVIDIA® K80
It all depends on what are the characteristics you're looking for.
First, let's collect some information about these different GPU models and see which one suits you best. You can google each model's name and see its characteristics. I did that and I created the following table:
Model
FP32 (TFLOPS)
Price
TFLOPS/dollar
Nvidia A100
19.5
2.933908
6.646425178
Nvidia Tesla T4
8.1
0.35
23.14285714
Nvidia Tesla P4
5.5
0.6
9.166666667
Nvidia Tesla V100
14
2.48
5.64516129
Nvidia Tesla P100
9.3
1.46
6.369863014
Nvidia Tesla K80
8.73
0.45
19.4
In the previous table, you see can the:
FP32: which stands for 32-bit floating point which is a measure of how fast this GPU card with single-precision floating-point operations. It's measured in TFLOPS or *Tera Floating-Point Operations... The higher, the better.
Price: Hourly-price on GCP.
TFLOPS/Price: simply how much operations you will get for one dollar.
From this table, you can see:
Nvidia A100 is the fastest.
Nvidia Tesla P4 is the slowest.
Nvidia A100 is the most expensive.
Nvidia Tesla T4 is the cheapest.
Nvidia Tesla T4 has the highest operations per dollar.
Nvidia Tesla V100 has the lowest operations per dollar.
And you can observe that clearly in the following figure:
I hope that was helpful!
Nvidia says that using the most modern, powerful GPUs is not only faster, it also ends up being cheaper: https://developer.nvidia.com/blog/saving-time-and-money-in-the-cloud-with-the-latest-nvidia-powered-instances/
Google came to a similar conclusion (this was a couple of years ago before the A100 was available): https://cloud.google.com/blog/products/ai-machine-learning/your-ml-workloads-cheaper-and-faster-with-the-latest-gpus
I guess you could make an argument that both Nvidia and Google could be a little biased in making that judgement, but they are also well placed to answer the question and I see no reason not to trust them.

How to use Google Cloud Compute Engine All-Core Turbo Frequency?

I have running c2-standard-4 (4vCPUs, 16 GB Ram) compute engine which supports 3.1 GHz Base CPU Frequency and 3.8 GHz All-core turbo frequency. I cannot use all-core turbo frequency even if I am using 3 cores out of 4 cores. Are there something I need to change in VM settings?
Thanks in advance for the help!
AKAIK, turbo frequence is based on the turbo boost technology of Intel CPUs. The core frequency is set according to the type of computation performed.
So, it's not a VM setting to change, but maybe a code update to have threads dedicated to high performance computing (float and integer vector multiplication).

Tensorflow running at 35% GPU utilization, profiler shows odd cpu activity

I'm running a typical 5 layer convolutional network on the GPU in tensorflow. When I run on a fast 1080 TI GPU I'm getting about 35% GPU utilization. On a slower M40 I get 80% utilization and get 97% utilization on a 970m mobile GPU.
I've implemented the tf.StagingArea GPU queue and have confirmed with a warning message that StagingArea is not empty before each training step, it's being feed asynchronously.
I've run the tensorflow profiler seen below. Notably, the main operations on the GPU appear to complete in 15ms, but then there's a gap between 15ms and 40ms where nothing is registered by the profiler. At 40ms three small CPU operations occur that appear related to the optimizer (global step update).
This behavior is consistent at every step.
Any idea why there's such a delay here?
There is a way how you can determine what is happening on CPU inside of that interval with help of Intel VTune Amplifier (the tool is not free, but there are free fully functional academic and trial versions). You can use a recipe from this article to import timeline data to Intel VTune Amplifier and analyze it there. You will need Frame Domain / Source Function grouping. Expand [No frame domain - Outside any frame] row and you will get the list of hotspots happening in the interval you are interested in.

Same Direct2D application performs better on a "slower" machine

I wrote a Direct2D application that displays a certain number of graphics.
When I run this application it takes about 4 seconds to display 700,000 graphic elements on my notebook:
Intel Core i7 CPU Q 720 1.6 GHz
NVIDIA Quadro FX 880M
According to the Direct2D MSDN page:
Direct2D is a user-mode library that is built using the Direct3D 10.1
API. This means that Direct2D applications benefit from
hardware-accelerated rendering on modern mainstream GPUs.
I was expecting that the same application (without any modification) should perform better on a different machine with better specs. So I tried it on a desktop computer:
Intel Xeon(R) CPU 2.27 GHz
NVIDIA GeForce GTX 960
But it took 5 seconds (1 second more) to display the same graphics (same number and type of elements).
I would like to know how can it be possible and what are the causes.
It's impossible to say for sure without measuring. However, my gut tells me that melak47 is correct. There is no lack of GPU acceleration, it's a lack of bandwidth. Integrated GPUs have access to the same memory as the CPU. They can skip the step of having to transfer bitmaps and drawing commands across the bus to dedicated graphics memory for the GPU.
With a primarily 2D workload, any GPU will be spending most of its time waiting on memory. In your case, the integrated GPU has an advantage. I suspect that extra second you feel, is your GeForce waiting on graphics coming across the motherboard bus.
But, you could profile and enlighten us.
Some good points in the comments and other replies.(can't add a comment yet)
Your results dont surprise me as there are some differencies between your 2 setups.
Let's have a look there: http://ark.intel.com/fr/compare/47640,43122
A shame we can't see the SSE version supported by your Xeon CPU. Those are often used for code optimization. Is the model I chose for the comparison even the good one?
No integrated GPU in that Core-I7, but 4 cores + hyperthreading = 8 threads against 2 cores with no hyperthreading for the Xeon.
Quadro stuff rocks when it comes to realtime rendering. As your scene seems to be quite simple, it could be well optimized for that, but just "maybe" - I'm guessing here... could someone with experience comment on that? :-)
So it's not so simple. What appears to be a better gfx card doesn't mean better performance for sure. If you have a bottleneck somewhere else you're screwed!
The difference is small, you must compare every single element of your 2 setups: CPU, RAM, HDD, GPU, Motherboard with type of PCI-e and chipset.
So again, a lot of guessing, some tests are needed :)
Have fun and good luck ;-)

How can I read the bandwidth in use over the PCIe bus?

I'm working on a streaming media application that pushes a lot of data to the graphics card at startup. The CPU is doing very little at the point when the data is being pushed, it idles along at close to zero percent usage.
I'd like to monitor which machines struggle at pushing the initial data, and which ones can cope, in order that I can get to a minimum recommended spec for our customers hardware.
I've found that PCs with PCIe 1.1 x16 slots struggle with the initial data being pushed over the graphics card.
My development PC has a PCIe 2.0 x16 slot, and it has no problems with coping with the large amount of data being initially pushed to the graphics card.
I need numbers to prove (or disprove) my point.
What I'd like is to be able to determine:
Which slot type is the graphics card on?
What is the speed of that slot?
Gfx card name
Gfx card driver version
But most importantly, the data flow over the PCIe slot - e.g. if I can show that the PCIe bus is being maxed out with data, I can point to that as the bottle neck.
I know that system memory speed is also a factor here, e.g. the data is being transferred from RAM, over the PCIe bus to the graphics card, so is there a way to determine the system memory speed also?
Finally, I write in unmanaged C++, so accessing .NET libraries is not an option.
For Nvidia GPUs, you can try using NvAPI_GPU_GetDynamicPstatesInfoEx:
Nvidia, through its GeForce driver, exposes a programming interface
("NVAPI") that, among other things, allows for collecting performance
measurements. For the technically inclined, here is the relevant
section in the nvapi.h header file:
FUNCTION NAME: NvAPI_GPU_GetDynamicPstatesInfoEx
DESCRIPTION: This API retrieves the NV_GPU_DYNAMIC_PSTATES_INFO_EX
structure for the specified physical GPU. Each domain's info is
indexed in the array. For example:
pDynamicPstatesInfo->utilization[NVAPI_GPU_UTILIZATION_DOMAIN_GPU] holds the info for the GPU domain. There are currently four domains
for which GPU utilization and dynamic P-state thresholds can be
retrieved: graphic engine (GPU), frame buffer (FB), video engine
(VID), and bus interface (BUS).
Beyond this header commentary, the API's specific functionality isn't
documented. The information below is our best interpretation of its
workings, though it relies on a lot of conjecture.
The graphics engine ("GPU") metric is expected to be your bottleneck in most games. If you don't see this at or close to 100%, something
else (like your CPU or memory subsystem) is limiting performance.
The frame buffer ("FB") metric is interesting, if it works as intended. From the name, you'd expect it to measure graphics memory
utilization (the percentage of memory used). That is not what this is,
though. It appears, rather, to be the memory controller's utilization
in percent. If that's correct, it would measure actual bandwidth being
used by the controller, which is not otherwise available as a
measurement any other way.
We're not as interested in the video engine ("VID"); it's not generally used in gaming, and registers a flat 0% typically. You'd
only see the dial move if you're encoding video through ShadowPlay or
streaming to a Shield.
The bus interface ("BUS") metric refers to utilization of the PCIe controller, again, as a percentage. The corresponding measurement,
which you can trace in EVGA PrecisionX and MSI Afterburner, is called
"GPU BUS Usage".
We asked Nvidia to shed some light on the inner workings of NVAPI. Its
response confirmed that the FB metric measures graphics memory
bandwidth usage, but Nvidia dismissed the BUS metric as "considered
to be unreliable and thus not used internally".
We asked AMD if it had any API or function that allowed for similar
measurements. After internal verification, company representatives
confirmed that they did not. As much as we would like to, we are
unable to conduct similar tests on AMD hardware.
Do you get errors pushing your massive amounts of data, or are you "simply" concerned with slow speed?
I doubt there's any easy way to monitor PCI-e bandwidth usage, if it's possible at all. But it should be possible to query the bus type the video adapter is connected to via WMI and/or SetupAPI - I have no personal experience or helpful links for either, sorry.