Tensorflow running at 35% GPU utilization, profiler shows odd cpu activity - profiling

I'm running a typical 5 layer convolutional network on the GPU in tensorflow. When I run on a fast 1080 TI GPU I'm getting about 35% GPU utilization. On a slower M40 I get 80% utilization and get 97% utilization on a 970m mobile GPU.
I've implemented the tf.StagingArea GPU queue and have confirmed with a warning message that StagingArea is not empty before each training step, it's being feed asynchronously.
I've run the tensorflow profiler seen below. Notably, the main operations on the GPU appear to complete in 15ms, but then there's a gap between 15ms and 40ms where nothing is registered by the profiler. At 40ms three small CPU operations occur that appear related to the optimizer (global step update).
This behavior is consistent at every step.
Any idea why there's such a delay here?

There is a way how you can determine what is happening on CPU inside of that interval with help of Intel VTune Amplifier (the tool is not free, but there are free fully functional academic and trial versions). You can use a recipe from this article to import timeline data to Intel VTune Amplifier and analyze it there. You will need Frame Domain / Source Function grouping. Expand [No frame domain - Outside any frame] row and you will get the list of hotspots happening in the interval you are interested in.

Related

Same Direct2D application performs better on a "slower" machine

I wrote a Direct2D application that displays a certain number of graphics.
When I run this application it takes about 4 seconds to display 700,000 graphic elements on my notebook:
Intel Core i7 CPU Q 720 1.6 GHz
NVIDIA Quadro FX 880M
According to the Direct2D MSDN page:
Direct2D is a user-mode library that is built using the Direct3D 10.1
API. This means that Direct2D applications benefit from
hardware-accelerated rendering on modern mainstream GPUs.
I was expecting that the same application (without any modification) should perform better on a different machine with better specs. So I tried it on a desktop computer:
Intel Xeon(R) CPU 2.27 GHz
NVIDIA GeForce GTX 960
But it took 5 seconds (1 second more) to display the same graphics (same number and type of elements).
I would like to know how can it be possible and what are the causes.
It's impossible to say for sure without measuring. However, my gut tells me that melak47 is correct. There is no lack of GPU acceleration, it's a lack of bandwidth. Integrated GPUs have access to the same memory as the CPU. They can skip the step of having to transfer bitmaps and drawing commands across the bus to dedicated graphics memory for the GPU.
With a primarily 2D workload, any GPU will be spending most of its time waiting on memory. In your case, the integrated GPU has an advantage. I suspect that extra second you feel, is your GeForce waiting on graphics coming across the motherboard bus.
But, you could profile and enlighten us.
Some good points in the comments and other replies.(can't add a comment yet)
Your results dont surprise me as there are some differencies between your 2 setups.
Let's have a look there: http://ark.intel.com/fr/compare/47640,43122
A shame we can't see the SSE version supported by your Xeon CPU. Those are often used for code optimization. Is the model I chose for the comparison even the good one?
No integrated GPU in that Core-I7, but 4 cores + hyperthreading = 8 threads against 2 cores with no hyperthreading for the Xeon.
Quadro stuff rocks when it comes to realtime rendering. As your scene seems to be quite simple, it could be well optimized for that, but just "maybe" - I'm guessing here... could someone with experience comment on that? :-)
So it's not so simple. What appears to be a better gfx card doesn't mean better performance for sure. If you have a bottleneck somewhere else you're screwed!
The difference is small, you must compare every single element of your 2 setups: CPU, RAM, HDD, GPU, Motherboard with type of PCI-e and chipset.
So again, a lot of guessing, some tests are needed :)
Have fun and good luck ;-)

CUDA Profiler: Calculate memory and compute utilization

I am trying to establish two overall measurements for memory bandwidth utilization and compute throughput utilization for my GPU-accelerated application using CUDA nsight profiler on ubuntu. The application runs on a Tesla K20c GPU.
The two measurements I want are to some extend comparable to the ones given in this graph:
The problems are that no exact numbers are given here and more importantly that I do not know how these percentages are being calculated.
Memory Bandwidth Utilization
The Profiler tells me that my GPU has a Max Global Memory Bandwidth of 208 GB/s.
Does this refer to Device Memory BW or the Global Memory BW? It sais Global but the first one makes more sense to me.
For my kernel the profiler tells me that the Device Memory Bandwidth is 98.069 GB/s.
Assuming that the max of 208 GB/s refer to the Device Memory could I then simply calculate the Memory BW Utilization as 90.069/208 = 43%? Note that this kernel is executed multiple times without additional CPU-GPU data transfers. The system BW is therefore not important.
Compute Throughput Utilization
I am not exactly sure what the best way is to put Compute Throughput Utilization into a number. My best guess is to use the Instructions per Cycle to max Instructions per cycle ratio. The profiler tells me that the max IPC is 7 (see picture above).
First of all, what does that actually mean? Each multiprocessor has 192 cores and therefore a maximum of 6 active warps. Wouldnt that mean that max IPC should be 6?
The profiler tells me that my kernel has issued IPC = 1.144 and executed IPC = 0.907. Should I calculate the compute utilization as 1.144/7 = 16% or 0.907/7 = 13% or none of these?
Are these two measurements (Memory and compute utilization) giving an adequate first impression of how efficiently my kernel is using the resources? Or are there other important metrics that should be included?
Additional Graph
NOTE: I will try to update this answer for additional details in the future. I do not think all of the individual components of the calculations of easily visible in the Visual Profiler reports.
Compute Utilization
This is the pipeline utilization of the logical pipes: memory, control flow, and arithmetic. The SMs have numerous execution pipes that are not document. If you look at the instruction throughput charts you can determine to a high level how to calculate utilization. You can read the kepler or maxwell architecture documents for more information on the pipelines. A CUDA core is a marketing term for a integer/single precision floating point math pipeline.
This calculation is not based upon IPC. Its based upon pipeline utilization and issue cycles. For example, you can be at 100% utilization if you issue 1 instruction/cycle (never dual-issue). You can also be at 100% if you issue a double precision instruction at maximum rate (depends on GPU).
Memory Bandwidth Utilization
The profiler calculates the utilization of L1, TEX, L2, and device memory. The highest value is shown. It is very possible to have very high data path utilization but very low bandwidth utilization.
A memory latency boundness reason should also be calculated. It is very easy to have a program bound by memory latency but not bound by compute utilization or memory bandwidth.

Rotating hundreds of JPEGs in seconds rather than hours

We have hundreds of images which our computer gets at a time and we need to rotate and resize them as fast as possible.
Rotation is done by 90, 180 or 270 degrees.
Currently we are using the command line tool GraphicsMagick to rotate the image. Rotating the images (5760*3840 ~ 22MP) takes around 4 to 7 seconds.
The following python code sadly gives us equal results
import cv
img = cv.LoadImage("image.jpg")
timg = cv.CreateImage((img.height,img.width), img.depth, img.channels) # transposed image
# rotate counter-clockwise
cv.Transpose(img,timg)
cv.Flip(timg,timg,flipMode=0)
cv.SaveImage("rotated_counter_clockwise.jpg", timg)
Is there a faster way to rotate the images using the power of the graphics card? OpenCL and OpenGL come to mind but we are wondering whether a performance increase would be noticable.
The hardware we are using is fairly limited as the device should be as small as possible.
Intel Atom D525 (1,8 Ghz)
Mobility Radeon HD 5430 Series
4 GB of RAM
SSD Vertility 3
The software is debian 6 with official (closed source) radeon drivers.
you can perform a lossless rotation that will just modify the EXIF section. This will rotate your pictures faster.
and have a look at jpegtran utility which performs lossless jpeg modifications.
https://linux.die.net/man/1/jpegtran
There is a jpeg no-recompression plugin for irfanview which IIRC can rotate and resize images (in simple ways) without recompressing, it can also run an a directory of images - this should be a lot faster
The GPU probably wouldn't help, you are almost certainly I/O limited in opencv, it's not really optomised for high speed file access
I'm not an expert in jpeg and compression topics, but as your problem is pretty much as I/O limited as it gets (assuming that you can rotate without heavy de/encoding-related computation), you you might not be able to accelerate it very much on the GPU you have. (Un)Luckily your reference is a pretty slow Atom CPU.
I assume that the Radeon has separate main memory. This means that data needs to be communicated through PCI-E which is the extra latency compared to CPU execution and without hiding you can be sure that it is the bottleneck. This is the most probable reason why your code that uses OpenCV on the GPU is slow (besides the fact that you do two memory-bound operations, transpose & flip, instead of a single one).
The key thing is to hide as much of the PCI-E transfer times with computation as possible by using multiple-buffering. Overlapping transfers both to and from the GPU with computation by making use of the full-duplex capability of PCI-E will only work if the card in question has dual-DMA engines like high-end Radeons or the NVIDIA Quadro/Tesla cards -- which I highly doubt.
If your GPU compute-time (the time it takes the GPU to do the rotation) is lower than the time the transfer takes, you won't be able to fully overlap. The HD 4530 has a pretty slow memory interface with only 12.8 Gb/s peak, and the rotation kernel should be quite memory bound. However, I can only guesstimate, but I would say that if you reach peak PCI-E transfer rate of ~1.5 Gb/s (4x PCI-E AFAIK), the compute kernel will be a few times faster than the transfer and you'll be able to overlap very little.
You can simply time the parts separately without requiring elaborate asynchronous code and you can estimate how fast can you get things with an optimum overlap.
One thing you might want to consider is getting hardware which doesn't exhibit PCI-E as a bottleneck, e.g:
AMD APU-based system. On these platforms you will be able to page-lock the memory and use it directly from the GPU;
integrated GPUs which share main memory with the host;
a fast low-power CPU like a mobile Intel Ivy Bridge e.g. i5-3427U which consumes almost as little as the Atom D525 but has AVX support and should be several times faster.

Should Display Lists be cpu intensive?

My application is rendering about 100 display lists / second. While I do expect this to be intensive for the gpu, I don't see why it brings my cpu up to 80 - 90 %. Arn't display lists stored in the graphics card and not in system memory? What would I have to do to reduce this crazy cpu usage? My objects never change so that's why im using DL's instead of VBO's. But really my main concern is cpu usage and how I could reduce it. I'm rendering ~60 (or trying to) frames per second.
Thanks
If you are referring to these, then I suspect the bottleneck is going to be CPU related. All the decoding of such files is done on the CPU. Sure, each individual command might result in several commands to your graphics card, which will execute quickly, but the CPU is stuck doing decoding duty.
You probably have VSYNC disabled. In which case your CPU will generate as many frames per second as possible. Of course most of them will be wasted because your monitor can't update 100s of times per second.
Enable VSYNC and check your CPU usage (and frame rate) again.
While display lists are compiled on the GPU, it does not mean there isn't some work required on the cpu (if not directly in your code then possibly in the driver) to actually specify the display list to call on the gpu.
If you want to find out where the cpu time is being spent, grab a profiler and fire up a call graph sampling test. You'll find out in no time where the cpu time is being spent.

How can I read the bandwidth in use over the PCIe bus?

I'm working on a streaming media application that pushes a lot of data to the graphics card at startup. The CPU is doing very little at the point when the data is being pushed, it idles along at close to zero percent usage.
I'd like to monitor which machines struggle at pushing the initial data, and which ones can cope, in order that I can get to a minimum recommended spec for our customers hardware.
I've found that PCs with PCIe 1.1 x16 slots struggle with the initial data being pushed over the graphics card.
My development PC has a PCIe 2.0 x16 slot, and it has no problems with coping with the large amount of data being initially pushed to the graphics card.
I need numbers to prove (or disprove) my point.
What I'd like is to be able to determine:
Which slot type is the graphics card on?
What is the speed of that slot?
Gfx card name
Gfx card driver version
But most importantly, the data flow over the PCIe slot - e.g. if I can show that the PCIe bus is being maxed out with data, I can point to that as the bottle neck.
I know that system memory speed is also a factor here, e.g. the data is being transferred from RAM, over the PCIe bus to the graphics card, so is there a way to determine the system memory speed also?
Finally, I write in unmanaged C++, so accessing .NET libraries is not an option.
For Nvidia GPUs, you can try using NvAPI_GPU_GetDynamicPstatesInfoEx:
Nvidia, through its GeForce driver, exposes a programming interface
("NVAPI") that, among other things, allows for collecting performance
measurements. For the technically inclined, here is the relevant
section in the nvapi.h header file:
FUNCTION NAME: NvAPI_GPU_GetDynamicPstatesInfoEx
DESCRIPTION: This API retrieves the NV_GPU_DYNAMIC_PSTATES_INFO_EX
structure for the specified physical GPU. Each domain's info is
indexed in the array. For example:
pDynamicPstatesInfo->utilization[NVAPI_GPU_UTILIZATION_DOMAIN_GPU] holds the info for the GPU domain. There are currently four domains
for which GPU utilization and dynamic P-state thresholds can be
retrieved: graphic engine (GPU), frame buffer (FB), video engine
(VID), and bus interface (BUS).
Beyond this header commentary, the API's specific functionality isn't
documented. The information below is our best interpretation of its
workings, though it relies on a lot of conjecture.
The graphics engine ("GPU") metric is expected to be your bottleneck in most games. If you don't see this at or close to 100%, something
else (like your CPU or memory subsystem) is limiting performance.
The frame buffer ("FB") metric is interesting, if it works as intended. From the name, you'd expect it to measure graphics memory
utilization (the percentage of memory used). That is not what this is,
though. It appears, rather, to be the memory controller's utilization
in percent. If that's correct, it would measure actual bandwidth being
used by the controller, which is not otherwise available as a
measurement any other way.
We're not as interested in the video engine ("VID"); it's not generally used in gaming, and registers a flat 0% typically. You'd
only see the dial move if you're encoding video through ShadowPlay or
streaming to a Shield.
The bus interface ("BUS") metric refers to utilization of the PCIe controller, again, as a percentage. The corresponding measurement,
which you can trace in EVGA PrecisionX and MSI Afterburner, is called
"GPU BUS Usage".
We asked Nvidia to shed some light on the inner workings of NVAPI. Its
response confirmed that the FB metric measures graphics memory
bandwidth usage, but Nvidia dismissed the BUS metric as "considered
to be unreliable and thus not used internally".
We asked AMD if it had any API or function that allowed for similar
measurements. After internal verification, company representatives
confirmed that they did not. As much as we would like to, we are
unable to conduct similar tests on AMD hardware.
Do you get errors pushing your massive amounts of data, or are you "simply" concerned with slow speed?
I doubt there's any easy way to monitor PCI-e bandwidth usage, if it's possible at all. But it should be possible to query the bus type the video adapter is connected to via WMI and/or SetupAPI - I have no personal experience or helpful links for either, sorry.