How can I read the bandwidth in use over the PCIe bus? - c++

I'm working on a streaming media application that pushes a lot of data to the graphics card at startup. The CPU is doing very little at the point when the data is being pushed, it idles along at close to zero percent usage.
I'd like to monitor which machines struggle at pushing the initial data, and which ones can cope, in order that I can get to a minimum recommended spec for our customers hardware.
I've found that PCs with PCIe 1.1 x16 slots struggle with the initial data being pushed over the graphics card.
My development PC has a PCIe 2.0 x16 slot, and it has no problems with coping with the large amount of data being initially pushed to the graphics card.
I need numbers to prove (or disprove) my point.
What I'd like is to be able to determine:
Which slot type is the graphics card on?
What is the speed of that slot?
Gfx card name
Gfx card driver version
But most importantly, the data flow over the PCIe slot - e.g. if I can show that the PCIe bus is being maxed out with data, I can point to that as the bottle neck.
I know that system memory speed is also a factor here, e.g. the data is being transferred from RAM, over the PCIe bus to the graphics card, so is there a way to determine the system memory speed also?
Finally, I write in unmanaged C++, so accessing .NET libraries is not an option.

For Nvidia GPUs, you can try using NvAPI_GPU_GetDynamicPstatesInfoEx:
Nvidia, through its GeForce driver, exposes a programming interface
("NVAPI") that, among other things, allows for collecting performance
measurements. For the technically inclined, here is the relevant
section in the nvapi.h header file:
FUNCTION NAME: NvAPI_GPU_GetDynamicPstatesInfoEx
DESCRIPTION: This API retrieves the NV_GPU_DYNAMIC_PSTATES_INFO_EX
structure for the specified physical GPU. Each domain's info is
indexed in the array. For example:
pDynamicPstatesInfo->utilization[NVAPI_GPU_UTILIZATION_DOMAIN_GPU] holds the info for the GPU domain. There are currently four domains
for which GPU utilization and dynamic P-state thresholds can be
retrieved: graphic engine (GPU), frame buffer (FB), video engine
(VID), and bus interface (BUS).
Beyond this header commentary, the API's specific functionality isn't
documented. The information below is our best interpretation of its
workings, though it relies on a lot of conjecture.
The graphics engine ("GPU") metric is expected to be your bottleneck in most games. If you don't see this at or close to 100%, something
else (like your CPU or memory subsystem) is limiting performance.
The frame buffer ("FB") metric is interesting, if it works as intended. From the name, you'd expect it to measure graphics memory
utilization (the percentage of memory used). That is not what this is,
though. It appears, rather, to be the memory controller's utilization
in percent. If that's correct, it would measure actual bandwidth being
used by the controller, which is not otherwise available as a
measurement any other way.
We're not as interested in the video engine ("VID"); it's not generally used in gaming, and registers a flat 0% typically. You'd
only see the dial move if you're encoding video through ShadowPlay or
streaming to a Shield.
The bus interface ("BUS") metric refers to utilization of the PCIe controller, again, as a percentage. The corresponding measurement,
which you can trace in EVGA PrecisionX and MSI Afterburner, is called
"GPU BUS Usage".
We asked Nvidia to shed some light on the inner workings of NVAPI. Its
response confirmed that the FB metric measures graphics memory
bandwidth usage, but Nvidia dismissed the BUS metric as "considered
to be unreliable and thus not used internally".
We asked AMD if it had any API or function that allowed for similar
measurements. After internal verification, company representatives
confirmed that they did not. As much as we would like to, we are
unable to conduct similar tests on AMD hardware.

Do you get errors pushing your massive amounts of data, or are you "simply" concerned with slow speed?
I doubt there's any easy way to monitor PCI-e bandwidth usage, if it's possible at all. But it should be possible to query the bus type the video adapter is connected to via WMI and/or SetupAPI - I have no personal experience or helpful links for either, sorry.

Related

Tensorflow running at 35% GPU utilization, profiler shows odd cpu activity

I'm running a typical 5 layer convolutional network on the GPU in tensorflow. When I run on a fast 1080 TI GPU I'm getting about 35% GPU utilization. On a slower M40 I get 80% utilization and get 97% utilization on a 970m mobile GPU.
I've implemented the tf.StagingArea GPU queue and have confirmed with a warning message that StagingArea is not empty before each training step, it's being feed asynchronously.
I've run the tensorflow profiler seen below. Notably, the main operations on the GPU appear to complete in 15ms, but then there's a gap between 15ms and 40ms where nothing is registered by the profiler. At 40ms three small CPU operations occur that appear related to the optimizer (global step update).
This behavior is consistent at every step.
Any idea why there's such a delay here?
There is a way how you can determine what is happening on CPU inside of that interval with help of Intel VTune Amplifier (the tool is not free, but there are free fully functional academic and trial versions). You can use a recipe from this article to import timeline data to Intel VTune Amplifier and analyze it there. You will need Frame Domain / Source Function grouping. Expand [No frame domain - Outside any frame] row and you will get the list of hotspots happening in the interval you are interested in.

Pre-loading audio buffers - what is reasonable and reliable?

I am converting an audio signal processing application from Win XP to Win 7 (at least). You can imagine it is a sonar application - a signal is generated and sent out, and a related/modified signal is read back in. The application wants exclusive use of the audio hardware, and cannot afford glitches - we don't want to read headlines like "Windows beep causes missile launch".
Looking at the Windows SDK audio samples, the most relevant one to my case is the RenderExclusiveEventDriven example. Outside the audio engine, it prepares 10 seconds of audio to play, which provides it in 10ms chunks to the rendering engine via an IAudioRenderClient object's GetBuffer() and ReleaseBuffer(). It first uses these functions to pre-load a single 10ms chunk of audio, then relies on regular 10ms events to load subsequent chunks.
Hopefully this means there will always be 10-20ms of audio data buffered. How reliable (i.e. glitch-free) should we expect this to be on reasonably modern hardware (less than 18months old)?
Previously, one readily could pre-load at least half a second worth of audio into via the waveXXX() API, so that if Windows got busy elsewhere, audio continuity was less likely to be affected. 500ms seems like a lot safer margin than 10-20ms... but if you want both event-driven and exclusive-mode, the IAudioRenderClient documentation doesn't exactly make it clear if it is or is not possible to pre-load more than a single IAudioRenderClient buffer worth.
Can anyone confirm if more extensive pre-loading is still possible? Is it recommended, discouraged or neither?
If you are worried about launching missiles, I don't think you should be using Windows or any other non Real-Time operating system.
That said, we are working on another application that consumes a much higher bandwidth of data (400 MB/s continuously for hours or more). We have seen glitches where the operating system becomes unresponsive for up to 5 seconds, so we have large buffers on the data acquisition hardware.
Like with everything else in computing, the wider you go you:
increase throughput
increase latency
I'd say 512 samples buffer is the minimum typically used for non-demanding latency wise applications. I've seen up to 4k buffers. Memory use wise that's still pretty much nothing for contemporary devices - a mere 8 kilobytes of memory per channel for 16 bit audio. You have better playback stability and lower waste of CPU cycles. For audio applications that means you can process more tracks with more DSP before audio begins skipping.
On the other end - I've seen only a few professional audio interfaces, which could handle 32 sample buffers. Most are able to achieve 128 samples, naturally you are still limited to lower channel and effect count, even with professional hardware you increase buffering as your project gets larger, lower it back and disable tracks or effects when you need "real time" to capture a performance. In terms of lowest possible latency actually the same box is capable of achieving lower latency with Linux and a custom real time kernel than on Windows where you are not allowed to do such things. Keep in mind a 64 sample buffer might sound like 8 msec of latency in theory, but in reality it is more like double - because you have both input and output latency plus the processing latency.
For a music player where latency is not an issue you are perfectly fine with a larger buffer, for stuff like games you need to keep it lower for the sake of still having a degree of synchronization between what's going on visually and the sound - you simply cannot have your sound lag half a second behind the action, for music performance capturing together with already recorded material you need to have latency low. You should never go above what your use case requires, because a small buffer will needlessly add to CPU use and the odds of getting audio drop outs. 4k buffering for an audio player is just fine if you can live with half a second of latency between the moment you hit play and the moment you hear the song starting.
I've done a "kind of a hybrid" solution in my DAW project - since I wanted to employ GPGPU for its tremendous performance relative to the CPU I've split the work internally with two processing paths - 64 samples buffer for real time audio which is processed on the CPU, and another considerably wider buffer size for the data which is processed by the GPU. Naturally, they both come out through the "CPU buffer" for the sake of being synchronized perfectly, but the GPU path is "processed in advance" thus allowing higher throughput for already recorded data, and keeping CPU use lower so the real time audio is more reliable. I am honestly surprised professional DAW software hasn't taken this path yet, but not too much, knowing how much money the big fishes of the industry make on hardware that is much less powerful than a modern midrange GPU. They've been claiming that "latency is too much with GPUs" ever since Cuda and OpenCL came out, but with pre-buffering and pre-processing that is really not an issue for data which is already recorded, and increases the size of a project which the DAW can handle tremendously.
The short answer is yes, you can preload a larger amount of data.
This example uses a call to GetDevicePeriod to return the minimum service interval for the device (in nano seconds) and then passes that value along to Initialize. You can pass a larger value if you wish.
The down side to increasing the period is that you're increasing the latency. If you are just playing a waveform back and aren't planning on making changes on the fly then this is not a problem. But if you had a sine generator for example, then the increased latency means that it would take longer for you to hear a change in frequency or amplitude.
Whether or not you have drop outs depends on a number of things. Are you setting thread priorities appropriately? How small is the buffer? How much CPU are you using preparing your samples? In general though, a modern CPU can handle a pretty low-latency. For comparison, ASIO audio devices run perfectly fine at 96kHz with a 2048 sample buffer (20 milliseconds) with multiple channels - no problem. ASIO uses a similar double buffering scheme.
This is too long to be a comment, so it may as well be an answer (with qualifications).
Although it was edited out of final form of the question I submitted, what I had intended by "more extensive pre-loading" was not about the size of buffers used, so much as the number of buffers used. The (somewhat unexpected) answers that resulted all helped widen my understanding.
But I was curious. In the old waveXXX() world, it was possible to "pre-load" multiple buffers via waveOutPrepareHeader() and waveOutWrite() calls, the first waveOutWrite() of which would start playback. My old app "pre-loaded" 60 buffers out of a set of 64 in one burst, each with 512 samples played at 48kHz, creating over 600ms of buffering in a system with a cycle of 10.66ms.
Using multiple IAudioRenderClient::GetBuffer() and IAudioRenderClient::ReleaseBuffer() calls prior to IAudioCient::Start() in the WASAPI world, it appears that the same is still possible... at least on my (very ordinary) hardware, and without extensive testing (yet). This is despite the documentation strongly suggesting that exclusive, event-driven audio is strictly a double-buffering system.
I don't know that anyone should set out to exploit this by design, but I thought I'd point out that it may be supported.

graphics libraries and GPUs [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How does OpenGL work at the lowest level?
When we make a program that uses the OpenGL library, for example for the Windows platform and have a graphics card that supports OpenGL, what happens is this:
We developed our program in a programming language linking the graphics with OpenGL (eg Visual C++).
Compile and link the program for the target platform (eg Windows)
When you run the program, as we have a graphics card that supports OpenGL, the driver installed on the same Windows will be responsible for managing the same graphics. To do this, when the CPU will send the required data to the chip on the graphics card (eg NVIDIA GPU) sketch the results.
In this context, we talk about graphics acceleration and downloaded to the CPU that the work of calculating the framebuffer end of our graphic representation.
In this environment, when the driver of the GPU receives data, how leverages the capabilities of the GPU to accelerate the drawing? Translates instructions and data received CUDA language type to exploit parallelism capabilities? Or just copy the data received from the CPU in specific areas of the device memory? Do not quite understand this part.
Finally, if I had a card that supports OpenGL, does the driver installed in Windows detect the problem? Would get a CPU error or would you calculate our framebuffer?
You'd better work into computer gaming sites. They frequently give articles on how 3D graphics works and how "artefacts" present themselves in case of errors in games or drivers.
You can also read article on architecture of 3D libraries like Mesa or Gallium.
Overall drivers have a set of methods for implementing this or that functionality of Direct 3D or OpenGL or another standard API. When they are loading, they check the hardware. You can have cheap videocard or expensive one, recent one or one released 3 years ago... that is different hardware. So drivers are trying to map each API feature to an implementation that can be used on given computer, accelerated by GPU, accelerated by CPU like SSE4, or even some more basic implementation.
Then driver try to estimate GPU load. Sometimes function can be accelerated, yet the GPU (especially low-end ones) is alreay overloaded by other task, then it maybe would try to calculate on CPU instead of waiting for GPU time slot.
When you make mistake there is always several variants, depending on intelligence and quality of driver.
Maybe driver would fix the error for you, ignoring your commands and running its own set of them instead.
Maybe the driver would return to your program some error code
Maybe the driver would execute the command as is. If you issued painting wit hred colour instead of green - that is an error, but the kind that driver can not know about. Search for "3d artefacts" on PC gaming related sites.
In worst case your eror would interfere with error in driver and your computer would crash and reboot.
Of course all those adaptive strategies are rather complex and indeterminate, that causes 3D drivers be closed and know-how of their internals closely guarded.
Search sites dedicated to 3D gaming and perhaps also to 3D modelling - they should rate videocards "which are better to buy" and sometimes when they review new chip families they compose rather detailed essays about technical internals of all this.
To question 5.
Some of the things that a driver does: It compiles your GPU programs (vertex,fragment, etc. shaders) to the machine instructions of the specific card, uploads the compiled programs to the appropriate area of the device memory and arranges the programs to be executed in parallel on the many many graphics cores on the card.
It uploads the graphics data (vertex coordinates, textures, etc.) to the appropriate type of graphics card memory, using various hints from the programmer, for example whether the date is frequently, infrequently, or not at all updated.
It may utilize special units in the graphics card for transfer of data to/from host memory, for example some nVidia card have a DMA unit (some Quadro card may have two or more), which can upload, for example, textures in parallel with the usual driver operation (other transfers, drawing, etc).

Rotating hundreds of JPEGs in seconds rather than hours

We have hundreds of images which our computer gets at a time and we need to rotate and resize them as fast as possible.
Rotation is done by 90, 180 or 270 degrees.
Currently we are using the command line tool GraphicsMagick to rotate the image. Rotating the images (5760*3840 ~ 22MP) takes around 4 to 7 seconds.
The following python code sadly gives us equal results
import cv
img = cv.LoadImage("image.jpg")
timg = cv.CreateImage((img.height,img.width), img.depth, img.channels) # transposed image
# rotate counter-clockwise
cv.Transpose(img,timg)
cv.Flip(timg,timg,flipMode=0)
cv.SaveImage("rotated_counter_clockwise.jpg", timg)
Is there a faster way to rotate the images using the power of the graphics card? OpenCL and OpenGL come to mind but we are wondering whether a performance increase would be noticable.
The hardware we are using is fairly limited as the device should be as small as possible.
Intel Atom D525 (1,8 Ghz)
Mobility Radeon HD 5430 Series
4 GB of RAM
SSD Vertility 3
The software is debian 6 with official (closed source) radeon drivers.
you can perform a lossless rotation that will just modify the EXIF section. This will rotate your pictures faster.
and have a look at jpegtran utility which performs lossless jpeg modifications.
https://linux.die.net/man/1/jpegtran
There is a jpeg no-recompression plugin for irfanview which IIRC can rotate and resize images (in simple ways) without recompressing, it can also run an a directory of images - this should be a lot faster
The GPU probably wouldn't help, you are almost certainly I/O limited in opencv, it's not really optomised for high speed file access
I'm not an expert in jpeg and compression topics, but as your problem is pretty much as I/O limited as it gets (assuming that you can rotate without heavy de/encoding-related computation), you you might not be able to accelerate it very much on the GPU you have. (Un)Luckily your reference is a pretty slow Atom CPU.
I assume that the Radeon has separate main memory. This means that data needs to be communicated through PCI-E which is the extra latency compared to CPU execution and without hiding you can be sure that it is the bottleneck. This is the most probable reason why your code that uses OpenCV on the GPU is slow (besides the fact that you do two memory-bound operations, transpose & flip, instead of a single one).
The key thing is to hide as much of the PCI-E transfer times with computation as possible by using multiple-buffering. Overlapping transfers both to and from the GPU with computation by making use of the full-duplex capability of PCI-E will only work if the card in question has dual-DMA engines like high-end Radeons or the NVIDIA Quadro/Tesla cards -- which I highly doubt.
If your GPU compute-time (the time it takes the GPU to do the rotation) is lower than the time the transfer takes, you won't be able to fully overlap. The HD 4530 has a pretty slow memory interface with only 12.8 Gb/s peak, and the rotation kernel should be quite memory bound. However, I can only guesstimate, but I would say that if you reach peak PCI-E transfer rate of ~1.5 Gb/s (4x PCI-E AFAIK), the compute kernel will be a few times faster than the transfer and you'll be able to overlap very little.
You can simply time the parts separately without requiring elaborate asynchronous code and you can estimate how fast can you get things with an optimum overlap.
One thing you might want to consider is getting hardware which doesn't exhibit PCI-E as a bottleneck, e.g:
AMD APU-based system. On these platforms you will be able to page-lock the memory and use it directly from the GPU;
integrated GPUs which share main memory with the host;
a fast low-power CPU like a mobile Intel Ivy Bridge e.g. i5-3427U which consumes almost as little as the Atom D525 but has AVX support and should be several times faster.

Graphics Profiling

Ive got an application which drops to around 10fps. I profiled it with xperf which showed my app was using just 20% of the CPU, with none of my methods using a larger than expected amount of that 20%.
This seems to indicate that the vast drop in fps is because the graphics card isnt able to keep up with rendering the frame, resulting in my program stopping while it catches up...
Is there some way to profile what the graphics card is up to and work out what my program is telling it to do thats slowing it down, so that I can try to improve the frame rate?
For debugging / profiling graphics, try Nvidia PerfHUD
NVIDIA PerfHUD is a powerful real-time performance analysis tool for Direct3D applications.
There is also an ATI solution, called 'GPU PerfStudio'
GPU PerfStudio is a real-time performance analysis tool which has been designed to help tune the graphics performance of your DirectX 9, DirectX 10, and OpenGL applications. GPU PerfStudio displays real-time API, driver and hardware data which can be visualized using extremely flexible plotting and bar chart mechanisms. The application being profiled maybe executed locally or remotely over the network. GPU PerfStudio allows the developer to override key rendering states in real-time for rapid bottleneck detection. An auto-analysis window can be used for identifying performance issues at various stages of the graphics pipeline. No special drivers or code modifications are needed to use GPU PerfStudio.
You can find more information and download links here:
http://developer.nvidia.com/object/nvperfhud_home.html
http://developer.amd.com/tools-and-sdks/graphics-development/gpu-perfstudio/
Also, check out this article on FPS:
FPS vs Frame Time
Basically it talks about the fact that a drop from 200fps to 190fps is negligible, whereas a drop from 30fps to 20fps is a MUCH bigger deal. For better performance measuring, you should be calculating frame time rather than FPS.
You never told us what your fps is or what the program is doing at all, so your "vast drop" might not be a big deal at all.
For DirectX, there is PIX for profiling the CPU and GPU operations. It can give very detailed info, and might be worth looking into.
Hope that helps!
You can try using dxprof (search in google). It's lightweight app that draws real-time bars, each bar corresponding to one DirectX event (such as draw-call or resource copy). You can freeze the bars and check calls stack to find out where the draw-call originates from.
Are you developing for Windows? If so avoid using Video for Windows as this will limit you in the manner that you describe. Use DirectX instead.
No need to guess. Just pause it a few times under the IDE, and it will show you exactly what it's waiting for.