Is there any OpenGL operations cost GPU resource significantly? - opengl

The situation I met is that I run several OpenGL programs concurrently on a single server with only one GPU cards. They works fine with the FPS of 60.
But the problem is when I restart one of then, the FPS of the others drop a lot, may to 3X or even 1X
If I restart 2 or more at the same time, it can be worse, FPS can be only single-digit
I'm wondering is there any OpenGL initialization(create context/setup texture) operation that may cost the GPU resource a lot?
The environment: Linux(Ubuntu 14.04) NVIDIA GTX 770 with X11 window system

There are indeed some operations that most OpenGL implementations carry out on the CPU. Most notably when sending images to the GPU any necessary format conversions (i.e. if the image data is not yet in a format that can be processed by the GPU) are done on the CPU. Also every OpenGL object that carries bulk data (textures, render buffers, array buffer objects) usually requires a shadow copy in system memory (the GPU memory mostly acts like a cache) and initializing that is a costly operation as well.
Sending bulk data to the GPU consumes bus bandwidth. While this does not really hit the CPU or GPU due to being carried out using DMA transfer, it consumes memory bandwidth on either side thus impacting memory intensive operations on either side.

Related

Can textures once uploaded from CPU/Main Memory/RAM to GPU be deallocated on CPU?

For a fixed 3D scene with animations (which cannot be partially loaded into memory), with say 1000 objects and various zoom levels, once all the textures have been uploaded to the GPU, do their data still need to be held on the CPU, essentially the application works even if deallocate all the texture data from the CPU/Main RAM, but is this totally safe, or they should still be continued to be held in main memory apart from the GPU memory?
After sending them to OpenGL, all cpu data can be deleted without any problems. This is true for textures as well as for buffers.
If the implementation does not upload the data immediately to the gpu (afaik all desktop GL implementation delay that), then the OpenGL implementation has to make sure that cpu data is backed up until needed.

What is "GPU Cache" from a OpenGL/DirectX programmer prespective?

Maya promo video explains how GPU Cache affects user making application run faster. In frameworks like Cinder we redraw all geopetry we want to be in the scene on each frame update sending it to video card. So I worder what is behind GPU Caching from a programmer prespective? What OpenGL/DirectX APIs are behind such technology? How to "Cache" my mesh in GPU memory?
There is, to my knowledge, no way in OpenGL or DirectX to directly specify what is to be, and not to be, stored and tracked on the GPU cache. There are however methodologies that should be followed and maintained in order to make best use of the cache. Some of these include:
Batch, batch, batch.
Upload data directly to the GPU
Order indices to maximize vertex locality across the mesh.
Keep state changes to a minimum.
Keep shader changes to a minimum.
Keep texture changes to a minimum.
Use maximum texture compression whenever possible.
Use mipmapping whenever possible (to maximize texel sampling locality)
It is also important to keep in mind that there is no single GPU cache. There are multiple (vertex, texture, etc.) independent caches.
Sources:
OpenGL SuperBible - Memory Bandwidth and Vertices
GPU Gems - Graphics Pipeline Performance
GDC 2012 - Optimizing DirectX Graphics
First off, the "GPU cache" terminology that Maya uses probably refers to graphics data that is simply stored on the card refers to optimizing a mesh for device-independent storage and rendering in Maya . For card manufacturer's the notion of a "GPU cache" is different (in this case it means something more like the L1 or L2 CPU caches).
To answer your final question: Using OpenGL terminology, you generally create vertex buffer objects (VBO's). These will store the data on the card. Then, when you want to draw, you can simply instruct the card to use those buffers.
This will avoid the overhead of copying the mesh data from main (CPU) memory into graphics (GPU) memory. If you need to draw the mesh many times without changing the mesh data, it performs much better.

Rotating hundreds of JPEGs in seconds rather than hours

We have hundreds of images which our computer gets at a time and we need to rotate and resize them as fast as possible.
Rotation is done by 90, 180 or 270 degrees.
Currently we are using the command line tool GraphicsMagick to rotate the image. Rotating the images (5760*3840 ~ 22MP) takes around 4 to 7 seconds.
The following python code sadly gives us equal results
import cv
img = cv.LoadImage("image.jpg")
timg = cv.CreateImage((img.height,img.width), img.depth, img.channels) # transposed image
# rotate counter-clockwise
cv.Transpose(img,timg)
cv.Flip(timg,timg,flipMode=0)
cv.SaveImage("rotated_counter_clockwise.jpg", timg)
Is there a faster way to rotate the images using the power of the graphics card? OpenCL and OpenGL come to mind but we are wondering whether a performance increase would be noticable.
The hardware we are using is fairly limited as the device should be as small as possible.
Intel Atom D525 (1,8 Ghz)
Mobility Radeon HD 5430 Series
4 GB of RAM
SSD Vertility 3
The software is debian 6 with official (closed source) radeon drivers.
you can perform a lossless rotation that will just modify the EXIF section. This will rotate your pictures faster.
and have a look at jpegtran utility which performs lossless jpeg modifications.
https://linux.die.net/man/1/jpegtran
There is a jpeg no-recompression plugin for irfanview which IIRC can rotate and resize images (in simple ways) without recompressing, it can also run an a directory of images - this should be a lot faster
The GPU probably wouldn't help, you are almost certainly I/O limited in opencv, it's not really optomised for high speed file access
I'm not an expert in jpeg and compression topics, but as your problem is pretty much as I/O limited as it gets (assuming that you can rotate without heavy de/encoding-related computation), you you might not be able to accelerate it very much on the GPU you have. (Un)Luckily your reference is a pretty slow Atom CPU.
I assume that the Radeon has separate main memory. This means that data needs to be communicated through PCI-E which is the extra latency compared to CPU execution and without hiding you can be sure that it is the bottleneck. This is the most probable reason why your code that uses OpenCV on the GPU is slow (besides the fact that you do two memory-bound operations, transpose & flip, instead of a single one).
The key thing is to hide as much of the PCI-E transfer times with computation as possible by using multiple-buffering. Overlapping transfers both to and from the GPU with computation by making use of the full-duplex capability of PCI-E will only work if the card in question has dual-DMA engines like high-end Radeons or the NVIDIA Quadro/Tesla cards -- which I highly doubt.
If your GPU compute-time (the time it takes the GPU to do the rotation) is lower than the time the transfer takes, you won't be able to fully overlap. The HD 4530 has a pretty slow memory interface with only 12.8 Gb/s peak, and the rotation kernel should be quite memory bound. However, I can only guesstimate, but I would say that if you reach peak PCI-E transfer rate of ~1.5 Gb/s (4x PCI-E AFAIK), the compute kernel will be a few times faster than the transfer and you'll be able to overlap very little.
You can simply time the parts separately without requiring elaborate asynchronous code and you can estimate how fast can you get things with an optimum overlap.
One thing you might want to consider is getting hardware which doesn't exhibit PCI-E as a bottleneck, e.g:
AMD APU-based system. On these platforms you will be able to page-lock the memory and use it directly from the GPU;
integrated GPUs which share main memory with the host;
a fast low-power CPU like a mobile Intel Ivy Bridge e.g. i5-3427U which consumes almost as little as the Atom D525 but has AVX support and should be several times faster.

Rendering 1.2 GB of textures smoothly, how does a 1 GB GPU do this?

My goal is to see what would happen when using more texture data than what would fit in physical GPU memory. My first attempt was to load up to 40 DDS textures, resulting in a memory footprint way higher than there was GPU memory. However, my scene would still render at 200+ fps on a 9500 GT.
My conclusion: the GPU/OpenGL is being smart and only keeps certain parts of the mipmaps in memory. I thought that should not be possible on a standard config, but whatever.
Second attempt: disable mip mapping, such that the GPU will always have to sample from the high res textures. Once again, I loaded about 40 DDS textures in memory. I verified the texture memory usage with gDEBugger: 1.2 GB. Still, my scene was rendering at 200+ fps.
The only thing I noticed was that when looking away with the camera and then centering it once again on the scene, a serious lag would occur. As if only then it would transfer textures from main memory to the GPU. (I have some basic frustum culling enabled)
My question: what is going on? How does this 1 GB GPU manage to sample from 1.2 GB of texture data at 200+ fps?
OpenGL can page complete textures in and out of texture memory in between draw-calls (not just in between frames). Only those needed for the current draw-call actually need to be resident in graphics memory, the others can just reside in system RAM. It likely only does this with a very small subset of your texture data. It's pretty much the same as any cache - how can you run algorithms on GBs of data when you only have MBs of cache on your CPU?
Also PCI-E busses have a very high throughput, so you don't really notice that the driver does the paging.
If you want to verify this, glAreTexturesResident might or might-not help, depending on how well the driver is implemented.
Even if you were forcing texture thrashing in your test (discarding and uploading of some textures from system memory to GPU memory every frame), which I'm not sure you are, modern GPUs and PCI-E have such a huge bandwidth that some thrashing does impact performance that much. One of the 9500GT models is quoted to have a bandwidth of 25.6 GB/s, and 16x PCI-E slots (500 MB/s x 16 = 8 GB/s) are the norm.
As for the lag, I would assume the GPU + CPU throttle down their power usage when you aren't drawing visible textures, and when you suddenly overload them they need a brief instant to power up. In real life apps and games this 0%-100% sudden workload changes never happen, so a slight lag is totally understandable and expected, I guess.

What can cause a reduction in frame rate when upgrading a graphics card?

We have a two-screen DirectX application that previously ran at a consistent 60 FPS (the monitors' sync rate) using a NVIDIA 8400GS (256MB). However, when we swapped out the card for one with 512 MB of RAM the frame rate struggles to get above 40 FPS. (It only gets this high because we're using triple-buffering.) The two cards are from the same manufacturer (PNY). All other things are equal, this is a Windows XP Embedded application and we started from a fresh image for each card. The driver version number is 169.21.
The application is all 2D. I.E. just a bunch of textured quads and a whole lot of pre-rendered graphics (hence the need to upgrade the card's memory). We also have compressed animations which the CPU decodes on the fly - this involves a texture lock. The locks take forever but I've also tried having a separate system memory texture for the CPU to update and then updating the rendered texture using the device's UpdateTexture method. No overall difference in performance.
Although I've read through every FAQ I can find on the internet about DirectX performance, this is still the first time I've worked on a DirectX project so any arcane bits of knowledge you have would be useful. :)
One other thing whilst I'm on the subject; when calling Present on the swap chains it seems DirectX waits for the present to complete regardless of the fact that I'm using D3DPRESENT_DONOTWAIT in both present parameters (PresentationInterval) and the flags of the call itself. Because this is a two-screen application this is a problem as the two monitors do not appear to be genlocked, I'm working around it by running the Present calls through a threadpool. What could the underlying cause of this be?
Are the cards exactly the same (both GeForce 8400GS), and only the memory size differ? Quite often with different memory sizes come slightly different clock rates (i.e. your card with more memory might use slower memory!).
So the first thing to check would be GPU core & memory clock rates, using something like GPU-Z.
It's an easy test to see if the surface lock is the problem, just comment out the texture update and see if the framerate returns to 60hz. Unfortunately, writing to a locked surface and updating the resource kills perfomance, always has. Are you using mipmaps with the textures? I know DX9 added automatic generation of mipmaps, could be taking up a lot of time to generate those. If your constantly locking the same resource each frame, you could also try creating a pool of textures, kinda like triple-buffering except with textures. You would let the render use one texture, and on the next update you pick the next available texture in the pool that's not being used in to render. Unless of course your memory constrained or your only making diffs to the animated texture.