Why is CGLFlushDrawable so slow? (I am using VBOs)

Why is CGLFlushDrawable so slow? (I am using VBOs) - opengl

My application details:
Running on : Macbook pro with 4GB RAM, ATI Radeon X1600 with 128MB VRAM, Opengl version: 2.1 ATI-7.0.52
Using vertical sync (via CVDisplay) : YES
Programming Language: Lisp (Lispworks) with FFI to Opengl
Pixel Format information
ns-open-gl-pfa-depth-size 32
ns-open-gl-pfa-sample-buffers 1
ns-open-gl-pfa-samples 6
ns-open-gl-pfa-accelerated 1
ns-open-gl-pfa-no-recovery 1
ns-open-gl-pfa-backing-store 0
ns-open-gl-pfa-virtual-screen-count 1
[1 = YES, 0 = NO] for boolean attribs
I have in my application the following meshes:
14 static meshes (which do not change). I have defined a VBO for each mesh with static draw type.
2 dynamic meshes (which change per frame). I have defined a VBO for each mesh with stream draw type.
For these dynamic meshes, per frame I do a bind buffer data with null pointer, then map buffer, update the mapped buffer and unmap the buffer.
When I run the app and check with Opengl profiler: it shows the following (Statistics View) for:
CGLFlushDrawable:
Average Time (in micro sec): 52990.63 = 52.990 ms
% GL Time: 98.55
% App Time: 43.96
No wonder I get a very poor FPS of around 6-7 FPS.
What is the way to optimize CGLFlushDrawable, since I just invoke flushBuffer which in turn invokes CGLFlushBuffer I believe.

Well, it turns out that there is a problem with my ATI Radeon X1600 graphics card.
Without any change, when I test the same code on another newer 13" Macbook Pro which has an Intel HD Graphics 3000 with 384MB of DDR3 SDRAM, the application works fine with
around 30 FPS which is what I expect, given the dynamic meshes that I have.
Also, there is no bottleneck whatsoever in CGLFlushDrawable as was the case on my old
MBP. Further the amount of memory in VRAM available after VBO allocation remains the same
(again what I was expecting). This is not what was happening on my old MBP.
And finally, my MBP display has crashed (not regularly enough though) and external LCD display also does not work fine, which points to problems with my graphics card.
#Brad, thanks for all your inputs.

Related

Why my 2d graphic application run faster on Inegrated graphics card?

I am working on a 2D graphic application with OpenGL(like QGIS). Recently when I was testing some benchmarks, there was a weird performance difference between my 2 Graphic Cards. So I made a simple test and draw just 1 million squares using VBO. So there are 4m vertices each 20 bytes, So my total VBO size is 80 MB. And I draw whole things with just one DrawElements call. When I measured render time in my laptop which has 2 Graphic Cards it runs about 43 ms on Geforce and about 1 ms on Integrated Intel card. But I expected to be faster on Geforce. Why is it so? Should I disable some
Opengl options?
My System specification is:
ASUS N53m With Integrated Graphics Card and Geforce GT 610m
EDIT:
I also tested on another system with AMD Radeon HD 5450, it was about 44 ms again. I also used single precision instead and it reduced to 30 ms. But still integrated GPU is more faster!
It is definitely not measuring issue, because I can see the lag when zoom in/out.

The run time behavior of different OpenGL implementations vastly differs as I found out in my experiments regarding low-latency rendering techniques for VR. In general the only truly reliable timing interval to measure, that gives consistent results is inter-frame time between the very same step in your drawing. I.e. measure the time from buffer swap to buffer swap (if you want to measure raw drawing performance, disable V-Sync), or between the same glClear calls.
Everything else is only consistent within a certain implementation, but not between vendors (at the time of testing this I had no AMD GPU around, so I lack data on this). A few notable corner cases I discovered:
SwapBuffers
NVidia: returns only after the swap buffer has been presented. That means: Either waits for V-Sync or returns only after the buffers have been swapped
Intel/Linux/X11: always returns immediately. V-Sync affects the next OpenGL call that'd effects pixels in the not-yet-presented buffer and that does not fit into the command queue. Hence "clearing" the viewport with a large quad, skybox or using the depth-ping-pong method (found only in very old applications) gives very inconsistent frame intervals. glClear will reliably block until V-Sync after swap
glFinish
NVidia: actually finishes the rendering, as expected
Intel/Linux/X11: drawing to back buffer, acts like a No-Op, drawing to front buffer acts like a finish followed by a copy from an auxiliary back to front buffer (weird); essentially means you can't make the drawing process "visible".
I yet have to test what the Intel driver does if bypassing X11 (using KMS). Note that the OpenGL specification leaves it up to the implementation how and when it does certain things, as long as the outcome is consistent and conforms to the specification. And all the observed behavior is perfectly conformant.

KTX vs DDS images in OpenGL

I used DDS (DXT5)till now for fast load of texture data.Now,I read that since OpenGL 4.3 (and for ES2) the compressed standard is KTX(ETC1/ETC2).I integrated the Khronos libktx SDK and bench-marked.
Updating texture using glCompressedTexSubImage2D for 3000 times the results are:
DDS:
1450 - millisecond
KTX - forever....
Actually, running a loop of only 300 times updating KTX, the total time already reaches 24 seconds!
Now I have 2 questions:
Is this the expected speed of KTX?
if the answer to the first question is "YES" then what is the advantage of ETC except of smaller file size than that of DDS?
I use OpenGL 4.3 with Quadro4000 GPU.

I asked this question on Khronos KTX forum.Here is the answer I got from the forum moderator:
I have been told by the NVIDIA OpenGL driver team that the Quadro 4000
does not support ETC in hardware while it does support DXTC. This
means the ETC-compressed images will be decompressed by the OpenGL
driver in software then loaded into GPU memory while the
DXTC-compressed images will simply be loaded into GPU memory. I
believe that is the source of the performance difference you are
observing.
So it seems like my card's hardware doesn't support ETC.

Two GPU Cards, One Endabled Display, One Disabled Display: How to tell which GPU Card is OpenGL running on?

So I have two NVidia GPU Cards
Card A: GeForce GTX 560 Ti - Wired to Monitor A (Dell P2210)
Card B: GeForce 9800 GTX+ - Wired to Monitor B (ViewSonic VP20)
Setup: an Asus Mother Board with Intel Core i7 that supports SLI
In NVidia Control Panel, I disabled Monitor A, So I only have Monitor B for all my display purposes.
I ran my program, which
simulated 10000 particles in OpenGL and rendered them (properly showed in Monitor B)
use cudaSetDevice() to 'target' at Card A to run computational intensive CUDA Kernel.
The idea is simple - use Card B for all the OpenGL rendering work and use Card A for all the CUDA Kernel computational work.
My Question is this:
After using GPU-Z to monitor both of the Cards, I can see that:
Card A's GPU Load increased immediately to over 60% percent as expected.
However, Card B's GPU Load increased only to up to 2%. For 10000 particle rendered in 3D in opengl, I am not sure if that is what I should have expected.
So how can I find out if the OpenGL rendering was indeed using Card B (whose connected Monitor B is the only one that is enabled), and had nothing to do with Card A?
And and extension to the question is:
Is there a way to 'force' the OpenGL rendering logic to use a particular GPU Card?

You can tell which GPU a OpenGL context is using with glGetString(GL_RENDERER);
Is there a way to 'force' the OpenGL rendering logic to use a particular GPU Card?
Given the functions of the context creation APIs available at the moment: No.

opengl fixed function uses gpu or cpu?

I have a code which basically draws parallel coordinates using opengl fixed func pipeline.
The coordinate has 7 axes and draws 64k lines. SO the output is cluttered, but when I run the code on my laptop which has intel i5 proc, 8gb ddr3 ram it runs fine. One of my friend ran the same code in two different systems both having intel i7 and 8gb ddr3 ram along with a nvidia gpu. In those systems the code runs with shuttering and sometimes the mouse pointer becomes unresponsive. If you guys can give some idea why this is happening, it would be of great help. Initially I thought it would run even faster in those systems as they have a dedicated gpu. My own laptop has ubuntu 12.04 and both the other systems have ubuntu 10.x.

Fixed function pipeline is implemented using gpu programmable features in modern opengl drivers. This means most of the work is done by the GPU. Fixed function opengl shouldn't be any slower than using glsl for doing the same things, but just really inflexible.
What do you mean by coordinates having axes and 7 axes? Do you have screen shots of your application?
Mouse stuttering sounds like you are seriously taxing your display driver. This sounds like you are making too many opengl calls. Are you using immediate mode (glBegin glVertex ...)? Some OpenGL drivers might not have the best implementation of immediate mode. You should use vertex buffer objects for your data.

Maybe I've misunderstood you, but here I go.
There are API calls such as glBegin, glEnd which give commands to the GPU, so they are using GPU horsepower, though there are also calls to arrays, other function which have no relation to API - they use CPU.
Now it's a good practice to preload your models outside the onDraw loop of the OpenGL by saving the data in buffers (glGenBuffers etc) and then use these buffers(VBO/IBO) in your onDraw loop.
If managed correctly it can decrease the load on your GPU/CPU. Hope this helps.
Oleg

What can cause a reduction in frame rate when upgrading a graphics card?

We have a two-screen DirectX application that previously ran at a consistent 60 FPS (the monitors' sync rate) using a NVIDIA 8400GS (256MB). However, when we swapped out the card for one with 512 MB of RAM the frame rate struggles to get above 40 FPS. (It only gets this high because we're using triple-buffering.) The two cards are from the same manufacturer (PNY). All other things are equal, this is a Windows XP Embedded application and we started from a fresh image for each card. The driver version number is 169.21.
The application is all 2D. I.E. just a bunch of textured quads and a whole lot of pre-rendered graphics (hence the need to upgrade the card's memory). We also have compressed animations which the CPU decodes on the fly - this involves a texture lock. The locks take forever but I've also tried having a separate system memory texture for the CPU to update and then updating the rendered texture using the device's UpdateTexture method. No overall difference in performance.
Although I've read through every FAQ I can find on the internet about DirectX performance, this is still the first time I've worked on a DirectX project so any arcane bits of knowledge you have would be useful. :)
One other thing whilst I'm on the subject; when calling Present on the swap chains it seems DirectX waits for the present to complete regardless of the fact that I'm using D3DPRESENT_DONOTWAIT in both present parameters (PresentationInterval) and the flags of the call itself. Because this is a two-screen application this is a problem as the two monitors do not appear to be genlocked, I'm working around it by running the Present calls through a threadpool. What could the underlying cause of this be?

Are the cards exactly the same (both GeForce 8400GS), and only the memory size differ? Quite often with different memory sizes come slightly different clock rates (i.e. your card with more memory might use slower memory!).
So the first thing to check would be GPU core & memory clock rates, using something like GPU-Z.

It's an easy test to see if the surface lock is the problem, just comment out the texture update and see if the framerate returns to 60hz. Unfortunately, writing to a locked surface and updating the resource kills perfomance, always has. Are you using mipmaps with the textures? I know DX9 added automatic generation of mipmaps, could be taking up a lot of time to generate those. If your constantly locking the same resource each frame, you could also try creating a pool of textures, kinda like triple-buffering except with textures. You would let the render use one texture, and on the next update you pick the next available texture in the pool that's not being used in to render. Unless of course your memory constrained or your only making diffs to the animated texture.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js