I am developing a rendering engine with OpenGL as base renderer.
The renderer start with 150 fps in beginning and after 30 seconds or so the fps increases to 500.
I have timed each part of the engine separately and the only part that increase in speed is the drawMesh function which binds the [static]VBOs and calls the glDrawArrays.
I have also commented the glPush and glGet functions with the same behavior as result.
This happens every time i run the engine, even when the camera is not moved and remains rendering the exact same scene.
Does anyone has any idea how this can be happening?
The problem
The problem arises from the VBO being mapped to a buffer after being created. The model class does this once to update its boundaries; and in case of particles to update the buffer with required data.
As it seems the video-card (or at least in my case having a Geforce GTS 450) does not copy the data back into the video-card directly after unmapping the VBO, specially when using the GL_READ_WRITE_ARB flag for mapping the buffer. it will keep the data in external RAM for few seconds before copying the data back into the VRAM.
The solution
By using the GL_READ_ONLY_ARB flag to map the data which are supposed to only read the data, the buffer will get copied back into the VRAM almost directly. However in my case it would be much more efficient to calculate the boundaries during the mesh conversation and not accessing the data at all once the VBO is created for such purposes.
Maybe it's because shaders are compiled just-in-time before first usage.
Take a look at GL_ARB_get_program_binary
Also, try rendering a triangle ( can probably do this off screen) with your shaders as you load them during the initialization phase.
Related
This article is commonly referenced when anyone asks about video streaming textures in OpenGL.
It says:
To maximize the streaming transfer performance, you may use multiple pixel buffer objects. The diagram shows that 2 PBOs are used simultaneously; glTexSubImage2D() copies the pixel data from a PBO while the texture source is being written to the other PBO.
For nth frame, PBO 1 is used for glTexSubImage2D() and PBO 2 is used to get new texture source. For n+1th frame, 2 pixel buffers are switching the roles and continue to update the texture. Because of asynchronous DMA transfer, the update and copy processes can be performed simultaneously. CPU updates the texture source to a PBO while GPU copies texture from the other PBO.
They provide a simple bench-mark program which allows you to cycle between texture updates without PBO's, with a single PBO, and with two PBO's used as described above.
I see a slight performance improvement when enabling one PBO.
But the second PBO makes no real difference.
Right before the code glMapBuffer's the PBO, it calls glBufferData with the pointer set to NULL. It does this to avoid a sync-stall.
// map the buffer object into client's memory
// Note that glMapBufferARB() causes sync issue.
// If GPU is working with this buffer, glMapBufferARB() will wait(stall)
// for GPU to finish its job. To avoid waiting (stall), you can call
// first glBufferDataARB() with NULL pointer before glMapBufferARB().
// If you do that, the previous data in PBO will be discarded and
// glMapBufferARB() returns a new allocated pointer immediately
// even if GPU is still working with the previous data.
So, Here is my question...
Doesn't this make the second PBO completely useless? Just a waste of memory !?
With two PBO's the texture data is stored 3 times. 1 in the texture, and one in each PBO.
With a single PBO. There are two copies of the data. And temporarily only a 3rd in the event that glMapBuffer creates a new buffer because the existing one is presently being DMA'ed to the texture?
The comments seem to suggest that OpenGL drivers internally are capable to creating the second buffer IF and only WHEN it is required to avoid stalling the pipeline. The in-use buffer is being DMA'ed, and my call to map yields a new buffer for me to write to.
The Author of that article appears to be more knowledgeable in this area than myself. Have I completely mis-understood the point?
Answering my own question... But I wont accept it as an answer... (YET).
There are many problems with the benchmark program linked to in the question. It uses immediate mode. It uses GLUT!
The program was spending most of its time doing things we are not interested in profiling. Mainly rendering text via GLUT, and writing pretty stripes to the texture. So I have removed those functions.
I cranked the texture resultion up to 8K, and added more PBO Modes.
NO PBO (yeilds 6fps)
1 PBO. Orphan previous buffer. (yields 12.2 fps).
2 PBO's. Orpha previous buffer. (yields 12.2 fps).
1 PBO. DONT orphan previous PBO (possible stall - added by myself. yields 12.4 fps).
2 PBO's. DONT orphan previous PBO (possible stall - added by myself. yields 12.4 fps).
If anyone else would like to examine my code, it is vailable here
I have experimented with different texture sizes... and different updatePixels functions... I cannot, despite my best efforts get the double PBO implementation to perform any better than the single-PBO implementation.
Furthermore... NOT orphanning the previous buffer, actually vields better performance. Exactly opposite to what the article claims.
Perhaps modern drivers / hardware does not suffer the problem that this design is attemtping to fix...
Perhaps my graphics hardware / driver is buggy, and not taking advantage of the double-PBO...
Perhaps the commonly referenced article is completely wrong?
Who knows. . . .
My test hardware is Intel(R) HD Graphics 5500 (Broadwell GT2).
I have no experience with Direct3D, so I may just be looking in the wrong places. However, I would like to convert a program I have written in OpenGL (using FreeGLUT) to a Windows IoT compatible UWP (running Direct3D, 12 'caus it's cool). I'm trying to port my program to a Raspberry Pi 3 and I don't want to convert to Linux.
Through the examples provided by Microsoft I have figured out most of what I believe I need to know to get started, but I can't figure out how to share a dynamic data buffer between the CPU and GPU.
What I want to know how to do:
Create a CPU/GPU shared circular buffer
Read and Draw with the GPU
Write / Replace sections with the CPU
Quick semi-pseudo code:
while (!buffer.inUse()){ //wait until buffer is not in use
updateBuffer(buffer.id, data, start, end); //insert data into buffer
drawToScreen(buffer.id); //draw using vertex data in buffer
}
This was previously done in OpenGL by simply using glBegin()/glEnd() and glVertex3f() for each value in an array when it wasn't being written to.
Update: I basically want a Direct3D12 equivalent of OpenGLs VBO editing using glBufferSubData(). If that makes more sense.
Update 2: I found that I can get away with discarding the vertex buffer every frame and re-uploading a new buffer to the GPU. There's a fair amount of overhead, as one would expect with transferring 10,000 - 200,000 doubles every frame. So I'm trying to find a way to use constant buffers to port the 5-10 updated vertexes into the shader, so I can copy from the constant buffer into the vertex buffer using the shader and not have to use map/unmap every frame. This way my circular buffer on the CPU is independent of the buffer being used on the GPU, but they will both share the same information through periodic updates. I'll do some more looking and post another more specific question on shaders if I don't find a solution.
I'm writing a 3d application for the Playbook, which has a PowerVR SGX 540. I noticed that if I stuff enough data in the VBO through opengl, I can cause the device to crash (not just the application, but the entire device, requiring a hard reboot). To cause the crash, I sent data for a model with ~300k triangles and ~150k vertices. I sent normal data for the vertices as well.
I found that the problem doesn't occur if I send less data (tried another model with half the triangles and vertices). Also, the issue doesn't occur if I use vertex arrays (though it's incredibly slow).
I'd like to know:
Is what I'm seeing a common result for mobile hardware? That is, is a 300k tri model with 150k vertices and normals overkill?
Can I check how much memory I have available for VBO usage outside of testing a bunch of different model sizes (it takes a good five minutes to recover the device from a crash)?
Could anything else be causing this issue? I've provided some additional information:
I'm using Qt for my GUI, and drawing the 3d scene to an FBO before it's painted to the GUI (I haven't checked if redoing all this without a UI by creating an EGL window and drawing to that recreates the problem yet -- that'll take awhile).
To verify it wasn't me using OpenGL poorly, I tried both using raw OpenGL calls for all the 3d stuff, and also doing everything with OpenSceneGraph. Both methods fail in the exact same way (VBO works with less data, vertex arrays work, increased VBO data causes a crash).
The program works fine on my desktop. Unfortunately, I don't have any other mobile devices I can test my application out on.
OpenGL ES only supports unsigned short (16 bit) as data type for indices, so if you're using an index array, you're over that limit.
How can we run a OpenGL applications (say a games) in higher frame rate like 500 - 800 FPS ?
For a example AOE 2 is running with more than 700 FPS (I know it is about DirectX). Eventhough I just clear buffers and swap buffers within the game loop, I can only get about 200 (max) FPS. I know that FPS isn't a good messurenment (and also depend on the hardware), but I feel I missed some concepts in OpenGL. Did I ? Pls anyone can give me a hint ?
I'm getting roughly 5.600 FPS with an empty display loop (GeForce 260 GTX, 1920x1080). Adding glClear lowers it to 4.000 FPS which is still way over 200...
A simple graphics engine (AoE2 style) should run at about 100-200 FPS (GeForce 8 or similar). Probably more if it's multi-threaded and fully optimized.
I don't know what exactly you do in your loop or what hardware that is running on, but 200 FPS sounds like you are doing something else besides drawing nothing (sleep? game logic stuff? greedy framework? Aero?). The swapbuffer function should not take 5ms even if both framebuffers have to be copied. You can use a profile to check where the most CPU time is spent (timing results from gl* functions are mostly useless though)
If you are doing something with OpenGL (drawing stuff, creating textures, etc.) there is a nice extension to measure times called GL_EXT_timer_query.
Some general optimization tips:
don't use immediate mode (glBegin/glEnd), use VBO and/or display lists+vertex arrays instead
use some culling technique to remove objects outside your view (opengl would have to cull every polygon separately)
try minimizing state changes, especially changing the bound texture or vertex buffer
AOE 2 is a DirectDraw application, not Direct3D. There is no way to compare OpenGL and DirectDraw.
Also, check the method you're using for swapping buffers. In Direct3D there are flip method, copy method, and discard method. The best one is discard, which means that you don't care about previous contents in the buffer, and allow the driver to manage them efficiently.
One of the things you seem to miss (judging from your answer/comments, so correct me if I'm wrong) is that you need to determine what to render.
For example as you said you have multiple layers and such, well the first thing you need to do is not render anything that is off screen (which is possible and is sometimes done). What you should also do is not render things that you are certain are not visible, for example if some area of the top layer is not transparent (or filled up) you should not render the layers below it.
In general what I'm trying to say is that it is in most cases better to eliminate invisible things in the logic than to render all things and just let the things on top end up in the rendered image.
If your textures are small, try to combine them in one bigger texture and address them via texture coordinates. That will save you a lot of state changes. If your textures are e.g. 128x128, you can put 16 of them in one 512x512 texture, bringing your texture related state changes down by a factor of 16.
We have a two-screen DirectX application that previously ran at a consistent 60 FPS (the monitors' sync rate) using a NVIDIA 8400GS (256MB). However, when we swapped out the card for one with 512 MB of RAM the frame rate struggles to get above 40 FPS. (It only gets this high because we're using triple-buffering.) The two cards are from the same manufacturer (PNY). All other things are equal, this is a Windows XP Embedded application and we started from a fresh image for each card. The driver version number is 169.21.
The application is all 2D. I.E. just a bunch of textured quads and a whole lot of pre-rendered graphics (hence the need to upgrade the card's memory). We also have compressed animations which the CPU decodes on the fly - this involves a texture lock. The locks take forever but I've also tried having a separate system memory texture for the CPU to update and then updating the rendered texture using the device's UpdateTexture method. No overall difference in performance.
Although I've read through every FAQ I can find on the internet about DirectX performance, this is still the first time I've worked on a DirectX project so any arcane bits of knowledge you have would be useful. :)
One other thing whilst I'm on the subject; when calling Present on the swap chains it seems DirectX waits for the present to complete regardless of the fact that I'm using D3DPRESENT_DONOTWAIT in both present parameters (PresentationInterval) and the flags of the call itself. Because this is a two-screen application this is a problem as the two monitors do not appear to be genlocked, I'm working around it by running the Present calls through a threadpool. What could the underlying cause of this be?
Are the cards exactly the same (both GeForce 8400GS), and only the memory size differ? Quite often with different memory sizes come slightly different clock rates (i.e. your card with more memory might use slower memory!).
So the first thing to check would be GPU core & memory clock rates, using something like GPU-Z.
It's an easy test to see if the surface lock is the problem, just comment out the texture update and see if the framerate returns to 60hz. Unfortunately, writing to a locked surface and updating the resource kills perfomance, always has. Are you using mipmaps with the textures? I know DX9 added automatic generation of mipmaps, could be taking up a lot of time to generate those. If your constantly locking the same resource each frame, you could also try creating a pool of textures, kinda like triple-buffering except with textures. You would let the render use one texture, and on the next update you pick the next available texture in the pool that's not being used in to render. Unless of course your memory constrained or your only making diffs to the animated texture.