OpenGL rendering/updating loop issues - opengl

I'm wondering how e.g. graphic (/game) engines do their job with lot's of heterogeneous data while a customized simple rendering loop turns into a nightmare when you have some small changes.
Example:
First, let's say we have some blocks in our scene.
Graphic-Engine: create cubes and move them
Customized: create cube template for vertices, normals, etc. copy and translate them to the position and copy e.g. in a vbo. One glDraw* call does the job.
Second, some weird logic. We want block 1, 4, 7, ... to rotate on x-axis, 2, 5, 8, ... on y-axis and 3, 6, 9 on z-axis with a rotation speed linear to the camera distance.
Graphic-Engine: manipulating object's matrix and it works
Customized: (I think) per object glDraw* call with changing model-matrix uniform is not a good idea, so a translation matrix should be something like an attribute? I have to update them every frame.
Third, a block should disappear if the distance to the camera is lower than any const value Q.
Graphic-Engine: if (object.distance(camera) < Q) scene.drop(object);
Customized: (I think) our vbo is invalid and we have to recreate it?
Again to the very first sentence: it feels like engines do those manipulations for free, while we have to rethink how to provide and update data. And while we do so, the engine (might, but I actually don't know) say: 'update whatever you want, at least I'm going to send all matrizes'.
Another Example: What about a voxel-based world (e.g. Minecraft) where we only draw the visible surface, and we are able to throw a bomb and destroy many voxels. If the world's view data is in one huge buffer we only have one glDraw*-call but have to recreate the buffer every time then. If there are smaller chunks, we have many glDraw*-calls and also have to manipulate buffers, which are smaller.
So is it a good deal to send let's say 10MB of buffer update data instead of 2 gl*-calls with 1MB? How many updates are okay? Should a rendering loop deal with lazy updates?
I'm searching for a guide what a 60fps application should be able to update/draw per frame to get a feeling of what is possible. For my tests, every optimization try is another bottleneck.
And I don't want those tutorials which says: hey there is a new cool gl*Instance call which is super-fast, buuuuut you have to check if your gpu supports it. Well, I also rather consider this an optimization than a meaningful implementation at first.
Do you have any ideas, sources, best practices or rule of thumb how a rendering/updating routine best play together?
My questions are all nearly the same:
How many updates per frame are okay on today's hardware?
Can I lazy-load data to have it after a few frames, but without freezing my application
Do I have to do small updates and profile my loop if there are some microseconds left till next rendering?
Maybe I should implement a real-time profiler which gets a feeling over time, how expensive updates are and can determine the amount of updates per frame?
Thank you.

It's unclear how any of your questions relate to your "graphics engines" vs "customized" examples. All the updates you do with a "graphics engines" are translated to those OpenGL calls in the end.
In brief:
How many updates per frame are okay on today's hardware?
Today's PCIe bandwidth is huge (can go as high as 30 GB/s). However, to utilize it in its entirety you have to reduce the number transactions via consolidating OpenGL calls. The exact number of updates entirely depends on the hardware, drivers, and the way you use them, and graphics hardware is diverse.
This is the kind of answer you didn't want to hear, but unfortunately you have to face the truth: to reduce the number of OpenGL calls you have to use the newer version APIs. E.g. instead of setting each uniform individually you are better to submit a bunch of them through uniform shader buffer objects. Instead of submitting each MVP of each model individually, it's better to use instanced rendering. And so on.
An even more radical approach would be to move to a lower-level (and newer) API, i.e. Vulkan, which aims to solve exactly this problem: the cost of submitting work to the GPU.
Can I lazy-load data to have it after a few frames, but without freezing my application
Yes, you can upload buffer objects asynchronously. See Buffer Object Streaming for details.
Do I have to do small updates and profile my loop if there are some microseconds left till next rendering?
Maybe I should implement a real-time profiler which gets a feeling over time, how expensive updates are and can determine the amount of updates per frame?
You don't need any of these if you do it asynchronously.

Related

drawing time series of millions of vertices using OpenGL

I'm working on a data visualisation application where I need to draw about 20 different time series overlayed in 2D, each consisting of a few million data points. I need to be able to zoom and pan the data left and right to look at it and need to be able to place cursors on the data for measuring time intervals and to inspect data points. It's very important that when zoomed out all the way, I can easily spot outliers in the data and zoom in to look at them. So averaging the data can be problematic.
I have a naive implementation using a standard GUI framework on linux which is way too slow to be practical. I'm thinking of using OpenGL instead (testing on a Radeon RX480 GPU), with orthogonal projection. I searched around and it seems VBOs to draw line strips might work, but I have no idea if this is the best solution (would give me the best frame rate).
What is the best way to send data sets consisting of millions of vertices to the GPU, assuming the data does not change, and will be redrawn each time the user interacts with it (pan/zoom/click on it)?
In modern OpenGL (versions 3/4 core profile) VBOs are the standard way to transfer geometric / non-image data to the GPU, so yes you will almost certainly end up using VBOs.
Alternatives would be uniform buffers, or texture buffer objects, but for the application you're describing I can't see any performance advantage in using them - might even be worse - and it would complicate the code.
The single biggest speedup will come from having all the data points stored on the GPU instead of being copied over each frame as a 2D GUI might be doing. So do the simplest thing that works and then worry about speed.
If you are new to OpenGL, I recommend the book "OpenGL SuperBible" 6th or 7th edition. There are many good online OpenGL guides and references, just make sure you avoid the older ones written for OpenGL 1/2.
Hope this helps.

How to render multiple different items in an efficient way with OpenGL

I am making a simple STG engine with OpenGL (To be exact, with LWJGL3).In this game, there can be several different types of items(called bullet) in one frame, and each type can have 10-20 instances.I hope to find an efficient way to render it.
I have read some books about modern OpenGL and find a method called "Instanced Rendering", but it seems only to work with same instances.Should I use for-loop to draw all items directly for my case?
Another question is about memory.Should I create an VBO for each frame, since the number of items is always changing?
Not the easiest question to answer but I'll try my best anyways.
An important property of OpenGL is that the OpenGL context is always bound to a single thread. So every OpenGL-method has to be called within that thread. A common way of dealing with this is using Queuing.
Example:
We are using Model-View-Controller architecture.
We have 3 threads; One to read input, one to handle received messages and one to render the scene.
Here OpenGL context is bound to rendering thread.
The first thread receives a message "Add model to position x". First thread has no time to handle the message, because there might be another message coming right after and we don't want to delay it. So we just give this message for the second thread to handle by adding it to second thread's queue.
Second thread reads the message and performs the required tasks as far as it can before OpenGL context is required. Like reads the Wavefront (.obj)-file from the memory and creates arrays from the received data.
Our second thread then queues this data to our OpenGL thread to handle. OpenGL thread generates VBOs and VAO and stores the data in there.
Back to your question
OpenGL generated Objects stay in the context memory until they are manually deleted or the context is destroyed. So it works kind of like C, where you have to manually allocate memory and free it after it's no more used. So you should not create new Objects for each frame, but reuse the data that stays unchanged. Also when you have multiple objects that use the same model or texture, you should just load that model once and apply all object specific differences on shaders.
Example:
You have an environment with 10 rocks that all share the same rock model.
You load the data, store it in VBOs and attach those VBOs into a VAO. So now you have a VAO defining a rock.
You generate 10 rock entities that all have position, rotation and scale. When rendering, you first bind the shader, then bind the model and texture, then loop through the stone entities and for each stone entity you bind that entity's position, rotation and scale (usually stored in a transformationMatrix) and render.
bind shader
load values to shader's uniform variables that don't change between entities.
bind model and texture (as those stay the same for each rock)
for(each rock in rocks){
load values to shader's uniform variables that do change between each rock, like the transformation.
render
}
unbind shader
Note: You don't need to unbind/bind shader each frame if you only use one shader. Same goes for VAO's and every other OpenGL object as well. So the binding will also stay over each rendering cycle.
Hope this will help you when getting started. Altho I would recommend some tutorial that might have a bit more context to it.
I have read some books about modern OpenGL and find a method called
"Instanced Rendering", but it seems only to work with same
instances.Should I use for-loop to draw all items directly for my
case?
Another question is about memory.Should I create an VBO for each
frame, since the number of items is always changing?
These both depend on the amount of bullets you plan on having. If you think you will have less than a thousand bullets, you can almost certainly push all of them to a VBO each frame and upload and your end users will not notice. If you plan on some obscene amount, then don't do this.
I would say that you should write everything each frame because it's the simplest to do right now, and if you start noticing performance issues then you need to look into instancing or some other method. When you get to "later" you should be more comfortable with OpenGL and find out ways to optimize it that won't be over your head (not saying it is over your head right now, but more experience can only help make it less complex later on).
Culling bullets not on the screen either should be on your radar.
If you plan on having a ridiculous amount of bullets on screen, then you should say so and we can talk about more advanced methods, however my guess is that if you ever reach that limit on today's hardware then you have a large ambitious game with a zoomed out camera and a significant amount of entities on screen, or you are zoomed up and likely have a mess on your screen anyways.
20 objects is nothing. Your program will be plenty fast no matter how you draw them.
When you have 10000 objects, then you'll want to ask for an efficient way.
Until then, draw them whichever way is most convenient. This probably means a separate draw call per object.

Infinite cube world engine (like Minecraft) optimization suggestions?

Voxel engine (like Minecraft) optimization suggestions?
As a fun project (and to get my Minecraft-adict son excited for programming) I am building a 3D Minecraft-like voxel engine using C# .NET4.5.1, OpenGL and GLSL 4.x.
Right now my world is built using chunks. Chunks are stored in a dictionary, where I can select them based on a 64bit X | Z<<32 key. This allows to create an 'infinite' world that can cache-in and cache-out chunks.
Every chunk consists of an array of 16x16x16 block segments. Starting from level 0, bedrock, it can go as high as you want (unlike minecraft where the limit is 256, I think).
Chunks are queued for generation on a separate thread when they come in view and need to be rendered. This means that chunks might not show right away. In practice you will not notice this. NOTE: I am not waiting for them to be generated, they will just not be visible immediately.
When a chunk needs to be rendered for the first time a VBO (glGenBuffer, GL_STREAM_DRAW, etc.) for that chunk is generated containing the possibly visible/outside faces (neighboring chunks are checked as well). [This means that a chunk potentially needs to be re-tesselated when a neighbor has been modified]. When tesselating first the opaque faces are tesselated for every segment and then the transparent ones. Every segment knows where it starts within that vertex array and how many vertices it has, both for opaque faces and transparent faces.
Textures are taken from an array texture.
When rendering;
I first take the bounding box of the frustum and map that onto the chunk grid. Using that knowledge I pick every chunk that is within the frustum and within a certain distance of the camera.
Now I do a distance sort on the chunks.
After that I determine the ranges (index, length) of the chunks-segments that are actually visible. NOW I know exactly what segments (and what vertex ranges) are 'at least partially' in view. The only excess segments that I have are the ones that are hidden behind mountains or 'sometimes' deep underground.
Then I start rendering ... first I render the opaque faces [culling and depth test enabled, alpha test and blend disabled] front to back using the known vertex ranges. Then I render the transparent faces back to front [blend enabled]
Now... does anyone know a way of improving this and still allow dynamic generation of an infinite world? I am currently reaching ~80fps#1920x1080, ~120fps#1024x768 (screenshots: http://i.stack.imgur.com/t4k30.jpg, http://i.stack.imgur.com/prV8X.jpg) on an average 2.2Ghz i7 laptop with a ATI HD8600M gfx card. I think it must be possible to increase the number of frames. And I think I have to, as I want to add entity AI, sound and do bump and specular mapping. Could using Occlusion Queries help me out? ... which I can't really imagine based on the nature of the segments. I already minimized the creation of objects, so there is no 'new Object' all over the place. Also as the performance doesn't really change when using Debug or Release mode, I don't think it's the code but more the approach to the problem.
edit: I have been thinking of using GL_SAMPLE_ALPHA_TO_COVERAGE but it doesn't seem to be working?
gl.Enable(GL.DEPTH_TEST);
gl.Enable(GL.BLEND); // gl.Disable(GL.BLEND);
gl.Enable(GL.MULTI_SAMPLE);
gl.Enable(GL.SAMPLE_ALPHA_TO_COVERAGE);
To render a lot of similar objects, I strongly suggest you take a look into instanced draw : glDrawArraysInstanced and/or glDrawElementsInstanced.
It made a huge difference for me. I'm talking from 2 fps to over 60 fps to render 100000 similar icosahedrons.
You can parametrize your cubes by using Attribs ( glVertexAttribDivisor and friends ) to make them differents. Hope this helps.
It's on ~200fps currently, should be OK. The 3 main things that I've done are:
1) generation of both chunks on a separate thread.
2) tessellation the chunks on a separate thread.
3) using a Deferred Rendering Pipeline.
Don't really think the last one contributed much to the overall performance but had to start using it because of some of the shaders. Now the CPU is sort of falling asleep # ~11%.
This question is pretty old, but I'm working on a similar project. I approached it almost exactly the same way as you, however I added in one additional optimization that helped out a lot.
For each chunk, I determine which sides are completely opaque. I then use that information to do a flood fill through the chunks to cull out the ones that are underground. Note, I'm not checking individual blocks when I do the flood fill, only a precomputed bitmask for each chunk.
When I'm computing the bitmask, I also check to see if the chunk is entirely empty, since empty chunks can obviously be ignored.

Per-line texture processing accelerated with OpenGL/OpenCL

I have a rendering step which I would like to perform on a dynamically-generated texture.
The algorithm can operate on rows independently in parallel. For each row, the algorithm will visit each pixel in left-to-right order and modify it in situ (no distinct output buffer is needed, if that helps). Each pass uses state variables which must be reset at the beginning of each row and persist as we traverse the columns.
Can I set up OpenGL shaders, or OpenCL, or whatever, to do this? Please provide a minimal example with code.
If you have access to GL 4.x-class hardware that implements EXT_shader_image_load_store or ARB_shader_image_load_store, I imagine you could pull it off. Otherwise, in-situ read/write of an image is generally not possible (though there are ways with NV_texture_barrier).
That being said, once you start wanting pixels to share state the way you do, you kill off most of your potential gains from parallelism. If the value you compute for a pixel is dependent on the computations of the pixel to its left, then you cannot actually execute each pixel in parallel. Which means that the only parallelism your algorithm actually has is per-row.
That's not going to buy you much.
If you really want to do this, use OpenCL. It's much friendlier to this kind of thing.
Yes, you can do it. No, you don't need 4.X hardware for that, you need fragment shaders (with flow control), framebuffer objects and floating point texture support.
You need to encode your data into 2D texture.
Store "state variable" in 1st pixel for each row, and encode the rest of the data into the rest of the pixels. It goes without saying that it is recommended to use floating point texture format.
Use two framebuffers, and render them onto each other in a loop using fragment shader that updates "state variable" at the first column, and performs whatever operation you need on another column, which is "current". To reduce amount of wasted resources you can limit rendering to columns you want to process. NVidia OpenGL SDK examples had "game of life", "GDGPU fluid", "GPU partciles" demos that work in similar fashion - by encoding data into texture and then using shaders to update it.
However, because you can do it, it doesn't mean you should do it and it doesn't mean that it is guaranteed to be fast. Some GPUs might have a very high memory texture memory read speed, but relatively slow computation speed (and vice versa) and not all GPUs have many conveyors for processing things in parallel.
Also, depending on your app, CUDA or OpenCL might be more suitable.

How much does glCallList boost the performance?

I'm new to OpenGL. I have written a programm before getting to know OpenGL display lists. By now, it's pretty difficult to exploit them as my code contains many not 'gl' lines everywhere in between. So I'm curious how much I've lost in performance..
It depends, for static geometry, it can help quite a lot, but display lists are being deprecated in favor of VBOs (vertex buffer objects) and similar techniques that allow you to upload geometry data to the card once then just re-issue draw calls that use that data.
Display-lists (which was a nice idea in OpenGL 1.0) have turned out to be difficult to implement correctly in all corner cases, and are being phased out for more straightforward approaches.
However, if all you cache are geometry data, then by all means, lists are ok. But VBOs are the future so I'd suggest to learn those.
Call lists are boosting performance. If your users graphics card has enough memory, and supports Call lists (most of them):
There are several "performance bottlenecks", of which one of them is the pushing speed of data to the graphics card. When you call a OpenGL function, it will either calculate on the CPU or on the GPU (graphics card's processor). The GPU is optimized for graphics operations, but, the CPU must send it's data over to the GPU. This requires a lot of time.
If you use call lists; the data is sent (prematurely) to the GPU, if possible. Then when you call a call list, the CPU won't have to push data to the GPU.
Huge performance boost if you use them.
Corner case:
If you plan to run your app remotely over GLX then display lists can be a huge win because "pushing to the card" involves a costly libgl->network->xserver->libgl trip.