Sprite Batching: advanced technique - opengl

I'm making OpenGL2 based application, which renders over 200 sprites in each iteration. I would like to use less drawcalls, since often I render multiple sprites with same texture. Unfortunately, regular batching technique is not good for me because of Z-Sorting. Draworder of all elements is important, so I can't group them and draw by groups.
I was wondering, is there another batching technique to use in that situation. For example, I could modify shader to work with multiple textures at the same time (sounds like a bad decision though). Share your knowledge.
UPD 09.10.13: I also thought, that atlas textures will reduce drawcalls because of significant material number reducement.

I found that instanced rendering could speed things up A LOT ( tracing 100000 icosahedrons at 2 FPS with normal rendering to over 60 fps with instanced rendering ). There is a good section "Instanced Rendering" in the redbook about that subject. Hope this can be applied to your problem.

Related

drawing time series of millions of vertices using OpenGL

I'm working on a data visualisation application where I need to draw about 20 different time series overlayed in 2D, each consisting of a few million data points. I need to be able to zoom and pan the data left and right to look at it and need to be able to place cursors on the data for measuring time intervals and to inspect data points. It's very important that when zoomed out all the way, I can easily spot outliers in the data and zoom in to look at them. So averaging the data can be problematic.
I have a naive implementation using a standard GUI framework on linux which is way too slow to be practical. I'm thinking of using OpenGL instead (testing on a Radeon RX480 GPU), with orthogonal projection. I searched around and it seems VBOs to draw line strips might work, but I have no idea if this is the best solution (would give me the best frame rate).
What is the best way to send data sets consisting of millions of vertices to the GPU, assuming the data does not change, and will be redrawn each time the user interacts with it (pan/zoom/click on it)?
In modern OpenGL (versions 3/4 core profile) VBOs are the standard way to transfer geometric / non-image data to the GPU, so yes you will almost certainly end up using VBOs.
Alternatives would be uniform buffers, or texture buffer objects, but for the application you're describing I can't see any performance advantage in using them - might even be worse - and it would complicate the code.
The single biggest speedup will come from having all the data points stored on the GPU instead of being copied over each frame as a 2D GUI might be doing. So do the simplest thing that works and then worry about speed.
If you are new to OpenGL, I recommend the book "OpenGL SuperBible" 6th or 7th edition. There are many good online OpenGL guides and references, just make sure you avoid the older ones written for OpenGL 1/2.
Hope this helps.

Implementing environment mapping and IBL for deferred shading

Our deferred renderer reached to a point where we need to apply environment maps & IBL to push the quality to a higher level ( As you can see the cubemaps are clearly missing ):
After a couple of hours of research on this topic i still found no solution which makes me really happy.
This is what i found so far
Additional forward pass whichs result is added to the LA-Buffer. This looks like a bad idea somehow, we use deferred shading to avoid multiple passes and then render everything again for cubemapping & IBL. But at least we could use the existing GBuffer-Data ( inc. depth ) to render the objects and the selection of the used cubemap can be easily done on the CPU.
Render a fullscreen plane and perform some crazy selection of the right cubemap in the shader ( what is done on the CPU in the shader ). This seems even more bad than rendering everything again, the glsl shader is going to be huge. Even if i implement tiled-deferred-rendering ( not done yet ), it still seems to be a bad idea.
Also storing the cubemap information directly into the GBuffer of the first pass is not applicable in my renderer. I've used all components of my 3 buffers ( 3 to stay compatible with ES 3.0 ), and already used compression on the color values ( YCoCg ) and the normals ( Spheremap Transform ).
Last but not least the very simple and not really a good solution: Use a single cubemap and apply it on the hole scene. This is not really a option, because this is going to have a huge impact on the quality.
I want to know if another approach exists for environment cubemapping. If not what is the best approach of them. My personal favorite is the second one so far even if this requires rendering the whole scene again ( At least on devices which only support 4 rendertargets ).
After testing different stuff i found out that the best idea is to use the second solution, however the implementation is tough. The best approach is to use compute shaders, however these are not supported on the most mobile devices these days.
So you need to use a single cubemap on mobile devices or get the data into the buffer in the first render pass. If this is not possible you need to render it tiled with some frustum culling for each tile ( to reduce number of operations on each pixel ).

Average triangles in DirectX 10

So I basically want to check some information for my project.
I have GTX 460 video card. I wrote DX10 program with 20k triangles printed on the screen and now I get 28 FPS in Release build. All those triangles call DrawIndexed inside them so this is ofcourse an overhead in calling so much draws.
But anyway, I would like to know: how much triangles could I draw on the screen with those capabilities and at which FPS? I think 20k triangles is not even nearly enough to load some good models on game scene.
Sorry for my terrible english.
Sounds like you are creating a single draw call per triangle primitive, this is very bad, hence the horrid FPS, you should aim to draw as many triangles as possible per draw call, this can be done in a few ways:
Profile your code, both nVidia and AMD have free to to you you find why your code is slow, allowing you to focus where it really matters, so use them.
Index buffers & triangle strips to reduce bandwidth
Grouping of verts by material type/state/texture to improve batching
Instancing of primitive groups: draw multiple models/meshs in one call
Remove as much redundant state change (setting of shaders, textures, buffers, paramters) as possible, this goes hand-in-hand with the group mentioned earlier
The DX SDK will have examples of implementing each of these. The exact amount for triangles you can draw and a decent FPS(either 30 or 60 if you want vsync) varies greatly depending on the complexity of shading the triangles, however, if draw most simply, you should be able to push a few million with ease.
I would recommend taking a good look at the innards of an open source DX11 (not many DX10 projects exists, but the API is almost identical) engine, such as heiroglyph 3, and going through the SDK tutorials.
There are also quite a few presentations on increasing performance with DX10, but profile your code before diving into the suggestions full-on, here are a few from the hardware vendors themselves (color coded hints for nVidia vs AMD hardware):
GDC '08
GDC '09

OpenGL Picking from a large set

I'm trying to, in JOGL, pick from a large set of rendered quads (several thousands). Does anyone have any recommendations?
To give you more detail, I'm plotting a large set of data as billboards with procedurally created textures.
I've seen this post OpenGL GL_SELECT or manual collision detection? and have found it helpful. However it can take my program up to several minutes to complete a rendering of the full set, so I don't think drawing 2x (for color picking) is an option.
I'm currently drawing with calls to glBegin/glVertex.../glEnd. Given that I made the switch to batch rendering on the GPU with vao's and vbo's, do you think I would receive a speedup large enough to facilitate color picking?
If not, given all of the recommendations against using GL_SELECT, do you think it would be worth me using it?
I've investigated multithreaded CPU approaches to picking these quads that completely sidestep OpenGL all together. Do you think a OpenGL-less CPU solution is the way to go?
Sorry for all the questions. My main question remains to be, whats a good way that one can pick from a large set of quads using OpenGL (JOGL)?
The best way to pick from a large number of quad cannot be easily defined. I don't like color picking or similar techniques very much, because they seem to be to impractical for most situations. I never understood why there are so many tutorials that focus on people that are new to OpenGl or even programming focus on picking that is just useless for nearly everything. For exmaple: Try to get a pixel you clicked on in a heightmap: Not possible. Try to locate the exact mesh in a model you clicked on: Impractical.
If you have a large number of quads you will probably need a good spatial partitioning or at least (better also) a scene graph. Ok, you don't need this, but it helps A LOT. Look at some tutorials for scene graphs for further information's, it's a good thing to know if you start with 3D programming, because you get to know a lot of concepts and not only OpenGl code.
So what to do now to start with some picking? Take the inverse of your modelview matrix (iirc with glUnproject(...)) on the position where your mouse cursor is. With the orientation of your camera you can now cast a ray into your spatial structure (or your scene graph that holds a spatial structure). Now check for collisions with your quads. I currently have no link, but if you search for inverse modelview matrix you should find some pages that explain this better and in more detail than it would be practical to do here.
With this raycasting based technique you will be able to find your quad in O(log n), where n is the number of quads you have. With some heuristics based on the exact layout of your application (your question is too generic to be more specific) you can improve this a lot for most cases.
An easy spatial structure for this is for example a quadtree. However you should start with they raycasting first to fully understand this technique.
Never faced such problem, but in my opinion, I think the CPU based picking is the best way to try.
If you have a large set of quads, maybe you can group quads by space to avoid testing all quads. For example, you can group the quads in two boxes and firtly test which box you
I just implemented color picking but glReadPixels is slow here (I've read somehere that it might be bad for asynchron behaviour between GL and CPU).
Another possibility seems to me using transform feedback and a geometry shader that does the scissor test. The GS can then discard all faces that do not contain the mouse position. The transform feedback buffer contains then exactly the information about hovered meshes.
You probably want to write the depth to the transform feedback buffer too, so that you can find the topmost hovered mesh.
This approach works also nice with instancing (additionally write the instance id to the buffer)
I haven't tried it yet but I guess it will be a lot faster then using glReadPixels.
I only found this reference for this approach.
I'm using the solution that I've borrowed from DirectX SDK, there's a nice example how to detect the selected polygon in a vertext buffer object.
The same algorithm works nice with OpenGL.

How to get games' FPS (with OpenGL) to like 800 FPS

How can we run a OpenGL applications (say a games) in higher frame rate like 500 - 800 FPS ?
For a example AOE 2 is running with more than 700 FPS (I know it is about DirectX). Eventhough I just clear buffers and swap buffers within the game loop, I can only get about 200 (max) FPS. I know that FPS isn't a good messurenment (and also depend on the hardware), but I feel I missed some concepts in OpenGL. Did I ? Pls anyone can give me a hint ?
I'm getting roughly 5.600 FPS with an empty display loop (GeForce 260 GTX, 1920x1080). Adding glClear lowers it to 4.000 FPS which is still way over 200...
A simple graphics engine (AoE2 style) should run at about 100-200 FPS (GeForce 8 or similar). Probably more if it's multi-threaded and fully optimized.
I don't know what exactly you do in your loop or what hardware that is running on, but 200 FPS sounds like you are doing something else besides drawing nothing (sleep? game logic stuff? greedy framework? Aero?). The swapbuffer function should not take 5ms even if both framebuffers have to be copied. You can use a profile to check where the most CPU time is spent (timing results from gl* functions are mostly useless though)
If you are doing something with OpenGL (drawing stuff, creating textures, etc.) there is a nice extension to measure times called GL_EXT_timer_query.
Some general optimization tips:
don't use immediate mode (glBegin/glEnd), use VBO and/or display lists+vertex arrays instead
use some culling technique to remove objects outside your view (opengl would have to cull every polygon separately)
try minimizing state changes, especially changing the bound texture or vertex buffer
AOE 2 is a DirectDraw application, not Direct3D. There is no way to compare OpenGL and DirectDraw.
Also, check the method you're using for swapping buffers. In Direct3D there are flip method, copy method, and discard method. The best one is discard, which means that you don't care about previous contents in the buffer, and allow the driver to manage them efficiently.
One of the things you seem to miss (judging from your answer/comments, so correct me if I'm wrong) is that you need to determine what to render.
For example as you said you have multiple layers and such, well the first thing you need to do is not render anything that is off screen (which is possible and is sometimes done). What you should also do is not render things that you are certain are not visible, for example if some area of the top layer is not transparent (or filled up) you should not render the layers below it.
In general what I'm trying to say is that it is in most cases better to eliminate invisible things in the logic than to render all things and just let the things on top end up in the rendered image.
If your textures are small, try to combine them in one bigger texture and address them via texture coordinates. That will save you a lot of state changes. If your textures are e.g. 128x128, you can put 16 of them in one 512x512 texture, bringing your texture related state changes down by a factor of 16.