Cant understand concept of merge-instancing - opengl

I was reading slides from a presentation that was talking about "merge-instancing". (the presentation is from Emil Persson, the link:, from slide 19)
I can't understand what's going on, I know instancing only from openGL and I thought it can only draw the same mesh multiple times. Can somebody explain? Does it work differently with directX?

Instancing: You upload a mesh to the GPU and activate its buffers whenever you want to render it. Data is not duplicated.
Merging: You want to create a mesh from multiple smaller meshes (as the complex of building in the example), so you either:
Draw each complex using instancing, which means, multiple draw calls for each complex
You merge the instances into a single mesh, which will replicate the vertices and other data for each complex, but you will be able to render the whole complex with a single draw call
Instance-Merging: You create the complex by referencing the vertices of the instances that take part on it. Then you use the vertices to know where to fetch the data for each instance: This way you have the advantage of instancing (Each mesh is uploaded once to the GPU) and the merging benefits (you draw the whole complex with a single draw call)


How should you efficiently batch complex meshes?

What is the best way to render complex meshes? I wrote different solutions below and wonder what is your opinion about them.
Let's take an example: how to render the 'Crytek-Sponza' mesh?
PS: I do not use Ubershader but only separate shaders
If you download the mesh on the following link:
and load it in Blender you'll see that the whole mesh is composed by about 400 sub-meshes with their own materials/textures respectively.
A dummy renderer (version 1) will render each of the 400 sub-mesh separately! It means (to simplify the situation) 400 draw calls with for each of them a binding to a material/texture. Very bad for performance. Very slow!
pseudo-code version_1:
foreach mesh in meshList //400 iterations :(!
Material material = mesh->GetMaterial();
Shader bsdf = ShaderManager::GetBSDFByMaterial(material);
bsdf->SetTexture(material->GetTexture()); //Bind texture
Now, if we take care of the materials being loaded we can notice that the Sponza is composed in reality of ONLY (if I have a good memory :)) 25 different materials!
So a smarter solution (version 2) should be to gather all the vertex/index data in batches (25 in our example) and not store VBO/IBO into sub-meshes classes but into a new class called Batch.
pseudo-code version_2:
foreach batch in batchList //25 iterations :)!
Material material = batch->GetMaterial();
Shader bsdf = ShaderManager::GetBSDFByMaterial(material);
bsdf->SetTexture(material->GetTexture()); //Bind texture
In this case each VBO contains data that share exactly the same texture/material settings!
It's so much better! Now I think 25 VBO for render the sponza is too much! The problem is the number of Buffer bindings to render the sponza! I think a good solution should be to allocate a new VBO if the first one if 'full' (for example let's assume that the maximum size of a VBO (value defined in the VBO class as attribute) is 4MB or 8MB).
pseudo-code version_3:
foreach vbo in vboList //for example 5 VBOs (depends on the maxVBOSize)
BatchList batchList = vbo->GetBatchList();
foreach batch in batchList
Material material = batch->GetMaterial();
Shader bsdf = ShaderManager::GetBSDFByMaterial(material);
bsdf->SetTexture(material->GetTexture()); //Bind texture
In this case each VBO does not contain necessary data that share exactly the same texture/material settings! It depends of the sub-mesh loading order!
So OK, there are less VBO/IBO bindings but not necessary less draw calls! (are you OK by this affirmation ?). But in a general manner I think this version 3 is better than the previous one! What do you think about this ?
Another optimization should be to store all the textures (or group of textures) of the sponza model in array(s) of textures! But if you download the sponza package you will see that all texture has different sizes! So I think they can't be bound together because of their format differences.
But if it's possible, the version 4 of the renderer should use only less texture bindings rather than 25 bindings for the whole mesh! Do you think it's possible ?
So, according to you, what is the best way to render the sponza mesh ? Have you another suggestion ?
You are focused on the wrong things. In two ways.
First, there's no reason you can't stick all of the mesh's vertex data into a single buffer object. Note that this has nothing to do with batching. Remember: batching is about the number of draw calls, not the number of buffers you use. You can render 400 draw calls out of the same buffer.
This "maximum size" that you seem to want to have is a fiction, based on nothing from the real world. You can have it if you want. Just don't expect it to make your code faster.
So when rendering this mesh, there is no reason to be switching buffers at all.
Second, batching is not really about the number of draw calls (in OpenGL). It's really about the cost of the state changes between draw calls.
This video clearly spells out (about 31 minutes in), the relative cost of different state changes. Issuing two draw calls with no state changes between them is cheap (relatively speaking). But different kinds of state changes have different costs.
The cost of changing buffer bindings is quite small (assuming you're using separate vertex formats, so that changing buffers doesn't mean changing vertex formats). The cost of changing programs and even texture bindings is far greater. So even if you had to make multiple buffer objects (which again, you don't have to), that's not going to be the primary bottleneck.
So if performance is your goal, you'd be better off focusing on the expensive state changes, not the cheap ones. Making a single shader that can handle all of the material settings for the entire mesh, so that you only need to change uniforms between them. Use array textures so that you only have one texture binding call. This will turn a texture bind into a uniform setting, which is a much cheaper state change.
There are even fancier things you can do, involving base instance counts and the like. But that's overkill for a trivial example like this.

DirectX Adding Multiple Meshes to a Single Vertex Buffer

I'm fairly new to DirectX. I have what I think should be a pretty simple question, but I can't seem to find an answer to it anywhere.
Basically, I'd like to know how to add vertices from multiple meshes to a single vertex buffer. This would only happen once per mesh as the program is initialized, so I believe I want DEFAULT usage.
Is It possible to add each mesh to the buffer individually? or do I need to collect them all in a single array and pass them all at once? Default or Dynamic? Map/Unmap or updateSubresource? Thanks
For now I am using an index buffer and drawing once per object (horrible I know) but I am planning on switching to instancing as soon as I figure this out.

best way to wrap opengl models

In short: What is the "preferred" way to wrap OpenGL's buffers, shaders and/or matrices required for a more high level "model" object?
I am trying to write this tiny graphics engine in C++ built on core OpenGL 3.3 and I would like to implement an as clean as possible solution to wrapping a higher level "model" object, which would contain its vertex buffer, global position/rotation, textures (and also a shader maybe?) and potentially other information.
I have looked into this open source engine, called GamePlay3D and don't quite agree with many aspects of its solution to this problem. Is there any good resource that discusses this topic for modern OpenGL? Or is there some simple and clean way to do this?
That depends a lot on what you want to be able to do with your engine. Also note that these concepts are the same with DirectX (or any other graphic API), so don't focus too much your search on OpenGL. Here are a few points that are very common in a 3D engine (names can differ):
A mesh contains submeshes, each submesh contains a vertex buffer and an index buffer. The idea being that each submesh will use a different material (for example, in the mesh of a character, there could be a submesh for the body and one for the clothes.)
An instance (or mesh instance) references a mesh, a list of materials (one for each submesh in the mesh), and contains the "per instance" shader uniforms (world matrix etc.), usually grouped in a uniform buffer.
Material: (This part changes a lot depending on the complexity of the engine). A basic version would contain some textures, some render states (blend state, depth state), a shader program, and some shader uniforms that are common to all instances (for example a color, but that could also be in the instance depending on what you want to do.)
More complex versions usually separates the materials in passes (or sometimes techniques that contain passes) that contain everything that's in the previous paragraph. You can check Ogre3D documentation for more info about that and to take a look at one possible implementation. There's also a very good article called Designing a Data-Driven Renderer in GPU PRO 3 that describes an even more flexible system based on the same idea (but also more complex).
Scene: (I call it a scene here, but it could really be called anything). It provides the shader parameters and textures from the environment (lighting values, environment maps, this kind of things).
And I thinks that's it for the basics. With that in mind, you should be able to find your way around the code of any open-source 3D engine if you want the implementation details.
This is in addition to Jerem's excellent answer.
At a low level, there is no such thing as a "model", there is only buffer data and the code used to process it. At a high level, the concept of a "model" will differ from application to application. A chess game would have a static mesh for each chess piece, with shared textures and materials, but a first-person shooter could have complicated models with multiple parts, swappable skins, hit boxes, rigging, animations, et cetera.
Case study: chess
For chess, there are six pieces and two colors. Let's over-engineer the graphics engine to show how it could be done if you needed to draw, say, thousands of simultaneous chess games in the same screen, instead of just one game. Here is how you might do it.
Store all models in one big buffer. This buffer has all of the vertex and index data for all six models clumped together. This means that you never have to switch buffers / VAOs when you're drawing pieces. Also, this buffer never changes, except when the user goes into settings and chooses a different style for the chess pieces.
Create another buffer containing the current location of each piece in the game, the color of each piece, and a reference to the model for that piece. This buffer is updated every frame.
Load the necessary textures. Maybe the normals would be in one texture, and the diffuse map would be an array texture with one layer for white and another for black. The textures are designed so you don't have to change them while you're drawing chess pieces.
To draw all the pieces, you just have to update one buffer, and then call glMultiDrawElementsIndirect()... once per frame, and it draws all of the chess pieces. If that's not available, you can fall back to glDrawElements() or something else.
You can see how this kind of design won't work for everything.
What if you have to stream new models into memory, and remove old ones?
What if the models have different size textures?
What if the models are more complex, with animations or forward kinematics?
What about translucent models?
What about hit boxes and physics data?
What about different LODs?
The problem here is that your solution, and even the very concept of what a "model" is, will be very different depending on what your needs are.

OpenGL height-map painting using CUDA VBO

I've asked several questions regarding VBO previously here and from the comments i had received i decided that a new approach must be taken.
To put it simply - I'm trying to draw the Mandelbrot set which is defined on a large FLOAT array, around 512X512 Points. the purpose of my program is to let the user control the zooming and world's orientation (it's a 3d model).
so far I've painted the entire thing using GL_TRIANGLE_STRIP which turned to be a bad choice because of its slow painting process. also because implementing my painting style (order of calling the glVertex) became impossible for coding for VBOs.
so I've got several questions.
even after this description i'm not sure either the VBO is the best choice because it's up the user to control the calculations.for each calculation that he causes by the program, i have to recompute the mandelbrot set(~60ms),and recopy the points to the buffer : a process which takes some time(?ms).
the program allows the user also to move in the world so no calculations are done here therefore VBO is an excellent choice here.
1.what's the best way to paint height map(when each cell in the array holds only the height) can i apply it on VBO and transfer it to cuda (cudaRegisterBuffer or something like that) there a way to distinguish between the mode and decide when VBOs are needed(in a no calculations mode) and when they aren't(calculations mode).
You don't need to copy the CUDA data each frame if you bind the CUDA array/VBO to the DirectX/OpenGL VB (refer to the CUDA Programming Guide for details). One way to render data as a height-field is to use the Geometry Shader to emit the tris based on the height-field. Another way is to use the height field as a parallax-map (ref DirectX SDK). My personal fave would be to make your height-field an array of positions (X/Y/Z) and use CUDA to modify only the Y-Values, then use an index buffer to define the polygons that compose the surface. Note that you'll also need to update the vertex normals, and you may also want to use XYZ/UV if you want to texture the surface. If 512x512 is too big, use raster-ops (texture sampling) to populate a lower-resolution height-field of the region of interest. You can do this stage in CUDA or OpenGL/DirectX (I'd recommend doing it in CUDA where you can easily write your own sampling kernel to lookup pixels when down-sampling).

Renderer Efficiency

Ok, I have a renderer class which has all kinds of special functions called by the rest of the program:
About 30 more...
Each of these functions calls glBegin/glEnd separably, which I know can be very inefficiently(its even deprecated). So anyways, I am planning a total rewrite of the renderer and I need to know the most efficient ways to set up the functions so that when something calls it, it draws it all at once, or whatever else it needs to do so it will run most efficiently. Thanks in advance :)
The efficient way to render is generally to use VBO's (vertex buffer objects) to store your vertex data, but that is only really meaningful if you are rendering (mostly) static data.
Without knowing more about what your application is supposed to render, it's hard to say how you should structure it. But ideally, you should never draw individual primitives, but rather draw the contents (a subset) of a vertexbuffer.
The most efficient way is not to expose such low-level methods at all. Instead, what you want to do is build a scene graph, which is a data structure that contains a representation of the entire scene. You update the scene graph in your "update" method, then render the whole thing in one go in your "render" method.
Another, slightly different approach is to re-build the entire scene graph each frame. This has the advantage that once the scene graph is composed, it doesn't change. So you can call your "render" method on another thread while your "update" method is going through and constructing the scene for the next frame at the same time.
Many of the more advanced effects are simply not possible without a complete scene graph. You can't do shadow mapping, for instance (which requires you to render the scene multiple times from a different angle), you can't do deferred rendering, it also makes anything which relies on sorted draw order (e.g. alpha-blending) very difficult.
From your method names, it looks like you're working in 2D, so while shadow mapping is probably not high on your feature list, alpha-blending deferred rendering might be.