OpenGL 3.3 vertex buffer deletion before frame finished - opengl

This is an advanced OpenGL question and tbh. it seems more like a driver bug. I know that the standard explicitly states, that deletion of an object only deletes it's name, therefore a generator function can return the same name. However it's not clear on how to deal with this...
The situation is the following: I have a so called "transient" (C++) object (TO from now on), which generates GL objects, enqueues commands using them, then deletes them.
Now consider that I use more than one of this kind before I call SwapBuffers(). The following happens:
TO 1. generates a vertex buffer named VBO1, along with a VAO1 and other things
TO 1. calls some mapping/drawing commands with VBO1
TO 1. deletes the VAO1 and VBO1 (therefore the name VBO1 is freed)
TO 2. generates a vertex buffer object, now of course with the same name (VBO1) as the name 1 is deleted and available, along with another VAO (probably 1)
TO 2. calls some other mapping/drawing commands with this new VBO1 (different vertex positions, etc.)
TO 2. deletes the new VBO1
SwapBuffers()
And the result is: only the modifications performed by TO 1. are in effect. In a nutshell: I wanted to render a triangle, then a square, but I only got the triangle.
Workaround: not deleting the VBO, so I get a new name in TO 2. (VBO2)
I would like to ask for your help in this matter; although I'm aware of the fact that I shouldn't delete/generate objects mid-frame, but aside that, this "buggy" mechanism really disturbs me (I mean how can I trust GL then?...short answer: I can't...)
(sideonote: I've been programming 3D graphics since 12 years, but this thing really gave me the creeps...)

I have similar problems with my multithreaded rendering code. I use a double buffering system for the render commands, so when I delete an object, it might be used in the next frame.
The short of it is that TO shouldn't directly delete the GL objects. It needs to submit the handle to a manager to queue for deletion between frames. With my double buffering, I add a small timer to count down 2 frames before releasing.
For my transient verts, I have a large chunk of memory that I write to for storage, and skip the VBO submission. I don't know what your setup is or how many vertices you are pushing, but you may not benefit from VBOs if you 1) regenerate every frame or 2) push small sets of verts. Definitely perf test with and without VBOs.

I found the cause of the problem, I think it's worth mentioning (so that other developers won't fall into the same hole). The actual problem is the VAO, or more precisely the caching of the VAO.
In Metal and Vulkan the input layout is completely independent of the actual buffers used: you only specify the binding point (location) where the buffer is going to be.
But not in OpenGL... the VAO actually holds a strong reference to the vertex buffer which was bound during it's creation. Therefore the following thing happened:
VBO1 was created, VAO1 was created
VAO1 was cached in the pipeline cache
VBO1 was deleted, but only the name was freed, not the object
glGenBuffers() returns 1 again as the name is available
but VAO1 in the cache still references the old VBO1
the driver gets confused and doesn't let me modify the new VBO1
And the solution...well... For now when a vertex buffer gets deleted I delete any cached pipelines that reference that buffer.
On the long term tho: I'm going to maintain a separate cache for input layouts (even if it's part of the pipeline state), and move the transient object further up, so that it becomes less transient.
Welcome to the world of OpenGL...

Related

How to update vertex buffer data frequently in DirectX 11?

I am trying to update my vertex buffer data with the map function in dx. Though it does update the data once, but if i iterate over it the model disappears. i am actually trying to manipulate vertices in real-time by user input and to do so i have to update the vertex buffer every frame while the vertex is selected.
Perhaps this happens because the Map function disables GPU access to the vertices until the Unmap function is called. So if the access is blocked every frame, it kind of makes sense for it to not be able render the mesh. However when i update the vertex every frame and then stop after sometime, theatrically the mesh should show up again, but it doesn't.
i know that the proper way to update data every frame is to use constant buffers, but manipulating vertices with constant buffers might not be a good idea. and i don't think that there is any other way to update the vertex data. i expect dynamic vertex buffers to be able to handle being updated every frame.
D3D11_MAPPED_SUBRESOURCE mappedResource;
ZeroMemory(&mappedResource, sizeof(D3D11_MAPPED_SUBRESOURCE));
// Disable GPU access to the vertex buffer data.
pRenderer->GetDeviceContext()->Map(pVBuffer, 0, D3D11_MAP_WRITE_DISCARD, 0, &mappedResource);
// Update the vertex buffer here.
memcpy((Vertex*)mappedResource.pData + index, pData, sizeof(Vertex));
// Reenable GPU access to the vertex buffer data.
pRenderer->GetDeviceContext()->Unmap(pVBuffer, 0);
As this has been already answered the key issue that you are using Discard (which means you won't be able to retrieve the contents from the GPU), I thought I would add a little in terms of options.
The question I have is whether you require performance or the convenience of having the data in one location?
There are a few configurations you can try.
Set up your Buffer to have both CPU Read and Write Access. This though mean you will be pushing and pulling your buffer up and down the bus. In the end, it also causes performance issues on the GPU such as blocking etc (waiting for the data to be moved back onto the GPU). I personally don't use this in my editor.
If memory is not the issue, set up a copy of your buffer on CPU side, each frame map with Discard and block copy the data across. This is performant, but also memory intensive. You obviously have to manage the data partioning and indexing into this space. I don't use this, but I toyed with it, too much effort!
You bite the bullet, you map to the buffer as per 2, and write each vertex object into the mapped buffer. I do this, and unless the buffer is freaking huge, I havent had issue with it in my own editor.
Use the Computer shader to update the buffer, create a resource view and access view and pass the updates via a constant buffer. Bit of a Sledgehammer to crack a wallnut. And still doesn't stop the fact you may need pull the data back off the GPU ala as per item 1.
There are some variations on managing the buffer, such as interleaving you can play with also (2 copies, one on GPU while the other is being written to) which you can try also. There are some rather ornate mechanisms such as building the content of the buffer in another thread and then flagging the update.
At the end of the day, DX 11 doesn't offer the ability (someone might know better) to edit the data in GPU memory directly, there is alot shifting between CPU and GPU.
Good luck on which ever technique you choose.
Mapping buffer with D3D11_MAP_WRITE_DISCARD flag will cause entire buffer content to become invalid. You can not use it to update just a single vertex. Keep buffer on the CPU side instead and then update entire buffer on GPU side once per frame.
If you develop for UWP - use of map/unmap may result in sync problems. ID3D11DeviceContext methods are not thread safe: https://learn.microsoft.com/en-us/windows/win32/direct3d11/overviews-direct3d-11-render-multi-thread-intro.
If you update buffer from one thread and render from another - you may get different errors. In this case you must use some synchronization mechanism, such as critical sections. Example is here https://developernote.com/2015/11/synchronization-mechanism-in-directx-11-and-xaml-winrt-application/

How to render multiple different items in an efficient way with OpenGL

I am making a simple STG engine with OpenGL (To be exact, with LWJGL3).In this game, there can be several different types of items(called bullet) in one frame, and each type can have 10-20 instances.I hope to find an efficient way to render it.
I have read some books about modern OpenGL and find a method called "Instanced Rendering", but it seems only to work with same instances.Should I use for-loop to draw all items directly for my case?
Another question is about memory.Should I create an VBO for each frame, since the number of items is always changing?
Not the easiest question to answer but I'll try my best anyways.
An important property of OpenGL is that the OpenGL context is always bound to a single thread. So every OpenGL-method has to be called within that thread. A common way of dealing with this is using Queuing.
Example:
We are using Model-View-Controller architecture.
We have 3 threads; One to read input, one to handle received messages and one to render the scene.
Here OpenGL context is bound to rendering thread.
The first thread receives a message "Add model to position x". First thread has no time to handle the message, because there might be another message coming right after and we don't want to delay it. So we just give this message for the second thread to handle by adding it to second thread's queue.
Second thread reads the message and performs the required tasks as far as it can before OpenGL context is required. Like reads the Wavefront (.obj)-file from the memory and creates arrays from the received data.
Our second thread then queues this data to our OpenGL thread to handle. OpenGL thread generates VBOs and VAO and stores the data in there.
Back to your question
OpenGL generated Objects stay in the context memory until they are manually deleted or the context is destroyed. So it works kind of like C, where you have to manually allocate memory and free it after it's no more used. So you should not create new Objects for each frame, but reuse the data that stays unchanged. Also when you have multiple objects that use the same model or texture, you should just load that model once and apply all object specific differences on shaders.
Example:
You have an environment with 10 rocks that all share the same rock model.
You load the data, store it in VBOs and attach those VBOs into a VAO. So now you have a VAO defining a rock.
You generate 10 rock entities that all have position, rotation and scale. When rendering, you first bind the shader, then bind the model and texture, then loop through the stone entities and for each stone entity you bind that entity's position, rotation and scale (usually stored in a transformationMatrix) and render.
bind shader
load values to shader's uniform variables that don't change between entities.
bind model and texture (as those stay the same for each rock)
for(each rock in rocks){
load values to shader's uniform variables that do change between each rock, like the transformation.
render
}
unbind shader
Note: You don't need to unbind/bind shader each frame if you only use one shader. Same goes for VAO's and every other OpenGL object as well. So the binding will also stay over each rendering cycle.
Hope this will help you when getting started. Altho I would recommend some tutorial that might have a bit more context to it.
I have read some books about modern OpenGL and find a method called
"Instanced Rendering", but it seems only to work with same
instances.Should I use for-loop to draw all items directly for my
case?
Another question is about memory.Should I create an VBO for each
frame, since the number of items is always changing?
These both depend on the amount of bullets you plan on having. If you think you will have less than a thousand bullets, you can almost certainly push all of them to a VBO each frame and upload and your end users will not notice. If you plan on some obscene amount, then don't do this.
I would say that you should write everything each frame because it's the simplest to do right now, and if you start noticing performance issues then you need to look into instancing or some other method. When you get to "later" you should be more comfortable with OpenGL and find out ways to optimize it that won't be over your head (not saying it is over your head right now, but more experience can only help make it less complex later on).
Culling bullets not on the screen either should be on your radar.
If you plan on having a ridiculous amount of bullets on screen, then you should say so and we can talk about more advanced methods, however my guess is that if you ever reach that limit on today's hardware then you have a large ambitious game with a zoomed out camera and a significant amount of entities on screen, or you are zoomed up and likely have a mess on your screen anyways.
20 objects is nothing. Your program will be plenty fast no matter how you draw them.
When you have 10000 objects, then you'll want to ask for an efficient way.
Until then, draw them whichever way is most convenient. This probably means a separate draw call per object.

Why is using multiple Pixel buffer Objects advised. Surely it is redundant?

This article is commonly referenced when anyone asks about video streaming textures in OpenGL.
It says:
To maximize the streaming transfer performance, you may use multiple pixel buffer objects. The diagram shows that 2 PBOs are used simultaneously; glTexSubImage2D() copies the pixel data from a PBO while the texture source is being written to the other PBO.
For nth frame, PBO 1 is used for glTexSubImage2D() and PBO 2 is used to get new texture source. For n+1th frame, 2 pixel buffers are switching the roles and continue to update the texture. Because of asynchronous DMA transfer, the update and copy processes can be performed simultaneously. CPU updates the texture source to a PBO while GPU copies texture from the other PBO.
They provide a simple bench-mark program which allows you to cycle between texture updates without PBO's, with a single PBO, and with two PBO's used as described above.
I see a slight performance improvement when enabling one PBO.
But the second PBO makes no real difference.
Right before the code glMapBuffer's the PBO, it calls glBufferData with the pointer set to NULL. It does this to avoid a sync-stall.
// map the buffer object into client's memory
// Note that glMapBufferARB() causes sync issue.
// If GPU is working with this buffer, glMapBufferARB() will wait(stall)
// for GPU to finish its job. To avoid waiting (stall), you can call
// first glBufferDataARB() with NULL pointer before glMapBufferARB().
// If you do that, the previous data in PBO will be discarded and
// glMapBufferARB() returns a new allocated pointer immediately
// even if GPU is still working with the previous data.
So, Here is my question...
Doesn't this make the second PBO completely useless? Just a waste of memory !?
With two PBO's the texture data is stored 3 times. 1 in the texture, and one in each PBO.
With a single PBO. There are two copies of the data. And temporarily only a 3rd in the event that glMapBuffer creates a new buffer because the existing one is presently being DMA'ed to the texture?
The comments seem to suggest that OpenGL drivers internally are capable to creating the second buffer IF and only WHEN it is required to avoid stalling the pipeline. The in-use buffer is being DMA'ed, and my call to map yields a new buffer for me to write to.
The Author of that article appears to be more knowledgeable in this area than myself. Have I completely mis-understood the point?
Answering my own question... But I wont accept it as an answer... (YET).
There are many problems with the benchmark program linked to in the question. It uses immediate mode. It uses GLUT!
The program was spending most of its time doing things we are not interested in profiling. Mainly rendering text via GLUT, and writing pretty stripes to the texture. So I have removed those functions.
I cranked the texture resultion up to 8K, and added more PBO Modes.
NO PBO (yeilds 6fps)
1 PBO. Orphan previous buffer. (yields 12.2 fps).
2 PBO's. Orpha previous buffer. (yields 12.2 fps).
1 PBO. DONT orphan previous PBO (possible stall - added by myself. yields 12.4 fps).
2 PBO's. DONT orphan previous PBO (possible stall - added by myself. yields 12.4 fps).
If anyone else would like to examine my code, it is vailable here
I have experimented with different texture sizes... and different updatePixels functions... I cannot, despite my best efforts get the double PBO implementation to perform any better than the single-PBO implementation.
Furthermore... NOT orphanning the previous buffer, actually vields better performance. Exactly opposite to what the article claims.
Perhaps modern drivers / hardware does not suffer the problem that this design is attemtping to fix...
Perhaps my graphics hardware / driver is buggy, and not taking advantage of the double-PBO...
Perhaps the commonly referenced article is completely wrong?
Who knows. . . .
My test hardware is Intel(R) HD Graphics 5500 (Broadwell GT2).

How should you efficiently batch complex meshes?

What is the best way to render complex meshes? I wrote different solutions below and wonder what is your opinion about them.
Let's take an example: how to render the 'Crytek-Sponza' mesh?
PS: I do not use Ubershader but only separate shaders
If you download the mesh on the following link:
http://graphics.cs.williams.edu/data/meshes.xml
and load it in Blender you'll see that the whole mesh is composed by about 400 sub-meshes with their own materials/textures respectively.
A dummy renderer (version 1) will render each of the 400 sub-mesh separately! It means (to simplify the situation) 400 draw calls with for each of them a binding to a material/texture. Very bad for performance. Very slow!
pseudo-code version_1:
foreach mesh in meshList //400 iterations :(!
mesh->BindVBO();
Material material = mesh->GetMaterial();
Shader bsdf = ShaderManager::GetBSDFByMaterial(material);
bsdf->Bind();
bsdf->SetMaterial(material);
bsdf->SetTexture(material->GetTexture()); //Bind texture
mesh->Render();
Now, if we take care of the materials being loaded we can notice that the Sponza is composed in reality of ONLY (if I have a good memory :)) 25 different materials!
So a smarter solution (version 2) should be to gather all the vertex/index data in batches (25 in our example) and not store VBO/IBO into sub-meshes classes but into a new class called Batch.
pseudo-code version_2:
foreach batch in batchList //25 iterations :)!
batch->BindVBO();
Material material = batch->GetMaterial();
Shader bsdf = ShaderManager::GetBSDFByMaterial(material);
bsdf->Bind();
bsdf->SetMaterial(material);
bsdf->SetTexture(material->GetTexture()); //Bind texture
batch->Render();
In this case each VBO contains data that share exactly the same texture/material settings!
It's so much better! Now I think 25 VBO for render the sponza is too much! The problem is the number of Buffer bindings to render the sponza! I think a good solution should be to allocate a new VBO if the first one if 'full' (for example let's assume that the maximum size of a VBO (value defined in the VBO class as attribute) is 4MB or 8MB).
pseudo-code version_3:
foreach vbo in vboList //for example 5 VBOs (depends on the maxVBOSize)
vbo->Bind();
BatchList batchList = vbo->GetBatchList();
foreach batch in batchList
Material material = batch->GetMaterial();
Shader bsdf = ShaderManager::GetBSDFByMaterial(material);
bsdf->Bind();
bsdf->SetMaterial(material);
bsdf->SetTexture(material->GetTexture()); //Bind texture
batch->Render();
In this case each VBO does not contain necessary data that share exactly the same texture/material settings! It depends of the sub-mesh loading order!
So OK, there are less VBO/IBO bindings but not necessary less draw calls! (are you OK by this affirmation ?). But in a general manner I think this version 3 is better than the previous one! What do you think about this ?
Another optimization should be to store all the textures (or group of textures) of the sponza model in array(s) of textures! But if you download the sponza package you will see that all texture has different sizes! So I think they can't be bound together because of their format differences.
But if it's possible, the version 4 of the renderer should use only less texture bindings rather than 25 bindings for the whole mesh! Do you think it's possible ?
So, according to you, what is the best way to render the sponza mesh ? Have you another suggestion ?
You are focused on the wrong things. In two ways.
First, there's no reason you can't stick all of the mesh's vertex data into a single buffer object. Note that this has nothing to do with batching. Remember: batching is about the number of draw calls, not the number of buffers you use. You can render 400 draw calls out of the same buffer.
This "maximum size" that you seem to want to have is a fiction, based on nothing from the real world. You can have it if you want. Just don't expect it to make your code faster.
So when rendering this mesh, there is no reason to be switching buffers at all.
Second, batching is not really about the number of draw calls (in OpenGL). It's really about the cost of the state changes between draw calls.
This video clearly spells out (about 31 minutes in), the relative cost of different state changes. Issuing two draw calls with no state changes between them is cheap (relatively speaking). But different kinds of state changes have different costs.
The cost of changing buffer bindings is quite small (assuming you're using separate vertex formats, so that changing buffers doesn't mean changing vertex formats). The cost of changing programs and even texture bindings is far greater. So even if you had to make multiple buffer objects (which again, you don't have to), that's not going to be the primary bottleneck.
So if performance is your goal, you'd be better off focusing on the expensive state changes, not the cheap ones. Making a single shader that can handle all of the material settings for the entire mesh, so that you only need to change uniforms between them. Use array textures so that you only have one texture binding call. This will turn a texture bind into a uniform setting, which is a much cheaper state change.
There are even fancier things you can do, involving base instance counts and the like. But that's overkill for a trivial example like this.

Can't release textures created by a shared context

I met a problem using shared contexts:
I have two threads and each has one context, say Thr1(thread1) with Ctx1(Context1) and Thr2 and Ctx2. Ctx2 was created sharing with Ctx1.
Then, in Thr2, I create some textures with Ctx2 as current context, and do some rendering. After that, I destroy the Ctx2 and finish Thr2.
Now the problem arised: after I destroy the Ctx2, the textures created under the Ctx2 not released(Some of then, not all). I use gDebugger to profile my program, and see that, these textures are not released, and listed under Ctx1.
As I repeat create Thr2/Ctx2 and create textures and destroy Thr2/Ctx2, the textures are getting more and more, as well as the memory.
What I have tried:
Delete the textures in Thr2 before destroy Ctx2;
In Thr2 make Ctx1 as current and try to delete the textures, before Ctx2 is destroy;
This sounds like expected behavior.
To explain the lifetime of objects with multiple contexts, I'm going to use the word "pool" to describe a collection of textures. I don't think there's universal terminology for the concept, so this is as good as anything.
While you may normally picture textures being owned by a context, they are in fact owned by a pool. As long as you just have a single context, that's an academic difference. The context owns the pool, the pool owns all the textures that were created in the context. When the context is destroyed, the pool goes away with it, which in turn destroys all textures in the pool.
Now, with two sharing contexts, things get more interesting. You still have one pool, which both contexts have shared ownership for. When you create a texture in any one of the two contexts, the texture is owned by the shared pool. When a context is deleted, it gives up its shared ownership of the pool. The pool (including all textures in the pool) stays around as long as at least one of the contexts is alive.
In your scenario, context 2 creates a texture. This texture is added to the pool shared by context 1 and context 2. Then you delete context 2. The created texture remains in the pool. The pool itself remains alive because context 1 (which still exists) has shared ownership of the pool. This means that the texture also remains alive. It is irrelevant that context 2 created the texture, since the texture is owned by the pool, not by context 2.
Therefore, if you really want to delete the texture, you'll have to make a glDeleteTexture() call. It does not matter if you make this call in context 1 or context 2.
There are some subtle aspects when shared textures are deleted, related for example to textures being FBO attachments, or textures being deleted in one context while being bound in another context. But since this is not at the core of this question, and it's somewhat complicated, I'll refer to the spec for the details (see for example section D.1.2 on page 337 of the OpenGL 3.3 spec).