c++, directx 12: Color Picking Questions

c++, directx 12: Color Picking Questions - c++

I'd like to implement color picking in DirectX 12. So basically what I am trying to do is render to two render targets simultaneously. The first render target should contain the normal rendering, while the second should contain the objectID.
To render to two render targets, I think all you need to do is set them with OMSetRenderTargets.
Question 1: How do you specify which shader or pipeline state object should be used for a specific render target? Like how do you say render_target_0 should be rendered with shader_0, render_target_1 should be rendered with shader_1?
Question 2: How do you read a pixel from the frame buffer after it has been rendered? Is it like in DirectX 11 by using CopySubresourceRegion then Map? Do you need to use a readback heap? Do you need to use a resource barrier or fence or some sort synchronisation primitive to avoid the CPU and GPU using the frame buffer resource at the same time?
I tried googling for the answers but didn't get very far because DirectX 12 is pretty new and there aren't a lot of examples, tutorials, or open source projects for DirectX 12 yet.
Thanks for your help in advance.
Extra special bonus points for code examples.

Since I didn't get any response on Stack Overflow, I cross-posted on gamedev.net and got a good response there: http://www.gamedev.net/topic/674529-d3d12-color-picking-questions/
For anyone finding this in the future, I'll just copy red75prime's answer from the GameDev forums here.
red75prime's answer:
Question 1 is not specific to D3D12. Use one pixel shader with multiple outputs. Rendering to multiple textures with one pass in directx 11
Questions 2. Yes to all.
Pseudocode:
ID3D12GraphicsCommandList *gl = ...;
ID3D12CommandQueue *gqueue = ...;
ID3D12Resource *render_target, *read_back_texture;
...
// Draw scene
gl->DrawInstanced(...);
// Make ready for copy
gl->ResourceBarrier(render_target, RENDER_TARGET, COPY_SOURCE);
//gl->ResourceBarrier(read_back_texture, GENERIC_READ, COPY_DEST);
// Copy
gl->CopyTextureRegion(...);
// Make render_target ready for Present(), read_back_texture for Map()
gl->ResourceBarrier(render_target, COPY_SOURCE, PRESENT);
//gl->ResourceBarrier(read_back_texture, COPY_DEST, GENERIC_READ);
gl->Close(); // It's easy to forget
gqueue->ExecuteCommandLists(gl);
// Instruct GPU to signal when command list is done.
gqueue->Signal(fence, ...);
// Wait until GPU completes drawing
// It's inefficient. It doesn't allow GPU and CPU work in parallel.
// It's here just to make example simple.
wait_for_fence(fence, ...);
// Also, you can map texture once and store pointer to mapped data.
read_back_texture->Map(...);
// Read texture data
...
read_back_texture->Unmap();
EDIT: I added "gl->Close()" to the code.
EDIT2: State transitions for read_back_texture are unnecessary. Resources in read-back heap must always have state COPY_DEST.

Related

How to render multiple different items in an efficient way with OpenGL

I am making a simple STG engine with OpenGL (To be exact, with LWJGL3).In this game, there can be several different types of items(called bullet) in one frame, and each type can have 10-20 instances.I hope to find an efficient way to render it.
I have read some books about modern OpenGL and find a method called "Instanced Rendering", but it seems only to work with same instances.Should I use for-loop to draw all items directly for my case?
Another question is about memory.Should I create an VBO for each frame, since the number of items is always changing?

Not the easiest question to answer but I'll try my best anyways.
An important property of OpenGL is that the OpenGL context is always bound to a single thread. So every OpenGL-method has to be called within that thread. A common way of dealing with this is using Queuing.
Example:
We are using Model-View-Controller architecture.
We have 3 threads; One to read input, one to handle received messages and one to render the scene.
Here OpenGL context is bound to rendering thread.
The first thread receives a message "Add model to position x". First thread has no time to handle the message, because there might be another message coming right after and we don't want to delay it. So we just give this message for the second thread to handle by adding it to second thread's queue.
Second thread reads the message and performs the required tasks as far as it can before OpenGL context is required. Like reads the Wavefront (.obj)-file from the memory and creates arrays from the received data.
Our second thread then queues this data to our OpenGL thread to handle. OpenGL thread generates VBOs and VAO and stores the data in there.
Back to your question
OpenGL generated Objects stay in the context memory until they are manually deleted or the context is destroyed. So it works kind of like C, where you have to manually allocate memory and free it after it's no more used. So you should not create new Objects for each frame, but reuse the data that stays unchanged. Also when you have multiple objects that use the same model or texture, you should just load that model once and apply all object specific differences on shaders.
Example:
You have an environment with 10 rocks that all share the same rock model.
You load the data, store it in VBOs and attach those VBOs into a VAO. So now you have a VAO defining a rock.
You generate 10 rock entities that all have position, rotation and scale. When rendering, you first bind the shader, then bind the model and texture, then loop through the stone entities and for each stone entity you bind that entity's position, rotation and scale (usually stored in a transformationMatrix) and render.
bind shader
load values to shader's uniform variables that don't change between entities.
bind model and texture (as those stay the same for each rock)
for(each rock in rocks){
load values to shader's uniform variables that do change between each rock, like the transformation.
render
}
unbind shader
Note: You don't need to unbind/bind shader each frame if you only use one shader. Same goes for VAO's and every other OpenGL object as well. So the binding will also stay over each rendering cycle.
Hope this will help you when getting started. Altho I would recommend some tutorial that might have a bit more context to it.

I have read some books about modern OpenGL and find a method called
"Instanced Rendering", but it seems only to work with same
instances.Should I use for-loop to draw all items directly for my
case?
Another question is about memory.Should I create an VBO for each
frame, since the number of items is always changing?
These both depend on the amount of bullets you plan on having. If you think you will have less than a thousand bullets, you can almost certainly push all of them to a VBO each frame and upload and your end users will not notice. If you plan on some obscene amount, then don't do this.
I would say that you should write everything each frame because it's the simplest to do right now, and if you start noticing performance issues then you need to look into instancing or some other method. When you get to "later" you should be more comfortable with OpenGL and find out ways to optimize it that won't be over your head (not saying it is over your head right now, but more experience can only help make it less complex later on).
Culling bullets not on the screen either should be on your radar.
If you plan on having a ridiculous amount of bullets on screen, then you should say so and we can talk about more advanced methods, however my guess is that if you ever reach that limit on today's hardware then you have a large ambitious game with a zoomed out camera and a significant amount of entities on screen, or you are zoomed up and likely have a mess on your screen anyways.

20 objects is nothing. Your program will be plenty fast no matter how you draw them.
When you have 10000 objects, then you'll want to ask for an efficient way.
Until then, draw them whichever way is most convenient. This probably means a separate draw call per object.

CPU/GPU Shared Buffer in Direct3D12

I have no experience with Direct3D, so I may just be looking in the wrong places. However, I would like to convert a program I have written in OpenGL (using FreeGLUT) to a Windows IoT compatible UWP (running Direct3D, 12 'caus it's cool). I'm trying to port my program to a Raspberry Pi 3 and I don't want to convert to Linux.
Through the examples provided by Microsoft I have figured out most of what I believe I need to know to get started, but I can't figure out how to share a dynamic data buffer between the CPU and GPU.
What I want to know how to do:
Create a CPU/GPU shared circular buffer
Read and Draw with the GPU
Write / Replace sections with the CPU
Quick semi-pseudo code:
while (!buffer.inUse()){ //wait until buffer is not in use
updateBuffer(buffer.id, data, start, end); //insert data into buffer
drawToScreen(buffer.id); //draw using vertex data in buffer
}
This was previously done in OpenGL by simply using glBegin()/glEnd() and glVertex3f() for each value in an array when it wasn't being written to.
Update: I basically want a Direct3D12 equivalent of OpenGLs VBO editing using glBufferSubData(). If that makes more sense.
Update 2: I found that I can get away with discarding the vertex buffer every frame and re-uploading a new buffer to the GPU. There's a fair amount of overhead, as one would expect with transferring 10,000 - 200,000 doubles every frame. So I'm trying to find a way to use constant buffers to port the 5-10 updated vertexes into the shader, so I can copy from the constant buffer into the vertex buffer using the shader and not have to use map/unmap every frame. This way my circular buffer on the CPU is independent of the buffer being used on the GPU, but they will both share the same information through periodic updates. I'll do some more looking and post another more specific question on shaders if I don't find a solution.

How should you efficiently batch complex meshes?

What is the best way to render complex meshes? I wrote different solutions below and wonder what is your opinion about them.
Let's take an example: how to render the 'Crytek-Sponza' mesh?
PS: I do not use Ubershader but only separate shaders
If you download the mesh on the following link:
http://graphics.cs.williams.edu/data/meshes.xml
and load it in Blender you'll see that the whole mesh is composed by about 400 sub-meshes with their own materials/textures respectively.
A dummy renderer (version 1) will render each of the 400 sub-mesh separately! It means (to simplify the situation) 400 draw calls with for each of them a binding to a material/texture. Very bad for performance. Very slow!
pseudo-code version_1:
foreach mesh in meshList //400 iterations :(!
mesh->BindVBO();
Material material = mesh->GetMaterial();
Shader bsdf = ShaderManager::GetBSDFByMaterial(material);
bsdf->Bind();
bsdf->SetMaterial(material);
bsdf->SetTexture(material->GetTexture()); //Bind texture
mesh->Render();
Now, if we take care of the materials being loaded we can notice that the Sponza is composed in reality of ONLY (if I have a good memory :)) 25 different materials!
So a smarter solution (version 2) should be to gather all the vertex/index data in batches (25 in our example) and not store VBO/IBO into sub-meshes classes but into a new class called Batch.
pseudo-code version_2:
foreach batch in batchList //25 iterations :)!
batch->BindVBO();
Material material = batch->GetMaterial();
Shader bsdf = ShaderManager::GetBSDFByMaterial(material);
bsdf->Bind();
bsdf->SetMaterial(material);
bsdf->SetTexture(material->GetTexture()); //Bind texture
batch->Render();
In this case each VBO contains data that share exactly the same texture/material settings!
It's so much better! Now I think 25 VBO for render the sponza is too much! The problem is the number of Buffer bindings to render the sponza! I think a good solution should be to allocate a new VBO if the first one if 'full' (for example let's assume that the maximum size of a VBO (value defined in the VBO class as attribute) is 4MB or 8MB).
pseudo-code version_3:
foreach vbo in vboList //for example 5 VBOs (depends on the maxVBOSize)
vbo->Bind();
BatchList batchList = vbo->GetBatchList();
foreach batch in batchList
Material material = batch->GetMaterial();
Shader bsdf = ShaderManager::GetBSDFByMaterial(material);
bsdf->Bind();
bsdf->SetMaterial(material);
bsdf->SetTexture(material->GetTexture()); //Bind texture
batch->Render();
In this case each VBO does not contain necessary data that share exactly the same texture/material settings! It depends of the sub-mesh loading order!
So OK, there are less VBO/IBO bindings but not necessary less draw calls! (are you OK by this affirmation ?). But in a general manner I think this version 3 is better than the previous one! What do you think about this ?
Another optimization should be to store all the textures (or group of textures) of the sponza model in array(s) of textures! But if you download the sponza package you will see that all texture has different sizes! So I think they can't be bound together because of their format differences.
But if it's possible, the version 4 of the renderer should use only less texture bindings rather than 25 bindings for the whole mesh! Do you think it's possible ?
So, according to you, what is the best way to render the sponza mesh ? Have you another suggestion ?

You are focused on the wrong things. In two ways.
First, there's no reason you can't stick all of the mesh's vertex data into a single buffer object. Note that this has nothing to do with batching. Remember: batching is about the number of draw calls, not the number of buffers you use. You can render 400 draw calls out of the same buffer.
This "maximum size" that you seem to want to have is a fiction, based on nothing from the real world. You can have it if you want. Just don't expect it to make your code faster.
So when rendering this mesh, there is no reason to be switching buffers at all.
Second, batching is not really about the number of draw calls (in OpenGL). It's really about the cost of the state changes between draw calls.
This video clearly spells out (about 31 minutes in), the relative cost of different state changes. Issuing two draw calls with no state changes between them is cheap (relatively speaking). But different kinds of state changes have different costs.
The cost of changing buffer bindings is quite small (assuming you're using separate vertex formats, so that changing buffers doesn't mean changing vertex formats). The cost of changing programs and even texture bindings is far greater. So even if you had to make multiple buffer objects (which again, you don't have to), that's not going to be the primary bottleneck.
So if performance is your goal, you'd be better off focusing on the expensive state changes, not the cheap ones. Making a single shader that can handle all of the material settings for the entire mesh, so that you only need to change uniforms between them. Use array textures so that you only have one texture binding call. This will turn a texture bind into a uniform setting, which is a much cheaper state change.
There are even fancier things you can do, involving base instance counts and the like. But that's overkill for a trivial example like this.

DirectX Adding Multiple Meshes to a Single Vertex Buffer

I'm fairly new to DirectX. I have what I think should be a pretty simple question, but I can't seem to find an answer to it anywhere.
Basically, I'd like to know how to add vertices from multiple meshes to a single vertex buffer. This would only happen once per mesh as the program is initialized, so I believe I want DEFAULT usage.
Is It possible to add each mesh to the buffer individually? or do I need to collect them all in a single array and pass them all at once? Default or Dynamic? Map/Unmap or updateSubresource? Thanks
For now I am using an index buffer and drawing once per object (horrible I know) but I am planning on switching to instancing as soon as I figure this out.

OpenGL height-map painting using CUDA VBO

I've asked several questions regarding VBO previously here and from the comments i had received i decided that a new approach must be taken.
To put it simply - I'm trying to draw the Mandelbrot set which is defined on a large FLOAT array, around 512X512 Points. the purpose of my program is to let the user control the zooming and world's orientation (it's a 3d model).
so far I've painted the entire thing using GL_TRIANGLE_STRIP which turned to be a bad choice because of its slow painting process. also because implementing my painting style (order of calling the glVertex) became impossible for coding for VBOs.
so I've got several questions.
even after this description i'm not sure either the VBO is the best choice because it's up the user to control the calculations.for each calculation that he causes by the program, i have to recompute the mandelbrot set(~60ms),and recopy the points to the buffer : a process which takes some time(?ms).
the program allows the user also to move in the world so no calculations are done here therefore VBO is an excellent choice here.
1.what's the best way to paint height map(when each cell in the array holds only the height)
2.how can i apply it on VBO and transfer it to cuda (cudaRegisterBuffer or something like that)
3.is there a way to distinguish between the mode and decide when VBOs are needed(in a no calculations mode) and when they aren't(calculations mode).

You don't need to copy the CUDA data each frame if you bind the CUDA array/VBO to the DirectX/OpenGL VB (refer to the CUDA Programming Guide for details). One way to render data as a height-field is to use the Geometry Shader to emit the tris based on the height-field. Another way is to use the height field as a parallax-map (ref DirectX SDK). My personal fave would be to make your height-field an array of positions (X/Y/Z) and use CUDA to modify only the Y-Values, then use an index buffer to define the polygons that compose the surface. Note that you'll also need to update the vertex normals, and you may also want to use XYZ/UV if you want to texture the surface. If 512x512 is too big, use raster-ops (texture sampling) to populate a lower-resolution height-field of the region of interest. You can do this stage in CUDA or OpenGL/DirectX (I'd recommend doing it in CUDA where you can easily write your own sampling kernel to lookup pixels when down-sampling).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js