The fastests way to copy to and from Texture2D - c++

For some reason I have to copy from textures to buffer and then reload it back to the texture.
The source texture is the one coming from decoder, target texture is the one which will be rendered. The easiest way to do that (as I understand) is to do the following:
decoder tex(ID3D11Texture2D)
Use temp texture (Usage = D3D11_USAGE_STAGING)
CopyResource to temp texture
'Map'
memcpy_s to buffer
Unmap
on the other side it goes backward
Use temp texture (Usage = D3D11_USAGE_STAGING)
Map
memcpy_s from buffer
Unmap
CopyResource to renderer texture
Works fine, however I have a feeling I'm not doing it as efficient as possible (aside the fact I'm copying data back and forth)
Do I have to use staging textures? Can I tweak the decoder/renderer texture flags (BindFlags?) or the Map's D3D11_MAP enumeration to skip copying to staging texture?
EDIT001:
Ok, here goes the case, with technical details. There is a decoder, essentially it is Intel Media SDK decoder which decodes (pun intended) data provided from outside the decoding class. So, it receives a buffer, does its magic(asynchronously) and returns (by means of SyncOperation, If I recall the method name right) a surface, which is actually, under the hood DX texture, managed by the Intel allocator. I receive and copy the texture synchronously, but I guess, with a little effort I can do it asynchronously. The surface originates from a pool, so, working with the texture does not stops the decoder to keep the work on. The copied data resides in a struct which is kept in ring buffer, from which the video renderer is fed. That's it, to my understanding (a little one, I have to notice) there is no harm to the GPU parallelism.

If you have to read and write back, there is no fast way, you break GPU/CPU parallelism by forcing sync point and you will create many idle bubbles on the CPU and GPU.
Only the staging pool is accessible to the CPU, so yes, the temporary resource for the back and forth is necessary.
For performance, you should consider :
Adapt your technique to be GPU only
If read back is the only way, limit to portion that are dirty or neccessary
Try to work with a couple of texture accross frame to let the CPU work on a version that is a few frame behind to protect parallelism.

Related

How to sample Renderbuffer depth information and process it in CPU code, without causing an impact on performance?

I am trying to sample a few fragments' depth data that I need to use in my client code (that runs on CPU).
I tried a glReadPixel() on my FrameBuffer Object, but turns out it stalls the render pipeline as it transfers data from Video Memory to Main Memory through the CPU, thus causes unbearable lag (please, correct me if I am wrong).
I read about Pixel Buffer objects, that we can use them as copies of other buffers, and very importantly, perform glReadPixel() operation without stalling the performance, but not without compromising to use outdated information. (That's OK for me.)
But, I am unable to understand about how to use Pixel Buffers.
What I've learnt is we need to sample data from a texture to store it in a PixelBuffer. But I am trying to sample from a Renderbuffer, which I've read is not possible.
So here's my problem - I want to sample the depth information stored in my Render Buffer, store it in RAM, process it and do other stuff, without causing any issues to the Rendering Pipeline. If I use a depth texture instead of a renderbuffer, i don't know how to use it for depth testing.
Is it possible to copy the entire Renderbuffer to the Pixelbuffer and perform read operations on it?
Is there any other way to achieve what I am trying to do?
Thanks!
glReadPixels can also transfer from a framebuffer to a standard GPU side buffer object. If you generate a buffer and bind it to the GL_PIXEL_PACK_BUFFER target, the data pointer argument to glReadPixels is instead an offset into the buffer object. (So probably should be 0 unless you are doing something clever.)
Once you've copied the pixels you need into a buffer object, you can transfer or map or whatever back to the CPU at a time convenient for you.

How to update vertex buffer data frequently in DirectX 11?

I am trying to update my vertex buffer data with the map function in dx. Though it does update the data once, but if i iterate over it the model disappears. i am actually trying to manipulate vertices in real-time by user input and to do so i have to update the vertex buffer every frame while the vertex is selected.
Perhaps this happens because the Map function disables GPU access to the vertices until the Unmap function is called. So if the access is blocked every frame, it kind of makes sense for it to not be able render the mesh. However when i update the vertex every frame and then stop after sometime, theatrically the mesh should show up again, but it doesn't.
i know that the proper way to update data every frame is to use constant buffers, but manipulating vertices with constant buffers might not be a good idea. and i don't think that there is any other way to update the vertex data. i expect dynamic vertex buffers to be able to handle being updated every frame.
D3D11_MAPPED_SUBRESOURCE mappedResource;
ZeroMemory(&mappedResource, sizeof(D3D11_MAPPED_SUBRESOURCE));
// Disable GPU access to the vertex buffer data.
pRenderer->GetDeviceContext()->Map(pVBuffer, 0, D3D11_MAP_WRITE_DISCARD, 0, &mappedResource);
// Update the vertex buffer here.
memcpy((Vertex*)mappedResource.pData + index, pData, sizeof(Vertex));
// Reenable GPU access to the vertex buffer data.
pRenderer->GetDeviceContext()->Unmap(pVBuffer, 0);
As this has been already answered the key issue that you are using Discard (which means you won't be able to retrieve the contents from the GPU), I thought I would add a little in terms of options.
The question I have is whether you require performance or the convenience of having the data in one location?
There are a few configurations you can try.
Set up your Buffer to have both CPU Read and Write Access. This though mean you will be pushing and pulling your buffer up and down the bus. In the end, it also causes performance issues on the GPU such as blocking etc (waiting for the data to be moved back onto the GPU). I personally don't use this in my editor.
If memory is not the issue, set up a copy of your buffer on CPU side, each frame map with Discard and block copy the data across. This is performant, but also memory intensive. You obviously have to manage the data partioning and indexing into this space. I don't use this, but I toyed with it, too much effort!
You bite the bullet, you map to the buffer as per 2, and write each vertex object into the mapped buffer. I do this, and unless the buffer is freaking huge, I havent had issue with it in my own editor.
Use the Computer shader to update the buffer, create a resource view and access view and pass the updates via a constant buffer. Bit of a Sledgehammer to crack a wallnut. And still doesn't stop the fact you may need pull the data back off the GPU ala as per item 1.
There are some variations on managing the buffer, such as interleaving you can play with also (2 copies, one on GPU while the other is being written to) which you can try also. There are some rather ornate mechanisms such as building the content of the buffer in another thread and then flagging the update.
At the end of the day, DX 11 doesn't offer the ability (someone might know better) to edit the data in GPU memory directly, there is alot shifting between CPU and GPU.
Good luck on which ever technique you choose.
Mapping buffer with D3D11_MAP_WRITE_DISCARD flag will cause entire buffer content to become invalid. You can not use it to update just a single vertex. Keep buffer on the CPU side instead and then update entire buffer on GPU side once per frame.
If you develop for UWP - use of map/unmap may result in sync problems. ID3D11DeviceContext methods are not thread safe: https://learn.microsoft.com/en-us/windows/win32/direct3d11/overviews-direct3d-11-render-multi-thread-intro.
If you update buffer from one thread and render from another - you may get different errors. In this case you must use some synchronization mechanism, such as critical sections. Example is here https://developernote.com/2015/11/synchronization-mechanism-in-directx-11-and-xaml-winrt-application/

Why is using multiple Pixel buffer Objects advised. Surely it is redundant?

This article is commonly referenced when anyone asks about video streaming textures in OpenGL.
It says:
To maximize the streaming transfer performance, you may use multiple pixel buffer objects. The diagram shows that 2 PBOs are used simultaneously; glTexSubImage2D() copies the pixel data from a PBO while the texture source is being written to the other PBO.
For nth frame, PBO 1 is used for glTexSubImage2D() and PBO 2 is used to get new texture source. For n+1th frame, 2 pixel buffers are switching the roles and continue to update the texture. Because of asynchronous DMA transfer, the update and copy processes can be performed simultaneously. CPU updates the texture source to a PBO while GPU copies texture from the other PBO.
They provide a simple bench-mark program which allows you to cycle between texture updates without PBO's, with a single PBO, and with two PBO's used as described above.
I see a slight performance improvement when enabling one PBO.
But the second PBO makes no real difference.
Right before the code glMapBuffer's the PBO, it calls glBufferData with the pointer set to NULL. It does this to avoid a sync-stall.
// map the buffer object into client's memory
// Note that glMapBufferARB() causes sync issue.
// If GPU is working with this buffer, glMapBufferARB() will wait(stall)
// for GPU to finish its job. To avoid waiting (stall), you can call
// first glBufferDataARB() with NULL pointer before glMapBufferARB().
// If you do that, the previous data in PBO will be discarded and
// glMapBufferARB() returns a new allocated pointer immediately
// even if GPU is still working with the previous data.
So, Here is my question...
Doesn't this make the second PBO completely useless? Just a waste of memory !?
With two PBO's the texture data is stored 3 times. 1 in the texture, and one in each PBO.
With a single PBO. There are two copies of the data. And temporarily only a 3rd in the event that glMapBuffer creates a new buffer because the existing one is presently being DMA'ed to the texture?
The comments seem to suggest that OpenGL drivers internally are capable to creating the second buffer IF and only WHEN it is required to avoid stalling the pipeline. The in-use buffer is being DMA'ed, and my call to map yields a new buffer for me to write to.
The Author of that article appears to be more knowledgeable in this area than myself. Have I completely mis-understood the point?
Answering my own question... But I wont accept it as an answer... (YET).
There are many problems with the benchmark program linked to in the question. It uses immediate mode. It uses GLUT!
The program was spending most of its time doing things we are not interested in profiling. Mainly rendering text via GLUT, and writing pretty stripes to the texture. So I have removed those functions.
I cranked the texture resultion up to 8K, and added more PBO Modes.
NO PBO (yeilds 6fps)
1 PBO. Orphan previous buffer. (yields 12.2 fps).
2 PBO's. Orpha previous buffer. (yields 12.2 fps).
1 PBO. DONT orphan previous PBO (possible stall - added by myself. yields 12.4 fps).
2 PBO's. DONT orphan previous PBO (possible stall - added by myself. yields 12.4 fps).
If anyone else would like to examine my code, it is vailable here
I have experimented with different texture sizes... and different updatePixels functions... I cannot, despite my best efforts get the double PBO implementation to perform any better than the single-PBO implementation.
Furthermore... NOT orphanning the previous buffer, actually vields better performance. Exactly opposite to what the article claims.
Perhaps modern drivers / hardware does not suffer the problem that this design is attemtping to fix...
Perhaps my graphics hardware / driver is buggy, and not taking advantage of the double-PBO...
Perhaps the commonly referenced article is completely wrong?
Who knows. . . .
My test hardware is Intel(R) HD Graphics 5500 (Broadwell GT2).

OpenGL read pixels faster than glReadPixels

Is there a way to increase the speed of glReadPixels? Currently I do:
Gdx.gl.glReadPixels(0, 0, Gdx.graphics.getWidth(), Gdx.graphics.getHeight(), GL20.GL_RGBA, GL20.GL_UNSIGNED_BYTE, pixels);
The problem is that it blocks the rendering and is slow.
I have heard of Pixel Buffer Objects, but I am quite unsure on how to wire it up and whether it is faster or not.
Also is there any other solutation than glReadPixels?
Basically, I want to take a screenshot as fast as possible, without blocking the drawing of the next scene.
Is there a way to increase the speed of glReadPixels?
Well, the speed of that operation is actually not the main issue. It has to transfer a certain amount of bytes from the framebuffer to your system memory. In your typical desktop system with a discrete GPU, that involves sending the data over PCI-Express, and there is no way around that.
But as you already stated, the implicit synchronization is a big issue. If you need that pixel data as soon as possible, you can't really do much better than that synchronous readback. But if you can live with getting that data later, asynchronous readback via pixel buffer objects (PBOs) is the way to go.
The pseudo code for that is:
create PBO
bind PBO as GL_PIXEL_PACK_BUFFER
do the glReadPixels
do something else. Both work on the CPU and issuing new commands for the GPU is ideal.
Read back the data from PBO by either using glGetBufferSubData or by mapping the PBO for reading.
The crucial point is the timing of step 5. I you do that to early, you still blocking the client side, as it will wait for the data to become available. For some screenshots, It should not be hard to delay that step for even one or two frames. That way, it will have only a slight impact on the overall render performance, and it will neither stall the GPU nor the CPU.

Asynchronous readback from opengl front buffer using multiple PBO's

I am developing an application that needs to read back the whole frame from the front buffer of an openGL application. I can hijack the application's opengl library and insert my code on swapbuffers. At the moment I am successfully using a simple but excruciating slow glReadPixels command without PBO's.
Now I read about using multiple PBO's to speed things up. While I think I've found enough resources to actually program that (isn't that hard), I have some operational questions left. I would do something like this:
create a series (e.g. 3) of PBO's
use glReadPixels in my swapBuffers override to read data from front buffer to a PBO (should be fast and non-blocking, right?)
Create a seperate thread to call glMapBufferARB, once per PBO after a glReadPixels, because this will block until the pixels are in client memory.
Process the data from step 3.
Now my main concern is of course in steps 2 and 3. I read about glReadPixels used on PBO's being non-blocking, will this be an issue if I issue new opengl commands after that very fast? Will those opengl commands block? Or will they continue (my guess), and if so, I guess only swapbuffers can be a problem, will this one stall or will glReadPixels from front buffer be many times faster than swapping (about each 15->30ms) or, worst case scenario, will swapbuffers be executed while glReadPixels is still reading data to the PBO? My current guess is this logic will do something like this: copy FRONT_BUFFER -> generic place in VRAM, copy VRAM->RAM. But I have no idea which of those 2 is the real bottleneck and more, what the influence on the normal opengl command stream is.
Then in step 3. Is it wise to do this asynchronously in a thread separated from normal opengl logic? At the moment I think not, It seems you have to restore buffer operations to normal after doing this and I can't install synchronization objects in the original code to temporarily block those. So I think my best option is to define a certain swapbuffer delay before reading them out, so e.g. calling glReadPixels on PBO i%3 and glMapBufferARB on PBO (i+2)%3 in the same thread, resulting in a delay of 2 frames. Also, when I call glMapBufferARB to use data in client memory, will this be the bottleneck or will glReadPixels (asynchronously) be the bottleneck?
And finally, if you have some better ideas to speed up frame readback from GPU in opengl, please tell me, because this is a painful bottleneck in my current system.
I hope my question is clear enough, I know the answer will probably also be somewhere on the internet but I mostly came up with results that used PBO's to keep buffers in video memory and do processing there. I really need to read back the front buffer to RAM and I do not find any clear explanations about performance in that case (which I need, I cannot rely on "it's faster", I need to explain why it's faster).
Thank you
Are you sure you want to read from the front buffer? You do not own this buffer, and depending on your OS it might be destroyed, e.g., by another window on top of it.
For your use case, people typically do
draw N
start PBO read N from back buffer
draw N+1
start PBO read N+1
sync PBO read N
process N
...
from a single thread.