I'm trying to optimize the upload of textures from CPU memory to OpenGL.
Initially I was using glTexImage2D and that works, but obviously isn't using DMA, so I'm trying to use PBOs.
Unfortunately, the API I have to use that provides the texture data allocates the CPU memory for them and returns a pointer. I can't control where in memory it places the data.
If I create a PBO then map it, I have to either 'manually' move my data into the PBO-allocated memory, or use glBufferData to initialize it with the texture data before calling glTexImage2D. This appears significantly slower than not using a PBO at all.
Any other techniques I could try, or is this just a limitation of how PBO's work?
Related
I need to access large amounts of data from a GLSL compute shader (read and write).
For reference, I work with an nvidia A6000 gpu with 50GB of memory, the driver is up to date.
Here is what I've tried so far:
Using a SSBO: glBufferData() can allocate arbitrarily large buffers but the shader will only be able to access 2GB of memory (according to my tests).
Using a texture: glTextureStorage3D() can allocate very large textures but the shader will only be able to access 4GB of memory (according to my tests).
Using multiple textures: I break the data in multiple bindless 3D textures (GL_ARB_bindless_texture extension) that work like pages of memory. I store the texture handles in a UBO. It effectively does what I want but there are several downsides:
The texture/image handle pairs take space in the UBO, which could be needed by something else.
The textures are not accessed in a uniform control flow: two invocations of the same subgroup can fetch data in different pages. This is allowed on Nvidia gpus with the GL_NV_gpu_shader5 extension, but I could not find a similar extension on AMD.
Using NV_shader_buffer_load and NV_shader_buffer_store to get a pointer to gpu memory. I haven't tested this method yet, but I suspect it could be more efficient than solution 3 since there is no need to dereference the texture/image handles which introduce an indirection.
Thus, I would like to know: Would solution 4 work for my use case? Is there a better/faster way? Is there a cross-platform way?
Not sure what the DX parlance is for these, but I'm sure they have a similar notion.
As far as I'm aware the advantage of VBO's is that they allocate memory that's directly available by the GPU. We can then upload data to this buffer, and keep it there for an extended number of frames, preventing all the overhead of uploading the data every frame. Additionally, we're able to alter this data on a per-datum basis, if we choose to.
Therefore, I can see the advantage of using VBO's for static geo, but I don't see any benefit at all for dynamic objects - since you pretty much have to update all the data every frame anyways?
There are several methods of updating buffers in OpenGL. If you have dynamic data, you can simply reinitialize the buffer storage every frame with the new data (eg. with glBufferData). You can also use client vertex buffer pointers, in compatibility contexts. However, these methods can cause 'churn' in the memory allocation of the driver. The new data storage essentially has to sit in system memory until the GPU driver handles it, and it's not possible to get feedback on this process.
In later OpenGL versions (4.4, and via extensions in earlier versions), some functionality was introduced to try and reduce the overhead of updating dynamic buffers, allowing for GPU allocated memory to be written without direct driver synchronization. This essentially requires that you have the glBufferStorage and glMapBufferRange functionality available. You create the buffer storage with the GL_DYNAMIC_STORAGE_BIT, and then map it with GL_MAP_PERSISTENT_BIT (you may require other flags, depending on whether you are reading and/or writing the data). However, this technique also requires that you use GPU fencing to ensure you are not overwriting the data as the GPU is reading it. Using this method makes updating VBOs much more efficient than reinitializing the data store, or using client buffers.
There is a good presentation on GDC Vault about this technique (skip to the DynamicStreaming heading).
AFAIK, by creating dynamic vertex buffer, you are giving graphic adapter driver a hint to place vertex buffer in memory which fast for CPU to write but also reasonably fast for GPU to read it. Driver usually will manage it to minimize GPU stall by giving non-overlapping memory area so that CPU can write while GPU read other memory area.
If you do not give hint, it is assume a static resource so it will be placed in memory which fast for GPU to read/write but very slow for CPU to write.
I wanted to render multiple video streams using OpenGL. Currently I am performing using glTexImage2D provided by JOGL and rendering on Swing window.
For updating texture content for each video frame I am calling glTexImage2D. I want to know is there any faster method to update texture without calling glTexImage2D for each frame.
You will always be using glTexImage2D, but with the difference that data comes from a buffer object (what is this?) rather than from a pointer.
What's slow in updating a texture is not updating the texture, but synchronizing (blocking) with the current draw operation, and the PCIe transfer. When you call glTexImage, OpenGL must wait until it is done drawing the last frame during which it is still reading from the texture. During that time, your application is blocked and does nothing (this is necessary because otherwise you could modify or free the memory pointed to before OpenGL can copy it!). Then it must copy the data and transfer it to the graphics card, and only then your application continues to run.
While one can't make that process much faster, one can make it run asynchronously, so this latency pretty much disappears.
The easiest way of doing this is to for video frames is to create a buffer name, bind it, and reserve-initialize it once.
Then, on each subsequent frame, discard-initialize it by calling glBufferData with a null data pointer, and fill it either with a non-reserving call, or by mapping the buffer's complete range.
The reason why you want to do this strange dance instead of simply overwriting the buffer is that this will not block. OpenGL will synchronize access to buffer objects so you do not overwrite data while it is still reading from it. glBufferData with a null data pointer is a way of telling OpenGL that you don't really care about the buffer and that you don't necessary want the same buffer. So it will just allocate another one and give you that one, keep reading from the old one, and secretly swap them when it's done.
Since the word "synchronization" was used already, I shall explain my choice of glMapBufferRange in the link above, when in fact you want to map the whole buffer, not some range. Why would one want that?
Even if OpenGL can mostly avoid synchronizing when using the discard technique above, it may still have to, sometimes.
Also, it still has to run some kind of memory allocation algorithm to manage the buffers, which takes driver time. glMapBufferRange lets you specify additional flags, in particular (in later OpenGL versions) a flag that says "don't synchronize". This allows for a more complicated but still faster approach in which you create a single buffer twice the size you need once, and then keep mapping/writing either the lower or upper half, telling OpenGL not to synchronize at all. It is then your responsibility to know when it's safe (presumably by using a fence object), but you avoid all overhead as much as possible.
You can't update the texture without updating the texture.
Also I don't think that one call to glTexImage can be a real performance problem. If you are so oh concerned about it though, create two textures and map one of them for writing when using the other for drawing, then swap (just like double-buffering works).
If you could move processing to GPU you wouldn't have to call the function at all, which is about 100% speedup.
I just read the following presentation about AMD_pinned_memory.
However I have a question regarding to syncing the transfers.
When copying the data from a buffer to a texture they show the following example (on pages 12):
Copy data from a buffer into a texture
// Bind buffer as unpack buffer to copy data into a texture object
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, m_pBuffer[m_uiBufferIdx]);
// Copy pinned memory to texture
glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, m_uiTexWidth, m_uiTexHeight, m_nExtFormat, m_nType, NULL);
// Insert Sync object to check for completion
m_UnPackFence= glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0);
When and how do I wait for the m_UnPackFence? Do I need to call glClientWaitSync or glWaitSync just before using the texture or something else?
I'm aware of your other question about keeping data written to a mapped buffer around. I think you completely misunderstood the whole idea of mapping buffers in OpenGL. Pinned memory isn't going to help you either, because even with the memory pinned you need to synchonize with OpenGL (read the last sentence on page 11 of the presentation you linked). Last but not least pinned memory will only work performant on CPU/GPU combinations like AMD Fusion. On regular systems you've got the PCI-E bottleneck inbetween.
Regarding your original problem. I think you completely misunderstand what glMapBuffer does. It maps some part of GPU memory into your applications address space. This is not like regular system memory. In fact it's a good idea to keep a copy of the original data around. In fact reading from a mapped buffer will have quite bad performance, unless the OpenGL driver makes a copy of the data for you to read. Think about it: Everytime you map that buffer, the data has to be copied from the GPU.
The solution to your problem is simple: Just keep a copy of your data. This is not a bottleneck. And maybe glBufferSubData may be even better suited for you.
I've got a multithreaded OpenGL application using PBOs for data transfers between cpu and gpu.
I have pooled the allocation of PBOs, however, when the pools are empty my non-opengl threads have to block for a while until the OpenGL thread reaches a point where it can allocate the buffers (i.e. finished rendering the current frame). This waiting is causing some lag spikes in certain situations which I'd like to avoid.
Is it possible to allocate PBO's on another thread which are then used by the "main" OpenGL thread?
Yes, you can create objects on one thread which can be used on another. You'll need a new GL context to do this.
That being said, there are two concerns.
First, it depends on what you mean by "allocation of PBOs". You should never be allocating buffer objects in the middle of a frame. You should allocate all the buffers you need up front. When the time comes to use them, then you can simply use what you have.
By "allocate", I mean call glBufferData on a previously allocated buffer using a different size or driver hint than was used before. Or by using glGenBuffers and glDeleteBuffers in any way. Neither of these should happen within a frame.
Second, invalidating the buffer should never cause "lag spikes". By "invalidate", I mean reallocating the buffer with glBufferData using the same size and usage hint, or using glMapBufferRange with the GL_INVALIDATE_BUFFER bit. You should look at this page on how to stream buffer object data for details. If you're getting issues, then you're probably on NVIDIA hardware and using the wrong buffer object hint (ie: use STREAM).