AMD_pinned_memory - Syncing transfer into a texture - c++

I just read the following presentation about AMD_pinned_memory.
However I have a question regarding to syncing the transfers.
When copying the data from a buffer to a texture they show the following example (on pages 12):
Copy data from a buffer into a texture
// Bind buffer as unpack buffer to copy data into a texture object
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, m_pBuffer[m_uiBufferIdx]);
// Copy pinned memory to texture
glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, m_uiTexWidth, m_uiTexHeight, m_nExtFormat, m_nType, NULL);
// Insert Sync object to check for completion
m_UnPackFence= glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0);
When and how do I wait for the m_UnPackFence? Do I need to call glClientWaitSync or glWaitSync just before using the texture or something else?

I'm aware of your other question about keeping data written to a mapped buffer around. I think you completely misunderstood the whole idea of mapping buffers in OpenGL. Pinned memory isn't going to help you either, because even with the memory pinned you need to synchonize with OpenGL (read the last sentence on page 11 of the presentation you linked). Last but not least pinned memory will only work performant on CPU/GPU combinations like AMD Fusion. On regular systems you've got the PCI-E bottleneck inbetween.
Regarding your original problem. I think you completely misunderstand what glMapBuffer does. It maps some part of GPU memory into your applications address space. This is not like regular system memory. In fact it's a good idea to keep a copy of the original data around. In fact reading from a mapped buffer will have quite bad performance, unless the OpenGL driver makes a copy of the data for you to read. Think about it: Everytime you map that buffer, the data has to be copied from the GPU.
The solution to your problem is simple: Just keep a copy of your data. This is not a bottleneck. And maybe glBufferSubData may be even better suited for you.

Related

Using PBOs when memory address is already fixed?

I'm trying to optimize the upload of textures from CPU memory to OpenGL.
Initially I was using glTexImage2D and that works, but obviously isn't using DMA, so I'm trying to use PBOs.
Unfortunately, the API I have to use that provides the texture data allocates the CPU memory for them and returns a pointer. I can't control where in memory it places the data.
If I create a PBO then map it, I have to either 'manually' move my data into the PBO-allocated memory, or use glBufferData to initialize it with the texture data before calling glTexImage2D. This appears significantly slower than not using a PBO at all.
Any other techniques I could try, or is this just a limitation of how PBO's work?

Why use Vertex Buffer Objects for dynamic objects?

Not sure what the DX parlance is for these, but I'm sure they have a similar notion.
As far as I'm aware the advantage of VBO's is that they allocate memory that's directly available by the GPU. We can then upload data to this buffer, and keep it there for an extended number of frames, preventing all the overhead of uploading the data every frame. Additionally, we're able to alter this data on a per-datum basis, if we choose to.
Therefore, I can see the advantage of using VBO's for static geo, but I don't see any benefit at all for dynamic objects - since you pretty much have to update all the data every frame anyways?
There are several methods of updating buffers in OpenGL. If you have dynamic data, you can simply reinitialize the buffer storage every frame with the new data (eg. with glBufferData). You can also use client vertex buffer pointers, in compatibility contexts. However, these methods can cause 'churn' in the memory allocation of the driver. The new data storage essentially has to sit in system memory until the GPU driver handles it, and it's not possible to get feedback on this process.
In later OpenGL versions (4.4, and via extensions in earlier versions), some functionality was introduced to try and reduce the overhead of updating dynamic buffers, allowing for GPU allocated memory to be written without direct driver synchronization. This essentially requires that you have the glBufferStorage and glMapBufferRange functionality available. You create the buffer storage with the GL_DYNAMIC_STORAGE_BIT, and then map it with GL_MAP_PERSISTENT_BIT (you may require other flags, depending on whether you are reading and/or writing the data). However, this technique also requires that you use GPU fencing to ensure you are not overwriting the data as the GPU is reading it. Using this method makes updating VBOs much more efficient than reinitializing the data store, or using client buffers.
There is a good presentation on GDC Vault about this technique (skip to the DynamicStreaming heading).
AFAIK, by creating dynamic vertex buffer, you are giving graphic adapter driver a hint to place vertex buffer in memory which fast for CPU to write but also reasonably fast for GPU to read it. Driver usually will manage it to minimize GPU stall by giving non-overlapping memory area so that CPU can write while GPU read other memory area.
If you do not give hint, it is assume a static resource so it will be placed in memory which fast for GPU to read/write but very slow for CPU to write.

Efficient way of updating texture in OpenGL

I wanted to render multiple video streams using OpenGL. Currently I am performing using glTexImage2D provided by JOGL and rendering on Swing window.
For updating texture content for each video frame I am calling glTexImage2D. I want to know is there any faster method to update texture without calling glTexImage2D for each frame.
You will always be using glTexImage2D, but with the difference that data comes from a buffer object (what is this?) rather than from a pointer.
What's slow in updating a texture is not updating the texture, but synchronizing (blocking) with the current draw operation, and the PCIe transfer. When you call glTexImage, OpenGL must wait until it is done drawing the last frame during which it is still reading from the texture. During that time, your application is blocked and does nothing (this is necessary because otherwise you could modify or free the memory pointed to before OpenGL can copy it!). Then it must copy the data and transfer it to the graphics card, and only then your application continues to run.
While one can't make that process much faster, one can make it run asynchronously, so this latency pretty much disappears.
The easiest way of doing this is to for video frames is to create a buffer name, bind it, and reserve-initialize it once.
Then, on each subsequent frame, discard-initialize it by calling glBufferData with a null data pointer, and fill it either with a non-reserving call, or by mapping the buffer's complete range.
The reason why you want to do this strange dance instead of simply overwriting the buffer is that this will not block. OpenGL will synchronize access to buffer objects so you do not overwrite data while it is still reading from it. glBufferData with a null data pointer is a way of telling OpenGL that you don't really care about the buffer and that you don't necessary want the same buffer. So it will just allocate another one and give you that one, keep reading from the old one, and secretly swap them when it's done.
Since the word "synchronization" was used already, I shall explain my choice of glMapBufferRange in the link above, when in fact you want to map the whole buffer, not some range. Why would one want that?
Even if OpenGL can mostly avoid synchronizing when using the discard technique above, it may still have to, sometimes.
Also, it still has to run some kind of memory allocation algorithm to manage the buffers, which takes driver time. glMapBufferRange lets you specify additional flags, in particular (in later OpenGL versions) a flag that says "don't synchronize". This allows for a more complicated but still faster approach in which you create a single buffer twice the size you need once, and then keep mapping/writing either the lower or upper half, telling OpenGL not to synchronize at all. It is then your responsibility to know when it's safe (presumably by using a fence object), but you avoid all overhead as much as possible.
You can't update the texture without updating the texture.
Also I don't think that one call to glTexImage can be a real performance problem. If you are so oh concerned about it though, create two textures and map one of them for writing when using the other for drawing, then swap (just like double-buffering works).
If you could move processing to GPU you wouldn't have to call the function at all, which is about 100% speedup.

Is it possible to in-place resize VBOs?

The title says everything, but just to be clear I'll add some extra words.
In this case, resize means:
getting more storage space at the end of the old vbo
saving the old data at the front
(hopefully not copying, but at least not on CPU side, meaning the driver should handle this)
EDIT
As to explain some more details and justify my question:
I will store data of (in forehand) unknown size to the VBO but I only know an upper limit that is a very rough estimation (10 - 100x as much or even more in unusual conditions).
Of course I know how much data I stored, when I am done with it, so it would be nice to store data until I find my VBO too small and resize it and then go on storing.
Here is why I don't want to copy(especially not on CPU side):
I am doing all this on the GPU to get interactive frame rates. When I have to copy it is very slow or even not possible, because there is not enough space. Worst of all is to copy the data over the CPU, hence passing everything over the bus, into a new memory region that has sufficient size, then glBufferDataing the VBO with new size and the new memory region as source. That would be the performance killer.
circumvented
I circumvented the problem with an exact estimation of the needed space. But I will let this question be unanswered for a week to see if someone has another hint on this as I am not very happy with the solution.
I think without doing a copy you won't get around this, because the only way to resize a buffer is to call glBufferData and there is IMO no way to tell the driver to keep the old data.
What you probably can do to at least not copy it to the CPU and back again, is creating some kind of auxiliary VBO for these purposes and copy directly from the VBO into the auxiliary VBO (using the ARB_copy_buffer extension), resize the VBO and copy its contents back.
But I think the best way is just to allocate a larger buffer beforehand, so the resize is not neccessary, but of course in this case you need to know approximately how much extra storage you need.
Revisiting this question after some years, the landscape has changed a bit with newer versions and extensions.
GPU Side Copying
The extension mentioned in Christian Rau's answer is core since 3.1 which allows us to copy contents (via glCopyBufferSubData) from one VBO to another. Hopefully, the driver does this on the GPU side!
Using this function we could create a larger buffer and copy the leading data over. This has the disadvantage of doubling the memory requirements because we need both buffers.
True resizing
The good news is: With sparse buffers an even better solution is on the horizon.
Given this extension we can allocate a virtual buffer with more than enough space for our data without ever paying for the unneeded space. We only have to "commit" the regions of memory we physically want to store data in. This means we can "grow" the VBO by committing new pages at the end of it.
The bad news is: As of the current OpenGL version (4.5) this is still an extension and not yet core, so adopting this might not be possible. You should also not that there are some details in the specification that are not yet worked out. For example, mapping of sparse buffers is disallowed in the current extension but support might be added in future versions.
I would be keen to hear about the availability of this extension if you have any data on that.
Assuming you have support for a recent OpenGL standard, an alternative to VBOs might be to store your data in textures ( again, assuming you have enough memory on your card ). Copying data between old and new textures would take place on the card and not affect the data transfer.
Exactly how you achieve this depends on exactly what your code is doing. But in principle, you use texture data to overwrite dummy vertex data in your drawing calls, or maybe use instancing. It would require a lot of thought and rework.

OpenGL Buffer Object internal workings?

I've started to use Pixel Buffer Objects and while I understand how to use them and the gist of what they're doing, I really don't know what's going on under the hood. I'm aware that the OpenGL spec allows for leeway in regards to the exact implementation, but that's still beyond me.
So far as I understand, the Buffer Object typically resides server side in GRAM; though this apparently may vary depending on target and usage. This makes perfect sense as this would be why OpenGL calls on the BOs would operate so fast. But in what such instances would it reside in AGP or system memory? (side question: does PCI-e have an equivalent of AGP memory?)
Also, glMapBuffers() returns a pointer to a block of memory of the BO so the data may be read/written/changed. But how is this done? The manipulations are taking place client side, so the data still has to go from server to client some how. If it is, how is is better than glReadPixels()?
PBOs are obviously better than glReadPixels() as is obvious by the performance difference, I just don't understand how.
I haven't used FBOs yet, but I've heard they're better to use. Is this true? if so, why?
I can't tell you in what memory the buffer object will be allocated. Actually you mostly answered that question yourself, so you can hope that a good driver will actually do it this way.
glMapBuffer can be implemented the same way as memory mapped files. Remember the difference between physical memory and virtual address space: when you write to a memory location, the address is mapped through a page table to a physical location. If the required page is marked as swapped out an interrupt occurs and the system loads the required page from the swap to the RAM. This mechanism can be used to map files and other resources (like GPU memory) to your process's virtual address space. When you call glMapBuffer, the system allocates some address range (not memory, just addresses) and prepares the relevant entries in page table. When you try to read/write to these addresses the system loads/sends it to the GPU. Of course this would be slow, so some buffering is done on the way.
If you constantly transfer data between CPU and GPU, I doubt that PBOs will be faster. They are faster when you make many manipulations on the GPU (like load from frame buffer, change a few texels with CPU and use it as a texture again on the GPU). Well, they can be faster in case of integrated graphics processor or AGP memory, because in that case glMapBuffer can map the addresses directly to the physical memory, effectively eliminating one copy operation.
Are FBOs better? For what? They are better when you need to render to texture. That's again because they eliminate one data copy operation.