Opengl Unsynchronized/Non-blocking Map - opengl

I just found the following OpenGL specification for ARB_map_buffer_range.
I'm wondering if it is possible to do non-blocking map calls using this extension?
Currently in my application im rendering to an FBO which I then map to a host PBO buffer.
glMapBuffer(target_, GL_READ_ONLY);
However, the problem with this is that it blocks the rendering thread while transferring the data.
I could reduce this issue by pipelining the rendering, but latency is a big issue in my application.
My question is whether i can use map_buffer_range with MAP_UNSYNCHRONIZED_BIT and wait for the map operation to finish on another thread, or defer the map operation on the same thread, while the rendering thread renders the next frame.
e.g.
thread 1:
map();
render_next_frame();
thread 2:
wait_for_map
or
thread 1:
map();
while(!is_map_ready())
do_some_rendering_for_next_frame();
What I'm unsure of is how I know when the map operation is ready, the specification only mentions "other synchronization techniques to ensure correct operation".
Any ideas?

If you map a buffer with GL_MAP_UNSYNCHRONIZED_BIT, the driver will not wait until OpenGL is done with that memory before mapping it for you. So you will get more or less immediate access to it.
The problem is that this does not mean that you can just read/write that memory willy-nilly. If OpenGL is reading from or writing to that buffer and you change it... welcome to undefined behavior. Which can include crashing.
Therefore, in order to actually use unsynchronized mapping, you must synchronize your behavior to OpenGL's access of that buffer. This will involve the use of ARB_sync objects (or NV_fence if you're only on NVIDIA and haven't updated your drivers recently).
That being said, if you're using a fence object to synchronize access to the buffer, then you really don't need GL_MAP_UNSYNCHRONIZED_BIT at all. Once you finish the fence, or detect that it has completed, you can map the buffer normally and it should complete immediately (unless some other operation is reading/writing too).
In general, unsynchronized access is best used for when you need fine-grained write access to the buffer. In this case, good use of sync objects will get you what you really need (the ability to tell when the map operation is finished).
Addendum: The above is now outdated (depending on your hardware). Thanks to OpenGL 4.4/ARB_buffer_storage, you can now not only map unsynchronized, you can keep a buffer mapped indefinitely. Yes, you can have a buffer mapped while it is in use.
This is done by creating immutable storage and providing that storage with (among other things) the GL_MAP_PERSISTENT_BIT. Then you glMapBufferRange, also providing the same bit.
Now technically, that changes pretty much nothing. You still need to synchronize your actions with OpenGL. If you write stuff to a region of the buffer, you'll need to either issue a barrier or flush that region of the buffer explicitly. And if you're reading, you still need to use a fence sync object to make sure that the data is actually there before reading it (and unless you use GL_MAP_COHERENT_BIT too, you'll need to issue a barrier before reading).

In general, it is not possible to do a "nonblocking map", but you can map without blocking.
The reason why there can be no "nonblocking map" is that the moment the function call returns, you could access the data, so the driver must make sure it is there, positively. If the data has not been transferred, what else can the driver do but block.
Threads don't make this any better, and possibly make it worse (adding synchronisation and context sharing issues). Threads cannot magically remove the need to transfer data.
And this leads to how to not block on mapping: Only map when you are sure that the transfer is finished. One safe way to do this is to map the buffer after flipping buffers or after glFinish or after waiting on a query/fence object. Using a fence is the preferrable way if you can't wait until buffers have been swapped. A fence won't stall the pipeline, but will tell you whether or not your transfer is done (glFinish may or may not, but will probably stall).
Reading after swapping buffers is also 100% safe, but may not be acceptable if you need the data within the same frame (works perfectly for screenshots or for calculating a histogram for tonemapping, though).
A less safe way is to insert "some other stuff" and hope that in the mean time the transfer has completed.
In respect of below comment:
This answer is not incorrect. It isn't possible to do any better than access data after it's available (this should be obvious). Which means that you must sync/block, one way or the other, there is no choice.
Although, from a very pedantic point of view, you can of course use GL_MAP_UNSYNCHRONIZED_BIT to get a non-blocking map operation, this is entirely irrelevant, as it does not work unless you explicitly reproduce the implicit sync as described above. A mapping that you can't safely access is good for nothing.
Mapping and accessing a buffer that OpenGL is transferring data to without synchronizing/blocking (implicitly or explicitly) means "undefined behavior", which is only a nicer wording for "probably garbage results, maybe crash".
If, on the other hand, you explicitly synchronize (say, with a fence as described above), then it's irrelevant whether or not you use the unsynchronized flag, since no more implicit sync needs to happen anyway.

Related

Why I don't need memory barriers when starting drawing on an acquired swapchain image?

I'm learning Vulkan and my experience with memory barriers was quite good until I have to deal with memory visibility.
I feel like I have to use a memory barrier each time I start using a ressource for reading when I was previously writing on it, and inversly. A bit like if there was a state on the memory which says if it's used for writing or for reading. I know that the rationals for this are related to cache management, but at a higher level that's how I see it.
Bad things start when I don't see memory barriers, where according to my (very likely wrong) understanding they should be.
For example, if I want to draw something and present it on the screen, there is no memory barrier to make a transition from a swapchain image used for presentation (and thus for reading) to an image used for drawing (and thus for writing). And when I finish drawing, there is no barrier in the reverse order aswell.
I have seen the same thing happen when copying a staging host visible buffer to a device local buffer. You write something in the mapped memory, flush it, and then start recording the copy in a command buffer without putting any barrier to transition from a host writable memory to transfer read memory. So I'd like to know what I misunderstand or what implicits things make everything work out of the box.
No barrier between presentation is illegal. The swapchain image must be in VK_IMAGE_LAYOUT_PRESENT_SRC_KHR for presentation. And it must be in different layout when your app does write something to the image. Only way to achieve this is with barrier-like primitive.
Writes to mapped memory is one rare exception. Writes to mapped memory are automatically visible to any subsequent vkQueueSubmit. See Host Write Ordering Guarantees chapter of the specification.
Why the tutorial does not have barriers there is because it covers synchronization in the next chapter you assumably did not reach. They do so with Subpass Dependencies. The layout transitions that are part of that are shown in earlier chapter about render passes.

How to use GL_MAP_UNSYNCHRONIZED_BIT with GL_MAP_PERSISTENT_BIT?

I have been working with GL_MAP_PERSISTENT_BIT and glBufferStorage/glMapBufferRange. I am curious if there is an improvement in performance possible using GL_MAP_UNSYNCHRONIZED_BIT.
I already found Opengl Unsynchronized/Non-blocking Map
But the answer seems to be a bit contradictory to me. It's said there that you need to sync or block when using this flag. What is the point of setting it unsynchronized if I have to sync it later then anyway? Also I tried this combination and was not able to see any performance difference. Does it even make sense together with persistent mapped buffers? I found literally no examples about such a usage.
The mentioned topic also says that you can
issue a barrier or flush that region of the buffer explicitly
But every attempt I made so far using these only resulted in garbage.
I am using currently triple buffering, but since I have to deal with very small chunks of data sometimes which I hardly can batch I had to find out that glBufferData is often faster in these cases and persistent buffers only of (huge) benefit if I can batch and reduce also the amount of drawcalls. Using GL_MAP_UNSYNCHRONIZED_BIT could be the key here.
Can anyone give me a working example, in case it even makes sense in this combination?
What is the point of setting it unsynchronized if I have to sync it later then anyway?
The point, as stated by that answer, is that OpenGL isn't doing the synchronization for you. You control when the synchronization happens. This means that you can ensure that it doesn't happen at an inappropriate time. By using your own synchronization, you can also ask the question, "are you finished using the buffer?" which is not a question you could ask without your own sync system.
By using unsynchronized mapping, you stop the implementation from having to check its own internal sync in addition to your synchronization.
However, that answer you linked to applies primarily to non-persistent mapping (since that's what the question was about). Unsynchronized mapping only applies to the map call itself. It prevents GL from issuing internal synchronization due to you calling glMapBufferRange.
But unsynchronized mapping doesn't really affect persistent mapping because... well, it's persistent. The whole point of the feature is that you keep the buffer mapped, so you're only going to call glMapBufferRange once. And the unsynchronized bit only applies at the moment you call glMapBufferRange.
So whether you use unsynchronized or not with your persistent mapping call is essentially irrelevant.

With persistently mapped buffer storage in OpenGL, touched only before draw and after swap, is any further synchronization really needed?

I whipped up a simple C program (on github) that uses OpenGL to draw a bunch of triangles from a buffer that was allocated with glBufferStorage like so:
glGenBuffers(1, &vbo);
glBindBuffer(GL_ARRAY_BUFFER, vbo);
GLbitfield bufferStorageFlags = GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT;
glBufferStorage(GL_ARRAY_BUFFER, vboSize, 0, bufferStorageFlags);
vert *triData = glMapBufferRange(GL_ARRAY_BUFFER, 0, vboSize, bufferStorageFlags);
I am aware that synchronization is my responsibility when using glBufferStorage with MAP_PERSISTENT_BIT, but I'm not sure exactly what I need to protect against.
The only time I touch triData is before calling glDrawArrays on it, and after calling SDL_GL_SwapWindow, so I know drawing of the last frame is done, and I haven't called for drawing of this frame to begin yet.
This appears to work perfectly, even with vsync disabled.
The wiki says:
Swapping the back and front buffers on the Default Framebuffer may
cause some form of synchronization ... if there are still commands
affecting the default framebuffer that have not yet completed.
Swapping buffers only technically needs to sync to the last command
that affects the default framebuffer, but it may perform a full
glFinish​.
But every article I've read on the subject makes extensive use of GLsync
pointers, though maybe they were just assuming I might want to use the buffer in more complex ways?
For now, am I right to believe SDL_GL_SwapWindow is providing sufficient synchronization?
The previous answers are correct in saying that you do need synchronization even after you use a swap. But I wanted to make even clearer that this is more than just a theoretical concern.
Swap operations are typically not synchronous. It's very common to let the rendering get 1-2 frames ahead of the display. This is done to reduce "bubbles" where the GPU temporarily goes into an idle state. If your swap call were synchronous, the GPU would unavoidably be idle at the time it returns, since all previously submitted work would have completed. Even if you immediately started rendering again, it would take a little time for that work to actually reach the GPU for execution. So you have times where the GPU does nothing, which hurts performance at least as long as your rendering is entirely GPU limited.
Now, you obviously don't want the rendering to get too far ahead of the display. Undesired side effects of that would be increased latency in responding to user input (which is a big deal for games), and excessive memory usage for queued up rendering commands. Therefore, there needs to be throttling before this happens. This throttling is often applied as part of swap operations, but it can potentially happen almost anywere.
So if you measure the wall clock time taken for a swap call to return, it's fairly common for it to be long enough to suggest that it's blocking. But this does not suggest that the call itself is synchronous. It may just be blocking until a previous frame completes, to prevent the rendering to get too far ahead of the display.
Here's my favorite advice about any multithreaded/asynchronous code:
If multithreaded code isn't immediately, obviously, provably correct then it is almost certainly wrong.
You cannot prove that OpenGL will not read from a value you are writing to. Therefore, it is wrong, even if no problems are apparent.
Yes, you need to do explicit synchronization. Even though you coherently mapped the buffer, you still cannot change the values in it while OpenGL might be reading from them. You must wait until after the last call that reads from that data before writing to it again. And the only ways that OpenGL has to wait for it to get finished is either glFinish or glClientWaitSync.
I am aware that synchronization is my responsibility when using glBufferStorage,
No, not necessarily. A buffer created with glBufferStorage is no different than a buffer created with glBuffer, except for the fact that you can't re-specify it.
You only need to do manual synchronization when mapping with the MAP_PERSISTENT_BIT (which was included in the same extension that glBufferStorage was, ARB_buffer_storage).

Write-only `glMapBuffer`, what if I don't write it all?

Say I've got a buffer object with some data in it.
I use glMapBuffer with GL_WRITE_ONLY and write to every second byte (think interleaved vertex attributes).
Then I glUnmapBuffer the buffer.
Are the bytes I didn't write to preserved or are they now undefined?
I'm wondering because the main purpose of GL_WRITE_ONLY seems to be to avoid transferring the previous content of the buffer from the card's memory to main memory. The driver, however, has no way of knowing to which bytes I've actually written something in order to update the buffer only partially.
So either the driver transfers the content to main memory first, rendering GL_WRITE_ONLY pointless on pretty much every platform I could think of. Or it is assumed that I write the complete mapped area. Yet no such obligation is mentioned in the man pages.
Short answer: The data is preserved.
I'm wondering because the main purpose of GL_WRITE_ONLY seems to be to
avoid transferring the previous content of the buffer from the card's
memory to main memory.
Well, the implementation has many potential ways to fullfill that request, and the access flags may help in the decision of which path to go. For example, the driver may decide to do some direct I/O mapping of the buffer in VRAM instead of using system RAM for the mapping.
The issues you see with this are actually addressed by the more modern glMapBufferRange() API introduced in the GL_ARB_map_buffer_range extension. Although the name might suggest that this is for mapping parts of the buffers, it actually superseeds the glMapBuffer() function completely and allows for a much finer control. For example, the GL_MAP_INVALIDATE_RANGE_BIT or GL_MAP_INVALIDATE_BUFFER_BIT flags mark the data as invalid and enable for the optimizations you had in mind for the general GL_WRITE_ONLY case. But without these, the data is to be preserved, and how this is done is the implementation's problem.

Making glReadPixel() run faster

I want a really fast way to capture the content of the openGL framebuffer for my application. Generally, glReadPixels() is used for reading the content of framebuffer into a buffer. But this is slow.
I was trying to parallelise the procees of reading the framebuffer content by creating 4 threads to read framebuffer from 4 different regions using glReadPixels(). But the application is exiting due to segmentation fault. If I remove the glReadPixels() call from threads then application is running properly.
Threads do not work, abstain from that approach.
Creating several threads fails, as you have noticed, because only one thread has a current OpenGL context. In principle, you could make the context current in each worker thread before calling glReadPixels, but this will require extra synchronization from your side (otherwise, a thread could be preempted in between making the context current and reading back!), and (wgl|glx)MakeCurrent is a terribly slow function that will seriously stall OpenGL. In the end, you'll be doing more work to get something much slower.
There is no way to make glReadPixels any faster1, but you can decouple the time it takes (i.e. the readback runs asynchronously), so it does not block your application and effectively appears to run "faster".
You want to use a Pixel buffer object for that. Be sure to get the buffer flags correct.
Note that mapping the buffer to access its contents will still block if the complete contents hasn't finished transferring, so it will still not be any faster. To account for that, you either have to read the previous frame, or use a fence object which you can query to be sure that it's done.
Or, simpler but less reliable, you can insert "some other work" in between glReadPixels and accessing the data. This will not guarantee that the transfer has finished by the time you access the data, so it may still block. However, it may just work, and it will likely block for a shorter time (thus run "faster").
1 There are plenty of ways of making it slower, e.g. if you ask OpenGL to do some weird conversions or if you use wrong buffer flags. However, generally, there's no way to make it faster since its speed depends on all previous draw commands having finished before the transfer can even start, and the data being transferred over the PCIe bus (which has a fixed time overhead plus a finite bandwidth).
The only viable way of making readbacks "faster" is hiding this latency. It's of course still not faster, but you don't get to feel it.