The example I was reading comes from the opengl red book.
Source code is here: https://github.com/openglredbook/examples/blob/master/src/11-oit/11-oit.cpp
I read about image load store is incoherent memory access, and does not guarantee ordering between 2 rendering command. https://www.khronos.org/opengl/wiki/Memory_Model
When I read source code for this algorithm, I see no mentioning of memory barrier.
So do I actually need to call memory barrier between the rendering command that sort the fragments and store them, and the rendering command that renders the quad?
For your general question, yes, you need an explicit memory barrier between the two operations.
On a more personal note, please stop looking at that code. I'm seeing many dubious things beyond just the lack of a barrier: the mapping of a buffer for the sole purpose of writing a single integer, a call to glTexSubImage2D that's sure to give an error because NULL is not a valid pointer parameter, etc.
Related
I'm learning Vulkan and my experience with memory barriers was quite good until I have to deal with memory visibility.
I feel like I have to use a memory barrier each time I start using a ressource for reading when I was previously writing on it, and inversly. A bit like if there was a state on the memory which says if it's used for writing or for reading. I know that the rationals for this are related to cache management, but at a higher level that's how I see it.
Bad things start when I don't see memory barriers, where according to my (very likely wrong) understanding they should be.
For example, if I want to draw something and present it on the screen, there is no memory barrier to make a transition from a swapchain image used for presentation (and thus for reading) to an image used for drawing (and thus for writing). And when I finish drawing, there is no barrier in the reverse order aswell.
I have seen the same thing happen when copying a staging host visible buffer to a device local buffer. You write something in the mapped memory, flush it, and then start recording the copy in a command buffer without putting any barrier to transition from a host writable memory to transfer read memory. So I'd like to know what I misunderstand or what implicits things make everything work out of the box.
No barrier between presentation is illegal. The swapchain image must be in VK_IMAGE_LAYOUT_PRESENT_SRC_KHR for presentation. And it must be in different layout when your app does write something to the image. Only way to achieve this is with barrier-like primitive.
Writes to mapped memory is one rare exception. Writes to mapped memory are automatically visible to any subsequent vkQueueSubmit. See Host Write Ordering Guarantees chapter of the specification.
Why the tutorial does not have barriers there is because it covers synchronization in the next chapter you assumably did not reach. They do so with Subpass Dependencies. The layout transitions that are part of that are shown in earlier chapter about render passes.
Say I've got a buffer object with some data in it.
I use glMapBuffer with GL_WRITE_ONLY and write to every second byte (think interleaved vertex attributes).
Then I glUnmapBuffer the buffer.
Are the bytes I didn't write to preserved or are they now undefined?
I'm wondering because the main purpose of GL_WRITE_ONLY seems to be to avoid transferring the previous content of the buffer from the card's memory to main memory. The driver, however, has no way of knowing to which bytes I've actually written something in order to update the buffer only partially.
So either the driver transfers the content to main memory first, rendering GL_WRITE_ONLY pointless on pretty much every platform I could think of. Or it is assumed that I write the complete mapped area. Yet no such obligation is mentioned in the man pages.
Short answer: The data is preserved.
I'm wondering because the main purpose of GL_WRITE_ONLY seems to be to
avoid transferring the previous content of the buffer from the card's
memory to main memory.
Well, the implementation has many potential ways to fullfill that request, and the access flags may help in the decision of which path to go. For example, the driver may decide to do some direct I/O mapping of the buffer in VRAM instead of using system RAM for the mapping.
The issues you see with this are actually addressed by the more modern glMapBufferRange() API introduced in the GL_ARB_map_buffer_range extension. Although the name might suggest that this is for mapping parts of the buffers, it actually superseeds the glMapBuffer() function completely and allows for a much finer control. For example, the GL_MAP_INVALIDATE_RANGE_BIT or GL_MAP_INVALIDATE_BUFFER_BIT flags mark the data as invalid and enable for the optimizations you had in mind for the general GL_WRITE_ONLY case. But without these, the data is to be preserved, and how this is done is the implementation's problem.
I'm asking this question because I don't want to spend time writing some code that duplicates functionalities of the OpenGL drivers.
Can the OpenGL driver/server hold more data than the video card? Say, I have enough video RAM to hold 10 textures. Can I ask OpenGL to allocate 15 textures without getting an GL_OUT_OF_MEMORY error?
If I can rely on the driver to cleverly send the textures/buffers/objects from the 'normal' RAM to the video RAM when needed then I don't really need to Gen/Delete these objects myself. I become limited by the 'normal' RAM which is often plentiful when compared to the video RAM.
The approach "memory is abundant so I don't need to delete" is bad, and the approach "memory is abundant, so I'll never get out of memory errors" is flawed.
OpenGL memory management is obscure, both for technical reasons (see t.niese's comment above) and for ideological reasons ("you don't need to know, you don't want to know"). Though there exist vendor extensions (such as ATI_meminfo) that let you query some non-authorative numbers (non-authorative insofar as they could change the next millisecond, and they do not take effects like fragmentation into account).
Generally, for the most part, your assumption that you can use more memory than there is GPU memory is correct.
However, you are not usually not able to use all available memory. More likely, there is a limit well below "all available RAM" due to constraints on what memory regions (and how large regions) the driver can allocate, lock, and DMA to/from. And even though you can normally use more memory than will fit on the GPU (even if you used it exclusively), this does not mean careless allocations can't and won't eventually fail.
Usually, but not necessarily, you consume as much system memory as GPU memory, too (without knowing, the driver does that secretly). Since the driver swaps resources in and out as needed, it needs to maintain a copy. Sometimes, it is necessary to keep 2 or 3 copies (e.g. when streaming or for ARB_copy_buffer operations). Sometimes, mapping a buffer object is yet another copy in a specially allocated block, and sometimes you're allowed to write straight into the driver's memory.
On the other hand, PCIe 2.0 (and PCIe 3.0 even more so) is fast enough to stream vertices from main memory, so you do not even strictly need GPU memory (other than a small buffer). Some drivers will stream dynamic geometry right away from system memory.
Some GPUs do not even have separate system and GPU memory (Intel Sandy Bridge or AMD Fusion).
Also, you should note that deleting objects does not necessarily delete them (at least not immediately). Usually, with very few exceptions, deleting an OpenGL object is merely a tentative delete which prevents you from further referencing the object. The driver will keep the object valid for as long as it needs to.
On the other hand, you really should delete what you do not need any more, and you should delete early. For example, you should delete a shader immediately after attaching it to the program object. This ensures that you do not leak resources, and it is guaranteed to work. Deleting and re-specifying the in-use vertex or pixel buffer when streaming (by calling glBufferData(... NULL); is a well-known idiom. This only affects your view of the object, and it allows the driver to continue using the old object in parallel for as long as it needs to.
Some additional information to my comment that did not fit in there.
There are different reasons why this is not part of OpenGL.
It isn't an easy task for the system/driver to guess which resources are and will be required. The driver for sure could create an internal heuristic if resource will be required often or rarely (like CPU does for if statements and doing pre executing code certain code parts on that guess). But the GPU will not know (without knowing the application code) what resource will be required next. It even has no knowledge where the geometry is places in the scene (because you do this with you model and view martix you pass to your shader yourself)
If you e.g. have a game where you can walk through a scene, you normally won't render the parts that are out of the view. So the GPU could think that these resources are not required anymore, but if you turn around then all this textures and geometry is required again and needs to be moved from system memory to gpu memory, which could result in really bad performance. But the Game Engine itself has, because of the use of octrees (or similar techniques) and the possible paths that can be walked, an in deep knowledge about the scene and which resource could be removed from the GPU and which one could be move to the GPU while playing and where it would be necessary to display a loading screen.
If you look at the evolution of OpenGL and which features become deprecated you will see that they go to the direction to remove everything except the really required features that can be done best by the graphic card, driver and system. Everything else is up to the user to implement on it's own to get the best performance. (you e.g. create your projection matrix yourself to pass it to the shader, so OpenGl even does not know where the object is placed in the scene).
Here's my TL;DR answer, I recommend reading Daemon's and t.niese's answers as well:
Can the OpenGL driver/server hold more data than the video card?
Yes
Say, I have enough video RAM to hold 10 textures. Can I ask OpenGL to allocate 15 textures without getting an GL_OUT_OF_MEMORY error?
Yes. Depending on the driver / GPU combination it might even be possible to allocate a single texture that exceeds the GPU's memory, and actually use it for rendering. At my current occupation I exploit that fact to extract slices of arbitrary orientation and geometry from large volumetric datasets, using shaders to apply filters on the voxel data in situ. Works well, but doesn't work for interactive frame rates.
So from looking around at examples and tutorials, it seems the most common method of placing buffers in the pipeline is, every model object gets it's own vertex buffer, and then after the buffers are filled, they lock, set the buffers, unlock, set shaders, draw, and rinse/repeat for every models individual buffer. seems to me that all that locking and unlocking would slow things down a bit.
So i'm wondering if maybe the model objects could instead aggregate all their vertices into 1 big array, all the indices in a big array, create 1 large buffer, lock once, set buffers once, unlock, and then switch out shaders and draw as many polygons as required with those shaders, and just work your way along the buffer drawing and switching shaders like before, instead of having to lock and drop more vertices in the pipeline every time before you draw.
Would this be any more efficient, or do you think the overhead from all the bookkeeping involved,(for example, from index a to index b, use this shader), would just make this more work than it's worth?
also, if i have missed a concept of d3d here, please inform me. ( i'm new)
EDIT
due to massive misunderstanding, anywhere where i refered to locking and unlocking, was actually supposed to just be calling IASetVertexBuffer/IASetIndexBuffer. The "revised" question is more or less:
Does stuffing the vertices for all the models in the scene into one single buffer, and simply calling IASetVertexBuffer once improve perforamance at all?
So from looking around at examples and tutorials
Stop. Most "examples and tutorials" for anything are not intended to show best performance practices. Unless they are specifically about best performance practices. They're trying to show in the clearest and cleanest way how to perform task X. Optimization is an entirely other issue. Optimized code is a lit less clear and clean than unoptimal code; thus, many optimizations would get in the way of the tutorial's stated purpose.
So never assume that just because a tutorial does it some way, that's the fastest way to do something. It is simply one way to do it.
then after the buffers are filled, they lock, set the buffers, unlock, set shaders, draw, and rinse/repeat for every models individual buffer.
Locking and unlocking is for modifying the buffer. If you're not modifying it... why are you locking it? And if you are modifying it, then you're doing some form of buffer streaming, which requires special handling in order to make it efficient.
If you're doing streaming, then that's a different question you should ask (ie: how to do high-performance vertex streaming).
That isn't to say that putting the data for multiple objects in one buffer isn't a good idea. But if it is, the reason for it has less to do with locking and unlocking and more to do with the possibility of drawing multiple objects with a single draw call.
In general the fewer locks the better, every lock has to be an in-sync transfer between system memory and graphics card memory that stalls your GPU. The more you can batch these transfers together the better.
An even better improvement however is to leave buffers that don't change alone. You won't always need to reload bench #1221 every. single. frame. It never changes (*). So load your static art at the beginning and just draw it as needed. And before you think of culling half the bench away in preprocessing, think twice about the cost of locking a buffer just to get rid of a few vertices when your GPU already knows how to do basic culling at lightning speeds.
(*) assuming it doesn't change of course :)
I just found the following OpenGL specification for ARB_map_buffer_range.
I'm wondering if it is possible to do non-blocking map calls using this extension?
Currently in my application im rendering to an FBO which I then map to a host PBO buffer.
glMapBuffer(target_, GL_READ_ONLY);
However, the problem with this is that it blocks the rendering thread while transferring the data.
I could reduce this issue by pipelining the rendering, but latency is a big issue in my application.
My question is whether i can use map_buffer_range with MAP_UNSYNCHRONIZED_BIT and wait for the map operation to finish on another thread, or defer the map operation on the same thread, while the rendering thread renders the next frame.
e.g.
thread 1:
map();
render_next_frame();
thread 2:
wait_for_map
or
thread 1:
map();
while(!is_map_ready())
do_some_rendering_for_next_frame();
What I'm unsure of is how I know when the map operation is ready, the specification only mentions "other synchronization techniques to ensure correct operation".
Any ideas?
If you map a buffer with GL_MAP_UNSYNCHRONIZED_BIT, the driver will not wait until OpenGL is done with that memory before mapping it for you. So you will get more or less immediate access to it.
The problem is that this does not mean that you can just read/write that memory willy-nilly. If OpenGL is reading from or writing to that buffer and you change it... welcome to undefined behavior. Which can include crashing.
Therefore, in order to actually use unsynchronized mapping, you must synchronize your behavior to OpenGL's access of that buffer. This will involve the use of ARB_sync objects (or NV_fence if you're only on NVIDIA and haven't updated your drivers recently).
That being said, if you're using a fence object to synchronize access to the buffer, then you really don't need GL_MAP_UNSYNCHRONIZED_BIT at all. Once you finish the fence, or detect that it has completed, you can map the buffer normally and it should complete immediately (unless some other operation is reading/writing too).
In general, unsynchronized access is best used for when you need fine-grained write access to the buffer. In this case, good use of sync objects will get you what you really need (the ability to tell when the map operation is finished).
Addendum: The above is now outdated (depending on your hardware). Thanks to OpenGL 4.4/ARB_buffer_storage, you can now not only map unsynchronized, you can keep a buffer mapped indefinitely. Yes, you can have a buffer mapped while it is in use.
This is done by creating immutable storage and providing that storage with (among other things) the GL_MAP_PERSISTENT_BIT. Then you glMapBufferRange, also providing the same bit.
Now technically, that changes pretty much nothing. You still need to synchronize your actions with OpenGL. If you write stuff to a region of the buffer, you'll need to either issue a barrier or flush that region of the buffer explicitly. And if you're reading, you still need to use a fence sync object to make sure that the data is actually there before reading it (and unless you use GL_MAP_COHERENT_BIT too, you'll need to issue a barrier before reading).
In general, it is not possible to do a "nonblocking map", but you can map without blocking.
The reason why there can be no "nonblocking map" is that the moment the function call returns, you could access the data, so the driver must make sure it is there, positively. If the data has not been transferred, what else can the driver do but block.
Threads don't make this any better, and possibly make it worse (adding synchronisation and context sharing issues). Threads cannot magically remove the need to transfer data.
And this leads to how to not block on mapping: Only map when you are sure that the transfer is finished. One safe way to do this is to map the buffer after flipping buffers or after glFinish or after waiting on a query/fence object. Using a fence is the preferrable way if you can't wait until buffers have been swapped. A fence won't stall the pipeline, but will tell you whether or not your transfer is done (glFinish may or may not, but will probably stall).
Reading after swapping buffers is also 100% safe, but may not be acceptable if you need the data within the same frame (works perfectly for screenshots or for calculating a histogram for tonemapping, though).
A less safe way is to insert "some other stuff" and hope that in the mean time the transfer has completed.
In respect of below comment:
This answer is not incorrect. It isn't possible to do any better than access data after it's available (this should be obvious). Which means that you must sync/block, one way or the other, there is no choice.
Although, from a very pedantic point of view, you can of course use GL_MAP_UNSYNCHRONIZED_BIT to get a non-blocking map operation, this is entirely irrelevant, as it does not work unless you explicitly reproduce the implicit sync as described above. A mapping that you can't safely access is good for nothing.
Mapping and accessing a buffer that OpenGL is transferring data to without synchronizing/blocking (implicitly or explicitly) means "undefined behavior", which is only a nicer wording for "probably garbage results, maybe crash".
If, on the other hand, you explicitly synchronize (say, with a fence as described above), then it's irrelevant whether or not you use the unsynchronized flag, since no more implicit sync needs to happen anyway.