How to use GL_MAP_UNSYNCHRONIZED_BIT with GL_MAP_PERSISTENT_BIT? - c++

I have been working with GL_MAP_PERSISTENT_BIT and glBufferStorage/glMapBufferRange. I am curious if there is an improvement in performance possible using GL_MAP_UNSYNCHRONIZED_BIT.
I already found Opengl Unsynchronized/Non-blocking Map
But the answer seems to be a bit contradictory to me. It's said there that you need to sync or block when using this flag. What is the point of setting it unsynchronized if I have to sync it later then anyway? Also I tried this combination and was not able to see any performance difference. Does it even make sense together with persistent mapped buffers? I found literally no examples about such a usage.
The mentioned topic also says that you can
issue a barrier or flush that region of the buffer explicitly
But every attempt I made so far using these only resulted in garbage.
I am using currently triple buffering, but since I have to deal with very small chunks of data sometimes which I hardly can batch I had to find out that glBufferData is often faster in these cases and persistent buffers only of (huge) benefit if I can batch and reduce also the amount of drawcalls. Using GL_MAP_UNSYNCHRONIZED_BIT could be the key here.
Can anyone give me a working example, in case it even makes sense in this combination?

What is the point of setting it unsynchronized if I have to sync it later then anyway?
The point, as stated by that answer, is that OpenGL isn't doing the synchronization for you. You control when the synchronization happens. This means that you can ensure that it doesn't happen at an inappropriate time. By using your own synchronization, you can also ask the question, "are you finished using the buffer?" which is not a question you could ask without your own sync system.
By using unsynchronized mapping, you stop the implementation from having to check its own internal sync in addition to your synchronization.
However, that answer you linked to applies primarily to non-persistent mapping (since that's what the question was about). Unsynchronized mapping only applies to the map call itself. It prevents GL from issuing internal synchronization due to you calling glMapBufferRange.
But unsynchronized mapping doesn't really affect persistent mapping because... well, it's persistent. The whole point of the feature is that you keep the buffer mapped, so you're only going to call glMapBufferRange once. And the unsynchronized bit only applies at the moment you call glMapBufferRange.
So whether you use unsynchronized or not with your persistent mapping call is essentially irrelevant.

Related

Distributed Array Access Communication Cost

I'm finishing up implementing a sort of "Terasort Lite" program in Chapel, based on a distributed bucket sort, and I'm noticing what seems to be significant performance bottlenecks around access to a block-distributed array. A rough benchmark shows Chapel takes ~7 seconds to do the whole sort of 3.5MB with 5 locales, while the original MPI+C program does it in around 8.2ms with 5 processes. My local machine has 16 cores, so I don't need to oversubscribe MPI to get 5 processes working.
The data to be sorted are loaded into a block-distributed array across each of the locales so that each locale has an even (and contiguous) share of the unsorted records. In an MPI+C bucket sort, each process would have its records in memory and sort those local records. To that end, I've written a locale-aware implementation of qsort (based on the C stdlib implementation), and this is where I see extreme performance bottlenecks. The overall bucket sort procedure takes a reference to a block distributed array, and qsort is called with the local subdomain: qsort(records[records.localSubdomain()]) from within the coforall block and on loc do clause.
My main question is how Chapel maintains coherence on distributed arrays, and whether any type of coherence actions across locales are what's obliterating my performance. I've checked, and each locale's qsort call is only ever accessing array indices within its local subdomain; I would expect that this means that no communication is required, since each locale accesses only the portion of the domain that it owns. Is this a reasonable expectation, or is this simultaneous access to private portions of a distributed array causing communication overhead?
For context, I am running locally on one physical machine using the UDP GASNET communication substrate, and the documentation notes that this is not expected to give good performance. Nonetheless, I do plan to move the code to a cluster with InfiniBand, so I'd still like to know if I should approach the problem a different way to get better performance. Please let me know if there's any other information that would help this question be answered. Thank you!
Thanks for your question. I can answer some of the questions here.
First, I'd like to point out some other distributed sort implementations in Chapel:
The distributed sort in Arkouda
distributedPartitioningSortWithScratchSpace I have not worked on this in a while but if I recall correctly it is a distributed sample sort backed by a not-in-place radix sort and I think it is tested so it should run
I have other partial work trying to port over ips4o but that algorithm is pretty complicated and I didn't get all the way through and it's not running yet.
Generally I would expect a radix sort to outperform a quick sort for the local problems unless they are very small.
Now to your questions:
My main question is how Chapel maintains coherence on distributed arrays, and whether any type of coherence actions across locales are what's obliterating my performance.
There is a cache for remote data that is on by default. It can be disabled with --no-cache-remote when compiling but I suspect it is not the problem here. In particular, it mainly does any coherence activities on some sort of memory fence (which includes on statement, task end, use of sync/atomic variables). But, you can turn it off and see if that changes things.
Distributed arrays and domains currently use an eager privatization strategy. That means that once they are created, some elements of the data structure is replicated across all locales. Since this involves all locales, it can cause performance problems when running multilocale.
You can check for communication within your kernel with the CommDiagnostics module or with the local block. The CommDiagnostics module will allow you to count or trace communication events while the local block will halt your program if communication is attempted within it.
Another possibility is that the compiler is not generating communication but it is running slower because it has trouble optimizing when the data might be remote. The indicator that this is the problem would be that the performance you get when compiling with CHPL_COMM=none is significantly faster than when running with 1 locale with gasnet and UDP. (You could alternatively use --local and --no-local flags to compare). Some ways to potentially help that:
instead of records[records.localSubdomain()], you could try records.localSlice(records.localSubdomain()) but that uses an undocumented feature. I do not know why it is undocumented, though.
using a local block within your computation should solve this as well but note that we generally try to solve the problem in other ways since the local block is a big hammer.
Slicing has more overhead than we would like. See e.g. https://github.com/chapel-lang/chapel/issues/13317 . As I said, there might also be privatization costs (I don't remember what the current situation of slicing and privatization is, off-hand). In my experience, for local sorting code, you are better off passing a start and end argument as ints, or maybe a range; but slicing to get the local part of the array is certainly more important in the distributed setting.
Lastly, you mentioned that you're trying to analyze the performance when running oversubscribed. If you haven't already seen it, check out this documentation about oversubscription that recommends a setting.

Thread-safe read-only alternative to vtkUnstructuredGrid->GetPoint()

I have started working on multithreading and point cloud processing. Problem is i have to implement multithreading onto an existing implementation and there are so many read and write operation so using mutex does not give me enough speed up in terms of performance due to too many read operations from the grid.
At the end i modified the code in a way that i can have one vtkSmartPointer<vtkUnstructuredGrid>which holds my point cloud. The only operation the threads have to do is accessing points using GetPoint method. However, it is not thread safe even when you have read-only operation due to smart pointers.
Because of that i had to copy my main Point cloud for each thread which at the end causes memory issues if i have too many threads and big clouds.
I tried to cut point clouds into chunks but then it gets too complicated again when i have too many threads. I can not guarantee optimized amount of points to process for each thread. Also i do neighbour search for each point so cutting point cloud into chunks gets even more complicated because i need to have overlaps for each chunk in order to get proper neighbourhood search.
Since vtkUnstructuredGridis memory optimized i could not replace it with some STL containers. I would be happy if you can recommend me data structures i can use for point cloud processing that are thread-safe to read. Or if there is any other solution i could use.
Thanks in advance
I am not familiar with VTK or how it works.
In general, there are various techniques and methods to improve performance in multi-threading environment. The question is vague, so I can only provide a general vague answer.
Easy: In case there are many reads and few writes, use std::shared_mutex as it allows multiple reads simultaneously.
Moderate: If the threads work with distinct data most of the time: they access the same data array but at distinct locations - then you can implement a handler that ensures that the threads concurrently work over distinct pieces of data without intersections and if a thread ask to work over a piece of data that is currently being processed, then tell it to work over something else or wait.
Hard: There are methods that allow efficient concurrency via std::atomic by utilizing various memory instructions. I am not too familiar with it and it is definitely not simple but you can seek tutorials on it in the internet. As far as I know, certain parts of such methods are still in research-and-development and best practices aren't yet developed.
P.S. If there are many reads/writes over the same data... is the implementation even aware of the fact that the data is shared over several threads? Does it even perform correctly? You might end up needing to rewrite the whole implementation.
I just thought i post the solution because it was actually my stupitidy. I realized that at one part of my code i was using double* vtkDataSet::GetPoint(vtkIdType ptId) version of GetPoint() which is not thread safe.
For multithreaded code void vtkDataSet::GetPoint(vtkIdType id,double x[3]) should be used.

How should SetDescriptorHeaps be used?

I'm experimenting a bit with the new features in DirectX12. So far I really like some of the changes, for example, the pipeline states. At the same time, some other changes are a bit confusing, for example, the descriptor heaps.
Let's start with a quick background so you better understand what I'm asking for.
In DirectX11, we created objects of different shaders and then we had to bind each of them separately during actual runtime when setting up our draw call. Here's a pseudo-example:
deviceContext->VSSetShader(...);
deviceContext->HSSetShader(...);
deviceContext->DSSetShader(...);
deviceContext->PSSetShader(...);
In DirectX12, they've implemented this so much smarter, because now we can configure the pipeline state during initialization instead, and then set all of the above with a single API call:
commandList->SetPipelineState(...);
Very simple, elegant and quicker. And on top of that, very logical. Now let's take a look at the descriptor heaps instead. I kind of expected this to follow the same elegant pattern, and this is basically what my question is about.
In DirectX11, we created objects of different desriptors (views) and then we had to bind each of them separately for each shader during actual runtime when setting up our draw call. Once again a pseudo-example:
deviceContext->PSSetConstantBuffers(0, n, ...);
deviceContext->PSSetShaderResources(0, n, ...);
deviceContext->PSSetSamplers(0, n, ...);
In DirectX12, they've implemented something called descriptor heaps. Basically they're chunks of memory that contain all of the descriptors that we want to bind, and we can also set it up during initialization. So far, it looks equally elegant as the pipeline state, since we can set everything with a single API call:
commandList->SetDescriptorHeaps(n, ...);
Or can we? This is where the confusion arises, because after a search I found this question that states:
Swapping descriptor heaps is a costly operation you want to avoid at all cost.
Meanwhile, the MSDN documentation for SetDesciptorHeaps doesn't state anything about this method behing particularly expensive.
Considering how elegantly they've designed the pipeline state, I was kind of expecting to be able to do this:
commandList->SetPipelineState(...);
commandList->SetDescriptorHeaps(n, ...);
commandList->DrawInstanced(...);
commandList->SetPipelineState(...);
commandList->SetDescriptorHeaps(n, ...);
commandList->DrawInstanced(...);
commandList->SetPipelineState(...);
commandList->SetDescriptorHeaps(n, ...);
commandList->DrawInstanced(...);
But if SetDescriptorHeaps is actually that expensive, this will propably provide a very bad performance. Or will it? As said, I can't find any statement about this actually being a bad idea on MSDN.
So my questions are:
If the above is considered bad practice, how should SetDescriptorHeaps be used?
If this is a Nvidia-only performance problem, how come that they don't fix their drivers?
Basically, what I want to do is to have two descriptor heaps (CBV/SRV/UAV + sampler) for each pipeline state. And judging from how cheap it's to change the pipeline state, it would be logical that changing the descriptor heap would be equally cheap. The pipeline state and the descriptor heap are quite closely related, i.e. changing the pipeline state will most likely require a different set of descriptors.
I'm aware of the strategy of using one massive descriptor heap for each type of descriptor. But that approach feels so overly complicated considering all the work required to keep track of each individual descriptors index. And on top of that, the descriptors in a descriptor table need to be continious in the heap.
Descriptor heaps are independent of pipelines; they don't have to be bound per draw/dispatch. You can also just have a big descriptor heap and bind that instead. This should then be corrected by the root signature though; which should point to the correct offset in this descriptor heap. This means you could have unique textures in one heap and point your root signature to the correct descriptor. You could also suballocate the current heap into one giant heap.
MSDN documentation has now addressed the performance hit on switching heaps:
On some hardware, this can be an expensive operation, requiring a GPU stall to flush all work that depends on the currently bound descriptor heap.
Source: Descriptor Heaps Overview - Switching Heaps
The reason this may happen is that for some hardware, switching between hardware descriptor heaps during execution requires a GPU wait for idle (to ensure that GPU references to the previously descriptor heap are finished).
To avoid being impacted by this possible wait for idle on the descriptor heap switch, applications can take advantage of breaks in rendering that would cause the GPU to idle for other reasons as the time to do descriptor heap switches, since a wait for idle is happening anyway.
Source: Shader Visible Descriptor Heaps - Overview

Write-only `glMapBuffer`, what if I don't write it all?

Say I've got a buffer object with some data in it.
I use glMapBuffer with GL_WRITE_ONLY and write to every second byte (think interleaved vertex attributes).
Then I glUnmapBuffer the buffer.
Are the bytes I didn't write to preserved or are they now undefined?
I'm wondering because the main purpose of GL_WRITE_ONLY seems to be to avoid transferring the previous content of the buffer from the card's memory to main memory. The driver, however, has no way of knowing to which bytes I've actually written something in order to update the buffer only partially.
So either the driver transfers the content to main memory first, rendering GL_WRITE_ONLY pointless on pretty much every platform I could think of. Or it is assumed that I write the complete mapped area. Yet no such obligation is mentioned in the man pages.
Short answer: The data is preserved.
I'm wondering because the main purpose of GL_WRITE_ONLY seems to be to
avoid transferring the previous content of the buffer from the card's
memory to main memory.
Well, the implementation has many potential ways to fullfill that request, and the access flags may help in the decision of which path to go. For example, the driver may decide to do some direct I/O mapping of the buffer in VRAM instead of using system RAM for the mapping.
The issues you see with this are actually addressed by the more modern glMapBufferRange() API introduced in the GL_ARB_map_buffer_range extension. Although the name might suggest that this is for mapping parts of the buffers, it actually superseeds the glMapBuffer() function completely and allows for a much finer control. For example, the GL_MAP_INVALIDATE_RANGE_BIT or GL_MAP_INVALIDATE_BUFFER_BIT flags mark the data as invalid and enable for the optimizations you had in mind for the general GL_WRITE_ONLY case. But without these, the data is to be preserved, and how this is done is the implementation's problem.

Opengl Unsynchronized/Non-blocking Map

I just found the following OpenGL specification for ARB_map_buffer_range.
I'm wondering if it is possible to do non-blocking map calls using this extension?
Currently in my application im rendering to an FBO which I then map to a host PBO buffer.
glMapBuffer(target_, GL_READ_ONLY);
However, the problem with this is that it blocks the rendering thread while transferring the data.
I could reduce this issue by pipelining the rendering, but latency is a big issue in my application.
My question is whether i can use map_buffer_range with MAP_UNSYNCHRONIZED_BIT and wait for the map operation to finish on another thread, or defer the map operation on the same thread, while the rendering thread renders the next frame.
e.g.
thread 1:
map();
render_next_frame();
thread 2:
wait_for_map
or
thread 1:
map();
while(!is_map_ready())
do_some_rendering_for_next_frame();
What I'm unsure of is how I know when the map operation is ready, the specification only mentions "other synchronization techniques to ensure correct operation".
Any ideas?
If you map a buffer with GL_MAP_UNSYNCHRONIZED_BIT, the driver will not wait until OpenGL is done with that memory before mapping it for you. So you will get more or less immediate access to it.
The problem is that this does not mean that you can just read/write that memory willy-nilly. If OpenGL is reading from or writing to that buffer and you change it... welcome to undefined behavior. Which can include crashing.
Therefore, in order to actually use unsynchronized mapping, you must synchronize your behavior to OpenGL's access of that buffer. This will involve the use of ARB_sync objects (or NV_fence if you're only on NVIDIA and haven't updated your drivers recently).
That being said, if you're using a fence object to synchronize access to the buffer, then you really don't need GL_MAP_UNSYNCHRONIZED_BIT at all. Once you finish the fence, or detect that it has completed, you can map the buffer normally and it should complete immediately (unless some other operation is reading/writing too).
In general, unsynchronized access is best used for when you need fine-grained write access to the buffer. In this case, good use of sync objects will get you what you really need (the ability to tell when the map operation is finished).
Addendum: The above is now outdated (depending on your hardware). Thanks to OpenGL 4.4/ARB_buffer_storage, you can now not only map unsynchronized, you can keep a buffer mapped indefinitely. Yes, you can have a buffer mapped while it is in use.
This is done by creating immutable storage and providing that storage with (among other things) the GL_MAP_PERSISTENT_BIT. Then you glMapBufferRange, also providing the same bit.
Now technically, that changes pretty much nothing. You still need to synchronize your actions with OpenGL. If you write stuff to a region of the buffer, you'll need to either issue a barrier or flush that region of the buffer explicitly. And if you're reading, you still need to use a fence sync object to make sure that the data is actually there before reading it (and unless you use GL_MAP_COHERENT_BIT too, you'll need to issue a barrier before reading).
In general, it is not possible to do a "nonblocking map", but you can map without blocking.
The reason why there can be no "nonblocking map" is that the moment the function call returns, you could access the data, so the driver must make sure it is there, positively. If the data has not been transferred, what else can the driver do but block.
Threads don't make this any better, and possibly make it worse (adding synchronisation and context sharing issues). Threads cannot magically remove the need to transfer data.
And this leads to how to not block on mapping: Only map when you are sure that the transfer is finished. One safe way to do this is to map the buffer after flipping buffers or after glFinish or after waiting on a query/fence object. Using a fence is the preferrable way if you can't wait until buffers have been swapped. A fence won't stall the pipeline, but will tell you whether or not your transfer is done (glFinish may or may not, but will probably stall).
Reading after swapping buffers is also 100% safe, but may not be acceptable if you need the data within the same frame (works perfectly for screenshots or for calculating a histogram for tonemapping, though).
A less safe way is to insert "some other stuff" and hope that in the mean time the transfer has completed.
In respect of below comment:
This answer is not incorrect. It isn't possible to do any better than access data after it's available (this should be obvious). Which means that you must sync/block, one way or the other, there is no choice.
Although, from a very pedantic point of view, you can of course use GL_MAP_UNSYNCHRONIZED_BIT to get a non-blocking map operation, this is entirely irrelevant, as it does not work unless you explicitly reproduce the implicit sync as described above. A mapping that you can't safely access is good for nothing.
Mapping and accessing a buffer that OpenGL is transferring data to without synchronizing/blocking (implicitly or explicitly) means "undefined behavior", which is only a nicer wording for "probably garbage results, maybe crash".
If, on the other hand, you explicitly synchronize (say, with a fence as described above), then it's irrelevant whether or not you use the unsynchronized flag, since no more implicit sync needs to happen anyway.