DirectX12 Upload Synchronization D3D12_HEAP_TYPE_UPLOAD - directx-12

I want to ensure that my D3D12_HEAP_TYPE_UPLOAD resource has been upload before I use it.
Apparently to do this you call ID3D12Resource::Unmap, ID3D12CommandList::Close, ID3D12CommandQueue::ExecuteCommandList and then ID3D12CommandQueue::Signal.
However, this confuses me. The call ID3D12Resource::Unmap is completely unconnected to the command list and queue, except by the device the resource was created on. But I have multiple command queues per device. So how does it chose which command queue to upload the resource on?
Is this documented anywhere? The only help I can find are comments in the samples.

Once you have copied your data to a mapped pointer, it becomes available immediately to be consumed by commands, in case of Upload resources there is no need to Unmap resource in that case (you can unmap on Release or at application shutdown).
However, it is important to note (specially reading by your comments), that command will be executed later on the gpu, so if you plan to reuse that memory you need to have some synchronization mechanisms.
Let's make a simple pseudo code example :
You have a buffer called buffer1 (that you already created and mapped), now you have access to its memory via mappedPtr1.
copy data1 to mappedPtr1
call compute shader in commandList
execute CommandList
Now everything will execute properly (for one frame assuming you have synchronization)
Now if you do the following :
copy data1 to mappedPtr1
call compute shader in commandList (1)
copy data2 to mappedPtr1
call compute shader in commandList (1)
execute CommandList
In that case, since you copied data2 at the same place as data1,
the first compute shader call will use data2 (at it is the latest available data when you call execute CommandList)
Now let's have a slightly different example :
copy data1 to mappedPtr1
call compute shader in commandList1
execute CommandList1
copy data2 to mappedPtr1
call compute shader in commandList2
execute CommandList2
What will now happen is undefined, since you do not know when CommandList1 and CommandList2 will be effectively processed.
In case CommandList1 is processed (fast enough) before :
copy data2 to mappedPtr1
then data1 will be the current memory and be used
However, if your commandList is a bit heavier and CommandList1 is not yet processed at the time you finish your call to
copy data2 to mappedPtr1
Which is likely to happen, then both compute will again use data2 when used by the gpu.
This is because executeCommandList is a non blocking function, when it returns it only means that your commands have been prepared for execution, not that the commands have been processed.
In order to guarantee that you use the correct data at the correct time, you have in that case several options:
1/Use a fence and wait for completion
copy data1 to mappedPtr1
call compute shader in commandList1
execute CommandList1 on commandQueue
attachSignal (1) to commandQueue
add a waitevent for value (1)
copy data2 to mappedPtr1
call compute shader in commandList2
execute CommandList2 on commandQueue
attachSignal (2) to commandQueue
add a waitevent for value (2)
This is simple but is vastly inefficient, since now you wait for your gpu to finish all execution of commandList before to continue any cpu work.
2/Use different resources :
since now you copy to 2 different locations you will of course guarantee that your data is different accross both calls.
3/Use a single resource with offsets.
You can also create a resource larger that can hold data for all your calls, then copy once.
I'll assume your data is 64 bytes here (so you would create a 128 byte buffer)
copy data1 to mappedPtr1 (offset 0)
bind address from mappedPtr1 (offset 0) to compute
call compute shader in commandList1
execute CommandList1 on commandQueue
copy data2 to mappedPtr1 (offset 64)
bind address from mappedPtr1 (offset 64) to compute
call compute shader in commandList2
execute CommandList2 on commandQueue
Please note that you should still have fences to indicate when a frame have finished to be processed, this is the only way to guarantee you that upload part can finally be reused.
If you want to copy the data to a default heap (specially if you do it on a separate copy queue), you will also need a Fence on the copy queue and a wait in the main queue to ensure the copy queue has finished processing and that data is available (you also need, as per the other answer, to set up resource barriers in the default heap resource in that case)
Hope it makes sense.

Per Microsoft Docs, all that Map and Unmap do is deal with the virtual memory address mapping on the CPU. You can safely leave a resource mapped (i.e. keep it mapped into virtual memory) over a long time, unlike with Direct3D 11 where you had to Unmap it.
Almost all the samples use the UpdateSubresources helper in the D3DX12.H utility header. There a few overloads of this, but they all do the same basic thing:
Create/Map an 'intermediate' resource (i.e. something on an upload heap).
Take data from the CPU and copy it into the 'intermediate' resource (unmapping it when complete since there's no need to keep the virtual memory address assignment around).
Then call CopyBufferRegion or CopyTextureRegion on a command-list (which can be a graphics queue command-list, a copy queue command-list, or a compute-queue command-list).
You can post as many of these into a command-list as you want, but the 'intermediate' resource must remain valid until it completes.
As with most things in Direct3D 12, you do this with a fence. When that fence is complete, you know you can release the 'intermediate' resources. Also, none of the copies will actually start until after you close and submit the command-list for execution.
You also need to transition the final resource from a copy state to a state you can use for rendering. Typically you post these on the same command-list, although there are limitations if you are using copy-queue or compute-queue command-lists.
For a full implementation of this, see DirectX Tool Kit for DX12
Note that it is possible to render a texture or use vertex/index buffers directly from the upload heap. It's not as efficient as copying it into a default heap, but is akin to the Direct3D 11 USAGE_DYNAMIC. In this case, it would make sense to keep the upload heap "mapped" and re-use the same address once you know it's no longer in use. Otherwise, corruption or other bad things can happen.

According to this article from NVIDIA, an upload buffer is not copied until the GPU needs the buffer. Right before a draw (or copy) call is executed any upload buffers used by the call will be uploaded to GPU ram.
This means three things:
It is rather simple to know when you can execute the draw call. Just ensure that the memcpy call has returned before executing the command list.
It is a bit more complicated to know when the draw call has uploaded the buffer, i.e. when you can change the buffer for the next frame. Here a fence is needed to get that info back from the GPU.
Since the upload is done for every draw call, only use an upload buffer if the data changes between every draw call. Otherwise optimize the rendering process by copying the upload buffer into a GPU bound buffer.

Just summerising the mental model:
D3D12_HEAP_TYPE_UPLOAD or D3D12_HEAP_TYPE_READBACK have no (stateful) gpu backing memory, but rather only cpu memory. An the upload/readback happens every time they are used, usually by CopyResource/CopyBufferRegion/CopyTextureRegion, and (in the upload case) whatever state of the mapped cpu memory is in when this operator occurs is what you get on the gpu.
The upload and copy are simultaneous and a new upload occurs for each copy.
However, as gpu operations are asynchronous, you have to use synchronization primitives to ensure that the mapped cpu memory is in the right state when the gpu upload-copy operation occurs.
In my case, this involves making sure I don't overwrite the current data with future data before the gpu upload-copy operation completes.
The typical usage pattern is to have a ringbuffer of D3D12_HEAP_TYPE_UPLOAD resources. For each iteration of the render loop the next resource in the ringbuffer gets copied in to the same D3D12_HEAP_TYPE_DEFAULT resource . Edit: this is unsafe when frame buffering and I believe it was the original bug I had. #mrvux described a very real problem, just not the one I was having.

Related

concurrent data transfer cuda kernel and host [duplicate]

I have some questions.
Recently I'm making a program by using CUDA.
In my program, there is one big data on Host programmed with std::map(string, vector(int)).
By using these datas some vector(int) are copied to GPUs global memory and processed on GPU
After processing, some results are generated on GPU and these results are copied to CPU.
These are all my program schedule.
cudaMemcpy( ... , cudaMemcpyHostToDevice)
kernel function(kernel function only can be done when necessary data is copied to GPU global memory)
cudaMemcpy( ... , cudaMemcpyDeviceToHost)
repeat 1~3steps 1000times (for another data(vector) )
But I want to reduce processing time.
So I decided to use cudaMemcpyAsync function in my program.
After searching some documents and web pages, I realize that to use cudaMemcpyAsync function host memory which has data to be copied to GPUs global memory must be allocated as pinned memory.
But my programs are using std::map, so I couldn't make this std::map data to pinned memory.
So instead of using this, I made a buffer array typed pinned memory and this buffer can always handle all the case of copying vector.
Finally, my program worked like this.
Memcpy (copy data from std::map to buffer using loop until whole data is copied to buffer)
cudaMemcpyAsync( ... , cudaMemcpyHostToDevice)
kernel(kernel function only can be executed when whole data is copied to GPU global memory)
cudaMemcpyAsync( ... , cudaMemcpyDeviceToHost)
repeat 1~4steps 1000times (for another data(vector) )
And my program became much faster than the previous case.
But problem(my curiosity) is at this point.
I tried to make another program in a similar way.
Memcpy (copy data from std::map to buffer only for one vector)
cudaMemcpyAsync( ... , cudaMemcpyHostToDevice)
loop 1~2 until whole data is copied to GPU global memory
kernel(kernel function only can be executed when necessary data is copied to GPU global memory)
cudaMemcpyAsync( ... , cudaMemcpyDeviceToHost)
repeat 1~5steps 1000times (for another data(vector) )
This method came out to be about 10% faster than the method discussed above.
But I don't know why.
I think cudaMemcpyAsync only can be overlapped with kernel function.
But my case I think it is not. Rather than it looks like can be overlapped between cudaMemcpyAsync functions.
Sorry for my long question but I really want to know why.
Can Someone teach or explain to me what is the exact facility "cudaMemcpyAsync" and what functions can be overlapped with "cudaMemcpyAsync" ?
The copying activity of cudaMemcpyAsync (as well as kernel activity) can be overlapped with any host code. Furthermore, data copy to and from the device (via cudaMemcpyAsync) can be overlapped with kernel activity. All 3 activities: host activity, data copy activity, and kernel activity, can be done asynchronously to each other, and can overlap each other.
As you have seen and demonstrated, host activity and data copy or kernel activity can be overlapped with each other in a relatively straightforward fashion: kernel launches return immediately to the host, as does cudaMemcpyAsync. However, to get best overlap opportunities between data copy and kernel activity, it's necessary to use some additional concepts. For best overlap opportunities, we need:
Host memory buffers that are pinned, e.g. via cudaHostAlloc()
Usage of cuda streams to separate various types of activity (data copy and kernel computation)
Usage of cudaMemcpyAsync (instead of cudaMemcpy)
Naturally your work also needs to be broken up in a separable way. This normally means that if your kernel is performing a specific function, you may need multiple invocations of this kernel so that each invocation can be working on a separate piece of data. This allows us to copy data block B to the device while the first kernel invocation is working on data block A, for example. In so doing we have the opportunity to overlap the copy of data block B with the kernel processing of data block A.
The main differences with cudaMemcpyAsync (as compared to cudaMemcpy) are that:
It can be issued in any stream (it takes a stream parameter)
Normally, it returns control to the host immediately (just like a kernel call does) rather than waiting for the data copy to be completed.
Item 1 is a necessary feature so that data copy can be overlapped with kernel computation. Item 2 is a necessary feature so that data copy can be overlapped with host activity.
Although the concepts of copy/compute overlap are pretty straightforward, in practice the implementation requires some work. For additional references, please refer to:
Overlap copy/compute section of the CUDA best practices guide.
Sample code showing a basic implementation of copy/compute overlap.
Sample code showing a full multi/concurrent kernel copy/compute overlap scenario.
Note that some of the above discussion is predicated on having a compute capability 2.0 or greater device (e.g. concurrent kernels). Also, different devices may have one or 2 copy engines, meaning simultaneous copy to the device and copy from the device is only possible on certain devices.

Vulkan: `vkFlushMappedMemoryRanges` threading layer error?

If an entire vkDeviceMemory is mapped (via vkMapMemory) and it wasn't allocated with VK_MEMORY_PROPERTY_HOST_COHERENT_BIT, vkFlushMappedMemoryRanges must be made after any modifications to the buffer are made, for the device to see the writes (per the documentation).
I am only modifying small sections of a large buffer, and thus, only want to flush the affected regions. So, I create multiple VkMappedMemoryRange structures, with varying offset and size fields, but pointing to the same vkDeviceMemory. This all seems to work as I expect. Howevever, if I enable VK_LAYER_LUNARG_threading, I get an error:
THREADING ERROR : object of type VkDeviceMemory is recursively used in thread 24812
If I instead just call vkFlushMappedMemoryRanges multiple times with only a single flush range, instead of an array, I don't get an error. Is flushing multiple sub-ranges of the same buffer not a valid use case?
That is a false error report from the layer. A single function call can safely refer to the same vulkan object several times. Newer versions of the thread checking layer don't report that false conflict. (That layer is renamed to VK_LAYER_GOOGLE_threading in recent versions.)

How do UBOs/SSBOs differ from Vulkan's shader memory bindings?

In the article on Imagination's website, I've read the following paragraph:
For example, there are no glUniform*() equivalent entry points in Vulkan; instead, writing to GPU memory is the only way to pass data to shaders.
When you call glUniform*(), the OpenGL ES driver typically needs to allocate a driver managed buffer and copy data to it, the management of which incurs CPU overhead. In Vulkan, you simply map the memory address and write to that memory location directly.
Is there any difference between that and using Uniform Buffers? They are also allocated explicitely and can carry arbitrary data. Since Uniform Buffers are quite limited in size, perhaps Shader Storage Buffers are a better analogy.
From what I understand, this is not glUniform*() specific: glUniform*() is merely an example used by the author of the article to illustrate the way Vulkan works with regards to communication between the host and the GPU.
When you call glUniform*(), the OpenGL ES driver typically needs to allocate a driver managed buffer and copy data to it, the management of which incurs CPU overhead.
In this scenario, when a user calls glUniform*() with some data, that data is first copied to a buffer owned by the OpenGL implementation. This buffer is probably pinned, and can then be used by the driver to transfer the data through DMA to the device. That's two steps:
Copy user data to driver buffer;
Transfer buffer contents to GPU through DMA.
In Vulkan, you simply map the memory address and write to that memory location directly.
In this scenario, there is no intermediate copy of the user data. You ask Vulkan to map a region into the host's virtual address space, which you directly write to. The data gets to the device through DMA in a completely transparent way for the user.
From a performance standpoint, the benefits are obvious: zero copy. It also means the Vulkan implementation can be simpler, as it does not need to manage an intermediate buffer.
As the specs have not been released yet, here's a fictitious example of what it could look like:
// Assume Lights is some kind of handle to your buffer/data
float4* lights = vkMap(Lights);
for (int i = 0; i < light_count; ++i) {
// Goes directly to the device
lights[i] = make_light(/* stuff */);
}
vkUnmap(lights);

OpenGL Texture and Object Streaming

I have a need to stream a texture (essentially a camera feed).
With object streaming, the following scenarios seem to be arise:
Is the new object's data store larger, smaller or same size as the old one?
Subset of or whole texture being updated?
Are we streaming a buffer object or texture object (any difference?)
Here are the following approaches I have come across:
Allocate object data store (either BufferData for buffers or TexImage2D for textures) and then each frame, update subset of data with BufferSubData or TexSubImage2D
Nullify/invalidate the object after the last call (eg. draw) that uses the object either with:
Nullify: glTexSubImage2D( ..., NULL), glBufferSubData( ..., NULL)
Invalidate: glBufferInvalidate(), glMapBufferRange​ with the GL_MAP_INVALIDATE_BUFFER_BIT​, glDeleteTextures ?
Simpliy reinvoke BufferData or TexImage2D with the new data
Manually implement object multi-buffering / buffer ping-ponging.
Most immediately, my problem scenario is: entire texture being replaced with new one of same size. How do I implement this? Will (1) implicitly synchronize ? Does (2) avoid the synchronization? Will (3) synchronize or will a new data store for the object be allocated, where our update can be uploaded without waiting for all drawing using the old object state to finish? This passage from the Red Book V4.3 makes be believe so:
Data can also be copied between buffer objects using the
glCopyBufferSubData() function. Rather than assembling chunks of data
in one large buffer object using glBufferSubData(), it is possible to
upload the data into separate buffers using glBufferData() and then
copy from those buffers into the larger buffer using
glCopyBufferSubData(). Depending on the OpenGL implementation, it may
be able to overlap these copies because each time you call
glBufferData() on a buffer object, it invalidates whatever contents
may have been there before. Therefore, OpenGL can sometimes just
allocate a whole new data store for your data, even though a copy
operation from the previous store has not completed yet. It will then
release the old storage at a later opportunity.
But if so, why the need for (2)[nullify/invalidates]?
Also, please discuss the above approaches, and others, and their effectiveness for the various scenarios, while keeping in mind atleast the following issues:
Whether implicit synchronization to object (ie. synchronizing our update with OpenGL's usage) occurs
Memory usage
Speed
I've read http://www.opengl.org/wiki/Buffer_Object_Streaming but it doesn't offer conclusive information.
Let me try to answer at least a few of the questions you raised.
The scenarios you talk about can have a great impact on the performance on the different approaches, especially when considering the first point about the dynamic size of the buffer. In your scenario of video streaming, the size will rarely change, so a more expensive "re-configuration" of the data structures you use might be possible. If the size changes every frame or every few frames, this is typically not feasable. However, if a resonable maximum size limit can be enforced, just using buffers/textures with the maximum size might be a good strategy. Neither with buffers nor with textures you have to use all the space there is (although there are some smaller issues when you do this with texures, like wrap modes).
3.Are we streaming a buffer object or texture object (any difference?)
Well, the only way to efficiently stream image data to or from the GL is to use pixel buffer objects (PBOs). So you always have to deal with buffer objects in the first place, no matter if vertex data, image data or whatever data is to be tranfered. The buffer is just the source for some glTex*Image() call in the texture case, and of course you'll need a texture object for that.
Let's come to your approaches:
In approach (1), you use the "Sub" variant of the update commands. In that case, (parts of or the whole) storage of the existing object is updated. This is likely to trigger an implicit synchronziation ifold data is still in use. The GL has basically only two options: wait for all operations (potentially) depending on that data to complete, or make an intermediate copy of the new data and let the client go on. Both options are not good from a performance point of view.
In approach (2), you have some misconception. The "Sub" variants of the update commands will never invalidate/orphan your buffers. The "non-sub" glBufferData() will create a completely new storage for the object, and using it with NULL as data pointer will leave that storage unintialized. Internally, the GL implementation might re-use some memory which was in use for earlier buffer storage. So if you do this scheme, there is some probablity that you effectively end up using a ring-buffer of the same memory areas if you always use the same buffer size.
The other methods for invalidation you mentiond allow you to also invalidate parts of the buffer and also a more fine-grained control of what is happening.
Approach (3) is basically the same as (2) with the glBufferData() oprhaning, but you just specify the new data directly at this stage.
Approach (4) is the one I actually would recommend, as it is the one which gives the application the most control over what is happening, without having to relies on the GL implementation's specific internal workings.
Without taking synchronization into account, the "sub" variant of the update commands is
more efficient, even if the whole data storage is to be changed, not just some part. That is because the "non-sub" variants of the commands basically recreate the storage and introduce some overhead with this. With manually managing the ring buffers, you can avoid any of that overhead, and you don't have to rely in the GL to be clever, by just using the "sub" variants of the updates functions. At the same time, you can avoid implicit synchroniztion by only updating buffers which aren't in use by th GL any more. This scheme can also nicely be extenden into a multi-threaded scenario. You can have one (or several) extra threads with separate (but shared) GL contexts to fill the buffers for you, and just passing the buffer handlings to the draw thread as soon as the update is complete. You can also just map the buffers in the draw thread and let the be filled by worker threads (wihtout the need for additional GL contexts at all).
OpenGL 4.4 introduced GL_ARB_buffer_storage and with it came the GL_MAP_PERSISTEN_BIT for glMapBufferRange. That will allow you to keep all of the buffers mapped while they are used by the GL - so it allows you to avoid the overhead of mapping the buffers into the address space again and again. You then will have no implicit synchronzation at all - but you have to synchronize the operations manually. OpenGL's synchronization objects (see GL_ARB_sync) might help you with that, but the main burden on synchronization is on your applications logic itself. When streaming videos to the GL, just avoid re-using the buffer which was the source for the glTexSubImage() call immediately and try to delay its re-use as long as possible. You are of course also trading throughput for latency. If you need to minimize latency, you might to have to tweak this logic a bit.
Comparing the approaches for "memory usage" is really hard. There are a lot of of implementation specific details to consider here. A GL implementation might keep some old buffer memories around for some time to fullfill recreation requests of the same size. Also, an GL implementation might make shadow copies of any data at any time. The approaches which don't orphan and recreate storages all the time in principle expose more control of the memory which is in use.
"Speed" itself is also not a very useful metric. You basically have to balance throughput and latency here, according to the requirements of your application.

Why does barrier synchronize shared memory when memoryBarrier doesn't?

The following GLSL compute shader simply copies inImage to outImage. It is derived from a more complex post-processing pass.
In the first several lines of main(), a single thread loads 64 pixels of data into the shared array. Then, after synchronizing, each of the 64 threads writes one pixel to the output image.
Depending on how I synchronize, I get different results. I originally thought memoryBarrierShared() would be the correct call, but it produces the following result:
which is the same result as having no synchronization or using memoryBarrier() instead.
If I use barrier(), I get the following (desired) result:
The striping is 32 pixels wide, and if I change the workgroup size to anything less than or equal to 32, I get correct results.
What's going on here? Am I misunderstanding the purpose of memoryBarrierShared()? Why should barrier() work?
#version 430
#define SIZE 64
layout (local_size_x = SIZE, local_size_y = 1, local_size_z = 1) in;
layout(rgba32f) uniform readonly image2D inImage;
uniform writeonly image2D outImage;
shared vec4 shared_data[SIZE];
void main() {
ivec2 base = ivec2(gl_WorkGroupID.xy * gl_WorkGroupSize.xy);
ivec2 my_index = base + ivec2(gl_LocalInvocationID.x,0);
if (gl_LocalInvocationID.x == 0) {
for (int i = 0; i < SIZE; i++) {
shared_data[i] = imageLoad(inImage, base + ivec2(i,0));
}
}
// with no synchronization: stripes
// memoryBarrier(); // stripes
// memoryBarrierShared(); // stripes
// barrier(); // works
imageStore(outImage, my_index, shared_data[gl_LocalInvocationID.x]);
}
The problem with image load store and friends is, that the implementation cannot be sure anymore that a shader only changes the data of it's dedicated output values (e.g. the framebuffer after a fragment shader). This applies even more so to compute shaders, which don't have a dedicated output, but only output things by writing data into writable store, like images, storage buffers or atomic counters. This may require manual synchronization between individual passes as otherwise the fragment shader trying to access a texture might not have the most recent data written into that texture with image store operations by a preceding pass, like your compute shader.
So it may be that your compute shader works perfectly, but it is the synchronization with the following display (or whatever) pass (that needs to read this image data somehow) that fails. For this purpose there exists the glMemoryBarrier function. Depending on how you read that image data in the display pass (or more precisely the pass that reads the image after the compute shader pass), you need to give a different flag to this function. If you read it using a texture, use GL_TEXTURE_FETCH_BARRIER_BIT​, if you use an image load again, use GL_SHADER_IMAGE_ACCESS_BARRIER_BIT​, if using glBlitFramebuffer for display, use GL_FRAMEBUFFER_BARRIER_BIT​...
Though I don't have much experience with image load/store and manual memory snynchronization and this is only what I came up with theoretically. So if anyone knows better or you already use a proper glMemoryBarrier, then feel free to correct me. Likewise does this not need to be your only error (if any). But the last two points from the linked Wiki article actually address your use case and IMHO make it clear that you need some kind of glMemoryBarrier:
Data written to image variables in one rendering pass and read by the shader in a later pass need not use coherent variables or
memoryBarrier(). Calling glMemoryBarrier with the
SHADER_IMAGE_ACCESS_BARRIER_BIT​ set in barriers​ between passes is
necessary.
Data written by the shader in one rendering pass and read by another mechanism (e.g., vertex or index buffer pulling) in a later pass need
not use coherent variables or memoryBarrier(). Calling
glMemoryBarrier with the appropriate bits set in barriers​ between
passes is necessary.
EDIT: Actually the Wiki article on compute shaders says
Shared variable access uses the rules for incoherent memory access.
This means that the user must perform certain synchronization in order
to ensure that shared variables are visible.
Shared variables are all implicitly declared coherent​, so you don't
need to (and can't use) that qualifier. However, you still need to
provide an appropriate memory barrier.
The usual set of memory barriers is available to compute shaders, but
they also have access to memoryBarrierShared()​; this barrier is
specifically for shared variable ordering. groupMemoryBarrier()​
acts like memoryBarrier()​, ordering memory writes for all kinds of
variables, but it only orders read/writes for the current work group.
While all invocations within a work group are said to execute "in parallel", that doesn't mean that you can assume that all of them are
executing in lock-step. If you need to ensure that an invocation has
written to some variable so that you can read it, you need to
synchronize execution with the invocations, not just issue a memory
barrier (you still need the memory barrier though).
To synchronize reads and writes between invocations within a work group, you must employ the barrier()​ function. This forces an
explicit synchronization between all invocations in the work group.
Execution within the work group will not proceed until all other
invocations have reach this barrier. Once past the barrier()​, all
shared variables previously written across all invocations in the
group will be visible.
So this actually sounds like you need the barrier there and the memoryBarrierShared is not enough (though you don't need both, as the last sentence says). The memory barrier will just synchronize the memory, but it doesn't stop the execution of the threads to cross it. Thus the threads won't read any old cached data from the shared memory if the first thread has already written something, but they can very well reach the point of reading before the first thread has tried to write anything at all.
This actually fits perfectly to the fact that for 32 and below block sizes it works and that the first 32 pixels work. At least on NVIDIA hardware 32 is the warp size and thus the number of threads that operate in perfect lock-step. So the first 32 threads (well, every block of 32 threads) always work exactly parallel (well, conceptually that is) and thus they cannot introduce any race-conditions. This is also the case why you don't actually need any synchronization if you know you work inside a single warp, a common optimization.