Incremental by kernel thread - c++

Let’s say that I want to increment a property var incremental:Int32 every time a kernel thread is executed:
//SWIFT
var incremental:Int32 = 0
var incrementalBuffer:MTLBuffer!
var incrementalPointer: UnsafeMutablePointer<Int32>!
init(metalView: MTKView) {
...
incrementalBuffer = Renderer.device.makeBuffer(bytes: &incremental, length: MemoryLayout<Int32>.stride)
incrementalPointer = incrementalBuffer.contents().bindMemory(to: Int32.self, capacity: 1)
}
func draw(in view: MTKView) {
...
computeCommandEncoder.setComputePipelineState(computePipelineState)
let width = computePipelineState.threadExecutionWidth
let threadsPerGroup = MTLSizeMake(width, 1, 1)
let threadsPerGrid = MTLSizeMake(10, 1, 1)
computeCommandEncoder.setBuffer(incrementalBuffer, offset: 0, index: 0)
computeCommandEncoder.dispatchThreads(threadsPerGrid, threadsPerThreadgroup: threadsPerGroup)
computeCommandEncoder.endEncoding()
commandBufferCompute.commit()
commandBufferCompute.waitUntilCompleted()
print(incrementalPointer.pointee)
}
//METAL
kernel void compute_shader (device int& incremental [[buffer(0)]]){
incremental++;
}
So I expect outputs:
10
20
30
...
but I get:
1
2
3
...
EDIT:
After some work based on the answer of #JustSomeGuy, Caroline from raywenderlich and one Apple Engineer I get:
[[kernel]] void compute_shader (device atomic_int& incremental [[buffer(0)]],
ushort lid [[thread_position_in_threadgroup]] ){
threadgroup atomic_int local_atomic;
if (lid==0) atomic_store_explicit(&local_atomic, 0, memory_order_relaxed);
atomic_fetch_add_explicit(&local_atomic, 1, memory_order_relaxed);
threadgroup_barrier(mem_flags::mem_threadgroup);
if(lid == 0) {
int local_non_atomic = atomic_load_explicit(&local_atomic, memory_order_relaxed);
atomic_fetch_add_explicit(&incremental, local_non_atomic, memory_order_relaxed);
}
}
and works as expected

The reason you are seeing this problem is because ++ is not atomic. It basically comes down to a code like this
auto temp = incremental;
incremental = temp + 1;
temp;
which means that because the threads are executed in "parallel" (it's not really true cause a number of threads forms a SIMD-group which executes in step-lock, but it's not really important here).
Since the access is not atomic, the result is basically undefined, because there's no way to tell which thread observed which value.
A quick fix is to use atomic_fetch_add_explicit(incremental, 1, memory_order_relaxed). This makes all accesses to incremental atomic. memory_order_relaxed here means that guarantees on the order of operations is relaxed, so this will work only if you are just adding or just subtracting from the value. memory_order_relaxed is the only memory_order supported in MSL. You can read more on this in Metal Shading Language Specification, section 6.13.
But this quick fix is pretty bad because it's going to be slow, because access to incremental will have to be synchronized across all the threads. The other way is to use a common pattern where all threads in threadgroup update a value in threadgroup memory and then one or more of threads atomically update the device memory. So the kernel will looks something like
kernel void compute_shader (device int& incremental [[buffer(0)]], threadgroup int& local [[threadgroup(0)]], ushort lid [[thread_position_in_threadgroup]] ){
atomic_fetch_add_explicit(local, 1, memory_order_relaxed);
threadgroup_barrier(mem_flags::mem_threadgroup);
if(lid == 0) {
atomic_fetch_add_explicit(incremental, local, memory_order_relaxed);
}
}
Which basically means: every thread in threadgroup should add atomically 1 to local, wait until every thread is done (threadgroup_barrier) and then exactly one thread adds atomically the total local to incremental.
atomic_fetch_add_explicit on a threadgroup variable will use threadgroup atomics instead of global atomics which should be faster.
You can read specification I linked above to learn more, these patterns are mentioned in samples there.

Related

C++ amp atomics

I am rewriting an algorithm in C++ AMP and just ran into an issue with atomic writes, more specifically atomic_fetch_add, which apparently is only for integers?
I need to add a double_4 (or if I have to, a float_4) in an atomic fashion. How do I accomplish that with C++ AMP's atomics?
Is the best/only solution really to have a lock variable which my code can use to control the writes? I actually need to do atomic writes for a long list of output doubles, so I would essentially need a lock for every output.
I have already considered tiling this for better performance, but right now I am just in the first iteration.
EDIT:
Thanks for the quick answers already given.
I have a quick update to my question though.
I made the following lock attempt, but it seems that when one thread in a warp gets past the lock, all the other threads in the same warp just tags along. I was expecting the first warp thread to get the lock, but I must be missing something (note that it has been quite a few years since my cuda days, so I have just gotten dumb)
parallel_for_each(attracting.extent, [=](index<1> idx) restrict(amp)
{
.....
for (int j = 0; j < attracted.extent.size(); j++)
{
...
int lock = 0; //the expected lock value
while (!atomic_compare_exchange(&locks[j], &lock, 1));
//when one warp thread gets the lock, ALL threads continue on
...
acceleration[j] += ...; //locked write
locks[j] = 0; //leaving the lock again
}
});
It is as such not a big problem, since I should write into a shared variable at first and only write it to global memory after all threads in a tile have completed, but I just don't understand this behavior.
All the atomic add ops are only for integer types. You can do what you want without locks using 128-bit CAS (compare-and-swap) operations though for float_4 (I'm assuming this is 4 floats), but there's no 256-bit CAS ops what you would need for double_4. What you have to do is to have a loop which atomically reads float_4 from memory, perform the float add in the regular way, and then use CAS to test & swap the value if it's the original (and loop if not, i.e. some other thread changed the value between read & write). Note that the 128-bit CAS is only available on 64-bit architectures and that your data needs to be properly aligned.
if the critical code is short, you can create your own lock using atomic operations:
int lock = 1;
while(__sync_lock_test_and_set(&lock, 0) == 0) // trying to acquire lock
{
//yield the thread or go to sleep
}
//critical section, do the work
// release lock
lock = 1;
the advantage is you save the overhead of the OS locks.
The question has as such been answered by others and the answer is that you need to handle double atomics yourself. There is no function for it in the library.
I would also like to elaborate on my own edit, in case others come here with the same failing lock.
In the following example, my error was in not realizing that when the exchange failed, it actually changed the expected value! Thus the first thread would expect lock being zero and write a 1 in it. The next thread would expect 0 and would fail to write a one - but then the exchange wrote a one in the variable holding the expected value. This means that the next time the thread tried to do an exchange it expects a 1 in the lock! This it gets and then it thinks it gets the lock.
I was absolutely not aware that the &lock would receive a 1 on failed exchange match!
parallel_for_each(attracting.extent, [=](index<1> idx) restrict(amp)
{
.....
for (int j = 0; j < attracted.extent.size(); j++)
{
...
int lock = 0; //the expected lock value
**//note that, if locks[j]!=lock then lock=1
//meaning that ACE will be true the next time if locks[j]==1
//meaning the while will terminate even though someone else has the lock**
while (!atomic_compare_exchange(&locks[j], &lock, 1));
//when one warp thread gets the lock, ALL threads continue on
...
acceleration[j] += ...; //locked write
locks[j] = 0; //leaving the lock again
}
});
It seems that a fix is to do
parallel_for_each(attracting.extent, [=](index<1> idx) restrict(amp)
{
.....
for (int j = 0; j < attracted.extent.size(); j++)
{
...
int lock = 0; //the expected lock value
while (!atomic_compare_exchange(&locks[j], &lock, 1))
{
lock=0; //reset the expected value
};
//when one warp thread gets the lock, ALL threads continue on
...
acceleration[j] += ...; //locked write
locks[j] = 0; //leaving the lock again
}
});

CUDA, mutex and atomicCAS()

Recently I started to develop on CUDA and faced with the problem with atomicCAS().
To do some manipulations with memory in device code I have to create a mutex, so that only one thread could work with memory in critical section of code.
The device code below runs on 1 block and several threads.
__global__ void cudaKernelGenerateRandomGraph(..., int* mutex)
{
int i = threadIdx.x;
...
do
{
atomicCAS(mutex, 0, 1 + i);
}
while (*mutex != i + 1);
//critical section
//do some manipulations with objects in device memory
*mutex = 0;
...
}
When first thread executes
atomicCAS(mutex, 0, 1 + i);
mutex is 1. After that first thread changes its status from Active to Inactive, and line
*mutex = 0;
is not executed. Other threads stays forever in loop. I have tried many variants of this cycle like while(){};, do{}while();, with temp variable = *mutex inside loop, even variant with if(){} and goto. But result is the same.
The host part of code:
...
int verticlesCount = 5;
int *mutex;
cudaMalloc((void **)&mutex, sizeof(int));
cudaMemset(mutex, 0, sizeof(int));
cudaKernelGenerateRandomGraph<<<1, verticlesCount>>>(..., mutex);
I use Visual Studio 2012 with CUDA 5.5.
The device is NVidia GeForce GT 240 with compute capability 1.2.
Thanks in advance.
UPD:
After some time working on my diploma project this spring, I found a solution for critical section on cuda.
This is a combination of lock-free and mutex mechanisms.
Here is working code. Used it to impelment atomic dynamic-resizable array.
// *mutex should be 0 before calling this function
__global__ void kernelFunction(..., unsigned long long* mutex)
{
bool isSet = false;
do
{
if (isSet = atomicCAS(mutex, 0, 1) == 0)
{
// critical section goes here
}
if (isSet)
{
mutex = 0;
}
}
while (!isSet);
}
The loop in question
do
{
atomicCAS(mutex, 0, 1 + i);
}
while (*mutex != i + 1);
would work fine if it were running on the host (CPU) side; once thread 0 sets *mutex to 1, the other threads would wait exactly until thread 0 sets *mutex back to 0.
However, GPU threads are not as independent as their CPU counterparts. GPU threads are grouped into groups of 32, commonly referred to as warps. Threads in the same warp will execute instructions in complete lock-step. If a control statement such as if or while causes some of the 32 threads to diverge from the rest, the remaining threads will wait (i.e. sleeps) for the divergent threads to finish. [1]
Going back to the loop in question, thread 0 becomes inactive because threads 1, 2, ..., 31 are still stuck in the while loop. So thread 0 never reaches the line *mutex = 0, and the other 31 threads loops forever.
A potential solution is to make a local copy of the shared resource in question, let 32 threads modify the copy, and then pick one thread to 'push' the change back to the shared resource. A __shared__ variable is ideal in this situation: it will be shared by the threads belonging to the same block but not other blocks. We can use __syncthreads() to fine-control the access of this variable by the member threads.
[1] CUDA Best Practices Guide - Branching and Divergence
Avoid different execution paths within the same warp.
Any flow control instruction (if, switch, do, for, while) can significantly affect the instruction throughput by causing threads of the same warp to diverge; that is, to follow different execution paths. If this happens, the different execution paths must be serialized, since all of the threads of a warp share a program counter; this increases the total number of instructions executed for this warp. When all the different execution paths have completed, the threads converge back to the same execution path.

Implementation of Long Atomic Int

I would like to use an atomic counter (multi-thread computation) that counts to typically 2^40, so I cannot use a 32 bit int atomic counter directly. I do not have c++11 yet (I will migrate to it but not yet as this has a cost for me) and I have to compile on 32bit and 64bit platforms.
I use QT currently, so I can use QAtomicInt.
Here is what I'm thinking of:
(initialization...)
QAtomicInt counterLo = 0;
QAtomicInt counterHi = 0;
void increment()
{
int before = counterLo.fetchAndAddOrdered(1);
if(before==INT_MAX)
{
counterHi.fetchAndAddOrdered(1); //Increment high word
counterLo.fetchAndAddOrdered(INT_MAX); //Increments low word to -1
counterLo.fetchAndAddOrdered(1); //Increments low word to 0
}
}
uint64_t value()
{
//Wait until the low word is non-negative
int lo = counterLow;
while(lo<0)
lo = counterLow;
return (uint64_t)counterHi * ((uint64_t)INT_MAX+1) + (uint64_t)lo;
}
Is this correct? I already tried to make the counter with a mutex, but I'm loosing around 10% performance. This is called about 1 million times a second, shared between 8 threads (sample counter for Monte-Carlo simulation)
Thanks!
This is not overall atomic, see the following example:
hi=0,lo=INT_MAX
T1 calls value(), gets lo=INT_MAX, is interrupted
T2 calls increment() increments hi to 1
T1 resumes and reads counterHi, gets 1, returns a value of 2^32 + INT_MAX
This is likely not what you want. Can't you just split your sample space and let each thread calculate n/8 items without contending for a lock?
Of course this is not atomic. Sequence of atomic operations can be interrupted. I recommend to use protection (Mutex or critical section).

Fast counting semaphore on Windows?

First of all, I know that it can be implemented with a mutex and condition variable, but I want the most efficient implementation possible.
I would like a semaphore with a fast-path when there's no contention. On Linux this is easy with a futex; for example, here's a wait:
if (AtomicDecremenIfPositive(_counter) > 0) return; // Uncontended
AtomicAdd(&_waiters, 1);
do
{
if (syscall(SYS_futex, &_counter, FUTEX_WAIT_PRIVATE, 0, nullptr, nullptr, 0) == -1) // Sleep
{
AtomicAdd(&_waiters, -1);
throw std::runtime_error("Failed to wait for futex");
}
}
while (AtomicDecrementIfPositive(_counter) <= 0);
AtomicAdd(&_waiters, -1);
and post:
AtomicAdd(&_counter, 1);
if (Load(_waiters) > 0 && syscall(SYS_futex, &_counter, FUTEX_WAKE_PRIVATE, 1, nullptr, nullptr, 0) == -1) throw std::runtime_error("Failed to wake futex"); // Wake one
At first I thought for Windows to just use NtWaitForKeyedEvent(). The problem is it's not a direct substitution because it doesn't atomically check the value at _counter before going into the kernel, and so can miss the wake from NtReleaseKeyedEvent(). Worse, then NtReleaseKeyedEvent() would block.
What's the best solution?
Windows has native semaphores with CreateSemaphore. Until and unless you have some kind of documented performance problem doing it the normal way, you shouldn't even consider optimizations that are fragile or hardware-specific.
I think something like this should work:
// bottom 16 bits: post count
// top 16 bits: wait count
struct Semaphore { unsigned val; }
wait(struct Semaphore *s)
{
retry:
do
old = s->val;
if old had posts (bottom 16 bits != 0)
new = old - 1
wait = false
else
new = old + 65536
wait = true
until successful CAS of &s->val from old to new
if wait == true
wait on keyed event
goto retry;
}
post(struct Semaphore *s)
{
do
old = s->val;
if old had waiters (top 16 bits != 0)
// perhaps new = old - 65536 and remove the "goto retry" above?
// not sure, but this is safer...
new = old - 65536 + 1
release = true
else
new = old + 1
release = false
until successful CAS of &s->val from old to new
if release == true
release keyed event
}
edit: that said, I'm not sure this would help you a lot. Your thread pool usually should be big enough that a thread is always ready to process your request. This means that not only waits, but also posts will always take the slow path and go to the kernel. So, counting semaphores are probably the one primitive where you do not really care about a userspace-only fastpath. Stock Win32 semaphores should be good enough. That said, I'm happy to be proven wrong!
I vote for your first idea, e.g critical section and condition variable. Critical section is fast enough and it does use interlocked operation before it goes to sleep. Or, you can experiment with SRWLocks instead of critical section. Condition variables (and SRWLocks) are very fast - their only problem is that there are no conditions on XP, but maybe you do not need to target this platform .
Qt has all kinds of things like QMutex, QSemaphore which are implemented in spirit like what you presented in your question.
Actually, I would suggest replacing the futex stuff with the usual OS-provided synchronization primitives; it should not matter much since that is the slow path anyway.

Mutual exclusion problem [duplicate]

Please take a look on the following pseudo-code:
boolean blocked[2];
int turn;
void P(int id) {
while(true) {
blocked[id] = true;
while(turn != id) {
while(blocked[1-id])
/* do nothing */;
turn = id;
}
/* critical section */
blocked[id] = false;
/* remainder */
}
}
void main() {
blocked[0] = false;
blocked[1] = false;
turn = 0;
parbegin(P(0), P(1)); //RUN P0 and P1 parallel
}
I thought that a could implement a simple Mutual - Exclution solution using the code above. But it's not working. Has anyone got an idea why?
Any help would really be appreciated!
Mutual Exclusion is in this exemple not guaranteed because of the following:
We begin with the following situation:
blocked = {false, false};
turn = 0;
P1 is now executes, and skips
blocked[id] = false; // Not yet executed.
The situation is now:
blocked {false, true}
turn = 0;
Now P0 executes. It passes the second while loop, ready to execute the critical section. And when P1 executes, it sets turn to 1, and is also ready to execute the critical section.
Btw, this method was originally invented by Hyman. He sent it to Communications of the Acm in 1966
Mutual Exclusion is in this exemple not guaranteed because of the following:
We begin with the following situation:
turn= 1;
blocked = {false, false};
The execution runs as follows:
P0: while (true) {
P0: blocked[0] = true;
P0: while (turn != 0) {
P0: while (blocked[1]) {
P0: }
P1: while (true) {
P1: blocked[1] = true;
P1: while (turn != 1) {
P1: }
P1: criticalSection(P1);
P0: turn = 0;
P0: while (turn != 0)
P0: }
P0: critcalSection(P0);
Is this homework, or some embedded platform? Is there any reason why you can't use pthreads or Win32 (as relevant) synchronisation primitives?
Maybe you need to declare blocked and turn as volatile, but without specifying the programming language there is no way to know.
Concurrency can not be implemented like this, especially in a multi-processor (or multi-core) environment: different cores/processors have different caches. Those caches may not be coherent. The pseudo-code below could execute in the order shown, with the results shown:
get blocked[0] -> false // cpu 0
set blocked[0] = true // cpu 1 (stored in CPU 1's L1 cache)
get blocked[0] -> false // cpu 0 (retrieved from CPU 0's L1 cache)
get glocked[0] -> false // cpu 2 (retrieved from main memory)
You need hardware knowledge to implement concurrency.
Compiler might have optimized out the "empty" while loop. Declaring variables as volatile might help, but is not guaranteed to be sufficient on multiprocessor systems.