I am wondering about the execution order of threads in OpenGL.
Say I have a mobile GPU that often have n_cores between 8 ... 32 (e.g. ARM Mali). That means they are different from Nvidia (AMD) warps (wavefronts).
The reason I am asking is because of following toy example
layout(local_size_x = 256, local_size_y = 1, local_size_z = 1) in;
shared float a[16];
void main() {
uint tid = gl_GlobalInvocationID.x; // <-- thread id
// set all a to 0
if (tid < 16) {
a[tid] = 0;
}
barrier();
memoryBarrierShared();
a[tid % 16] += 1;
barrier();
memoryBarrierShared();
float b = 0;
b = REDUCE(a); // <-- reduction of a array a
}
It happens that b is different from execution to execution (glDispatchCompute(1, 100, 1)) as if there is some race condition.
I am not sure wether threads within a work group are really concurrent (like warps in a streaming multiprocessor).
Also how many cores are mapped to work groups/shaders?
What are your thoughts about that? Thanks
It happens that b is different from execution to execution (glDispatchCompute(1, 100, 1)) as if there is some race condition.
That's because there is one:
a[tid % 16] += 1;
For a workgroup with a local size of 256, there will be at least two invocations in that workgroup that have the same value of tid % 16. Therefore, those invocations will attempt to manipulate the same index of a.
Since there are no barriers or any other mechanism to prevent this, then this is a race-condition on the elements of a. And therefore, you get undefined behavior.
Now, you could manipulate a through atomic operations:
atomicAdd(a[tid % 16], 1);
That is well-defined behavior.
I am not sure wether threads within a work group are really concurrent (like warps in a streaming multiprocessor).
This is irrelevant. You must treat them as if they are executed concurrently.
Also how many cores are mapped to work groups/shaders?
Again, essentially irrelevant. This matters in terms of performance, but that's mainly about how big to make your local group size. But in terms of whether your code works or not, it doesn't matter.
Related
I am wondering if within a same wave / subgroup (warp?) we need to call memoryBarrierShared and barrier to synchronize shared variable? In NVIDIA I think it is not necessary, but I do not know for other IHVs.
EDIT : ballot
Since I am talking about wave / subgroup, I am talking about the ARB_shader_ballot extension.
Let's say we have such code (1) :
shared uint s_data[128];
uint tid = gl_GlobalInvocationID.x;
// initialization of some s_data
memoryBarrierShared();
barrier();
if(tid < gl_SubGroupSizeARB) {
for(uint i = gl_SubGroupeSizeARB; i > 0; i>>=1)
s_data[tid] += s_data[tid + i];
}
According to me, this code is not correct. The correct one, according to the spec, would be (2):
if(tid < gl_SubGroupSizeARB) {
for(uint i = gl_SubGroupeSizeARB; i > 0; i>>=1) {
s_data[tid] += s_data[tid + i];
memoryBarrierShared();
barrier();
}
}
However, since invocations run in parallel within a wave/subgroup, the barrier function seems to be useless : this one should be correct as well and faster than the second (3) :
if(tid < gl_SubGroupSizeARB) {
for(uint i = gl_SubGroupeSizeARB; i > 0; i>>=1) {
s_data[tid] += s_data[tid + i];
memoryBarrierShared();
}
}
However, since we do not need barrier function, I wonder if (1) is correct, even if it is unlikely for me, and if not, if (3) is correct (that would means that my understanding is correct)
EDIT : int to uint, and change = to +=
The execution model shared by OpenGL and Vulkan with regard to compute shaders does not really recognize the concept of a "wave". It has the concept of a work group, but that is not the same thing. A work group can be much bigger than a GPU "wave", and for small work groups, multiple work groups could be executing on the same GPU "wave".
As such, these specifications make no statements about the behavior of any of its functions with regard to a "wave" (with the exception of shader ballot functions). So if you want synchronization that the standard says will work on all conforming implementations, you must call both functions as dictated by the standard.
Even with ARB_shader_ballot, its behavior does not modify the execution model of shaders. It only allows cross-communication between subgroups, and only via the explicit mechanisms that it provides.
The execution model and memory model of shader invocations is that they are unordered with respect to each other, unless you explicitly order them with barriers.
Recently I started to develop on CUDA and faced with the problem with atomicCAS().
To do some manipulations with memory in device code I have to create a mutex, so that only one thread could work with memory in critical section of code.
The device code below runs on 1 block and several threads.
__global__ void cudaKernelGenerateRandomGraph(..., int* mutex)
{
int i = threadIdx.x;
...
do
{
atomicCAS(mutex, 0, 1 + i);
}
while (*mutex != i + 1);
//critical section
//do some manipulations with objects in device memory
*mutex = 0;
...
}
When first thread executes
atomicCAS(mutex, 0, 1 + i);
mutex is 1. After that first thread changes its status from Active to Inactive, and line
*mutex = 0;
is not executed. Other threads stays forever in loop. I have tried many variants of this cycle like while(){};, do{}while();, with temp variable = *mutex inside loop, even variant with if(){} and goto. But result is the same.
The host part of code:
...
int verticlesCount = 5;
int *mutex;
cudaMalloc((void **)&mutex, sizeof(int));
cudaMemset(mutex, 0, sizeof(int));
cudaKernelGenerateRandomGraph<<<1, verticlesCount>>>(..., mutex);
I use Visual Studio 2012 with CUDA 5.5.
The device is NVidia GeForce GT 240 with compute capability 1.2.
Thanks in advance.
UPD:
After some time working on my diploma project this spring, I found a solution for critical section on cuda.
This is a combination of lock-free and mutex mechanisms.
Here is working code. Used it to impelment atomic dynamic-resizable array.
// *mutex should be 0 before calling this function
__global__ void kernelFunction(..., unsigned long long* mutex)
{
bool isSet = false;
do
{
if (isSet = atomicCAS(mutex, 0, 1) == 0)
{
// critical section goes here
}
if (isSet)
{
mutex = 0;
}
}
while (!isSet);
}
The loop in question
do
{
atomicCAS(mutex, 0, 1 + i);
}
while (*mutex != i + 1);
would work fine if it were running on the host (CPU) side; once thread 0 sets *mutex to 1, the other threads would wait exactly until thread 0 sets *mutex back to 0.
However, GPU threads are not as independent as their CPU counterparts. GPU threads are grouped into groups of 32, commonly referred to as warps. Threads in the same warp will execute instructions in complete lock-step. If a control statement such as if or while causes some of the 32 threads to diverge from the rest, the remaining threads will wait (i.e. sleeps) for the divergent threads to finish. [1]
Going back to the loop in question, thread 0 becomes inactive because threads 1, 2, ..., 31 are still stuck in the while loop. So thread 0 never reaches the line *mutex = 0, and the other 31 threads loops forever.
A potential solution is to make a local copy of the shared resource in question, let 32 threads modify the copy, and then pick one thread to 'push' the change back to the shared resource. A __shared__ variable is ideal in this situation: it will be shared by the threads belonging to the same block but not other blocks. We can use __syncthreads() to fine-control the access of this variable by the member threads.
[1] CUDA Best Practices Guide - Branching and Divergence
Avoid different execution paths within the same warp.
Any flow control instruction (if, switch, do, for, while) can significantly affect the instruction throughput by causing threads of the same warp to diverge; that is, to follow different execution paths. If this happens, the different execution paths must be serialized, since all of the threads of a warp share a program counter; this increases the total number of instructions executed for this warp. When all the different execution paths have completed, the threads converge back to the same execution path.
"64-bit NoBarrier_Store() not implemented on this platform"
I use tcmalloc on win7 with vs2005.
There is two threads in my app, one do malloc(), the other one do free().The tcmalloc print this when my app start.After debug, i find the following functon can't work on _WIN32,
// Return a suggested delay in nanoseconds for iteration number "loop"
static int SuggestedDelayNS(int loop) {
// Weak pseudo-random number generator to get some spread between threads
// when many are spinning.
static base::subtle::Atomic64 rand;
uint64 r = base::subtle::NoBarrier_Load(&rand);
r = 0x5deece66dLL * r + 0xb; // numbers from nrand48()
base::subtle::NoBarrier_Store(&rand, r);
r <<= 16; // 48-bit random number now in top 48-bits.
if (loop < 0 || loop > 32) { // limit loop to 0..32
loop = 32;
}
// loop>>3 cannot exceed 4 because loop cannot exceed 32.
// Select top 20..24 bits of lower 48 bits,
// giving approximately 0ms to 16ms.
// Mean is exponential in loop for first 32 iterations, then 8ms.
// The futex path multiplies this by 16, since we expect explicit wakeups
// almost always on that path.
return r >> (44 - (loop >> 3));
}
I want to know how to avoid this on win32. thanks very much.
It seems to be using atomic loads and stores without memory barriers. Might make this work a bit faster on some multi-CPU systems.
On an x86, we don't have those types of operations. Loads and stores are always visible to the others cores in the system. Cache sync is implemented in the hardware, and can't be controlled by the program.
Perhaps the Atomic library used has Load and Store operations without the NoBarrier prefix? Use those instead.
Okay, i have this question in one regarding threads.
there are two unsynchronized threads running simultaneously and using a global resource "int num"
1st:
void Thread()
{
int i;
for ( i=0 ; i < 100000000; i++ )
{
num++;
num--;
}
}
2nd:
void Thread2()
{
int j;
for ( j=0 ; j < 100000000; j++ )
{
num++;
num--;
}
}
The question states: what are the possible values of the variable "num" at the end of the program.
now i would say 0 will be the value of num at the end of the program but, try and run this code and you will find out that the result is quite random,
and i can't understand why?
The full code:
#include <windows.h>
#include <process.h>
#include <stdio.h>
int static num=0;
void Thread()
{
int i;
for ( i=0 ; i < 100000000; i++ )
{
num++;
num--;
}
}
void Thread2()
{
int j;
for ( j=0 ; j < 100000000; j++ )
{
num++;
num--;
}
}
int main()
{
long handle,handle2,code,code2;
handle=_beginthread( Thread, 0, NULL );
handle2=_beginthread( Thread2, 0, NULL );
while( (GetExitCodeThread(handle,&code)||GetExitCodeThread(handle2,&code2))!=0 );
TerminateThread(handle, code );
TerminateThread(handle2, code2 );
printf("%d ",num);
system("pause");
}
num++ and num-- don't have to be atomic operations. To take num++ as an example, this is probably implemented like:
int tmp = num;
tmp = tmp + 1;
num = tmp;
where tmp is held in a CPU register.
Now let's say that num == 0, both threads try to execute num++, and the operations are interleaved as follows:
Thread A Thread B
int tmp = num;
tmp = tmp + 1;
int tmp = num;
tmp = tmp + 1;
num = tmp;
num = tmp;
The result at the end will be num == 1 even though it should have been incremented twice. Here, one increment is lost; in the same way, a decrement could be lost as well.
In pathological cases, all increments of one thread could be lost, resulting in num == -100000000, or all decrements of one thread could be lost, resulting in num == +100000000. There may even be more extreme scenarios lurking out there.
Then there's also other business going on, because num isn't declared as volatile. Both threads will therefore assume that the value of num doesn't change, unless they are the one changing it. This allows the compiler to optimize away the entire for loop, if it feels so inclined!
The possible values for num include all possible int values, plus floating point values, strings, and jpegs of nasal demons. Once you invoke undefined behavior, all bets are off.
More specifically, modifying the same object from multiple threads without synchronization results in undefined behavior. On most real-world systems, the worst effects you see will probably be missing or double increments or decrements, but it could be much worse (memory corruption, crashing, file corruption, etc.). So just don't do it.
The next upcoming C and C++ standards will include atomic types which can be safely accessed from multiple threads without any synchronization API.
You speak of threads running simultaneously which actually might not be the case if you only have one core in your system. Let's assume that you have more than one.
In the case of multiple devices having access to main memory either in the form of CPUs or bus-mastering or DMA they must be synchronized. This is handled by the lock prefix (implicit for the instruction xchg). It accesses a physical wire on the system bus which essentially signals all devices present to stay away. It is, for example, part of the Win32 function EnterCriticalSection.
So in the case of two cores on the same chip accessing the same position the result would be undefined which may seem strange considering some synchronization should occur since they share the same L3 cache (if there is one). Seems logical, but it doesn't work that way. Why? Because a similar case occurs when you have the two cores on different chips (i e don't have a shared L3 cache). You can't expect them to be synchronized. Well you can but consider all the other devices having access to main memory. If you plan to synchronize between two CPU chips you can't stop there - you have to perform a full-blown synchronization that blocks out all devices with access and to ensure a successful synchronization all the other devices need time to recognize that a synchronization has been requested and that takes a long time, especially if a device has been granted access and is performing a bus-mastering operation which must be allowed to complete. The PCI bus will perform an operation every 0.125 us (8 MHz) and considering that your CPUs run at 400 times you're looking at A LOT of wait states. Then consider that several PCI clock cycles might be required.
You could argue that a medium type (memory bus only) lock should exist but this means an additional pin on every processor and additional logic in every chipset just to handle a case which is really a misunderstanding on the programmer's part. So it's not implemented.
To sum it up: a generic synchronization that would handle your situation would render your PC useless due to it always having to wait for the last device to check in and ok the synchronization. It is a better solution to let it be optional and only insert wait states when the developer has determined that it is absolutely necessary.
This was so much fun that I played a little with the example code and added spinlocks to see what would happen. The spinlock components were
// prototypes
char spinlock_failed (spinlock *);
void spinlock_leave (spinlock *);
// application code
while (spinlock_failed (&sl)) ++n;
++num;
spinlock_leave (&sl);
while (spinlock_failed (&sl)) ++n;
--num;
spinlock_leave (&sl);
spinlock_failed was constructed around the "xchg mem,eax" instruction. Once it failed (at not setting the spinlock <=> succeeded at setting it) spinlock_leave would just assign to it with "mov mem,0". The "++n" counts the total number of retries.
I changed the loop to 2.5 million (because with two threads and two spinlocks per loop I get 10 million spinlocks, nice and easy to round with) and timed the sequences with the "rdtsc" count on a dual-core Athlon II M300 # 2GHz and this is what I found
Running one thread without timing
(except for the main loop) and locks
(as in the original example) 33748884
<=> 16.9 ms => 13.5 cycles/loop.
Running one thread i e no other core
trying took 210917969 cycles <=>
105.5 ms => 84,4 cycles/loop <=> 0.042 us/loop. The spinlocks required 112581340 cycles <=> 22.5 cycles per
spinlocked sequence. Still, the
slowest spinlock required 1334208
cycles: that's 667 us or only 1500
every second.
So, the additon of spinlocks unaffected by another CPU added several hundred percent to the total execution time. The final value in num was 0.
Running two threads without spinlocks
took 171157957 cycles <=> 85.6 ms =>
68.5 cycles/loop. Num contained 10176.
Two threads with spinlocks took
4099370103 <=> 2049 ms => 1640
cycles/loop <=> 0.82 us/loop. The
spinlocks required 3930091465 cycles
=> 786 cycles per spinlocked sequence. The slowest spinlock
required 27038623 cycles: thats 13.52
ms or only 74 every second. Num
contained 0.
Incidentally the 171157957 cycles for two threads without spinlocks compares very favorably to two threads with spinlocks where the spinlock time has been removed: 4099370103-3930091465 = 169278638 cycles.
For my sequence the spinlock competition caused 21-29 million retries per thread which comes out to 4.2-5.8 retries per spinlock or 5.2-6.8 tries per spinlock. Addition of spinlocks caused an execution time penalty of 1927% (1500/74-1). The slowest spinlock required 5-8% of all tries.
As Thomas said, the results are unpredictable because your increment and decrement are non-atomic. You can use InterlockedIncrement and InterlockedDecrement -- which are atomic -- to see a predictable result.
Interlocked Variable Access (MSDN)
I am tackling the challenge of using both the capabilities of a 8 core machine and a high-end GPU (Tesla 10).
I have one big input file, one thread for each core, and one for the the GPU handling.
The Gpu thread, to be efficient, needs a big number of lines from the input, while
the Cpu thread needs only one line to proceed (storing multiple lines in a temp buffer was slower). The file doesn't need to be read sequentially. I am using boost.
My strategy is to have a mutex on the input stream and each thread locks - unlocks it.
This is not optimal because the gpu thread should have a higher precedence when locking the mutex, being the fastest and the most demanding one.
I can come up with different solutions but before rush into implementation I would like to have some guidelines.
What approach do you use / recommend ?
You may not need to lock at all if "1 line per thread" is not a strict requirement and you can go up to 2 lines or three lines sometimes. Then you can split the file equally, based on a formula. Suppose you want to read the file in 1024 kbyte blocks in total (this could be gigabytes too): You split it up to the cores with prioritization. So:
#define BLOCK_SIZE (1024 * 1024)
#define REGULAR_THREAD_BLOCK_SIZE (BLOCK_SIZE/(2 * NUM_CORES)) // 64kb
#define GPU_THREAD_BLOCK_SIZE (BLOCK_SIZE/2)
Each core gets 64 KB chunk
Core 1: offset 0 , size = REGULAR_THREAD_BLOCK_SIZE
Core 2: offset 65536 , size = REGULAR_THREAD_BLOCK_SIZE
Core 3: offset 131072 , size = REGULAR_THREAD_BLOCK_SIZE
Core n: offset (n * REGULAR_THREAD_BLOCK_SIZE), size = REGULAR_THREAD_BLOCK_SIZE
GPU gets 512 KB, offset = (NUM_CORES * REGULAR_THREAD_BLOCK_SIZE), size = GPU_THREAD_BLOCK_SIZE
So ideally they don't overlap. There are cases where they can overlap though. Since you're reading a text file a line might fall into the next core's block. To avoid overlapping you always skip first line for other cores, and always complete the last line assuming the next thread would skip it anyway, here is pseudo code:
void threadProcess(buf, startOFfset, blockSize)
{
int offset = startOffset;
int endOffset = startOffset + blockSize;
if(coreNum > 0) {
// skip to the next line
while(buf[offset] != '\n' && offset < endOffset) offset++;
}
if(offset >= endOffset) return; // nothing left to process
// read number of lines provided in buffer
char *currentLine = allocLineBuffer(); // opening door to security exploits :)
int strPos = 0;
while(offset < endOffset) {
if(buf[offset] == '\n') {
currentLine[strPos] = 0;
processLine(currentLine); // do line processing here
strPos = 0; // fresh start
offset++;
continue;
}
currentLine[strPos] = buf[offset];
offset++;
strPos++;
}
// read the remaineder past the buf
strPos = 0;
while(buf[offset] != '\n') {
currentLine[strPos++] = buf[offset++];
}
currentLine[strPos] = 0;
processLine(currentLine); // process the carryover line
}
As you can see this parallelizes the processing of the read block not the reads themselves. How do you parallelize reads? The best most awesome way would be memory mapping the whole block into memory. That would gain the best I/O performance as it's the lowest level.
Some ideas:
1) Because the bottlenect is not in the IO the file should be kept almost entirely in RAM for easier access.
2) Implementation should not allow threads to block. It's better to have slightly non optimal solution if that reduces blocking.
Assuming we have big data file threads can employ shot in the dark tactics. This means once the thread acquires the lock, it just increments fpos and unlocks the memory. Then it grants itself privilege to process the part of the memory it just got. For example the thread could process all the lines which have their beginnings in the fragment.
Outcomes:
1) It's almost impossible for a thread to block. The lock times are very short (in the range of several instructions + time to flush caches)
2) Flexibility. Thread can take as much data as it wants to.
Of course, there should be some mechanisms to adapt to the length of line in the data file to avoid worst case scenario.
I would use a buffer. Have a single thread filling that buffer from disk. Each thread will lock the buffer, read data into the thread's buffer then release the lock on the mutex before processing the data.