Pass condition as argument to an OpenCL kernel - if-statement

I'm trying to write an OpenCL kernel that takes a condition, i.e., a boolean value, as an argument. My objective with this is to write a kernel that behaves in the following way:
__kernel void operation(
__global double *vec,
const bool condition, // is this possible?
const int n)
{
// ...
if(condition){
// do this ...
}
else{
// do that ...
}
}
I've found this thread that says it is not possible to do what I want, but the discussion is a bit outdated and I was wondering if anything has changed.
So this is basically it. I'm completely open for suggestions.
Thanks in advance!

Basically you can do exactly that. However bool (1-bit) is not allowed as kernel parameter. You have to use char (8-bit) or int (32-bit) instead as data type for the condition variable. If you set the condition value to 0 (false) or 1 (true), you can just plug it into the if(condition) as is.
I assume condition in your case is the same value for all threads. For performance reasons, I would advise to make two separate kernels, one for the if branch and one for the else branch, and move the condition to the C++ side to enqueue either of the two kernels.

Related

Implementing custom atomic_add() which works with floats

I'm trying to follow the B.12 section of https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html for atomic add, which works with floats. Simply copying and pasting the code from there and changing the types to floats does not work because I can't perform the casting pointer casting from GLOBAL to PRIVATE, required for the atomicCAS operation. To overcome this I decided to use atomic_xchg() because it works with floats, with additional if statement to achieve same functionality as atomicCAS. However, this returns me varying answer when I perform addition on large float vector every time i run the program.
I've tried figuring out how to overcome the explicit conversion from GLOBAL to PRIVATE, but I honestly don't know how to do it so that when I perform addition, the address argument is changed instead of some temp variable.
kernel void atomicAdd_2(volatile global float* address, float value)
{
float old = *address, assumed;
do {
assumed = old;
if (*address == assumed) {
old = atomic_xchg(address, value + assumed);
}
else{
old = *address;
}
// Note: uses integer comparison to avoid hang in case of NaN (since NaN != NaN)
} while (assumed != old);
}
This is my implementation of atomicAdd for floats.
kernel void reduce_add(global const float* input, global float* output) {
float temp = 242.23f;
atomicAdd_floats(&output[0], temp);
printf(" %f ", output[0]);
}
This is the function where I supply the arguments to the atomicAdd_floats. Note that my input argument contains a vector of floats and output argument is simply where I want to store the result, specifically in the first element of the output vector output[0]; but instead when i printf(" %f ", output[0]); it shows my default initialisation value 0.
First of all, i'd suggest to remove the "kernel" keyword on the atomicAdd_2 function. "kernel" should be used only on functions you intend to enqueue from the host. 2nd, there are OpenCL implementations of atomic-add-float on the net.
Then, i'm a bit confused as to what you're trying to do. Are you trying to sum a vector and while having the sum in a private variable ? if so, it makes no sense to use atomicAdd. Private memory is always atomic, because it's private. Atomicity is only required for global and local memories because they're shared.
Otherwise i'm not sure why you mention changing the address or the GLOBAL to PRIVATE.
Anyway, the code from the link should work, though it's relatively slow. If the vector to sum is large, you might be better off using a different algorithm (wtth partial sums). Try googling "opencl parallel sum" or such.

Fastest (or most elegant) way of passing constant arguments to a CUDA kernel

Lets say I want a CUDA kernel that needs to do lots of stuff, but there are dome parameters that are constant to all the kernels. this arguments are passed to the main program as an input, so they can not be defined in a #DEFINE.
The kernel will run multiple times (around 65K) and it needs those parameters (and some other inputs) to do its maths.
My question is: whats the fastest (or else, the most elegant) way of passing these constants to the kernels?
The constants are 2 or 3 element length float* or int* arrays. They will be around 5~10 of these.
toy example: 2 constants const1 and const2
__global__ void kernelToyExample(int inputdata, ?????){
value=inputdata*const1[0]+const2[1]/const1[2];
}
is it better
__global__ void kernelToyExample(int inputdata, float* const1, float* const2){
value=inputdata*const1[0]+const2[1]/const1[2];
}
or
__global__ void kernelToyExample(int inputdata, float const1x, float const1y, float const1z, float const2x, float const2y){
value=inputdata*const1x+const2y/const1z;
}
or maybe declare them in some global read only memory and let the kernels read from there? If so, L1, L2, global? Which one?
Is there a better way I don't know of?
Running on a Tesla K40.
Just pass them by value. The compiler will automagically put them in the optimal place to facilitate cached broadcast to all threads in each block - either shared memory in compute capability 1.x devices, or constant memory/constant cache in compute capability >= 2.0 devices.
For example, if you had a long list of arguments to pass to the kernel, a struct passed by value is a clean way to go:
struct arglist {
float magicfloat_1;
float magicfloat_2;
//......
float magicfloat_19;
int magicint1;
//......
};
__global__ void kernel(...., const arglist args)
{
// you get the idea
}
[standard disclaimer: written in browser, not real code, caveat emptor]
If it turned out one of your magicint actually only took one of a small number of values which you know beforehand, then templating is an extremely powerful tool:
template<int magiconstant1>
__global__ void kernel(....)
{
for(int i=0; i < magconstant1; ++i) {
// .....
}
}
template kernel<3>(....);
template kernel<4>(....);
template kernel<5>(....);
The compiler is smart enough to recognise magconstant makes the loop trip known at compile time and will automatically unroll the loop for you. Templating is a very powerful technique for building fast, flexible codebases and you would be well advised to accustom yourself with it if you haven't already done so.

Is std::atomic redundant if you have to check for overflow or act conditionally?

You can safely increment and decrement std::atomic_int for example. But if you need to check for overflow or execute some routine conditinoally based on the value, then a lock is needed anyway. Since you must compare the value and the thread might be swapped off just after the comparison succeeded, another thread modifies, ... bug.
But if you need a lock then you can just use a plain integer instead of atomic. Am I right?
No, you can still use a std::atomic even conditionally.
Firstly, if you use std::atomic<unsigned int> then overflow behavoir is well defined (although possibly not what you want). If you use a signed integer overflow isn't well defined but as long as you don't hit it then this doesn't matter.
If you absolutely must check for overflow, or otherwise act conditionally, you can use compare-exchange. This lets you read the value, decide whether you want to do work on it and then atomically update the value back if it hasn't changed. And the key part here is the system will tell you if the atomic update failed, in which case you can go back to the start and read the new value and make the decision again.
As an example, if we only wanted to set the max value of an atomic integer to 4 (in some kind of refcounting, for instance), we could do:
#include <atomic>
static std::atomic<int> refcount = 0;
int val = refcount; // this doesn't need to be in the loop as on failure compare-exchange-strong updates it
while(true)
{
if(val == 4)
{
// there's already 4 refs here, maybe come back later?
break;
}
int toChangeTo = val + 1;
if(refcount.compare_exchange_strong(val, toChangeTo))
{
// we successfully took a ref!
break;
}
// if we fail here, another thread updated the value whilst we were running, just loop back and try again
}
In the above code you can use compare_exchange_weak instead. This can sometimes spuriously fail and so you need to do it in a loop. However, we have a loop anyway (and in general you always will as you need to handle real failures) and so compare_exchange_weak makes a lot of sense here.

passing a flag in the form of "int" or a "bool" to a function, is better in terms of performance?

Say, I have a function like shown below
void caller()
{
int flag = _getFlagFromConfig();
//this is a flag, which according to the implementation
//is supposed to have only two values, 0 and 1 (as of now)
callee_1(flag);
callee_2(1 == flag);
}
void callee_1(int flag)
{
if (1 == flag)
{
//do operation X
}
}
void callee_2(bool flag)
{
if (flag)
{
//do operation X
}
}
Which of the callee functions will be a better implementation?
I have gone through this link and I'm pretty convinced that there is not much of a performance impact by taking bool for comparison in an if-condition. But in my case, I have the flag as an integer. In this case, is it worth going for the second callee?
It won't make any difference in terms of performance, however in terms of readability if there are only 2 values then a bool makes more sense. Especially if you name your flag something sensible like isFlagSet.
In terms of efficiency, they should be the same.
Note however that they don't do the same thing - you can pass something other than 1 to the first function, and the condition will evaluate to false even if the parameter is not itself false. The extra comparison could account for some overhead, probably not.
So let's assume the following case:
void callee_1(int flag)
{
if (flag)
{
//do operation X
}
}
void callee_2(bool flag)
{
if (flag)
{
//do operation X
}
}
In this case, technically the first variant would be faster, since bool values aren't checked directly for true or false, but promoted to a word-sized type and then checked for 0. Although the assembly generated might be the same, the processor theoretically does more work on the bool option.
If the value or argument is being used as a boolean, declare it bool.
The probability of it making any difference in performance is almost 0,
and the use of bool documents your intent, both to the reader and to
the compiler.
Also, if you have an int which is being used as a flag (due to an
existing interface): either use the implicit conversion (if the
interface documents it as a boolean), or compare it with 0 (not with
1). This is conform with the older definitions of how int served as
a boolean (before the days when C++ had bool).
One case where the difference between bool and int results in different (optimized) asm is the negation operator ("!").
For "!b", If b is a bool, the compiler can assume that the integer value is either 1 or 0, and the negation can thus be a simple "b XOR 1". OTOH, if b is an integer, the compiler, barring data-flow analysis, must assume that the variable may contain any integer value, and thus to implement the negation it must generate code such as
(b != 0) ? 0: 1
That being said, code where negation is a performance-critical operation is quite rare, I'm sure.

Multithreading: do I need protect my variable in read-only method?

I have few questions about using lock to protect my shared data structure. I am using C/C++/ObjC/Objc++
For example I have a counter class that used in multi-thread environment
class MyCounter {
private:
int counter;
std::mutex m;
public:
int getCount() const {
return counter;
}
void increase() {
std::lock_guard<std::mutex> lk(m);
counter++;
}
};
Do I need to use std::lock_guard<std::mutex> lk(m); in getCount() method to make it thread-safe?
What happen if there is only two threads: a reader thread and a writer thread then do I have to protect it at all? Because there is only one thread is modifying the variable so I think no lost update will happen.
If there are multiple writer/reader for a shared primitive type variable (e.g. int) what disaster may happen if I only lock in write method but not read method? Will 8bits type make any difference compare to 64bits type?
Is any primitive type are atomic by default? For example write to a char is always atomic? (I know this is true in Java but don't know about c++ and I am using llvm compiler on Mac if platform matters)
Yes, unless you can guarantee that changes to the underlying variable counter are atomic, you need the mutex.
Classic example, say counter is a two-byte value that's incremented in (non-atomic) stages:
(a) add 1 to lower byte
if lower byte is 0:
(b) add 1 to upper byte
and the initial value is 255.
If another thread comes in anywhere between the lower byte change a and the upper byte change b, it will read 0 rather than the correct 255 (pre-increment) or 256 (post-increment).
In terms of what data types are atomic, the latest C++ standard defines them in the <atomic> header.
If you don't have C++11 capabilities, then it's down to the implementation what types are atomic.
Yes, you would need to lock the read as well in this case.
There are several alternatives -- a lock is quite heavy here. Atomic operations are the most obvious (lock-free). There are also other approaches to locking in this design -- the read write lock is one example.
Yes, I believe that you do need to lock the read as well. But since you are using C++11 features, why don't you use std::atomic<int> counter; instead?
As a rule of thumb, you should lock the read too.
Read and write to int is atomic on most architecture (and since int is guaranted to be the machine's word size, you should almost never experience corrupted int)
Yet, the answer from #paxdiablo is correct, and will happen if you have someone doing this:
#pragma pack(push, 1)
struct MyObj
{
char a;
MyCounter cnt;
};
#pragma pack(pop)
In that specific case, cnt will not be aligned to a word boundary, and the int MyCounter::counter will/might be emulated in multiple operations in CPU supporting unaligned access (like x86). Thus, you could get this sequence of operations:
Thread A: [...] set counter to 255 (counter is 0x000000FF)
getCount() => CPU reads low byte: lo:255
<interrupted here>
Thread B: increase() => counter is incremented, leading to counter = 256 = 0x00000100)
<interrupted here>
Thread A: CPU read high bytes: 0x000001, concatenate: 0x000001FF, returns 511 !
Now, let's say you never use unaligned access. Yet, if you are doing something like this:
ThreadA.cpp:
int g = clientCounter.getCount();
while (g > 0)
{
processFirstClient();
g = clientCounter.getCount();
}
ThreadB.cpp:
if (acceptClient()) clientCounter.increase();
The compiler is completely allowed to replace the loop in Thread A by this:
if (clientCounter.getCount())
while(true) processFirstClient();
Why ? That's because for each instruction, the compiler will evaluate side-effects of such expression. The getCount() is so simple that the compiler will deduce: it's a read of a single variable, and it's not modified anywhere in ThreadA.cpp, thus, it's constant. Because it's constant, let's simplify this.
If you add a mutex, the mutex code will insert a memory barrier telling the compiler "hey, don't expect anything after this barrier is crossed".
Thus, the "optimization" above can not happen since getCount might have been modified.
Sure, you could have written volatile int counter instead of counter, and the compiler would have avoided this optimization too.
In the end, if you have to write a ton of code just to avoid a mutex, you're doing it wrong (and probably will get wrong results).
You cant gaurantee that multiple threads wont modify your variable at the same time. and if such a situation occurs your variable will be garbled or program might crash. In order to avoid such cases its always better and safer to make the program thread safe.
You can use the synchronization techinques available like: Mutex, Lock, Synchronization attribute(available for MS c++)