bit test and set (BTS) on a tbb atomic variable - c++

I want to do bitTestAndSet on a tbb atomic variable.
atomic.h from tbb does not seem to have any bit operations.
If I treat the tbb atomic variable as a normal pointer and do __sync_or_and_fetch gcc compiler doesn't allow that.
Is there a workaround for this?
Related question:
assembly intrinsic for bit test and set (BTS)

A compare_and_swap loop can be used, like this:
// Atomically perform i|=j. Return previous value of i.
int bitTestAndSet( tbb::atomic<int>& i, int j ) {
int o = i; // Atomic read (o = "old value")
while( (o|j)!=o ) { // Loop exits if another thread sets the bits
int k = o;
o = i.compare_and_swap(k|j,k);
if( o==k ) break; // Successful swap
}
return o;
}
Note that if the while condition succeeds on the first try, there will be only an acquire fence, not a full fence. Whether that matters depends on context.
If there is risk of high contention, then some sort of backoff scheme should be be used in the loop. TBB uses a class atomic_backoff for contention management internally, but it's not currently part of the public TBB API.
There is a second way, if portability is not a concern and you are willing to exploit the undocumented fact that the layout of a tbb::atomic and T are the same on x86 platforms. In that case, just operate on the tbb::atomic using assembly code. The program below demonstrates this technique:
#include <tbb/tbb.h>
#include <cstdio>
inline int SetBit(int array[], int bit) {
int x=1, y=0;
asm("bts %2,%0\ncmovc %3,%1" : "+m" (*array), "+r"(y) : "r" (bit), "r"(x));
return y;
}
tbb::atomic<int> Flags;
volatile int Result;
int main() {
for( int i=0; i<16; ++i ) {
int k = i*i%32;
std::printf("bit at %2d was %d. Flags=%8x\n", k, SetBit((int*)&Flags,k), +Flags);
}
}

Related

Adding double within a parallel loop - std::atomic<double>

I have a parallel code that does some computation and then adds a double to an outside-the-loop double variable. I tried using std::atomic but it does not have suport for arithmetic operations on std::atomic < double > variables.
double dResCross = 0.0;
std::atomic<double> dResCrossAT = 0.0;
Concurrency::parallel_for(0, iExperimentalVectorLength, [&](size_t m)
{
double value;
//some computation of the double value
atomic_fetch_add(&dResCrossAT, value);
});
dResCross += dResCrossAT;
Simply writing
dResCross += value;
does obviously otput nonsense. My question is, how can I solve this problem, without making the code serial?
A typical way to atomically perform arithmetic operations on a floating-point type is with a compare-and-swap (CAS) loop.
double value;
//some computation of the double value
double expected = atomic_load(&dResCrossAT);
while (!atomic_compare_exchange_weak(&dResCrossAT, &expected, expected + value));
A detailed explanation can be found in Jeff Preshing's article about this class of operation.
I believe excluding partial memory write in a non-atomic variable requires mutexing, I am not certain of that being the only way to ensure there is no write conflict but it is accomplished like this
#include <mutex>
#include <thread>
std::mutex mtx;
void threadFunction(double* d){
while (*d < 100) {
mtx.lock();
*d += 1.0;
mtx.unlock();
}
}
int main() {
double* d = new double(0);
std::thread thread(threadFunction, d);
while (true) {
if (*d == 100) {
break;
}
}
thread.join();
}
Which will add 1.0 to d 100 times in a thread-safe way. The mutex locking and unlocking ensures that only one thread is accessing d at a given time. However, this is significantly slower than an atomic equivalent because locking and unlocking is so expensive - I've heard varying things based on operating system and specific processor and what is being locked or unlocked but it's in the neighborhood of 50 clock cycles for this example, but it can require a system call which is more like 2000 clock cycles.
  Moral: use with caution.
If your vector has many elements per thread, you should consider implementing a reduction rather than using an atomic operation for every element. Atomic operations are much more expensive than normal stores.
double global_value{0.0};
std::vector<double> private_values(num_threads,0.0);
parallel_for(size_t k=0; k<n; ++k) {
private_values[my_thread] += ...;
}
if (my_thread==0) {
for (int t=0; t<num_threads; ++t) {
global_value += private_values[t];
}
}
This algorithm requires no atomic operations and will be faster in many cases. You can replace the second phase with a tree or atomics if the thread count is very high (e.g. on a GPU).
Concurrency libraries like TBB and Kokkos both provide parallel reduce templates that do the right thing internally.

Thread safety while looping with OpenMP

I'm working on a small Collatz conjecture calculator using C++ and GMP, and I'm trying to implement parallelism on it using OpenMP, but I'm coming across issues regarding thread safety. As it stands, attempting to run the code will yield this:
*** Error in `./collatz': double free or corruption (fasttop): 0x0000000001140c40 ***
*** Error in `./collatz': double free or corruption (fasttop): 0x00007f4d200008c0 ***
[1] 28163 abort (core dumped) ./collatz
This is the code to reproduce the behaviour.
#include <iostream>
#include <gmpxx.h>
mpz_class collatz(mpz_class n) {
if (mpz_odd_p(n.get_mpz_t())) {
n *= 3;
n += 1;
} else {
n /= 2;
}
return n;
}
int main() {
mpz_class x = 1;
#pragma omp parallel
while (true) {
//std::cout << x.get_str(10);
while (true) {
if (mpz_cmp_ui(x.get_mpz_t(), 1)) break;
x = collatz(x);
}
x++;
//std::cout << " OK" << std::endl;
}
}
Given that I did not get this error when I uncomment the outputs to screen, which are slow, I assume the issue at hand has to do with thread safety, and in particular with concurrent threads trying to increment x at the same time.
Am I correct in my assumptions? How can I fix this and make it safe to run?
I assume what you want to do is to check if the collatz conjecture holds for all numbers. The program you posted is wrong on many levels both serially and in parallel.
if (mpz_cmp_ui(x.get_mpz_t(), 1)) break;
Means that it will break when x != 1. If you replace it with the correct 0 == mpz_cmp_ui, the code will just continue to test 2 over and over again. You have to have two variables anyway, one for the outer loop that represents what you want to check, and one for the inner loop performing the check. It's easier to get this right if you make a function for that:
void check_collatz(mpz_class n) {
while (n != 1) {
n = collatz(n);
}
}
int main() {
mpz_class x = 1;
while (true) {
std::cout << x.get_str(10);
check_collatz(x);
x++;
}
}
The while (true) loop is bad to reason about and parallelize, so let's just make an equivalent for loop:
for (mpz_class x = 1;; x++) {
check_collatz(x);
}
Now, we can talk about parallelizing the code. The basis for OpenMP parallelizing is a worksharing construct. You cannot just slap #pragma omp parallel on a while loop. Fortunately you can easily mark certain canonical for loops with #pragma omp parallel for. For that, however, you cannot use mpz_class as a loop variable, and you must specify an end for the loop:
#pragma omp parallel for
for (long check = 1; check <= std::numeric_limits<long>::max(); check++)
{
check_collatz(check);
}
Note that check is implicitly private, there is a copy for each thread working on it. Also OpenMP will take care of distributing the work [1 ... 2^63] among threads. When a thread calls check_collatz a new, private, mpz_class object will be created for it.
Now, you might notice, that repeatedly creating a new mpz_class object in each loop iteration is costly (memory allocation). You can reuse that (by breaking check_collatz again) and creating a thread-private mpz_class working object. For this, you split the compound parallel for into separate parallel and for pragmas:
#include <gmpxx.h>
#include <iostream>
#include <limits>
// Avoid copying objects by taking and modifying a reference
void collatz(mpz_class& n)
{
if (mpz_odd_p(n.get_mpz_t()))
{
n *= 3;
n += 1;
}
else
{
n /= 2;
}
}
int main()
{
#pragma omp parallel
{
mpz_class x;
#pragma omp for
for (long check = 1; check <= std::numeric_limits<long>::max(); check++)
{
// Note: The structure of this fits perfectly in a for loop.
for (x = check; x != 1; collatz(x));
}
}
}
Note that declaring x in the parallel region will make sure it is implicitly private and properly initialized. You should prefer that to declaring it outside and marking it private. This will often lead to confusion because explicitly private variables from outside scope are unitialized.
You might complain that this only checks the first 2^63 numbers. Just let it run. This gives you enough time to master OpenMP to expert level and write your own custom worksharing for GMP objects.
You were concerned about having extra objects for each thread. This is essential for good performance. You cannot solve this efficiently with locks/critical sections/atomics. You would have to protect each and every read and write to your only relevant variable. There would be no parallelism left.
Note: The huge for loop will likely have a load imbalance. So some threads will probably finish a few centuries earlier than the others. You could fix that with dynamic scheduling, or smaller static chunks.
Edit: For academic sake, here is one idea how to implement the worksharing directly on GMP objects:
#pragma omp parallel
{
// Note this is not a "parallel" loop
// these are just separate loops on distinct strided
int nthreads = omp_num_threads();
mpz_class check = 1;
// we already checked those in the other program
check += std::numeric_limits<long>::max();
check += omp_get_thread_num();
mpz_class x;
for (; ; check += nthreads)
{
// Note: The structure of this fits perfectly in a for loop.
for (x = check; x != 1; collatz(x));
}
}
You could well be right about collisions with x. You can mark x as private by:
#pragma omp parallel private(x)
This way each thread gets their own "version" of the variable x, which should make this thread-safe. By default, variables declared before a #pragma omp parallel are public, so there is one shared instance between all of the threads.
You might want to touch x only with atomic instructions.
#pragma omp atomic
x++;
This ensures that all threads see the same value of x without requires mutexes or other synchronization techniques.

atomic compare and conditionally subtract if less

I manage some memory that is used by concurrent threads, and I have a variable
unsigned int freeBytes
When I request some memory from a task
unsigned int bytesNeeded
I must check if
bytesNeeded<=freeBytes
and if yes keep the old value of freeBytes and subtract atomically from freeBytes bytesNeeded.
Does the atomic library OR the x86 offers such possibilities ?
Use an atomic compare-and-swap operation. In pseudo-code:
do {
unsigned int n = load(freeBytes);
if (n < bytesNeeded) { return NOT_ENOUGH_MEMORY; }
unsigned int new_n = n - bytesNeeded;
} while (!compare_and_swap(&freeBytes, n, new_n));
With real C++ <atomic> variables the actual could would look pretty similar:
#include <atomic>
// Global counter for the amount of available bytes
std::atomic<unsigned int> freeBytes; // global
// attempt to decrement the counter by bytesNeeded; returns whether
// decrementing succeeded.
bool allocate(unsigned int bytesNeeded)
{
for (unsigned int n = freeBytes.load(); ; )
{
if (n < bytesNeeded) { return false; }
unsigned int new_n = n - bytesNeeded;
if (freeBytes.compare_exchange_weak(n, new_n)) { return true; }
}
}
(Note that the final compare_exchange_weak takes the first argument by reference and updates it with the current value of the atomic variable in the event that the exchange fails.)
By contrast, incrementing the value ("deallocate?) can be done with a simple atomic addition (unless you want to check for overflow). This is to some extent symptomatic of lock-free containers: Creating something is relatively easy, assuming infinite resources, but removing requires trying in a loop.

how do i write assembly in C++ for adding (time test)

I want to time how much slower it is if i were to do a simple operation like 1+1 and plus(int l,int r) which does l+r and throws an exception on overflow heres some example code with _C and _V being carry and overflow. The exception code can be written differently if you like.
How do i write it so i can quickly test for carry/overflow and throw an exception if it is true?
I never did jumps (or even some of the basics) in assembly so i a bit clueless even after googling.
This should work in x32 comps. Currently i am running on a Intel Core Duo which has the x86, x86-64 set
unsigned int plus(unsigned int l, unsigned int r){
unsigned int v = l+r;
if (!_C) return v;
throw 1;
}
int plus(int l, int r){
int v = l+r;
if (!_V) return v;
throw 1;
}
Do you want C/C++ code to perform these operations, or do you want to know how to do it in x86 assembly language?
In C/C++, determining the carry is easy:
int _C = (v < l); // "v < r" works too
The overflow is a bit more complicated. Normally overflow is flagged when the two operands have the same sign yet the result has a different sign. On two's complement architectures such as x86, this can be written as:
int _V = ((l ^ r) >= 0) && ((l ^ v) < 0);
The MSB (sign bit) of l ^ r will be 0 if and only if the signs agree, and similarly l ^ v will have a nonzero sign bit (=value less than zero) if and only if l and v have opposite signs.
If you want to write it in assembly, you just do the add and use a jc or jo respectively to jump to the carry/overflow handler. However, you can't easily throw C++ exceptions from assembly code. The easiest way to do this is probably to write a simple one-line function in C++ that throws the exception and call that from your assembly code. The final asm code will look something like this:
; Assuming eax=l, ebx=r
add eax, ebx
jc .have_carry
; Continue computation here...
.have_carry:
call throw_overflow_exception
with the following C++ helper function defined somewhere:
extern "C" void throw_overflow_exception()
{
throw 1; // or some other exception
}
You need the extern C to disable C++ name mangling for this function. There are other conventions to (e.g. some compilers add an underscore before or after C function names) - this depends on the compiler used and the architecture though.
Shift bit both inputs right by one. Add them both then test the left-most bit.
l>>1;
r>>1;
int result = l+r;
if (result>>((sizeof(int)*8)-1)) { /* handle overflow */ }
Here is the test code i end up using
-edit- I tested on VS 2010 outside of the IDE and in GCC. gcc is a lot faster as VC optimizes the variables poorly. (VS c++ moves the register back into V when assembly is used. It doesnt need to do that.) gcc shows little between the 3 test. The results varied from each run so often that they all looked like it was the same test. I couldnt tell. VS showed checking C flag without asm about 3x slower while with asm 6x slower.
#include <stdio.h>
#include <intrin.h>
int main(int argc, char*argv[])
{
try{
for(int n=0; n<10; n++){
volatile unsigned int v, r;
sscanf("0", "%d", &v);
sscanf("0", "%d", &r);
__int64 time = 0xFFFFFFFF;
__int64 start = __rdtsc();
for(int i=0; i<100000000; i++)
{
v=v+v;
#if 1
__asm jc e
continue;
e:
throw 1;
#endif
}
__int64 end = __rdtsc();
time = end - start;
printf("time: %I64d\n", time/10000000);
}
}
catch(int v)
{
printf("exception");
}
return 0;
}

Intel Inspector reports a data race in my spinlock implementation

I made a very simple spinlock using the Interlocked functions in Windows and tested it on a dual-core CPU (two threads that increment a variable);
The program seems to work OK (it gives the same result every time, which is not the case when no synchronization is used), but Intel Parallel Inspector says that there is a race condition at value += j (see the code below). The warning disappears when using Critical Sections instead of my SpinLock.
Is my implementation of SpinLock correct or not ? It's really strange, because all the used operations are atomic and have the proper memory barriers and it shouldn't lead to race conditions.
class SpinLock
{
int *lockValue;
SpinLock(int *value) : lockValue(value) { }
void Lock() {
while(InterlockedCompareExchange((volatile LONG*)lockValue, 1, 0) != 0) {
WaitABit();
}
}
void Unlock() { InterlockedExchange((volatile LONG*)lockValue, 0); }
};
The test program:
static const int THREADS = 2;
HANDLE completedEvents[THREADS];
int value = 0;
int lock = 0; // Global.
DWORD WINAPI TestThread(void *param) {
HANDLE completed = (HANDLE)param;
SpinLock testLock(&lock);
for(int i = 0;i < 1000*20; i++) {
for(int j = 0;j < 10*10; j++) {
// Add something to the variable.
testLock.Lock();
value += j;
testLock.Unlock();
}
}
SetEvent(completed);
}
int main() {
for(int i = 0; i < THREADS; i++) {
completedEvents[i] = CreateEvent(NULL, true, false, NULL);
}
for(int i = 0; i < THREADS; i++) {
DWORD id;
CreateThread(NULL, 0, TestThread, completedEvents[i], 0, &id);
}
WaitForMultipleObjects(THREADS, completedEvents, true, INFINITE);
cout<<value;
}
Parallel Inspector's documentation for data race suggests using a critical section or a mutex to fix races on Windows. There's nothing in it which suggests that Parallel Inspector knows how to recognise any other locking mechanism you might invent.
Tools for analysis of novel locking mechanisms tend to be static tools which look at every possible path through the code, Parallel Inspector's documentation implies that it executes the code once.
If you want to experiment with novel locking mechanisms, the most common tool I've seen used in academic literature is the Spin model checker. There's also ESP, which might reduce the state space, but I don't know if it's been applied to concurrent problems, and also the mobility workbench which would give an analysis if you can couch your problem in pi-calculus. Intel Parallel Inspector doesn't seem anything like as complicated as these tools, but rather designed to check for commonly occurring issues using heuristics.
For other poor folks in a similar situation to me: Intel DOES provide a set of includes and libraries for doing exactly this sort of thing. Check in the Inspector installation directory (you'll see \include, \lib32 and \lib64 in the installation directory) for those materials. Documentation on how to use them (as of June 2018, though Intel cares nothing about keeping links consistent):
https://software.intel.com/en-us/inspector-user-guide-windows-apis-for-custom-synchronization
There are 3 functions:
void __itt_sync_acquired(void *addr)
void __itt_sync_releasing(void *addr)
void __itt_sync_destroy(void *addr)
I'm pretty sure it should be implemented as follows:
class SpinLock
{
long lockValue;
SpinLock(long value) : lockValue(value) { }
void Lock() {
while(InterlockedCompareExchange(&lockValue, 1, 0) != 0) {
WaitABit();
}
}
void Unlock() { InterlockedExchange(&lockValue, 0); }
};