Atomic counter in gcc - c++

I must be just having a moment, because this should be easy but I can't seem to get it working right.
Whats the correct way to implement an atomic counter in GCC?
i.e. I want a counter that runs from zero to 4 and is thread safe.
I was doing this (which is further wrapped in a class, but not here)
static volatile int _count = 0;
const int limit = 4;
int get_count(){
// Create a local copy of diskid
int save_count = __sync_fetch_and_add(&_count, 1);
if (save_count >= limit){
__sync_fetch_and_and(&_count, 0); // Set it back to zero
}
return save_count;
}
But it's running from 1 through from 1 - 4 inclusive then around to zero.
It should go from 0 - 3. Normally I'd do a counter with a mod operator but I don't
know how to do that safely.
Perhaps this version is better. Can you see any problems with it, or offer
a better solution.
int get_count(){
// Create a local copy of diskid
int save_count = _count;
if (save_count >= limit){
__sync_fetch_and_and(&_count, 0); // Set it back to zero
return 0;
}
return save_count;
}
Actually, I should point out that it's not absolutely critical that each thread get a different value. If two threads happened to read the same value at the same time that wouldn't be a problem. But they can't exceed limit at any time.

Your code isn't atomic (and your second get_count doesn't even increment the counter value)!
Say count is 3 at the start and two threads simultaneously call get_count. One of them will get his atomic add done first and increments count to 4. If the second thread is fast enough, it can increment it to 5 before the first thread resets it to zero.
Also, in your wraparound processing, you reset count to 0 but not save_count. This is clearly not what's intended.
This is easiest if limit is a power of 2. Don't ever do the reduction yourself, just use
return (unsigned) __sync_fetch_and_add(&count, 1) % (unsigned) limit;
or alternatively
return __sync_fetch_and_add(&count, 1) & (limit - 1);
This only does one atomic operation per invocation, is safe and very cheap. For generic limits, you can still use %, but that will break the sequence if the counter ever overflows. You can try using a 64-bit value (if your platform supports 64-bit atomics) and just hope it never overflows; this is a bad idea though. The proper way to do this is using an atomic compare-exchange operation. You do this:
int old_count, new_count;
do {
old_count = count;
new_count = old_count + 1;
if (new_count >= limit) new_count = 0; // or use %
} while (!__sync_bool_compare_and_swap(&count, old_count, new_count));
This approach generalizes to more complicated sequences and update operations too.
That said, this type of lockless operation is tricky to get right, relies on undefined behavior to some degree (all current compilers get this right, but no C/C++ standard before C++0x actually has a well-defined memory model) and is easy to break. I recommend using a simple mutex/lock unless you've profiled it and found it to be a bottleneck.

You're in luck, because the range you want happens to fit into exactly 2 bits.
Easy solution: Let the volatile variable count up forever. But after you read it, use just the lowest two bits (val & 3). Presto, atomic counter from 0-3.

It's impossible to create anything atomic in pure C, even with volatile. You need asm. C1x will have special atomic types, but until then you're stuck with asm.

You have two problems.
__sync_fetch_and_add will return the previous value (i.e., before adding one). So at the step where _count becomes 3, your local save_count variable is getting back 2. So you actually have to increment _count up to 4 before it'll come back as a 3.
But even on top of that, you're specifically looking for it to be >= 4 before you reset it back to 0. That's just a question of using the wrong limit if you're only looking for it to get as high as three.

Related

Question regarding race conditions while multithreading

So I'm reading through a book an in this chapter that goes over multithreading and concurrency they gave me a question that does not really make sense to me.
I'm suppose to create 3 functions with param x that simply calculates x * x; one using mutex, one using atomic types, and one using neither. And create 3 global variables holding the values.
The first two functions will prevent race conditions but the third might not.
After that I create N threads and then loop through and tell each thread to calculate it's x function (3 separate loops, one for each function. So I'm creating N threads 3 times)
Now the book tells me that using function 1 & 2 I should always get the correct answer but using function 3 I won't always get the right answer. However, I am always getting the right answer for all of them. I assume this is because I am just calculating x * x which is all it does.
As an example, when N=3, the correct value is 0 * 0 + 1 * 1 + 2 * 2 = 5.
this is the atomic function:
void squareAtomic(atomic<int> x)
{
accumAtomic += x * x;
}
And this is how I call the function
thread threadsAtomic[N]
for (int i = 0; i < N; i++) //i will be the current thread that represents x
{
threadsAtomic[i] = thread(squareAtomic, i);
}
for (int i = 0; i < N; i++)
{
threadsAtomic[i].join();
}
This is the function that should sometimes create race conditions:
void squareNormal(int x)
{
accumNormal += x * x;
}
Heres how I call that:
thread threadsNormal[N];
for (int i = 0; i < N; i++) //i will be the current thread that represents x
{
threadsNormal[i] = thread(squareNormal, i);
}
for (int i = 0; i < N; i++)
{
threadsNormal[i].join();
}
This is all my own code so I might not be doing this question correctly, and in that case I apologize.
One problem with race conditions (and with undefined behavior in general) is that their presence doesn't guarantee that your program will behave incorrectly. Rather, undefined behavior only voids the guarantee that your program will behave according to rules of the C++ language spec. That can make undefined behavior very difficult to detect via empirical testing. (Every multithreading-programmer's worst nightmare is the bug that was never seen once during the program's intensive three-month testing period, and only appears in the form of a mysterious crash during the big on-stage demo in front of a live audience)
In this case your racy program's race condition comes in the form of multiple threads reading and writing accumNormal simultaneously; in particular, you might get an incorrect result if thread A reads the value of accumNormal, and then thread B writes a new value to accumNormal, and then thread A writes a new value to accumNormal, overwriting thread B's value.
If you want to be able to demonstrate to yourself that race conditions really can cause incorrect results, you'd want to write a program where multiple threads hammer on the same shared variable for a long time. For example, you might have half the threads increment the variable 1 million times, while the other half decrement the variable 1 million times, and then check afterwards (i.e. after joining all the threads) to see if the final value is zero (which is what you would expect it to be), and if not, run the test again, and let that test run all night if necessary. (and even that might not be enough to detect incorrect behavior, e.g. if you are running on hardware where increments and decrements are implemented in such a way that they "just happen to work" for this use case)

Are there any more efficient ways for atomically adding two floats?

I have a bundle of floats which get updated by various threads. Size of the array is much larger than the number of threads. Therefore simultaneous access on particular floats is rather rare. I need a solution for C++03.
The following code atomically adds a value to one of the floats (live demo). Assuming it works it might be the best solution.
The only alternative I can think of is dividing the array into bunches and protecting each bunch by a mutex. But I don't expect the latter to be more efficient.
My questions are as follows. Are there any alternative solutions for adding floats atomically? Can anyone anticipate which is the most efficient? Yes, I am willing to do some benchmarks. Maybe the solution below can be improved by relaxing the memorder constraints, i.e. exchanging __ATOMIC_SEQ_CST by something else. I have no experience with that.
void atomic_add_float( float *x, float add )
{
int *ip_x= reinterpret_cast<int*>( x ); //1
int expected= __atomic_load_n( ip_x, __ATOMIC_SEQ_CST ); //2
int desired;
do {
float sum= *reinterpret_cast<float*>( &expected ) + add; //3
desired= *reinterpret_cast<int*>( &sum );
} while( ! __atomic_compare_exchange_n( ip_x, &expected, desired, //4
/* weak = */ true,
__ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST ) );
}
This works as follows. At //1 the bit-pattern of x is interpreted as an int, i.e. I assume that float and int have the same size (32 bits). At //2 the value to be increased is loaded atomically. At //3 the bit-pattern of the int is interpreted as float and the summand is added. (Remember that expected contains a value found at ip_x == x.) This doesn't change the value under ip_x == x. At //4 the result of the summation is stored only at ip_x == x if no other thread changed the value, i.e. if expected == *ip_x (docu). If this is not the case the do-loop continues and expected contains the updated value found ad ip_x == x.
GCC's functions for atomic access (__atomic_load_n and __atomic_compare_exchange_n) can easily be exchanged by other compiler's implementations.
Are there any alternative solutions for adding floats atomically? Can anyone anticipate which is the most efficient?
Sure, there are at least few that come to mind:
Use synchronization primitives, i.e. spinlocks. Will be a bit slower than compare-exchange.
Transactional extension (see Wikipedia). Will be faster, but this solution might limit the portability.
Overall, your solution is quire reasonable: it is fast and yet will work on any platform.
In my opinion the needed memory orders are:
__ATOMIC_ACQUIRE -- when we read the value in __atomic_load_n()
__ATOMIC_RELEASE -- when __atomic_compare_exchange_n() is success
__ATOMIC_ACQUIRE -- when __atomic_compare_exchange_n() is failed
To make this function more efficient you may like to use __ATOMIC_ACQUIRE for __atomic_load_n and __ATOMIC_RELEASE and __ATOMIC_RELAXED for __atomic_compare_exchange_n success_memorder and failure_memorder respectively.
On x86-64 though that does not change the generated assembly because its memory model is relatively strong. Unlike for ARM with its weaker memory model.

Is std::atomic redundant if you have to check for overflow or act conditionally?

You can safely increment and decrement std::atomic_int for example. But if you need to check for overflow or execute some routine conditinoally based on the value, then a lock is needed anyway. Since you must compare the value and the thread might be swapped off just after the comparison succeeded, another thread modifies, ... bug.
But if you need a lock then you can just use a plain integer instead of atomic. Am I right?
No, you can still use a std::atomic even conditionally.
Firstly, if you use std::atomic<unsigned int> then overflow behavoir is well defined (although possibly not what you want). If you use a signed integer overflow isn't well defined but as long as you don't hit it then this doesn't matter.
If you absolutely must check for overflow, or otherwise act conditionally, you can use compare-exchange. This lets you read the value, decide whether you want to do work on it and then atomically update the value back if it hasn't changed. And the key part here is the system will tell you if the atomic update failed, in which case you can go back to the start and read the new value and make the decision again.
As an example, if we only wanted to set the max value of an atomic integer to 4 (in some kind of refcounting, for instance), we could do:
#include <atomic>
static std::atomic<int> refcount = 0;
int val = refcount; // this doesn't need to be in the loop as on failure compare-exchange-strong updates it
while(true)
{
if(val == 4)
{
// there's already 4 refs here, maybe come back later?
break;
}
int toChangeTo = val + 1;
if(refcount.compare_exchange_strong(val, toChangeTo))
{
// we successfully took a ref!
break;
}
// if we fail here, another thread updated the value whilst we were running, just loop back and try again
}
In the above code you can use compare_exchange_weak instead. This can sometimes spuriously fail and so you need to do it in a loop. However, we have a loop anyway (and in general you always will as you need to handle real failures) and so compare_exchange_weak makes a lot of sense here.

C++ for-loop condition

I want to know why this loops runs even when result.bad_matches.size()=0
for (int i = 1; i <= result.badmatches.size() - 1; i++)
{
...
}
Also, is there any other way I could stop it from running when badmatches size is 0 without using an if condition?
This depends on the type size() returns. It is probably a standard container and thus will be an unsigned type and those types wrap around on overflow. That means it the result of subtracting one will be the maximum value of that type.
Either use a comparison that doesn't require you to subtract from the size (<, !=) or just use iterators or a for-auto loop. Under any circumstance you should at least use the same type for iterating as the nested size_type of the container and not int.
for(auto& x : result.badmatches) {
// ...
}
use while(result.badmatches.size()) to NOT execute it.
result.badmatches.size()-1 this will be converted to -1. If its an unsigned integer, then -1 is interpreted as 0xFFFFFFFF(on a 32 bit machine). This will make the loop run for 2^32 or 2^64 times. To avoid this, use while() as before IF you're certain that result.badmatches.size() will return 0.
size must be returning an unsigned so 0-1 is getting upgraded to unsigned and so is the left value.
So for int size of 4 bytes, -1 will be represented as 2^32 -1 in unsigned int.
If you don't want this behavior then just cast it like this : static_cast <signed int > (result.badmatches.size());
PS: I've not touched C++ for past 4 years pl. excuse little mistakes.
The right way is:
for (int i=0;i< result.badmatches.size() ;++i)
{
}
If you specifically don't want this loop to enter when the sise of the collection is zero then you could check for ! badmatches.empty() assuming that badmatches is an STL container. However, if you structure your code slightly differently, you'll probably overcome this issue without having to do that:
for (size_t i=0; i < result.badmatches.size(); i++)
{
}
I've changed the int to size_t which is the same type that size() returns (an unsigned integer), changed the initial value to 0 and the comparison so that it will exit if i >= result.badmatches.size() Generally, I'd say that this is the clearest way of presenting an indexed approach as it matches the natural indexing of collections and if you need 1, 2, 3 ... rather than 0, 1, 2 in your loop, then you can address that within it.
If you're still having problems, two questions:
Is there anything in your loop that might alter the value of result.badmatches.size()?
Is your code multithreaded with a possibility that result.badmatches.size() could change by actions on another thread?
After understanding the problem explained by #Prototype Stark #Aga , i came to a more simpler solution , using which i can keep my initial index to 1 .
for(int i=1;i+1<=result.badmatches.size();i++)
Thanks for all the help , it's much clearer now .

Binary Addition without overflow wrap-around in C/C++

I know that when overflow occurs in C/C++, normal behavior is to wrap-around. For example, INT_MAX+1 is an overflow.
Is possible to modify this behavior, so binary addition takes place as normal addition and there is no wraparound at the end of addition operation ?
Some Code so this would make sense. Basically, this is one bit (full) added, it adds bit by bit in 32
int adder(int x, int y)
{
int sum;
for (int i = 0; i < 31; i++)
{
sum = x ^ y;
int carry = x & y;
x = sum;
y = carry << 1;
}
return sum;
}
If I try to adder(INT_MAX, 1); it actually overflows, even though, I amn't using + operator.
Thanks !
Overflow means that the result of an addition would exceed std::numeric_limits<int>::max() (back in C days, we used INT_MAX). Performing such an addition results in undefined behavior. The machine could crash and still comply with the C++ standard. Although you're more likely to get INT_MIN as a result, there's really no advantage to depending on any result at all.
The solution is to perform subtraction instead of addition, to prevent overflow and take a special case:
if ( number > std::numeric_limits< int >::max() - 1 ) { // ie number + 1 > max
// fix things so "normal" math happens, in this case saturation.
} else {
++ number;
}
Without knowing the desired result, I can't be more specific about the it. The performance impact should be minimal, as a rarely-taken branch can usually be retired in parallel with subsequent instructions without delaying them.
Edit: To simply do math without worrying about overflow or handling it yourself, use a bignum library such as GMP. It's quite portable, and usually the best on any given platform. It has C and C++ interfaces. Do not write your own assembly. The result would be unportable, suboptimal, and the interface would be your responsibility!
No, you have to add them manually to check for overflow.
What do you want the result of INT_MAX + 1 to be? You can only fit INT_MAX into an int, so if you add one to it, the result is not going to be one greater. (Edit: On common platforms such as x86 it is going to wrap to the largest negative number: -(INT_MAX+1). The only way to get bigger numbers is to use a larger variable.
Assuming int is 4-bytes (as is typical on x86 compilers) and you are executing an add instruction (in 32-bit mode), the destination register simply does overflow -- it is out of bits and can't hold a larger value. It is a limitation of the hardware.
To get around this, you can hand-code, or use an aribitrarily-sized integer library that does the following:
First perform a normal add instruction on the lowest-order words. If overflow occurs, the Carry flag is set.
For each increasingly-higher-order word, use the adc instruction, which adds the two operands as usual, but takes into account the value of the Carry flag (as a value of 1.)
You can see this for a 64-bit value here.