Lock-Free Data Structures in C++ Compare and Swap Routine

Lock-Free Data Structures in C++ Compare and Swap Routine - c++

In this paper: Lock-Free Data Structures (pdf) the following "Compare and Swap" fundamental is shown:
template <class T>
bool CAS(T* addr, T exp, T val)
{
if (*addr == exp)
{
*addr = val;
return true;
}
return false;
}
And then says
The entire procedure is atomic
But how is that so? Is it not possible that some other actor could change the value of addr between the if and the assignment? In which case, assuming all code is using this CAS fundamental, it would be found the next time something "expected" it to be a certain way, and it wasn't. However, that doesn't change the fact that it could happen, in which case, is it still atomic? What about the other actor returning true, even when it's changes were overwritten by this actor? If that can't possibly happen, then why?
I want to believe the author, so what am I missing here? I am thinking it must be obvious. My apologies in advance if this seems trivial.

He is describing an atomic operation which is given by the implementation, "somehow." That is pseudo-code for something implemented in hardware.

Related

Difference between Interlocked, InterlockedAcquire, and InterlockedRelease if single thread reordering is impossible

In all likelihood, a lockless implementation is already overkill for the purposes of my application, but I wanted to look into memory barriers and lockless-ness anyways in case I ever actually need to use these concepts in the future.
From what I can tell:
an "InterlockedAcquire" function performs an atomic operation while preventing the compiler from moving code statements after the InterlockedAcquire to before the InterlockedAcquire.
an "InterlockedRelease" function performs an atomic operation while preventing the compiler from moving code statements before the InterlockedRelease to after the InterlockedRelease.
a vanilla "Interlocked" function performs an atomic operation while preventing the compiler from moving code statements in either direction across the Interlocked call.
My question is, if a function is structured such that the compiler can't reorder any of the code anyways because doing so would affect single-threaded behavior, is there a difference between any of the variants of an Interlocked function, or all they all effectively the same? Is the only difference between them how they interact with code reordering?
For a more concrete example, here's my current application - the produce() function as part of what will eventually be a multiple producer, single consumer queue built using a circular buffer:
template <typename T>
class Queue {
private:
long headIndex;
long tailIndex;
T* array[MAXQUEUESIZE];
public:
Queue() {
headIndex = 0;
tailIndex = 0;
memset(array, 0, MAXQUEUESIZE*sizeof(void*);
}
~Queue() {
}
bool produce(T value) {
//1) prevents concurrent calls to produce() from causing corruption:
long indexRetVal;
long reservedIndex;
do {
reservedIndex = tailIndex;
indexRetVal = InterlockedCompareExchange64(&tailIndex, (reservedIndex + 1) % MAXQUEUESIZE, reservedIndex);
} while (indexRetVal != reservedIndex);
//2) allocates the node.
T* newValPtr = (T*) malloc(sizeof(T));
if (newValPtr == null) {
OutputDebugString("Queue: malloc returned null");
return false;
}
*newValPtr = value;
//3) prevents a concurrent call to consume from causing corruption by atomically replacing the old pointer:
T* valPtrRetVal = InterlockedCompareExchangePointer(array + reservedIndex, newValPtr, null);
//if the previous value wasn't null, then our circular buffer overflowed:
if (valPtrRetVal != null) {
OutputDebugString("Queue: circular buffer overflowed");
free(newValPtr); //as pointed out by RbMm
return false;
}
//otherwise, everything worked fine
return true;
}
};
As I understand it, 3) will occur after 1) and 2) regardless of what I do anyways, but I should change 1) to an InterlockedRelease because I don't care whether it occurs before or after 2) and I should let the compiler decide.

My question is, if a function is structured such that the compiler can't reorder any of the code anyways because doing so would affect single-threaded behavior, is there a difference between any of the variants of an Interlocked function, or all they all effectively the same? Is the only difference between them how they interact with code reordering?
You may be confusing C++ statements with instructions. Your question isn't CPU specific, so you have to pretend you have no idea what the CPU instructions look like.
Consider this code:
if (a == 2)
{
b = 5;
}
Now, here's an example of a re-ordering of this code that doesn't affect a single thread:
int c = b;
b = 5;
if (a != 2)
b = c;
This performs the same operations but in a different order. It has no effect on single-threaded code. But, of course, if another thread was accessing b, it could see a value of 5 from this code even if a was never 2.
Thus it could also see a value of 5 from the original code even if a is never 2!
Why, because the two bits of code perform the same from the point of view of a single thread. And unless you use operations with guaranteed threading semantics, that's all the compiler, CPU, caches, and other platform components need to preserve.
So most likely, your belief that reordering any of the code would affect single-threaded behavior is probably incorrect. There's lots of ways to reorder and optimize code that doesn't affect single-threaded behavior.

There is an document on the msdn Explained the difference: Acquire and Release Semantics.
For the sample:
a++;
b++;
c++;
If we use acquire semantics to increment a, other processors would always see the increment of a before the increments of b and c;
If we use release semantics to increment c, other processors would always see the increments of a and b before the increment of c;
the InterlockedXxx routines perform, have both acquire and release semantics by default.
More specific, for 4 values:
a++;
b++;
c++;
d++;
If we use acquire semantics to increment b, other processors would always see the increment of b before the increments of c and d;
The order may be a->b->c,d or b->a,c,d.
If we use release semantics to increment c, other processors would always see the increments of a and b before the increment of c;
The order may be a,b->c->d or a,b,d->c.
To quote from this answer of #antiduh:
Acquire says "only worry about stuff after me". Release says "only
worry about stuff before me". Combining those both is a full memory
barrier.

All three versions prevent the compiler from moving code across the function call, but the compiler is not the only place that reordering takes place.
Modern CPUs have "out-of-order execution" and even "speculative execution". Acquire and release semantics cause the code to compiler to instructions with flags or prefixes controlling reordering within the CPU.

Order of operations for bitwise operators (a & (a=1) strange behavior)?

I was messing around with some c++ code after learning that you cannot increment booleans in the standard. I thought incrementing booleans would be useful because I like compacting functions if I can and
unsigned char b = 0; //type doesn't really matter assuming no overflow
while (...) {
if (b++) //do something
...}
is sometimes useful to have, but it would be nice to not have to worry about integer overflow. Since booleans cannot take on any value besides 0 or 1, I thought that perhaps this would work. Of course, you could just have a boolean assigned to 1 after you do your operation but that takes an extra line of code.
Anyway, I thought about how I might achieve this a different way and I came up with this.
bool b = 0;
while (...)
if (b & (b=1)) ...
This turned out to always evaluate to true, even on the first pass through. I thought - sure, its just doing the bitwise operator from right to left, except that when I swapped the order, it did the same thing.
So my question is, how is this expression being evaluated so that it simplifies to always being true?
As an aside, the way to do what I wanted is like this I guess:
bool b = 0;
while (...) if (!b && !(b=1)) //code executes every time except the first iteration
There's a way to get it to only execute once, also.
Or just do the actually readable and obvious solution.
bool b = 0;
while (...) {
if (b) {} else {b=1;}
}

This is what std::exchange is for:
if (std::exchange(b, true)) {
// b was already true
}
As mentioned, sequencing rules mean that b & (b=1) is undefined behaviour in both C and C++. You can't modify a variable and read from it "without proper separation" (e.g., as separate statements).

There is no "strange behavior" here. The & operator, in your case is a bitwise arithmetic operator. These operator are commutative, so both operand must be evaluated before computing the resulting value.
As mentioned earlier, this is considered as undefined behavior, as the standard does not prevent computing + loading into a register the left operand, before doing the same to the second, or doing so in reverse order.
If you are using C++14 or above, you could use std::exchange do achieve this, or re-implement it quite easily using the following :
bool exchange_bool(bool &var, bool &&val) {
bool ret = var;
var = val;
return ret;
}
If you end up re-implementing it, consider using template instead of hard-coded types.
As a proper answer to your question :
So my question is, how is this expression being evaluated so that it simplifies to always being true?
It does not simplify, if we strictly follow the standard, but I'd guess you are always using the same compiler, which end up producing the same output.
Side point: You should not "compact" functions. Most of the time the price of this is less readability. This might be fine if you're the only dev working on a project, but on bigger project, or if you come back to your code later, that could be an issue. Last but not least, "compact" function are more error prone, because of their lack of readability.
If you really want shorter function, you should consider using sub-function that will handle part of the function computation. This would enable you to write short function, to quickly fix or update behavior, without impacting clarity.

I don't see what meaning you can assign to a "bitwise increment" operator. If I am right, what you achieve is just setting the low-order bit, which is done by b|= 1 (following the C conventions, this could have been defined as a unary b||, but the || was assigned another purpose). But I don't feel that this is a useful bitwise operation, and it can increment only once. IMO the standard increment operator is more bitwise.
If your goal was in fact to "increment" a boolean variable, i.e. set it to true (prefix or postfix), b= b || true is a way. (There is no "logical OR assignement" b||= true, and even less a "logical increment" b||||.)

Can I check a small array of bools in one go?

There was a similar question here, but the user in that question seemed to have a much larger array, or vector. If I have:
bool boolArray[4];
And I want to check if all elements are false, I can check [ 0 ], [ 1 ] , [ 2 ] and [ 3 ] either separately, or I can loop through it. Since (as far as I know) false should have value 0 and anything other than 0 is true, I thought about simply doing:
if ( *(int*) boolArray) { }
This works, but I realize that it relies on bool being one byte and int being four bytes. If I cast to (std::uint32_t) would it be OK, or is it still a bad idea? I just happen to have 3 or 4 bools in an array and was wondering if this is safe, and if not if there is a better way to do it.
Also, in the case I end up with more than 4 bools but less than 8 can I do the same thing with a std::uint64_t or unsigned long long or something?

As πάντα ῥεῖ noticed in comments, std::bitset is probably the best way to deal with that in UB-free manner.
std::bitset<4> boolArray {};
if(boolArray.any()) {
//do the thing
}
If you want to stick to arrays, you could use std::any_of, but this requires (possibly peculiar to the readers) usage of functor which just returns its argument:
bool boolArray[4];
if(std::any_of(std::begin(boolArray), std::end(boolArray), [](bool b){return b;}) {
//do the thing
}
Type-punning 4 bools to int might be a bad idea - you cannot be sure of the size of each of the types. It probably will work on most architectures, but std::bitset is guaranteed to work everywhere, under any circumstances.

Several answers have already explained good alternatives, particularly std::bitset and std::any_of(). I am writing separately to point out that, unless you know something we don't, it is not safe to type pun between bool and int in this fashion, for several reasons:
int might not be four bytes, as multiple answers have pointed out.
M.M points out in the comments that bool might not be one byte. I'm not aware of any real-world architectures in which this has ever been the case, but it is nevertheless spec-legal. It (probably) can't be smaller than a byte unless the compiler is doing some very elaborate hide-the-ball chicanery with its memory model, and a multi-byte bool seems rather useless. Note however that a byte need not be 8 bits in the first place.
int can have trap representations. That is, it is legal for certain bit patterns to cause undefined behavior when they are cast to int. This is rare on modern architectures, but might arise on (for example) ia64, or any system with signed zeros.
Regardless of whether you have to worry about any of the above, your code violates the strict aliasing rule, so compilers are free to "optimize" it under the assumption that the bools and the int are entirely separate objects with non-overlapping lifetimes. For example, the compiler might decide that the code which initializes the bool array is a dead store and eliminate it, because the bools "must have" ceased to exist* at some point before you dereferenced the pointer. More complicated situations can also arise relating to register reuse and load/store reordering. All of these infelicities are expressly permitted by the C++ standard, which says the behavior is undefined when you engage in this kind of type punning.
You should use one of the alternative solutions provided by the other answers.
* It is legal (with some qualifications, particularly regarding alignment) to reuse the memory pointed to by boolArray by casting it to int and storing an integer, although if you actually want to do this, you must then pass boolArray through std::launder if you want to read the resulting int later. Regardless, the compiler is entitled to assume that you have done this once it sees the read, even if you don't call launder.

You can use std::bitset<N>::any:
Any returns true if any of the bits are set to true, otherwise false.
#include <iostream>
#include <bitset>
int main ()
{
std::bitset<4> foo;
// modify foo here
if (foo.any())
std::cout << foo << " has " << foo.count() << " bits set.\n";
else
std::cout << foo << " has no bits set.\n";
return 0;
}
Live
If you want to return true if all or none of the bits set to on, you can use std::bitset<N>::all or std::bitset<N>::none respectively.

The standard library has what you need in the form of the std::all_of, std::any_of, std::none_of algorithms.

...And for the obligatory "roll your own" answer, we can provide a simple "or"-like function for any array bool[N], like so:
template<size_t N>
constexpr bool or_all(const bool (&bs)[N]) {
for (bool b : bs) {
if (b) { return b; }
}
return false;
}
Or more concisely,
template<size_t N>
constexpr bool or_all(const bool (&bs)[N]) {
for (bool b : bs) { if (b) { return b; } }
return false;
}
This also has the benefit of both short-circuiting like ||, and being optimised out entirely if calculable at compile time.
Apart from that, if you want to examine the original idea of type-punning bool[N] to some other type to simplify observation, I would very much recommend that you don't do that view it as char[N2] instead, where N2 == (sizeof(bool) * N). This would allow you to provide a simple representation viewer that can automatically scale to the viewed object's actual size, allow iteration over its individual bytes, and allow you to more easily determine whether the representation matches specific values (such as, e.g., zero or non-zero). I'm not entirely sure off the top of my head whether such examination would invoke any UB, but I can say for certain that any such type's construction cannot be a viable constant-expression, due to requiring a reinterpret cast to char* or unsigned char* or similar (either explicitly, or in std::memcpy()), and thus couldn't as easily be optimised out.

Should I protect operations on primitive types with mutexes for being thread-safe in C++?

What is the best approach to achieve thread-safety for rather simple operations?
Consider a pair of functions:
void setVal(int val)
{
this->_val = val;
}
int getVal() {
return this->_val;
}
Since even assignments of primitive types aren't guaranteed to be atomic, should I modify every getter and setter in the program in the following way to be thread-safe?
void setVal(int val)
{
this->_mutex.lock();
this->_val = val;
this->_mutex.unlock();
}
int getVal() {
this->_mutex.lock();
int result = this->_val;
this->_mutex.unlock();
return result;
}

Are you using _val in multiple threads? If not, then no, you don't need to synchronize access to it.
If it is used from multiple threads, then yes, you need to synchronize access, either using a mutex or by using an atomic type (like std::atomic<T> in C++0x, though other threading libraries have nonstandard atomic types as well).

Mutexes are very costly, as they are able to be shared across processes. If the state that you're limiting access to is only to be constrained to threads within your current process then go for something much less heavy, such as a Critical Section or Semaphore.

On 32-bit x86 platforms, reads and writes of 32-bit values aligned on 4-byte boundary are atomic. On 64-bit platforms you can also rely on 64-bit loads and stores of 8-byte aligned values to be atomic as well. SPARC and POWER CPUs also work like that.
C++ doesn't make any guarantees like that, but in practice no compiler is going to mess with it, since every non-trivial multi-threaded program relies on this behaviour.

int getVal() {
this->_mutex.lock();
int result = this->_val;
this->_mutex.unlock();
return result;
}
What exactly are you hoping to accomplish with this? Sure, you've stopped this->_val from changing before you saved into result but it still may change before result is returned, -- or between the return and the assignment to whatever you assigned it -- or a microsecond later. Regardless of what you do, you are just going to get a snapshot of a moving target. Deal with it.
void setVal(int val)
{
this->_mutex.lock();
this->_val = val;
this->_mutex.unlock();
}
Similarly, what is this buying you? If you call setVal(-5) and setVal(17) from separate threads at the same time, what value should be there after both complete? You've gone to some trouble to make sure that the first to start is also the first to finish, but how is that help to get the "right" value set?

How far to go with a strongly typed language?

Let's say I am writing an API, and one of my functions take a parameter that represents a channel, and will only ever be between the values 0 and 15. I could write it like this:
void Func(unsigned char channel)
{
if(channel < 0 || channel > 15)
{ // throw some exception }
// do something
}
Or do I take advantage of C++ being a strongly typed language, and make myself a type:
class CChannel
{
public:
CChannel(unsigned char value) : m_Value(value)
{
if(channel < 0 || channel > 15)
{ // throw some exception }
}
operator unsigned char() { return m_Value; }
private:
unsigned char m_Value;
}
My function now becomes this:
void Func(const CChannel &channel)
{
// No input checking required
// do something
}
But is this total overkill? I like the self-documentation and the guarantee it is what it says it is, but is it worth paying the construction and destruction of such an object, let alone all the additional typing? Please let me know your comments and alternatives.

If you wanted this simpler approach generalize it so you can get more use out of it, instead of tailor it to a specific thing. Then the question is not "should I make a entire new class for this specific thing?" but "should I use my utilities?"; the latter is always yes. And utilities are always helpful.
So make something like:
template <typename T>
void check_range(const T& pX, const T& pMin, const T& pMax)
{
if (pX < pMin || pX > pMax)
throw std::out_of_range("check_range failed"); // or something else
}
Now you've already got this nice utility for checking ranges. Your code, even without the channel type, can already be made cleaner by using it. You can go further:
template <typename T, T Min, T Max>
class ranged_value
{
public:
typedef T value_type;
static const value_type minimum = Min;
static const value_type maximum = Max;
ranged_value(const value_type& pValue = value_type()) :
mValue(pValue)
{
check_range(mValue, minimum, maximum);
}
const value_type& value(void) const
{
return mValue;
}
// arguably dangerous
operator const value_type&(void) const
{
return mValue;
}
private:
value_type mValue;
};
Now you've got a nice utility, and can just do:
typedef ranged_value<unsigned char, 0, 15> channel;
void foo(const channel& pChannel);
And it's re-usable in other scenarios. Just stick it all in a "checked_ranges.hpp" file and use it whenever you need. It's never bad to make abstractions, and having utilities around isn't harmful.
Also, never worry about overhead. Creating a class simply consists of running the same code you would do anyway. Additionally, clean code is to be preferred over anything else; performance is a last concern. Once you're done, then you can get a profiler to measure (not guess) where the slow parts are.

Yes, the idea is worthwhile, but (IMO) writing a complete, separate class for each range of integers is kind of pointless. I've run into enough situations that call for limited range integers that I've written a template for the purpose:
template <class T, T lower, T upper>
class bounded {
T val;
void assure_range(T v) {
if ( v < lower || upper <= v)
throw std::range_error("Value out of range");
}
public:
bounded &operator=(T v) {
assure_range(v);
val = v;
return *this;
}
bounded(T const &v=T()) {
assure_range(v);
val = v;
}
operator T() { return val; }
};
Using it would be something like:
bounded<unsigned, 0, 16> channel;
Of course, you can get more elaborate than this, but this simple one still handles about 90% of situations pretty well.

No, it is not overkill - you should always try to represent abstractions as classes. There are a zillion reasons for doing this and the overhead is minimal. I would call the class Channel though, not CChannel.

Can't believe nobody mentioned enum's so far. Won't give you a bulletproof protection, but still better than a plain integer datatype.

Looks like overkill, especially the operator unsigned char() accessor. You're not encapsulating data, you're making evident things more complicated and, probably, more error-prone.
Data types like your Channel are usually a part of something more abstracted.
So, if you use that type in your ChannelSwitcher class, you could use commented typedef right in the ChannelSwitcher's body (and, probably, your typedef is going to be public).
// Currently used channel type
typedef unsigned char Channel;

Whether you throw an exception when constructing your "CChannel" object or at the entrance to the method that requires the constraint makes little difference. In either case you're making runtime assertions, which means the type system really isn't doing you any good, is it?
If you want to know how far you can go with a strongly typed language, the answer is "very far, but not with C++." The kind of power you need to statically enforce a constraint like, "this method may only be invoked with a number between 0 and 15" requires something called dependent types--that is, types which depend on values.
To put the concept into pseudo-C++ syntax (pretending C++ had dependent types), you might write this:
void Func(unsigned char channel, IsBetween<0, channel, 15> proof) {
...
}
Note that IsBetween is parameterized by values rather than types. In order to call this function in your program now, you must provide to the compiler the second argument, proof, which must have the type IsBetween<0, channel, 15>. Which is to say, you have to prove at compile-time that channel is between 0 and 15! This idea of types which represent propositions, whose values are proofs of those propositions, is called the Curry-Howard Correspondence.
Of course, proving such things can be difficult. Depending on your problem domain, the cost/benefit ratio can easily tip in favor of just slapping runtime checks on your code.

Whether something is overkill or not often depends on lots of different factors. What might be overkill in one situation might not in another.
This case might not be overkill if you had lots of different functions that all accepted channels and all had to do the same range checking. The Channel class would avoid code duplication, and also improve readability of the functions (as would naming the class Channel instead of CChannel - Neil B. is right).
Sometimes when the range is small enough I will instead define an enum for the input.

If you add constants for the 16 different channels, and also a static method that fetches the channel for a given value (or throws an exception if out of range) then this can work without any additional overhead of object creation per method call.
Without knowing how this code is going to be used, it's hard to say if it's overkill or not or pleasant to use. Try it out yourself - write a few test cases using both approaches of a char and a typesafe class - and see which you like. If you get sick of it after writing a few test cases, then it's probably best avoided, but if you find yourself liking the approach, then it might be a keeper.
If this is an API that's going to be used by many, then perhaps opening it up to some review might give you valuable feedback, since they presumably know the API domain quite well.

In my opinion, I don't think what you are proposing is a big overhead, but for me, I prefer to save the typing and just put in the documentation that anything outside of 0..15 is undefined and use an assert() in the function to catch errors for debug builds. I don't think the added complexity offers much more protection for programmers who are already used to C++ language programming which contains alot of undefined behaviours in its specs.

You have to make a choice. There is no silver bullet here.
Performance
From the performance perspective, the overhead isn't going to be much if at all. (unless you've got to counting cpu cycles) So most likely this shouldn't be the determining factor.
Simplicity/ease of use etc
Make the API simple and easy to understand/learn.
You should know/decide whether numbers/enums/class would be easier for the api user
Maintainability
If you are very sure the channel
type is going to be an integer in
the foreseeable future , I would go
without the abstraction (consider
using enums)
If you have a lot of use cases for a
bounded values, consider using the
templates (Jerry)
If you think, Channel can
potentially have methods make it a
class right now.
Coding effort
Its a one time thing. So always think maintenance.

The channel example is a tough one:
At first it looks like a simple limited-range integer type, like you find in Pascal and Ada. C++ gives you no way to say this, but an enum is good enough.
If you look closer, could it be one of those design decisions that are likely to change? Could you start referring to "channel" by frequency? By call letters (WGBH, come in)? By network?
A lot depends on your plans. What's the main goal of the API? What's the cost model? Will channels be created very frequently (I suspect not)?
To get a slightly different perspective, let's look at the cost of screwing up:
You expose the rep as int. Clients write a lot of code, the interface is either respected or your library halts with an assertion failure. Creating channels is dirt cheap. But if you need to change the way you're doing things, you lose "backward bug-compatibility" and annoy authors of sloppy clients.
You keep it abstract. Everybody has to use the abstraction (not so bad), and everybody is futureproofed against changes in the API. Maintaining backwards compatibility is a piece of cake. But creating channels is more costly, and worse, the API has to state carefully when it is safe to destroy a channel and who is responsible for the decision and the destruction. Worse case scenario is that creating/destroying channels leads to a big memory leak or other performance failure—in which case you fall back to the enum.
I'm a sloppy programmer, and if it were for my own work, I'd go with the enum and eat the cost if the design decision changes down the line. But if this API were to go out to a lot of other programmers as clients, I'd use the abstraction.
Evidently I'm a moral relativist.

An integer with values only ever between 0 and 15 is an unsigned 4-bit integer (or half-byte, nibble. I imagine if this channel switching logic would be implemented in hardware, then the channel number might be represented as that, a 4-bit register).
If C++ had that as a type you would be done right there:
void Func(unsigned nibble channel)
{
// do something
}
Alas, unfortunately it doesn't. You could relax the API specification to express that the channel number is given as an unsigned char, with the actual channel being computed using a modulo 16 operation:
void Func(unsigned char channel)
{
channel &= 0x0f; // truncate
// do something
}
Or, use a bitfield:
#include <iostream>
struct Channel {
// 4-bit unsigned field
unsigned int n : 4;
};
void Func(Channel channel)
{
// do something with channel.n
}
int main()
{
Channel channel = {9};
std::cout << "channel is" << channel.n << '\n';
Func (channel);
}
The latter might be less efficient.

I vote for your first approach, because it's simpler and easier to understand, maintain, and extend, and because it is more likely to map directly to other languages should your API have to be reimplemented/translated/ported/etc.

This is abstraction my friend! It's always neater to work with objects

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js