atomic_compare_exchange with greater-than instead of equals? - c++

C++11 has a 'compare and exchange' operation for atomic variables.
The semantics are:
Atomically compares the value pointed to by obj with the value pointed to by expected, and if those are equal, replaces the former with desired (performs read-modify-write operation). Otherwise, loads the actual value pointed to by obj into *expected (performs load operation).
I want to do the same, but instead of setting *obj when the values are equal, I want it to be set when one is greater-than the other (assume we're talking about an ordered type).
Is this supported somehow? Achievable by some hack perhaps?
Note: A CAS loop will not do for me, since both the values I'm comparing might change between non-atomic operations.

I think you misunderstand how compare and swap/exchange works: the basic idea is that having looked at the current value you can work out some corresponding new value - and you attempt that update. If it succeeds - great - continue with whatever you need to, but if it fails then start all over again: looking at the new value that some other thread's put in there and thinking about the value that you'd consequently now need.
I want it to be set when one is greater-than the other (assume we're talking about an ordered type).
So say you want to store 11 but only if the existing value's still atomically less than 11. You won't find an instruction to do that directly, but you can easily do it with the existing compare and swap:
int target_value = 11;
do {
int snapped_x = x;
if (snapped_x >= target_value)
what do you want to do instead?
} while (!compare_and_swap(x, snapped_x, target_value));
// ...or whatever your exact calling convention is...
You still get the behaviour you want, just with a potentially higher failure/spin rate....

As requested, here's my comment as an answer:
I, too, wish this existed, but it does not, as far as I know (certainly not for x86/x64), apart from conceptually, of course, and workarounds that (potentially) use more than a single atomic instruction (which work but are not wait-free).

This may be an old question, but I think many people will want this kind of feature.
I come up with an idea, here show the pseudo code(I am linux kernel people, so use some kernel functions).
update(new)
{
old = atomic_read(&pvalue);
while (old < new) {
v = atomic_cmpxchg(&pvalue, old, new);
if (v != old) {
old = v;
continue;
}
}
}
The code doesn't try cmpxchg for old value less than new value.
If there are concurrency issues, please tell me. Thanks:)

Related

Is it bad practice to operate on a structure and assign the result to the same structure? Why?

I don't recall seeing examples of code like this hypothetical snippet:
cpu->dev.bus->uevent = (cpu->dev.bus->uevent) >> 16; //or the equivalent using a macro
in which a member in a large structure gets dereferenced using pointers, operated on, and the result assigned back to the same field of the structure.
The kernel seems to be a place where such large structures are frequent but I haven't seen examples of it and became interested as to the reason why.
Is there a performance reason for this, maybe related to the time required to follow the pointers? Is it simply not good style and if so, what is the preferred way?
There's nothing wrong with the statement syntactically, but it's easier to code it like this:
cpu->dev.bus->uevent >>= 16;
It's mush more a matter of history: the kernel is mostly written in C (not C++), and -in the original development intention- (K&R era) was thought as a "high level assembler", whose statement and expression should have a literal correspondence in C and ASM. In this environment, ++i i+=1 and i=i+1 are completely different things that translates in completely different CPU instructions
Compiler optimizations, at that time, where not so advanced and popular, so the idea to follow the pointer chain twice was often avoided by first store the resulting destination address in a local temporary variable (most likely a register) and than do the assignment.
(like int* p = &a->b->c->d; *p = a + *p;)
or trying to use compond instruction like a->b->c >>= 16;)
With nowadays computers (multicore processor, multilevel caches and piping) the execution of cone inside registers can be ten times faster respect to the memory access, following three pointers is faster than storing an address in memory, thus reverting the priority of the "business model".
Compiler optimization, then, can freely change the produced code to adequate it to size or speed depending on what is retained more important and depending on what kind of processor you are working with.
So -nowadays- it doesn't really matter if you write ++i or i+=1 or i=i+1: The compiler will most likely produce the same code, attempting to access i only once. and following the pointer chain twice will most likely be rewritten as equivalent to (cpu->dev.bus->uevent) >>= 16 since >>= correspond to a single machine instruction in the x86 derivative processors.
That said ("it doesn't really matter"), it is also true that code style tend to reflect stiles and fashions of the age it was first written (since further developers tend to maintain consistency).
You code is not "bad" by itself, it just looks "odd" in the place it is usually written.
Just to give you an idea of what piping and prediction is. consider the comparison of two vectors:
bool equal(size_t n, int* a, int *b)
{
for(size_t i=0; i<n; ++i)
if(a[i]!=b[i]) return false;
return true;
}
Here, as soon we find something different we sortcut saying they are different.
Now consider this:
bool equal(size_t n, int* a, int *b)
{
register size_t c=0;
for(register size_t i=0; i<n; ++i)
c+=(a[i]==b[i]);
return c==n;
}
There is no shortcut, and even if we find a difference continue to loop and count.
But having removed the if from inside the loop, if n isn't that big (let's say less that 20) this can be 4 or 5 times faster!
An optimized compiler can even recognize this situation - proven there are no different side effects- can rework the first code in the second!
I see nothing wrong with something like that, it appears as innocuous as:
i = i + 42;
If you're accessing the data items a lot, you could consider something like:
tSomething *cdb = cpu->dev.bus;
cdb->uevent = cdb->uevent >> 16;
// and many more accesses to cdb here
but, even then, I'd tend to leave it to the optimiser, which tends to do a better job than most humans anyway :-)
There's nothing inherently wrong by doing
cpu->dev.bus->uevent = (cpu->dev.bus->uevent) >> 16;
but depending on the type of uevent, you need to be careful when shifting right like that, so you don't accidentally shift in unexpected bits into your value. For instance, if it's a 64-bit value
uint64_t uevent = 0xDEADBEEF00000000;
uevent = uevent >> 16; // now uevent is 0x0000DEADBEEF0000;
if you thought you shifted a 32-bit value and then pass the new uevent to a function taking a 64-bit value, you're not passing 0xBEEF0000, as you might have expected. Since the sizes fit (64-bit value passed as 64-bit parameter), you won't get any compiler warnings here (which you would have if you passed a 64-bit value as a 32-bit parameter).
Also interesting to note is that the above operation, while similar to
i = ++i;
which is undefined behavior (see http://josephmansfield.uk/articles/c++-sequenced-before-graphs.html for details), is still well defined, since there are no side effects in the right-hand side expression.

Programming Myth? Pre vs. post increment and decrement speed [duplicate]

Is there a performance difference between i++ and ++i if the resulting value is not used?
Executive summary: No.
i++ could potentially be slower than ++i, since the old value of i
might need to be saved for later use, but in practice all modern
compilers will optimize this away.
We can demonstrate this by looking at the code for this function,
both with ++i and i++.
$ cat i++.c
extern void g(int i);
void f()
{
int i;
for (i = 0; i < 100; i++)
g(i);
}
The files are the same, except for ++i and i++:
$ diff i++.c ++i.c
6c6
< for (i = 0; i < 100; i++)
---
> for (i = 0; i < 100; ++i)
We'll compile them, and also get the generated assembler:
$ gcc -c i++.c ++i.c
$ gcc -S i++.c ++i.c
And we can see that both the generated object and assembler files are the same.
$ md5 i++.s ++i.s
MD5 (i++.s) = 90f620dda862cd0205cd5db1f2c8c06e
MD5 (++i.s) = 90f620dda862cd0205cd5db1f2c8c06e
$ md5 *.o
MD5 (++i.o) = dd3ef1408d3a9e4287facccec53f7d22
MD5 (i++.o) = dd3ef1408d3a9e4287facccec53f7d22
From Efficiency versus intent by Andrew Koenig :
First, it is far from obvious that ++i is more efficient than i++, at least where integer variables are concerned.
And :
So the question one should be asking is not which of these two operations is faster, it is which of these two operations expresses more accurately what you are trying to accomplish. I submit that if you are not using the value of the expression, there is never a reason to use i++ instead of ++i, because there is never a reason to copy the value of a variable, increment the variable, and then throw the copy away.
So, if the resulting value is not used, I would use ++i. But not because it is more efficient: because it correctly states my intent.
A better answer is that ++i will sometimes be faster but never slower.
Everyone seems to be assuming that i is a regular built-in type such as int. In this case there will be no measurable difference.
However if i is complex type then you may well find a measurable difference. For i++ you must make a copy of your class before incrementing it. Depending on what's involved in a copy it could indeed be slower since with ++i you can just return the final value.
Foo Foo::operator++()
{
Foo oldFoo = *this; // copy existing value - could be slow
// yadda yadda, do increment
return oldFoo;
}
Another difference is that with ++i you have the option of returning a reference instead of a value. Again, depending on what's involved in making a copy of your object this could be slower.
A real-world example of where this can occur would be the use of iterators. Copying an iterator is unlikely to be a bottle-neck in your application, but it's still good practice to get into the habit of using ++i instead of i++ where the outcome is not affected.
Short answer:
There is never any difference between i++ and ++i in terms of speed. A good compiler should not generate different code in the two cases.
Long answer:
What every other answer fails to mention is that the difference between ++i versus i++ only makes sense within the expression it is found.
In the case of for(i=0; i<n; i++), the i++ is alone in its own expression: there is a sequence point before the i++ and there is one after it. Thus the only machine code generated is "increase i by 1" and it is well-defined how this is sequenced in relation to the rest of the program. So if you would change it to prefix ++, it wouldn't matter in the slightest, you would still just get the machine code "increase i by 1".
The differences between ++i and i++ only matters in expressions such as array[i++] = x; versus array[++i] = x;. Some may argue and say that the postfix will be slower in such operations because the register where i resides has to be reloaded later. But then note that the compiler is free to order your instructions in any way it pleases, as long as it doesn't "break the behavior of the abstract machine" as the C standard calls it.
So while you may assume that array[i++] = x; gets translated to machine code as:
Store value of i in register A.
Store address of array in register B.
Add A and B, store results in A.
At this new address represented by A, store the value of x.
Store value of i in register A // inefficient because extra instruction here, we already did this once.
Increment register A.
Store register A in i.
the compiler might as well produce the code more efficiently, such as:
Store value of i in register A.
Store address of array in register B.
Add A and B, store results in B.
Increment register A.
Store register A in i.
... // rest of the code.
Just because you as a C programmer is trained to think that the postfix ++ happens at the end, the machine code doesn't have to be ordered in that way.
So there is no difference between prefix and postfix ++ in C. Now what you as a C programmer should be vary of, is people who inconsistently use prefix in some cases and postfix in other cases, without any rationale why. This suggests that they are uncertain about how C works or that they have incorrect knowledge of the language. This is always a bad sign, it does in turn suggest that they are making other questionable decisions in their program, based on superstition or "religious dogmas".
"Prefix ++ is always faster" is indeed one such false dogma that is common among would-be C programmers.
Taking a leaf from Scott Meyers, More Effective c++ Item 6: Distinguish between prefix and postfix forms of increment and decrement operations.
The prefix version is always preferred over the postfix in regards to objects, especially in regards to iterators.
The reason for this if you look at the call pattern of the operators.
// Prefix
Integer& Integer::operator++()
{
*this += 1;
return *this;
}
// Postfix
const Integer Integer::operator++(int)
{
Integer oldValue = *this;
++(*this);
return oldValue;
}
Looking at this example it is easy to see how the prefix operator will always be more efficient than the postfix. Because of the need for a temporary object in the use of the postfix.
This is why when you see examples using iterators they always use the prefix version.
But as you point out for int's there is effectively no difference because of compiler optimisation that can take place.
Here's an additional observation if you're worried about micro optimisation. Decrementing loops can 'possibly' be more efficient than incrementing loops (depending on instruction set architecture e.g. ARM), given:
for (i = 0; i < 100; i++)
On each loop you you will have one instruction each for:
Adding 1 to i.
Compare whether i is less than a 100.
A conditional branch if i is less than a 100.
Whereas a decrementing loop:
for (i = 100; i != 0; i--)
The loop will have an instruction for each of:
Decrement i, setting the CPU register status flag.
A conditional branch depending on CPU register status (Z==0).
Of course this works only when decrementing to zero!
Remembered from the ARM System Developer's Guide.
First of all: The difference between i++ and ++i is neglegible in C.
To the details.
1. The well known C++ issue: ++i is faster
In C++, ++i is more efficient iff i is some kind of an object with an overloaded increment operator.
Why?
In ++i, the object is first incremented, and can subsequently passed as a const reference to any other function. This is not possible if the expression is foo(i++) because now the increment needs to be done before foo() is called, but the old value needs to be passed to foo(). Consequently, the compiler is forced to make a copy of i before it executes the increment operator on the original. The additional constructor/destructor calls are the bad part.
As noted above, this does not apply to fundamental types.
2. The little known fact: i++ may be faster
If no constructor/destructor needs to be called, which is always the case in C, ++i and i++ should be equally fast, right? No. They are virtually equally fast, but there may be small differences, which most other answerers got the wrong way around.
How can i++ be faster?
The point is data dependencies. If the value needs to be loaded from memory, two subsequent operations need to be done with it, incrementing it, and using it. With ++i, the incrementation needs to be done before the value can be used. With i++, the use does not depend on the increment, and the CPU may perform the use operation in parallel to the increment operation. The difference is at most one CPU cycle, so it is really neglegible, but it is there. And it is the other way round then many would expect.
Please don't let the question of "which one is faster" be the deciding factor of which to use. Chances are you're never going to care that much, and besides, programmer reading time is far more expensive than machine time.
Use whichever makes most sense to the human reading the code.
#Mark
Even though the compiler is allowed to optimize away the (stack based) temporary copy of the variable and gcc (in recent versions) is doing so,
doesn't mean all compilers will always do so.
I just tested it with the compilers we use in our current project and 3 out of 4 do not optimize it.
Never assume the compiler gets it right, especially if the possibly faster, but never slower code is as easy to read.
If you don't have a really stupid implementation of one of the operators in your code:
Alwas prefer ++i over i++.
I have been reading through most of the answers here and many of the comments, and I didn't see any reference to the one instance that I could think of where i++ is more efficient than ++i (and perhaps surprisingly --i was more efficient than i--). That is for C compilers for the DEC PDP-11!
The PDP-11 had assembly instructions for pre-decrement of a register and post-increment, but not the other way around. The instructions allowed any "general-purpose" register to be used as a stack pointer. So if you used something like *(i++) it could be compiled into a single assembly instruction, while *(++i) could not.
This is obviously a very esoteric example, but it does provide the exception where post-increment is more efficient(or I should say was, since there isn't much demand for PDP-11 C code these days).
In C, the compiler can generally optimize them to be the same if the result is unused.
However, in C++ if using other types that provide their own ++ operators, the prefix version is likely to be faster than the postfix version. So, if you don't need the postfix semantics, it is better to use the prefix operator.
I can think of a situation where postfix is slower than prefix increment:
Imagine a processor with register A is used as accumulator and it's the only register used in many instructions (some small microcontrollers are actually like this).
Now imagine the following program and their translation into a hypothetical assembly:
Prefix increment:
a = ++b + c;
; increment b
LD A, [&b]
INC A
ST A, [&b]
; add with c
ADD A, [&c]
; store in a
ST A, [&a]
Postfix increment:
a = b++ + c;
; load b
LD A, [&b]
; add with c
ADD A, [&c]
; store in a
ST A, [&a]
; increment b
LD A, [&b]
INC A
ST A, [&b]
Note how the value of b was forced to be reloaded. With prefix increment, the compiler can just increment the value and go ahead with using it, possibly avoid reloading it since the desired value is already in the register after the increment. However, with postfix increment, the compiler has to deal with two values, one the old and one the incremented value which as I show above results in one more memory access.
Of course, if the value of the increment is not used, such as a single i++; statement, the compiler can (and does) simply generate an increment instruction regardless of postfix or prefix usage.
As a side note, I'd like to mention that an expression in which there is a b++ cannot simply be converted to one with ++b without any additional effort (for example by adding a - 1). So comparing the two if they are part of some expression is not really valid. Often, where you use b++ inside an expression you cannot use ++b, so even if ++b were potentially more efficient, it would simply be wrong. Exception is of course if the expression is begging for it (for example a = b++ + 1; which can be changed to a = ++b;).
I always prefer pre-increment, however ...
I wanted to point out that even in the case of calling the operator++ function, the compiler will be able to optimize away the temporary if the function gets inlined. Since the operator++ is usually short and often implemented in the header, it is likely to get inlined.
So, for practical purposes, there likely isn't much of a difference between the performance of the two forms. However, I always prefer pre-increment since it seems better to directly express what I"m trying to say, rather than relying on the optimizer to figure it out.
Also, giving the optmizer less to do likely means the compiler runs faster.
My C is a little rusty, so I apologize in advance. Speedwise, I can understand the results. But, I am confused as to how both files came out to the same MD5 hash. Maybe a for loop runs the same, but wouldn't the following 2 lines of code generate different assembly?
myArray[i++] = "hello";
vs
myArray[++i] = "hello";
The first one writes the value to the array, then increments i. The second increments i then writes to the array. I'm no assembly expert, but I just don't see how the same executable would be generated by these 2 different lines of code.
Just my two cents.

Concurrent/Asynchronous access to shared data

I searched around a bit, but could not find anything useful. Could someone help me with this concurrency/synchronization problem?
Given five instances of the program below running asynchronously, with s being a shared data with an initial value of 0 and i a local variable, which values are obtainable by s?
for (i = 0; i < 5; i ++) {
s = s + 1;
}
2
1
6
I would like to know which values, and why exactly.
The non-answering answer is: Uaaaagh, don't do such a thing.
An answer more in the sense of your question is: In principle, any value is possible, because it is totally undefined. You have no strict guarantee that concurrent writes are atomic in any way and don't result in complete garbage.
In practice, less-than-machine-word sized writes are atomic everywhere (for what I know, at least), but they do not have a defined order. Also, you usually don't know in which order threads/processes are scheduled. So you will never see a "random garbage" value, but also you cannot know what it will be. It will be anything 5 or higher (up to 25).
Since no atomic increment is used, there is a race between reading the value, incrementing it, and writing it back. If the value is being written by another instance before the result is written back, the write (and thus increment) that finished earlier has no effect. If that does not happen, both increments are effective.
Nevertheless, each instance increments the value at least 5 times, so apart from the theoretical "total garbage" possibility, there is no way a value less than 5 could result in the end. Therefore (1) and (2) are not possible, but (3) is.

What does `std::kill_dependency` do, and why would I want to use it?

I've been reading about the new C++11 memory model and I've come upon the std::kill_dependency function (ยง29.3/14-15). I'm struggling to understand why I would ever want to use it.
I found an example in the N2664 proposal but it didn't help much.
It starts by showing code without std::kill_dependency. Here, the first line carries a dependency into the second, which carries a dependency into the indexing operation, and then carries a dependency into the do_something_with function.
r1 = x.load(memory_order_consume);
r2 = r1->index;
do_something_with(a[r2]);
There is further example that uses std::kill_dependency to break the dependency between the second line and the indexing.
r1 = x.load(memory_order_consume);
r2 = r1->index;
do_something_with(a[std::kill_dependency(r2)]);
As far as I can tell, this means that the indexing and the call to do_something_with are not dependency ordered before the second line. According to N2664:
This allows the compiler to reorder the call to do_something_with, for example, by performing speculative optimizations that predict the value of a[r2].
In order to make the call to do_something_with the value a[r2] is needed. If, hypothetically, the compiler "knows" that the array is filled with zeros, it can optimize that call to do_something_with(0); and reorder this call relative to the other two instructions as it pleases. It could produce any of:
// 1
r1 = x.load(memory_order_consume);
r2 = r1->index;
do_something_with(0);
// 2
r1 = x.load(memory_order_consume);
do_something_with(0);
r2 = r1->index;
// 3
do_something_with(0);
r1 = x.load(memory_order_consume);
r2 = r1->index;
Is my understanding correct?
If do_something_with synchronizes with another thread by some other means, what does this mean with respect to the ordering of the x.load call and this other thread?
Assuming my understading is correct, there's still one thing that bugs me: when I'm writing code, what reasons would lead me to choose to kill a dependency?
The purpose of memory_order_consume is to ensure the compiler does not do certain unfortunate optimizations that may break lockless algorithms. For example, consider this code:
int t;
volatile int a, b;
t = *x;
a = t;
b = t;
A conforming compiler may transform this into:
a = *x;
b = *x;
Thus, a may not equal b. It may also do:
t2 = *x;
// use t2 somewhere
// later
t = *x;
a = t2;
b = t;
By using load(memory_order_consume), we require that uses of the value being loaded not be moved prior to the point of use. In other words,
t = x.load(memory_order_consume);
a = t;
b = t;
assert(a == b); // always true
The standard document considers a case where you may only be interested in ordering certain fields of a structure. The example is:
r1 = x.load(memory_order_consume);
r2 = r1->index;
do_something_with(a[std::kill_dependency(r2)]);
This instructs the compiler that it is allowed to, effectively, do this:
predicted_r2 = x->index; // unordered load
r1 = x; // ordered load
r2 = r1->index;
do_something_with(a[predicted_r2]); // may be faster than waiting for r2's value to be available
Or even this:
predicted_r2 = x->index; // unordered load
predicted_a = a[predicted_r2]; // get the CPU loading it early on
r1 = x; // ordered load
r2 = r1->index; // ordered load
do_something_with(predicted_a);
If the compiler knows that do_something_with won't change the result of the loads for r1 or r2, then it can even hoist it all the way up:
do_something_with(a[x->index]); // completely unordered
r1 = x; // ordered
r2 = r1->index; // ordered
This allows the compiler a little more freedom in its optimization.
In addition to the other answer, I will point out that Scott Meyers, one of the definitive leaders in the C++ community, bashed memory_order_consume pretty strongly. He basically said that he believed it had no place in the standard. He said there are two cases where memory_order_consume has any effect:
Exotic architectures designed to support 1024+ core shared memory machines.
The DEC Alpha
Yes, once again, the DEC Alpha finds its way into infamy by using an optimization not seen in any other chip until many years later on absurdly specialized machines.
The particular optimization is that those processors allow one to dereference a field before actually getting the address of that field (i.e. it can look up x->y BEFORE it even looks up x, using a predicted value of x). It then goes back and determines whether x was the value it expected it to be. On success, it saved time. On failure, it has to go back and get x->y again.
Memory_order_consume tells the compiler/architecture that these operations have to happen in order. However, in the most useful case, one will end up wanting to do (x->y.z), where z doesn't change. memory_order_consume would force the compiler to keep x y and z in order. kill_dependency(x->y).z tells the compiler/architecture that it may resume doing such nefarious reorderings.
99.999% of developers will probably never work on a platform where this feature is required (or has any effect at all).
The usual use case of kill_dependency arises from the following. Suppose you want to do atomic updates to a nontrivial shared data structure. A typical way to do this is to nonatomically create some new data and to atomically swing a pointer from the data structure to the new data. Once you do this, you are not going to change the new data until you have swung the pointer away from it to something else (and waited for all readers to vacate). This paradigm is widely used, e.g. read-copy-update in the Linux kernel.
Now, suppose the reader reads the pointer, reads the new data, and comes back later and reads the pointer again, finding that the pointer hasn't changed. The hardware can't tell that the pointer hasn't been updated again, so by consume semantics he can't use a cached copy of the data but has to read it again from memory. (Or to think of it another way, the hardware and compiler can't speculatively move the read of the data up before the read of the pointer.)
This is where kill_dependency comes to the rescue. By wrapping the pointer in a kill_dependency, you create a value that will no longer propagate dependency, allowing accesses through the pointer to use the cached copy of the new data.

C++ test to verify equality operator is kept consistent with struct over time

I voted up #TomalakGeretkal for a good note about by-contract; I'm haven't accepted an answer as my question is how to programatically check the equals function.
I have a POD struct & an equality operator, a (very) small part of a system with >100 engineers.
Over time I expect the struct to be modified (members added/removed/reordered) and I want to write a test to verify that the equality op is testing every member of the struct (eg is kept up to date as the struct changes).
As Tomalak pointed out - comments & "by contract" is often the best/only way to enforce this; however in my situation I expect issues and want to explore whether there are any ways to proactively catch (at least many) of the modifications.
I'm not coming up with a satisfactory answer - this is the best I've thought of:
-new up two instances struct (x, y), fill each with identical non-zero data.
-check x==y
-modify x "byte by byte"
-take ptr to be (unsigned char*)&x
-iterator over ptr (for sizeof(x))
-increment the current byte
-check !(x==y)
-decrement the current byte
-check x==y
The test passes if the equality operator caught every byte (NOTE: there is a caveat to this - not all bytes are used in the compilers representation of x, therefore the test would have to 'skip' these bytes - eg hard code ignore bytes)
My proposed test has significant problems: (at least) the 'don't care' bytes, and the fact that incrementing one byte of the types in x may not result in a valid value for the variable at that memory location.
Any better solutions?
(This shouldn't matter, but I'm using VS2008, rtti is off, googletest suite)
Though tempting to make code 'fool-proof' with self-checks like this, it's my experience that keeping the self-checks themselves fool-proof is, well, a fool's errand.
Keep it simple and localise the effect of any changes. Write a comment in the struct definition making it clear that the equality operator must also be updated if the struct is; then, if this fails, it's just the programmer's fault.
I know that this will not seem optimal to you as it leaves the potential for user error in the future, but in reality you can't get around this (at least without making your code horrendously complicated), and often it's most practical just not to bother.
I agree with (and upvoted) Tomalak's answer. It's unlikely that you'll find a foolproof solution. Nonetheless, one simple semi-automated approach could be to validate the expected size within the equality operator:
MyStruct::operator==(const MyStruct &rhs)
{
assert(sizeof(MyStruct) == 42); // reminder to update as new members added
// actual functionality here ...
}
This way, if any new members are added, the assert will fire until someone updates the equality operator. This isn't foolproof, of course. (Member vars might be replaced with something of same size, etc.) Nonetheless, it's a relatively simple (one line assert) that has a good shot of detecting the error case.
I'm sure I'm going to get downvoted for this but...
How about a template equality function that takes a reference to an int parameter, and the two objects being tested. The equality function will return bool, but will increment the size reference (int) by the sizeof(T).
Then have a large test function that calls the template for each object and sums the total size --> compare this sum with the sizeof the object. The existence of virtual functions/inheritance, etc could kill this idea.
it's actually a difficult problem to solve correctly in a self-test.
the easiest solution i can think of is to take a few template functions which operate on multiple types, perform the necessary conversions, promotions, and comparisons, then verify the result in an external unit test. when a breaking change is introduced, at least you'll know.
some of these challenges are more easily maintained/verified using approaches such as composition, rather than extension/subclassing.
Agree with Tomalak and Eric. I have used this for very similar problems.
Assert does not work unless the DEBUG is defined, so potentially you can release code that is wrong. These tests will not always work reliably. If the structure contains bit fields, or items are inserted that take up slack space cause by compiler aligning to word boundaries, the size won't change. For this reason they offer limited value. e.g.
struct MyStruct {
char a ;
ulong l ;
}
changed to
struct MyStruct {
char a ;
char b ;
ulong l ;
}
Both structures are 8 bytes (on 32bit Linux x86)