Standardized ways to pack multiple values into one atomic - c++

Assuming I have two atomic variables of types int32, I could instead chose to represent them as std::atomic<int64> both and reserve the first 32 bits for my first in and the last for my second int.
This seems like quite a space & time saver on x64 architectures, not to mention it allows for all sorts of black magic since one can abstract over various operations and make them atomic:
first == a && second ==b
becomes
both == ( int64(a) + int64(b) << 32 )
//Or some such... I'm not 100% sure this is correct but you get the idea
The one problem with this trick that I see is that I'm not particularly found with operating at the bit level and C++ is not very kind when it comes to operation at the bit level, especially once you try to accomplish more complex operations or pack more than two variables (e.g. two numbers and several bools) into the same atomic.
So I'm wondering if there is a standardized way to apply this kind of trick. A pattern or even std functionality that is easily recognizable by other coder when seen and easier to work with for the implementer ? Likewise, is this pattern useful enough to warrant such a standardization, or does its usefulness quickly become obsolete when compares to the possible annoyances and UB it can bring?

The way to get around Read-Then-Write with atomics is using a loop:
void setBit(atomic<int64_t>& bitset, int bit)
{
int64_t val = 1LL << bit;
int64_t prev = bitset;
while ((!(bitset & val)) &&
!bitset.compare_exchange_weak(prev, (prev | val))
;
}
You can extend this method to create generic bitwise operation functions

Related

How to effectively apply bitwise operation to (large) packed bit vectors?

I want to implement
void bitwise_and(
char* __restrict__ result,
const char* __restrict__ lhs,
const char* __restrict__ rhs,
size_t length);
or maybe a bitwise_or(), bitwise_xor() or any other bitwise operation. Obviously it's not about the algorithm, just the implementation details - alignment, loading the largest possible element from memory, cache-awareness, using SIMD instructions etc.
I'm sure this has (more than one) fast existing implementations, but I would guess most library implementations would require some fancy container, e.g. std::bitset or boost::dynamic_bit_set - but I don't want to spend the time constructing one of those.
So do I... Copy-paste from an existing library? Find a library which can 'wrap' a raw packed bits array in memory with a nice object? Roll my own implementation anyway?
Notes:
I'm mostly interested in C++ code, but I certainly don't mind a plain C approach.
Obviously, making copies of the input arrays is out of the question - that would probably nearly-double the execution time.
I intentionally did not template the bitwise operator, in case there's some specific optimization for OR, or for AND etc.
Bonus points for discussing operations on multiple vectors at once, e.g. V_out = V_1 bitwise-and V_2 bitwise-and V_3 etc.
I noted this article comparing library implementations, but it's from 5 years ago. I can't ask which library to use since that would violate SO policy I guess...
If it helps you any, assume its uint64_ts rather than chars (that doesn't really matter - if the char array is unaligned we can just treated the heading and trailing chars separately).
This answer is going to assume you want the fastest possible way and are happy to use platform specific things. You optimising compiler may be able to produce similar code to the below from normal C but in my experiance across a few compilers something as specific as this is still best hand-written.
Obviously like all optimisation tasks, never assume anything is better/worse and measure, measure, measure.
If you could lock down you architecture to x86 with at least SSE3 you would do:
void bitwise_and(
char* result,
const char* lhs,
const char* rhs,
size_t length)
{
while(length >= 16)
{
// Load in 16byte registers
auto lhsReg = _mm_loadu_si128((__m128i*)lhs);
auto rhsReg = _mm_loadu_si128((__m128i*)rhs);
// do the op
auto res = _mm_and_si128(lhsReg, rhsReg);
// save off again
_mm_storeu_si128((__m128i*)result, res);
// book keeping
length -= 16;
result += 16;
lhs += 16;
rhs += 16;
}
// do the tail end. Assuming that the array is large the
// most that the following code can be run is 15 times so I'm not
// bothering to optimise. You could do it in 64 bit then 32 bit
// then 16 bit then char chunks if you wanted...
while (length)
{
*result = *lhs & *rhs;
length -= 1;
result += 1;
lhs += 1;
rhs += 1;
}
}
This compiles to ~10asm instructions per 16 bytes (+ change for the leftover and a little overhead).
The great thing about doing intrinsics like this (over hand rolled asm) is that the compiler is still free to do additional optimisations (such as loop unrolling) ontop of what you write. It also handles register allocation.
If you could guarantee aligned data you could save an asm instruction (use _mm_load_si128 instead and the compiler will be clever enough to avoid a second load and use it as an direct mem operand to the 'pand'.
If you could guarantee AVX2+ then you could use the 256 bit version and handle 10asm instructions per 32 bytes.
On arm theres similar NEON instructions.
If you wanted to do multiple ops just add the relevant intrinsic in the middle and it'll add 1 asm instruction per 16 bytes.
I'm pretty sure with a decent processor you dont need any additional cache control.
Don't do it this way. The individual operations will look great, sleek asm, nice performance .. but a composition of them will be terrible. You cannot make this abstraction, nice as it looks. The arithmetic intensity of those kernels is almost the worst possible (the only worse one is doing no arithmetic, such as a straight up copy), and composing them at a high level will retain that awful property. In a sequence of operations each using the result of the previous one, the results are written and read again a lot later (in the next kernel), even though the high level flow could be transposed so that the result the "next operation" needs is right there in a register. Also, if the same argument appears twice in an expression tree (and not both as operands to one operation), they will be streamed in twice, instead of reusing the data for two operations.
It doesn't have that nice warm fuzzy feeling of "look at all this lovely abstraction" about it, but what you should do is find out at a high level how you're combining your vectors, and then try to chop that in pieces that make sense from a performance perspective. In some cases that may mean making big ugly messy loops that will make people get an extra coffee before diving in, that's just too bad then. If you want performance, you often have to sacrifice something else. Usually it's not so bad, it probably just means you have a loop that has an expression consisting of intrinsics in it, instead of an expression of vector-operations that each individually have a loop.

What C++ type use for fastest "for cycles"?

I think this is not answered on this site yet.
I made a code which goes through many combinations of 4 numbers. The number values are from 0 to 51, so they can be stored in 6 bits, so in 1 byte, am I right? I use these 4 numbers in nested for cycles and then use them in the lowest level for cycle. So what c++ type from those which can store at least 52 values is the fastest for iterating through 4 nested for cycles?
The code looks like:
for(type first = 0; first != 49; ++first)
for(type second = first+1; second != 50; ++second)
for(type third = second+1; third != 51; ++third)
for(type fourth = third+1; fourth != 52; ++fourth) {
//using those values for about 1 bilion bit operations made in another for cycles
}
That code is very simplified and maybe there is also a better way for this kind of iterating, you can help me also with that.
Use the typedef std::uint_fast8_t from the header <cstdint>. It is supposed to be the "fastest" unsigned integer type with at least 8 bits.
The fastest is whatever the underlying processor ALU can natively work with. Now registers may be addressable in multiple formats. In that case all those formats are equally fast.
So this becomes very processor architecture specific rather than C++ specific.
If you are working on a modern day PC processor then an int is as fast as anything else for your for loops.
On an embedded system there are more things to consider. Eg. Whether the variable is stored in an aligned location or not?
On most machines, int is the fastest integer type. On all of the computers I work with, int is faster than unsigned, significantly faster than signed char.
Another issue, perhaps a bigger one, is what you are doing with those numbers. You didn't show the code, so there's no way of telling. Use int if you expect first*second to produce the expected integral value.
Yet another issue is how widely portable you expect this code to be. There's a huge distinction between code that will be ported to a number of different architectures, different compilers versus code that will be used in a limited and controlled setting. If it's the latter, write some benchmarks, and use the type under which the benchmarks perform best. The problem is a bit tougher if you are writing something for wide consumption.

Why use the '+' operator when '|' is perfectly good?

This is more of a philosophical question, but I've seen this a bunch of times in codebases here and there and do not really understand how this programming method came to be.
Suppose you have to set bits 2 and 3 to some value x without changing the other values in the uint. Doing so is pretty trivial and a common task, and I would be inclined to do it this way:
uint8_t someval = 0xFF; //some random previous value
uint8_t x = 0x2; //some random value to assign.
someval = (somval & ~0xC) | (x << 2); //Set the value to 0x2 for bits 2-3
I've seen code that instead or using '|' uses '+':
uint8_t someval = 0xFF; //some random previous value
uint8_t x = 0x2; //some random value to assign.
someval = (somval & ~0xC) + (x << 2); //Set the value to 0x2 for bits 2-3
Are they equivalent?
Yes.
Is one better than the other?
Only if your hardware doesn't have a bitwise OR instruction, but I have never ever ever seen a processor that didn't have a bitwise OR (even small PIC10 processors have an OR instruction).
So why would some programmers be inclined to use '+' instead of '|'? Am I missing some really obvious, uber powerful optimization here?
If you want to perform bitwise operations, use bitwise operators.
If you want to perform arithmetic operations, use arithmetic operators.
It's true that for some values some arithmetic operations can be implemented as simple bitwise operations, but that's essentially an implementation detail to which you should never expose your readers. First and foremost the logic of the code should be clear and if possible self-explanatory. The compiler will choose appropriate low-level operations for you to implement your desire.
That's being philanthropic.
Are they equivalent?
Yes, as long as the bitfield being written to is clear beforehand. Otherwise, they'll go wrong in slightly different ways.
Is one better than the other?
No, although some would say that bitwise operations express the intent more clearly.
So why would some programmers be inclined to use '+' instead of '|'?
Because they're equivalent, and neither is particularly better than the other.
Am I missing some really obvious, uber powerful optimization here?
No.
So why would some programmers be inclined to use '+' instead of '|'?
+ could bring out logical bugs faster. a | a would appear to work, whereas a simple a + a definitely wouldn't (of course, depends on the logic, but the + version is more error-prone).
Of course you should stick to the standard way of doing things (use bitwise operations when you want a bitwise operation, and arithmetic operations when you want to do math).
It's just a question of style. Any modern CPU will complete both operations in the same number of cycles (typically 1). Personally I prefer | in these cases since it more explicitly states to the code reader that you're doing bit twiddling instead of arithmetic.
If you have a bug in your code, then using + could lead to strange behavior, whereas using | would tend to mask the bug. For example, if you accidentally include the same bit twice, ORing it again is a no-op, but adding it will clear the bit and carry up into the next bit (and possibly farther, if more bits are set). So that would usually lead to fail-fast behavior instead of failure-masking behavior, which is generally preferable.

make an integer even

Sometimes I need to be sure that some integer is even. As such I could use the following code:
int number = /* magic initialization here */;
// make sure the number is even
if ( number % 2 != 0 ) {
number--;
}
but that does not seem to be very efficient the most efficient way to do it, so I could do the following:
int number = /* magic initialization here */;
// make sure the number is even
number &= ~1;
but (besides not being readable) I am not sure that solution is completely portable.
Which solution do you think is best?
Is the second solution completely portable?
Is the second solution considerably faster that the first?
What other solutions do you know for this problem?
What if I do this inside an inline method? It should (theoretically) be as fast as these solutions and readability should no longer be an issue, does that make the second solution more viable?
note: This code is supposed to only work with positive integers but having a solution that also works with negative numbers would be a plus.
Personally, I'd go with an inline helper function.
inline int make_even(int n)
{
return n - n % 2;
}
// ....
int m = make_even(n);
Before accepting an answer I will make my own that tries to summarize and
complete some of the information found here:
Four possible methods where described (and some small variations of these).
if (number % 2 != 0) {
number--;
}
number&= ~1
number = number - (number % 2);
number = (number / 2) * 2;
Before proceeding any further let me clarify something:
The expected gain for using any of these methods is minimal, even if we could
prove that one method is 200% faster than the others the worst one is so fast
that the only way to have visible gain in speed would be if this method was
called many times in a CPU bound application. As such this is more of an
exercise for fun than a real optimization.
Analysis
Readability
As far as readability goes I would rank method 1 as the most readable,
method 4 as the second best and method 2 as the worse.
People are free to disagree but I ranked them like this because:
In method 1 it is as explicit as possible that if the number is odd you
want to subtract from it making it even.
Method 4 is also very much explicit but I ranked it second because at
first glance you might think it is doing nothing, and only a fraction of a
second latter you're like "Oh... Integer division.".
Method 2 and 3 are almost equivalent in terms of readability, but many
people are not used to bitwise operations and as such I ranked method 2 as
the worse.
With that in mind I would add that it is generally accepted that the best way
to implement this is using an inline function, and none of the options is
that unreadable, readability is not really an issue (direct uses in the code
are explicit and clear and reading the method will never be that hard).
If you don't want to use an inline method I would recommend that you only use
method 1 or method 4.
Compatibility issues
Underflow
It has been mentioned that method 1 may underflow, depending on the way the
processor represents integers. Just to be sure you can add the following
STATIC_ASSERT when using method 1.
STATIC_ASSERT(INT_MIN % 2 == 0, make_even_may_underflow);
As for method 3, even when INT_MIN is not even it may not underflow
depending on whether the result has the same sign of the divisor or the
dividend. Having the same sign of the divisor never underflows because
INT_MIN - (-1) is closer to 0.
Add the following STATIC_ASSERT just to be sure:
STATIC_ASSERT(INT_MIN % 2 == 0 || -1 % 2 < 0, make_even_may_underflow);
Of course you can still use these methods when the STATIC_ASSERT fails since
it would only be a problem when you pass INT_MIN to your make_even method,
but I would STRONGLY advice against it.
(Un)supported bit representations
When using method 2 you should make sure your compiler bit representation
behaves as expected:
STATIC_ASSERT( (1 & ~1) == 0, unsupported_bit_representation);
// two's complement OR sign-and-magnitude.
STATIC_ASSERT( (-3 & ~1) == -4 || (-3 & ~1) == -2 , unsupported_bit_representation);
Speed
I also did some naive speed tests using the Unix time utility. I ran every
different method (and its variations) 4 times and recorded the results,
since the results didn't vary much I didn't find necessary to run more tests.
The obtained results show method 4 and method 2 as the fastest of them
all.
Conclusion
According to the provided information, I would recommend using method 4. Its
readable, I am not aware of any compatibility issues and performs great.
I hope you enjoy this answer and use the information contained here to make
your own informed choice. :)
The source code is available if you want to check my results. Please note
that the tests where compiled using g++ and run in Mac OS X. Different
platforms and compilers may give different results.
int even_number = (number / 2) * 2;
This should work regardless architecture as long as optimizer is not going in the way (it shouldn't but who knows).
I would use the second solution. In any binary representation, regardless of the number of bits, big-endian vs. little-endian, or other architecture differences, that operation will have the effect of setting the lowest bit to zero. It's fast and completely portable. The intent of the code can be explained via comments, if you meet any poor C programmers who can't figure out what it means.
The &= solution looks best to me. If you want to make it more portable and more readable:
const int MakeEven = -2;
int number = /* magic initialization here */
// Make sure number is even
number &= MakeEven;
The second solution should be considerably faster than the first. Is it completely portable? Most likely, although there's probably some computer somewhere that does math differently.
This should work for positive and negative integers.
Use your second solution as inline function and put static assert into implementation of it to document and test that it works on platform that it is compiled on.
BOOST_STATIC_ASSERT( (1 & ~1) == 0 );
BOOST_STATIC_ASSERT( (-1 & ~1) == -2 );
Your second solution only works if your sign representation is "two's complement" or "sign and magnitude". To do it in place I'd go with suszterpatt's variant, which should (almost) always work
number -= (number % 2);
You don't know for sure in which direction this will "round" for negative values, so in extreme cases you might have an underflow.
even_integer = (any_integer >> 1) << 1;
This solution achieves the goal in the most performant way compared to the other suggested solutions.
In general, bitwise shift is the cheapest possible operation. Some compilers generate the same assembly for "number = (number / 2) * 2" as well but that is not guaranteed on all target platforms and programming languages.
The following approach is simple and requires no multiplication or division.
number = number & ~1;
or
number = (number + 1) & ~1;

Is there any advantage to using '<< 1' instead of '* 2'?

I've seen this a couple of times, but it seems to me that using the bitwise shift left hinders readability. Why is it used? Is it faster than just multiplying by 2?
You should use * when you are multiplying, and << when you are bit shifting. They are mathematically equivalent, but have different semantic meanings. If you are building a flag field, for example, use bit shifting. If you are calculating a total, use multiplication.
It is faster on old compilers that don't optimize the * 2 calls by emitting a left shift instruction. That optimization is really easy to detect and any decent compiler already does.
If it affects readability, then don't use it. Always write your code in the most clear and concise fashion first, then if you have speed problems go back and profile and do hand optimizations.
It's used when you're concerned with the individual bits of the data you're working with. For example, if you want to set the upper byte of a word to 0x9A, you would not write
n |= 0x9A * 256
You'd write:
n |= 0x9A << 8
This makes it clearer that you're working with bits, rather than the data they represent.
For some architectures, bit shifting is faster than multiplying. However, any compiler worth its salt will optimize *2 (or any multiplication by a power of 2) to a left bit shift (when a bit shift would be faster).
For readability of values used as bitfields:
enum Flags { UP = (1<<0),
DOWN = (1<<1),
STRANGE = (1<<2),
CHARM = (1<<3),
...
which I think is preferable to either '=1,...,=2,...=4' or '=1,...=2, =2*2,...=2*3' especially if you have 8+ flags.
If you are using a old C compiler, it is preferrable to use bitwise. For readability you can comment you code though.