This question already has answers here:
Swapping two variable value without using third variable
(31 answers)
Closed 9 years ago.
I found this common solution :-
int a=10, b=20;
a=a+b;
b=a-b;
a=a-b;
But what if a=2147483647 the largest value of an integer then probably a=a+b will not be feasible.
How about using the std libs? ;)
std::swap(a,b);
Although you may also use the XORing algorithm but dont use it until you really have to.
The reason is well explained here:-
On modern CPU architectures, the XOR technique is considerably slower
than using a temporary variable to do swapping. One reason is that
modern CPUs strive to execute instructions in parallel via instruction
pipelines. In the XOR technique, the inputs to each operation depend
on the results of the previous operation, so they must be executed in
strictly sequential order. If efficiency is of tremendous concern, it
is advised to test the speeds of both the XOR technique and temporary
variable swapping on the target architecture.
Although its too late but since you have not mentioned that you want build in functions or not hence the swap method is the easiest one.
However you may also try to use the XOR method(although check the above reference about its performance) like this:
a ^= b;
b ^= a;
a ^= b;
The solution is:
a ^= b;
b ^= a;
a ^= b;
It works because x ^ x equals 0 for any value of x, and thus, x ^ y ^ x (in any order) equals y for any values of x and y. It's unlikely to be faster than just using a temporary though (unless you're programming for a CPU with high register contention and no pipelining ability).
Try XORing as below.
a ^= b;
b ^= a;
a ^= b;
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I have 2 numbers A and B. I want to find C = A - (A % B), but there are some problems. First of all, if C and D = A / B should have the same parity ((even and even) or (odd and odd)), otherwise C should be incremented (++C). The second problem is that I constantly do this calculation, so I want the cost of it to be as small as possible. Right now my solution looks like this:
uint32_t D = A / B;
C = D * B;
if ((C ^ D) & 0x1) ++C;
Is there a better way to do this? Maybe (C % 2) != (D % 2) is faster because of compiler optimizations, but I can't proof it. I would also like to know if it can be done with some specific intel functions (registers).
I assume the inputs A and B are also uint32_t?
The cost of the division dwarfs everything else, unless B is known at compile time after inlining. (Even if it's not a power of 2). The actual div instruction is very expensive compared to anything else, and can't vectorize with SIMD. (The only SIMD division available on x86 is FP, or of course integer shifts for division by 2).
By far the most useful thing you could do is arrange for B's value to be visible to the compiler at compile time, or at least with link-time optimization for cross-file inlining. (Why does GCC use multiplication by a strange number in implementing integer division?)
If B isn't a compile-time constant, x86 division will produce the remainder for free, along with the quotient. sub is cheaper than imul, so use and let the compiler optimize:
uint32_t D = A / B;
uint32_t C = A - A % B;
And if B is a compile-time constant, the compiler will optimize it to a divide then multiply anyway and (hopefully) optimize this down to as good as you'd get with your original.
And no, (C^D) ^ 1 should be a more efficient way to check that the low bits differ than (C % 2) != (D % 2). Doing something separate to each input before combining would cost more instructions, so it's better to lead the compiler in the direction of the more efficient asm implementation. (Obviously it's a good idea to have a look at the asm output for both cases).
Possibly useful would be to use + instead of ^. XOR = Addition without carry, but you only care about the low bit. The low bit of ^ and + is always the same. This gives the compiler the option of using an lea instruction to copy-and-add. (Probably not helpful in this case; it's ok if the compiler destroys the
value in the register holding D, assuming it's dead after this. But if you also use D directly)
Of course, you don't actually want to branch with if(...) so you should write it as:
C += (C+D) & 1; // +1 if low bits differ
I have a bundle of floats which get updated by various threads. Size of the array is much larger than the number of threads. Therefore simultaneous access on particular floats is rather rare. I need a solution for C++03.
The following code atomically adds a value to one of the floats (live demo). Assuming it works it might be the best solution.
The only alternative I can think of is dividing the array into bunches and protecting each bunch by a mutex. But I don't expect the latter to be more efficient.
My questions are as follows. Are there any alternative solutions for adding floats atomically? Can anyone anticipate which is the most efficient? Yes, I am willing to do some benchmarks. Maybe the solution below can be improved by relaxing the memorder constraints, i.e. exchanging __ATOMIC_SEQ_CST by something else. I have no experience with that.
void atomic_add_float( float *x, float add )
{
int *ip_x= reinterpret_cast<int*>( x ); //1
int expected= __atomic_load_n( ip_x, __ATOMIC_SEQ_CST ); //2
int desired;
do {
float sum= *reinterpret_cast<float*>( &expected ) + add; //3
desired= *reinterpret_cast<int*>( &sum );
} while( ! __atomic_compare_exchange_n( ip_x, &expected, desired, //4
/* weak = */ true,
__ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST ) );
}
This works as follows. At //1 the bit-pattern of x is interpreted as an int, i.e. I assume that float and int have the same size (32 bits). At //2 the value to be increased is loaded atomically. At //3 the bit-pattern of the int is interpreted as float and the summand is added. (Remember that expected contains a value found at ip_x == x.) This doesn't change the value under ip_x == x. At //4 the result of the summation is stored only at ip_x == x if no other thread changed the value, i.e. if expected == *ip_x (docu). If this is not the case the do-loop continues and expected contains the updated value found ad ip_x == x.
GCC's functions for atomic access (__atomic_load_n and __atomic_compare_exchange_n) can easily be exchanged by other compiler's implementations.
Are there any alternative solutions for adding floats atomically? Can anyone anticipate which is the most efficient?
Sure, there are at least few that come to mind:
Use synchronization primitives, i.e. spinlocks. Will be a bit slower than compare-exchange.
Transactional extension (see Wikipedia). Will be faster, but this solution might limit the portability.
Overall, your solution is quire reasonable: it is fast and yet will work on any platform.
In my opinion the needed memory orders are:
__ATOMIC_ACQUIRE -- when we read the value in __atomic_load_n()
__ATOMIC_RELEASE -- when __atomic_compare_exchange_n() is success
__ATOMIC_ACQUIRE -- when __atomic_compare_exchange_n() is failed
To make this function more efficient you may like to use __ATOMIC_ACQUIRE for __atomic_load_n and __ATOMIC_RELEASE and __ATOMIC_RELAXED for __atomic_compare_exchange_n success_memorder and failure_memorder respectively.
On x86-64 though that does not change the generated assembly because its memory model is relatively strong. Unlike for ARM with its weaker memory model.
In the case of the overflow flag, it would seem that access to this flag would be a great boon to cross-architecture programming. It would provide a safe alternative to relying on undefined behaviour to check for signed integer overflow such as:
if(a < a + 100) //detect overflow
I do understand that there are safe alternatives such as:
if(a > (INT_MAX - 100)) //detected overflow
However, it would seem that access to the status register or the individual flags within it is missing from both the C and C++ languages. Why was this feature not included or what language design decisions were made that prohibited this feature from being included?
Because C and C++ are designed to be platform independent. Status register is not.
These days, two's complement is universally used to implement signed integer arithmetic, but it was not always the case. One's complement or sign and absolute value used to be quite common. And when C was first designed, such CPUs were still in common use. E.g. COBOL distinguishes negative and positive 0, which existed on those architectures. Obviously overflow behaviour on these architectures is completely different!
By the way, you can't rely on undefined behaviour for detecting overflow, because reasonable compilers upon seeing
if(a < a + 100)
will write a warning and compile
if(true)
... (provided optimizations are turned on and the particular optimization is not turned off).
And note, that you can't rely on the warning. The compiler will only emit the warning when the condition ends up true or false after equivalent transformations, but there are many cases where the condition will be modified in presence of overflow without ending up as plain true/false.
Because C++ is designed as a portable language, i.e. one that compiles on many CPUs (e.g. x86, ARM, LSI-11/2, with devices like Game Boys, Mobile Phones, Freezers, Airplanes, Human Manipulation Chips and Laser Swords).
the available flags across CPUs may largely differ
even within the same CPU, flags may differ (take x86 scalar vs. vector instructions)
some CPUs may not even have the flag you desire at all
The question has to be answered: Should the compiler always deliver/enable that flag when it can't determine whether it is used at all?, which does not conform the pay only for what you use unwritten but holy law of both C and C++
Because compilers would have to be forbidden to optimize and e.g. reorder code to keep those flags valid
Example for the latter:
int x = 7;
x += z;
int y = 2;
y += z;
The optimizer may transform this to that pseudo assembly code:
alloc_stack_frame 2*sizeof(int)
load_int 7, $0
load_int 2, $1
add z, $0
add z, $1
which in turn would be more similar to
int x = 7;
int y = 2;
x += z;
y += z;
Now if you query registers inbetween
int x = 7;
x += z;
if (check_overflow($0)) {...}
int y = 2;
y += z;
then after optimizing and dissasembling you might end with this:
int x = 7;
int y = 2;
x += z;
y += z;
if (check_overflow($0)) {...}
which is then incorrect.
More examples could be constructed, like what happens with a constant-folding-compile-time-overflow.
Sidenotes: I remember an old Borland C++ compiler having a small API to read the current CPU registers. However, the argumentation above about optimization still applies.
On another sidenote: To check for overflow:
// desired expression: int z = x + y
would_overflow = x > MAX-y;
more concrete
auto would_overflow = x > std::numeric_limits<int>::max()-y;
or better, less concrete:
auto would_overflow = x > std::numeric_limits<decltype(x+y)>::max()-y;
I can think of the following reasons.
By allowing access to the register-flags, portability of the language across platforms is severily limited.
The optimizer can change expressions drastically, and render your flags useless.
It would make the language more complex
Most compilers have a big set of intrinsic functions, to do most common operations (e.g. addition with carry) without resorting to flags.
Most expressions can be rewritten in a safe way to avoid overflows.
You can always fall back to inline assembly if you have very specific needs
Access to status registers does not seem needed enough, to go through a standardization-effort.
I know that when overflow occurs in C/C++, normal behavior is to wrap-around. For example, INT_MAX+1 is an overflow.
Is possible to modify this behavior, so binary addition takes place as normal addition and there is no wraparound at the end of addition operation ?
Some Code so this would make sense. Basically, this is one bit (full) added, it adds bit by bit in 32
int adder(int x, int y)
{
int sum;
for (int i = 0; i < 31; i++)
{
sum = x ^ y;
int carry = x & y;
x = sum;
y = carry << 1;
}
return sum;
}
If I try to adder(INT_MAX, 1); it actually overflows, even though, I amn't using + operator.
Thanks !
Overflow means that the result of an addition would exceed std::numeric_limits<int>::max() (back in C days, we used INT_MAX). Performing such an addition results in undefined behavior. The machine could crash and still comply with the C++ standard. Although you're more likely to get INT_MIN as a result, there's really no advantage to depending on any result at all.
The solution is to perform subtraction instead of addition, to prevent overflow and take a special case:
if ( number > std::numeric_limits< int >::max() - 1 ) { // ie number + 1 > max
// fix things so "normal" math happens, in this case saturation.
} else {
++ number;
}
Without knowing the desired result, I can't be more specific about the it. The performance impact should be minimal, as a rarely-taken branch can usually be retired in parallel with subsequent instructions without delaying them.
Edit: To simply do math without worrying about overflow or handling it yourself, use a bignum library such as GMP. It's quite portable, and usually the best on any given platform. It has C and C++ interfaces. Do not write your own assembly. The result would be unportable, suboptimal, and the interface would be your responsibility!
No, you have to add them manually to check for overflow.
What do you want the result of INT_MAX + 1 to be? You can only fit INT_MAX into an int, so if you add one to it, the result is not going to be one greater. (Edit: On common platforms such as x86 it is going to wrap to the largest negative number: -(INT_MAX+1). The only way to get bigger numbers is to use a larger variable.
Assuming int is 4-bytes (as is typical on x86 compilers) and you are executing an add instruction (in 32-bit mode), the destination register simply does overflow -- it is out of bits and can't hold a larger value. It is a limitation of the hardware.
To get around this, you can hand-code, or use an aribitrarily-sized integer library that does the following:
First perform a normal add instruction on the lowest-order words. If overflow occurs, the Carry flag is set.
For each increasingly-higher-order word, use the adc instruction, which adds the two operands as usual, but takes into account the value of the Carry flag (as a value of 1.)
You can see this for a 64-bit value here.
This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Potential Problem in “Swapping values of two variables without using a third variable”
I recently read in a community that we can easily swap two numbers without using third using a XOR trick.
m^=n^=m^=n; trick was mentioned.
What do you guys think? Is this trick always useful?
the way you have written it, it is undefined behavior. This is because you are modifying a variable more than once in the same sequence point. However if you rewrite it as follows:
m ^= n;
n ^= m;
m ^= n;
then it is safe. However, "useful" is another question, it is rarely "useful" and sometimes it is actually slower than actually using a temp!
Also you need to be careful with aliasing (pointers/references) because if you try to swap something with itself, then you end up accidentally zeroing your value. For example:
#define SWAP(m, n) { m ^= n; n ^= m; m ^= n; }
int x[] = { 1, 2, 3, 4 };
int i = 0;
int j = 0;
SWAP(x[i], x[j]); // whoops, x[0] == 0 now, not 1!
a more traditional swap implementation doesn't have this issue.
No, it is undefined behaviour in both C and C++. It may work sometimes, but you should not rely on it.
Also even the "fixed" variation doesn't always work:
m ^= n;
n ^= m;
m ^= n;
This fails if m and n are references to the same variable. In this case it sets the value to zero.
C doesn't have references but even in C there are still dangers lurking if you try to use this trick:
You may try to put the "working" version into a macro SWAP, but that can fail if called with SWAP(x, x), setting x always to zero.
You may try to extend the trick to swapping two values in an array, but again this can fail if you use the same index:
a[m] ^= a[n];
a[n] ^= a[m];
a[m] ^= a[n];
Now if m == n again the value of a[m] is set to zero.
Please don't use "clever" tricks like this. Use a temporary variable to swap two values.
with that trick, you save an extra memory location to save a temporary value.
It may be efficiant for integers, but is that readable ? Your last generation optimizer may do the same.
It's often more of a danger (in your case it's actually undefined behavior; plus if one number is zero, things go wrong). Unless you're cripplingly low on memory, it's better to just use a temp variable (or the stl swap function). This is more clear about what you're doing and easier to maintain.