Can someone explain to me how XOR swapping of two variables with no temp variable works?
void xorSwap (int *x, int *y)
{
if (x != y) {
*x ^= *y;
*y ^= *x;
*x ^= *y;
}
}
I understand WHAT it does, but can someone walk me through the logic of how it works?
You can see how it works by doing the substitution:
x1 = x0 xor y0
y2 = x1 xor y0
x2 = x1 xor y2
Substituting,
x1 = x0 xor y0
y2 = (x0 xor y0) xor y0
x2 = (x0 xor y0) xor ((x0 xor y0) xor y0)
Because xor is fully associative and commutative:
y2 = x0 xor (y0 xor y0)
x2 = (x0 xor x0) xor (y0 xor y0) xor y0
Since x xor x == 0 for any x,
y2 = x0 xor 0
x2 = 0 xor 0 xor y0
And since x xor 0 == x for any x,
y2 = x0
x2 = y0
And the swap is done.
Other people have explained it, now I want to explain why it was a good idea, but now isn't.
Back in the day when we had simple single cycle or multi-cycle CPUs, it was cheaper to use this trick to avoid costly memory dereferences or spilling registers to the stack. However, we now have CPUs with massive pipelines instead. The P4's pipeline ranged from having 20 to 31 (or so) stages in their pipelines, where any dependence between reading and writing to a register could cause the whole thing to stall. The xor swap has some very heavy dependencies between A and B that don't actually matter at all but stall the pipeline in practice. A stalled pipeline causes a slow code path, and if this swap's in your inner loop, you're going to be moving very slowly.
In general practice, your compiler can figure out what you really want to do when you do a swap with a temp variable and can compile it to a single XCHG instruction. Using the xor swap makes it much harder for the compiler to guess your intent and therefore much less likely to optimize it correctly. Not to mention code maintenance, etc.
I like to think of it graphically rather than numerically.
Let's say you start with x = 11 and y = 5
In binary (and I'm going to use a hypothetical 4 bit machine), here's x and y
x: |1|0|1|1| -> 8 + 2 + 1
y: |0|1|0|1| -> 4 + 1
Now to me, XOR is an invert operation and doing it twice is a mirror:
x^y: |1|1|1|0|
(x^y)^y: |1|0|1|1| <- ooh! Check it out - x came back
(x^y)^x: |0|1|0|1| <- ooh! y came back too!
Here's one that should be slightly easier to grok:
int x = 10, y = 7;
y = x + y; //x = 10, y = 17
x = y - x; //x = 7, y = 17
y = y - x; //x = 7, y = 10
Now, one can understand the XOR trick a little more easily by understanding that ^ can be thought of as + or -. Just as:
x + y - ((x + y) - x) == x
, so:
x ^ y ^ ((x ^ y) ^ x) == x
The reason WHY it works is because XOR doesn't lose information. You could do the same thing with ordinary addition and subtraction if you could ignore overflow. For example, if the variable pair A,B originally contains the values 1,2, you could swap them like this:
// A,B = 1,2
A = A+B // 3,2
B = A-B // 3,1
A = A-B // 2,1
BTW there's an old trick for encoding a 2-way linked list in a single "pointer".
Suppose you have a list of memory blocks at addresses A, B, and C. The first word in each block is , respectively:
// first word of each block is sum of addresses of prior and next block
0 + &B // first word of block A
&A + &C // first word of block B
&B + 0 // first word of block C
If you have access to block A, it gives you the address of B. To get to C, you take the "pointer" in B and subtract A, and so on. It works just as well backwards. To run along the list, you need to keep pointers to two consecutive blocks. Of course you would use XOR in place of addition/subtration, so you wouldn't have to worry about overflow.
You could extend this to a "linked web" if you wanted to have some fun.
Most people would swap two variables x and y using a temporary variable, like this:
tmp = x
x = y
y = tmp
Here’s a neat programming trick to swap two values without needing a temp:
x = x xor y
y = x xor y
x = x xor y
More details in Swap two variables using XOR
On line 1 we combine x and y (using XOR) to get this “hybrid” and we store it back in x. XOR is a great way to save information, because you can remove it by doing an XOR again.
On line 2. We XOR the hybrid with y, which cancels out all the y information, leaving us only with x. We save this result back into y, so now they have swapped.
On the last line, x still has the hybrid value. We XOR it yet again with y (now with x’s original value) to remove all traces of x out of the hybrid. This leaves us with y, and the swap is complete!
The computer actually has an implicit “temp” variable that stores intermediate results before writing them back to a register. For example, if you add 3 to a register (in machine-language pseudocode):
ADD 3 A // add 3 to register A
The ALU (Arithmetic Logic Unit) is actually what executes the instruction 3+A. It takes the inputs (3,A) and creates a result (3 + A), which the CPU then stores back into A’s original register. So, we used the ALU as temporary scratch space before we had the final answer.
We take the ALU’s implicit temporary data for granted, but it’s always there. In a similar way, the ALU can return the intermediate result of the XOR in the case of x = x xor y, at which point the CPU stores it into x’s original register.
Because we aren’t used to thinking about the poor, neglected ALU, the XOR swap seems magical because it doesn’t have an explicit temporary variable. Some machines have a 1-step exchange XCHG instruction to swap two registers.
#VonC has it right, it's a neat mathematical trick. Imagine 4 bit words and see if this helps.
word1 ^= word2;
word2 ^= word1;
word1 ^= word2;
word1 word2
0101 1111
after 1st xor
1010 1111
after 2nd xor
1010 0101
after 3rd xor
1111 0101
Basically there are 3 steps in the XOR approach:
a’ = a XOR b (1)
b’ = a’ XOR b (2)
a” = a’ XOR b’ (3)
To understand why this works first note that:
XOR will produce a 1 only if exactly one of it’s operands is 1, and the other is zero;
XOR is commutative so a XOR b = b XOR a;
XOR is associative so (a XOR b) XOR c = a XOR (b XOR c); and
a XOR a = 0 (this should be obvious from the definition in 1 above)
After Step (1), the binary representation of a will have 1-bits only in the bit positions where a and b have opposing bits. That is either (ak=1, bk=0) or (ak=0, bk=1). Now when we do the substitution in Step (2) we get:
b’ = (a XOR b) XOR b
= a XOR (b XOR b) because XOR is associative
= a XOR 0 because of [4] above
= a due to definition of XOR (see 1 above)
Now we can substitute into Step (3):
a” = (a XOR b) XOR a
= (b XOR a) XOR a because XOR is commutative
= b XOR (a XOR a) because XOR is associative
= b XOR 0 because of [4] above
= b due to definition of XOR (see 1 above)
More detailed information here:
Necessary and Sufficient
As a side note I reinvented this wheel independently several years ago in the form of swapping integers by doing:
a = a + b
b = a - b ( = a + b - b once expanded)
a = a - b ( = a + b - a once expanded).
(This is mentioned above in a difficult to read way),
The exact same reasoning applies to xor swaps: a ^ b ^ b = a and a ^ b ^ a = a. Since xor is commutative, x ^ x = 0 and x ^ 0 = x, this is quite easy to see since
= a ^ b ^ b
= a ^ 0
= a
and
= a ^ b ^ a
= a ^ a ^ b
= 0 ^ b
= b
Hope this helps. This explanation has already been given... but not very clearly imo.
I just want to add a mathematical explanation to make the answer more complete. In group theory, XOR is an abelian group, also called a commutative group. It means it satisfies five requirements: Closure, Associativity, Identity element, Inverse element, Commutativity.
XOR swap formula:
a = a XOR b
b = a XOR b
a = a XOR b
Expand the formula, substitute a, b with previous formula:
a = a XOR b
b = a XOR b = (a XOR b) XOR b
a = a XOR b = (a XOR b) XOR (a XOR b) XOR b
Commutativity means "a XOR b" equal to "b XOR a":
a = a XOR b
b = a XOR b = (a XOR b) XOR b
a = a XOR b = (a XOR b) XOR (a XOR b) XOR b
= (b XOR a) XOR (a XOR b) XOR b
Associativity means "(a XOR b) XOR c" equal to "a XOR (b XOR c)":
a = a XOR b
b = a XOR b = (a XOR b) XOR b
= a XOR (b XOR b)
a = a XOR b = (a XOR b) XOR (a XOR b) XOR b
= (b XOR a) XOR (a XOR b) XOR b
= b XOR (a XOR a) XOR (b XOR b)
The inverse element in XOR is itself, it means that any value XOR with itself gives zero:
a = a XOR b
b = a XOR b = (a XOR b) XOR b
= a XOR (b XOR b)
= a XOR 0
a = a XOR b = (a XOR b) XOR (a XOR b) XOR b
= (b XOR a) XOR (a XOR b) XOR b
= b XOR (a XOR a) XOR (b XOR b)
= b XOR 0 XOR 0
The identity element in XOR is zero, it means that any value XOR with zero is left unchanged:
a = a XOR b
b = a XOR b = (a XOR b) XOR b
= a XOR (b XOR b)
= a XOR 0
= a
a = a XOR b = (a XOR b) XOR (a XOR b) XOR b
= (b XOR a) XOR (a XOR b) XOR b
= b XOR (a XOR a) XOR (b XOR b)
= b XOR 0 XOR 0
= b XOR 0
= b
And you can get further information in group theory.
Others have posted explanations but I think it would be better understood if its accompanied with a good example.
XOR Truth Table
If we consider the above truth table and take the values A = 1100 and B = 0101 we are able to swap the values as such:
A = 1100
B = 0101
A ^= B; => A = 1100 XOR 0101
(A = 1001)
B ^= A; => B = 0101 XOR 1001
(B = 1100)
A ^= B; => A = 1001 XOR 1100
(A = 0101)
A = 0101
B = 1100
Related
Find X such that (A ^ X) * (B ^ X) is maximum
Given A, B, and N (X < 2^N)
Return the maximum product modulus 10^9+7.
Example:
A = 4
B = 6
N = 3
We can choose X = 3 and (A ^ X) = 7 and (B ^ X) = 5.
The product will be 35 which is the maximum.
Here is my code:
int limit = (1<<n) - 1;
int MOD = 1_000_000_007;
int maxProd = 1;
for(int i = 1; i <= limit; i++){
int x1 = (A^i);
int x2 = (B^i);
maxProd = max(maxProd, (x1*x2) % MOD);
}
return maxProd;
for bits >=Nth bit, X will be zero, A^X and B^X are A and B for those bits
find set bits and zero bits shared by A and B from 0 to N-1th bits. for set bits, X will be zero there. for zero bits, X will be 1 there.
for bits that A and B are different, X will be either 0 or 1
from 1,2, we will have the value for A and B, denoted by a and b. a and b are known constants
from 3, we will have a bunch of 2^k, such as 2^3, 2^1,…, say the tot sum of them is tot. tot is a known constant
the question becomes max (a+tot-sth)*(b+sth), where sth is the subset sum of some 2^k from 3, while a,tot,and b are constants
when (a+tot-sth) and (b+sth) are as close as possible, the product will be maxed.
if a==b, we will give the most significant bit of step 3 to either a or b, and the rest to the other one
if a!=b, we will give all bits in step 3 to the smaller one
I have an input uint64_t X and number of its N least significant bits that I want to write into the target Y, Z uint64_t values starting from bit index M in the Z. Unaffected parts of Y and Z should not be changed. How I can implement it efficiently in C++ for the latest intel CPUs?
It should be efficient for execution in loops. I guess that it requires to have no branching: the number of used instructions is expected to be constant and as small as possible.
M and N are not fixed at compile time. M can take any value from 0 to 63 (target offset in Z), N is in the range from 0 to 64 (number of bits to copy).
illustration:
There's at least a four instruction sequence available on reasonable modern IA processors.
X &= (1 << (N+1)) - 1; // mask off the upper bits
// bzhi rax, rdi, rdx
Z = X << M;
// shlx rax, rax, rsi
Y = X >> (64 - M);
// neg sil
// shrx rax, rax, rsi
The value M=0 causes a bit of pain, as Y would need to be zero in that case and also the expression N >> (64-M) would need sanitation.
One possibility to overcome this is
x = bzhi(x, n);
y = rol(x,m);
y = bzhi(y, m); // y &= ~(~0ull << m);
z = shlx(x, m); // z = x << m;
As OP actually wants to update the bits, one obvious solution would be to replicate the logic for masks:
xm = bzhi(~0ull, n);
ym = rol(xm, m);
ym = bzhi(ym, m);
zm = shlx(xm, m);
However, clang seems to produce something like 24 instructions total with the masks applied:
Y = (Y & ~xm) | y; // |,+,^ all possible
Z = (Z & ~zm) | z;
It is likely then better to change the approach:
x2 = x << (64-N); // align xm to left
y2 = y >> y_shift; // align y to right
y = shld(y2,x2, y_shift); // y fixed
Here y_shift = max(0, M+N-64)
Fixing Z is slightly more involved, as Z can be combined of three parts:
zzzzz.....zzzzXXXXXXXzzzzzz, where m=6, n=7
That should be doable with two double shifts as above.
In my C++ code I have three uint64_tvariables:
uint64_t a = 7940678747;
uint64_t b = 59182917008;
uint64_t c = 73624982323;
I need to find (a * b) % c. If I directly multiply a and b, it will cause overflow. However, I can't apply the formula (a * b) % c = ((a % c) * (b % c)) % c, because c > a, c > b and, consequently, a % c = a, a % c = b and I will end up multiplying a and b again, which again will result in overflow.
How can I compute (a * b) % c for these values (and such cases in general) of the variables without overflow?
A simple solution is to define x = 2^32 = 4.29... 10^9
and then to represent a and b as:
a = ka * x + a1 with ka, a1 < x
b = kb * x + b1 with kb, b1 < x
Then
a*b = (ka * x + a1) * (kb * x + b1) = ((ka * kb) * x) * x
+ x * (b1 * ka) + x * (a1 * kb) + a1 * b1
All these operations can be performed without the need of a larger type, assuming that all the operations are performed in Z/cZ, i.e. assuming that % c operation is performed after each operation (* or +)
There are more elegant solutions than this, but an easy one would be looking into a library that deals with larger numbers. It will handle numbers that are too large for the largest of normal types for you. Check this one out: https://gmplib.org/
Create a class or struct to deal with numbers in parts.
Example PsuedoCode
// operation enum to know how to construct a large number
enum operation {
case add;
case sub;
case mult;
case divide;
}
class bigNumber {
//the two parts of the number
int partA;
int partB;
bigNumber(int numA, int numB, operation op) {
if(op == operation.mult) {
// place each digit of numA into an integer array
// palce each digit of numB into an integer array
// Iteratively place the first half of digits into the partA member
// Iteratively place the second half of digits into the partB member
} else if //cases for construction from other operations
}
// Create operator functions so you can perform arithmetic with this class
}
uint64_t a = 7940678747;
uint64_t b = 59182917008;
uint64_t c = 73624982323;
bigNumber bigNum = bigNumber(a, b, .mult);
uint64_t result = bigNum % c;
print(result);
Keep in mind that you may want to make result of type bigNumber if the value of c is very small. Basically this was just sort of an outline, make sure if you use a type that it won't overflow.
this is my assignment.
I've done my code for this assembly, but is there any way to make the convert speed more fast?
thank in advance for any helps ;D
//Convert this nested for loop to assembly instructions
for (a = 0; a < y; a++)
for (b = 0; b < y; b++)
for (c = 0; c < y; c++)
if ((a + 2 * b - 8 * c) == y)
count++;
convert
_asm {
mov ecx,0
mov ax, 0
mov bx, 0
mov cx, 0
Back:
push cx
push bx
push ax
add bx, bx
mov dx, 8
mul dx
add cx, bx
sub cx, ax
pop ax
pop bx
cmp cx, y
jne increase
inc count
increase : pop cx
inc ax
cmp ax, y
jl Back
inc bx
mov ax, 0
cmp bx, y
jl Back
inc cx
mov ax, 0
mov bx, 0
cmp cx, y
jl Back
}
Some generic tricks:
Make your loop counters count down instead of up. You eliminate a compare that way.
Learn the magic of LEA to compute expressions that include addition and scaling by certain powers of 2. You won't need a MUL in there anywhere.
Hoist loop-invariant work outside the inner loop. a + 2*b is constant for every iteration of the c loop.
Use SI, DI to hold values. That should help you avoid all those push and pop instructions.
If your values fit in 8 bits, use AH, AL, etc. to make more effective use of your registers.
Oh, and you don't need that mov ax, 0 after inc cx, because AX is already 0 there.
Specific to this algorithm: If y is odd, skip iterations where a is even, and vice versa. Nearly 2x speedup awaits... (Work out with pencil and paper if you wonder why.) Hint: You don't need to test every iteration, either. You can simply step by 2s, if you're clever enough.
Or better still, work out a closed form that allows you to calculate the answer directly. ;-)
When you are optimizing, always start high and go low, i.e. start at the algorithm level, and when everything is exhausted, go to the assembly conversion.
First, observe that:
8 * c = (a + 2 * b - y)
Has a unique c solution for each triplet (a,b,y).
What does this mean? Your 3 loops can be collapsed into 2. This is a huge reduction from a runtime with theta y^3 to theta y^2.
Rewrite the code:
for (a = 0; a < y; a++)
for (b = 0; b < y; b++) {
c = (a+2*b-y);
if (((c%8)==0) && (c >= 0)) count++;
}
Next observe that c>=0 means:
a+2*b-y >= 0
a+2*b >= y
a >= y-2b
Note that the two loops can be interchanged, which gives:
for (b = 0; b < y; b++) {
for (a = max(y-2*b,0); a < y; a++) {
if (((a+2*b-y)%8)==0) count++;
} }
Which we can split into two:
for (b = 0; b < y/2; b++) {
for (a = y-2*b; a < y; a++) {
if (((a+2*b-y)%8)==0) count++;
} }
for (b = y/2; b < y; b++) {
for (a = 0; a < y; a++) {
if (((a+2*b-y)%8)==0) count++;
} }
Now we have entirely eliminated c. We can't eliminate a or b altogether without coming up with a closed form formula (or at least partial closed form formula), why?
So here are several exercises that will get you "there".
how can we get rid of %8? can we eliminate a or b now?
observe that for each y, there is approximately theta y^2 counts. why is it that there is no single closed form quadratic (i.e. a*y^2+b*y+c) that give us the correct count?
given 2, how would one go about coming up with a closed form formula?
And now conversion to assembly language will give you a small improvement in the grand scheme of things :p
(I hope all the details are right. Please correct if you see a mistake)
In Assembly Language Step-by-Step Jeff writes on page 230,
Now, speed optimization is a very slippery business in the x86 world, Having instructions in the CPU cache versus having to pull them from memory is a speed difference that swamps most speed differences among the instructions themselves. Other factors come into play in the most recent Pentium-class CPUs that make generalizations about instruction speed almost impossible, and certainly impossible to state with any precision.
Assuming you're on an x86 machine, my advice would be soak up all that Math in the other answers the best you can for optimizations.
If I had the sum of products like z*a + z*b + z*c + ... + z*y, it would be possible to move the z factor, which is the same, out before brackets: z(a + b + c + ... y).
I'd like to know how it is possible (if it is) to do the same trick if bitwise XOR is used instead of multiplication.
z^a + z^b + ... z^y -> z^(a + b + ... + y)
Perhaps a, b, c ... should be preprocessed, such as logically negated or something else, before adding? z could change, so preprocessing, if it's needed, shouldn't depend on particular z value.
From Wikipedia:
Distributivity: with no binary function, not even with itself
So, no, unfortunately, you can't do anything like that with XOR.
To prove that a general formula does not hold you only need to prove a contradiction in a limited case.
We can reduce it to show that this does not hold:
(a^b) * c = (a^c) * (b^c)
It is trivial to show that one base case fails as such:
a = 3
b = 1
c = 1
(a^b) * c = (3^1) * 1 = 2
(a^c) * (b^c) = 2 * 0 = 0
Using the same case you can show that (a*b) ^ c = (a^c) * (b^c) and (a + b) ^ c = (a^c) + (b^c) do not hold either.
Hence, equality does not hold in a general case.
Equality can hold in special cases though, which is an entirely different subject.