How much will it affect the performance if I use:
n>>1 instead of n/2
n&1 instead of n%2!=0
n<<3 instead of n*8
n++ instead of n+=1
and so on...
and if it does increase the performance then please explain why.
Any half decent compiler will optimize the two versions into the same thing. For example, GCC compiles this:
unsigned int half1(unsigned int n) { return n / 2; }
unsigned int half2(unsigned int n) { return n >> 1; }
bool parity1(int n) { return n % 2; }
bool parity2(int n) { return n & 1; }
int mult1(int n) { return n * 8; }
int mult2(int n) { return n << 3; }
void inc1(int& n) { n += 1; }
void inc2(int& n) { n++; }
to
half1(unsigned int):
mov eax, edi
shr eax
ret
half2(unsigned int):
mov eax, edi
shr eax
ret
parity1(int):
mov eax, edi
and eax, 1
ret
parity2(int):
mov eax, edi
and eax, 1
ret
mult1(int):
lea eax, [0+rdi*8]
ret
mult2(int):
lea eax, [0+rdi*8]
ret
inc1(int&):
add DWORD PTR [rdi], 1
ret
inc2(int&):
add DWORD PTR [rdi], 1
ret
One small caveat is that in the first example, if n could be negative (in case that it is signed and the compiler can't prove that it's nonnegative), then the division and the bitshift are not equivalent and the division needs some extra instructions. Other than that, compilers are smart and they'll optimize operations with constant operands, so use whichever version makes more sense logically and is more readable.
Strictly speaking, in most cases, yes.
This is because bit manipulation is a simpler operation to perform for CPUs due to the circuitry in the APU being much simpler and requiring less discrete steps (clock cycles) to perform fully.
As others have mentioned, any compiler worth a damn will automatically detect constant operands to certain arithmetic operations with bitwise analogs (like those in your examples) and will convert them to the appropriate bitwise operations under the hood.
Keep in mind, if the operands are runtime values, such optimizations cannot occur.
Related
I've been struggling trying to convert this assembly code to C++ code.
It's a function from an old game that takes pixel data Stmp, and I believe it places it to destination void* dest
void Function(int x, int y, int yl, void* Stmp, void* dest)
{
unsigned long size = 1280 * 2;
unsigned long j = yl;
void* Dtmp = (void*)((char*)dest + y * size + (x * 2));
_asm
{
push es;
push ds;
pop es;
mov edx,Dtmp;
mov esi,Stmp;
mov ebx,j;
xor eax,eax;
xor ecx,ecx;
loop_1:
or bx,bx;
jz exit_1;
mov edi,edx;
loop_2:
cmp word ptr[esi],0xffff;
jz exit_2;
mov ax,[esi];
add edi,eax;
mov cx,[esi+2];
add esi,4;
shr ecx,2;
jnc Next2;
movsw;
Next2:
rep movsd;
jmp loop_2;
exit_2:
add esi,2;
add edx,size;
dec bx;
jmp loop_1;
exit_1:
pop es;
};
}
That's where I've gotten as far to: (Not sure if it's even correct)
while (j > 0)
{
if (*stmp != 0xffff)
{
}
++stmp;
dtmp += size;
--j;
}
Any help is greatly appreciated. Thank you.
It saves / restores ES around setting it equal to DS so rep movsd will use the same addresses for load and store. That instruction is basically memcpy(edi, esi, ecx) but incrementing the pointers in EDI and ESI (by 4 * ecx). https://www.felixcloutier.com/x86/movs:movsb:movsw:movsd:movsq
In a flat memory model, you can totally ignore that. This code looks like it might have been written to run in 16-bit unreal mode, or possibly even real mode, hence the use of 16-bit registers all over the place.
Look like it's loading some kind of records that tell it how many bytes to copy, and reading until the end of the record, at which point it looks for the next record there. There's an outer loop around that, looping through records.
The records look like this I think:
struct sprite_line {
uint16_t skip_dstbytes, src_bytes;
uint16_t src_data[]; // flexible array member, actual size unlimited but assumed to be a multiple of 2.
};
The inner loop is this:
;; char *dstp; // in EDI
;; struct spriteline *p // in ESI
loop_2:
cmp word ptr[esi],0xffff ; while( p->skip_dstbytes != (uint16_t)-1 ) {
jz exit_2;
mov ax,[esi]; ; EAX was xor-zeroed earlier; some old CPUs maybe had slow movzx loads
add edi,eax; ; dstp += p->skip_dstbytes;
mov cx,[esi+2]; ; bytelen = p->src_len;
add esi,4; ; p->data
shr ecx,2; ; length in dwords = bytelen >> 2
jnc Next2;
movsw; ; one 16-bit (word) copy if bytelen >> 1 is odd, i.e. if last bit shifted out was a 1.
; The first bit shifted out isn't checked, so size is assumed to be a multiple of 2.
Next2:
rep movsd; ; copy in 4-byte chunks
Old CPUs (before IvyBridge) had rep movsd faster than rep movsb, otherwise this code could just have done that.
or bx,bx;
jz exit_1;
That's an obsolete idiom that comes from 8080 for test bx,bx / jnz, i.e. jump if BX was zero. So it's a while( bx != 0 ) {} loop. With dec bx in it. It's an inefficient way to write a while (--bx) loop; a compiler would put a dec/jnz .top_of_loop at the bottom, with a test once outside the loop in case it needs to run zero times. Why are loops always compiled into "do...while" style (tail jump)?
Some people would say that's what a while loop looks like in asm, if they're picturing totally naive translation from C to asm.
The arithmetic mean of two unsigned integers is defined as:
mean = (a+b)/2
Directly implementing this in C/C++ may overflow and produce a wrong result. A correct implementation would avoid this. One way of coding it could be:
mean = a/2 + b/2 + (a%2 + b%2)/2
But this produces rather a lot of code with typical compilers. In assembler, this usually can be done much more efficiently. For example, the x86 can do this in the following way (assembler pseudo code, I hope you get the point):
ADD a,b ; addition, leaving the overflow condition in the carry bit
RCR a,1 ; rotate right through carry, effectively a division by 2
After those two instructions, the result is in a, and the remainder of the division is in the carry bit. If correct rounding is desired, a third ADC instruction would have to add the carry into the result.
Note that the RCR instruction is used, which rotates a register through the carry. In our case, it is a rotate by one position, so that the previous carry becomes the most significant bit in the register, and the new carry holds the previous LSB from the register. It seems that MSVC doesn't even offer an intrinsic for this instruction.
Is there a known C/C++ pattern that can be expected to be recognized by an optimizing compiler so that it produces such efficient code? Or, more generally, is there a rational way how to program in C/C++ source level so that the carry bit is being used by the compiler to optimize the generated code?
EDIT:
A 1-hour lecture about std::midpoint: https://www.youtube.com/watch?v=sBtAGxBh-XI
Wow!
EDIT2: Great discussion on Microsoft blog
The following method avoids overflow and should result in fairly efficient assembly (example) without depending on non-standard features:
mean = (a&b) + (a^b)/2;
There are three typical methods to compute average without overflow, one of which is limited to uint32_t (on 64-bit architectures).
// average "SWAR" / Montgomery
uint32_t avg(uint32_t a, uint32_t b) {
return (a & b) + ((a ^ b) >> 1);
}
// in case the relative magnitudes are known
uint32_t avg2(uint32_t min, uint32_t max) {
return min + (max - min) / 2;
}
// in case the relative magnitudes are not known
uint32_t avg2_constrained(uint32_t a, uint32_t b) {
return a + (int32_t)(b - a) / 2;
}
// average increase width (not applicable to uint64_t)
uint32_t avg3(uint32_t a, uint32_t b) {
return ((uint64_t)a + b) >> 1;
}
The corresponding assembler sequences from clang in two architectures are
avg(unsigned int, unsigned int)
mov eax, esi
and eax, edi
xor esi, edi
shr esi
add eax, esi
avg2(unsigned int, unsigned int)
sub esi, edi
shr esi
lea eax, [rsi + rdi]
avg3(unsigned int, unsigned int)
mov ecx, edi
mov eax, esi
add rax, rcx
shr rax
vs.
avg(unsigned int, unsigned int)
and w8, w1, w0
eor w9, w1, w0
add w0, w8, w9, lsr #1
ret
avg2(unsigned int, unsigned int)
sub w8, w1, w0
add w0, w0, w8, lsr #1
ret
avg3(unsigned int, unsigned int):
mov w8, w1
add x8, x8, w0, uxtw
lsr x0, x8, #1
ret
Out of these three versions, avg2 would perform in ARM64 as well, as the optimal sequence using carry flag -- and also it's likely that avg3 would perform as well, noticing that the mov w8,w1 is used to clear the top 32-bits, which may be unnecessary given that the compiler knows they are cleared by any previous instruction that is used to produce the value.
Similar statement can be made of the Intel version for avg3, which would in optimal case compiled to just the two meaningful instructions:
add rax, rcx
shr rax
See https://godbolt.org/z/5TMd3zr81 for online comparison.
The "SWAR"/Montgomery version is typically only justified, when trying to compute multiple averages packed to a single (large) integer in which case the full formula contains masking with the bit positions of the highest bits: return (a & b) + (((a ^ b) >> 1) & ~kH;.
I have to check whether K-th bit of a number is set or not.
Example what I mean:
Input: N = 4, K = 0
Output: false
Explanation: Binary representation of 4 is b'100, in which the 0th bit from LSB is not set and therefore it returns false.
This code does not work with the not equal to the statement:
bool checkKthBit(int n, int k)
{
if( n & (1 << k) != 0)
return true;
else
return false;
}
However, after removing the not equal operator the code works perfectly fine:
bool checkKthBit(int n, int k)
{
if (n & (1 << k))
return true;
else
return false;
}
How is this happening?
If you put the equation into brackets it does what you want:
https://godbolt.org/z/enTcjhrzT
bool checkA(int n, int k) {
if(n&(1<<k)!=0) return true; else return false;
}
bool checkB(int n, int k) {
if((n&(1<<k))!=0) return true; else return false;
}
bool checkC(int n, int k) {
if((n&(1<<k))) return true; else return false;
}
Will produce this assembly, notice that checkB and checkC are identical (correct brackets means correct precedence and the != operator can stay). While the checkA does give precedence to the equality and then the result is and-ed and that result is used for the return.
checkA(int, int):
mov eax, 1
mov ecx, esi
sal eax, cl
test eax, eax
setne al
movzx eax, al
and eax, edi
ret
checkB(int, int):
mov ecx, esi
sar edi, cl
mov eax, edi
and eax, 1
ret
checkC(int, int):
mov ecx, esi
sar edi, cl
mov eax, edi
and eax, 1
ret
See the C precedence table where the & is precedence 2 (right to left) and != is precedence 7 (left to right)
Reference from: https://en.cppreference.com/w/c/language/operator_precedence
BUT if you are using C++ notice how the precedence table changes, the & is 11 and the != is 10:
Reference from: https://en.cppreference.com/w/cpp/language/operator_precedence
If I would assume GCC as the toolchain, if you do not provide any dialect then the default will be used:
The default, if no C language dialect options are given, is
-std=gnu17.
Or for C++:
The default, if no C++ language dialect options are given, is
-std=gnu++17.
GCC reference: https://gcc.gnu.org/onlinedocs/gcc/Standards.html
Broadly speaking it's better to use more parenthesis than less, instead of depending on the precedence order and making sure you and everybody who reads it understand the caveats of the order, it's considered safer to make the order explicit:
https://softwareengineering.stackexchange.com/questions/201175/should-i-use-parentheses-in-logical-statements-even-where-not-necessary
I have the compressed sparse column (csc) representation of the n x n lower-triangular matrix A with zeros on the main diagonal, and would like to solve for b in
(A + I)' * x = b
This is the routine I have for computing this:
void backsolve(const int*__restrict__ Lp,
const int*__restrict__ Li,
const double*__restrict__ Lx,
const int n,
double*__restrict__ x) {
for (int i=n-1; i>=0; --i) {
for (int j=Lp[i]; j<Lp[i+1]; ++j) {
x[i] -= Lx[j] * x[Li[j]];
}
}
}
Thus, b is passed in via the argument x, and is overwritten by the solution. Lp, Li, Lx are respectively the row, indices, and data pointers in the standard csc representation of sparse matrices. This function is the top hotspot in the program, with the line
x[i] -= Lx[j] * x[Li[j]];
being the bulk of the time spent. Compiling with gcc-8.3 -O3 -mfma -mavx -mavx512f gives
backsolve(int const*, int const*, double const*, int, double*):
lea eax, [rcx-1]
movsx r11, eax
lea r9, [r8+r11*8]
test eax, eax
js .L9
.L5:
movsx rax, DWORD PTR [rdi+r11*4]
mov r10d, DWORD PTR [rdi+4+r11*4]
cmp eax, r10d
jge .L6
vmovsd xmm0, QWORD PTR [r9]
.L7:
movsx rcx, DWORD PTR [rsi+rax*4]
vmovsd xmm1, QWORD PTR [rdx+rax*8]
add rax, 1
vfnmadd231sd xmm0, xmm1, QWORD PTR [r8+rcx*8]
vmovsd QWORD PTR [r9], xmm0
cmp r10d, eax
jg .L7
.L6:
sub r11, 1
sub r9, 8
test r11d, r11d
jns .L5
ret
.L9:
ret
According to vtune,
vmovsd QWORD PTR [r9], xmm0
is the slowest part. I have almost no experience with assembly, and am at a loss as to how to further diagnose or optimize this operation. I have tried compiling with different flags to enable/disable SSE, FMA, etc, but nothing has worked.
Processor: Xeon Skylake
Question What can I do to optimize this function?
This should depend quite a bit on the exact sparsity pattern of the matrix and the platform being used. I tested a few things with gcc 8.3.0 and compiler flags -O3 -march=native (which is -march=skylake on my CPU) on the lower triangle of this matrix of dimension 3006 with 19554 nonzero entries. Hopefully this is somewhat close to your setup, but in any case I hope these can give you an idea of where to start.
For timing I used google/benchmark with this source file. It defines benchBacksolveBaseline which benchmarks the implementation given in the question and benchBacksolveOptimized which benchmarks the proposed "optimized" implementations. There is also benchFillRhs which separately benchmarks the function that is used in both to generate some not completely trivial values for the right hand side. To get the time of the "pure" backsolves, the time that benchFillRhs takes should be subtracted.
1. Iterating strictly backwards
The outer loop in your implementation iterates through the columns backwards, while the inner loop iterates through the current column forwards. Seems like it would be more consistent to iterate through each column backwards as well:
for (int i=n-1; i>=0; --i) {
for (int j=Lp[i+1]-1; j>=Lp[i]; --j) {
x[i] -= Lx[j] * x[Li[j]];
}
}
This barely changes the assembly (https://godbolt.org/z/CBZAT5), but the benchmark timings show a measureable improvement:
------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------
benchFillRhs 2737 ns 2734 ns 5120000
benchBacksolveBaseline 17412 ns 17421 ns 829630
benchBacksolveOptimized 16046 ns 16040 ns 853333
I assume this is caused by somehow more predictable cache access, but I did not look into it much further.
2. Less loads/stores in inner loop
As A is lower triangular, we have i < Li[j]. Therefore we know that x[Li[j]] will not change due to the changes to x[i] in the inner loop. We can put this knowledge into our implementation by using a temporary variable:
for (int i=n-1; i>=0; --i) {
double xi_temp = x[i];
for (int j=Lp[i+1]-1; j>=Lp[i]; --j) {
xi_temp -= Lx[j] * x[Li[j]];
}
x[i] = xi_temp;
}
This makes gcc 8.3.0 move the store to memory from inside the inner loop to directly after its end (https://godbolt.org/z/vM4gPD). The benchmark for the test matrix on my system shows a small improvement:
------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------
benchFillRhs 2737 ns 2740 ns 5120000
benchBacksolveBaseline 17410 ns 17418 ns 814545
benchBacksolveOptimized 15155 ns 15147 ns 887129
3. Unroll the loop
While clang already starts unrolling the loop after the first suggested code change, gcc 8.3.0 still has not. So let's give that a try by additionally passing -funroll-loops.
------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------
benchFillRhs 2733 ns 2734 ns 5120000
benchBacksolveBaseline 15079 ns 15081 ns 953191
benchBacksolveOptimized 14392 ns 14385 ns 963441
Note that the baseline also improves, as the loop in that implementation is also unrolled. Our optimized version also benefits a bit from loop unrolling, but maybe not as much as we may have liked. Looking into the generated assembly (https://godbolt.org/z/_LJC5f), it seems like gcc might have gone a little far with 8 unrolls. For my setup, I can in fact do a little better with just one simple manual unroll. So drop the flag -funroll-loops again and implement the unrolling with something like this:
for (int i=n-1; i>=0; --i) {
const int col_begin = Lp[i];
const int col_end = Lp[i+1];
const bool is_col_nnz_odd = (col_end - col_begin) & 1;
double xi_temp = x[i];
int j = col_end - 1;
if (is_col_nnz_odd) {
xi_temp -= Lx[j] * x[Li[j]];
--j;
}
for (; j >= col_begin; j -= 2) {
xi_temp -= Lx[j - 0] * x[Li[j - 0]] +
Lx[j - 1] * x[Li[j - 1]];
}
x[i] = xi_temp;
}
With that I measure:
------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------
benchFillRhs 2728 ns 2729 ns 5090909
benchBacksolveBaseline 17451 ns 17449 ns 822018
benchBacksolveOptimized 13440 ns 13443 ns 1018182
Other algorithms
All of these versions still use the same simple implementation of the backward solve on the sparse matrix structure. Inherently, operating on sparse matrix structures like these can have significant problems with memory traffic. At least for matrix factorizations, there are more sophisticated methods, that operate on dense submatrices that are assembled from the sparse structure. Examples are supernodal and multifrontal methods. I am a bit fuzzy on this, but I think that such methods will also apply this idea to layout and use dense matrix operations for lower triangular backwards solves (for example for Cholesky-type factorizations). So it might be worth to look into those kind of methods, if you are not forced to stick to the simple method that works on the sparse structure directly. See for example this survey by Davis.
You might shave a few cycles by using unsigned instead of int for the index types, which must be >= 0 anyway:
void backsolve(const unsigned * __restrict__ Lp,
const unsigned * __restrict__ Li,
const double * __restrict__ Lx,
const unsigned n,
double * __restrict__ x) {
for (unsigned i = n; i-- > 0; ) {
for (unsigned j = Lp[i]; j < Lp[i + 1]; ++j) {
x[i] -= Lx[j] * x[Li[j]];
}
}
}
Compiling with Godbolt's compiler explorer shows slightly different code for the innerloop, potentially making better use of the CPU pipeline. I cannot test, but you could try.
Here is the generated code for the inner loop:
.L8:
mov rax, rcx
.L5:
mov ecx, DWORD PTR [r10+rax*4]
vmovsd xmm1, QWORD PTR [r11+rax*8]
vfnmadd231sd xmm0, xmm1, QWORD PTR [r8+rcx*8]
lea rcx, [rax+1]
vmovsd QWORD PTR [r9], xmm0
cmp rdi, rax
jne .L8
I saw the chosen answer to this post.
I was suprised that (x & 255) == (x % 256) if x is an unsigned integer, I wondered if it makes sense to always replace % with & in x % n for n = 2^a (a = [1, ...]) and x being a positive integer.
Since this is a special case in which I as a human can decide because I know with which values the program will deal with and the compiler does not. Can I gain a significant performance boost if my program uses a lot of modulo operations?
Sure, I could just compile and look at the dissassembly. But this would only answer my question for one compiler/architecture. I would like to know if this is in principle faster.
If your integral type is unsigned, the compiler will optimize it, and the result will be the same. If it's signed, something is different...
This program:
int mod_signed(int i) {
return i % 256;
}
int and_signed(int i) {
return i & 255;
}
unsigned mod_unsigned(unsigned int i) {
return i % 256;
}
unsigned and_unsigned(unsigned int i) {
return i & 255;
}
will be compiled (by GCC 6.2 with -O3; Clang 3.9 produces very similar code) into:
mod_signed(int):
mov edx, edi
sar edx, 31
shr edx, 24
lea eax, [rdi+rdx]
movzx eax, al
sub eax, edx
ret
and_signed(int):
movzx eax, dil
ret
mod_unsigned(unsigned int):
movzx eax, dil
ret
and_unsigned(unsigned int):
movzx eax, dil
ret
The result assembly of mod_signed is different because
If both operands to a multiplication, division, or modulus expression have the same sign, the result is positive. Otherwise, the result is negative. The result of a modulus operation's sign is implementation-defined.
and AFAICT, most of implementation decided that the result of a modulus expression is always the same as the sign of the first operand. See this documentation.
Hence, mod_signed is optimized to (from nwellnhof's comment):
int d = i < 0 ? 255 : 0;
return ((i + d) & 255) - d;
Logically, we can prove that i % 256 == i & 255 for all unsigned integers, hence, we can trust the compiler to do its job.
I did some measurements with gcc, and
if the argument of a / or % is a compiled time constant that's a power of 2, gcc can turn it into the corresponding bit operation.
Here are some of my benchmarks for divisions
What has a better performance: multiplication or division? and as you can see, the running times with divisors that are statically known powers of two are noticably lower than with other statically known divisors.
So if / and % with statically known power-of-two arguments describe your algorithm better than bit ops, feel free to prefer / and %.
You shouldn't lose any performance with a decent compiler.