Built in mod operator vs arithmetic while loop in C++ - c++

I am relatively new to programming, and after going through some of the questions asked on this subject, I still could not find an answer.
As I understand it, the % operator in C++ can be rewritten in a number of ways, some of which seem to perform better as indicated by some users here:
Built-in mod ('%') vs custom mod function: improve the performance of modulus operation
Now if I understand correctly, arithmetic operations on integers - such as + and - are faster than * and / when we think about multiplication as repeated addition for instance( this is commonly mentioned in threads I have read).
So, why wouldn't something like the following function be faster than % :
int mod(int a, int b)
{
int s = a;
while(s < b)
{
s += a;
}
return a - (s - b);
}
As I am quite new to programming and have little knowledge in how things are implemented by compilers- I am not sure what way would be appropriate to test this, and whether results might vary greatly depending on the compiler and implementations.

Moulo usually is implemented through one or a couple CPU operations, so in general it's single operation: not counting auxilary instructions.
But by definition it should be quite simple:
int mult = a / b;
return a - mult * b;
E.g. on x86-64 clang compiles
int mod(int a, int b)
{
return a % b;
}
into following code, the actual operation is integer division, which returns remainder in EDX
mod(int, int): # #mod(int, int)
push rbp
mov rbp, rsp
mov dword ptr [rbp - 4], edi
mov dword ptr [rbp - 8], esi
mov eax, dword ptr [rbp - 4]
cdq
idiv dword ptr [rbp - 8]
mov eax, edx
pop rbp
ret
Your version:
mod(int, int): # #mod(int, int)
push rbp
mov rbp, rsp
mov dword ptr [rbp - 4], edi
mov dword ptr [rbp - 8], esi
mov eax, dword ptr [rbp - 4]
mov dword ptr [rbp - 12], eax
.LBB0_1: # =>This Inner Loop Header: Depth=1
mov eax, dword ptr [rbp - 12]
cmp eax, dword ptr [rbp - 8]
jge .LBB0_3
mov eax, dword ptr [rbp - 4]
add eax, dword ptr [rbp - 12]
mov dword ptr [rbp - 12], eax
jmp .LBB0_1
.LBB0_3:
mov eax, dword ptr [rbp - 4]
mov ecx, dword ptr [rbp - 12]
sub ecx, dword ptr [rbp - 8]
sub eax, ecx
pop rbp
ret
THat's quite a bunch of operations performed in loop, it's doubtful that it can be faster/

Related

Modulus in Assembly x64 linux question C++ [duplicate]

This question already has answers here:
Why does GCC use multiplication by a strange number in implementing integer division?
(5 answers)
Divide Signed Integer By 2 compiles to complex assembly output, not just a shift
(1 answer)
Closed 1 year ago.
I have these functions in C++
int f1(int a)
{
int x = a / 2;
}
int f2(int a)
{
int y = a % 2;
}
int f3(int a)
{
int z = a % 7;
}
int f4(int a,int b)
{
int xy = a % b;
}
And i saw their assembly code but couldn't understand what they are doing.I couldn't even find a good referance or some explained example for the same. Here is the assembly
f1(int):
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-20], edi
mov eax, DWORD PTR [rbp-20]
mov edx, eax
shr edx, 31
add eax, edx
sar eax
mov DWORD PTR [rbp-4], eax
nop
pop rbp
ret
f2(int):
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-20], edi
mov eax, DWORD PTR [rbp-20]
cdq
shr edx, 31
add eax, edx
and eax, 1
sub eax, edx
mov DWORD PTR [rbp-4], eax
nop
pop rbp
ret
f3(int):
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-20], edi
mov eax, DWORD PTR [rbp-20]
movsx rdx, eax
imul rdx, rdx, -1840700269
shr rdx, 32
add edx, eax
sar edx, 2
mov esi, eax
sar esi, 31
mov ecx, edx
sub ecx, esi
mov edx, ecx
sal edx, 3
sub edx, ecx
sub eax, edx
mov DWORD PTR [rbp-4], eax
nop
pop rbp
ret
f4(int, int):
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-20], edi
mov DWORD PTR [rbp-24], esi
mov eax, DWORD PTR [rbp-20]
cdq
idiv DWORD PTR [rbp-24]
mov DWORD PTR [rbp-4], edx
nop
pop rbp
ret
Can you please tell by some example or what steps it is following to calculate the answers in all these three cases and why would they work just fine instead of normal divide

Compiler Explorer Assembly Output for C, C++ and D (dlang) [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
When using Compiler Explorer (https://godbolt.org/) to compare assembly output of simple programs, why D language assembly output is so long compared to C or C++ output. The simple square function output is the same for C, C++, and D, but the D output has additional lines that are not highlighted when hovering over the square function in the source code.
What are these additional lines?
How I can remove these lines from being generated?
Let's say I have https://godbolt.org/z/64EsWo5Ke a template function both in C++ and D, the Intel asm output for D is 29309 lines long, while the C++ Intel asm output is 73 lines only.
These are the codes in question:
For D:
int example.square(int):
push rbp
mov rbp, rsp
mov dword ptr [rbp - 4], edi
mov eax, dword ptr [rbp - 4]
imul eax, dword ptr [rbp - 4]
pop rbp
ret
ldc.register_dso:
sub rsp, 40
mov qword ptr [rsp + 8], 1
lea rax, [rip + ldc.dso_slot]
mov qword ptr [rsp + 16], rax
lea rax, [rip + __start___minfo]
mov qword ptr [rsp + 24], rax
lea rax, [rip + __stop___minfo]
mov qword ptr [rsp + 32], rax
lea rax, [rsp + 8]
mov rdi, rax
call _d_dso_registry#PLT
add rsp, 40
ret
example.__ModuleInfo:
.long 2147483652
.long 0
.asciz "example"
example.__moduleRef:
.quad example.__ModuleInfo
ldc.dso_slot:
.quad 0
C/C++:
square(int):
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], edi
mov eax, DWORD PTR [rbp-4]
imul eax, eax
pop rbp
ret
As you can see the actual implementation in assembly is very similar (almost identical). The program constructs the stack frame:
push rbp
mov rbp, rsp
Takes the argument and multiplies it with itself leaving it in the return value (eax register):
mov dword ptr [rbp - 4], edi
mov eax, dword ptr [rbp - 4]
imul eax, dword ptr [rbp - 4]
in D and
mov DWORD PTR [rbp-4], edi
mov eax, DWORD PTR [rbp-4]
imul eax, eax
in C++/C, and then deconstructs stack frame and returns:
pop rbp
ret
Now I don't claim to know what the D compiler is doing, but I assume the rest of the code is so that this piece of compiled code can work well with other D code. Basically metadata and other fun stuff. I assume this because nowhere does our function use any of the defined symbols nor do the other function call square. This code is therefore probably to do something with inclusion into other D programs, or the like, and therefore you might not be able to/should not remove it.
In the case of your second example, most of the code is the output library implemented. Using only the function defined it is actually 66 lines long. While still longer than the equivalent 22 lines of C++ generated assembly it is not several thousand.
Edit:
As I explained in a comment would recommend to analyse the output binaries with something like Cutter or Ghidra, which give you a more complete picture of what is actually produced in a binary, because I can tell you that even in 'shorter' C++ code you will find a lot of function calls such as _entry before getting to main.

Finding max number between two, which implementation to choose

I am trying to figure out, which implementation has edge over other while finding max number between two. As an example let's examine two implementation:
Implementation 1:
int findMax (int a, int b)
{
return (a > b) ? a : b;
}
// Assembly output: (gcc 11.1)
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], edi
mov DWORD PTR [rbp-8], esi
mov eax, DWORD PTR [rbp-4]
cmp eax, DWORD PTR [rbp-8]
jle .L2
mov eax, DWORD PTR [rbp-4]
jmp .L4 .L2:
mov eax, DWORD PTR [rbp-8] .L4:
pop rbp
ret
Implementation 2:
int findMax(int a, int b)
{
int diff, s, max;
diff = a - b;
s = (diff >> 31) & 1;
max = a - (s * diff);
return max;
}
// Assembly output: (gcc 11.1)
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-20], edi
mov DWORD PTR [rbp-24], esi
mov eax, DWORD PTR [rbp-20]
sub eax, DWORD PTR [rbp-24]
mov DWORD PTR [rbp-4], eax
mov eax, DWORD PTR [rbp-4]
shr eax, 31
mov DWORD PTR [rbp-8], eax
mov eax, DWORD PTR [rbp-8]
imul eax, DWORD PTR [rbp-4]
mov edx, eax
mov eax, DWORD PTR [rbp-20]
sub eax, edx
mov DWORD PTR [rbp-12], eax
mov eax, DWORD PTR [rbp-12]
pop rbp
ret
The second one produced more assembly instructions but first one has conditional jump. Just trying to understand if both are equally good.
First you need to turn on compiler optimizations (I used -O2 for the following). And you should compare to std::max. Then this:
#include <algorithm>
int findMax (int a, int b)
{
return (a > b) ? a : b;
}
int findMax2(int a, int b)
{
int diff, s, max;
diff = a - b;
s = (diff >> 31) & 1;
max = a - (s * diff);
return max;
}
int findMax3(int a,int b){
return std::max(a,b);
}
results in:
findMax(int, int):
cmp edi, esi
mov eax, esi
cmovge eax, edi
ret
findMax2(int, int):
mov ecx, edi
mov eax, edi
sub ecx, esi
mov edx, ecx
shr edx, 31
imul edx, ecx
sub eax, edx
ret
findMax3(int, int):
cmp edi, esi
mov eax, esi
cmovge eax, edi
ret
Your first version results in identical assembly as std::max, while your second variant is doing more. Actually when trying to optimize you need to specify what you optimize for. There are several options that typically require a trade-off to be made: Runtime, memory usage, size of executable, readability of code, etc. Typically you cannot get it all at once.
When in doubt, do not reinvent a wheel but use existing already optimzied std::max. And do not forget that code you write is not instructions for your CPU, rather it is a high level abstract description of what the program should do. Its the compilers job to figure out how that can be achieved best.
Last but not least, your second variant is actually broken. See example here compiled with -O2 -fsanitize=signed-integer-overflow, results in:
/app/example.cpp:13:10: runtime error: signed integer overflow: -2147483648 - 2147483647 cannot be represented in type 'int'
You should favor correctness over speed. The fastest code is not worth a thing when it is wrong. And because of that, readability is next on the list. Code that is difficult to read and understand is also difficult to proove correct. I was only able to spot the problem in your code with the help of the compiler, while std::max(a,b) is unlikely to cause undefined behavior (and even if it does, at least it isnt your fault ;).
For two ints, you can compute max(a, b) without branching using a technique you probably learnt at school:
a ^ ((a ^ b) & -(a < b));
But no sane person would write this in their code. Always use std::max and trust the compiler to pick the best way. You may well find it adopts the above for int arguments with optimisations set appropriately. Although I conject that a compare and jump is probably the best way on the whole, even at the expense of a pipeline dump.
Using std::max gives the compiler the best optimisation hint.
Implementation 1 performs well on a CISC CPU like a modern x64 AMD/Intel CPU.
Implementation 2 performs well on a RISC GPU like from nVIDIA or AMD Graphics.
The term "performs well" is only significant in a tight loop.

Union with Insidious bug

Having a problem grasping that when we want to traverse a whole array and compare each value of the array with a number present in the array, say arr[0] then, why is it advised to initialize an int with arr[0], like int acomp =arr[0] and compare acomp with every integer present in the array than comparing every integer present in the array with arr[0]?
For eg., in the following code of union it was pointed out to me that Code 2 is better than Code 1, but I am not quite sure why.
int unionarr(int p, int q){ //Code 1
for(int i=0;i<size;i++)
if(arr[i]==arr[p])
arr[i]=arr[q];}
int unionarr(int p, int q){ //Code 2
int pid=arr[p];
int qid=arr[q];
for(int i=0;i<size;i++)
if(arr[i]==pid)
arr[i]=qid;}
It's a correctness issue. The assignment inside the for loop can modify array values. You might modify the very elements that are being used in the comparison or right-hand side of the assignment. That's why you must save them before entering the loop.
Making local copies pid, and qid of values which would otherwise have to be repeatedly looked up in the array is something of a performance optimisation.
However, I would be surprised if any modern compiler would fail to pick that up and do that optimisation implicitly.
Using https://godbolt.org/ you can compare the two. what you care about is the instruction inside the loop.
With Clang 4.0 the assembly is:
Code 1
movsxd rax, dword ptr [rbp - 16]
mov ecx, dword ptr [4*rax + arr]
movsxd rax, dword ptr [rbp - 8]
cmp ecx, dword ptr [4*rax + arr]
jne .LBB0_4
movsxd rax, dword ptr [rbp - 12]
mov ecx, dword ptr [4*rax + arr]
movsxd rax, dword ptr [rbp - 16]
mov dword ptr [4*rax + arr], ecx
Code 2
movsxd rax, dword ptr [rbp - 24]
mov ecx, dword ptr [4*rax + arr]
cmp ecx, dword ptr [rbp - 16]
jne .LBB0_4
mov eax, dword ptr [rbp - 20]
movsxd rcx, dword ptr [rbp - 24]
mov dword ptr [4*rcx + arr], eax

How does assembly do parameter passing: by value, reference, pointer for different types/arrays?

In attempt to look at this, I wrote this simple code where I just created variables of different types and passed them into a function by value, by reference, and by pointer:
int i = 1;
char c = 'a';
int* p = &i;
float f = 1.1;
TestClass tc; // has 2 private data members: int i = 1 and int j = 2
the function bodies were left blank because i am just looking at how parameters are passed in.
passByValue(i, c, p, f, tc);
passByReference(i, c, p, f, tc);
passByPointer(&i, &c, &p, &f, &tc);
wanted to see how this is different for an array and also how the parameters are then accessed.
int numbers[] = {1, 2, 3};
passArray(numbers);
assembly:
passByValue(i, c, p, f, tc)
mov EAX, DWORD PTR [EBP - 16]
mov DL, BYTE PTR [EBP - 17]
mov ECX, DWORD PTR [EBP - 24]
movss XMM0, DWORD PTR [EBP - 28]
mov ESI, DWORD PTR [EBP - 40]
mov DWORD PTR [EBP - 48], ESI
mov ESI, DWORD PTR [EBP - 36]
mov DWORD PTR [EBP - 44], ESI
lea ESI, DWORD PTR [EBP - 48]
mov DWORD PTR [ESP], EAX
movsx EAX, DL
mov DWORD PTR [ESP + 4], EAX
mov DWORD PTR [ESP + 8], ECX
movss DWORD PTR [ESP + 12], XMM0
mov EAX, DWORD PTR [ESI]
mov DWORD PTR [ESP + 16], EAX
mov EAX, DWORD PTR [ESI + 4]
mov DWORD PTR [ESP + 20], EAX
call _Z11passByValueicPif9TestClass
passByReference(i, c, p, f, tc)
lea EAX, DWORD PTR [EBP - 16]
lea ECX, DWORD PTR [EBP - 17]
lea ESI, DWORD PTR [EBP - 24]
lea EDI, DWORD PTR [EBP - 28]
lea EBX, DWORD PTR [EBP - 40]
mov DWORD PTR [ESP], EAX
mov DWORD PTR [ESP + 4], ECX
mov DWORD PTR [ESP + 8], ESI
mov DWORD PTR [ESP + 12], EDI
mov DWORD PTR [ESP + 16], EBX
call _Z15passByReferenceRiRcRPiRfR9TestClass
passByPointer(&i, &c, &p, &f, &tc)
lea EAX, DWORD PTR [EBP - 16]
lea ECX, DWORD PTR [EBP - 17]
lea ESI, DWORD PTR [EBP - 24]
lea EDI, DWORD PTR [EBP - 28]
lea EBX, DWORD PTR [EBP - 40]
mov DWORD PTR [ESP], EAX
mov DWORD PTR [ESP + 4], ECX
mov DWORD PTR [ESP + 8], ESI
mov DWORD PTR [ESP + 12], EDI
mov DWORD PTR [ESP + 16], EBX
call _Z13passByPointerPiPcPS_PfP9TestClass
passArray(numbers)
mov EAX, .L_ZZ4mainE7numbers
mov DWORD PTR [EBP - 60], EAX
mov EAX, .L_ZZ4mainE7numbers+4
mov DWORD PTR [EBP - 56], EAX
mov EAX, .L_ZZ4mainE7numbers+8
mov DWORD PTR [EBP - 52], EAX
lea EAX, DWORD PTR [EBP - 60]
mov DWORD PTR [ESP], EAX
call _Z9passArrayPi
// parameter access
push EAX
mov EAX, DWORD PTR [ESP + 8]
mov DWORD PTR [ESP], EAX
pop EAX
I'm assuming I'm looking at the right assembly pertaining to the parameter passing because there are calls at the end of each!
But due to my very limited knowledge of assembly, I can't tell what's going on here. I learned about ccall convention, so I'm assuming something is going on that has to do with preserving the caller-saved registers and then pushing the parameters onto the stack. Because of this, I'm expecting to see things loaded into registers and "push" everywhere, but have no idea what's going on with the movs and leas. Also, I don't know what DWORD PTR is.
I've only learned about registers: eax, ebx, ecx, edx, esi, edi, esp and ebp, so seeing something like XMM0 or DL just confuses me as well. I guess it makes sense to see lea when it comes to passing by reference/pointer because they use memory addresses, but I can't actually tell what is going on. When it comes to passing by value, it seems like there are many instructions, so this could have to do with copying the value into registers. No idea when it comes to how arrays are passed and accessed as parameters.
If someone could explain the general idea of what's going on with each block of assembly to me, I would highly appreciate it.
Using CPU registers for passing arguments is faster than using memory, i.e. stack. However there is limited number of registers in CPU (especially in x86-compatible CPUs) so when a function has many parameters then stack is used instead of CPU registers. In your case there are 5 function arguments so the compiler uses stack for the arguments instead of registers.
In principle compilers can use push instructions to push arguments to stack before actual call to function, but many compilers (incl. gnu c++) use mov to push arguments to stack. This way is convenient as it does not change ESP register (top of the stack) in the part of code which calls the function.
In case of passByValue(i, c, p, f, tc) values of arguments are placed on the stack. You can see many mov instruction from a memory location to a register and from the register to an appropriate location of the stack. The reason for this is that x86 assembly forbids direct moving from one memory location to another (exception is movs which moves values from one array (or string as you wish) to another).
In case of passByReference(i, c, p, f, tc) you can see many 5 lea instructions which copy addresses of arguments to CPU registers, and these values of the registers are moved into stack.
The case of passByPointer(&i, &c, &p, &f, &tc) is similar to passByValue(i, c, p, f, tc). Internally, on the assembly level, pass by reference uses pointers, while on the higher, C++, level a programmer does not need to use explicitely the & and * operators on references.
After the parameters are moved to the stack call is issued, which pushes instruction pointer EIP to stack before transferring the program execution to the subroutine. All moves of the parameters to the stack account for the coming EIP on stack after the call instruction.
There's too much in your example above to dissect all of them. Instead I'll just go over passByValue since that seems to be the most interesting. Afterwards, you should be able to figure out the rest.
First some important points to keep in mind while studying the disassembly so you don't get completely lost in the sea of code:
There are no instructions to directly copy data from one mem location to another mem location. eg. mov [ebp - 44], [ebp - 36] is not a legal instruction. An intermediate register is needed to store the data first and then subsequently copied into the memory destination.
Bracket operator [] in conjunction with a mov means to access data from a computed memory address. This is analogous to derefing a pointer in C/C++.
When you see lea x, [y] that usually means compute address of y and save into x. This is analogous to taking the address of a variable in C/C++.
Data and objects that needs to be copied but are too big to fit into a register are copied onto the stack in a piece-meal fashion. IOW, it'll copy a native machine word at a time until all the bytes representing the object/data is copied. Usually that means either 4 or 8 bytes on modern processors.
The compiler will typically interleave instructions together to keep the processor pipeline busy and to minimize stalls. Good for code efficiency but bad if you're trying to understand the disassembly.
With the above in mind here's the call to passByValue function rearranged a bit to make it more understandable:
.define arg1 esp
.define arg2 esp + 4
.define arg3 esp + 8
.define arg4 esp + 12
.define arg5.1 esp + 16
.define arg5.2 esp + 20
; copy first parameter
mov EAX, [EBP - 16]
mov [arg1], EAX
; copy second parameter
mov DL, [EBP - 17]
movsx EAX, DL
mov [arg2], EAX
; copy third
mov ECX, [EBP - 24]
mov [arg3], ECX
; copy fourth
movss XMM0, DWORD PTR [EBP - 28]
movss DWORD PTR [arg4], XMM0
; intermediate copy of TestClass?
mov ESI, [EBP - 40]
mov [EBP - 48], ESI
mov ESI, [EBP - 36]
mov [EBP - 44], ESI
;copy fifth
lea ESI, [EBP - 48]
mov EAX, [ESI]
mov [arg5.1], EAX
mov EAX, [ESI + 4]
mov [arg5.2], EAX
call passByValue(int, char, int*, float, TestClass)
The code above is unmangled and instruction mixing undone to make it clear what is actually happening but some still needs explaining. First, the char is signed and it is a single byte in size. The instructions here:
; copy second parameter
mov DL, [EBP - 17]
movsx EAX, DL
mov [arg2], EAX
reads a byte from [ebp - 17](somewhere on stack) and stores it into the lower first byte of edx. That byte is then copied into eax using sign-extended move. The full 32-bit value in eax is finally copied onto the stack that passByValue can access. See register layout if you need more detail.
The fourth argument:
movss XMM0, DWORD PTR [EBP - 28]
movss DWORD PTR [arg4], XMM0
Uses the SSE movss instruction to copy the floating point value from stack into a xmm0 register. In brief, SSE instructions let you perform the same operation on multiple pieces of data simultaneously but here the compiler is using it as an intermediate storage for copying floating-point values on the stack.
The last argument:
; copy intermediate copy of TestClass?
mov ESI, [EBP - 40]
mov [EBP - 48], ESI
mov ESI, [EBP - 36]
mov [EBP - 44], ESI
corresponds to the TestClass. Apparently this class is 8-bytes in size located on the stack from [ebp - 40] to [ebp - 33]. The class here is being copied 4-bytes at a time since the object cannot fit into a single register.
Here's what the stack approximately looks like prior to call passByValue:
lower addr esp => int:arg1 <--.
esp + 4 char:arg2 |
esp + 8 int*:arg3 | copies passed
esp + 12 float:arg4 | to 'passByValue'
esp + 16 TestClass:arg5.1 |
esp + 20 TestClass:arg5.2 <--.
...
...
ebp - 48 TestClass:arg5.1 <-- intermediate copy of
ebp - 44 TestClass:arg5.2 <-- TestClass?
ebp - 40 original TestClass:arg5.1
ebp - 36 original TestClass:arg5.2
...
ebp - 28 original arg4 <--.
ebp - 24 original arg3 | original (local?) variables
ebp - 20 original arg2 | from calling function
ebp - 16 original arg1 <--.
...
higher addr ebp prev frame
What you're looking for are ABI calling conventions. Different platforms have different conventions. e.g. Windows on x86-64 has different conventions than Unix/Linux on x86-64.
http://www.agner.org/optimize/ has a calling-conventions doc detailing the various ones for x86 / amd64.
You can write code in ASM that does whatever you want, but if you want to call other functions, and be called by them, then pass parameters / return values according to the ABI.
It could be useful to make an internal-use-only helper function that doesn't use the standard ABI, but instead uses values in the registers that the calling function allocates them in. This is esp. likely if you're writing the main program in something other than ASM, with just a small part in ASM. Then the asm part only needs to care about being portable to systems with different ABIs for being called from the main program, not for its own internals.