there are 2 ways i found to get a whole number from a division in c++
question is which way is more efficient (more speedy)
first way:
Quotient = value1 / value2; // normal division haveing splitted number
floor(Quotient); // rounding the number down to the first integer
second way:
Rest = value1 % value2; // getting the Rest with modulus % operator
Quotient = (value1-Rest) / value2; // substracting the Rest so the division will match
also please demonstrate how to find out which method is faster
If you're dealing with integers, then the usual way is
Quotient = value1 / value2;
That's it. The result is already an integer. No need to use the floor(Quotient); statement. It has no effect anyway. You would want to use Quotient = floor(Quotient); if it was needed.
If you have floating point numbers, then the second method won't work at all, as % is only defined for integers. But what does it mean to get a whole number from a division of real numbers? What integer do you get when you divide 8.5 by 3.2? Does it ever make sense to ask this question?
As a side note, the thing you call 'Rest' is normally called 'reminder'.remainder.
Use this program:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#ifdef DIV_BY_DIV
#define DIV(a, b) ((a) / (b))
#else
#define DIV(a, b) (((a) - ((a) % (b))) / (b))
#endif
#ifndef ITERS
#define ITERS 1000
#endif
int main()
{
int i, a, b;
srand(time(NULL));
a = rand();
b = rand();
for (i = 0; i < ITERS; i++)
a = DIV(a, b);
return 0;
}
You can time execution
mihai#keldon:/tmp$ gcc -Wall -Wextra -DITERS=1000000 -DDIV_BY_DIV 1.c && time ./a.out
real 0m0.010s
user 0m0.012s
sys 0m0.000s
mihai#keldon:/tmp$ gcc -Wall -Wextra -DITERS=1000000 1.c && time ./a.out
real 0m0.019s
user 0m0.020s
sys 0m0.000s
Or, you look at the assembly output:
mihai#keldon:/tmp$ gcc -Wall -Wextra -DITERS=1000000 -DDIV_BY_DIV 1.c -S; mv 1.s 1_div.s
mihai#keldon:/tmp$ gcc -Wall -Wextra -DITERS=1000000 1.c -S; mv 1.s 1_modulus.s
mihai#keldon:/tmp$ diff 1_div.s 1_modulus.s
24a25,32
> movl %edx, %eax
> movl 24(%esp), %edx
> movl %edx, %ecx
> subl %eax, %ecx
> movl %ecx, %eax
> movl %eax, %edx
> sarl $31, %edx
> idivl 20(%esp)
As you see, doing only the division is faster.
Edited to fix error in code, formatting and wrong diff.
More edit (explaining the assembly diff): In the second case, when doing the modulus first, the assembly shows that two idivl operations are needed: one to get the result of % and one for the actual division. The above diff shows the subtraction and the second division, as the first one is exactly the same in both codes.
Edit: more relevant timing information:
mihai#keldon:/tmp$ gcc -Wall -Wextra -DITERS=42000000 -DDIV_BY_DIV 1.c && time ./a.out
real 0m0.384s
user 0m0.360s
sys 0m0.004s
mihai#keldon:/tmp$ gcc -Wall -Wextra -DITERS=42000000 1.c && time ./a.out
real 0m0.706s
user 0m0.696s
sys 0m0.004s
Hope it helps.
Edit: diff between assembly with -O0 and without.
mihai#keldon:/tmp$ gcc -Wall -Wextra -DITERS=1000000 1.c -S -O0; mv 1.s O0.s
mihai#keldon:/tmp$ gcc -Wall -Wextra -DITERS=1000000 1.c -S; mv 1.s noO.s
mihai#keldon:/tmp$ diff noO.s O0.s
Since the defualt optimization level of gcc is O0 (see this article explaining optimization levels in gcc) the result was expected.
Edit: if you compile with -O3 as one of the comments suggested you'll get the same assembly, at that level of optimization, both alternatives are the same.
Related
I was trying to use AVX in a Mandelbrot program and it's not working right.
I try debugging it but GDB refuses to show me the floating point values in the YMM registers. Here's the minimum example
t.c
#include <stdio.h>
extern void loadnum(void);
extern double input[4];
extern double output[4];
int main(void)
{
/*
input[0] = 1.1;
input[1] = 2.2;
input[2] = 3.3;
input[3] = 3.14159;
*/
printf("%f %f %f %f\n",input[0],input[1],input[2],input[3]);
loadnum();
printf("%f %f %f %f\n",output[0],output[1],output[2],output[3]);
return 0;
}
l.asm
section .data
global input
global output
align 64
input dq 1.1,2.2,3.3,3.14159
output dq 0,0,0,0
section .text
global loadnum
loadnum:
vmovapd ymm0, [input]
vmovapd [output],ymm0
ret
how it's compiled
OBJECTS = t.o l.o
CFLAGS = -c -O2 -g -no-pie -mavx -Wall
t: $(OBJECTS)
gcc -g -no-pie $(OBJECTS) -o t
t.o: t.c
gcc $(CFLAGS) t.c
l.o: l.asm
nasm -felf64 -gdwarf l.asm
The output is
> 1.100000 2.200000 3.300000 3.141590
> 1.100000 2.200000 3.300000 3.141590
which shows it's loading and storing these doubles as expected, but in gdb it shows
> gdb t (followed by some boilerplate)
> Reading symbols from t...
> (gdb) b loadnum
> Breakpoint 1 at 0x4011b0: file l.asm, line 15.
> (gdb) run
> Starting program: /somedir/t
> 1.100000 2.200000 3.300000 3.141590
> Breakpoint 1, loadnum () at l.asm:15
> 15 vmovapd ymm0, [input]
> (gdb) n
> 16 vmovapd [output],ymm0
> (gdb)
then I say
> (gdb) info all-registers
and this shows up.
> ymm0 (blah blah) v4_double = {0x1, 0x2, 0x3, 0x3}
when I expected it to show
> ymm0 (blah blah) v4_double = {1.100000 2.200000 3.300000 3.141590}
None of the other fields show anything like that, unless you want to parse the floating point bits
> v4_int64 = {0x3ff199999999999a, 0x400199999999999a, 0x400a666666666666, 0x400921f9f01b866e}
How can I fix this?
p $ymm0.v4_double (the print command) defaults to decimal formatting.
Use p /whatever for other formats, like p /x $ymm0.v4_int64 to see hex for the bit-patterns. help p for more.
display $ymm0.v4_double can work as a stand-in for layout reg + tui reg vec being buggy/broken in some versions, and always an unusable mess of different formats for registers as wide and numerous as ymm0-15. It takes the same options as print, and prints before every prompt. (undisplay 1 or undisplay (all) to disable some of the expressions you've set up.)
It can get cluttered in TUI mode (layout asm or layout reg + layout next to see integer regs and disassembly) if you want to track more than a couple registers, so you might prefer to use non-TUI mode, either don't use layout in the first place, or tui dis.
(When debugging hand-written asm, I almost always want to look at disassembly, not source; but maybe for a complicated algorithm I'd sometimes want to see source with comments as a reminder of what the values should be/mean at a certain point.)
Please note: this question is neither about code quality, and ways to improve the code, nor about the (in)significance of the runtime differences. It is about GCC and why which compiler optimisation costs performance.
The program
The following code counts the number of Fibonacci primes up to m:
int main() {
unsigned int m = 500000000u;
unsigned int i = 0u;
unsigned int a = 1u;
unsigned int b = 1u;
unsigned int c = 1u;
unsigned int count = 0u;
while (a + b <= m) {
for (i = 2u; i < a + b; ++i) {
c = (a + b) % i;
if (c == 0u) {
i = a + b;
// break;
}
}
if (c != 0u) {
count = count + 1u;
}
a = a + b;
b = a - b;
}
return count; // Just to "output" (and thus use) count
}
When compiled with g++.exe (Rev2, Built by MSYS2 project) 9.2.0 and no optimisations (-O0), the resulting binary executes (on my machine) in 1.9s. With -O1 and -O3 it takes 3.3s and 1.7s, respectively.
I've tried to make sense of the resulting binaries by looking at the assembly code (godbolt.org) and the corresponding control-flow graph (hex-rays.com/products/ida), but my assembler skills don't suffice.
Additional observations
An explicit break in the innermost if makes the -O1 code fast again:
if (c == 0u) {
i = a + b; // Not actually needed any more
break;
}
As does "inlining" the loop's progress expression:
for (i = 2u; i < a + b; ) { // No ++i any more
c = (a + b) % i;
if (c == 0u) {
i = a + b;
++i;
} else {
++i;
}
}
Questions
Which optimisation does/could explain the performance drop?
Is it possible to explain what triggers the optimisation in terms of the C++ code (i.e. without a deep understanding of GCC's internals)?
Similarly, is there a high-level explanation for why the alternatives (additional observations) apparently prevent the rogue optimisation?
The important thing at play here are loop-carried data dependencies.
Look at machine code of the slow variant of the innermost loop. I'm showing -O2 assembly here, -O1 is less optimized, but has similar data dependencies overall:
.L4:
xorl %edx, %edx
movl %esi, %eax
divl %ecx
testl %edx, %edx
cmove %esi, %ecx
addl $1, %ecx
cmpl %ecx, %esi
ja .L4
See how the increment of the loop counter in %ecx depends on the previous instruction (the cmov), which in turn depends on the result of the division, which in turn depends on the previous value of loop counter.
Effectively there is a chain of data dependencies on computing the value in %ecx that spans the entire loop, and since the time to execute the loop dominates, the time to compute that chain decides the execution time of the program.
Adjusting the program to compute the number of divisions reveals that it executes 434044698 div instructions. Dividing the number of machine cycles taken by the program by this number gives 26 cycles in my case, which corresponds closely to latency of the div instruction plus about 3 or 4 cycles from the other instructions in the chain (the chain is div-test-cmov-add).
In contrast, the -O3 code does not have this chain of dependencies, making it throughput-bound rather than latency-bound: the time to execute the -O3 variant is determined by the time to compute 434044698 independent div instructions.
Finally, to give specific answers to your questions:
1. Which optimisation does/could explain the performance drop?
As another answer mentioned, this is if-conversion creating a loop-carried data dependency where originally there was a control dependency. Control dependencies may be costly too, when they correspond to unpredictable branches, but in this case the branch is easy to predict.
2. Is it possible to explain what triggers the optimisation in terms of the C++ code (i.e. without a deep understanding of GCC's internals)?
Perhaps you can imagine the optimization transforming the code to
for (i = 2u; i < a + b; ++i) {
c = (a + b) % i;
i = (c != 0) ? i : a + b;
}
Where the ternary operator is evaluated on the CPU such that new value of i is not known until c has been computed.
3. Similarly, is there a high-level explanation for why the alternatives (additional observations) apparently prevent the rogue optimisation?
In those variants the code is not eligible for if-conversion, so the problematic data dependency is not introduced.
I think the problem is in the -fif-conversion that instructs the compiler to do CMOV instead of TEST/JZ for some comparisons. And CMOV is known for being not so great in the general case.
There are two points in the disassembly, that I know of, affected by this flag:
First, the if (c == 0u) { i = a + b; } in line 13 is compiled to:
test edx,edx //edx is c
cmove ecx,esi //esi is (a + b), ecx is i
Second, the if (c != 0u) { count = count + 1u; } is compiled to
cmp eax,0x1 //eax is c
sbb r8d,0xffffffff //r8d is count, but what???
Nice trick! It is substracting -1 to count but with carry, and the carry is only set if c is less than 1, which being unsigned means 0. Thus, if eax is 0 it substracts -1 to count but then substracts 1 again: it does not change. If eax is not 0, then it substracts -1, that increments the variable.
Now, this avoids branches, but at the cost of missing the obvious optimization that if c == 0u you could jump directly to the next while iteration. This one is so easy that it is even done in -O0.
I believe this is caused by the "conditional move" instruction (CMOVEcc) that the compiler generates to replace branching when using -O1 and -O2.
When using -O0, the statement if (c == 0u) is compiled to a jump:
cmp DWORD PTR [rbp-16], 0
jne .L4
With -O1 and -O2:
test edx, edx
cmove ecx, esi
while -O3 produces a jump (similar to -O0):
test edx, edx
je .L5
There is a known bug in gcc where "using conditional moves instead of compare and branch result in almost 2x slower code"
As rodrigo suggested in his comment, using the flag -fno-if-conversion tells gcc not to replace branching with conditional moves, hence preventing this performance issue.
This does not look too friendly:
__asm("command 1"
"command 2"
"command 3");
Do I really have to put a doublequote around every line?
Also... since multiline string literals do not work in GCC, I could not cheat with that either.
I always find some examples on Internet that the guy manually insert a tab and new-line instead of \t and \n, however it doesn't work for me. Not very sure if your example even compile.. but this is how I do:
asm volatile( // note the backslash line-continuation
"xor %eax,%eax \n\t\
mov $0x7c802446, %ebx \n\t\
mov $1000, %ax \n\t\
push %eax \n\t\
call *%ebx \n\t\
add $4, %esp \n\t\
"
: "=a"(retval) // output in EAX: function return value
:
: "ecx", "edx", "ebx" // tell compiler about clobbers
// Also x87 and XMM regs should be listed.
);
Or put double quotes around each line, instead of using \ line-continuation. C string literals separately only by whitespace (including a newline) are concatenated into one long string literal. (Which is why you need the \n inside it, so it's separate lines when it's seen by the assembler).
This is less ugly and makes it possible to put C comments on each line.
asm volatile(
"xor %eax,%eax \n\t"
"mov $0x7c802446, %ebx \n\t"
"mov $1000, %ax \n\t"
"push %eax \n\t" // function arg
"call *%ebx \n\t"
"add $4, %esp \n\t" // rebalance the stack: necessary for asm statements
: "=a"(retval)
:
: "ecx", "edx", "ebx" // clobbers. Function calls themselves kill EAX,ECX,EDX
// function calls also clobber all x87 and all XMM registers, omitted here
);
C++ multiline string literals
Interesting how this question pointed me to the answer:
main.cpp
#include <cassert>
#include <cinttypes>
int main() {
uint64_t io = 0;
__asm__ (
R"(
incq %0
incq %0
)"
: "+m" (io)
:
:
);
assert(io == 2);
}
Compile and run:
g++ -o main -pedantic -std=c++11 -Wall -Wextra main.cpp
./main
See also: C++ multiline string literal
GCC also adds the same syntax as a C extension, you just have to use -std=gnu99 instead of -std=c99:
main.c
#include <assert.h>
#include <inttypes.h>
int main(void) {
uint64_t io = 0;
__asm__ (
R"(
incq %0
incq %0
)"
: "+m" (io)
:
:
);
assert(io == 2);
}
Compile and run:
gcc -o main -pedantic -std=gnu99 -Wall -Wextra main.c
./main
See also: How to split a string literal across multiple lines in C / Objective-C?
One downside of this method is that I don't see how to add C preprocessor macros in the assembly, since they are not expanded inside of strings, see also: Multi line inline assembly macro with strings
Tested on Ubuntu 16.04, GCC 6.4.0, binutils 2.26.1.
.incbin
This GNU GAS directive is another thing that should be in your radar if you are going to use large chunks of assembly: Embedding resources in executable using GCC
The assembly will be in a separate file, so it is not a direct answer, but it is still worth knowing about.
I'm building Botan on Solaris 11.3 with the SunCC compiler that comes with Developer Studio 12.5. I'm not too familiar with the library or Solaris, and it takes me some effort to track down issues.
The compile is dying on a relatively benign file called divide.cpp. I've got it reduced to the following test case. According to Oracle's GCC-style asm inlining support in Sun Studio 12 compilers, the ASM is well formed. Clang, GCC and ICC happily consume the code.
$ /opt/developerstudio12.5/bin/CC -m64 -std=c++11 test.cxx -c
"test.cxx", [main]:ube: error: Invalid reference to argument '1' in GASM Inlining
CC: ube failed for test.cxx
$ cat test.cxx
#include <iostream>
#include <stdint.h>
typedef uint64_t word;
inline word multadd(word a, word b, word* c)
{
asm(
"mulq %[b] \n\t"
"addq %[c],%[a] \n\t"
"adcq $0,%[carry] \n\t"
: [a]"=a"(a), [b]"=rm"(b), [carry]"=&d"(*c)
: "0"(a), "1"(b), [c]"g"(*c) : "cc");
return a;
}
int main(int argc, char* argv[])
{
word a, b, c, d;
std::cin >> a >> b >> c;
d = multadd(a, b, &c);
return 0;
}
I can't find useful information on the error string Invalid reference to argument 'N' in GASM Inlining. I found sunCC chokes on inline assembler on the Oracle boards. But the answer is UBE is buggy and buy a support contract to learn more.
I have three questions:
What does the error message indicate?
How can I get SunCC to provide a source file and line number?
How can I work around the issue?
If I change the b parameter to just =m, then the same error is produced. If I change the b parameter to just =r, then a different error is generated:
asm(
"mulq %[b] \n\t"
"addq %[c],%[a] \n\t"
"adcq $0,%[carry] \n\t"
: [a]"=a"(a), [b]"=r"(b), [carry]"=&d"(*c)
: "0"(a), "1"(b), [c]"g"(*c) : "cc");
And the result:
$ /opt/developerstudio12.5/bin/CC -m64 -std=c++11 test.cxx -c
Assembler: test.cxx
"<null>", line 205 : Invalid instruction argument
Near line: "mulq %rcx "
"<null>", line 206 : Invalid instruction argument
Near line: " addq %rbx,%rax "
"<null>", line 207 : Invalid instruction argument
Near line: " adcq $0,%rdx "
CC: ube failed for test.cxx
What does the error message indicate?
Unfortunately, no idea.
If someone buys a support contract and has the time, then please solicit Oracle for an answer .
How can I get SunCC to provide a source file and line number?
Unfortunately, no idea.
How can I work around the issue?
David Wohlferd suspected the [b]"=rm"(b) output operand. It looks like the one ASM block needs to be split into two blocks. Its an awful hack, but we have not figured out another way to do it.
inline word multadd(word a, word b, word* c)
{
asm(
"mulq %[b] \n\t"
: [a]"+a"(a), [b]"=&d"(b)
: "0"(a), "1"(b));
asm(
"addq %[c],%[a]" \n\t"
"adcq $0,%[carry] \n\t"
: [a]"=a"(a), [carry]"=&d"(*c)
: "a"(a), "d"(b), [c]"g"(*c) : "cc");
return a;
}
And again about STL std::bitset - its documentation says that functions set/reset/test do boundary checks, and operator[] doesn't. My timing experiments show that functions set/test typically perform 2-3% faster than the operator[]. The code I'm working with is:
typedef unsigned long long U64;
const U64 MAX = 800000000ULL;
struct Bitmap1
{
void insert(U64 N) {this->s[N % MAX] = 1;}
bool find(U64 N) const {return this->s[N % MAX];}
private:
std::bitset<MAX> s; // <---- takes MAX/8 memory (in bytes)
};
struct Bitmap2
{
void insert(U64 N) {this->s.set(N % MAX);}
bool find(U64 N) const {return this->s.test(N % MAX);}
private:
std::bitset<MAX> s; // <---- takes MAX/8 memory (in bytes)
};
int main()
{
Bitmap2* s = new Bitmap2();
// --------------------------- storing
const size_t t0 = time(0);
for (unsigned k = 0; k < LOOPS; ++k)
{
for (U64 i = 0; i < MAX; ++i) s->insert(i);
}
cout << "storing: " << time(0) - t0 << endl;
// -------------------------------------- search
const size_t t1 = time(0);
U64 count = 0;
for (unsigned k = 0; k < LOOPS; ++k)
{
for (U64 i = 0; i < MAX; ++i) if (s->find(i)) ++count;
}
cout << "search: " << time(0) - t1 << endl;
cout << count << endl;
}
How to explain this? Absence of boundary checks should save us some cycles, right?
Compiler: g++ 4.8.1 (options -g -O4)
VMware VM: Ubuntu 3.11.0-15
Host: MacBook Pro
When I remove rand, division, output, and the memory cache from the timings:
bool bracket_test() {
std::bitset<MAX> s;
for(int j=0; j<num_iterations; ++j) {
for(int i=0; i<MAX; ++i)
s[i] = !s[MAX-1-i];
}
return s[0];
}
bool set_test() {
std::bitset<MAX> s;
for(int j=0; j<num_iterations; ++j) {
for(int i=0; i<MAX; ++i)
s.set(i, !s.test(MAX-1-i));
}
return s.test(0);
}
bool no_test() {
bool s = false;
for(int j=0; j<num_iterations; ++j) {
for(int i=0; i<MAX; ++i)
s = !s;
}
return s;
}
I get these results with Clang at http://coliru.stacked-crooked.com/a/cdc832bfcc7e32be. (I do 10000 iterations, 20 times, and measure the lowest time, which mitigates timing errors.)
clang++ -std=c++11 -O0 -Wall -Wextra -pedantic -pthread main.cpp && ./a.out
bracket_test took 178663845 ticks to find result 1
set_test took 117336632 ticks to find result 1
no_test took 9214297 ticks to find result 0
clang++ -std=c++11 -O1 -Wall -Wextra -pedantic -pthread main.cpp && ./a.out
bracket_test took 798184780 ticks to find result 1
set_test took 565999680 ticks to find result 1
no_test took 41693575 ticks to find result 0
clang++ -std=c++11 -O2 -Wall -Wextra -pedantic -pthread main.cpp && ./a.out
bracket_test took 81240369 ticks to find result 1
set_test took 72172912 ticks to find result 1
no_test took 41907685 ticks to find result 0
clang++ -std=c++11 -O3 -Wall -Wextra -pedantic -pthread main.cpp && ./a.out
bracket_test took 77688054 ticks to find result 1
set_test took 72433185 ticks to find result 1
no_test took 41433010 ticks to find result 0
Previous versions of this test found that brackets were slightly faster, but now that I've improved the accuracy of the timings, it appears that my margin of error for timing is approximately 3%. At O1 Set is 35-54% faster, at O2 it's 13-49% faster, and at O3 it's 2-34% faster. This seems pretty conclusive to me, aside from looking at the assembly output.
So here's assembly (at GCC -O) via http://assembly.ynh.io/:
std::bitset<MAX> s
s[1000000] = true;
return s;
0000 4889F8 movq %rdi, %rax
0003 4889FA movq %rdi, %rdx
0006 488D8F00 leaq 100000000(%rdi), %rcx
E1F505
000d 48C70200 movq $0, (%rdx)
000000
0014 4883C208 addq $8, %rdx
0018 4839CA cmpq %rcx, %rdx
001b 75F0 jne .L2
001d 48838848 orq $1, 125000(%rax)
E8010001
0025 C3 ret
and
std::bitset<MAX> s;
s.set(1000000);
return s;
0026 4889F8 movq %rdi, %rax
0029 4889FA movq %rdi, %rdx
002c 488D8F00 leaq 100000000(%rdi), %rcx
E1F505
0033 48C70200 movq $0, (%rdx)
000000
003a 4883C208 addq $8, %rdx
003e 4839CA cmpq %rcx, %rdx
0041 75F0 jne .L6
0043 48838848 orq $1, 125000(%rax)
E8010001
004b C3 ret
I can't really read assembly so well, but these are completely identical, so analysis of this case is easy. If the compiler knows they're both in range, it optimizes out the range check. When I replace the fixed index with a variable index, Set adds 5 operations to check for the boundary case.
As for REASONS that Set is faster sometimes, is that operator[] has to do a TON of work for the reference proxy that Set doesn't have to do. The reason that Set is slower sometimes is that the proxy is trivially inlined, in which case the only difference is that Set has to do the boundary check. On the other hand, Set only has to do the boundary check if the compiler cannot prove that indexes are always in range. So it depends on the surrounding code, a lot. Your results may differ.
http://en.cppreference.com/w/cpp/utility/bitset/set says:
Sets the bit at position pos to the value value.
Throws std::out_of_range if pos does not correspond to a valid position within the bitset.
http://en.cppreference.com/w/cpp/utility/bitset/operator_at says:
Accesses the bit at position pos. Returns an object of type std::bitset::reference that allows modification of the value.
Unlike test(), does not throw exceptions: the behavior is undefined if pos is out of bounds.
and http://en.cppreference.com/w/cpp/utility/bitset/reference says:
The std::bitset class includes std::bitset::reference as a publicly-accessible nested class. This class is used as a proxy object to allow users to interact with individual bits of a bitset, since standard C++ types (like references and pointers) are not built with enough precision to specify individual bits. The primary use of std::bitset::reference is to provide an l-value that can be returned from operator[]. Any reads or writes to a bitset that happen via a std::bitset::reference potentially read or write to the entire underlying bitset.
It should be clear that operator[] actually has a lot more to it than is intuitive.