I am writing a modulo function and want to optimize the number of instructions called. Currently it looks like this
#include <cstdlib>
constexpr long long mod = 1e9 + 7;
static __attribute__((always_inline)) long long modulo(long long x) noexcept
{
return lldiv(x, mod).rem;
}
and even though I am compiling with -static there is a call to lldiv that isn't being inlined. I want to get around this by adding inline assembly to my function to call the idiv instruction directly. How can I do this?
I am using g++ 9.4.0 and I'm targeting x86.
Since this is related to a competetive programming contest, the code will be compiled with g++ -std=c++17 -O3 -static.
Related
PROBLEM:
I'm doing a fixed point c++ class to perform some closed loop control system on an 8b microcontroller.
I wrote a C++ class to encapsulate the PID and tested the algorithm on an X86 desktop with a modern gcc compiler. All good.
When I compiled the same code on an 8b microcontroller with a modern avr-g++ compiler, I had weird artefacts. After some debugging, the problem was that the 16b*16b multiplication was truncated to 16b. Below some minimal code to show what I'm trying to do.
I used -O2 optimization on the desktop system and -OS optimization on the embedded system, with no other compiler flag.
#include <cstdio>
#include <stdint.h>
#define TEST_16B true
#define TEST_32B true
int main( void )
{
if (TEST_16B)
{
int16_t op1 = 9000;
int16_t op2 = 9;
int32_t res;
//This operation gives the correct result on X86 gcc (81000)
//This operation gives the wrong result on AVR avr-g++ (15464)
res = (int32_t)0 +op1 *op2;
printf("op1: %d | op2: %d | res: %d\n", op1, op2, res );
}
if (TEST_32B)
{
int16_t op1 = 9000;
int16_t op2 = 9;
int32_t res;
//Promote first operand
int32_t promoted_op1 = op1;
//This operation gives the correct result on X86 gcc (81000)
//This operation gives the correct result on AVR avr-g++ (81000)
res = promoted_op1 *op2;
printf("op1: %d | op2: %d | res: %d\n", promoted_op1, op2, res );
}
return 0;
}
SOLUTION:
Just promoting one operand to 32b with a local variable is enough to solve the problem.
My expectation was that C++ would garantuee that a math operation would be performed at the same width as the first operand, so in my mind res = (int32_t)0 +... should have told the compiler that whatever came after should be performed at int32_t resolution.
This is not what happened. The (int16_t)*(int16_t) operation got truncated to (int16_t).
gcc has an internal word width of at least 32b in an X86 machine, so that might be the reason I didn't see artefacts on my desktop.
AVR Command Line
E:\Programs\AVR\7.0\toolchain\avr8\avr8-gnu-toolchain\bin\avr-g++.exe$(QUOTE) -funsigned-char -funsigned-bitfields -DNDEBUG -I"E:\Programs\AVR\7.0\Packs\atmel\ATmega_DFP\1.3.300\include" -Os -ffunction-sections -fdata-sections -fpack-struct -fshort-enums -Wall -pedantic -mmcu=atmega4809 -B "E:\Programs\AVR\7.0\Packs\atmel\ATmega_DFP\1.3.300\gcc\dev\atmega4809" -c -std=c++11 -fno-threadsafe-statics -fkeep-inline-functions -v -MD -MP -MF "$(#:%.o=%.d)" -MT"$(#:%.o=%.d)" -MT"$(#:%.o=%.o)" -o "$#" "$<"
QUESTION:
Is this the actual expected behaviour of a compliant C++ compiler, meaning I did it wrong, or is this a quirk of the avr-g++ compiler?
UPDATE:
Debugger output of various solutions
This is expected behavior of the compiler.
When you write A + B * C, that is equivalent to A + (B * C) because of operator precedence. The B * C term is evaluated on its own, without regard to how it is going to be used later. (Otherwise, it would be really hard to look at C/C++ code and understand what is actually going to happen.)
There are integer promotion rules in the C/C++ standards that sometimes help you out by promoting B and C to be of type int or maybe unsigned int before performing the multiplication. That is why you get the expected result on x86 gcc, where an int has 32 bits. However, since an int in avr-gcc only has 16 bits, the integer promotion is not good enough for you. So you need to cast either B or C to an int32_t to ensure the result of the multiplication will be an int32_t as well. For example, you can do:
A + (int32_t)B * C
I have some code using large integer literals as follows:
if(nanoseconds < 1'000'000'000'000)
This gives the compiler warning integer constant is too large for 'long' type [-Wlong-long]. However, if I change it to:
if(nanoseconds < 1'000'000'000'000ll)
...I instead get the warning use of C++11 long long integer constant [-Wlong-long].
I would like to disable this warning just for this line, but without disabling -Wlong-long or using -Wno-long-long for the entire project. I have tried surrounding it with:
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wlong-long"
...
#pragma GCC diagnostic pop
but that does not seem to work here with this warning. Is there something else I can try?
I am building with -std=gnu++1z.
Edit: minimal example for the comments:
#include <iostream>
auto main()->int {
double nanoseconds = 10.0;
if(nanoseconds < 1'000'000'000'000ll) {
std::cout << "hello" << std::endl;
}
return EXIT_SUCCESS;
}
Building with g++ -std=gnu++1z -Wlong-long test.cpp gives test.cpp:6:20: warning: use of C++11 long long integer constant [-Wlong-long]
I should mention this is on a 32bit platform, Windows with MinGW-w64 (gcc 5.1.0) - the first warning does not seem to appear on my 64bit Linux systems, but the second (with the ll suffix) appears on both.
It seems that the C++11 warning when using the ll suffix may be a gcc bug. (Thanks #praetorian)
A workaround (inspired by #nate-eldredge's comment) is to avoid using the literal and have it produced at compile time with constexpr:
int64_t constexpr const trillion = int64_t(1'000'000) * int64_t(1'000'000);
if(nanoseconds < trillion) ...
I've been reading again Scott Meyers' Effective C++ and more specifically Item 30 about inlining.
So I wrote the following, trying to induce that optimization with gcc 4.6.3
// test.h
class test {
public:
inline int max(int i) { return i > 5 ? 1 : -1; }
int foo(int);
private:
int d;
};
// test.cpp
int test::foo(int i) { return max(i); }
// main.cpp
#include "test.h"
int main(int argc, const char *argv[]) {
test t;
return t.foo(argc);
}
and produced the relevant assembly using alternatively the following:
g++ -S -I. test.cpp main.cpp
g++ -finline-functions -S -I. test.cpp main.cpp
Both commands produced the same assembly as far as the inline method is concerned;
I can see both the max() method body (also having a cmpl statement and the relevant jumps) and its call from foo().
Am I missing something terribly obvious? I can't say that I combed through the gcc man page, but couldn't find anything relevant standing out.
So, I just increased the optimization level to -O3 which has the inline optimizations on by default, according to:
g++ -c -Q -O3 --help=optimizers | grep inline
-finline-functions [enabled]
-finline-functions-called-once [enabled]
-finline-small-functions [enabled]
unfortunately, this optimized (as expected) the above code fragment almost out of existence.
max() is no longer there (at least as an explicitly tagged assembly block) and foo() has been reduced to:
_ZN4test3fooEi:
.LFB7:
.cfi_startproc
rep
ret
.cfi_endproc
which I cannot clearly understand at the moment (and is out of research scope).
Ideally, what I would like to see, would have been the assembly code for max() inside the foo() block.
Is there a way (either through cmd-line options or using a different (non-trivial?) code fragment) to produce such an output?
The compiler is entirely free to inline functiones even if you don't ask it to - both when you use inline keyword or not, or whether you use -finline-functions or not (although probably not if you use -fnoinline-functions - that would be contrary to what you asked for, and although the C++ standard doesn't say so, the flag becomes pretty pointless if it doesn't do something like what it says).
Next, the compiler is also not always certain that your function won't be used "somewhere else", so it will produce an out-of-line copy of most inline functions, unless it's entirely clear that it "can not possibly be called from somewhere else [for example the class is declared such that it can't be reached elsewhere].
And if you don't use the result of a function, and the function doesn't have side-effects (e.g. writing to a global variable, performing I/O or calling a function the compiler "doesn't know what it does"), then the compiler will eliminate that code as "dead" - because you don't really want unnecessary code, do you? Adding a return in front of max(i) in your foo function should help.
Consider following scheme. We have 3 files:
main.cpp:
int main() {
clock_t begin = clock();
int a = 0;
for (int i = 0; i < 1000000000; ++i) {
a += i;
}
clock_t end = clock();
printf("Number: %d, Elapsed time: %f\n",
a, double(end - begin) / CLOCKS_PER_SEC);
begin = clock();
C b(0);
for (int i = 0; i < 1000000000; ++i) {
b += C(i);
}
end = clock();
printf("Number: %d, Elapsed time: %f\n",
a, double(end - begin) / CLOCKS_PER_SEC);
return 0;
}
class.h:
#include <iostream>
struct C {
public:
int m_number;
C(int number);
void operator+=(const C & rhs);
};
class.cpp
C::C(int number)
: m_number(number)
{
}
void
C::operator+=(const C & rhs) {
m_number += rhs.m_number;
}
Files are compiled using clang++ with flags -std=c++11 -O3.
What I expected were very similar performance results, since I thought that compiler will optimize the operators not to be called as functions. The reality though was a bit different, here is the result:
Number: -1243309312, Elapsed time: 0.000003
Number: -1243309312, Elapsed time: 5.375751
I played around a bit and found out, that if I paste all of the code from class.* into the main.cpp the speed dramatically improves and results are very similar.
Number: -1243309312, Elapsed time: 0.000003
Number: -1243309312, Elapsed time: 0.000003
Than I realized that this behavior is probably caused by the fact, that compilation of main.cpp and class.cpp is completely separated and therefore compiler is unable to perform adequate optimizations.
My question: Is there any way of keeping the 3-file scheme and still achieve the optimization level as if the files were merged into one and than compiled? I have read something about 'unity builds' but that seems like an overkill.
Short answer
What you want is link time optimization. Try the answer from this question. I.e., try:
clang++ -O4 -emit-llvm main.cpp -c -o main.bc
clang++ -O4 -emit-llvm class.cpp -c -o class.bc
llvm-link main.bc class.bc -o all.bc
opt -std-compile-opts -std-link-opts -O3 all.bc -o optimized.bc
clang++ optimized.bc -o yourExecutable
You should see that your performance reaches the one that you had when pasting everything into main.cpp.
Long answer
The problem is that the compiler cannot inline your overloaded operator during linking, because it no longer has its definition in a form which it can use to inline it (it cannot inline bare machine code). Thus, the operator call in main.cpp will stay a real function call to the function declared in class.cpp. A function call is very expensive in comparison to a simple inlined addition which can be optimized further (e.g., vectorized).
When you enable link time optimization, the compiler is able to do this. As you see above, you first create llvm intermediate representation byte code (the .bc files, which I will simply call llvm code hereinafter) instead of machine code.
You then link these files to a new .bc file which still contains llvm code instead of machine code. In contrast to machine code, the compiler is able to perform inlining on llvm code. opt is the llvm optimizer (be sure to install llvm), which performs the inlining and further link time optimizations. Then, we call clang++ a final time to generate executable machine code from the optimized llvm code.
For People with GCC
The answer above is only for clang. GCC (g++) users must use the -flto flag during compilation and during linking to enable link time optimization. It is simpler than with clang, simply add -flto everywhere:
g++ -c -O2 -flto main.cpp
g++ -c -O2 -flto class.cpp
g++ -o myprog -flto -O2 main.o class.o
The technique what you are looking for is called Link Time Optimization.
From the timing data, it is obvious that the compiler doesn't just generate better code for the trivial case, but that it doesn't perform any code at all to sum up a billion number. That doesn't happen in real life. You are not performing a useful benchmark. You want to test code that is at least complicated enough to avoid stupid/clever things like this.
I'd re-run the test, but change the loop to
for (int i = 0; i < 1000000000; ++i) if (i != 1000000) {
// ...
}
so that the compiler is forced to actually add up the numbers.
I was writing some templated code to benchmark a numeric algorithm using both floats and doubles, in order to compare against a GPU implementation.
I discovered that my floating point code was slower and after investigating using Vtune Amplifier from Intel I discovered that g++ was generating extra x86 instructions (cvtps2pd/cvtpd2ps and unpcklps/unpcklpd) to convert some intermediate results from float to double and then back again. The performance degradation is almost 10% for this application.
After compiling with the flag -Wdouble-promotion (which BTW is not included with -Wall or -Wextra), sure enough g++ warned me that the results were being promoted.
I reduced this to a simple test case shown below. Note that the ordering of the c++ code affects the generated code. The compound statement (T d1 = log(r)/r;) produces a warning, whilst the separated version does not (T d = log(r); d/=r;).
The following was compiled with both g++-4.6.3-1ubuntu5 and g++-4.7.3-2ubuntu1~12.04 with the same results.
Compile flags are:
g++-4.7 -O2 -Wdouble-promotion -Wextra -Wall -pedantic -Werror -std=c++0x test.cpp -o test
#include <cstdlib>
#include <iostream>
#include <cmath>
template <typename T>
T f()
{
T r = static_cast<T>(0.001);
// Gives no double promotion warning
T d = log(r);
d/=r;
// Promotes to double
T d1 = log(r)/r;
return d+d1;
}
int main()
{
float f1 = f<float>();
std::cout << f1 << std::endl;
}
I realise that the c++11 standard allows the compiler discretion here. But why does the order matter?
Can I explicitly instruct g++ to use floats only for this calculation?
EDIT: SOLVED by Mike Seymour. Needed to use std::log to ensure picking up the overloaded version of log instead of calling the C double log(double). The warning was not generated for the separated statement because this is a conversion and not a promotion.
The problem is
log(r)
In this implementation, it seems that the only log in the global namespace is the C library function, double log(double). Remember that it's not specified whether or not the C-library headers in the C++ library dump their definitions into the global namespace as well as namespace std.
You want
std::log(r)
to ensure that the extra overloads defined by the C++ library are available.