I have an O(N^4) scaling algorithm of the form
...
...
...
for (unsigned i = 0; i < nI; ++i) {
for (unsigned j = 0; j < nJ; ++j) {
for (unsigned k = 0; k < nK; ++k) {
for (unsigned l = 0; l < nL; ++l) {
*calculate value*
*do something with value*
}
}
}
}
I need this code in a couple of places so I put it a looper function as part of a class. This loop function is templated so that it can accept a lambda function which takes care of *do something with value*.
Some tests have shown that this is not optimal performance-wise but I do not have any idea on how to get around explicitly writing out this code every time I need it. Do you see a way of doing this?
Using a templated function to call the lambda should generate a code that can be optimized by modern optimizing compilers. It is actually the case for the last version of GCC, Clang and MSVC. You can check that on GodBolt with this code:
extern int unknown1();
extern int unknown2(int);
template <typename LambdaType>
int compute(LambdaType lambda, int nI, int nJ, int nK, int nL)
{
int sum = 0;
for (unsigned i = 0; i < nI; ++i) {
for (unsigned j = 0; j < nJ; ++j) {
for (unsigned k = 0; k < nK; ++k) {
for (unsigned l = 0; l < nL; ++l) {
sum += lambda(i, j, k, l);
}
}
}
}
return sum;
}
int caller(int nI, int nJ, int nK, int nL)
{
int context = unknown1();
auto lambda = [&](int i, int j, int k, int l) -> int {
return unknown2(context + i + j + k + l);
};
return compute(lambda, nI, nJ, nK, nL);
}
Using optimization flags, GCC, Clang and MSVC are capable of generating an efficient implementation of compute eliding the lambda calls in the 4 nested loops (unknown2 is directly called in the generated assembly). This is the case even if compute is not inlined. Note the fact that the lambda capture its context do not actually prevent optimisations (although this is much harder for the compiler to optimize this case).
Note that this is important not to use the direct lambda type and not wrappers like std::function as wrapper will likely prevent optimizations (or at least make optimizations much more difficult to apply) resulting in direct function calls. Indeed, the type help the compiler to inline the function and then apply further optimizations like vectorization and constant propagation.
Note that the code of the lambda should be kept small. Otherwise, it may not be inlined resulting in a function call. A direct function call is not so slow with if the function body is pretty big on modern processors because of good branch prediction units and relatively fast large caches. However, the cost of preventing further optimizations mostly possible due to the lambda inlining can be huge. One way to mitigate this cost is to move at least one loop in the lambda (see Data-oriented design for more information). Another solution is to use OpenMP to help the compiler vectorizing the lambda thanks to #pragma omp declare simd [...] directives (assuming your compiler supports it). You can also play with compiler inlining command-line parameters to tell your compiler to actually inline the lambda in such a case.
Related
According to this question it's impossible to leave variables uninitialized inside a constexpr function. Sometimes for performance reasons, we do not want to intialize variables though. Is it possible to "overload" the function somehow, so it allows for a constexpr version and a higher-performance non-constexpr function?
As an example, consider the following add function in a custom class vec:
auto add(vec that) const {
vec ret;
for (int i = 0; i < n; i++)
ret[i] = (*this)[i] + that[i];
return ret;
}
constexpr auto add(vec that) const {
vec ret = {};
for (int i = 0; i < n; i++)
ret[i] = (*this)[i] + that[i];
return ret;
}
The C++ compiler is very good at optimizations, especially inside constexpr functions. The initialization will very likely be optimized and have no additional cost, and in your case it doesn't even matter since declaring a vector already initializes it to an empty vector.
A few answers and discussions and even the source code of boost::irange mention that there should be a performance penalty to using these ranges over raw for loops.
However, for example for the following code
#include <boost/range/irange.hpp>
int sum(int* v, int n) {
int result{};
for (auto i : boost::irange(0, n)) {
result += v[i];
}
return result;
}
int sum2(int* v, int n) {
int result{};
for (int i = 0; i < n; ++i) {
result += v[i];
}
return result;
}
I see no differences in the generated (-O3 optimized) code (Compiler Explorer). Does anyone see an example where using such an integer range could lead to worse code generation in modern compilers?
EDIT: Clearly, debug performance might be impacted, but that's not my aim here.
Concerning the strided (step size > 1) example, I think it might be possible to modify the irange code to more closely match the code of a strided raw for-loop.
Does anyone see an example where using such an integer range could lead to worse code generation in modern compilers?
Yes. It is not stated that your particular case is affected. But changing the step to anything else than 1:
#include <boost/range/irange.hpp>
int sum(int* v, int n) {
int result{};
for (auto i : boost::irange(0, n, 8)) {
result += v[i]; //^^^ different steps
}
return result;
}
int sum2(int* v, int n) {
int result{};
for (int i = 0; i < n; i+=8) {
result += v[i]; //^^^ different steps
}
return result;
}
Live.
While sum now looks worse (the loop did not get unrolled) sum2 still benefits from loop unrolling and SIMD optimization.
Edit:
To comment on your edit, it's true that it might be possible to modify the irange code to more closely. But:
To fit how range-based for loops are expanded, boost::irange(0, n, 8) must create some sort of temporary, implementing begin/end iterators and a prefix operator++ (which is crearly not as trivial as an int += operation). Compilers are using pattern matching for optimization, which is trimmed to work with standard C++ and standard libraries. Thus, any result from irange; if it is slightly different than a pattern the compiler knows to optimize, optimization won't kick in. And I think, these are the reason why the author of the library mentions performance penalties.
say you have a function like:
double do_it(int m)
{
double result = 0;
for(int i = 0; i < m; i++)
result += i;
return result;
}
If you know m at compile time you can do:
template<size_t t_m>
double do_it()
{
double result = 0;
for(int i = 0; i < t_m; i++)
result += i;
return result;
}
This gives a possibility for things like loop unrolling when optimizing. But, sometimes you might know some cases at compile-time and some at run-time. Or, perhaps you have defaults which a user could change...but it would be nice to optimize the default case.
I'm wondering if there is any way to provide both versions without basically duplicating the code or using a macro?
Note that the above is a toy example to illustrate the point.
In terms of the language specification, there's no general way to have a function that works in the way you desire. But that doesn't mean compilers can't do it for you.
This gives a possibility for things like loop unrolling when optimizing.
You say this as though the compiler cannot unroll the loop otherwise.
The reason the compiler can unroll the template loop is because of the confluence of the following:
The compiler has the definition of the function. In this case, the function definition is provided (it's a template function, so its definition has to be provided).
The compiler has the compile-time value of the loop counter. In this case, through the template parameter.
But none of these factors explicitly require a template. If the compiler has the definition of a function, and it can determine the compile-time value of the loop counter, then it has 100% of the information needed to unroll that loop.
How it gets this information is irrelevant. It could be an inline function (you have to provide the definition) which you call given a compile-time constant as an argument. It could be a constexpr function (again, you have to provide the definition) which you call given a compile-time constant as an argument.
This is a matter of quality of implementation, not of language. If compile-time parameters are to ever be a thing, it would be to support things you cannot do otherwise, not to support optimization (or at least, not compiler optimizations). For example, you can't have a function which returns a std::array whose length is specified by a regular function parameter rather than a template parameter.
Yes you can, with std::integral_constant. Specifically, the following function will work with an int, as well as specializations of std::integral_constant.
template<class Num>
constexpr double do_it(Num m_unconverted) {
double result = 0.;
int m_converted = static_cast<int>(m_unconverted);
for(int i = 0; i < m_converted; i++){ result += i; }
return result;
}
If you want to call do_it with a compile-time constant, then you can use
constexpr double result = do_it(std::integral_constant<int, 5>{});
Otherwise, it's just
double result = do_it(some_number);
Use constexpr (needs at least C++14 to allow for):
constexpr double do_it(int m)
{
double result = 0;
for(int i = 0; i < m; i++)
result += i;
return result;
}
constexpr double it_result = do_it(10) + 1; // compile time `do_it`, possibly runtime `+ 1`
int main() {
int x;
cin >> x;
do_it(x); // runtime
}
If you want to force a constexpr value to be inlined as part of a runtime expression, you can use the FORCE_CT_EVAL macro from this comment:
#include <utility>
#define FORCE_CT_EVAL(func) [](){constexpr auto ___expr = func; return std::move(___expr);}()
double it_result = FORCE_CT_EVAL(do_it(10)); // compile time
Suppose I have the following C++ function:
// Returns a set containing {1!, 2!, ..., n!}.
set<int> GetFactorials(int n) {
set<int> ret;
int curr = 1;
for (int i = 1; i < n; i++) {
curr *= i;
ret.insert(curr);
}
return ret;
}
set<int> fs = GetFactorials(5);
(This is just a dummy example. The key is that the function creates the set itself and returns it.)
One of my friends tells me that instead of writing the function the way I did, I should write it so that the function takes in a pointer to a set, in order to avoid copying the set on return. I'm guessing he meant something like:
void GetFactorials2(int n, set<int>* fs) {
int curr = 1;
for (int i = 1; i < n; i++) {
curr *= i;
fs->insert(curr);
}
}
set<int> fs;
GetFactorials2(5, &fs);
My question: is this second way really a big advantage? It seems pretty weird to me. I'm new to C++, and don't know much about compilers, but I would assume that through some compiler magic, my original function wouldn't be that much more expensive. (And I'd get to avoid having to initialize the set myself.) Am I wrong? What should I know about pointers and copying-on-return to understand this?
No, it is generally not advantageous at all. Just about any reasonable compiler these days will utilize named return value optimization (see here). This effectively removes any performance penalty from the former example.
If you really want to get into the nitty gritty, read this article by Dave Abrahams (one of the big contributors to boost). Long story short, however, just return the value. It's probably faster.
Yes it can be expensive. Especially when the set gets bigger. There is no reason not to use pointers or reference here. It will save you a lot and you don't sacrifice much regarding readability.
And why rely on compiler optimizations when you can optimize it yourself. The compiler knows your code but not always understands your algorithm.
I would do this
void GetFactorials2(int n, set<int>& fs) {
// ^^
int curr = 1;
for (int i = 1; i < n; i++) {
curr *= i;
fs->insert(curr);
}
}
and the call will stay normal.
set<int> fs;
GetFactorials2(5, fs);
^^
Is it worth to write code like the following to copy array elements:
#include <iostream>
using namespace std;
template<int START, int N>
struct Repeat {
static void copy (int * x, int * y) {
x[START+N-1] = y[START+N-1];
Repeat<START, N-1>::copy(x,y);
}
};
template<int START>
struct Repeat<START, 0> {
static void copy (int * x, int * y) {
x[START] = y[START];
}
};
int main () {
int a[10];
int b[10];
// initialize
for (int i=0; i<=9; i++) {
b[i] = 113 + i;
a[i] = 0;
}
// do the copy (starting at 2, 4 elements)
Repeat<2,4>::copy(a,b);
// show
for (int i=0; i<=9; i++) {
cout << a[i] << endl;
}
} // ()
or is it better to use a inlined function?
A first drawback is that you can't use variables in the template.
That's not better. First of all, it's not really compile time, since you make function calls here. If you are lucky, the compiler will inline these and end up with a loop you could have written yourself with much less amount of code (or just by using std::copy).
General rule: Use templates for things known at compile time, use inlining for things known at run time. If you don't know the size of your array at compile time, then don't be using templates for it.
You shouldn't do this. Templates were invented for different purpose, not for calculations, although you can do it. First you can't use variables, second templates will produce vast of unused structures at compilation, and third is: use for (int i = start; i <= end; i++) b[i] = a[i];
That's better because you control and enforce the loop unrolling by yourself.
A loop can be unrolled by the compiler depending on optimizing options...
The fact that copying with copy is almost the best is not a good general answer because the loop unrolling can be done whatever is the computation done inside...