C++ template meta-programming member function loop unrolling - c++

I have just started to use template meta-programming in my code. I have a class which has as a member which is a vector of a multi-dimensional Cartesian points. Here is a basic setup of the class:
template<size_t N>
class TGrid{
public:
void round_points_3(){
for(std::size_t i = 0; i < Xp.size();i++){
Xp[i][0] = min[0] + (std::floor((Xp[i][0] - min[0]) * nbins[0] / (max[0] - min[0])) * bin_w[0]) + bin_w[0]/2.0;
Xp[i][1] = min[1] + (std::floor((Xp[i][1] - min[1]) * nbins[1] / (max[1] - min[1])) * bin_w[1]) + bin_w[1]/2.0;
Xp[i][2] = min[2] + (std::floor((Xp[i][2] - min[2]) * nbins[2] / (max[2] - min[2])) * bin_w[2]) + bin_w[2]/2.0;
}
}
void round_points_2(){
for(std::size_t i = 0; i < Xp.size();i++){
Xp[i][0] = min[0] + (std::floor((Xp[i][0] - min[0]) * nbins[0] / (max[0] - min[0])) * bin_w[0]) + bin_w[0]/2.0;
Xp[i][1] = min[1] + (std::floor((Xp[i][1] - min[1]) * nbins[1] / (max[1] - min[1])) * bin_w[1]) + bin_w[1]/2.0;
}
}
void round_points_1(){
for(std::size_t i = 0; i < Xp.size();i++){
Xp[i][0] = min[0] + (std::floor((Xp[i][0] - min[0]) * nbins[0] / (max[0] - min[0])) * bin_w[0]) + bin_w[0]/2.0;
}
}
public:
std::vector<std::array<double,N> > Xp;
std::vector<double> min, max, nbins, bin_w;
};
This class represented a multidimensional Grid. The dimension is specified by the template value N. I will be having many operations which can be made more efficient by having template specific member functions tailored to the specific dimensions, such as loop unrolling.
In the class TGrid, I have 3 functions specific for dimensions D=1,D=2 and D=3. This is indicated by the subscript _1,_2 and _3 of the functions.
I am looking for a template meta-programming oriented approach to write
these three functions more compactly.
I have seen examples of loop unrolling but all of these examples don't consider member functions of a template class.

Putting to one side the question of whether or not this is an appropriate optimisation, or if other optimisations should be regarded first, this is how I would do it. (But I do agree, sometimes it is demonstrably better to explicitly unroll loops — the compiler isn't always the best judge.)
One can't partially specialize a member function, and one can't specialize a nested struct without specializing the outer struct, so the only solution is to use a separate templated struct for the unrolling mechanism. Feel free to put this in some other namespace :)
The unrolling implementation:
template <int N>
struct sequence {
template <typename F,typename... Args>
static void run(F&& f,Args&&... args) {
sequence<N-1>::run(std::forward<F>(f),std::forward<Args>(args)...);
f(args...,N-1);
}
};
template <>
struct sequence<0> {
template <typename F,typename... Args>
static void run(F&& f,Args&&... args) {}
};
This takes an arbitrary functional object and a list of arguments, and then calls the object with the arguments and an additional final argument N times, where the final argument ranges from 0 to N-1. The universal references and variadic templates are not necessary; the same idea can be employed in C++98 with less generality.
round_points<K> then calls sequence::run<K> with a helper static member function:
template <size_t N>
class TGrid {
public:
template <size_t K>
void round_points(){
for (std::size_t i = 0; i < Xp.size();i++) {
sequence<K>::run(TGrid<N>::round_item,*this,i);
}
}
static void round_item(TGrid &G,int i,int j) {
G.Xp[i][j] = G.min[j] + (std::floor((G.Xp[i][j] - G.min[j]) * G.nbins[j] / (G.max[j] - G.min[j])) * G.bin_w[j]) + G.bin_w[j]/2.0;
}
// ...
};
Edit: Addendum
Doing the equivalent with a pointer-to-member function appears to be hard for compilers to inline. As an alternative, to avoid the use of a static round_item, you can use a lambda, e.g.:
template <size_t N>
class TGrid {
public:
template <size_t K>
void round_points(){
for (std::size_t i = 0; i < Xp.size();i++) {
sequence<K>::run([&](int j) {round_item(i,j);});
}
}
void round_item(int i,int j) {
Xp[i][j] = min[j] + (std::floor((Xp[i][j] - min[j]) * nbins[j] / (max[j] - min[j])) * bin_w[j]) + bin_w[j]/2.0;
}
// ...
};

Related

How might I build a templated functions that compiles differently if a template int is odd or even?

class Whatever {
public:
// doThing overloads:
template <typename T>
inline static T doThing(T t, float n) {
/* It's a SmoothStartN function in my code,
but don't worry about the specifics.
Includes a for loop up to n times
(result gets interpolated between non-integer ns). */
return whatever;
}
template <unsigned int n, typename T>
inline static T doThing(T t) {
/* Same as the other one, except now the compiler can
unroll the for loop if appropriate.
Or so I assume, anyway; I might be wrong. */
return whatever;
}
// doMoreComplexThing overloads:
template <unsigned int n, typename T>
inline static T doMoreComplexThing(T t1, T t2) {
float halfN = ((float)n) * 0.5f;
return (doThing(t1, halfN) * doThing(t2, halfN));
}
};
My problem: doMoreComplexThing() currently has to use the presumably-less-well-optimised version of doThing() in all cases. However, in half of all cases, where n is even, it can be evenly divided into integers and thus the more efficient template-uint version is viable.
How could I set this up so that, at compile time, doMoreComplexThing() detects whether n is even and uses the appropriate overload? Is such a thing possible? For that matter, is it likely any more performant to bother with this, or should I just stick with the float overload?
Answer: Thanks to Quentin's suggestion, I believe a good solution looks something like this:
template <unsigned int n, typename T>
inline static T doMoreComplexThing(T t1, T t2) {
if constexpr((n % 2u) == 0u) {
unsigned int halfN = n / 2u;
return (doThing<halfN>(t1) * doThing<halfN>(t2));
}
else {
float halfN = ((float)n) * 0.5f;
return (doThing(t1, halfN) * doThing(t2, halfN));
}
}

C++ function design

I came across the following question in multiple occasions this year but I don't have a clear answer to it: say I want to design a function that accumulates or in general builds something. I would need to declare the initial accumulator value (or an empty object in general), the problem is that whether I should initialize this value or object inside the function arguments with default value or should I initialize this thing inside the function body?
An example would be the following piece of function that split an sequential container into N equal size pieces (precondition: the pieces are splittable).
Is it a okay to write it in the following form
template <typename T, std::size_t N>
array<T, N> equal_split(const T& x, array<T, N> result = {}) {
for (int i = 0; i < N; ++i)
std::copy(begin(x) + i * size(x) / 3, begin(x) + (i + 1) * size(x) / 3, std::back_inserter(result[i]));
return result;
}
or is it better to write it as
template <typename T, std::size_t N>
array<T, N> equal_split(const T& x) {
array<T, N> result = {};
for (int i = 0; i < N; ++i)
std::copy(begin(x) + i * size(x) / 3, begin(x) + (i + 1) * size(x) / 3, std::back_inserter(result[i]));
return result;
}
I would need to declare the initial accumulator value
If it is just an implementation detail, then hide it from interface.
if different initial values make sense, then you might add it in interface.
In your example, signature would be:
template <std::size_t N, typename Container>
array<Container, N> equal_split(const Container&);
Rename T to more meaningful Container
size_t N first, to not have to provide deducible Container
No default parameters, as initial value was just implementation detail.

Simplify variadic template: Remove some specializations

I found a template to calculate the binomial coefficient, which I happily used for function generation. The advantage is that I use this template for compile time Bernstein polynomial generation instead of using the derived polynomials (just 5 very simple ones).
I initially thought the code would become easier by doing so because the generation of the five random functions now obvious. Unfortunately, the code below is hard to read for someone not used to templates. Is there a way to get rid of at least some of the template specializations?
// Template functions to estimate the binominal coefficient
template<uint8_t n, uint8_t k>
struct binomial {
static constexpr int value = (binomial<n - 1, k - 1>::value + binomial<n - 1, k>::value);
};
template<>
struct binomial<0, 0> {
static constexpr int value = 1;
};
template<uint8_t n>
struct binomial<n, 0> {
static constexpr int value = 1;
};
template<uint8_t n>
struct binomial<n, n> {
static constexpr int value = 1;
};
You might probably use constexpr functions. Here is C++11-friendly version:
constexpr int factorial(int n)
{
return (n == 1) ? 1 : n * factorial(n - 1);
}
constexpr int bin_coeff(int n, int k)
{
return factorial(n) / factorial(n - k) / factorial(k);
}
EXAMPLE

Is there a better way to fill array with precalculated values by templates (for using in runtime)?

So, assume I have a template structure-function fib<i>::value. I want to get nth fibonacci number in runtime. For this i create array fibs[] = { fib<0>::value, ... , fib<maxN>::value }. Unfortunatelly, for some functions maxN can be very large and I can't fill it with hands only. So I writed some preprocessor directives to make task easier.
#define fib(x) (fib<(x)>::value)
#define fibLine_level_0(x) fib(5*(x) + 0), fib(5*(x) + 1), fib(5*(x) + 2), fib(5*(x) + 3), fib(5*(x) + 4)
#define fibLine_level_1(x) fibLine_level_0(2*(x) + 0), fibLine_level_0(2*(x) + 1)
#define fibLine_level_2(x) fibLine_level_1(2*(x) + 0), fibLine_level_1(2*(x) + 1)
#define fibLine_level_3(x) fibLine_level_2(2*(x) + 0), fibLine_level_2(2*(x) + 1)
#define cAarrSize(x) (sizeof(x) / sizeof(x[0]))
And I use it so:
int fibs[] = { fibLine_level_3(0) };
for (int i = 0; i < cAarrSize(fibs); i++)
cout << "fib(" << i << ") = " << fibs[i] << endl;
The code that you may need:
template<int i>
struct fibPair{
static const int fst = fibPair<i-1>::snd;
static const int snd = fibPair<i-1>::fst + fibPair<i-1>::snd;
};
template<>
struct fibPair<0> {
static const int fst = 0;
static const int snd = 1;
};
template<int i>
struct fib {
static const int value = fibPair<i>::fst;
};
But this code is really ugly. What to do to make it more beautiful?
Constraints: this code must be used in sport programming. That means - no third-party libraries and sometimes no C++11 (but it can be)
Fib structure can be rewritten as follows:
template <size_t i>
struct fib
{
static const size_t value = fib<i - 1>::value + fib<i - 2>::value;
};
template <>
struct fib<0>
{
static const size_t value = 0;
};
template <>
struct fib<1>
{
static const size_t value = 1;
};
Compile-time array of the Fibonacci numbers can be calculated using C++11.
Edit 1 (changed the type of fib values).
Edit 2:
Compile-time generation of Fibonacci numbers array (based on this answer).
template<unsigned... args> struct ArrayHolder
{
static const unsigned data[sizeof...(args)];
};
template<unsigned... args>
const unsigned ArrayHolder<args...>::data[sizeof...(args)] = { args... };
template<size_t N, template<size_t> class F, unsigned... args>
struct generate_array_impl
{
typedef typename generate_array_impl<N-1, F, F<N>::value, args...>::result result;
};
template<template<size_t> class F, unsigned... args>
struct generate_array_impl<0, F, args...>
{
typedef ArrayHolder<F<0>::value, args...> result;
};
template<size_t N, template<size_t> class F>
struct generate_array
{
typedef typename generate_array_impl<N-1, F>::result result;
};
int main()
{
const size_t count = 10;
typedef generate_array<count, fib>::result fibs;
for(size_t i = 0; i < count; ++i)
std::cout << fibs::data[i] << std::endl;
}
All you need is to provide generate_array with the generation «function» (our fib struct).
Thanks to #nameless, for giving link to question, where I found answer by #MichaelAnderson for simple c++ (without new features). I used it and expanded for my own needs.
So, concept is simple, but a bit strange. We must produce recursive templated structure, where the first field is this same temlated structure with other argument.
template<size_t N>
struct FibList {
FibList<N-1> previous;
size_t value;
FibList<N>() : value(fib<N>::value) {}
};
Let's try expand it a bit (just to see, what compiler will produce):
template<size_t N>
struct FibList {
FibList<N-3> previous;
size_t value_N_minus_2;
size_t value_N_minus_1;
size_t value_N;
};
So we can think that FibList is array and just cast it (that is weak point of my solution - I can't prove this now)
static const size_t maxN = 2000;
FibList<maxN> fibList;
size_t *fibArray = &fibList.value - maxN;
Or in another way:
size_t *fibArray = reinterpret_cast<size_t*>(&fibList);
Important: size of array is maxN+1, but standart methodic to get array size (sizeof(array) / sizeof(array[0]) will fail. Be pretty accurate with that.
Now we must stop recursion:
// start point
template<>
struct FibList<0> {
size_t value;
FibList<0>() : value(0) {}
};
// start point
template<>
struct FibList<1> {
FibList<0> previous;
size_t value;
FibList<1>() : value(1) {}
};
Note, that swapping places of FibList<1> and FibList<0> will produce stack overflow in compiler.
And we must solve another problem - template recursion have limited depth (depends on compiler and/or options). But, fortunately, compiler have only depth limit, not memory limit for templates (well, yeah, memory limit is more bigger than depth limit). So we have obvious ugly solution - call fib<N> in series with step equal to depth limit - and we will never catch template depth limit about fib<N>. But we can't just write fib<500>::value not in runtime. So we got solution - write macro that will specialize FibList<N> using fib<N>::value:
#define SetOptimizePointForFib(N) template<>\
struct FibList<N> {\
FibList<(N)-1> previous;\
size_t value;\
FibList<N>() : value(fib<N>::value) {}\
};
And we must write something like this:
SetOptimizePointForFib(500);
SetOptimizePointForFib(1000);
SetOptimizePointForFib(1500);
SetOptimizePointForFib(2300);
So we got really compile time precalc and filling static arrays of awesome lengths.

C++ enum template partial specialization

I have a matrix class very tailored for the algorithm I need to implement. I know about Eigen but it doesn't fit my bill so I had to do my own. I have been working all along with Column Major ordering and now I have the strong use case to employ Row Major too and so I would like to specialize my template matrix class with an extra template parameter that defines the ordering but I don't want to break the existing code.
The concrete effect of this will be to use the template partial specialization to generate differently two or three key class methods e.g. operator(int i, int j) that define the different ordering, a similar concept can be done using pre-processor #define but this is not very elegant and only works compiling all in one mode or the other. This is a sketch of what I'm trying to accomplish:
enum tmatrix_order {
COLUMN_MAJOR, ROW_MAJOR
};
/**
* Concrete definition of matrix in major column ordering that delegates most
* operations to LAPACK and BLAS functions. This implementation provides support
* for QR decomposition and fast updates. The following sequence of QR updates
* are supported:
*
* 1) [addcol -> addcol] integrates nicely with the LAPACK compact form.
* 2) [addcol -> delcol] delcols holds additional Q, and most to date R
* 3) [delcol -> delcol] delcols holds additional Q, and most to date R
* 4) [delcol -> addcol] delcols Q must also be applied to the new column
* 5) [addcol -> addrow] addrows holds additional Q, R is updated in original QR
* 6) [delcol -> addrow] addrows holds additional Q, R is updated in original QR
*/
template<typename T, tmatrix_order O = COLUMN_MAJOR>
class tmatrix {
private:
// basic matrix structure
T* __restrict m_data;
int m_rows, m_cols;
// ...
};
template <typename T>
inline T& tmatrix<T, COLUMN_MAJOR>::operator()(int i, int j) {
return m_data[j*m_rows + i];
}
template <typename T>
inline const T& tmatrix<T, COLUMN_MAJOR>::operator()(int i, int j) const {
return m_data[j*m_rows + i];
}
template <typename T>
inline T& tmatrix<T, ROW_MAJOR>::operator()(int i, int j) {
return m_data[i*m_cols + j];
}
template <typename T>
inline const T& tmatrix<T, ROW_MAJOR>::operator()(int i, int j) const {
return m_data[i*m_cols + j];
}
but the compiler will complain of the partial specialization:
/Users/bravegag/code/fastcode_project/code/src/matrix.h:227:59: error: invalid use of incomplete type 'class tmatrix<T, (tmatrix_order)0u>'
/Users/bravegag/code/fastcode_project/code/src/matrix.h:45:7: error: declaration of 'class tmatrix<T, (tmatrix_order)0u>'
However, if I fully specialize these function like shown below it will work, but this is very inflexible:
inline double& tmatrix<double, COLUMN_MAJOR>::elem(int i, int j) {
return m_data[j*m_rows + i];
}
Is this a language partial template specialization support issue or am I using the wrong syntax?
A possible solution:
enum tmatrix_order {
COLUMN_MAJOR, ROW_MAJOR
};
template<typename T>
class tmatrix_base {
protected:
// basic matrix structure
T* __restrict m_data;
int m_rows, m_cols;
};
template<typename T, tmatrix_order O = COLUMN_MAJOR>
class tmatrix : public tmatrix_base<T>{
public:
tmatrix() {this->m_data = new T[5];}
T& operator()(int i, int j) {
return this->m_data[j*this->m_rows + i];
}
const T& operator()(int i, int j) const {
return this->m_data[j*this->m_rows + i];
}
};
template<typename T>
class tmatrix<T, ROW_MAJOR> : public tmatrix_base<T>{
public:
tmatrix() {this->m_data = new T[5];}
T& operator()(int i, int j) {
return this->m_data[i*this->m_cols + j];
}
const T& operator()(int i, int j) const {
return this->m_data[i*this->m_cols + j];
}
};
int main()
{
tmatrix<double, COLUMN_MAJOR> m1;
m1(0, 0);
tmatrix<double, ROW_MAJOR> m2;
m2(0, 0);
}
The idea is that you provide the full class definition, instead of providing only the functions in the specialized class. I think the basic problem is that the compiler does not know what that class's definition is otherwise.
Note: you need the this-> to be able to access the templated base members, otherwise lookup will fail.
Note: the constructor is just there so I could test the main function without blowing up. You will need your own (which I'm sure you already have)
I'd keep it simple, and write it like this:
template <typename T, tmatrix_order O>
inline T& tmatrix<T, O>::operator()(int i, int j) {
if (O == COLUMN_MAJOR) {
return m_data[j*m_rows + i];
} else {
return m_data[i*m_cols + j];
}
}
While not guaranteed by the language specification, I bet your compiler will optimize out that comparison with a compile-time constant.