how does c++ meta-programming unroll loops?

how does c++ meta-programming unroll loops? - c++

I am trying to optimize my math calculation code base and I found this piece of code from here
this piece of code tries to calculate the matrix multiplication. However, I don't understand how enum can be used for calculation here. Cnt is a type specified in
template <int I=0, int J=0, int K=0, int Cnt=0>
and somehow we can still do
Cnt = Cnt + 1
Could anyone give me a quick tutorial on how this could be happening?
Thanks
template <int I=0, int J=0, int K=0, int Cnt=0> class MatMult
{
private :
enum
{
Cnt = Cnt + 1,
Nextk = Cnt % 4,
Nextj = (Cnt / 4) % 4,
Nexti = (Cnt / 16) % 4,
go = Cnt < 64
};
public :
static inline void GetValue(D3DMATRIX& ret, const D3DMATRIX& a, const D3DMATRIX& b)
{
ret(I, J) += a(K, J) * b(I, K);
MatMult<Nexti, Nextj, Nextk, Cnt>::GetValue(ret, a, b);
}
};
// specialization to terminate the loop
template <> class MatMult<0, 0, 0, 64>
{
public :
static inline void GetValue(D3DMATRIX& ret, const D3DMATRIX& a, const D3DMATRIX& b) { }
};
Or maybe I should ask more specifically, how does Nexti, Nextj, Nextk, Cnt get propagated to the next level when the for loop is unrolled.
thanks

Related

Infinite loop inside recursive template function

I am writing my own library for a university project, containing the template classes: Vector and Matrix. In addition to these template classes, there are also related template functions for vectors and matrices. The professor explicitly told us to define the matrix as a one-dimensional array in which the elements are sorted by column (reasons of efficiency / optimization). The "matrix" template class has 3 template parameters: type of data allowed by the matrix, number of rows, number of columns.
template <class T, unsigned int M, unsigned int N>
class Matrix
Having said that, I immediately get to the problem. I'm writing a function that calculates the determinant of any matrix of dimension > 4, using the LaPlace rule for columns (using the first column).
I also wrote a function for two-dimensional matrices (called D2MatrixDet) and a function for three-dimensional matrices (called D3MatrixDet) tested and working:
template <class T>
double D2MatrixDet(const Matrix<T, 2, 2>& _m)
template <class T>
double D3MatrixDet(const Matrix<T, 3, 3>& _m)
The template function that I have to write has two template parameters: data type of the input matrix, dimension of the matrix (since the determinant is calculated for square matrices, only one dimension is enough). It is a recursive function; the variable "result" is the one that keeps the determinant in memory at each step. Below, the code I wrote.
template <class T, unsigned int D>
void DNMatrixDet(Matrix<T, D, D> _m, double result) //LaPlace Rule respect to the first column
{
const unsigned int new_D = D - 1;
Matrix<T, new_D, new_D> temp;
if (D > 3)
{
for (unsigned int i = 0; i < _m.row; ++i)
//Indicate the element to multiply
{
for (unsigned int j = _m.row, l = 0; j < _m.row * _m.column && l < pow(new_D, 2); ++j)
//Manage the element to be inserted in temp
{
bool invalid_row = false;
for (unsigned int k = 1; k < _m.row && invalid_row == false; ++k) //Slide over row
{
if (j == (i + k * _m.row))
{
invalid_row = true;
}
}
if (invalid_row == false)
{
temp.components[l] = _m.components[j];
++l;
}
}
DNMatrixDet(temp, result);
result += pow((-1), i) * _m.components[i] * result;
}
}
else if (D == 3)
{
result += D3MatrixDet(_m);
}
}
In main, I test the function using a 5 x 5 matrix.
When I try to compile, several errors come out, all very similar and that have to do with the size of the matrix which is decreased by one at each step. This is when the initial matrix size is 5 (LA is the name of the library and Test.cpp is the file that contains the main):
LA.h: In instantiation of 'void LA::DNMatrixDet(LA::Matrix<T, M, M>, double) [with T = double;
unsigned int D = 5]':
Test.cpp:437:33: required from here
LA.h:668:34: error: no matching function for call to 'D3MatrixDet(LA::Matrix<double, 5, 5>&)'
result += D3MatrixDet(_m);
~~~~~~~~~~~^~~~
In file included from Test.cpp:1:
LA.h:619:12: note: candidate: 'template<class T> double LA::D3MatrixDet(const LA::Matrix<T, 3, 3>&)'
double D3MatrixDet(const Matrix<T, 3, 3>& _m)
^~~~~~~~~~~
LA.h:619:12: note: template argument deduction/substitution failed:
In file included from Test.cpp:1:
LA.h:668:34: note: template argument '5' does not match '3'
result += D3MatrixDet(_m);
~~~~~~~~~~~^~~~
This is when the size becomes 4:
LA.h: In instantiation of 'void LA::DNMatrixDet(LA::Matrix<T, M, M>, double) [with T = double;
unsigned int D = 4]':
LA.h:662:28: required from 'void LA::DNMatrixDet(LA::Matrix<T, M, M>, double) [with T = double;
unsigned int D = 5]'
Test.cpp:437:33: required from here
LA.h:668:34: error: no matching function for call to 'D3MatrixDet(LA::Matrix<double, 4, 4>&)'
In file included from Test.cpp:1:
LA.h:619:12: note: candidate: 'template<class T> double LA::D3MatrixDet(const LA::Matrix<T, 3, 3>&)'
double D3MatrixDet(const Matrix<T, 3, 3>& _m)
^~~~~~~~~~~
LA.h:619:12: note: template argument deduction/substitution failed:
In file included from Test.cpp:1:
LA.h:668:34: note: template argument '4' does not match '3'
result += D3MatrixDet(_m);
~~~~~~~~~~~^~~~
And so on. It keeps going down until starting over at 4294967295 (which I found to be the upper limit of a 32 bit "unsigned int") and continuing to go down until I reach the maximum number of template instances (= 900).
At each iteration, the compiler always checks the function for calculating the determinant of a 3 x 3, even if that function is only executed when the input matrix is a 3 x 3. So why does it check something that in theory should never to happen?
I double-checked the mathematical logic of what I wrote several times, even with the help of a matrix written on paper and slowly carrying out the first steps. I believe and hope it is right. I'm pretty sure the problem has to do with using templates and recursive function.
I apologize for the very long question, I tried to explain it in the best possible way. I hope I have well explained the problem.
EDIT:
Fixed problem by defining "if constexpr" at the beginning of DNMatrixDet function. The compilation is successful. I just need to fix the algorithm, but this is beyond the scope of the post. Below is the reprex with the changes made:
template <class T, unsigned int M, unsigned int N>
class Matrix
{
public:
T components[M * N];
unsigned int row = M;
unsigned int column = N;
Matrix()
{
for (unsigned int i = 0; i < M * N; ++i)
{
components[i] = 1;
}
}
Matrix(T* _c)
{
for (unsigned int i = 0; i < M * N; ++i, ++_c)
{
components[i] = *_c;
}
}
friend std::ostream& operator<<(std::ostream& output, const Matrix& _m)
{
output << _m.row << " x " << _m.column << " matrix:" << std::endl;
for (unsigned int i = 0; i < _m.row; ++i)
{
for (unsigned int j = 0; j < _m.column; ++j)
{
if (j == _m.column -1)
{
output << _m.components[i + j*_m.row];
}
else
{
output << _m.components[i + j*_m.row] << "\t";
}
}
output << std::endl;
}
return output;
}
};
template <class T>
double D3MatrixDet(const Matrix<T, 3, 3>& _m)
{
double result = _m.components[0] * _m.components[4] * _m.components[8] +
_m.components[3] * _m.components[7] * _m.components[2] +
_m.components[6] * _m.components[1] * _m.components[5] -
(_m.components[6] * _m.components[4] * _m.components[2] +
_m.components[3] * _m.components[1] * _m.components[8] +
_m.components[0] * _m.components[7] * _m.components[5]);
return result;
}
template <class T, unsigned int D>
void DNMatrixDet(Matrix<T, D, D> _m, double result)
{
Matrix<T, D - 1, D - 1> temp;
if constexpr (D > 3)
{
for (unsigned int i = 0; i < D; ++i)
{
for (unsigned int j = D, l = 0; j < D * D && l < (D - 1) * (D - 1); ++j)
{
bool invalid_row = false;
for (unsigned int k = 1; k < D && invalid_row == false; ++k)
{
if (j == (i + k * D))
{
invalid_row = true;
}
}
if (invalid_row == false)
{
temp.components[l] = _m.components[j];
++l;
}
}
DNMatrixDet(temp, result);
result += i & 1 ? -1 : 1 * _m.components[i] * result;
}
}
else if (D == 3)
{
result += D3MatrixDet(_m);
}
}
int main()
{
double m_start[25] = {4, 9, 3, 20, 7, 10, 9, 50, 81, 7, 20, 1, 36, 98, 4, 20, 1, 8, 5, 93, 47, 21, 49, 36, 92};
Matrix<double, 5, 5> m = Matrix<double, 5, 5> (m_start);
double m_det = 0;
DNMatrixDet(m, m_det);
std::cout << "m is " << m << std::endl;
std::cout << "Det of m is " << m_det << std::endl;
return 0;
}

When you pass as an argument _m with the type Matrix<T, 5, 5>, the trailing else branch contains the code result += D3MatrixDet(_m);. The compiler will still try to compile this and notice that it cannot find a matching constructor.
Since we know at compile-time whether to take this branch or not, we can instruct the compiler by using if constexpr instead. Since we are within a template, the compiler will no longer check this branch if it is discarded.
So let's change if (D > 3) to if constexpr (D > 3).

Is there a way to improve the speed of these lines?

I am trying to optimise some code which runs unreasonably slowly for what is required. The top answer here describes the method I am trying (although I am not 100% sure I am implementing it correctly).
Only a few lines show up repeatedly on the top of the call stack as I pause the program randomly, however I do not know how I could increase the codes performance given these lines.
The essential function of the code is updating a lattice of points repeatedly using the values of the points surrounding a given point. The relevant code for the first line that comes up:
The class definition:
template<typename T> class lattice{
private:
const unsigned int N; //size
std::vector<std::vector<T>> lattice_points =
std::vector<std::vector<T>>(N,std::vector<T>(N)); //array of points
protected:
static double mod(double, double) ;
public:
lattice(unsigned int);
lattice(const lattice&);
lattice& operator=(const lattice&);
~lattice() {};
T point(int, int) const;
void set(int, int, T);
unsigned int size() const;
};
These lines show up quite often:
template <typename T>
T lattice<T>::point(int x, int y) const {
return (*this).lattice_points[x % N][y % N]; //mod for periodic boundaries
};
template <typename T>
void lattice<T>::set(int x, int y, T val) {
this->lattice_points[x % N][y % N] = val; //mod for periodic boundaries
};
They are used here:
angle_lattice update_lattice(const angle_lattice& lat, const parameters& par, double dt) {
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_real_distribution<> dis(-0.5,0.5);
double sqrtdt = sqrt(dt);
angle_lattice new_lat(lat.size());
int N = lat.size();
for(int i=0; i < N; i++) {
for(int j=0; j < N; j++) {
double val = lat.point(i,j)+
dt*(-par.Dx*( sin_points(lat, i, j, i+1, j) + sin_points(lat, i, j, i-1, j) )
-par.Dy*( sin_points(lat, i, j, i, j+1) + sin_points(lat, i, j, i, j-1) )
-par.Lx/2*( cos_points(lat, i, j, i+1, j) + cos_points(lat, i, j, i-1, j) -2)
-par.Ly/2*( cos_points(lat, i, j, i, j+1) + cos_points(lat, i, j, i, j-1) -2))
+sqrtdt*2*M_PI*par.Cl*dis(gen);
new_lat.set(i,j,val);
}
}
return new_lat;
};
double sin_points(const angle_lattice& lat, int i1, int j1, int i2, int j2) {
return sin(lat.point(i1, j1) - lat.point(i2, j2));
};
double cos_points(const angle_lattice& lat, int i1, int j1, int i2, int j2) {
return cos(lat.point(i1, j1) - lat.point(i2, j2));
};
here angle_lattice is just a lattice where the template parameter is a angle. The set function is overloaded so that the angle is mod 2pi. The only other two functions that appear in the call stack are cos_points and sin_points , as well as generating the random number, but I assume the latter cannot be helped.
Is there anything that can be done? Help would be appreciated.
Edit: I changed the code following some of the suggestions and now the cosine and sine calculation are the highest. I am not sure what

Can I generate a constant array of size n

I want to generate a constant array power[501] = {1, p % MODER, p*p % MODER, p*p*p % MODER, ..., p^500 % MODER}, of which p is an constant number.
I know I could generate p^n % MODER by using the following code:
template<int a, int n> struct pow
{
static const int value = a * pow<a, n-1>::value % MODER;
};
template<int a> struct pow<a, 0>
{
static const int value = 1;
};
And it does work!
My question is if I could generate the array that I want?

You can use BOOST_PP_ENUM as:
#include <iostream>
#include <boost/preprocessor/repetition/enum.hpp>
#define MODER 10
template<int a, int n> struct pow
{
static const int value = a * pow<a, n-1>::value % MODER;
};
template<int a> struct pow<a, 0>
{
static const int value = 1;
};
#define ORDER(count, i, data) pow<data,i>::value
int main() {
const int p = 3;
int const a[] = { BOOST_PP_ENUM(10, ORDER, p) };
std::size_t const n = sizeof(a)/sizeof(int);
for(std::size_t i = 0 ; i != n ; ++i )
std::cout << a[i] << "\n";
return 0;
}
Output:
1
3
9
7
1
3
9
7
1
3
See online demo
The line:
int const a[] = { BOOST_PP_ENUM(10, ORDER, p) };
expands to this:
int const a[] = { pow<p,0>::value, pow<p,1>::value, ...., pow<p,9>::value};

Unless n has an upper bound, I would presume that that is impossible. Check this question out. There are ways to make the preprocessor look like a Turing-complete machine but only if you accept the fact that your code size should increase in the order of n, which is not better than placing a precomputed array by hand.
Important Update: You should see this question too. It seems that not the preprocessor but the template engine is indeed Turing-complete (at least can do recursion). So, now I suspect that the answer is yes.

Windows Threads: Parallel Mergesort

I have what is hopefully a very easy question, I just cant find the answer online. I made a merge sort function ( which im sure has inefficiencies), but im here to ask about the threads. I'm using Windows' CreateThread function to spawn threads to sort intervals of a given array. Once all the threads are finished, I will merge the segments together for the final result. I havent implemented the final merge yet because im getting strange errors which im sure is from a dumb mistake in the threads. I'll post my code, if you could kindly look at paraMSort. Ill post the whole MergeSort.h file so you can see the helper functions as well. Sometimes the code will compile and run perfectly. Sometimes the console will abruptly close with no errors/exceptions. There shouldnt be mutex issues because im doing operations on different segments of the array (Different memory locations altogether). Does anyone see something wrong? Thanks so much.
PS. Are Windows CreateThread's kernel level? In other words, if I make 2 threads on a dual core computer, may they run simultaneously on separate cores? Im thinking yes, since on this computer I can do the same work in 1/2 the time with 2 threads (with another test example).
PPS. I also saw some parallelism answers solved using Boost.Thread. Should I just use boost threads instead of windows threads? I don't have experience with Boost.
#include "Windows.h"
#include <iostream>
using namespace std;
void insert_sort(int* A, int sA, int eA, int* B, int sB, int eB)
{
int value;
int iterator;
for(int i = sA + 1; i < eA; i++)
{
value = A[i]; // Grab the next value in the array
iterator = i - 1;
// Move this value left up the list until its in the right spot
while(iterator >= sA && value < A[iterator])
A[iterator + 1] = A[iterator--];
A[iterator + 1] = value; // Put value in its correct spot
}
for(int i = sA; i < eB; i++)
{
B[i] = A[i]; // Put results in B
}
}
void merge_func(int* a, int sa, int ea, int* b, int sb, int eb, int* c, int sc)
{
int i = sa, j = sb, k = sc;
while(i < ea && j < eb)
c[k++] = a[i] < b[j] ? a[i++] : b[j++];
while(i < ea)
c[k++] = a[i++];
while(j < eb)
c[k++] = b[j++];
}
void msort_big(int* a, int* b, int s, int e, bool inA)
{
if(e-s < 4)
{
insert_sort(a, s, e, b, s, e);
return; // We sorted (A,s,e) into (B,s,e).
}
int m = (s + e)/2;
msort_big(a, b, s, m, !inA);
msort_big(a, b, m, e, !inA);
// If we want to merge in A, do it. Otherwise, merge in B
inA ? merge_func(b, s, m, b, m, e, a, s) : merge_func(a, s, m, a, m, e, b, s);
}
void msort(int* toBeSorted, int s, int e)
// Sorts toBeSorted from [s, e+1), so just enter [s, e] and
// the call to msort_big adds one.
{
int* b = new int[e - s + 1];
msort_big(toBeSorted, b, s, e+1, true);
delete [] b;
}
template <class T>
struct SortData_Send
{
T* data;
int start;
int end;
};
DWORD WINAPI msort_para_callback(LPVOID lpParam)
{
SortData_Send<int> dat = *(SortData_Send<int>*)lpParam;
msort(dat.data, dat.start, dat.end);
cout << "done! " << endl;
}
int ceiling_func(double num)
{
int temp = (int)num;
if(num > (double)temp)
{
return temp + 1;
}
else
{
return temp;
}
}
void paraMSort(int* toBeSorted, int s, int e, int numThreads)
{
HANDLE threads[numThreads];
DWORD threadIDs[numThreads];
SortData_Send<int>* sent[numThreads];
for(int i = 0; i < numThreads; i++)
{
// So for each thread, make an interval and pass the pointer to the array of ints.
// So for numThreads = 3 and array size of 0 to 99 (100), we have 0-32, 33-65, 66-100.
// 100 because sort function takes [start, end).
sent[i] = new SortData_Send<int>;
sent[i]->data = toBeSorted;
sent[i]->start = s + ceiling_func(double(i)*(double)e/double(numThreads));
sent[i]->end = ceiling_func(double(i+1)*double(e)/double(numThreads)) + ((i == numThreads-1) ? 1 : -1);
threads[i] = CreateThread(NULL, 0, msort_para_callback, sent[i], 0, &threadIDs[i]);
}
WaitForMultipleObjects(numThreads, threads, true, INFINITE);
cout << "Done waiting!" <<endl;
}

Assuming 's' is your starting point and 'e' is your ending point for a thread shouldn't your code be something like
sent[i]->start = s + ceiling_func(double(i)*(double)(e-s)/double(numThreads));
sent[i]->end = (i == numThreads-1) ? e : (s - 1 + ceiling_func(double(i+1)*(double)(e-s)/double(numThreads)));
This is in case your function void paraMSort(int* toBeSorted, int s, int e, int numThreads) is being called with a value of 's' not equal to 0? This could cause you to read wrong sections of memory.

c++, how randomly with given probabilities choose numbers

I have N numbers n_1, n_2, ...n_N and associated probabilities p_1, p_2, ..., p_N.
function should return number n_i with probability p_i, where i =1, ..., N.
How model it in c++?
I know it is not a hard problem. But I am new to c++, want to know what function will you use.
Will you generate uniform random number between 0 and 1 like this:
((double) rand() / (RAND_MAX+1))

This is very similar to the answer I gave for this question:
changing probability of getting a random number
You can do it like this:
double val = (double)rand() / RAND_MAX;
int random;
if (val < p_1)
random = n_1;
else if (val < p_1 + p_2)
random = n_2;
else if (val < p_1 + p_2 + p_3)
random = n_3;
else
random = n_4;
Of course, this approach only makes sense if p_1 + p_2 + p_3 + p_4 == 1.0.
This can easily be generalized to a variable number of outputs and probabilities with a couple of arrays and a simple loop.

If you know the probabilities compile-time you can use this variadic template version I decided to create. Although in actuality, I don't recommend using this due to how horribly incomprehensible the source is :P.
Usage
NumChooser <
Entry<2, 10>, // Value of 2 and relative probability of 10
Entry<5, 50>,
Entry<6, 80>,
Entry<20, 01>
> chooser;
chooser.choose(); // Returns the number 2 on average 10/141 times, etc.
Efficiency
Ideone
Generally, the template based implementation is very similar to a basic one. However, there are a few differences:
With -O2 optimizations or no optimizations, the template version can be ~1-5% slower
With -O3 optimizations, the template version was actually ~1% faster when generating numbers for 1 - 10,000 times consecutively.
Notes
This uses rand() for choosing numbers. If being statistically accurate is important to you or you would like to use C++11's <random>, you can use the slightly modified version below the first source.
Source
Ideone
#define onlyAtEnd(a) typename std::enable_if<sizeof...(a) == 0 > ::type
template<int a, int b>
class Entry
{
public:
static constexpr int VAL = a;
static constexpr int PROB = b;
};
template<typename... EntryTypes>
class NumChooser
{
private:
const int SUM;
static constexpr int NUM_VALS = sizeof...(EntryTypes);
public:
static constexpr int size()
{
return NUM_VALS;
}
template<typename T, typename... args>
constexpr int calcSum()
{
return T::PROB + calcSum < args...>();
}
template <typename... Ts, typename = onlyAtEnd(Ts) >
constexpr int calcSum()
{
return 0;
}
NumChooser() : SUM(calcSum < EntryTypes... >()) { }
template<typename T, typename... args>
constexpr int find(int left, int previous = 0)
{
return left < 0 ? previous : find < args... >(left - T::PROB, T::VAL);
}
template <typename... Ts, typename = onlyAtEnd(Ts) >
constexpr int find(int left, int previous)
{
return previous;
}
constexpr int choose()
{
return find < EntryTypes... >(rand() % SUM);
}
};
C++11 <random> version
Ideone
#include <random>
#define onlyAtEnd(a) typename std::enable_if<sizeof...(a) == 0 > ::type
template<int a, int b>
class Entry
{
public:
static constexpr int VAL = a;
static constexpr int PROB = b;
};
template<typename... EntryTypes>
class NumChooser
{
private:
const int SUM;
static constexpr int NUM_VALS = sizeof...(EntryTypes);
std::mt19937 gen;
std::uniform_int_distribution<> dist;
public:
static constexpr int size()
{
return NUM_VALS;
}
template<typename T, typename... args>
constexpr int calcSum()
{
return T::PROB + calcSum < args...>();
}
template <typename... Ts, typename = onlyAtEnd(Ts) >
constexpr int calcSum()
{
return 0;
}
NumChooser() : SUM(calcSum < EntryTypes... >()), gen(std::random_device{}()), dist(1, SUM) { }
template<typename T, typename... args>
constexpr int find(int left, int previous = 0)
{
return left < 0 ? previous : find < args... >(left - T::PROB, T::VAL);
}
template <typename... Ts, typename = onlyAtEnd(Ts) >
constexpr int find(int left, int previous)
{
return previous;
}
int choose()
{
return find < EntryTypes... >(dist(gen));
}
};
// Same usage as example above

Perhaps something like (untested code!)
/* n is the size of tables, numtab[i] the number of index i,
probtab[i] its probability; the sum of all probtab should be 1.0 */
int random_inside(int n, int numtab[], double probtab[])
{
double r = drand48();
double p = 0.0;
for (int i=0; i<n; i++) {
p += probtab[i];
if (r>=p) return numtab[i];
}
}

Here you have a correct answer in my last comment:
how-to-select-a-value-from-a-list-with-non-uniform-probabilities

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

how does c++ meta-programming unroll loops? - c++

Related

Infinite loop inside recursive template function

Is there a way to improve the speed of these lines?

Can I generate a constant array of size n

Windows Threads: Parallel Mergesort

c++, how randomly with given probabilities choose numbers

Categories

Resources