Auto Parallelization with VS - c++

I am trying to understand how the auto-parallelization works to speed up the execution of a program I am writing. I have created a simpler example:
#include <iostream>
#include <vector>
#include <chrono>
using namespace std;
using namespace std::chrono;
class matrix
{
public:
matrix(int size, double value)
{
A.resize(size, vector<double>(size, value));
B.resize(size, vector<double>(size, value));
};
void prodScal(double valore)
{
for (int m = 0; m < A.size(); m++)
for (int n = 0; n < A.size(); n++)
{
B[m][n] = A[m][n] * valore;
};
};
double elemento(int riga, int column) { return B[riga][column]; }
protected:
vector<vector<double>> A, B;
};
void main()
{
matrix* M;
M = new matrix(1000, 174.9);
high_resolution_clock::time_point t1 = high_resolution_clock::now();
#pragma loop(hint_parallel(4))
for (int i = 0; i < 1000; i++)
M->prodScal(567.3);
high_resolution_clock::time_point t2 = high_resolution_clock::now();
auto duration = duration_cast<milliseconds>(t2 - t1).count();
cout << "execution time [ms]: " << duration << endl;
}
When I try to compile this code using cl main.cpp /O2 /Qpar /Qpar-report:2, I get the following message:
c:\users\utente\documents\visual studio 2017\projects\parallel\parallel\main.cpp(39) : info C5012: ciclo non parallelizzato a causa del motivo '500'
c:\users\utente\documents\visual studio 2017\projects\parallel\parallel\main.cpp(39) : info C5012: ciclo non parallelizzato a causa del motivo '500'
c:\users\utente\documents\visual studio 2017\projects\parallel\parallel\main.cpp(38) : info C5012: ciclo non parallelizzato a causa del motivo '1000'
Can you help me with the correct way to parallelize this loop?
Thanks.

Auto-parallelisation efforts ( or Beliefs? ) are a rather dual-edge sword :
A machine can "guess" an intent only to a certain degree ( and can give-up, whenever such intent was not clear to a pre-wired set of transformation strategies ), so rather do not expect any bright tricks on a large scale of different approaches possible. Marketing people will beat all their drums and blow all their whistles to sell auto-"thinking"-PRODUCTs, but the reality is different. Even the best of the bests admit, that best performance comes from instruction-level profiling and sometimes they even avoid superscalar pipelined processor-knit tricks, so as to gain the last few nanoseconds, lost in parallelised-code performance at the very last level of CPU uop instruction flow. So, better never expect such expertise to happen just by using a #pragma code-section in a belief, the "machine"-will-invent a smartest way ahead.
So, test it ( always and thoroughly ):
An attempt to "parallelise" an outermost for(){...} is not the best step to start with. Both performance-wise and resources-wise. Let's tackle the case from a different side, the calculation itself:
#include <iostream> // https://stackoverflow.com/questions/48033769/auto-parallelization-with-vs
#include <vector>
#include <chrono> // g++ FLAGS.ADD: -std=c++11
#include <omp.h> // g++ FLAGS.ADD: -fopenmp -lm
#define OMP_NUM_OF_THREADS 4
using namespace std;
using namespace std::chrono;
class matrix {
public:
matrix( int size, double value ) {
A.resize( size, vector<double>( size, value ) );
B.resize( size, vector<double>( size, value ) );
}
void prodScal( double aScalarVALORE ) {
// #pragma loop( hint_parallel(4) ) // matrix_(hint_parallel(4)).cpp:18:0: warning: ignoring #pragma loop [-Wunknown-pragmas]
#pragma omp parallel num_threads( OMP_NUM_OF_THREADS ) // _____ YET, AGNOSTIC TO ANY BETTER CACHE-LINE RE-USE POLICY
for ( unsigned int m = 0; m < A.size(); m++ )
for ( unsigned int n = 0; n < A.size(); n++ )
B[m][n] = A[m][n] * aScalarVALORE;
}
double elemento( int riga, int column ) { return B[riga][column]; }
protected:
vector<vector<double>> A, B;
};
int main() { // matrix_(hint_parallel(4)).cpp:31:11: error: ‘::main’ must return ‘int’
matrix* M;
M = new matrix( 1000, 174.9 );
high_resolution_clock::time_point t1 = high_resolution_clock::now();
// *******************
// DEFINITELY NOT HERE
// *******************
// #pragma loop(hint_parallel(4)) // JUST A TEST EXECUTION, NOT ANY PARALLELISATION BENEFIT FOR A PROCESS-PER-SE PERFORMANCE
for ( int i = 0; i < 1000; i++ )
M->prodScal( 567.3 );
high_resolution_clock::time_point t2 = high_resolution_clock::now();
auto duration = duration_cast<milliseconds>( t2 - t1 ).count();
cout << "execution time [ms]: " << duration << endl;
/*
* execution time [ms]: 21601
------------------
(program exited with code: 0)
* */
return 0;
}
Once having a working code, the performance tweaking, to gain max, is the next hurdle.
A better stepping through a for(){...} can dramatically improve the sum of costs of all the MEM-fetches ( paying ~ +100 [ns] for each non-cached reference ) v/s CACHE-re-use ( paying just ~ +1.5 [ns] for any cache-re-use ).
It depends on the global size of the matrices, on the L3, L2 and L1 cache-sizes and on the cache-lines lengths / associativity, not to mention an additional performance skew(s) if the code is to be run on a virtual device.
The static sizing and an approximate NUMA-topology could be depicted using lstopo ( lscpu in an absence of the smart hwloc service ).
Here, you can read the cache capacities, that can hold the matrix cells for any potential speedup from a smart re-use ( obeying the cache-line striding of the for(){...} indexes ).
Best performance could be gained from tuning the for()-loop stepping, best near the CPU-hardware available ILP-level ( using another degree of parallelism possible from CPU Instruction-Level-Parallelism, permitted for co-executed micro-instruction chains ( ref. Intel CPU publications on these details ) and best tested on the target platform ( cross-compilation would fail to allow such optimisation without performance benchmarks on the target CPU-architecture, best on the target platform in-vivo ).
Details are way beyond the limited scope of this media format here, on StackOverflow, but definitely if interested in performance tuning, you will find both sources and your own experimentation hands-on experience will govern your further steps. To just somehow sense the powers, 've made a large-matrix linear algebra project to finish a few [TB] matrix processing from ~ 126 hours down to a few minutes ( not counting the loading phase, to get the matrix data into the RAM ), right by the very careful parallel code-design, so indeed worth doing the design "right".
For even higher performance, one will have to also avoid O/S from evicting the expensively pre-fetched data, so even more efforts are needed for ultimate performance, than just to rely on an automated "auto-parallelisation"-toy.
Epilogue:
If still in doubts, if that were indeed possible, why would HPC centers still take care for and nourish the HPC-experts for designing ultimately performant code, if the "auto-parallelisation"-toys would do it better or at least the same as these expert nerdy geeks?? They would not, if they indeed could.

Related

Why is thrust reduce_by_key almost 75x slower than for_each with atomicAdd()?

I was not satisfied with the performance of the below thrust::reduce_by_key, so I rewrote it in a variety of ways with little gained benefit (including removing the permutation iterator). However, it wasn't until after replacing it with a thrust::for_each() (see below) that capitalizes on atomicAdd(), that I gained almost a 75x speedup! The two versions produce the exact same results. What could be the biggest cause for the dramatic performance differences?
Complete code for comparison between the two approaches:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <ctime>
#include <iostream>
#include <thrust/copy.h>
#include <thrust/device_vector.h>
#include <thrust/execution_policy.h>
#include <thrust/host_vector.h>
#include <thrust/iterator/discard_iterator.h>
#include <thrust/sort.h>
constexpr int NumberOfOscillators = 100;
int SeedRange = 500;
struct GetProduct
{
template<typename Tuple>
__host__ __device__
int operator()(const Tuple & t)
{
return thrust::get<0>(t) * thrust::get<1>(t);
}
};
int main()
{
using namespace std;
using namespace thrust::placeholders;
/* BEGIN INITIALIZATION */
thrust::device_vector<int> dv_OscillatorsVelocity(NumberOfOscillators);
thrust::device_vector<int> dv_outputCompare(NumberOfOscillators);
thrust::device_vector<int> dv_Connections_Strength((NumberOfOscillators - 1) * NumberOfOscillators);
thrust::device_vector<int> dv_Connections_Active((NumberOfOscillators - 1) * NumberOfOscillators);
thrust::device_vector<int> dv_Connections_TerminalOscillatorID_Map(0);
thrust::device_vector<int> dv_Permutation_Connections_To_TerminalOscillators((NumberOfOscillators - 1) * NumberOfOscillators);
thrust::device_vector<int> dv_Connection_Keys((NumberOfOscillators - 1) * NumberOfOscillators);
srand((unsigned int)time(NULL));
thrust::fill(dv_OscillatorsVelocity.begin(), dv_OscillatorsVelocity.end(), 0);
for (int c = 0; c < NumberOfOscillators * (NumberOfOscillators - 1); c++)
{
dv_Connections_Strength[c] = (rand() % SeedRange) - (SeedRange / 2);
dv_Connections_Active[c] = 0;
}
int curOscillatorIndx = -1;
for (int c = 0; c < NumberOfOscillators * NumberOfOscillators; c++)
{
if (c % NumberOfOscillators == 0)
{
curOscillatorIndx++;
}
if (c % NumberOfOscillators != curOscillatorIndx)
{
dv_Connections_TerminalOscillatorID_Map.push_back(c % NumberOfOscillators);
}
}
for (int n = 0; n < NumberOfOscillators; n++)
{
for (int p = 0; p < NumberOfOscillators - 1; p++)
{
thrust::copy_if(
thrust::device,
thrust::make_counting_iterator<int>(0),
thrust::make_counting_iterator<int>(dv_Connections_TerminalOscillatorID_Map.size()), // indices from 0 to N
dv_Connections_TerminalOscillatorID_Map.begin(), // array data
dv_Permutation_Connections_To_TerminalOscillators.begin() + (n * (NumberOfOscillators - 1)), // result will be written here
_1 == n);
}
}
for (int c = 0; c < NumberOfOscillators * (NumberOfOscillators - 1); c++)
{
dv_Connection_Keys[c] = c / (NumberOfOscillators - 1);
}
/* END INITIALIZATION */
/* BEGIN COMPARISON */
auto t = clock();
for (int x = 0; x < 5000; ++x) //Set x maximum to a reasonable number while testing performance.
{
thrust::reduce_by_key(
thrust::device,
//dv_Connection_Keys = 0,0,0,...1,1,1,...2,2,2,...3,3,3...
dv_Connection_Keys.begin(), //keys_first The beginning of the input key range.
dv_Connection_Keys.end(), //keys_last The end of the input key range.
thrust::make_permutation_iterator(
thrust::make_transform_iterator(
thrust::make_zip_iterator(
thrust::make_tuple(
dv_Connections_Strength.begin(),
dv_Connections_Active.begin()
)
),
GetProduct()
),
dv_Permutation_Connections_To_TerminalOscillators.begin()
), //values_first The beginning of the input value range.
thrust::make_discard_iterator(), //keys_output The beginning of the output key range.
dv_OscillatorsVelocity.begin() //values_output The beginning of the output value range.
);
}
std::cout << "iterations time for original: " << (clock() - t) * (1000.0 / CLOCKS_PER_SEC) << "ms\n" << endl << endl;
thrust::copy(dv_OscillatorsVelocity.begin(), dv_OscillatorsVelocity.end(), dv_outputCompare.begin());
t = clock();
for (int x = 0; x < 5000; ++x) //Set x maximum to a reasonable number while testing performance.
{
thrust::for_each(
thrust::device,
thrust::make_counting_iterator(0),
thrust::make_counting_iterator(0) + dv_Connections_Active.size(),
[
s = dv_OscillatorsVelocity.size() - 1,
dv_b = thrust::raw_pointer_cast(dv_OscillatorsVelocity.data()),
dv_c = thrust::raw_pointer_cast(dv_Permutation_Connections_To_TerminalOscillators.data()), //3,6,9,0,7,10,1,4,11,2,5,8
dv_ppa = thrust::raw_pointer_cast(dv_Connections_Active.data()),
dv_pps = thrust::raw_pointer_cast(dv_Connections_Strength.data())
] __device__(int i) {
const int readIndex = i / s;
atomicAdd(
dv_b + readIndex,
(dv_ppa[dv_c[i]] * dv_pps[dv_c[i]])
);
}
);
}
std::cout << "iterations time for new: " << (clock() - t) * (1000.0 / CLOCKS_PER_SEC) << "ms\n" << endl << endl;
std::cout << "***" << (dv_OscillatorsVelocity == dv_outputCompare ? "success" : "fail") << "***\n";
/* END COMPARISON */
return 0;
}
Extra info.:
My results are using a single GTX 980 TI.
There are 100 * (100 - 1) = 9,900 elements in all of the "Connection" vectors.
Each of the 100 unique keys found in dv_Connection_Keys has 99 elements each.
Use this compiler option: --expt-extended-lambda
What could be the biggest cause for the dramatic performance differences?
You are evidently building a debug project, that is your compilation settings include the -G switch. Although you were asked for your compilation settings in the comments, you didn't mention this.
It's important.
CUDA device code can have dramatically different performance characteristics when compiled with -G.
Don't evaluate performance of a debug project, or code compiled with -G.
When I compile and run your code without -G, I get:
iterations time for original: 210ms
iterations time for new: 70ms
***success***
When I compile your code with the debug switch -G, and run, I get:
iterations time for original: 12330ms
iterations time for new: 320ms
***success***
returning to your question, that accounts for the biggest factor of the difference.
The following answer tries to explain or at least motivate the remaining difference in performance after going from a debug build to a release build as explained in Robert Crovella's answer.
Coalescing
As the accesses in both kernels are not coalesced due to the permutation_iterator/indirection through dv_c, going by the the plain number of accesses will overestimate the performance in this case. thrust::reduce_by_key (or pretty much all Thrust algorithms) is not and can not be optimized for general permutations of the input as the performance of these bandwidth-bound kernels depends strongly on coalesced memory access. Naturally the algorithms are written such that accesses are coalesced for normal continuous input. So if you need to access the permuted state order of the data more than once (which might happen in a single reduction algorithm), it could be faster to actually permute the data in memory using thrust::gather or thrust::scatter once so at least all following accesses are efficient. I would not expect the for_each solution to beat reduce_by_key without that permutation.
Atomics
Newer versions of nvcc will try to use automatically use warp-aggregated-atomics to reduce the number of actual atomic instructions on the same address. As neighboring threads (same warp) tend to atomically write to the same address, this optimization is crucial for the performance of your custom reduction. Another important detail is that s = NumberOfOscillators is relatively small (100) in your code compared to typical thread-block sizes (256, 512, 1024; locality of atomic writes) and the amount of parallelism in the for_each (~NumberOfOscillators^2). So for smaller NumberOfOscillators I expect your custom reduction to get worse than reduce_by_key due to the vanishing amount of parallelism, while for bigger NumberOfOscillators you get both much more parallelism and more thread blocks/warps writing to the same location, so it is not quite clear which one will win without benchmarking it for given hardware and compiler.

Why my inversions of matrices are such slow with LAPACKE in C++ : MAGMA Alternative and set up

I am using LAPACK to inverse a matrix: I did a reference passing, i.e by working on the address. Here below the function with an input matrix and an output matrix referenced by their address.
The issue is that I am obliged to convert the F_matrix into 1D array and I think this is a waste of performances on the runtime level : which way could I find to get rid of this supplementary task which is time consuming I think if I call a lot of times the
function matrix_inverse_lapack.
Below the function concerned :
// Passing Matrixes by Reference
void matrix_inverse_lapack(vector<vector<double>> const &F_matrix, vector<vector<double>> &F_output) {
// Index for loop and arrays
int i, j, ip, idx;
// Size of F_matrix
int N = F_matrix.size();
int *IPIV = new int[N];
// Statement of main array to inverse
double *arr = new double[N*N];
// Output Diagonal block
double *diag = new double[N];
for (i = 0; i<N; i++){
for (j = 0; j<N; j++){
idx = i*N + j;
arr[idx] = F_matrix[i][j];
}
}
// LAPACKE routines
int info1 = LAPACKE_dgetrf(LAPACK_ROW_MAJOR, N, N, arr, N, IPIV);
int info2 = LAPACKE_dgetri(LAPACK_ROW_MAJOR, N, arr, N, IPIV);
for (i = 0; i<N; i++){
for (j = 0; j<N; j++){
idx = i*N + j;
F_output[i][j] = arr[idx];
}
}
delete[] IPIV;
delete[] arr;
}
For example, I call it this way :
vector<vector<double>> CO_CL(lsize*(2*Dim_x+Dim_y), vector<double>(lsize*(2*Dim_x+Dim_y), 0));
... some code
matrix_inverse_lapack(CO_CL, CO_CL);
The performances on inversion are not which are expected, I think this is due to this conversion 2D -> 1D that I described in the function matrix_inverse_lapack.
Update
I was advised to install MAGMA on my MacOS Big Sur 11.3 but I have a lot of difficulties to set up it.
I have a AMD Radeon Pro 5600M graphic card. I have already installed by default Big Sur version all the Framework OpenCL (maybe I am wrong by saying that). Anyone could tell the procedure to follow for the installation of MAGMA. I saw that on a MAGMA software exists on http://magma.maths.usyd.edu.au/magma/ but it is really expensive and doesn't correspond to what I want : I just need all the SDK (headers and libraries) , if possible built with my GPU card. I have already installed all the Intel OpenAPI SDK on my MacOS. Maybe, I could link it to a MAGMA installation.
I saw another link https://icl.utk.edu/magma/software/index.html where MAGMA seems to be public : there is none link with the non-free version above, isn't there ?
First of all let me complain that OP did not provide all necessary data. The program is almost complete, but it is not a minimal, reproducible example. This is important because (a) it wastes time and (b) it hides potentially relevant information, eg. about the matrix initialization. Second, OP did not provide any details on the compilation, which, again may be relevant.
Last, but not least, OP didn't check the status code for possible errors from Lapack functions, and this could also be important for correct interpretation of the results.
Let's start from a minimal reproducible example:
#include <lapacke.h>
#include <vector>
#include <chrono>
#include <iostream>
using Matrix = std::vector<std::vector<double>>;
std::ostream &operator<<(std::ostream &out, Matrix const &v)
{
const auto size = std::min<int>(10, v.size());
for (int i = 0; i < size; i++)
{
for (int j = 0; j < size; j++)
{
out << v[i][j] << "\t";
}
if (size < std::ssize(v)) out << "...";
out << "\n";
}
return out;
}
void matrix_inverse_lapack(Matrix const &F_matrix, Matrix &F_output, std::vector<int> &IPIV_buffer,
std::vector<double> &matrix_buffer)
{
// std::cout << F_matrix << "\n";
auto t0 = std::chrono::steady_clock::now();
const int N = F_matrix.size();
for (int i = 0; i < N; i++)
{
for (int j = 0; j < N; j++)
{
auto idx = i * N + j;
matrix_buffer[idx] = F_matrix[i][j];
}
}
auto t1 = std::chrono::steady_clock::now();
// LAPACKE routines
int info1 = LAPACKE_dgetrf(LAPACK_ROW_MAJOR, N, N, matrix_buffer.data(), N, IPIV_buffer.data());
int info2 = LAPACKE_dgetri(LAPACK_ROW_MAJOR, N, matrix_buffer.data(), N, IPIV_buffer.data());
auto t2 = std::chrono::steady_clock::now();
for (int i = 0; i < N; i++)
{
for (int j = 0; j < N; j++)
{
auto idx = i * N + j;
F_output[i][j] = matrix_buffer[idx];
}
}
auto t3 = std::chrono::steady_clock::now();
auto whole_fun_time = std::chrono::duration<double>(t3 - t0).count();
auto lapack_time = std::chrono::duration<double>(t2 - t1).count();
// std::cout << F_output << "\n";
std::cout << "status: " << info1 << "\t" << info2 << "\t" << (info1 == 0 && info2 == 0 ? "Success" : "Failure")
<< "\n";
std::cout << "whole function: " << whole_fun_time << "\n";
std::cout << "LAPACKE matrix operations: " << lapack_time << "\n";
std::cout << "conversion: " << (whole_fun_time - lapack_time) / whole_fun_time * 100.0 << "%\n";
}
int main(int argc, const char *argv[])
{
const int M = 5; // numer of test repetitions
const int N = (argc > 1) ? std::stoi(argv[1]) : 10;
std::cout << "Matrix size = " << N << "\n";
std::vector<int> IPIV_buffer(N);
std::vector<double> matrix_buffer(N * N);
// Test matrix_inverse_lapack M times
for (int i = 0; i < M; i++)
{
Matrix CO_CL(N);
for (auto &v : CO_CL) v.resize(N);
int idx = 1;
for (auto &v : CO_CL)
{
for (auto &x : v)
{
x = idx + 1.0 / idx;
idx++;
}
}
matrix_inverse_lapack(CO_CL, CO_CL, IPIV_buffer, matrix_buffer);
}
}
Here, operator<< is an overkill, but may be useful for anyone wanting to verify half-manually that the code works (by uncommenting lines 26 and 58), and ensuring that the code is correct is more important that measuring its performance.
The code can be compiled with
g++ -std=c++20 -O3 main.cpp -llapacke
The program relies on an external library, lapacke, which needs to be installed, headers + binaries, for the code to compile and run.
My code differs a bit from OP's: it is closer to "modern C++" in that it refrains from using naked pointers; I also added external buffers to matrix_inverse_lapack to suppress continual launching of memory allocator and deallocator, a small improvement that reduces the 2D-1D-2D conversion overhead in a measurable way. I also had to initialize the matrix and find a way to read in OP's mind what the value of N could be. I also added some timer readings for benchmarking. Apart from this, the logic of the code is unchanged.
Now a benchmark carried out on a decent workstation. It lists the percentage of time the conversion takes relative to the total time taken by matrix_inverse_lapack. In other words, I measure the conversion overhead:
N = 10, 3.5%
N = 30, 1.5%
N = 100, 1%
N = 300, 0.5%
N = 1000, 0.35%
N = 3000, 0.1%
The time taken by Lapack nicely scales as N3, as expected (data not shown). The time to invert a matrix is about 16 seconds for N = 3000, and about 5-6 s (5 microseconds) for N = 10.
I assume the overhead of even 3% is completely acceptable. I believe OP uses matrices of size larger then 100, in which case the overhead at or below 1% is certainly acceptable.
So what OP (or anyone having a similar problem) could have done wrong to obtain "unacceptable overhead conversion values"? Here's my short list
Improper compilation
Improper matrix initialization (for tests)
Improper benchmarking
1. Improper compilation
If one forgets to compile in Release mode, one ends up with optimized Lapacke competing with unoptimized conversion. On my machine this peaks at an 33% overhead for N = 20.
2. Improper matrix initialization (for tests)
If one initializes the matrix like this:
for (auto &v : CO_CL)
{
for (auto &x : v)
{
x = idx; // rather than, eg., idx + 1.0/idx
idx++;
}
}
then the matrix is singular, lapack returns quite quickly with the status different from 0. This increases the relative importance of the conversion part. But singular matrices are not what one wants to invert (it's impossible to do).
3. Improper benchmarking
Here's an example of the program output for N = 10:
./a.out 10
Matrix size = 10
status: 0 0 Success
whole function: 0.000127658
LAPACKE matrix operations: 0.000126783
conversion: 0.685425%
status: 0 0 Success
whole function: 1.2497e-05
LAPACKE matrix operations: 1.2095e-05
conversion: 3.21677%
status: 0 0 Success
whole function: 1.0535e-05
LAPACKE matrix operations: 1.0197e-05
conversion: 3.20835%
status: 0 0 Success
whole function: 9.741e-06
LAPACKE matrix operations: 9.422e-06
conversion: 3.27482%
status: 0 0 Success
whole function: 9.939e-06
LAPACKE matrix operations: 9.618e-06
conversion: 3.2297%
One can see that the first call to lapack functions can take 10 times more time than the subsequent calls. This is quite a stable pattern, as if Lapack needed some time for self-initialization. It can affect the measurements for small N badly.
4. What else can be done?
OP apperas to believe that his approach to 2D arrays is good and Lapack is strange and old-fashionable in its packing a 2D array into a 1D array. No. It is Lapack who is right.
If one defines a 2D array as vector<vector<double>>, one obtains one advantage: code simplicity. This comes at a price. Each row of such a matrix is allocated separateley from the others. Thus, a matrix 100 by 100 may be stored in 100 completely different memory blocks. This has a bad impact on the cache (and prefetcher) utilization. Lapck (and other linear algebra packages) enforces compactification of the data in a single, continuous array. This is so to minimize cache and prefetcher misses. If OP had used such an approach from the very beginning, he would probably have gained more than 1-3% that they pay now for the conversion.
This compactification can be achieved in at least three ways.
Write a custom class for a 2D matrix, with the internal data stored in a 1D array and convenient access member funnctions (e.g.: operator ()), or find a library that does just that
Write a custom allocator for std::vector (or find a library). This allocator should allocate the memory from a preallocated 1D vector exactly matching the data storage pattern used by Lapack
Use std::vector<double*> and initailze the pointers with the addresses pointing at the appropriate elements of a preallocated 1D array.
Each of the above solutions forces some changes to the surrounding code, which OP might not want to do. All depends on the code complexity and expected performance gains.
EDIT: Alternative libraries
An alternative approach is to use a library that is known for being a highly optimzed one. Lapack by itself can be regardered as a standard interface with many implementations and it may happen that OP uses an unoptimized one. Which library to choose may depend on the hardware/software platform OP is interested in and may vary in time.
As for now (mid-2021) a decent suggestions are:
Lapack https://www.netlib.org/lapack/
Atlas https://en.wikipedia.org/wiki/Automatically_Tuned_Linear_Algebra_Software http://math-atlas.sourceforge.net/
OpenBlas https://www.openblas.net/
Magma https://developer.nvidia.com/magma
Plasma https://bitbucket.org/icl/plasma/src/main/
If OP uses martices of sizes at least 100, then GPU-oriented MAGMA might be worth trying.
An easier (installation, running) way might with a parallel CPU library, e.g. Plasma. Plsama is Lapack-compliant, it has been being developed by a large team of people, including Jack Dongarra, it also should be rather easy to compile it locally as it is provided with a CMake script.
An example how much a parallel CPU-based, multicore implementation can outperform a single-threaded implementation of the LU-decomposition can be found for example here: https://cse.buffalo.edu/faculty/miller/Courses/CSE633/Tummala-Spring-2014-CSE633.pdf (short answer: 5 to 15 times for matrices of size 1000).

Doesn't see any significant improvement while using parallel block in OpenMP C++

I am receiving an array of Eigen::MatrixXf and Eigen::Matrix4f in realtime. Both of these arrays are having an equal number of elements. All I am trying to do is just multiply elements of both the arrays together and storing the result in another array at the same index.
Please see the code snippet below-
#define COUNT 4
while (all_ok())
{
Eigen::Matrix4f trans[COUNT];
Eigen::MatrixXf in_data[COUNT];
Eigen::MatrixXf out_data[COUNT];
// at each iteration, new data is filled
// in 'trans' and 'in_data' variables
#pragma omp parallel num_threads(COUNT)
{
#pragma omp for
for (int i = 0; i < COUNT; i++)
out_data[i] = trans[i] * in_clouds[i];
}
}
Please note that COUNT is a constant. The size of trans and in_data is (4 x 4) and (4 x n) respectively, where n is approximately 500,000. In order to parallelize the for loop, I gave OpenMP a try as shown above. However, I don't see any significant improvement in the elapsed time of for loop.
Any suggestions? Any alternatives to perform the same operation, please?
Edit: My idea is to define 4 (=COUNT) threads wherein each of them is taking care of multiplication. In this way, we don't need to create threads every time, I guess!
Works for me using the following self-contained example, that is, I get a x4 speed up when enabling openmp:
#include <iostream>
#include <bench/BenchTimer.h>
using namespace Eigen;
const int COUNT = 4;
EIGEN_DONT_INLINE
void foo(const Matrix4f *trans, const MatrixXf *in_data, MatrixXf *out_data)
{
#pragma omp parallel for num_threads(COUNT)
for (int i = 0; i < COUNT; i++)
out_data[i] = trans[i] * in_data[i];
}
int main()
{
Eigen::Matrix4f trans[COUNT];
Eigen::MatrixXf in_data[COUNT];
Eigen::MatrixXf out_data[COUNT];
int n = 500000;
for (int i = 0; i < COUNT; i++)
{
trans[i].setRandom();
in_data[i].setRandom(4,n);
out_data[i].setRandom(4,n);
}
int tries = 3;
int rep = 1;
BenchTimer t;
BENCH(t, tries, rep, foo(trans, in_data, out_data));
std::cout << " " << t.best(Eigen::REAL_TIMER) << " (" << double(n)*4.*4.*4.*2.e-9/t.best() << " GFlops)\n";
return 0;
}
So 1) make sure you measure the wallclock time and not the CPU time, and 2) make sure that the products is the bottleneck and not filling in_data.
Finally, for maximal performance don't forget to enable AVX/FMA (e.g., with -march=native), and of course make sure to benchmark with compiler's optimization ON.
For the record, on my computer the above example takes 0.25s without openmp, and 0.065s with.
You need to specify -fopenmp during compilation and linking. But you will quickly hit the limit, where RAM access is stopping further speeding up. You really should have a look at vector intrinsics. Dependent on you CPU you could accelerate your operations to the size of your register divided by the size of your variable (float = 4). So if your processor supports say AVX, you'd be dealing with 8 floats at a time. If you need some inspiration, you're welcome to steal code from my medical image reconstruction library here:
https://github.com/kvahed/codeare/blob/master/src/matrix/SIMDTraits.hpp
The code does the whole shebang for float/double real and complex.

C++ Branch prediction algorithm comparison?

There are 2 possible techniques shown below which do the same task.
I would like to know if there will be any performance difference between the two.
I think the first technique will suffer due to branch prediction as contents of A are random.
Technique 1:
#include <iostream>
#include <cstdlib>
#include <ctime>
#define SIZE 1000000
using namespace std;
class MyClass
{
private:
bool flag;
public:
void setFlag(bool f) {flag = f;}
};
int main()
{
MyClass obj;
int *A = new int[SIZE];
for(int i = 0; i < SIZE; i++)
A[i] = (unsigned int)rand();
time_t mytime1;
time_t mytime2;
time(&mytime1);
for(int test = 0; test < 5000; test++)
{
for(int i = 0; i < SIZE; i++)
{
if(A[i] > 100)
obj.setFlag(true);
else
obj.setFlag(false);
}
}
time(&mytime2);
cout << asctime(localtime(&mytime1)) << endl;
cout << asctime(localtime(&mytime2)) << endl;
}
Result:
Sat May 03 20:08:07 2014
Sat May 03 20:08:32 2014
i.e. Time taken = 25sec
Technique 2:
#include <iostream>
#include <cstdlib>
#include <ctime>
#define SIZE 1000000
using namespace std;
class MyClass
{
private:
bool flag;
public:
void setFlag(bool f) {flag = f;}
};
int main()
{
MyClass obj;
int *A = new int[SIZE];
for(int i = 0; i < SIZE; i++)
A[i] = (unsigned int)rand();
time_t mytime1;
time_t mytime2;
time(&mytime1);
for(int test = 0; test < 5000; test++)
{
for(int i = 0; i < SIZE; i++)
{
obj.setFlag(A[i] > 100);
}
}
time(&mytime2);
cout << asctime(localtime(&mytime1)) << endl;
cout << asctime(localtime(&mytime2)) << endl;
}
Result:
Sat May 03 20:08:42 2014
Sat May 03 20:09:10 2014
i.e. Time taken = 28sec
The compilation is done using MinGW 64 bt compiler with no flags.
From the results it looks like the opposite is happening.
EDIT:
After making the check for RAND_MAX / 2 instead of 100, I am getting the following results:
Technique 1: 70sec
Technique 2: 28sec
So it becomes clear now that Technique 2 is better than technique 1 and can be explained on the basis of branch prediction failure phenomenon.
With optimisations enabled the binaries are exactly the same, in GCC 4.8 at least: demo.
They're different with optimisations disabled, though: demo.
This very poor attempt at a measurement suggests that the second is actually slower, though both programs run in the same duration in real terms: demo
real 0m0.052s
user 0m0.036s
sys 0m0.012s
real 0m0.052s
user 0m0.044s
sys 0m0.004s
To find out how they really differ in performance with optimisations disabled, you can benchmark properly with more runs.
Frankly, though, since it's irrelevant for your production code I wouldn't even bother.
I agree with the fact that this isn't very interesting for practical code (especially when it dissapears with -O3), but for the sake of academic interest: In some conditions it may be better to rely on the branch predictor.
On one hand, in this particular case the branch is almost always going to be not-taken (as RAND_MAX >> 100), which is easy to predict both interms of branch resolution as well as the next IP address. Try converting the prediciton to a 50% chance and then benchmark this.
On the other hand, the second operation turns the stores done to the obj flag into being data-dependent with the loads from A[i]. These loads are going to be slow as your dataset is 1000000*sizeof(A) bytes at least (almost 4MB), meaning that it could be either in the L3 cache or the memory - either way that's quiet a few cycles per each new line (once every few accesses) - when the writes to the flag were independent, they could queue in parallel, now you have to stall them until you get the data. In Theory, the CPU should be able to "pipeline" this, since stores are performed much later than loads along the pipeline on most CPUs, but in practice you're limited by the size of the execution window, in most machines that would be ~100 I believe), so if the store of the current iteration is stalled, you won't be able to launch too far ahead the loads required for the future iterations.
In other words - you may be losing due to the fact that CPUs have a fairly decent branch prediction, but no (or hardly no) data prediction.

Testing parallel_for_ performance in OpenCV

I tested parallel_for_ in OpenCV by comparing with the normal operation for just simple array summation and multiplication.
I have array of 100 integers and split into 10 each and run using parallel_for_.
Then I also have normal 0 to 99 operation for summation and multiuplication.
Then I measured the elapsed time and normal operation is faster than parallel_for_ operation.
My CPU is Intel(R) Core(TM) i7-2600 Quard Core CPU.
parallel_for_ operation took 0.002sec (took 2 clock cycles) for summation and 0.003sec (took 3 clock cycles) for multiplication.
But normal operation took 0.0000sec (less than one click cycle) for both summation and multiplication. What am I missing? My code is as follow.
TEST Class
#include <opencv2\core\internal.hpp>
#include <opencv2\core\core.hpp>
#include <tbb\tbb.h>
using namespace tbb;
using namespace cv;
template <class type>
class Parallel_clipBufferValues:public cv::ParallelLoopBody
{
private:
type *buffertoClip;
type maxSegment;
char typeOperation;//m = mul, s = summation
static double total;
public:
Parallel_clipBufferValues(){ParallelLoopBody::ParallelLoopBody();};
Parallel_clipBufferValues(type *buffertoprocess, const type max, const char op): buffertoClip(buffertoprocess), maxSegment(max), typeOperation(op){
if(typeOperation == 's')
total = 0;
else if(typeOperation == 'm')
total = 1;
}
~Parallel_clipBufferValues(){ParallelLoopBody::~ParallelLoopBody();};
virtual void operator()(const cv::Range &r) const{
double tot = 0;
type *inputOutputBufferPTR = buffertoClip+(r.start*maxSegment);
for(int i = 0; i < 10; ++i)
{
if(typeOperation == 's')
total += *(inputOutputBufferPTR+i);
else if(typeOperation == 'm')
total *= *(inputOutputBufferPTR+i);
}
}
static double getTotal(){return total;}
void normalOperation(){
//int iteration = sizeof(buffertoClip)/sizeof(type);
if(typeOperation == 'm')
{
for(int i = 0; i < 100; ++i)
{
total *= buffertoClip[i];
}
}
else if(typeOperation == 's')
{
for(int i = 0; i < 100; ++i)
{
total += buffertoClip[i];
}
}
}
};
MAIN
#include "stdafx.h"
#include "TestClass.h"
#include <ctime>
double Parallel_clipBufferValues<int>::total;
int _tmain(int argc, _TCHAR* argv[])
{
const int SIZE=100;
int myTab[SIZE];
double totalSum_by_parallel;
double totalSun_by_normaloperation;
double elapsed_secs_parallel;
double elapsed_secs_normal;
for(int i = 1; i <= SIZE; i++)
{
myTab[i-1] = i;
}
int maxSeg =10;
clock_t begin_parallel = clock();
cv::parallel_for_(cv::Range(0,maxSeg), Parallel_clipBufferValues<int>(myTab, maxSeg, 'm'));
totalSum_by_parallel = Parallel_clipBufferValues<int>::getTotal();
clock_t end_parallel = clock();
elapsed_secs_parallel = double(end_parallel - begin_parallel) / CLOCKS_PER_SEC;
clock_t begin_normal = clock();
Parallel_clipBufferValues<int> norm_op(myTab, maxSeg, 'm');
norm_op.normalOperation();
totalSun_by_normaloperation = norm_op.getTotal();
clock_t end_normal = clock();
elapsed_secs_normal = double(end_normal - begin_normal) / CLOCKS_PER_SEC;
return 0;
}
Let me do some considerations:
Accuracy
clock() function is not accurate at all. Its tick is roughly 1 / CLOCKS_PER_SEC but how often it's updated and if it's uniform or not it's system and implementation dependent. See this post for more details about that.
Better alternatives to measure time:
This post for Windows.
This article for *nix.
Trials and Test Environment
Measures are always affected by errors. Performance measurement for your code is affected (short list, there is much more than that) by other programs, cache, operating system jobs, scheduling and user activity. To have a better measure you have to repeat it many times (let's say 1000 or more) then calculate average. Moreover you should prepare your test environment to be as clean as possible.
More details about tests on these posts:
How do I write a correct micro-benchmark in Java?
NAS Parallel Benchmarks
Visual C++ 11 Beta Benchmark of Parallel Loops (for code examples)
Great articles from our Eric Lippert about benchmarking (it's about C# but most of them applies directly to any bechmark): C# Performance Benchmark Mistakes (part II).
Overhead and Scalability
In your case overhead for parallel execution (and your test code structure) is much higher that loop body itself. In this case it's not productive to make an algorithm parallel. Parallel execution must always be evaluated in a specific scenario, measured and compared. It's not kind of magic medicine to speed up everything. Take a look to this article about How to Quantify Scalability.
Just for example if you have to sum/multiply 100 numbers it's better to use SIMD instructions (even better within an unrolled loop).
Measure It!
Try to make your loop body empty (or to execute a single NOP operation or volatile write so it won't be optimized away). You'll roughly measure overhead. Now compare it with your results.
Notes About This Test
IMO this kind of test is pretty useless. You can't compare, in a generic way, serial or parallel execution. It's something you should always check against a specific situation (in real world many things will play, synchronization for example).
Imagine: you make your loop body really "heavy" and you'll see a big speed up with parallel execution. Now you make your real program parallel and you see performance is worse. Why? Because parallel execution is slowed down by locks, by cache problems or serial access to a shared resource.
Test itself is meaningless unless you're testing your specific code in your specific situation (because too many factors will play and you just can't ignore them). What it means? Well that you can compare only what you tested...if your program performs total *= buffertoClip[i]; then your results are reliable. If your real program does something else then you have to repeat tests with that something else.