Windows threading synchronization performance issue - c++

I have a threading issue under windows.
I am developing a program that runs complex physical simulations for different conditions. Say a condition per hour of the year, would be 8760 simulations. I am grouping those simulations per thread such that each thread runs a for loop of 273 simulations (on average)
I bought an AMD ryzen 9 5950x with 16 cores (32 threads) for this task. On Linux, all the threads seem to be between 98% to 100% usage, while under windows I get this:
(The first bar is the I/O thread reading data, the smaller bars are the process threads. Red: synchronization, green: process, purple: I/O)
This is from Visual Studio's concurrency visualizer, which tells me that 63% of the time was spent on thread synchronization. As far as I can tell, my code is the same for both the Linux and windows executions.
I made my best to make the objects immutable to avoid issues and that provided a big gain with my old 8-thread intel i7. However with many more threads, this issue arises.
For threading, I have tried a custom parallel for, and the taskflow library. Both perform identically for what I want to do.
Is there something fundamental about windows threads that produces this behaviour?
The custom parallel for code:
/**
* parallel for
* #tparam Index integer type
* #tparam Callable function type
* #param start start index of the loop
* #param end final +1 index of the loop
* #param func function to evaluate
* #param nb_threads number of threads, if zero, it is determined automatically
*/
template<typename Index, typename Callable>
static void ParallelFor(Index start, Index end, Callable func, unsigned nb_threads=0) {
// Estimate number of threads in the pool
if (nb_threads == 0) nb_threads = getThreadNumber();
// Size of a slice for the range functions
Index n = end - start + 1;
Index slice = (Index) std::round(n / static_cast<double> (nb_threads));
slice = std::max(slice, Index(1));
// [Helper] Inner loop
auto launchRange = [&func] (int k1, int k2) {
for (Index k = k1; k < k2; k++) {
func(k);
}
};
// Create pool and launch jobs
std::vector<std::thread> pool;
pool.reserve(nb_threads);
Index i1 = start;
Index i2 = std::min(start + slice, end);
for (unsigned i = 0; i + 1 < nb_threads && i1 < end; ++i) {
pool.emplace_back(launchRange, i1, i2);
i1 = i2;
i2 = std::min(i2 + slice, end);
}
if (i1 < end) {
pool.emplace_back(launchRange, i1, end);
}
// Wait for jobs to finish
for (std::thread &t : pool) {
if (t.joinable()) {
t.join();
}
}
}
A complete C++ project illustrating the issue is uploaded here
Main.cpp:
//
// Created by santi on 26/08/2022.
//
#include "input_data.h"
#include "output_data.h"
#include "random.h"
#include "par_for.h"
void fillA(Matrix& A){
Random rnd;
rnd.setTimeBasedSeed();
for(int i=0; i < A.getRows(); ++i)
for(int j=0; j < A.getRows(); ++j)
A(i, j) = (int) rnd.randInt(0, 1000);
}
void worker(const InputData& input_data,
OutputData& output_data,
const std::vector<int>& time_indices,
int thread_index){
std::cout << "Thread " << thread_index << " [" << time_indices[0]<< ", " << time_indices[time_indices.size() - 1] << "]\n";
for(const int& t: time_indices){
Matrix b = input_data.getAt(t);
Matrix A(input_data.getDim(), input_data.getDim());
fillA(A);
Matrix x = A * b;
output_data.setAt(t, x);
}
}
void process(int time_steps, int dim, int n_threads){
InputData input_data(time_steps, dim);
OutputData output_data(time_steps, dim);
// correct the number of threads
if ( n_threads < 1 ) { n_threads = ( int )getThreadNumber( ); }
// generate indices
std::vector<int> time_indices = arrange<int>(time_steps);
// compute the split of indices per core
std::vector<ParallelChunkData<int>> chunks = prepareParallelChunks(time_indices, n_threads );
// run in parallel
ParallelFor( 0, ( int )chunks.size( ), [ & ]( int k ) {
// run chunk
worker(input_data, output_data, chunks[k].indices, k );
} );
}
int main(){
process(8760, 5000, 0);
return 0;
}

The performance problem you see is definitely caused by the many memory allocations, as already suspected by Matt in his answer. To expand on this: Here is a screenshot from Intel VTune running on an AMD Ryzen Threadripper 3990X with 64 cores (128 threads):
As you can see, almost all of the time is spent in malloc or free, which get called from the various Matrix operations. The bottom part of the image shows the timeline of the activity of a small selection of the threads: Green means that the thread is inactive, i.e. waiting. Usually only one or two threads are actually active. Allocations and freeing memory accesses a shared resource, causing the threads to wait for each other.
I think you have only two real options:
Option 1: No dynamic allocations anymore
The most efficient thing to do would be to rewrite the code to preallocate everything and get rid of all the temporaries. To adapt it to your example code, you could replace the b = input_data.getAt(t); and x = A * b; like this:
void MatrixVectorProduct(Matrix const & A, Matrix const & b, Matrix & x)
{
for (int i = 0; i < x.getRows(); ++i) {
for (int j = 0; j < x.getCols(); ++j) {
x(i, j) = 0.0;
for (int k = 0; k < A.getCols(); ++k) {
x(i,j) += (A(i,k) * b(k,j));
}
}
}
}
void getAt(int t, Matrix const & input_data, Matrix & b) {
for (int i = 0; i < input_data.getRows(); ++i)
b(i, 0) = input_data(i, t);
}
void worker(const InputData& input_data,
OutputData& output_data,
const std::vector<int>& time_indices,
int thread_index){
std::cout << "Thread " << thread_index << " [" << time_indices[0]<< ", " << time_indices[time_indices.size() - 1] << "]\n";
Matrix A(input_data.getDim(), input_data.getDim());
Matrix b(input_data.getDim(), 1);
Matrix x(input_data.getDim(), 1);
for (const int & t: time_indices) {
getAt(t, input_data.getMat(), b);
fillA(A);
MatrixVectorProduct(A, b, x);
output_data.setAt(t, x);
}
std::cout << "Thread " << thread_index << ": Finished" << std::endl;
}
This fixes the performance problems.
Here is a screenshot from VTune, where you can see a much better utilization:
Option 2: Using a special allocator
The alternative is to use a different allocator that handles allocating and freeing memory more efficiently in multithreaded scenarios. One that I had very good experience with is mimalloc (there are others such as hoard or the one from TBB). You do not need to modify your source code, you just need to link with a specific library as described in the documentation.
I tried mimalloc with your source code, and it gave near 100% CPU utilization without any code changes.
I also found a post on the Intel forums with a similar problem, and the solution there was the same (using a special allocator).
Additional notes
Matrix::allocSpace() allocates the memory by using pointers to arrays. It is better to use one contiguous array for the whole matrix instead of multiple independent arrays. That way, all elements are located behind each other in memory, allowing more efficient access.
But in general I suggest to use a dedicated linear algebra library such as Eigen instead of the hand rolled matrix implementation to exploit vectorization (SSE2, AVX,...) and to get the benefits of a highly optimized library.
Ensure that you compile your code with optimizations enabled.
Disable various cross-checks if you do not need them: assert() (i.e. define NDEBUG in the preprocessor), and for MSVC possibly /GS-.
Ensure that you actually have enough memory installed.

You said that all your memory was pre-allocated, but in the worker function I see this...
Matrix b = input_data.getAt(t);
which allocates and fills a new matrix b, and this...
Matrix A(input_data.getDim(), input_data.getDim());
which allocates and fills a new matrix A, and this...
Matrix x = A * b;
which allocates and fills a new matrix x.
The heap is a global data structure, so the thread synchronization time you're seeing is probably contention in the memory allocate/free functions.
These are in a tight loop. You should fix this loop to access b by reference, and reuse the other 2 matrices for every iteration.

Related

Why is thrust reduce_by_key almost 75x slower than for_each with atomicAdd()?

I was not satisfied with the performance of the below thrust::reduce_by_key, so I rewrote it in a variety of ways with little gained benefit (including removing the permutation iterator). However, it wasn't until after replacing it with a thrust::for_each() (see below) that capitalizes on atomicAdd(), that I gained almost a 75x speedup! The two versions produce the exact same results. What could be the biggest cause for the dramatic performance differences?
Complete code for comparison between the two approaches:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <ctime>
#include <iostream>
#include <thrust/copy.h>
#include <thrust/device_vector.h>
#include <thrust/execution_policy.h>
#include <thrust/host_vector.h>
#include <thrust/iterator/discard_iterator.h>
#include <thrust/sort.h>
constexpr int NumberOfOscillators = 100;
int SeedRange = 500;
struct GetProduct
{
template<typename Tuple>
__host__ __device__
int operator()(const Tuple & t)
{
return thrust::get<0>(t) * thrust::get<1>(t);
}
};
int main()
{
using namespace std;
using namespace thrust::placeholders;
/* BEGIN INITIALIZATION */
thrust::device_vector<int> dv_OscillatorsVelocity(NumberOfOscillators);
thrust::device_vector<int> dv_outputCompare(NumberOfOscillators);
thrust::device_vector<int> dv_Connections_Strength((NumberOfOscillators - 1) * NumberOfOscillators);
thrust::device_vector<int> dv_Connections_Active((NumberOfOscillators - 1) * NumberOfOscillators);
thrust::device_vector<int> dv_Connections_TerminalOscillatorID_Map(0);
thrust::device_vector<int> dv_Permutation_Connections_To_TerminalOscillators((NumberOfOscillators - 1) * NumberOfOscillators);
thrust::device_vector<int> dv_Connection_Keys((NumberOfOscillators - 1) * NumberOfOscillators);
srand((unsigned int)time(NULL));
thrust::fill(dv_OscillatorsVelocity.begin(), dv_OscillatorsVelocity.end(), 0);
for (int c = 0; c < NumberOfOscillators * (NumberOfOscillators - 1); c++)
{
dv_Connections_Strength[c] = (rand() % SeedRange) - (SeedRange / 2);
dv_Connections_Active[c] = 0;
}
int curOscillatorIndx = -1;
for (int c = 0; c < NumberOfOscillators * NumberOfOscillators; c++)
{
if (c % NumberOfOscillators == 0)
{
curOscillatorIndx++;
}
if (c % NumberOfOscillators != curOscillatorIndx)
{
dv_Connections_TerminalOscillatorID_Map.push_back(c % NumberOfOscillators);
}
}
for (int n = 0; n < NumberOfOscillators; n++)
{
for (int p = 0; p < NumberOfOscillators - 1; p++)
{
thrust::copy_if(
thrust::device,
thrust::make_counting_iterator<int>(0),
thrust::make_counting_iterator<int>(dv_Connections_TerminalOscillatorID_Map.size()), // indices from 0 to N
dv_Connections_TerminalOscillatorID_Map.begin(), // array data
dv_Permutation_Connections_To_TerminalOscillators.begin() + (n * (NumberOfOscillators - 1)), // result will be written here
_1 == n);
}
}
for (int c = 0; c < NumberOfOscillators * (NumberOfOscillators - 1); c++)
{
dv_Connection_Keys[c] = c / (NumberOfOscillators - 1);
}
/* END INITIALIZATION */
/* BEGIN COMPARISON */
auto t = clock();
for (int x = 0; x < 5000; ++x) //Set x maximum to a reasonable number while testing performance.
{
thrust::reduce_by_key(
thrust::device,
//dv_Connection_Keys = 0,0,0,...1,1,1,...2,2,2,...3,3,3...
dv_Connection_Keys.begin(), //keys_first The beginning of the input key range.
dv_Connection_Keys.end(), //keys_last The end of the input key range.
thrust::make_permutation_iterator(
thrust::make_transform_iterator(
thrust::make_zip_iterator(
thrust::make_tuple(
dv_Connections_Strength.begin(),
dv_Connections_Active.begin()
)
),
GetProduct()
),
dv_Permutation_Connections_To_TerminalOscillators.begin()
), //values_first The beginning of the input value range.
thrust::make_discard_iterator(), //keys_output The beginning of the output key range.
dv_OscillatorsVelocity.begin() //values_output The beginning of the output value range.
);
}
std::cout << "iterations time for original: " << (clock() - t) * (1000.0 / CLOCKS_PER_SEC) << "ms\n" << endl << endl;
thrust::copy(dv_OscillatorsVelocity.begin(), dv_OscillatorsVelocity.end(), dv_outputCompare.begin());
t = clock();
for (int x = 0; x < 5000; ++x) //Set x maximum to a reasonable number while testing performance.
{
thrust::for_each(
thrust::device,
thrust::make_counting_iterator(0),
thrust::make_counting_iterator(0) + dv_Connections_Active.size(),
[
s = dv_OscillatorsVelocity.size() - 1,
dv_b = thrust::raw_pointer_cast(dv_OscillatorsVelocity.data()),
dv_c = thrust::raw_pointer_cast(dv_Permutation_Connections_To_TerminalOscillators.data()), //3,6,9,0,7,10,1,4,11,2,5,8
dv_ppa = thrust::raw_pointer_cast(dv_Connections_Active.data()),
dv_pps = thrust::raw_pointer_cast(dv_Connections_Strength.data())
] __device__(int i) {
const int readIndex = i / s;
atomicAdd(
dv_b + readIndex,
(dv_ppa[dv_c[i]] * dv_pps[dv_c[i]])
);
}
);
}
std::cout << "iterations time for new: " << (clock() - t) * (1000.0 / CLOCKS_PER_SEC) << "ms\n" << endl << endl;
std::cout << "***" << (dv_OscillatorsVelocity == dv_outputCompare ? "success" : "fail") << "***\n";
/* END COMPARISON */
return 0;
}
Extra info.:
My results are using a single GTX 980 TI.
There are 100 * (100 - 1) = 9,900 elements in all of the "Connection" vectors.
Each of the 100 unique keys found in dv_Connection_Keys has 99 elements each.
Use this compiler option: --expt-extended-lambda
What could be the biggest cause for the dramatic performance differences?
You are evidently building a debug project, that is your compilation settings include the -G switch. Although you were asked for your compilation settings in the comments, you didn't mention this.
It's important.
CUDA device code can have dramatically different performance characteristics when compiled with -G.
Don't evaluate performance of a debug project, or code compiled with -G.
When I compile and run your code without -G, I get:
iterations time for original: 210ms
iterations time for new: 70ms
***success***
When I compile your code with the debug switch -G, and run, I get:
iterations time for original: 12330ms
iterations time for new: 320ms
***success***
returning to your question, that accounts for the biggest factor of the difference.
The following answer tries to explain or at least motivate the remaining difference in performance after going from a debug build to a release build as explained in Robert Crovella's answer.
Coalescing
As the accesses in both kernels are not coalesced due to the permutation_iterator/indirection through dv_c, going by the the plain number of accesses will overestimate the performance in this case. thrust::reduce_by_key (or pretty much all Thrust algorithms) is not and can not be optimized for general permutations of the input as the performance of these bandwidth-bound kernels depends strongly on coalesced memory access. Naturally the algorithms are written such that accesses are coalesced for normal continuous input. So if you need to access the permuted state order of the data more than once (which might happen in a single reduction algorithm), it could be faster to actually permute the data in memory using thrust::gather or thrust::scatter once so at least all following accesses are efficient. I would not expect the for_each solution to beat reduce_by_key without that permutation.
Atomics
Newer versions of nvcc will try to use automatically use warp-aggregated-atomics to reduce the number of actual atomic instructions on the same address. As neighboring threads (same warp) tend to atomically write to the same address, this optimization is crucial for the performance of your custom reduction. Another important detail is that s = NumberOfOscillators is relatively small (100) in your code compared to typical thread-block sizes (256, 512, 1024; locality of atomic writes) and the amount of parallelism in the for_each (~NumberOfOscillators^2). So for smaller NumberOfOscillators I expect your custom reduction to get worse than reduce_by_key due to the vanishing amount of parallelism, while for bigger NumberOfOscillators you get both much more parallelism and more thread blocks/warps writing to the same location, so it is not quite clear which one will win without benchmarking it for given hardware and compiler.

Performance of C++ program is 2x slower with the use of TBB

I'm trying to optimize the performance of a C++ program by using the TBB library.
My program only contains a couple of small for loop, so I know it can be a challenge to optimze time complexity in this case, but I have to use TBB.
As such, I tried to use a partitionner which made the program 2 time faster with TBB than without the partitionner, but it's still slower than the original program without the use of parallelism.
In my code, I print when a loop start and end with the id to see if there is parallelism. The output show that the loop is in fact execute sequentially, for example : start 1 end 1, start 2 end 2 , etc(it's a list of size 200). The output of the ids isn't random like you would expect from a parallelized program.
Here is an example of how I used the library:
tbb::global_control c(tbb::global_control::max_allowed_parallelism, 1000);
size_t grainsize = 1000;
size_t changes = 0;
tbb::parallel_for(
tbb::blocked_range<std::size_t>(0, list.size(), grainsize),
[&](const tbb::blocked_range<std::size_t> r) {
for (size_t id = r.begin(); id < r.end(); ++id) {
std::cout << "start:" << point_id << std::endl;
double disto = std::numeric_limits<double>::max();
size_t cluster_id = 0;
const Point& point = points.at(id);
for (size_t i = 0; i < short_list.size(); i++) {
const Point& origin = originss[i];
double disto2 = point.dist(origin);
if (disto2 < min) {
min = disto2;
clus = i;
}
}
if (m[id] != m_id) {
m[id] = m_id;
modif++;
}
disto_list[id] = min;
std::cout << "end:" << point_id << std::endl;
}
}
);
Is there a way to improve the performance of a C++ program composed of multiple small for loops with the use of the TBB library? And why are the loop not parallized?
If you are using task_scheduler_init in your program, then TBB uses the same thread throughout the program until task_scheduler_init objects are destroyed.
As you are passing max_allowed_parallelism as a parameter for global_control, if it is set to 1 then it will make your application run in a sequential way.
You can refer to the below link:
https://spec.oneapi.io/versions/latest/elements/oneTBB/source/task_scheduler/scheduling_controls/global_control_cls.html
It will be helpful if you provide the complete reproducer to figure out where exactly the issue took place.

Why my inversions of matrices are such slow with LAPACKE in C++ : MAGMA Alternative and set up

I am using LAPACK to inverse a matrix: I did a reference passing, i.e by working on the address. Here below the function with an input matrix and an output matrix referenced by their address.
The issue is that I am obliged to convert the F_matrix into 1D array and I think this is a waste of performances on the runtime level : which way could I find to get rid of this supplementary task which is time consuming I think if I call a lot of times the
function matrix_inverse_lapack.
Below the function concerned :
// Passing Matrixes by Reference
void matrix_inverse_lapack(vector<vector<double>> const &F_matrix, vector<vector<double>> &F_output) {
// Index for loop and arrays
int i, j, ip, idx;
// Size of F_matrix
int N = F_matrix.size();
int *IPIV = new int[N];
// Statement of main array to inverse
double *arr = new double[N*N];
// Output Diagonal block
double *diag = new double[N];
for (i = 0; i<N; i++){
for (j = 0; j<N; j++){
idx = i*N + j;
arr[idx] = F_matrix[i][j];
}
}
// LAPACKE routines
int info1 = LAPACKE_dgetrf(LAPACK_ROW_MAJOR, N, N, arr, N, IPIV);
int info2 = LAPACKE_dgetri(LAPACK_ROW_MAJOR, N, arr, N, IPIV);
for (i = 0; i<N; i++){
for (j = 0; j<N; j++){
idx = i*N + j;
F_output[i][j] = arr[idx];
}
}
delete[] IPIV;
delete[] arr;
}
For example, I call it this way :
vector<vector<double>> CO_CL(lsize*(2*Dim_x+Dim_y), vector<double>(lsize*(2*Dim_x+Dim_y), 0));
... some code
matrix_inverse_lapack(CO_CL, CO_CL);
The performances on inversion are not which are expected, I think this is due to this conversion 2D -> 1D that I described in the function matrix_inverse_lapack.
Update
I was advised to install MAGMA on my MacOS Big Sur 11.3 but I have a lot of difficulties to set up it.
I have a AMD Radeon Pro 5600M graphic card. I have already installed by default Big Sur version all the Framework OpenCL (maybe I am wrong by saying that). Anyone could tell the procedure to follow for the installation of MAGMA. I saw that on a MAGMA software exists on http://magma.maths.usyd.edu.au/magma/ but it is really expensive and doesn't correspond to what I want : I just need all the SDK (headers and libraries) , if possible built with my GPU card. I have already installed all the Intel OpenAPI SDK on my MacOS. Maybe, I could link it to a MAGMA installation.
I saw another link https://icl.utk.edu/magma/software/index.html where MAGMA seems to be public : there is none link with the non-free version above, isn't there ?
First of all let me complain that OP did not provide all necessary data. The program is almost complete, but it is not a minimal, reproducible example. This is important because (a) it wastes time and (b) it hides potentially relevant information, eg. about the matrix initialization. Second, OP did not provide any details on the compilation, which, again may be relevant.
Last, but not least, OP didn't check the status code for possible errors from Lapack functions, and this could also be important for correct interpretation of the results.
Let's start from a minimal reproducible example:
#include <lapacke.h>
#include <vector>
#include <chrono>
#include <iostream>
using Matrix = std::vector<std::vector<double>>;
std::ostream &operator<<(std::ostream &out, Matrix const &v)
{
const auto size = std::min<int>(10, v.size());
for (int i = 0; i < size; i++)
{
for (int j = 0; j < size; j++)
{
out << v[i][j] << "\t";
}
if (size < std::ssize(v)) out << "...";
out << "\n";
}
return out;
}
void matrix_inverse_lapack(Matrix const &F_matrix, Matrix &F_output, std::vector<int> &IPIV_buffer,
std::vector<double> &matrix_buffer)
{
// std::cout << F_matrix << "\n";
auto t0 = std::chrono::steady_clock::now();
const int N = F_matrix.size();
for (int i = 0; i < N; i++)
{
for (int j = 0; j < N; j++)
{
auto idx = i * N + j;
matrix_buffer[idx] = F_matrix[i][j];
}
}
auto t1 = std::chrono::steady_clock::now();
// LAPACKE routines
int info1 = LAPACKE_dgetrf(LAPACK_ROW_MAJOR, N, N, matrix_buffer.data(), N, IPIV_buffer.data());
int info2 = LAPACKE_dgetri(LAPACK_ROW_MAJOR, N, matrix_buffer.data(), N, IPIV_buffer.data());
auto t2 = std::chrono::steady_clock::now();
for (int i = 0; i < N; i++)
{
for (int j = 0; j < N; j++)
{
auto idx = i * N + j;
F_output[i][j] = matrix_buffer[idx];
}
}
auto t3 = std::chrono::steady_clock::now();
auto whole_fun_time = std::chrono::duration<double>(t3 - t0).count();
auto lapack_time = std::chrono::duration<double>(t2 - t1).count();
// std::cout << F_output << "\n";
std::cout << "status: " << info1 << "\t" << info2 << "\t" << (info1 == 0 && info2 == 0 ? "Success" : "Failure")
<< "\n";
std::cout << "whole function: " << whole_fun_time << "\n";
std::cout << "LAPACKE matrix operations: " << lapack_time << "\n";
std::cout << "conversion: " << (whole_fun_time - lapack_time) / whole_fun_time * 100.0 << "%\n";
}
int main(int argc, const char *argv[])
{
const int M = 5; // numer of test repetitions
const int N = (argc > 1) ? std::stoi(argv[1]) : 10;
std::cout << "Matrix size = " << N << "\n";
std::vector<int> IPIV_buffer(N);
std::vector<double> matrix_buffer(N * N);
// Test matrix_inverse_lapack M times
for (int i = 0; i < M; i++)
{
Matrix CO_CL(N);
for (auto &v : CO_CL) v.resize(N);
int idx = 1;
for (auto &v : CO_CL)
{
for (auto &x : v)
{
x = idx + 1.0 / idx;
idx++;
}
}
matrix_inverse_lapack(CO_CL, CO_CL, IPIV_buffer, matrix_buffer);
}
}
Here, operator<< is an overkill, but may be useful for anyone wanting to verify half-manually that the code works (by uncommenting lines 26 and 58), and ensuring that the code is correct is more important that measuring its performance.
The code can be compiled with
g++ -std=c++20 -O3 main.cpp -llapacke
The program relies on an external library, lapacke, which needs to be installed, headers + binaries, for the code to compile and run.
My code differs a bit from OP's: it is closer to "modern C++" in that it refrains from using naked pointers; I also added external buffers to matrix_inverse_lapack to suppress continual launching of memory allocator and deallocator, a small improvement that reduces the 2D-1D-2D conversion overhead in a measurable way. I also had to initialize the matrix and find a way to read in OP's mind what the value of N could be. I also added some timer readings for benchmarking. Apart from this, the logic of the code is unchanged.
Now a benchmark carried out on a decent workstation. It lists the percentage of time the conversion takes relative to the total time taken by matrix_inverse_lapack. In other words, I measure the conversion overhead:
N = 10, 3.5%
N = 30, 1.5%
N = 100, 1%
N = 300, 0.5%
N = 1000, 0.35%
N = 3000, 0.1%
The time taken by Lapack nicely scales as N3, as expected (data not shown). The time to invert a matrix is about 16 seconds for N = 3000, and about 5-6 s (5 microseconds) for N = 10.
I assume the overhead of even 3% is completely acceptable. I believe OP uses matrices of size larger then 100, in which case the overhead at or below 1% is certainly acceptable.
So what OP (or anyone having a similar problem) could have done wrong to obtain "unacceptable overhead conversion values"? Here's my short list
Improper compilation
Improper matrix initialization (for tests)
Improper benchmarking
1. Improper compilation
If one forgets to compile in Release mode, one ends up with optimized Lapacke competing with unoptimized conversion. On my machine this peaks at an 33% overhead for N = 20.
2. Improper matrix initialization (for tests)
If one initializes the matrix like this:
for (auto &v : CO_CL)
{
for (auto &x : v)
{
x = idx; // rather than, eg., idx + 1.0/idx
idx++;
}
}
then the matrix is singular, lapack returns quite quickly with the status different from 0. This increases the relative importance of the conversion part. But singular matrices are not what one wants to invert (it's impossible to do).
3. Improper benchmarking
Here's an example of the program output for N = 10:
./a.out 10
Matrix size = 10
status: 0 0 Success
whole function: 0.000127658
LAPACKE matrix operations: 0.000126783
conversion: 0.685425%
status: 0 0 Success
whole function: 1.2497e-05
LAPACKE matrix operations: 1.2095e-05
conversion: 3.21677%
status: 0 0 Success
whole function: 1.0535e-05
LAPACKE matrix operations: 1.0197e-05
conversion: 3.20835%
status: 0 0 Success
whole function: 9.741e-06
LAPACKE matrix operations: 9.422e-06
conversion: 3.27482%
status: 0 0 Success
whole function: 9.939e-06
LAPACKE matrix operations: 9.618e-06
conversion: 3.2297%
One can see that the first call to lapack functions can take 10 times more time than the subsequent calls. This is quite a stable pattern, as if Lapack needed some time for self-initialization. It can affect the measurements for small N badly.
4. What else can be done?
OP apperas to believe that his approach to 2D arrays is good and Lapack is strange and old-fashionable in its packing a 2D array into a 1D array. No. It is Lapack who is right.
If one defines a 2D array as vector<vector<double>>, one obtains one advantage: code simplicity. This comes at a price. Each row of such a matrix is allocated separateley from the others. Thus, a matrix 100 by 100 may be stored in 100 completely different memory blocks. This has a bad impact on the cache (and prefetcher) utilization. Lapck (and other linear algebra packages) enforces compactification of the data in a single, continuous array. This is so to minimize cache and prefetcher misses. If OP had used such an approach from the very beginning, he would probably have gained more than 1-3% that they pay now for the conversion.
This compactification can be achieved in at least three ways.
Write a custom class for a 2D matrix, with the internal data stored in a 1D array and convenient access member funnctions (e.g.: operator ()), or find a library that does just that
Write a custom allocator for std::vector (or find a library). This allocator should allocate the memory from a preallocated 1D vector exactly matching the data storage pattern used by Lapack
Use std::vector<double*> and initailze the pointers with the addresses pointing at the appropriate elements of a preallocated 1D array.
Each of the above solutions forces some changes to the surrounding code, which OP might not want to do. All depends on the code complexity and expected performance gains.
EDIT: Alternative libraries
An alternative approach is to use a library that is known for being a highly optimzed one. Lapack by itself can be regardered as a standard interface with many implementations and it may happen that OP uses an unoptimized one. Which library to choose may depend on the hardware/software platform OP is interested in and may vary in time.
As for now (mid-2021) a decent suggestions are:
Lapack https://www.netlib.org/lapack/
Atlas https://en.wikipedia.org/wiki/Automatically_Tuned_Linear_Algebra_Software http://math-atlas.sourceforge.net/
OpenBlas https://www.openblas.net/
Magma https://developer.nvidia.com/magma
Plasma https://bitbucket.org/icl/plasma/src/main/
If OP uses martices of sizes at least 100, then GPU-oriented MAGMA might be worth trying.
An easier (installation, running) way might with a parallel CPU library, e.g. Plasma. Plsama is Lapack-compliant, it has been being developed by a large team of people, including Jack Dongarra, it also should be rather easy to compile it locally as it is provided with a CMake script.
An example how much a parallel CPU-based, multicore implementation can outperform a single-threaded implementation of the LU-decomposition can be found for example here: https://cse.buffalo.edu/faculty/miller/Courses/CSE633/Tummala-Spring-2014-CSE633.pdf (short answer: 5 to 15 times for matrices of size 1000).

Doesn't see any significant improvement while using parallel block in OpenMP C++

I am receiving an array of Eigen::MatrixXf and Eigen::Matrix4f in realtime. Both of these arrays are having an equal number of elements. All I am trying to do is just multiply elements of both the arrays together and storing the result in another array at the same index.
Please see the code snippet below-
#define COUNT 4
while (all_ok())
{
Eigen::Matrix4f trans[COUNT];
Eigen::MatrixXf in_data[COUNT];
Eigen::MatrixXf out_data[COUNT];
// at each iteration, new data is filled
// in 'trans' and 'in_data' variables
#pragma omp parallel num_threads(COUNT)
{
#pragma omp for
for (int i = 0; i < COUNT; i++)
out_data[i] = trans[i] * in_clouds[i];
}
}
Please note that COUNT is a constant. The size of trans and in_data is (4 x 4) and (4 x n) respectively, where n is approximately 500,000. In order to parallelize the for loop, I gave OpenMP a try as shown above. However, I don't see any significant improvement in the elapsed time of for loop.
Any suggestions? Any alternatives to perform the same operation, please?
Edit: My idea is to define 4 (=COUNT) threads wherein each of them is taking care of multiplication. In this way, we don't need to create threads every time, I guess!
Works for me using the following self-contained example, that is, I get a x4 speed up when enabling openmp:
#include <iostream>
#include <bench/BenchTimer.h>
using namespace Eigen;
const int COUNT = 4;
EIGEN_DONT_INLINE
void foo(const Matrix4f *trans, const MatrixXf *in_data, MatrixXf *out_data)
{
#pragma omp parallel for num_threads(COUNT)
for (int i = 0; i < COUNT; i++)
out_data[i] = trans[i] * in_data[i];
}
int main()
{
Eigen::Matrix4f trans[COUNT];
Eigen::MatrixXf in_data[COUNT];
Eigen::MatrixXf out_data[COUNT];
int n = 500000;
for (int i = 0; i < COUNT; i++)
{
trans[i].setRandom();
in_data[i].setRandom(4,n);
out_data[i].setRandom(4,n);
}
int tries = 3;
int rep = 1;
BenchTimer t;
BENCH(t, tries, rep, foo(trans, in_data, out_data));
std::cout << " " << t.best(Eigen::REAL_TIMER) << " (" << double(n)*4.*4.*4.*2.e-9/t.best() << " GFlops)\n";
return 0;
}
So 1) make sure you measure the wallclock time and not the CPU time, and 2) make sure that the products is the bottleneck and not filling in_data.
Finally, for maximal performance don't forget to enable AVX/FMA (e.g., with -march=native), and of course make sure to benchmark with compiler's optimization ON.
For the record, on my computer the above example takes 0.25s without openmp, and 0.065s with.
You need to specify -fopenmp during compilation and linking. But you will quickly hit the limit, where RAM access is stopping further speeding up. You really should have a look at vector intrinsics. Dependent on you CPU you could accelerate your operations to the size of your register divided by the size of your variable (float = 4). So if your processor supports say AVX, you'd be dealing with 8 floats at a time. If you need some inspiration, you're welcome to steal code from my medical image reconstruction library here:
https://github.com/kvahed/codeare/blob/master/src/matrix/SIMDTraits.hpp
The code does the whole shebang for float/double real and complex.

For loop performance and multithreaded performance questions

I was kind of bored so I wanted to try using std::thread and eventually measure performance of single and multithreaded console application. This is a two part question. So I started with a single threaded sum of a massive vector of ints (800000 of ints).
int sum = 0;
auto start = chrono::high_resolution_clock::now();
for (int i = 0; i < 800000; ++i)
sum += ints[i];
auto end = chrono::high_resolution_clock::now();
auto diff = end - start;
Then I added range based and iterator based for loop and measured the same way with chrono::high_resolution_clock.
for (auto& val : ints)
sum += val;
for (auto it = ints.begin(); it != ints.end(); ++it)
sum += *it;
At this point console output looked like:
index loop: 30.0017ms
range loop: 221.013ms
iterator loop: 442.025ms
This was a debug version, so I changed to release and the difference was ~1ms in favor of index based for. No big deal, but just out of curiosity: should there be a difference this big in debug mode between these three for loops? Or even a difference in 1ms in release mode?
I moved on to the thread creation, and tried to do a parallel sum of the array with this lambda (captured everything by reference so I could use vector of ints and a mutex previously declared) using index based for.
auto func = [&](int start, int total, int index)
{
int partial_sum = 0;
auto s = chrono::high_resolution_clock::now();
for (int i = start; i < start + total; ++i)
partial_sum += ints[i];
auto e = chrono::high_resolution_clock::now();
auto d = e - s;
m.lock();
cout << "thread " + to_string(index) + ": " << chrono::duration<double, milli>(d).count() << "ms" << endl;
sum += partial_sum;
m.unlock();
};
for (int i = 0; i < 8; ++i)
threads.push_back(thread(func, i * 100000, 100000, i));
Basically every thread was summing 1/8 of the total array, and the final console output was:
thread 0: 6.0004ms
thread 3: 6.0004ms
thread 2: 6.0004ms
thread 5: 7.0004ms
thread 4: 7.0004ms
thread 1: 7.0004ms
thread 6: 7.0004ms
thread 7: 7.0004ms
8 threads total: 53.0032ms
So I guess the second part of this question is what's going on here? Solution with 2 threads ended with ~30ms as well. Cache ping pong? Something else? If I'm doing something wrong, what would be the correct way to do it? Also if It's relevant, I was trying this on an i7 with 8 threads, so yes I know I didn't count the main thread, but tried it with 7 separate threads and pretty much got the same result.
EDIT: Sorry forgot the mention this was on Windows 7 with Visual Studio 2013 and Visual Studio's v120 compiler or whatever it's called.
EDIT2: Here's the whole main function:
http://pastebin.com/HyZUYxSY
With optimisation not turned on, all the method calls that are performed behind the scenes are likely real method calls. Inline functions are likely not inlined but really called. For template code, you really need to turn on optimisation to avoid that all the code is taken literally. For example, it's likely that your iterator code will call iter.end () 800,000 times, and operator!= for the comparison 800,000 times, which calls operator== and so on and so on.
For the multithreaded code, processors are complicated. Operating systems are complicated. Your code isn't alone on the computer. Your computer can change its clock speed, change into turbo mode, change into heat protection mode. And rounding the times to milliseconds isn't really helpful. Could be one thread to 6.49 milliseconds and another too 6.51 and it got rounded differently.
should there be a difference this big in debug mode between these three for loops?
Yes. If allowed, a decent compiler can produce identical output for each of the 3 different loops, but if optimizations are not enabled, the iterator version has more function calls and function calls have certain overhead.
Or even a difference in 1ms in release mode?
Your test code:
start = ...
for (auto& val : ints)
sum += val;
end = ...
diff = end - start;
sum = 0;
Doesn't use the result of the loop at all so when optimized, the compiler should simply choose to throw away the code resulting in something like:
start = ...
// do nothing...
end = ...
diff = end - start;
For all your loops.
The difference of 1ms may be produced by high granularity of the "high_resolution_clock" in the used implementation of the standard library and by differences in process scheduling during the execution. I measured the index based for being 0.04 ms slower, but that result is meaningless.
Aside from how std::thread is implemented on Windows I would to point your attention to your available execution units and context switching.
An i7 does not have 8 real execution units. It's a quad-core processor with hyper-threading. And HT does not magically double the available number of threads, no matter how it's advertised. It's a really clever system which tries to fit in instructions from an extra pipeline whenever possible. But in the end all instructions go through only four execution units.
So running 8 (or 7) threads is still more than your CPU can really handle simultaneously. That means your CPU has to switch a lot between 8 hot threads clamouring for calculation time. Top that off with several hundred more threads from the OS, admittedly most of which are asleep, that need time and you're left with a high degree of uncertainty in your measurements.
With a single threaded for-loop the OS can dedicate a single core to that task and spread the half-sleeping threads across the other three. This is why you're seeing such a difference between 1 thread and 8 threads.
As for your debugging questions: you should check if Visual Studio has Iterator checking enabled in debugging. When it's enabled every time an iterator is used it is bounds-checked and such. See: https://msdn.microsoft.com/en-us/library/aa985965.aspx
Lastly: have a look at the -openmp switch. If you enable that and apply the OpenMP #pragmas to your for-loops you can do away with all the manual thread creation. I toyed around with similar threading tests (because it's cool. :) ) and OpenMPs performance is pretty damn good.
For the first question, regarding the difference in performance between the range, iterator and index implementations, others have pointed out that in a non-optimized build, much which would normally be inlined may not be.
However there is an additional wrinkle: by default, in Debug builds, Visual Studio will use checked iterators. Access through a checked iterator is checked for safety (does the iterator refer to a valid element?), and consequently operations which use them, including the range-based iteration, are heavily penalized.
For the second part, I have to say that those durations seem abnormally long. When I run the code locally, compiled with g++ -O3 on a core i7-4770 (Linux), I get sub-millisecond timings for each method, less in fact than the jitter between runs. Altering the code to iterate each test 1000 times gives more stable results, with the per test times being 0.33 ms for the index and range loops with no extra tweaking, and about 0.15 ms for the parallel test.
The parallel threads are doing in total the same number of operations, and what's more, using all four cores limits the CPU's ability to dynamically increase its clock speed. So how can it take less total time?
I'd wager that the gains result from better utilization of the per-core L2 caches, four in total. Indeed, using four threads instead of eight threads reduces the total parallel time to 0.11 ms, consistent with better L2 cache use.
Browsing the Intel processor documentation, all the Core i7 processors, including the mobile ones, have at least 4 MB of L3 cache, which will happily accommodate 800 thousand 4-byte ints. So I'm surprised both by the raw times being 100 times larger than I'm seeing, and the 8-thread time totals being so much greater, which as you surmise, is a strong hint that they are thrashing the cache. I'm presuming this is demonstrating just how suboptimal the Debug build code is. Could you post results from an optimised build?
Not knowing how those std::thread classes are implemented, one possible explanation for the 53ms could be:
The threads are started right away when they get instantiated. (I see no thread.start() or threads.StartAll() or alike). So, during the time the first thread instance gets active, the main thread might (or might not) be preempted. There is no guarantee that the threads are getting spawned on individual cores, after all (thread affinity).
If you have a closer look at POSIX APIs, there is the notion of "application context" and "system context", which basically implies, that there might be an OS policy in place which would not use all cores for 1 application.
On Windows (this is where you were testing), maybe the threads are not being spawned directly but via a thread pool, maybe with some extra std::thread functionality, which could produce overhead/delay. (Such as completion ports etc.).
Unfortunately my machine is pretty fast so I had to increase the amount of data processed to yield significant times. But on the upside, this reminded me to point out, that typically, it starts to pay off to go parallel when the computation times are way beyond the time of a time slice (rule of thumb).
Here my "native" Windows implementation, which - for a large enough array finally makes the threads win over a single threaded computation.
#include <stdafx.h>
#include <nativethreadTest.h>
#include <vector>
#include <cstdint>
#include <Windows.h>
#include <chrono>
#include <iostream>
#include <thread>
struct Range
{
Range( const int32_t *p, size_t l)
: data(p)
, length(l)
, result(0)
{}
const int32_t *data;
size_t length;
int32_t result;
};
static int32_t Sum(const int32_t * data, size_t length)
{
int32_t sum = 0;
const int32_t *end = data + length;
for (; data != end; data++)
{
sum += *data;
}
return sum;
}
static int32_t TestSingleThreaded(const Range& range)
{
return Sum(range.data, range.length);
}
DWORD
WINAPI
CalcThread
(_In_ LPVOID lpParameter
)
{
Range * myRange = reinterpret_cast<Range*>(lpParameter);
myRange->result = Sum(myRange->data, myRange->length);
return 0;
}
static int32_t TestWithNCores(const Range& range, size_t ncores)
{
int32_t result = 0;
std::vector<Range> ranges;
size_t nextStart = 0;
size_t chunkLength = range.length / ncores;
size_t remainder = range.length - chunkLength * ncores;
while (nextStart < range.length)
{
ranges.push_back(Range(&range.data[nextStart], chunkLength));
nextStart += chunkLength;
}
Range remainderRange(&range.data[range.length - remainder], remainder);
std::vector<HANDLE> threadHandles;
threadHandles.reserve(ncores);
for (size_t i = 0; i < ncores; ++i)
{
threadHandles.push_back(::CreateThread(NULL, 0, CalcThread, &ranges[i], 0, NULL));
}
int32_t remainderResult = Sum(remainderRange.data, remainderRange.length);
DWORD waitResult = ::WaitForMultipleObjects((DWORD)threadHandles.size(), &threadHandles[0], TRUE, INFINITE);
if (WAIT_OBJECT_0 == waitResult)
{
for (auto& r : ranges)
{
result += r.result;
}
result += remainderResult;
}
else
{
throw std::runtime_error("Something went horribly - HORRIBLY wrong!");
}
for (auto& h : threadHandles)
{
::CloseHandle(h);
}
return result;
}
static int32_t TestWithSTLThreads(const Range& range, size_t ncores)
{
int32_t result = 0;
std::vector<Range> ranges;
size_t nextStart = 0;
size_t chunkLength = range.length / ncores;
size_t remainder = range.length - chunkLength * ncores;
while (nextStart < range.length)
{
ranges.push_back(Range(&range.data[nextStart], chunkLength));
nextStart += chunkLength;
}
Range remainderRange(&range.data[range.length - remainder], remainder);
std::vector<std::thread> threads;
for (size_t i = 0; i < ncores; ++i)
{
threads.push_back(std::thread([](Range* range){ range->result = Sum(range->data, range->length); }, &ranges[i]));
}
int32_t remainderResult = Sum(remainderRange.data, remainderRange.length);
for (auto& t : threads)
{
t.join();
}
for (auto& r : ranges)
{
result += r.result;
}
result += remainderResult;
return result;
}
void TestNativeThreads()
{
const size_t DATA_SIZE = 800000000ULL;
typedef std::vector<int32_t> DataVector;
DataVector data;
data.reserve(DATA_SIZE);
for (size_t i = 0; i < DATA_SIZE; ++i)
{
data.push_back(static_cast<int32_t>(i));
}
Range r = { data.data(), data.size() };
std::chrono::system_clock::time_point singleThreadedStart = std::chrono::high_resolution_clock::now();
int32_t result = TestSingleThreaded(r);
std::chrono::system_clock::time_point singleThreadedEnd = std::chrono::high_resolution_clock::now();
std::cout
<< "Single threaded sum: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(singleThreadedEnd - singleThreadedStart).count()
<< "ms." << " Result = " << result << std::endl;
std::chrono::system_clock::time_point multiThreadedStart = std::chrono::high_resolution_clock::now();
result = TestWithNCores(r, 8);
std::chrono::system_clock::time_point multiThreadedEnd = std::chrono::high_resolution_clock::now();
std::cout
<< "Multi threaded sum: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(multiThreadedEnd - multiThreadedStart).count()
<< "ms." << " Result = " << result << std::endl;
std::chrono::system_clock::time_point stdThreadedStart = std::chrono::high_resolution_clock::now();
result = TestWithSTLThreads(r, 8);
std::chrono::system_clock::time_point stdThreadedEnd = std::chrono::high_resolution_clock::now();
std::cout
<< "std::thread sum: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(stdThreadedEnd - stdThreadedStart).count()
<< "ms." << " Result = " << result << std::endl;
}
Here the output on my machine of this code:
Single threaded sum: 382ms. Result = -532120576
Multi threaded sum: 234ms. Result = -532120576
std::thread sum: 245ms. Result = -532120576
Press any key to continue . . ..
Last not least, I feel urged to mention that the way this code is written it is rather a memory IO performance benchmark than a core CPU computation benchmark.
Better computation benchmarks would use small amounts of data which is local, fits into CPU caches etc.
Maybe it would be interesting to experiment with the splitting of the data into ranges. What if each thread were "jumping" over the data from the start to an end with a gap of ncores? Thread 1: 0 8 16... Thread 2: 1 9 17 ... etc.? Maybe then the "locality" of the memory could gain extra speed.