openMp optimisation of dynamic array access - c++

I am trying to measure the speedup in parallel section using one or four threads. As my parallel section is relatively simple, I expect a near-to-fourfold speedup. ( This is following my question:
openMp: severe perfomance loss when calling shared references of dynamic arrays )
As my parallel sections runs twice as fast on four cores compared to only one, I believe I have still not found the reason for the performance loss.
I want to parallelise my function iter as well as possible. The function is using entries of dynamic arrays and private quantities to change the entries of other dynamic arrays. Because every iteration step only uses the array entries of the respective loop step, I don't have different threads accessing the same array entry. Furthermore, I put some thought on false sharing, due to accessing entries in the same cache line. My guess is, that this is a minor effect, as my double-arrays are 5*10^5 long and by choosing a reasonable chunk size for the schedule(dynamic,chunk) command, I don't expect the very few entires in a given cache line to be accessed at the same time by different threads. In my simulation, I have about 80 of such arrays, so that allocating them on the stack is not comfortable and making private copies for every thread is out of question too.
Does anybody have an idea, how to improve this? I want to fully understand why this is so slow, before starting with compiler optimisations.
What also surprised me was: calling iter(parallel), with parallel = false, is slower than calling it with parallel = true and omp_set_num_threads(1).
main.cpp:
int main(){
mathClass m;
m.fillArrays();
double timeCount = 0.0;
for(int j = 0; j<1000; j++){
timeCount += m.iter(true);
}
printf("meam time difference = %fms\n",timeCount);
return 0;
}
mathClass.h:
class mathClass{
private:
double* A;
double* B;
double* C;
int length;
public:
double* D;
mathClass();
double iter(bool parallel);
void fillArrays();
};
mathClass.cpp:
mathClass::mathClass(){
length = 5000000;
A = new double[length];
B = new double[length];
C = new double[length];
D = new double[length];
}
void mathClass::fillArrays(){
int temp;
for ( int i=0; i<length; i++){
temp = rand() % 100;
A[i] = double(temp);
temp = rand() % 100;
B[i] = double(temp);
temp = rand() % 100;
C[i] = double(temp);
}
}
double mathClass::iter(bool parallel){
double startTime;
double endTime;
omp_set_num_threads(4);
startTime = omp_get_wtime();
#pragma omp parallel if(parallel)
{
int alpha; // private in all threads
#pragma omp for schedule(static)
for (int i=0; i<length; i++){
alpha = 15*A[i];
D[i] = C[i]*alpha + B[i]*alpha*alpha;
}
}
endTime = omp_get_wtime();
return endTime - startTime;
}

Related

Need Help Understanding OpenMP Matrix Multiplication C++ code

Here is my Matrix Multiplication C++ OpenMP code that I have written. I am trying to use OpenMP to optimize the program. The sequential code speed was 7 seconds but when I added openMP statements but it only got faster by 3 seconds. I thought it was going to get much faster and don't understand if I'm doing it right.
The OpenMP statements are in the fill_random function and in the matrix multiplication triple for loop section in main.
I would appreciate any help or advice you can give to understand this!
#include <iostream>
#include <cassert>
#include <omp.h>
#include <chrono>
using namespace std::chrono;
double** fill_random(int rows, int cols )
{
double** mat = new double* [rows]; //Allocate rows.
#pragma omp parallell collapse(2)
for (int i = 0; i < rows; ++i)
{
mat[i] = new double[cols]; // added
for( int j = 0; j < cols; ++j)
{
mat[i][j] = rand() % 10;
}
}
return mat;
}
double** create_matrix(int rows, int cols)
{
double** mat = new double* [rows]; //Allocate rows.
for (int i = 0; i < rows; ++i)
{
mat[i] = new double[cols](); //Allocate each row and zero initialize..
}
return mat;
}
void destroy_matrix(double** &mat, int rows)
{
if (mat)
{
for (int i = 0; i < rows; ++i)
{
delete[] mat[i]; //delete each row..
}
delete[] mat; //delete the rows..
mat = nullptr;
}
}
int main()
{
int rowsA = 1000; // number of rows
int colsA= 1000; // number of columns
double** matA = fill_random(rowsA, colsA);
int rowsB = 1000; // number of rows
int colsB = 1000; // number of columns
double** matB = fill_random(rowsB, colsB);
//Checking matrix multiplication qualification
assert(colsA == rowsB);
double** matC = create_matrix(rowsA, colsB);
//measure the multiply only
const auto start = high_resolution_clock::now();
//Multiplication
#pragma omp parallel for
for(int i = 0; i < rowsA; ++i)
{
for(int j = 0; j < colsB; ++j)
{
for(int k = 0; k < colsA; ++k) //ColsA..
{
matC[i][j] += matA[i][k] * matB[k][j];
}
}
}
const auto stop = high_resolution_clock::now();
const auto duration = duration_cast<seconds>(stop - start);
std::cout << "Time taken by function: " << duration.count() << " seconds" << std::endl;
//Clean up..
destroy_matrix(matA, rowsA);
destroy_matrix(matB, rowsB);
destroy_matrix(matC, rowsA);
return 0;
}
Your problem is rather small.
The collapse in the matrix creation does nothing because the loops are not perfectly nested. On the other hand, in the multiplication routine you should add a collapse(2) directive.
Creating a matrix with an array of pointers means that the expression matB[k][j] dances all over memory. Allocate your matrices as a single array and then use i*N+j as an indexing expression. (Of course I would put that in a macro or so.)
Matrix size of 1000x1000 with double(64 bit) element type requires 8MB data. When you multiply two matrices, you read 16MB data. When you write to a third matrix, you also access 24MB data total.
If L3 cache is smaller than 24MB then RAM is bottleneck. Maybe single thread did not fully use its bandwidth but when OpenMP is used, RAM bandwidth is fully used. In your case it had only 50% headroom for bandwidth.
Naive version is not using cache well. You need to swap order of two loops to gain more caching:
loop
loop k
loop
C[..] += B[..] * A[..]
although incrementing C does not re-use a register in this optimized version, it re-uses cache that is more important in this case. If you do it, it should get ~100-200 milliseconds computation time even in single-thread.
Also if you need performance, don't do this:
//Allocate each row and zero initialize..
allocate whole matrix at once so that your matrix is not scattered in memory.
To add more threads efficiently, you can do sub-matrix multiplications to compute full matrix multiplication. Scan-line multiplication is not good for load-balancing between threads. When sub-matrices are multiplied, they give better load distribution due to caching and higher floating-point operations per element fetched from memory.
Edit:
Swapping order of loops also makes compiler able to vectorize the innermost loop because one of the input matrices becomes a constant during the innermost loop.

faster access to random elements in c++ array

What is the fastest way access random (non-sequential) elements in an array if the access pattern is known beforehand? The access is random for different needs at every step so rearranging the elements is expensive option. The code below is represents important sample of the whole application.
#include <iostream>
#include "chrono"
#include <cstdlib>
#define NN 1000000
struct Astr{
double x[3], v[3];
int i, j, k;
long rank, p, q, r;
};
int main ()
{
struct Astr *key;
key = new Astr[NN];
int ii, *sequence;
sequence = new int[NN]; // access pattern is stored here
float frac ;
// create array of structs
// create array for random numbers between 0 to NN to access 'key'
for(int i=0; i < NN; i++){
key[i].x[1] = static_cast<double>(i);
key[i].p = static_cast<long>(i);
frac = static_cast<float>(rand()) / static_cast<float>(RAND_MAX);
sequence[i] = static_cast<int>(frac * static_cast<float>(NN));
}
// part to check and improve
// =========================================Random=======================================================
std::chrono::high_resolution_clock::time_point TstartMain = std::chrono::high_resolution_clock::now();
double tmp;
long rnk;
for(int j=0; j < 1000; j++)
for(int i=0; i < NN; i++){
ii = sequence[i];
tmp = key[ii].x[1];
rnk = key[ii].p;
key[ii].x[1] = tmp * 1.01;
key[ii].p = rnk * 1.01;
}
std::chrono::high_resolution_clock::time_point TendMain = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>( TendMain - TstartMain );
double time_uni = static_cast<double>(duration.count()) / 1000000;
std::cout << "\n Random array access " << time_uni << "s \n" ;
// ==========================================Sequential======================================================
TstartMain = std::chrono::high_resolution_clock::now();
for(int j=0; j < 1000; j++)
for(int i=0; i < NN; i++){
tmp = key[i].x[1];
rnk = key[i].p;
key[i].x[1] = tmp * 1.01;
key[i].p = rnk * 1.01;
}
TendMain = std::chrono::high_resolution_clock::now();
duration = std::chrono::duration_cast<std::chrono::microseconds>( TendMain - TstartMain );
time_uni = static_cast<double>(duration.count()) / 1000000;
std::cout << " Sequential array access " << time_uni << "s \n" ;
// ================================================================================================
delete [] key;
delete [] sequence;
}
As expected, sequential access is faster; the answer is following on my machine-
Random array access 21.3763s
Sequential array access 8.7755s
The main question is whether random access could be made any faster.
The code improvement could be in terms of the container itself ( e.g. list/vector rather than array). Could software prefetching be implemented?
In theory it is possible to help guide the pre-fetcher to speed up random access (well, on those CPU's that support it - e.g. _mm_prefetch for Intel/AMD). In practice however this is often a complete waste of time, and will more often than not, slow down your code.
The general theory is that you pass a pointer to the _mm_prefetch intrinsic a loop iteration or two prior to using the value. There are however problems with this:
It is likely that you'll end up tuning the code for your CPU. When running that same code on other platforms, you'll probably find that different CPU cache layouts/sizes mean that your prefetch optimisations are now actually slowing the performance down.
The additional prefetch instructions will end up using up more of your instruction cache, and most likely your uop cache as well. You may find this alone slows the code down.
This assumes the CPU actually pays attention to the _mm_prefetch instruction. It is only a hint, so there are no guarentees it will be respected by the CPU.
If you want to speed up random memory access, there are better methods than prefetching imho.
Reduce the size of the data (i.e. use shorts/float16s inplace of int/float, eradicate any erronious padding in your structs, etc). By reducing the size of the structs, you have less memory to read, so it will go quicker! (Simple compression schemes aren't a bad idea either!)
Sort your data so that instead of doing random access, you are processing the data sequentially.
Other than those two options, the best bet is to leave prefetching well alone, and the compiler do it's thing with your random access (The only exception: you are optimising code for a ~2001 Pentium 4, where prefetching was basically required).
To give an example of what #robthebloke says, the following code makes ~15% improvment on my machine:
#include <immintrin.h>
void do_it(struct Astr *key, const int *sequence) {
for(int i = 0; i < NN-8; ++i) {
_mm_prefetch(key + sequence[i+8], _MM_HINT_NTA);
struct Astr *ki = key+sequence[i];
ki->x[1] *= 1.01;
ki->p *= 1.01;
}
for(int i = NN-8; i < NN; ++i) {
struct Astr *ki = key+sequence[i];
ki->x[1] *= 1.01;
ki->p *= 1.01;
}
}

C++ OpenMP: Writing to a matrix inside of for loop slows down the for loop significantly

I have the following code. The bitCount function simply counts the number of the bits in a 64 bit integer. The test function is an example of something similar I am doing in a more complicated piece of code in which I tried to replicate in it how writing to a matrix slows down significantly the performance of the for loop, and I am trying to figure out why it does so, and if there are any solutions to it.
#include <vector>
#include <cmath>
#include <omp.h>
// Count the number of bits
inline int bitCount(uint64_t n){
int count = 0;
while(n){
n &= (n-1);
count++;
}
return count;
}
void test(){
int nthreads = omp_get_max_threads();
omp_set_dynamic(0);
omp_set_num_threads(nthreads);
// I need a priority queue per thread
std::vector<std::vector<double> > mat(nthreads, std::vector<double>(1000,-INFINITY));
std::vector<uint64_t> vals(100,1);
# pragma omp parallel for shared(mat,vals)
for(int i = 0; i < 100000000; i++){
std::vector<double> &tid_vec = mat[omp_get_thread_num()];
int total_count = 0;
for(unsigned int j = 0; j < vals.size(); j++){
total_count += bitCount(vals[j]);
tid_vec[j] = total_count; // if I comment out this line, performance increase drastically
}
}
}
This code runs in about 11 seconds. If I comment out the following line:
tid_vec[j] = total_count;
the code runs in about 2 seconds. Is there a reason why writing to a matrix in my case costs so much in performance?
Since you said nothing about your compiler/system specs, I'm assuming you are compiling with GCC and flags -O2 -fopenmp.
If you comment the line:
tid_vec[j] = total_count;
The compiler will optimize away all the computations whose result is not used. Therefore:
total_count += bitCount(vals[j]);
is optimized too. If your application main kernel is not being used, it makes sense the program runs much faster.
On the other hand, I would not implement a bit count function myself but rather rely on functionality that is already provided to you. For example, GCC builtin functions include __builtin_popcount, which does exactly what you are trying to do.
As a bonus: it is way better to work on private data rather than working on a common array using different array elements. It improves locality (specially important when access to memory is not uniform, aka. NUMA) and may reduce access contention.
# pragma omp parallel shared(mat,vals)
{
std::vector<double> local_vec(1000,-INFINITY);
#pragma omp for
for(int i = 0; i < 100000000; i++) {
int total_count = 0;
for(unsigned int j = 0; j < vals.size(); j++){
total_count += bitCount(vals[j]);
local_vec[j] = total_count;
}
}
// Copy local vec to tid_vec[omp_get_thread_num()]
}

OpenMP code C++ is slower thatn c++

i have the following part of code, i run it on sample of N=3000, the c++ sequential code is faster by 3 seconds which is not good at all.
this code is filling the array jsd[N] with calculated values and i want to locate the maximum value and its location.
so
1- is this openmp conversion correct, and is there any better suggstion to make it more profissional
2- why it is slower that the equavilant c++ code, also the more threads i create the more it get slow.
thanks in advance
double maxval = 0;
int pos = -1;
double jsd[N];
#pragma omp parallel for num_threads(4)
for (int i = 0; i < N; i++) {
double Hl = obj.function1(sequenceVctr, i, LEFT);
double Hr = obj.function1(sequenceVctr, i, RIGHT);
jsd[i] = obj.function2(H, i + 1, N, Hl, Hr);
if (jsd[i] >= maxval) {
#pragma omp critical
{
maxval = jsd[i];
pos = i;
}
}
} // for
update:
here is the new code but still slow and get slower in more threads.
i update the code as following. but still get slower for more threads
double maxval = 0;
int pos = -1;
double jsd[N];
#pragma omp parallel num_threads(50)
for (int i = 0; i < N; i++) {
double Hl = obj.function1(sequenceVctr, i, LEFT);
double Hr = obj.function1(sequenceVctr, i, RIGHT);
jsd[i]= obj.function2(H, i + 1, N, Hl, Hr);
} // for
#pragma omp master
{
vector<double> jsd2 (jsd,jsd+N);
vector<double>::iterator jsditer;
jsditer = std::max_element(jsd2.begin(), jsd2.end());
maxval=*jsditer;
pos=std::distance(jsd2.begin(),jsditer) ;
// cout<<"pos"<<pos<<endl;
}
#pragma omp barrier
The first optimization I would suggest is to first compute all jsd values in the loop, then find the maximum element via std::max_element().
This way you are not forcing the threads to synchronise.
The second thing I would do is move over to Intel TBB instead of OpenMP and use parallel_reduce().
But the biggest question is, how complex are the objective functions you are evaluating.

misusing OpenMP?

I have a program using OpenMP to parallelize a for-loop. Inside the loop, the threads will write to shared variable, so I need to synchronize them. However I can sometimes get either segment fault or double free or corruption error. Anyone knows what happens? Thanks and regards! Here is the code:
void KNNClassifier::classify_various_k(int dim, double *feature, int label, int *ks, double * errors, int nb_ks, int k_max) {
ANNpoint queryPt = 0;
ANNidxArray nnIdx = 0;
ANNdistArray dists = 0;
queryPt = feature;
nnIdx = new ANNidx[k_max];
dists = new ANNdist[k_max];
if(strcmp(_search_neighbors, "brutal") == 0) {// search
_search_struct->annkSearch(queryPt, k_max, nnIdx, dists, _eps);
}else if(strcmp(_search_neighbors, "kdtree") == 0) {
_search_struct->annkSearch(queryPt, k_max, nnIdx, dists, _eps); // double free or corruption
}
for (int j = 0; j < nb_ks; j++)
{
scalar_t result = 0.0;
for (int i = 0; i < ks[j]; i++) {
result+=_labels[ nnIdx[i] ]; // Segmentation fault
}
if (result*label<0)
{
#pragma omp critical
{
errors[j]++;
}
}
}
delete [] nnIdx;
delete [] dists;
}
void KNNClassifier::tune_complexity(int nb_examples, int dim, double **features, int *labels, int fold, char *method, int nb_examples_test, double **features_test, int *labels_test) {
int nb_try = (_k_max - _k_min) / scalar_t(_k_step);
scalar_t *error_validation = new scalar_t [nb_try];
int *ks = new int [nb_try];
for(int i=0; i < nb_try; i ++){
ks[i] = _k_min + _k_step * i;
}
if (strcmp(method, "ct")==0)
{
train(nb_examples, dim, features, labels );// train once for all nb of nbs in ks
for(int i=0; i < nb_try; i ++){
if (ks[i] > nb_examples){nb_try=i; break;}
error_validation[i] = 0;
}
int i = 0;
#pragma omp parallel shared(nb_examples_test, error_validation,features_test, labels_test, nb_try, ks) private(i)
{
#pragma omp for schedule(dynamic) nowait
for (i=0; i < nb_examples_test; i++)
{
classify_various_k(dim, features_test[i], labels_test[i], ks, error_validation, nb_try, ks[nb_try - 1]); // where error occurs
}
}
for (i=0; i < nb_try; i++)
{
error_validation[i]/=nb_examples_test;
}
}
......
}
UPDATE:
As in my last post double free or corruption, the code runs fine with single-thread but gives runtime errors for multi-thread. The error changes from time to time. If I run it twice, one will be segfault, and the other will be double free or corruption.
Let's take a look at your segmentation fault line:
result+=_labels[ nnIdx[i] ];
result is local -- OK.
nnIdx is local -- also OK.
i is local -- still OK.
_labels ... what is it?
Is it global? Did you define access to it via #pragma shared?
Same goes for the former:
_search_struct->annkSearch(queryPt, k_max, nnIdx, dists, _eps);
Seems as we have here a problem that is not easily solvable -- _search_struct is not thread safe -- probably values in it are modified by threads at once. You have to have a dedicated _search_struct per-thread, probably by allocating it in classify_various_k.
The really bad news however is that ANN is probably completely non-threadable:
The library allocates a small amount
of storage, which is shared by all
search struc- tures built during the
program’s lifetime. Because the data
is shared, it is not deallocated, even
when the all the individual structures
are deleted.
As seen above there'll always be problems with parallel data modification, because the library itself has some shared data -- hence it's not thread-safe itself :/.