C++ Multithreading optimization - c++

I am new to threading in C++ but I've done enough reading to at least get what I'm working on to compile. As of yet it hasn't improved performance at all. Right now I just have it creating the number of threads as there are loops but I can imagine that can pretty quickly cause the system to thrash. Is there a better alternative to brute force controlling the number of threads? I also intend to run this on the WestGrid computing system where I can specify the number of processors to use. What is the best way to set the number of threads to optimize for the number of processors.
void ExecuteCRTProcess(const long &numberOfRows, ZZ* rij, const ZZ &powRoh, const int &rowLength, long* PublicKey, ZZ* rQ0, const ZZ &Q0, ZZ* primes, const ZZ &productOfPrimes, ZZ* resultsArray, const bool IsItPrimeArray, long* caseStudy, long* caseStudyTracker, const ZZ &X0, const long &Roh){
int rc;
pthread_t threads[numberOfRows];
struct parameters ThreadParameters[numberOfRows];
for(int i = 0; i< numberOfRows ; i++){
FillR(rij,powRoh,rowLength); // fill up the vector rij with random numbers between 0 powRoh
MultiplyVectorByTwo(rij, rowLength, i, IsItPrimeArray); //Multiply rij vector by 2. If calculating Xi' also add 1 to appropriate values.
ThreadParameters[i].rij = rij;
ThreadParameters[i].rQ0 = rQ0;
ThreadParameters[i].primes = primes;
ThreadParameters[i].rowLength = rowLength;
ThreadParameters[i].Q0 = Q0;
ThreadParameters[i].i = i;
ThreadParameters[i].X0 = X0;
rc = pthread_create(&threads[i],NULL,CRTNew,(void *)&ThreadParameters[i]);
if(rc){
cout << "Error: unable to create thread, " << rc << endl;
exit(-1);
}
for(long j = 0; j< rowLength; j++){
cout << (resultsArray[i] % primes[j]) << " ";
}
cout << endl;*/
}
for(int i = 0; i< numberOfRows; i++){
pthread_join(threads[i], NULL);
resultsArray[i] = ThreadParameters[i].result;
}
}
The threads created run this function
void* CRTNew(void *threadArg){
struct parameters *local_data;
local_data = (struct parameters *) threadArg;
ZZ a, p, A, P, crt;
long Z, Public;
a = local_data->rQ0[local_data->i];
p = local_data->Q0;
A = local_data->rij[0];
P = local_data->primes[0];
for(int i = 1; i<=local_data->rowLength; i++){
A = A%P;
Z = CRT(a, p, A, P);
A = local_data->rij[i]; P = local_data->primes[i];
if(i == local_data->rowLength) Public = Z;
}
if(a < 0) crt = a+p;
else crt = a%p;
local_data->result = crt%local_data->X0;
pthread_exit(NULL);
}

'What is the best way to set the number of threads to optimize for the number of processors';
1) Create [num of cores] threads, (or maybe a few more), at app startup.
2) Never create any more threads
3) Never let the threads terminate.
4) Have them wait for work tasks on a producer-consumer thread, in the manner of a pool.
Alternatively, use a thread pool class or equivalent parallel language feature that already works.

Related

Multiple threads taking more time than single process [duplicate]

This question already has answers here:
C: using clock() to measure time in multi-threaded programs
(2 answers)
Closed 2 years ago.
I am implementing pattern matching algorithm, by moving template gradient info over entire target's gradient image , that too at each rotation (-60 to 60). I have already saved the template info for each rotation ,i.e. 121 templates are already preprocessed and saved.
But the issue is, this is consuming lot of time (approx 110ms), so decided to split the matching at set of rotations (-60 to -30 , -30 to 0, 0 to 30 and 30 to 60) into 4 threads, but threading is taking more time that single process (approx 115ms to 120ms).
Snippet of code is...
#define MAXTARGETNUM 64
MatchResultA totalResultsTemp[MAXTARGETNUM];
void CShapeMatch::match(ShapeInfo *ShapeInfoVec, search_region SearchRegion, float MinScore, float Greediness, int width,int height, int16_t *pBufGradX ,int16_t *pBufGradY,float *pBufMag, bool corr)
{
MatchResultA resultsPerDeg[MAXTARGETNUM];
....
....
int startX = SearchRegion.StartX;
int startY = SearchRegion.StartY;
int endX = SearchRegion.EndX;
int endY = SearchRegion.EndY;
float AngleStep = SearchRegion.AngleStep;
float AngleStart = SearchRegion.AngleStart;
float AngleStop = SearchRegion.AngleStop;
int startIndex = (int)(ShapeInfoVec[0].AngleNum/2) + ShapeInfoVec[0].AngleNum%2+(int)AngleStart/AngleStep;
int stopIndex = (int)(ShapeInfoVec[0].AngleNum/2) + ShapeInfoVec[0].AngleNum%2+(int)AngleStop/AngleStep;
for (int k = startIndex; k < stopIndex ; k++){
....
for(int j = startY; j < endY; j++){
for(int i = startX; i < endX; i++){
for(int m = 0; m < ShapeInfoVec[k].NoOfCordinates; m++)
{
curX = i + (ShapeInfoVec[k].Coordinates + m)->x; // template X coordinate
curY = j + (ShapeInfoVec[k].Coordinates + m)->y ; // template Y coordinate
iTx = *(ShapeInfoVec[k].EdgeDerivativeX + m); // template X derivative
iTy = *(ShapeInfoVec[k].EdgeDerivativeY + m); // template Y derivative
iTm = *(ShapeInfoVec[k].EdgeMagnitude + m); // template gradients magnitude
if(curX < 0 ||curY < 0||curX > width-1 ||curY > height-1)
continue;
offSet = curY*width + curX;
iSx = *(pBufGradX + offSet); // get corresponding X derivative from source image
iSy = *(pBufGradY + offSet); // get corresponding Y derivative from source image
iSm = *(pBufMag + offSet);
if (PartialScore > MinScore)
{
float Angle = ShapeInfoVec[k].Angel;
bool hasFlag = false;
for(int n = 0; n < resultsNumPerDegree; n++)
{
if(abs(resultsPerDeg[n].CenterLocX - i) < 5 && abs(resultsPerDeg[n].CenterLocY - j) < 5)
{
hasFlag = true;
if(resultsPerDeg[n].ResultScore < PartialScore)
{
resultsPerDeg[n].Angel = Angle;
resultsPerDeg[n].CenterLocX = i;
resultsPerDeg[n].CenterLocY = j;
resultsPerDeg[n].ResultScore = PartialScore;
break;
}
}
}
if(!hasFlag)
{
resultsPerDeg[resultsNumPerDegree].Angel = Angle;
resultsPerDeg[resultsNumPerDegree].CenterLocX = i;
resultsPerDeg[resultsNumPerDegree].CenterLocY = j;
resultsPerDeg[resultsNumPerDegree].ResultScore = PartialScore;
resultsNumPerDegree ++;
}
minScoreTemp = minScoreTemp < PartialScore ? PartialScore : minScoreTemp;
}
}
}
for(int i = 0; i < resultsNumPerDegree; i++)
{
mtx.lock();
totalResultsTemp[totalResultsNum] = resultsPerDeg[i];
totalResultsNum++;
mtx.unlock();
}
n++;
}
void CallerFunction(){
int16_t *pBufGradX = (int16_t *) malloc(bufferSize * sizeof(int16_t));
int16_t *pBufGradY = (int16_t *) malloc(bufferSize * sizeof(int16_t));
float *pBufMag = (float *) malloc(bufferSize * sizeof(float));
clock_t start = clock();
float temp_stop = SearchRegion->AngleStop;
SearchRegion->AngleStop = -30;
thread t1(&CShapeMatch::match, this, ShapeInfoVec, *SearchRegion, MinScore, Greediness, width, height, pBufGradX ,pBufGradY,pBufMag, corr);
SearchRegion->AngleStart = -30;
SearchRegion->AngleStop=0;
thread t2(&CShapeMatch::match, this, ShapeInfoVec, *SearchRegion, MinScore, Greediness, width, height, pBufGradX ,pBufGradY,pBufMag, corr);
SearchRegion->AngleStart = 0;
SearchRegion->AngleStop=30;
thread t3(&CShapeMatch::match, this, ShapeInfoVec, *SearchRegion, MinScore, Greediness,width, height, pBufGradX ,pBufGradY,pBufMag, corr);
SearchRegion->AngleStart = 30;
SearchRegion->AngleStop=temp_stop;
thread t4(&CShapeMatch::match, this, ShapeInfoVec, *SearchRegion, MinScore, Greediness,width, height, pBufGradX ,pBufGradY,pBufMag, corr);
t1.join();
t2.join();
t3.join();
t4.join();
clock_t end = clock();
cout << 1000*(double)(end-start)/CLOCKS_PER_SEC << endl;
}
As we can see there are plenty of heap access but they just are read-only. Only totalResultTemp and totalResultNum are shared global resource on which write are performed.
My PC configuration is,
i5-7200U CPU # 2.50GHz 4 cores
4 Gig RAM
Ubuntu 18
for(int i = 0; i < resultsNumPerDegree; i++)
{
mtx.lock();
totalResultsTemp[totalResultsNum] = resultsPerDeg[i];
totalResultsNum++;
mtx.unlock();
}
You writing into static array, and mutexes are really time consuming. Instead of creating locks try to use std::atomic_int, or in my opinion even better, just pass to function exact place where to store result, so problem with sync is not your problem anymore
POSIX Threads in c/c++ are not concurrent since the time assigned by the operative system to each parent process must be split into the number of threads it has. Thus, your algorithm is executing only core. To leverage multicore technology, you must use OpenMP. This interface library let you split your algorithm in different physic cores. This is a good OpenMP tutorial

Concurrent Threads Slower than Single Thread

I've been comparing two ways of finding the highest value from a matrix (should they be duplicated, randomly choose between them), single threaded vs multi-threaded. Typically, the multi-thread should be faster, assuming I coded this properly. Because it is not, it is by far slower, I can only assume I did something wrong. Can someone pinpoint what I did wrong?
Note: I know I shouldn't use rand(), but for this purpose I feel there aren't that many problems with doing so, I'll replace it with a mt19937_64 after this is working properly.
Thanks in advance!
double* RLPolicy::GetActionWithMaxQ(std::tuple<int, double*, int*, int*, int, double*>* state, long* selectedActionIndex, bool& isActionQZero)
{
const bool useMultithreading = true;
double* qIterator = Utilities::DiscretizeStateActionPairToStart(_world->GetCurrentStatePointer(), (long*)&(std::get<0>(*state)));
// Represents the action-pointer for which Q-values are duplicated
// Note: A shared_ptr is used instead of a unique_ptr since C++11 wont support unique_ptrs for pointers to pointers **
static std::shared_ptr<double*> duplicatedQValues(new double*[*_world->GetActionsNumber()], std::default_delete<double*>());
/*[](double** obj) {
delete[] obj;
});*/
static double* const defaultAction = _actionsListing.get();// [0];
double* actionOut = defaultAction; //default action
static double** const duplicatedQsDefault = duplicatedQValues.get();
if (!useMultithreading)
{
const double* const qSectionEnd = qIterator + *_world->GetActionsNumber() - 1;
double* largestValue = qIterator;
int currentActionIterator = 0;
long duplicatedIndex = -1;
do {
if (*qIterator > *largestValue)
{
largestValue = qIterator;
actionOut = defaultAction + currentActionIterator;
*selectedActionIndex = currentActionIterator;
duplicatedIndex = -1;
}
// duplicated value, map it
else if (*qIterator == *largestValue)
{
++duplicatedIndex;
*(duplicatedQsDefault + duplicatedIndex) = defaultAction + currentActionIterator;
}
++currentActionIterator;
++qIterator;
} while (qIterator != qSectionEnd);
// If duped (equal) values are found, select among them randomly with equal probability
if (duplicatedIndex >= 0)
{
*selectedActionIndex = (std::rand() % duplicatedIndex);
actionOut = *(duplicatedQsDefault + *selectedActionIndex);
}
isActionQZero = *largestValue == 0;
return actionOut;
}
else
{
static const long numberOfSections = 6;
unsigned int actionsPerSection = *_world->GetActionsNumber() / numberOfSections;
unsigned long currentSectionStart = 0;
static double* actionsListing = _actionsListing.get();
long currentFoundResult = FindActionWithMaxQInMatrixSection(qIterator, 0, actionsPerSection, duplicatedQsDefault, actionsListing);
static std::vector<std::future<long>> maxActions;
for (int i(0); i < numberOfSections - 1; ++i)
{
currentSectionStart += actionsPerSection;
maxActions.push_back(std::async(&RLPolicy::FindActionWithMaxQInMatrixSection, std::ref(qIterator), currentSectionStart, std::ref(actionsPerSection), std::ref(duplicatedQsDefault), actionsListing));
}
long foundActionIndex;
actionOut = actionsListing + currentFoundResult;
for (auto &f : maxActions)
{
f.wait();
foundActionIndex = f.get();
if (actionOut == nullptr)
actionOut = defaultAction;
else if (*(actionsListing + foundActionIndex) > *actionOut)
actionOut = actionsListing + foundActionIndex;
}
maxActions.clear();
return actionOut;
}
}
/*
Deploy a thread to find the action with the highest Q-value for the provided Q-Matrix section.
#return - The index of the action (on _actionListing) which contains the highest Q-value.
*/
long RLPolicy::FindActionWithMaxQInMatrixSection(double* qMatrix, long sectionStart, long sectionLength, double** dupListing, double* actionListing)
{
double* const matrixSectionStart = qMatrix + sectionStart;
double* const matrixSectionEnd = matrixSectionStart + sectionLength;
double** duplicatedSectionStart = dupListing + sectionLength;
static double* const defaultAction = actionListing;
long maxValue = sectionLength;
long maxActionIndex = 0;
double* qIterator = matrixSectionStart;
double* largestValue = matrixSectionStart;
long currentActionIterator = 0;
long duplicatedIndex = -1;
do {
if (*qIterator > *largestValue)
{
largestValue = qIterator;
maxActionIndex = currentActionIterator;
duplicatedIndex = -1;
}
// duplicated value, map it
else if (*qIterator == *largestValue)
{
++duplicatedIndex;
*(duplicatedSectionStart + duplicatedIndex) = defaultAction + currentActionIterator;
}
++currentActionIterator;
++qIterator;
} while (qIterator != matrixSectionEnd);
// If duped (equal) values are found, select among them randomly with equal probability
if (duplicatedIndex >= 0)
{
maxActionIndex = (std::rand() % duplicatedIndex);
}
return maxActionIndex;
}
Parallel programs are not necessarily faster than serial programs; there are both fixed and variable time costs to setting up parallel algorithms, and for small and/or simple problems, this parallel overhead cost can be greater than the cost of the serial algorithm as a whole. Examples of parallel overhead include thread spawning and synchronization, extra memory copying, and memory bus pressure. At around 2 mircoseconds for your serial program and around 500 microseconds for your parallel program, it's likely your matrix is small enough that the work of setting up the parallel algorithm overshadows the work of solving the matrix problem.

parallel for with omp stucks

I have problem with the following code:
int *chosen_pts = new int[k];
std::pair<float, int> *dist2 = new std::pair<float, int>[x.n];
// initialize dist2
for (int i = 0; i < x.n; ++i) {
dist2[i].first = std::numeric_limits<float>::max();
dist2[i].second = i;
}
// choose the first point randomly
int ndx = 1;
chosen_pts[ndx - 1] = rand() % x.n;
double begin, end;
double elapsed_secs;
while (ndx < k) {
float sum_distribution = 0.0;
// look for the point that is furthest from any center
begin = omp_get_wtime();
#pragma omp parallel for reduction(+:sum_distribution)
for (int i = 0; i < x.n; ++i) {
int example = dist2[i].second;
float d2 = 0.0, diff;
for (int j = 0; j < x.d; ++j) {
diff = x(example,j) - x(chosen_pts[ndx - 1],j);
d2 += diff * diff;
}
if (d2 < dist2[i].first) {
dist2[i].first = d2;
}
sum_distribution += dist2[i].first;
}
end = omp_get_wtime() - begin;
std::cout << "center assigning -- "
<< ndx << " of " << k << " = "
<< (float)ndx / k * 100
<< "% is done. Elasped time: "<< (float)end <<"\n";
/**/
bool unique = true;
do {
// choose a random interval according to the new distribution
float r = sum_distribution * (float)rand() / (float)RAND_MAX;
float sum_cdf = dist2[0].first;
int cdf_ndx = 0;
while (sum_cdf < r) {
sum_cdf += dist2[++cdf_ndx].first;
}
chosen_pts[ndx] = cdf_ndx;
for (int i = 0; i < ndx; ++i) {
unique = unique && (chosen_pts[ndx] != chosen_pts[i]);
}
} while (! unique);
++ndx;
}
As you can see i use omp to make parallel the for loop. It works fine and i can achive a significant speed up. However if i increase the value of x.n over 20000000 the function stops to work after 8-10 loops:
It doestn produces any output (std::cout)
Only one core works
No error, whatsoever
If i comment out the do while loop, it works again as expected. All cores are busy and there is an output after each iteration, and i can increase k.n over 100 millions just as i need it.
It's not OpenMP parallel for getting stuck, it's obviously in your serial do-while loop.
One particular issue that I see is that there is no array boundary checks in the inner while loop accessing dist2. In theory, out-of-boundary access should never happen; but in practice it may - see below why. So first of all I would rewrite the calculation of cdf_ndx to guarantee that the loop ends when all elements are inspected:
float sum_cdf = 0;
int cdf_ndx = 0;
while (sum_cdf < r && cdf_ndx < x.n ) {
sum_cdf += dist2[cdf_ndx].first;
++cdf_ndx;
}
Now, how it may happen that sum_cdf does not reach r? It is due to specifics of floating-point arithmetic and the fact that sum_distribution was computed in parallel, while sum_cdf is computed serially. The problem is that contribution of one element to the sum can be below the accuracy for floats; in other words, when you sum two float values that differ more than ~8 orders of magnitude, the smaller one does not affect the sum.
So, with 20M of floats after some point it might happen that the next value to add is so small comparing to the accumulated sum_cdf that adding this value does not change it! On the other hand, sum_distribution was essentially computed as several independent partial sums (one per thread) then combined together. Thus it is more accurate, and possibly bigger than sum_cdf can ever reach.
A solution can be to compute sum_cdf in portions, having two nested loops. For example:
float sum_cdf = 0;
int cdf_ndx = 0;
while (sum_cdf < r && cdf_ndx < x.n ) {
float block_sum = 0;
int block_end = min(cdf_ndx+10000, x.n); // 10000 is arbitrary selected block size
for (int i=cdf_ndx; i<block_end; ++i ) {
block_sum += dist2[i].first;
if( sum_cdf+block_sum >=r ) {
block_end = i; // adjust to correctly compute cdf_ndx
break;
}
}
sum_cdf += block_sum;
cdf_ndx = block_end;
}
And after the loop you need to check that cdf_ndx < x.n, otherwise repeat with a new random interval.

C++ use SSE instructions for comparing huge vectors of ints

I have a huge vector<vector<int>> (18M x 128). Frequently I want to take 2 rows of this vector and compare them by this function:
int getDiff(int indx1, int indx2) {
int result = 0;
int pplus, pminus, tmp;
for (int k = 0; k < 128; k += 2) {
pplus = nodeL[indx2][k] - nodeL[indx1][k];
pminus = nodeL[indx1][k + 1] - nodeL[indx2][k + 1];
tmp = max(pplus, pminus);
if (tmp > result) {
result = tmp;
}
}
return result;
}
As you see, the function, loops through the two row vectors does some subtraction and at the end returns a maximum. This function will be used a million times, so I was wondering if it can be accelerated through SSE instructions. I use Ubuntu 12.04 and gcc.
Of course it is microoptimization but it would helpful if you could provide some help, since I know nothing about SSE. Thanks in advance
Benchmark:
int nofTestCases = 10000000;
vector<int> nodeIds(nofTestCases);
vector<int> goalNodeIds(nofTestCases);
vector<int> results(nofTestCases);
for (int l = 0; l < nofTestCases; l++) {
nodeIds[l] = randomNodeID(18000000);
goalNodeIds[l] = randomNodeID(18000000);
}
double time, result;
time = timestamp();
for (int l = 0; l < nofTestCases; l++) {
results[l] = getDiff2(nodeIds[l], goalNodeIds[l]);
}
result = timestamp() - time;
cout << result / nofTestCases << "s" << endl;
time = timestamp();
for (int l = 0; l < nofTestCases; l++) {
results[l] = getDiff(nodeIds[l], goalNodeIds[l]);
}
result = timestamp() - time;
cout << result / nofTestCases << "s" << endl;
where
int randomNodeID(int n) {
return (int) (rand() / (double) (RAND_MAX + 1.0) * n);
}
/** Returns a timestamp ('now') in seconds (incl. a fractional part). */
inline double timestamp() {
struct timeval tp;
gettimeofday(&tp, NULL);
return double(tp.tv_sec) + tp.tv_usec / 1000000.;
}
FWIW I put together a pure SSE version (SSE4.1) which seems to run around 20% faster than the original scalar code on a Core i7:
#include <smmintrin.h>
int getDiff_SSE(int indx1, int indx2)
{
int result[4] __attribute__ ((aligned(16))) = { 0 };
const int * const p1 = &nodeL[indx1][0];
const int * const p2 = &nodeL[indx2][0];
const __m128i vke = _mm_set_epi32(0, -1, 0, -1);
const __m128i vko = _mm_set_epi32(-1, 0, -1, 0);
__m128i vresult = _mm_set1_epi32(0);
for (int k = 0; k < 128; k += 4)
{
__m128i v1, v2, vmax;
v1 = _mm_loadu_si128((__m128i *)&p1[k]);
v2 = _mm_loadu_si128((__m128i *)&p2[k]);
v1 = _mm_xor_si128(v1, vke);
v2 = _mm_xor_si128(v2, vko);
v1 = _mm_sub_epi32(v1, vke);
v2 = _mm_sub_epi32(v2, vko);
vmax = _mm_add_epi32(v1, v2);
vresult = _mm_max_epi32(vresult, vmax);
}
_mm_store_si128((__m128i *)result, vresult);
return max(max(max(result[0], result[1]), result[2]), result[3]);
}
You probably can get the compiler to use SSE for this. Will it make the code quicker? Probably not. The reason being is that there is a lot of memory access compared to computation. The CPU is much faster than the memory and a trivial implementation of the above will already have the CPU stalling when it's waiting for data to arrive over the system bus. Making the CPU faster will just increase the amount of waiting it does.
The declaration of nodeL can have an effect on the performance so it's important to choose an efficient container for your data.
There is a threshold where optimising does have a benfit, and that's when you're doing more computation between memory reads - i.e. the time between memory reads is much greater. The point at which this occurs depends a lot on your hardware.
It can be helpful, however, to optimise the code if you've got non-memory constrained tasks that can run in prarallel so that the CPU is kept busy whilst waiting for the data.
This will be faster. Double dereference of vector of vectors is expensive. Caching one of the dereferences will help. I know it's not answering the posted question but I think it will be a more helpful answer.
int getDiff(int indx1, int indx2) {
int result = 0;
int pplus, pminus, tmp;
const vector<int>& nodetemp1 = nodeL[indx1];
const vector<int>& nodetemp2 = nodeL[indx2];
for (int k = 0; k < 128; k += 2) {
pplus = nodetemp2[k] - nodetemp1[k];
pminus = nodetemp1[k + 1] - nodetemp2[k + 1];
tmp = max(pplus, pminus);
if (tmp > result) {
result = tmp;
}
}
return result;
}
A couple of things to look at. One is the amount of data you are passing around. That will cause a bigger issue than the trivial calculation.
I've tried to rewrite it using SSE instructions (AVX) using library here
The original code on my system ran in 11.5s
With Neil Kirk's optimisation, it went down to 10.5s
EDIT: Tested the code with a debugger rather than in my head!
int getDiff(std::vector<std::vector<int>>& nodeL,int row1, int row2) {
Vec4i result(0);
const std::vector<int>& nodetemp1 = nodeL[row1];
const std::vector<int>& nodetemp2 = nodeL[row2];
Vec8i mask(-1,0,-1,0,-1,0,-1,0);
for (int k = 0; k < 128; k += 8) {
Vec8i nodeA(nodetemp1[k],nodetemp1[k+1],nodetemp1[k+2],nodetemp1[k+3],nodetemp1[k+4],nodetemp1[k+5],nodetemp1[k+6],nodetemp1[k+7]);
Vec8i nodeB(nodetemp2[k],nodetemp2[k+1],nodetemp2[k+2],nodetemp2[k+3],nodetemp2[k+4],nodetemp2[k+5],nodetemp2[k+6],nodetemp2[k+7]);
Vec8i tmp = select(mask,nodeB-nodeA,nodeA-nodeB);
Vec4i tmp_a(tmp[0],tmp[2],tmp[4],tmp[6]);
Vec4i tmp_b(tmp[1],tmp[3],tmp[5],tmp[7]);
Vec4i max_tmp = max(tmp_a,tmp_b);
result = select(max_tmp > result,max_tmp,result);
}
return horizontal_add(result);
}
The lack of branching speeds it up to 9.5s but still data is the biggest impact.
If you want to speed it up more, try to change the data structure to a single array/vector rather than a 2D one (a.l.a. std::vector) as that will reduce cache pressure.
EDIT
I thought of something - you could add a custom allocator to ensure you allocate the 2*18M vectors in a contiguous block of memory which allows you to keep the data structure and still go through it quickly. But you'd need to profile it to be sure
EDIT 2: Tested the code with a debugger rather than in my head!
Sorry Alex, this should be better. Not sure it will be faster than what the compiler can do. I still maintain that it's memory access that's the issue, so I would still try the single array approach. Give this a go though.

Red-Black Gauss Seidel and OpenMP

I was trying to prove a point with OpenMP compared to MPICH, and I cooked up the following example to demonstrate how easy it was to do some high performance in OpenMP.
The Gauss-Seidel iteration is split into two separate runs, such that in each sweep every operation can be performed in any order, and there should be no dependency between each task. So in theory each processor should never have to wait for another process to perform any kind of synchronization.
The problem I am encountering, is that I, independent of problem size, find there is only a weak speed-up of 2 processors and with more than 2 processors it might even be slower.
Many other linear paralleled routine I can obtain very good scaling, but this one is tricky.
My fear is that I am unable to "explain" to the compiler that operation that I perform on the array, is thread-safe, such that it is unable to be really effective.
See the example below.
Anyone has any clue on how to make this more effective with OpenMP?
void redBlackSmooth(std::vector<double> const & b,
std::vector<double> & x,
double h)
{
// Setup relevant constants.
double const invh2 = 1.0/(h*h);
double const h2 = (h*h);
int const N = static_cast<int>(x.size());
double sigma = 0;
// Setup some boundary conditions.
x[0] = 0.0;
x[N-1] = 0.0;
// Red sweep.
#pragma omp parallel for shared(b, x) private(sigma)
for (int i = 1; i < N-1; i+=2)
{
sigma = -invh2*(x[i-1] + x[i+1]);
x[i] = (h2/2.0)*(b[i] - sigma);
}
// Black sweep.
#pragma omp parallel for shared(b, x) private(sigma)
for (int i = 2; i < N-1; i+=2)
{
sigma = -invh2*(x[i-1] + x[i+1]);
x[i] = (h2/2.0)*(b[i] - sigma);
}
}
Addition:
I have now also tried with a raw pointer implementation and it has the same behavior as using STL container, so it can be ruled out that it is some pseudo-critical behavior comming from STL.
First of all, make sure that the x vector is aligned to cache boundaries. I did some test, and I get something like a 100% improvement with your code on my machine (core duo) if I force the alignment of memory:
double * x;
const size_t CACHE_LINE_SIZE = 256;
posix_memalign( reinterpret_cast<void**>(&x), CACHE_LINE_SIZE, sizeof(double) * N);
Second, you can try to assign more computation to each thread (in this way you can keep cache-lines separated), but I suspect that openmp already does something like this under the hood, so it may be worthless with large N.
In my case this implementation is much faster when x is not cache-aligned.
const int workGroupSize = CACHE_LINE_SIZE / sizeof(double);
assert(N % workGroupSize == 0); //Need to tweak the code a bit to let it work with any N
const int workgroups = N / workGroupSize;
int j, base , k, i;
#pragma omp parallel for shared(b, x) private(sigma, j, base, k, i)
for ( j = 0; j < workgroups; j++ ) {
base = j * workGroupSize;
for (int k = 0; k < workGroupSize; k+=2)
{
i = base + k + (redSweep ? 1 : 0);
if ( i == 0 || i+1 == N) continue;
sigma = -invh2* ( x[i-1] + x[i+1] );
x[i] = ( h2/2.0 ) * ( b[i] - sigma );
}
}
In conclusion, you definitely have a problem of cache-fighting, but given the way openmp works (sadly I am not familiar with it) it should be enough to work with properly allocated buffers.
I think the main problem is about type of array structure you are using. Lets try comparing results with vectors and arrays. (Arrays = c-arrays using new operator).
Vector and array sizes are N = 10000000. I force the smoothing function to repeat in order to maintain runtime > 0.1secs.
Vector Time: 0.121007 Repeat: 1 MLUPS: 82.6399
Array Time: 0.164009 Repeat: 2 MLUPS: 121.945
MLUPS = ((N-2)*repeat/runtime)/1000000 (Million Lattice Points Update per second)
MFLOPS are misleading when it comes to grid calculation. A few changes in the basic equation can lead to consider high performance for the same runtime.
The modified code:
double my_redBlackSmooth(double *b, double* x, double h, int N)
{
// Setup relevant constants.
double const invh2 = 1.0/(h*h);
double const h2 = (h*h);
double sigma = 0;
// Setup some boundary conditions.
x[0] = 0.0;
x[N-1] = 0.0;
double runtime(0.0), wcs, wce;
int repeat = 1;
timing(&wcs);
for(; runtime < 0.1; repeat*=2)
{
for(int r = 0; r < repeat; ++r)
{
// Red sweep.
#pragma omp parallel for shared(b, x) private(sigma)
for (int i = 1; i < N-1; i+=2)
{
sigma = -invh2*(x[i-1] + x[i+1]);
x[i] = (h2*0.5)*(b[i] - sigma);
}
// Black sweep.
#pragma omp parallel for shared(b, x) private(sigma)
for (int i = 2; i < N-1; i+=2)
{
sigma = -invh2*(x[i-1] + x[i+1]);
x[i] = (h2*0.5)*(b[i] - sigma);
}
// cout << "In Array: " << r << endl;
}
if(x[0] != 0) dummy(x[0]);
timing(&wce);
runtime = (wce-wcs);
}
// cout << "Before division: " << repeat << endl;
repeat /= 2;
cout << "Array Time:\t" << runtime << "\t" << "Repeat:\t" << repeat
<< "\tMLUPS:\t" << ((N-2)*repeat/runtime)/1000000.0 << endl;
return runtime;
}
I didn't change anything in the code except than array type. For better cache access and blocking you should look into data alignment (_mm_malloc).