OpenMP problems - c++

How to parallelize this algorithm using OpenMP? I tried different options, but the execution time only increases.
void gammaEncoding(string& input, string& gamma, string& result)
{
int j = 0;
int i = 0;
int Ti, Gi;
char BUFF;
for (i = 0; i < ITERATION_COUNT; i++)
{
if(j == gamma.length() - 1)
j = 0;
Ti = input[i] - FIRST_SYMBOL;
Gi = gamma[j] - FIRST_SYMBOL;
BUFF = FIRST_SYMBOL + (Ti + Gi) % SYMBOL_NUMBER;
result += BUFF;
j++;
}
}

void gammaEncoding(string const &input, string const &gamma, string &result)
{
auto const old_length = result.size();
result.resize(old_length + ITERATION_COUNT);
#pragma omp parallel for default(none) \
shared(input, gamma, result, old_length)
for (int i = 0; i < ITERATION_COUNT; ++i)
{
int const Ti = input[i] - FIRST_SYMBOL;
int const Gi = gamma[i % (gamma.length() - 1)] - FIRST_SYMBOL;
result[old_length + i] = FIRST_SYMBOL + (Ti + Gi) % SYMBOL_NUMBER;
}
}
This should do the same thing as your code does and be relatively fast (depending on problem size and number of processors). It does more work (one more modulo operator and in the worst case thread creation) which has to be amortized by the parallelism.
That being said, I have no idea if your original code does the right thing. For example I don't understand why one would not use all of gamma. Could it be that you think that gamma[gamma.length() - 1] == '\0'? Because that is not the case in C++ (assuming you are using std::string). Also I don't know if you actually plan to give non-empty result strings to the function. Because if you don't and if this function is not called in a loop such that you can reuse the result string instead of reallocating it every time, you might want to just create the result string inside the function. Due to RVO (Return Value Optimization), this would not create a copy operation at the end of the function, but instead put the string at the right position in the stack of the caller from the start. In the case that you want all of gamma and you wont reuse result:
string gammaEncoding(string const &input, string const &gamma)
{
string result(ITERATION_COUNT, ' ');
#pragma omp parallel for default(none) shared(input, gamma, result)
for (int i = 0; i < ITERATION_COUNT; ++i)
{
int const Ti = input[i] - FIRST_SYMBOL;
int const Gi = gamma[i % gamma.length()] - FIRST_SYMBOL;
result[i] = FIRST_SYMBOL + (Ti + Gi) % SYMBOL_NUMBER;
}
return result;
}
One could further optimize this in such way that one doesn't have to do the additional modulo operation for every i, but then one could not use pragma omp for. In that case one would have to distribute the range of i in between the threads manually. Then one only has to calculate j = i % gamma.length() once and can then use the original if statement and ++j to access gamma. In that case one would be dependent on the hardware to correctly predict the regular pattern in j such that the if statement is not holding you back. I could also add this solution, but I don't think that it is as good for learning to use OpenMP. Also I wouldn't guarantee that it actually is faster.

Related

How to properly access array with specific pointer arithmetic using SSE in convolution algorithm? [duplicate]

This question already exists:
How to implement convolution algorithm with SSE?
Closed 1 year ago.
My goal is to implement exactly that algorithm using only CPU and using SSE:
My array's sizes a multiple of 4 and they are aligned:
const int INPUT_SIGNAL_ARRAY_SIZE = 256896;
const int IMPULSE_RESPONSE_ARRAY_SIZE = 318264;
const int OUTPUT_SIGNAL_ARRAY_SIZE = INPUT_SIGNAL_ARRAY_SIZE + IMPULSE_RESPONSE_ARRAY_SIZE;
__declspec(align(16)) float inputSignal_dArray[INPUT_SIGNAL_ARRAY_SIZE];
__declspec(align(16)) float impulseResponse_dArray[IMPULSE_RESPONSE_ARRAY_SIZE];
__declspec(align(16)) float outputSignal_dArray[OUTPUT_SIGNAL_ARRAY_SIZE];
I have written CPU "method" and it works correctly:
//#pragma optimize( "", off )
void computeConvolutionOutputCPU(float* inputSignal, float* impulseResponse, float* outputSignal) {
float* pInputSignal = inputSignal;
float* pImpulseResponse = impulseResponse;
float* pOutputSignal = outputSignal;
#pragma loop(no_vector)
for (int i = 0; i < OUTPUT_SIGNAL_ARRAY_SIZE; i++)
{
*(pOutputSignal + i) = 0;
#pragma loop(no_vector)
for (int j = 0; j < IMPULSE_RESPONSE_ARRAY_SIZE; j++)
{
if (i - j >= 0 && i - j < INPUT_SIGNAL_ARRAY_SIZE)
{
*(pOutputSignal + i) = *(pOutputSignal + i) + *(pImpulseResponse + j) * (*(pInputSignal + i - j));
}
}
}
}
//#pragma optimize( "", on )
On the other hand I should use function with SSE. I tried the following code:
void computeConvolutionOutputSSE(float* inputSignal, float* impulseResponse, float* outputSignal) {
__m128* pInputSignal = (__m128*) inputSignal;
__m128* pImpulseResponse = (__m128*) impulseResponse;
__m128* pOutputSignal = (__m128*) outputSignal;
int nOuterLoop = OUTPUT_SIGNAL_ARRAY_SIZE / 4;
int nInnerLoop = IMPULSE_RESPONSE_ARRAY_SIZE / 4;
int quarterOfInputSignal = INPUT_SIGNAL_ARRAY_SIZE / 4;
__m128 m0 = _mm_set_ps1(0);
for (int i = 0; i < nOuterLoop; i++)
{
*(pOutputSignal + i) = m0;
for (int j = 0; j < nInnerLoop; j++)
{
if ((i - j) >= 0 && (i - j) < quarterOfInputSignal)
{
*(pOutputSignal + i) = _mm_add_ps(
*(pOutputSignal + i),
_mm_mul_ps(*(pImpulseResponse + j), *(pInputSignal + i - j))
);
}
}
}
}
And function above works not correct and produces not the same values like CPU.
The problem was specified on stackoverflow with following comment :
*(pInputSignal + i - j) is incorrect in case of SSE, because it's not an i-j offset away from current value, it's (i-j) * 4 . THe thing is,
as I remember it, the idea of using pointer that way is incorrect
unless intrinsics had changed since then - in my time one had to
"load" values into an instance of __m128 in this case, as H(J) and
X(I-J) are in unaligned location (and sequence breaks).
and
Since you care about individual floats and their order, probably best
to use const float*, with _mm_loadu_ps instead of just dereferencing
(which is like _mm_load_ps). That way you can easily do unaligned
loads that get the floats you want into the vector element positions
you want, and the pointer math works the same as for scalar. You just
have to take into account that load(ptr) actually gets you a vector of
elements from ptr+0..3.
But I can't use this information because have no idea how to properly access array with SSE in this case.
you need 128-bit float32 value , not msvc float.
see _mm_broadcast_ss

I would like to improve the performance of this code using AVX

I profiled my code and the most expensive part of the code is the loop included in the post. I want to improve the performance of this loop using AVX. I have tried manually unrolling the loop and, while that does improve performance, the improvements are not satisfactory.
int N = 100000000;
int8_t* data = new int8_t[N];
for(int i = 0; i< N; i++) { data[i] = 1 ;}
std::array<float, 10> f = {1,2,3,4,5,6,7,8,9,10};
std::vector<float> output(N, 0);
int k = 0;
for (int i = k; i < N; i = i + 2) {
for (int j = 0; j < 10; j++, k = j + 1) {
output[i] += f[j] * data[i - k];
output[i + 1] += f[j] * data[i - k + 1];
}
}
Could I have some guidance on how to approach this.
I would assume that data is a large input array of signed bytes, and f is a small array of floats of length 10, and output is the large output array of floats. Your code goes out of bounds for the first 10 iterations by i, so I will start i from 10 instead. Here is a clean version of the original code:
int s = 10;
for (int i = s; i < N; i += 2) {
for (int j = 0; j < 10; j++) {
output[i] += f[j] * data[i-j-1];
output[i+1] += f[j] * data[i-j];
}
}
As it turns out, processing two iterations by i does not change anything, so we simplify it further to:
for (int i = s; i < N; i++)
for (int j = 0; j < 10; j++)
output[i] += f[j] * data[i-j-1];
This version of code (along with declarations of input/output data) should have been present in the question itself, without others having to clean/simplify the mess.
Now it is obvious that this code applies one-dimensional convolution filter, which is a very common thing in signal processing. For instance, it can by computed in Python using numpy.convolve function. The kernel has very small length 10, so Fast Fourier Transform won't provide any benefits compared to bruteforce approach. Given that the problem is well-known, you can read a lot of articles on vectorizing small-kernel convolution. I will follow the article by hgomersall.
First, let's get rid of reverse indexing. Obviously, we can reverse the kernel before running the main algorithm. After that, we have to compute the so-called cross-correlation instead of convolution. In simple words, we move the kernel array along the input array, and compute the dot product between them for every possible offset.
std::reverse(f.data(), f.data() + 10);
for (int i = s; i < N; i++) {
int b = i-10;
float res = 0.0;
for (int j = 0; j < 10; j++)
res += f[j] * data[b+j];
output[i] = res;
}
In order to vectorize it, let's compute 8 consecutive dot products at once. Recall that we can pack eight 32-bit float numbers into one 256-bit AVX register. We will vectorize the outer loop by i, which means that:
The loop by i will be advanced by 8 every iteration.
Every value inside the outer loop turns into a 8-element pack, such that k-th element of the pack holds this value for (i+k)-th iteration of the outer loop from the scalar version.
Here is the resulting code:
//reverse the kernel
__m256 revKernel[10];
for (size_t i = 0; i < 10; i++)
revKernel[i] = _mm256_set1_ps(f[9-i]); //every component will have same value
//note: you have to compute the last 16 values separately!
for (size_t i = s; i + 16 <= N; i += 8) {
int b = i-10;
__m256 res = _mm256_setzero_ps();
for (size_t j = 0; j < 10; j++) {
//load: data[b+j], data[b+j+1], data[b+j+2], ..., data[b+j+15]
__m128i bytes = _mm_loadu_si128((__m128i*)&data[b+j]);
//convert first 8 bytes of loaded 16-byte pack into 8 floats
__m256 floats = _mm256_cvtepi32_ps(_mm256_cvtepi8_epi32(bytes));
//compute res = res + floats * revKernel[j] elementwise
res = _mm256_fmadd_ps(revKernel[j], floats, res);
}
//store 8 values packed in res into: output[i], output[i+1], ..., output[i+7]
_mm256_storeu_ps(&output[i], res);
}
For 100 millions of elements, this code takes about 120 ms on my machine, while the original scalar implementation took 850 ms. Beware: I have Ryzen 1600 CPU, so results on Intel CPUs may be somewhat different.
Now if you really want to unroll something, the inner loop by 10 kernel elements is the perfect place. Here is how it is done:
__m256 revKernel[10];
for (size_t i = 0; i < 10; i++)
revKernel[i] = _mm256_set1_ps(f[9-i]);
for (size_t i = s; i + 16 <= N; i += 8) {
size_t b = i-10;
__m256 res = _mm256_setzero_ps();
#define DOIT(j) {\
__m128i bytes = _mm_loadu_si128((__m128i*)&data[b+j]); \
__m256 floats = _mm256_cvtepi32_ps(_mm256_cvtepi8_epi32(bytes)); \
res = _mm256_fmadd_ps(revKernel[j], floats, res); \
}
DOIT(0);
DOIT(1);
DOIT(2);
DOIT(3);
DOIT(4);
DOIT(5);
DOIT(6);
DOIT(7);
DOIT(8);
DOIT(9);
_mm256_storeu_ps(&output[i], res);
}
It takes 110 ms on my machine (slightly better that the first vectorized version).
The simple copy of all elements (with conversion from bytes to floats) takes 40 ms for me, which means that this code is not memory-bound yet, and there is still some room for improvement left.

openmp bug with radial computation

#pragma omp parallel for schedule(static) default(none)
for(int row = 0; row < m_height; row++)
{
for(int col = 0; col < m_width; col++)
{
int RySqr, RxSqr;
SettingSigmaN(eta, m_RxInitial + col, m_RyInitial + row , RxSqr, RySqr);
FunctionUsing(RySqr,RxSqr);
}
}
void CImagePro::SettingSigmaN(int Eta, int x, int y, int &RxSqr, int &RySqr, int &returnValue)
{
int rSqr = GetRadius(x,y,RxSqr,RySqr);
returnValue = GetNumberFromTable(rsqr);
}
int CImagePro::GetRadius(int x, int y, int &RxSqr, int &RySqr)
{
if (x == m_RxInitial)
{
RxSqr = m_RxSqrInitial;
if (y == m_RyInitial)
{
RySqr = m_RySqrInitial;
}
else if ( abs(y) % 2 == abs(m_RyInitial) % 2)
{
RySqr = RySqr + (y<<2) + 4; //(y+2)^2
}
}
else
{
RxSqr = RxSqr + ( x << 1) + 1; //(x+1)^2
}
return clamp(((RxSqr+RySqr)>>RAD_RES_REDUCTION),0,(1<<(RAD_RES-RAD_RES_REDUCTION))-1);
}
ok so here is my code and my problem is inside GetRadius Function.
since i have many threads each threads starts at a different place of x,y. however i don't really understand where is the bug inside GetRadius().
i thought maybe is it the RySqr computation. can you suggest a way to debug? or can you see my problem?
UPDATE:
this has fixed most of my code:
i still don't really understand, why the there are jumps between the different threads.
int CImagePro::GetRadius(int x, int y, int &RxSqr, int &RySqr)
{
if (x == m_RxInitial)
{
RxSqr = m_RxSqrInitial;
}
else
{
RxSqr = x * x;
}
if (y == m_RyInitial)
{
RySqr = m_RySqrInitial;
}
else if (abs(y) % 2 == abs(m_RyInitial) % 2)
{
RySqr = y * y;
}
return clamp(( (RxSqr + RySqr) >> RAD_RES_REDUCTION), 0, ( 1 << (RAD_RES - RAD_RES_REDUCTION) ) - 1);
}
I really wonder if this thing compiles? You specify default(none), but consistently use data members of your class. Are they all static?
What you could do is either i) leave default(none) away, which means default(shared), ii) have a shared access to the values by explicitly sharing them, or iii) initialise the variables you use inside the parallel region so that each thread has it's own private copy of, say, m_RxInitial called p_RxInitial etc. The first option is almost guaranteed to get you into trouble.
Following illustrates option ii):
1) Make a helper class containing everything you need to pass, for you this could be
struct ShareData{
int s_RxInitial
/* ... */
}
2) In the member function containing parallel section, before parallel loop define
ShareData SD;
SD.s_RxInitial = m_RxInitial;
/* ... */
3) Give it to the parallel section
#pragma omp parallel for schedule(static), default(none), shared(SD)
4) Use the SD datamembers in function calls.
I hope this was clear enough. I would appreciate it if someone had a more elegant solution to offer.
If you wanted private variables of option iii), you could say firstprivate(SD) instead of shared(SD). This would give each thread initialized (to the original values) private copy of SD. It may or may not give some performance advantage by avoiding serial access. I had a similar problem few days ago and there was no difference.
You cannot guarantee the order in which the threads execute, if you need to guarantee that either make if statements as you did or simply don't parallelize it because it is a critical section.
http://bisqwit.iki.fi/story/howto/openmp/

C++ use SSE instructions for comparing huge vectors of ints

I have a huge vector<vector<int>> (18M x 128). Frequently I want to take 2 rows of this vector and compare them by this function:
int getDiff(int indx1, int indx2) {
int result = 0;
int pplus, pminus, tmp;
for (int k = 0; k < 128; k += 2) {
pplus = nodeL[indx2][k] - nodeL[indx1][k];
pminus = nodeL[indx1][k + 1] - nodeL[indx2][k + 1];
tmp = max(pplus, pminus);
if (tmp > result) {
result = tmp;
}
}
return result;
}
As you see, the function, loops through the two row vectors does some subtraction and at the end returns a maximum. This function will be used a million times, so I was wondering if it can be accelerated through SSE instructions. I use Ubuntu 12.04 and gcc.
Of course it is microoptimization but it would helpful if you could provide some help, since I know nothing about SSE. Thanks in advance
Benchmark:
int nofTestCases = 10000000;
vector<int> nodeIds(nofTestCases);
vector<int> goalNodeIds(nofTestCases);
vector<int> results(nofTestCases);
for (int l = 0; l < nofTestCases; l++) {
nodeIds[l] = randomNodeID(18000000);
goalNodeIds[l] = randomNodeID(18000000);
}
double time, result;
time = timestamp();
for (int l = 0; l < nofTestCases; l++) {
results[l] = getDiff2(nodeIds[l], goalNodeIds[l]);
}
result = timestamp() - time;
cout << result / nofTestCases << "s" << endl;
time = timestamp();
for (int l = 0; l < nofTestCases; l++) {
results[l] = getDiff(nodeIds[l], goalNodeIds[l]);
}
result = timestamp() - time;
cout << result / nofTestCases << "s" << endl;
where
int randomNodeID(int n) {
return (int) (rand() / (double) (RAND_MAX + 1.0) * n);
}
/** Returns a timestamp ('now') in seconds (incl. a fractional part). */
inline double timestamp() {
struct timeval tp;
gettimeofday(&tp, NULL);
return double(tp.tv_sec) + tp.tv_usec / 1000000.;
}
FWIW I put together a pure SSE version (SSE4.1) which seems to run around 20% faster than the original scalar code on a Core i7:
#include <smmintrin.h>
int getDiff_SSE(int indx1, int indx2)
{
int result[4] __attribute__ ((aligned(16))) = { 0 };
const int * const p1 = &nodeL[indx1][0];
const int * const p2 = &nodeL[indx2][0];
const __m128i vke = _mm_set_epi32(0, -1, 0, -1);
const __m128i vko = _mm_set_epi32(-1, 0, -1, 0);
__m128i vresult = _mm_set1_epi32(0);
for (int k = 0; k < 128; k += 4)
{
__m128i v1, v2, vmax;
v1 = _mm_loadu_si128((__m128i *)&p1[k]);
v2 = _mm_loadu_si128((__m128i *)&p2[k]);
v1 = _mm_xor_si128(v1, vke);
v2 = _mm_xor_si128(v2, vko);
v1 = _mm_sub_epi32(v1, vke);
v2 = _mm_sub_epi32(v2, vko);
vmax = _mm_add_epi32(v1, v2);
vresult = _mm_max_epi32(vresult, vmax);
}
_mm_store_si128((__m128i *)result, vresult);
return max(max(max(result[0], result[1]), result[2]), result[3]);
}
You probably can get the compiler to use SSE for this. Will it make the code quicker? Probably not. The reason being is that there is a lot of memory access compared to computation. The CPU is much faster than the memory and a trivial implementation of the above will already have the CPU stalling when it's waiting for data to arrive over the system bus. Making the CPU faster will just increase the amount of waiting it does.
The declaration of nodeL can have an effect on the performance so it's important to choose an efficient container for your data.
There is a threshold where optimising does have a benfit, and that's when you're doing more computation between memory reads - i.e. the time between memory reads is much greater. The point at which this occurs depends a lot on your hardware.
It can be helpful, however, to optimise the code if you've got non-memory constrained tasks that can run in prarallel so that the CPU is kept busy whilst waiting for the data.
This will be faster. Double dereference of vector of vectors is expensive. Caching one of the dereferences will help. I know it's not answering the posted question but I think it will be a more helpful answer.
int getDiff(int indx1, int indx2) {
int result = 0;
int pplus, pminus, tmp;
const vector<int>& nodetemp1 = nodeL[indx1];
const vector<int>& nodetemp2 = nodeL[indx2];
for (int k = 0; k < 128; k += 2) {
pplus = nodetemp2[k] - nodetemp1[k];
pminus = nodetemp1[k + 1] - nodetemp2[k + 1];
tmp = max(pplus, pminus);
if (tmp > result) {
result = tmp;
}
}
return result;
}
A couple of things to look at. One is the amount of data you are passing around. That will cause a bigger issue than the trivial calculation.
I've tried to rewrite it using SSE instructions (AVX) using library here
The original code on my system ran in 11.5s
With Neil Kirk's optimisation, it went down to 10.5s
EDIT: Tested the code with a debugger rather than in my head!
int getDiff(std::vector<std::vector<int>>& nodeL,int row1, int row2) {
Vec4i result(0);
const std::vector<int>& nodetemp1 = nodeL[row1];
const std::vector<int>& nodetemp2 = nodeL[row2];
Vec8i mask(-1,0,-1,0,-1,0,-1,0);
for (int k = 0; k < 128; k += 8) {
Vec8i nodeA(nodetemp1[k],nodetemp1[k+1],nodetemp1[k+2],nodetemp1[k+3],nodetemp1[k+4],nodetemp1[k+5],nodetemp1[k+6],nodetemp1[k+7]);
Vec8i nodeB(nodetemp2[k],nodetemp2[k+1],nodetemp2[k+2],nodetemp2[k+3],nodetemp2[k+4],nodetemp2[k+5],nodetemp2[k+6],nodetemp2[k+7]);
Vec8i tmp = select(mask,nodeB-nodeA,nodeA-nodeB);
Vec4i tmp_a(tmp[0],tmp[2],tmp[4],tmp[6]);
Vec4i tmp_b(tmp[1],tmp[3],tmp[5],tmp[7]);
Vec4i max_tmp = max(tmp_a,tmp_b);
result = select(max_tmp > result,max_tmp,result);
}
return horizontal_add(result);
}
The lack of branching speeds it up to 9.5s but still data is the biggest impact.
If you want to speed it up more, try to change the data structure to a single array/vector rather than a 2D one (a.l.a. std::vector) as that will reduce cache pressure.
EDIT
I thought of something - you could add a custom allocator to ensure you allocate the 2*18M vectors in a contiguous block of memory which allows you to keep the data structure and still go through it quickly. But you'd need to profile it to be sure
EDIT 2: Tested the code with a debugger rather than in my head!
Sorry Alex, this should be better. Not sure it will be faster than what the compiler can do. I still maintain that it's memory access that's the issue, so I would still try the single array approach. Give this a go though.

Red-Black Gauss Seidel and OpenMP

I was trying to prove a point with OpenMP compared to MPICH, and I cooked up the following example to demonstrate how easy it was to do some high performance in OpenMP.
The Gauss-Seidel iteration is split into two separate runs, such that in each sweep every operation can be performed in any order, and there should be no dependency between each task. So in theory each processor should never have to wait for another process to perform any kind of synchronization.
The problem I am encountering, is that I, independent of problem size, find there is only a weak speed-up of 2 processors and with more than 2 processors it might even be slower.
Many other linear paralleled routine I can obtain very good scaling, but this one is tricky.
My fear is that I am unable to "explain" to the compiler that operation that I perform on the array, is thread-safe, such that it is unable to be really effective.
See the example below.
Anyone has any clue on how to make this more effective with OpenMP?
void redBlackSmooth(std::vector<double> const & b,
std::vector<double> & x,
double h)
{
// Setup relevant constants.
double const invh2 = 1.0/(h*h);
double const h2 = (h*h);
int const N = static_cast<int>(x.size());
double sigma = 0;
// Setup some boundary conditions.
x[0] = 0.0;
x[N-1] = 0.0;
// Red sweep.
#pragma omp parallel for shared(b, x) private(sigma)
for (int i = 1; i < N-1; i+=2)
{
sigma = -invh2*(x[i-1] + x[i+1]);
x[i] = (h2/2.0)*(b[i] - sigma);
}
// Black sweep.
#pragma omp parallel for shared(b, x) private(sigma)
for (int i = 2; i < N-1; i+=2)
{
sigma = -invh2*(x[i-1] + x[i+1]);
x[i] = (h2/2.0)*(b[i] - sigma);
}
}
Addition:
I have now also tried with a raw pointer implementation and it has the same behavior as using STL container, so it can be ruled out that it is some pseudo-critical behavior comming from STL.
First of all, make sure that the x vector is aligned to cache boundaries. I did some test, and I get something like a 100% improvement with your code on my machine (core duo) if I force the alignment of memory:
double * x;
const size_t CACHE_LINE_SIZE = 256;
posix_memalign( reinterpret_cast<void**>(&x), CACHE_LINE_SIZE, sizeof(double) * N);
Second, you can try to assign more computation to each thread (in this way you can keep cache-lines separated), but I suspect that openmp already does something like this under the hood, so it may be worthless with large N.
In my case this implementation is much faster when x is not cache-aligned.
const int workGroupSize = CACHE_LINE_SIZE / sizeof(double);
assert(N % workGroupSize == 0); //Need to tweak the code a bit to let it work with any N
const int workgroups = N / workGroupSize;
int j, base , k, i;
#pragma omp parallel for shared(b, x) private(sigma, j, base, k, i)
for ( j = 0; j < workgroups; j++ ) {
base = j * workGroupSize;
for (int k = 0; k < workGroupSize; k+=2)
{
i = base + k + (redSweep ? 1 : 0);
if ( i == 0 || i+1 == N) continue;
sigma = -invh2* ( x[i-1] + x[i+1] );
x[i] = ( h2/2.0 ) * ( b[i] - sigma );
}
}
In conclusion, you definitely have a problem of cache-fighting, but given the way openmp works (sadly I am not familiar with it) it should be enough to work with properly allocated buffers.
I think the main problem is about type of array structure you are using. Lets try comparing results with vectors and arrays. (Arrays = c-arrays using new operator).
Vector and array sizes are N = 10000000. I force the smoothing function to repeat in order to maintain runtime > 0.1secs.
Vector Time: 0.121007 Repeat: 1 MLUPS: 82.6399
Array Time: 0.164009 Repeat: 2 MLUPS: 121.945
MLUPS = ((N-2)*repeat/runtime)/1000000 (Million Lattice Points Update per second)
MFLOPS are misleading when it comes to grid calculation. A few changes in the basic equation can lead to consider high performance for the same runtime.
The modified code:
double my_redBlackSmooth(double *b, double* x, double h, int N)
{
// Setup relevant constants.
double const invh2 = 1.0/(h*h);
double const h2 = (h*h);
double sigma = 0;
// Setup some boundary conditions.
x[0] = 0.0;
x[N-1] = 0.0;
double runtime(0.0), wcs, wce;
int repeat = 1;
timing(&wcs);
for(; runtime < 0.1; repeat*=2)
{
for(int r = 0; r < repeat; ++r)
{
// Red sweep.
#pragma omp parallel for shared(b, x) private(sigma)
for (int i = 1; i < N-1; i+=2)
{
sigma = -invh2*(x[i-1] + x[i+1]);
x[i] = (h2*0.5)*(b[i] - sigma);
}
// Black sweep.
#pragma omp parallel for shared(b, x) private(sigma)
for (int i = 2; i < N-1; i+=2)
{
sigma = -invh2*(x[i-1] + x[i+1]);
x[i] = (h2*0.5)*(b[i] - sigma);
}
// cout << "In Array: " << r << endl;
}
if(x[0] != 0) dummy(x[0]);
timing(&wce);
runtime = (wce-wcs);
}
// cout << "Before division: " << repeat << endl;
repeat /= 2;
cout << "Array Time:\t" << runtime << "\t" << "Repeat:\t" << repeat
<< "\tMLUPS:\t" << ((N-2)*repeat/runtime)/1000000.0 << endl;
return runtime;
}
I didn't change anything in the code except than array type. For better cache access and blocking you should look into data alignment (_mm_malloc).