#pragma omp parallel for schedule(static) default(none)
for(int row = 0; row < m_height; row++)
{
for(int col = 0; col < m_width; col++)
{
int RySqr, RxSqr;
SettingSigmaN(eta, m_RxInitial + col, m_RyInitial + row , RxSqr, RySqr);
FunctionUsing(RySqr,RxSqr);
}
}
void CImagePro::SettingSigmaN(int Eta, int x, int y, int &RxSqr, int &RySqr, int &returnValue)
{
int rSqr = GetRadius(x,y,RxSqr,RySqr);
returnValue = GetNumberFromTable(rsqr);
}
int CImagePro::GetRadius(int x, int y, int &RxSqr, int &RySqr)
{
if (x == m_RxInitial)
{
RxSqr = m_RxSqrInitial;
if (y == m_RyInitial)
{
RySqr = m_RySqrInitial;
}
else if ( abs(y) % 2 == abs(m_RyInitial) % 2)
{
RySqr = RySqr + (y<<2) + 4; //(y+2)^2
}
}
else
{
RxSqr = RxSqr + ( x << 1) + 1; //(x+1)^2
}
return clamp(((RxSqr+RySqr)>>RAD_RES_REDUCTION),0,(1<<(RAD_RES-RAD_RES_REDUCTION))-1);
}
ok so here is my code and my problem is inside GetRadius Function.
since i have many threads each threads starts at a different place of x,y. however i don't really understand where is the bug inside GetRadius().
i thought maybe is it the RySqr computation. can you suggest a way to debug? or can you see my problem?
UPDATE:
this has fixed most of my code:
i still don't really understand, why the there are jumps between the different threads.
int CImagePro::GetRadius(int x, int y, int &RxSqr, int &RySqr)
{
if (x == m_RxInitial)
{
RxSqr = m_RxSqrInitial;
}
else
{
RxSqr = x * x;
}
if (y == m_RyInitial)
{
RySqr = m_RySqrInitial;
}
else if (abs(y) % 2 == abs(m_RyInitial) % 2)
{
RySqr = y * y;
}
return clamp(( (RxSqr + RySqr) >> RAD_RES_REDUCTION), 0, ( 1 << (RAD_RES - RAD_RES_REDUCTION) ) - 1);
}
I really wonder if this thing compiles? You specify default(none), but consistently use data members of your class. Are they all static?
What you could do is either i) leave default(none) away, which means default(shared), ii) have a shared access to the values by explicitly sharing them, or iii) initialise the variables you use inside the parallel region so that each thread has it's own private copy of, say, m_RxInitial called p_RxInitial etc. The first option is almost guaranteed to get you into trouble.
Following illustrates option ii):
1) Make a helper class containing everything you need to pass, for you this could be
struct ShareData{
int s_RxInitial
/* ... */
}
2) In the member function containing parallel section, before parallel loop define
ShareData SD;
SD.s_RxInitial = m_RxInitial;
/* ... */
3) Give it to the parallel section
#pragma omp parallel for schedule(static), default(none), shared(SD)
4) Use the SD datamembers in function calls.
I hope this was clear enough. I would appreciate it if someone had a more elegant solution to offer.
If you wanted private variables of option iii), you could say firstprivate(SD) instead of shared(SD). This would give each thread initialized (to the original values) private copy of SD. It may or may not give some performance advantage by avoiding serial access. I had a similar problem few days ago and there was no difference.
You cannot guarantee the order in which the threads execute, if you need to guarantee that either make if statements as you did or simply don't parallelize it because it is a critical section.
http://bisqwit.iki.fi/story/howto/openmp/
Related
How to parallelize this algorithm using OpenMP? I tried different options, but the execution time only increases.
void gammaEncoding(string& input, string& gamma, string& result)
{
int j = 0;
int i = 0;
int Ti, Gi;
char BUFF;
for (i = 0; i < ITERATION_COUNT; i++)
{
if(j == gamma.length() - 1)
j = 0;
Ti = input[i] - FIRST_SYMBOL;
Gi = gamma[j] - FIRST_SYMBOL;
BUFF = FIRST_SYMBOL + (Ti + Gi) % SYMBOL_NUMBER;
result += BUFF;
j++;
}
}
void gammaEncoding(string const &input, string const &gamma, string &result)
{
auto const old_length = result.size();
result.resize(old_length + ITERATION_COUNT);
#pragma omp parallel for default(none) \
shared(input, gamma, result, old_length)
for (int i = 0; i < ITERATION_COUNT; ++i)
{
int const Ti = input[i] - FIRST_SYMBOL;
int const Gi = gamma[i % (gamma.length() - 1)] - FIRST_SYMBOL;
result[old_length + i] = FIRST_SYMBOL + (Ti + Gi) % SYMBOL_NUMBER;
}
}
This should do the same thing as your code does and be relatively fast (depending on problem size and number of processors). It does more work (one more modulo operator and in the worst case thread creation) which has to be amortized by the parallelism.
That being said, I have no idea if your original code does the right thing. For example I don't understand why one would not use all of gamma. Could it be that you think that gamma[gamma.length() - 1] == '\0'? Because that is not the case in C++ (assuming you are using std::string). Also I don't know if you actually plan to give non-empty result strings to the function. Because if you don't and if this function is not called in a loop such that you can reuse the result string instead of reallocating it every time, you might want to just create the result string inside the function. Due to RVO (Return Value Optimization), this would not create a copy operation at the end of the function, but instead put the string at the right position in the stack of the caller from the start. In the case that you want all of gamma and you wont reuse result:
string gammaEncoding(string const &input, string const &gamma)
{
string result(ITERATION_COUNT, ' ');
#pragma omp parallel for default(none) shared(input, gamma, result)
for (int i = 0; i < ITERATION_COUNT; ++i)
{
int const Ti = input[i] - FIRST_SYMBOL;
int const Gi = gamma[i % gamma.length()] - FIRST_SYMBOL;
result[i] = FIRST_SYMBOL + (Ti + Gi) % SYMBOL_NUMBER;
}
return result;
}
One could further optimize this in such way that one doesn't have to do the additional modulo operation for every i, but then one could not use pragma omp for. In that case one would have to distribute the range of i in between the threads manually. Then one only has to calculate j = i % gamma.length() once and can then use the original if statement and ++j to access gamma. In that case one would be dependent on the hardware to correctly predict the regular pattern in j such that the if statement is not holding you back. I could also add this solution, but I don't think that it is as good for learning to use OpenMP. Also I wouldn't guarantee that it actually is faster.
I'm trying to understand why the following runs much faster on 1 thread than on 4 threads on OpenMP. The following code is actually based on a similar question: OpenMP recursive tasks but when trying to implement one of the suggested answers, I don't get the intended speedup, which suggests I've done something wrong (and not sure what it is). Do people get better speed when running the below on 4 threads than on 1 thread? I'm getting a 10 times slowdown when running on 4 cores (I should be getting moderate speedup rather than significant slowdown).
int fib(int n)
{
if(n == 0 || n == 1)
return n;
if (n < 20) //EDITED CODE TO INCLUDE CUTOFF
return fib(n-1)+fib(n-2);
int res, a, b;
#pragma omp task shared(a)
a = fib(n-1);
#pragma omp task shared(b)
b = fib(n-2);
#pragma omp taskwait
res = a+b;
return res;
}
int main(){
omp_set_nested(1);
omp_set_num_threads(4);
double start_time = omp_get_wtime();
#pragma omp parallel
{
#pragma omp single
{
cout << fib(25) << endl;
}
}
double time = omp_get_wtime() - start_time;
std::cout << "Time(ms): " << time*1000 << std::endl;
return 0;
}
Have you tried it with a large number?
In multi-threading, it takes some time to initialize work on CPU cores. For smaller jobs, which is done very fast on a single core, threading slows the job down because of this.
Multi-threading shows increase in speed if the job normally takes time longer than second, not milliseconds.
There is also another bottleneck for threading. If your codes try to create too many threads, mostly by recursive methods, this may cause a delay to all running threads causing a massive set back.
In this OpenMP/Tasks wiki page, it is mentioned and a manual cut off is suggested. There need to be 2 versions of the function and when the thread goes too deep, it continues the recursion with single threading.
EDIT: cutoff variable needs to be increased before entering OMP zone.
the following code is for test purposes for the OP to test
#define CUTOFF 5
int fib_s(int n)
{
if (n == 0 || n == 1)
return n;
int res, a, b;
a = fib_s(n - 1);
b = fib_s(n - 2);
res = a + b;
return res;
}
int fib_m(int n,int co)
{
if (co >= CUTOFF) return fib_s(n);
if (n == 0 || n == 1)
return n;
int res, a, b;
co++;
#pragma omp task shared(a)
a = fib_m(n - 1,co);
#pragma omp task shared(b)
b = fib_m(n - 2,co);
#pragma omp taskwait
res = a + b;
return res;
}
int main()
{
omp_set_nested(1);
omp_set_num_threads(4);
double start_time = omp_get_wtime();
#pragma omp parallel
{
#pragma omp single
{
cout << fib_m(25,1) << endl;
}
}
double time = omp_get_wtime() - start_time;
std::cout << "Time(ms): " << time * 1000 << std::endl;
return 0;
}
RESULT:
With CUTOFF value set to 10, it was under 8 seconds to calculate 45th term.
co=1 14.5s
co=2 9.5s
co=3 6.4s
co=10 7.5s
co=15 7.0s
co=20 8.5s
co=21 >18.0s
co=22 >40.0s
I believe I do not know how to tell the compiler not to create parallel task after a certain depth as: omp_set_max_active_levels seems to have no effect and omp_set_nested is deprecated (though it also has no effect).
So I have to manually specify after which level not to create more tasks. Which IMHO is sad. I still believe there should be a way to do this (if somebody know, kindly let me know). Here is how I attempted it, and after input size of 20 parallel version runs a bit faster than serial (like in 70-80% time).
Ref: Code taken from an assignment from course (solution was not provided, so I don't know how to do it efficiently): https://www.cs.iastate.edu/courses/2018/fall/com-s-527x
#include <stdio.h>
#include <omp.h>
#include <math.h>
int fib(int n, int rec_height)
{
int x = 1, y = 1;
if (n < 2)
return n;
int tCount = 0;
if (rec_height > 0) //Surprisingly without this check parallel code is slower than serial one (I believe it is not needed, I just don't know how to use OpneMP)
{
rec_height -= 1;
#pragma omp task shared(x)
x = fib(n - 1, rec_height);
#pragma omp task shared(y)
y = fib(n - 2, rec_height);
#pragma omp taskwait
}
else{
x = fib(n - 1, rec_height);
y = fib(n - 2, rec_height);
}
return x+y;
}
int main()
{
int tot_thread = 16;
int recDepth = (int)log2f(tot_thread);
if( ((int)pow(2, recDepth)) < tot_thread) recDepth += 1;
printf("\nrecDepth: %d\n",recDepth);
omp_set_max_active_levels(recDepth);
omp_set_nested(recDepth-1);
int n,fibonacci;
double starttime;
printf("\nPlease insert n, to calculate fib(n): %d\n",n);
scanf("%d",&n);
omp_set_num_threads(tot_thread);
starttime=omp_get_wtime();
#pragma omp parallel
{
#pragma omp single
{
fibonacci=fib(n, recDepth);
}
}
printf("\n\nfib(%d)=%d \n",n,fibonacci);
printf("calculation took %lf sec\n",omp_get_wtime()-starttime);
return 0;
}
I just wrote a thread pool using C++11/14 std::thread objects and use tasks in the worker queue. I encountered some weird behaviour when calling recursive functions in lambda expressions. The following code crashes if you implement fac() in a recursive fashion (both with clang 3.5 and gcc 4.9):
#include <functional>
#include <vector>
std::size_t fac(std::size_t x) {
// This will crash (segfault).
// if (x == 1) return 1;
// else return fac(x-1)*x;
// This, however, works fine.
auto res = 1;
for (auto i = 2; i < x; ++i) {
res *= x;
}
return res;
}
int main() {
std::vector<std::function<void()> > functions;
for (auto i = 0; i < 10; ++i) {
functions.emplace_back([i]() { fac(i); });
}
for (auto& fn : functions) {
fn();
}
return 0;
}
It does, however, work fine with the iterative version above. What am I missing?
for (auto i = 0; i < 10; ++i) {
functions.emplace_back([i]() { fac(i); });
The first time through that loop, i is going to be set to zero, so you're executing:
fac(0);
Doing so with the recursive definition:
if (x == 1) return 1;
else return fac(x-1)*x;
means that the else block will execute, and hence x will wrap around to whatever the maximum size_t value is (as it's unsigned).
Then it's going to run from there down to 1, consuming one stack frame each time. At a minimum, that's going to consume 65,000 or so stack frames (based on the minimum allowed value of size_t from the standards), but probably more, much more.
That's what causing your crash. The fix is relatively simple. Since 0! is defined as 1, you can simply change your statement to be:
if (x <= 1)
return 1;
return fac (x-1) * x;
But you should keep in mind that recursive functions are best suited to those cases where the solution space reduces quickly, a classic example being the binary search, where the solution space is halved every time you recur.
Functions that don't reduce solution space quickly are usually prone to stack overflow problems (unless the optimiser can optimise away the recursion). You may still run into problems if you pass in a big enough number and it's no real different to adding together two unsigned numbers with the bizarre (though I actually saw it put forward as a recursive example many moons ago):
def addu (unsigned a, b):
if b == 0:
return a
return addu (a + 1, b - 1)
So, in your case, I'd stick with the iterative solution, albeit making it bug-free:
auto res = 1;
for (auto i = 2; i <= x; ++i) // include the limit with <=.
res *= i; // multiply by i, not x.
Both definitions have different behavior for x=0. The loop will be fine as it uses the less-than operator:
auto res = 1;
for (auto i = 2; i < x; ++i) {
res *= x;
}
However,
if (x == 1) return 1;
else return fac(x-1)*x;
Results in a quasi-infinite loop as x == 1 is false and x-1 yields the largest possible value of std::size_t (typically 264-1).
The recursive version does not take care of the case for x == 0.
You need:
std::size_t fac(std::size_t x) {
if (x == 1 || x == 0 ) return 1;
return fac(x-1)*x;
}
or
std::size_t fac(std::size_t x) {
if (x == 0 ) return 1;
return fac(x-1)*x;
}
This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 9 years ago.
Please see the below results and let me know where I can optimise my code further to get a better speedup.
Result
Machine used: Mac Book Pro Processor: 2.5 GHz Intel Core i5(at least 4 logical cores)
Memory: 4GB 1600 MHz
Compiler: Mac OSX Compiler
Sequential Time:0.016466
Using two threads:0.0120111
Using four threads:0.0109911(Speed Up ~ 1.5)
Using 8 threads: 0.0111289
II Machine:
OS: Linux
Hardware: Intel(R) Core™ i5-3550 CPU # 3.30GHz × 4
Memory: 7.7 GiB
Compiler: G++ Version 4.6
Sequential Time:0.0128901
Using two threads:0.00838804
Using four threads:0.00612688(Speed up = 2)
Using 8 threads: 0.0101049
Please let me know what's the overhead in my code that is not giving a linear speedup. There is nothing much in the code. I am calling the function "findParallelUCHWOUP" in the main function like this:
#pragma omp parallel for private(th_id)
for (th_id = 0; th_id < nthreads; th_id++)
findParallelUCHWOUP(points, th_id + 1, nthreads, inp_size, first[th_id], last[th_id]);
Code:
class Point {
double i, j;
public:
Point() {
i = 0;
j = 0;
}
Point(double x, double y) {
i = x;
j = y;
}
double x() const {
return i;
}
double y() const {
return j;
}
void setValue(double x, double y) {
i = x;
j = y;
}
};
typedef std::vector<Point> Vector;
int second(std::stack<int> &s);
double crossProduct(Point v[], int a, int b, int c);
bool myfunction(Point a, Point b) {
return ((a.x() < b.x()) || (a.x() == b.x() && a.y() < b.y()));
}
class CTPoint {
int i, j;
public:
CTPoint() {
i = 0;
j = 0;
}
CTPoint(int x, int y) {
i = x;
j = y;
}
double getI() const {
return i;
}
double getJ() const {
return j;
}
};
const int nthreads = 4;
const int inp_size = 1000000;
Point output[inp_size];
int numElems = inp_size / nthreads;
int sizes[nthreads];
CTPoint ct[nthreads][nthreads];
//function that is called from different threads
int findParallelUCHWOUP(Point* iv, int id, int thread_num, int inp_size, int first, int last) {
output[first] = iv[first];
std::stack<int> s;
s.push(first);
int i = first + 1;
while (i < last) {
if (crossProduct(iv, i, first, last) > 0) {
s.push(i);
i++;
break;
} else {
i++;
}
}
if (i == last) {
s.push(last);
return 0;
}
for (; i <= last; i++) {
if (crossProduct(iv, i, first, last) >= 0) {
while (s.size() > 1 && crossProduct(iv, s.top(), second(s), i) <= 0) {
s.pop();
}
s.push(i);
}
}
int count = s.size();
sizes[id - 1] = count;
while (!s.empty()) {
output[first + count - 1] = iv[s.top()];
s.pop();
count--;
}
return 0;
}
double crossProduct(Point* v, int a, int b, int c) {
return (v[c].x() - v[b].x()) * (v[a].y() - v[b].y())
- (v[a].x() - v[b].x()) * (v[c].y() - v[b].y());
}
int second(std::stack<int> &s) {
int temp = s.top();
s.pop();
int sec = s.top();
s.push(temp);
return sec;
}
//reads points from a file and divides the array of points to different threads
int main(int argc, char *argv[]) {
// read points from a file and assign them to the input array.
Point *points = new Point[inp_size];
unsigned i = 0;
while (i < Points.size()) {
points[i] = Points[i];
i++;
}
numElems = inp_size / nthreads;
int first[nthreads];
int last[nthreads];
for(int i=1;i<=nthreads;i++){
first[i-1] = (i - 1) * numElems;
if (i == nthreads) {
last[i-1] = inp_size - 1;
} else {
last[i-1] = i * numElems - 1;
}
}
/* Parallel Code starts here*/
int th_id;
omp_set_num_threads(nthreads);
double start = omp_get_wtime();
#pragma omp parallel for private(th_id)
for (th_id = 0; th_id < nthreads; th_id++)
findParallelUCHWOUP(points, th_id + 1, nthreads, inp_size, first[th_id], last[th_id]);
/* Parallel Code ends here*/
double end = omp_get_wtime();
double diff = end - start;
std::cout << "Time Elapsed in seconds:" << diff << '\n';
return 0;
}
Threading in general and in your particular case, OpenMP do introduce a certain amount of overhead that does essentially prevent you from getting "real" linear speedup. You have to account for that.
Second, the runtime of your test is extremely short (I assume the times measure are seconds?). At that level you're also running into issues with the precision of timing the functions as a very small amount in overhead has a large impact on the measure result.
Last, you're also dealing with memory access here and if both the chunks you are processing and the stack you're creating don't fit into the processor cache, you also have to account for the overhead of fetching data from memory. The latter gets worse if you have multiple threads reading and possibly writing to the same area of memory. That will result in invalidated cache lines, which means that your cores will be waiting for data to be fetched into the cache and/or written to main memory.
I would massively increase the size of your data so you can runtimes in the seconds, for starters, then measure again. The longer running your test code is the better because the startup and general overhead of the threading will play less of a role if you do more processing.
Once you established a better baseline, you'll probably need a good profiler that gives you deeper insight into threading to see where the hotspots are in your code. It's not unusual that you might have to roll custom data structures for your parallelized part to improve the performance.
I was trying to prove a point with OpenMP compared to MPICH, and I cooked up the following example to demonstrate how easy it was to do some high performance in OpenMP.
The Gauss-Seidel iteration is split into two separate runs, such that in each sweep every operation can be performed in any order, and there should be no dependency between each task. So in theory each processor should never have to wait for another process to perform any kind of synchronization.
The problem I am encountering, is that I, independent of problem size, find there is only a weak speed-up of 2 processors and with more than 2 processors it might even be slower.
Many other linear paralleled routine I can obtain very good scaling, but this one is tricky.
My fear is that I am unable to "explain" to the compiler that operation that I perform on the array, is thread-safe, such that it is unable to be really effective.
See the example below.
Anyone has any clue on how to make this more effective with OpenMP?
void redBlackSmooth(std::vector<double> const & b,
std::vector<double> & x,
double h)
{
// Setup relevant constants.
double const invh2 = 1.0/(h*h);
double const h2 = (h*h);
int const N = static_cast<int>(x.size());
double sigma = 0;
// Setup some boundary conditions.
x[0] = 0.0;
x[N-1] = 0.0;
// Red sweep.
#pragma omp parallel for shared(b, x) private(sigma)
for (int i = 1; i < N-1; i+=2)
{
sigma = -invh2*(x[i-1] + x[i+1]);
x[i] = (h2/2.0)*(b[i] - sigma);
}
// Black sweep.
#pragma omp parallel for shared(b, x) private(sigma)
for (int i = 2; i < N-1; i+=2)
{
sigma = -invh2*(x[i-1] + x[i+1]);
x[i] = (h2/2.0)*(b[i] - sigma);
}
}
Addition:
I have now also tried with a raw pointer implementation and it has the same behavior as using STL container, so it can be ruled out that it is some pseudo-critical behavior comming from STL.
First of all, make sure that the x vector is aligned to cache boundaries. I did some test, and I get something like a 100% improvement with your code on my machine (core duo) if I force the alignment of memory:
double * x;
const size_t CACHE_LINE_SIZE = 256;
posix_memalign( reinterpret_cast<void**>(&x), CACHE_LINE_SIZE, sizeof(double) * N);
Second, you can try to assign more computation to each thread (in this way you can keep cache-lines separated), but I suspect that openmp already does something like this under the hood, so it may be worthless with large N.
In my case this implementation is much faster when x is not cache-aligned.
const int workGroupSize = CACHE_LINE_SIZE / sizeof(double);
assert(N % workGroupSize == 0); //Need to tweak the code a bit to let it work with any N
const int workgroups = N / workGroupSize;
int j, base , k, i;
#pragma omp parallel for shared(b, x) private(sigma, j, base, k, i)
for ( j = 0; j < workgroups; j++ ) {
base = j * workGroupSize;
for (int k = 0; k < workGroupSize; k+=2)
{
i = base + k + (redSweep ? 1 : 0);
if ( i == 0 || i+1 == N) continue;
sigma = -invh2* ( x[i-1] + x[i+1] );
x[i] = ( h2/2.0 ) * ( b[i] - sigma );
}
}
In conclusion, you definitely have a problem of cache-fighting, but given the way openmp works (sadly I am not familiar with it) it should be enough to work with properly allocated buffers.
I think the main problem is about type of array structure you are using. Lets try comparing results with vectors and arrays. (Arrays = c-arrays using new operator).
Vector and array sizes are N = 10000000. I force the smoothing function to repeat in order to maintain runtime > 0.1secs.
Vector Time: 0.121007 Repeat: 1 MLUPS: 82.6399
Array Time: 0.164009 Repeat: 2 MLUPS: 121.945
MLUPS = ((N-2)*repeat/runtime)/1000000 (Million Lattice Points Update per second)
MFLOPS are misleading when it comes to grid calculation. A few changes in the basic equation can lead to consider high performance for the same runtime.
The modified code:
double my_redBlackSmooth(double *b, double* x, double h, int N)
{
// Setup relevant constants.
double const invh2 = 1.0/(h*h);
double const h2 = (h*h);
double sigma = 0;
// Setup some boundary conditions.
x[0] = 0.0;
x[N-1] = 0.0;
double runtime(0.0), wcs, wce;
int repeat = 1;
timing(&wcs);
for(; runtime < 0.1; repeat*=2)
{
for(int r = 0; r < repeat; ++r)
{
// Red sweep.
#pragma omp parallel for shared(b, x) private(sigma)
for (int i = 1; i < N-1; i+=2)
{
sigma = -invh2*(x[i-1] + x[i+1]);
x[i] = (h2*0.5)*(b[i] - sigma);
}
// Black sweep.
#pragma omp parallel for shared(b, x) private(sigma)
for (int i = 2; i < N-1; i+=2)
{
sigma = -invh2*(x[i-1] + x[i+1]);
x[i] = (h2*0.5)*(b[i] - sigma);
}
// cout << "In Array: " << r << endl;
}
if(x[0] != 0) dummy(x[0]);
timing(&wce);
runtime = (wce-wcs);
}
// cout << "Before division: " << repeat << endl;
repeat /= 2;
cout << "Array Time:\t" << runtime << "\t" << "Repeat:\t" << repeat
<< "\tMLUPS:\t" << ((N-2)*repeat/runtime)/1000000.0 << endl;
return runtime;
}
I didn't change anything in the code except than array type. For better cache access and blocking you should look into data alignment (_mm_malloc).