I want to optimize my application using vectorization. More specifically, I want to vectorize the mathematical operations on the std::complex<double> type. However, this seems to be quite difficult. Consider the following example:
#define TEST_LEN 100
#include <algorithm>
#include <complex>
typedef std::complex<double> cmplx;
using namespace std::complex_literals;
#pragma omp declare simd
cmplx add(cmplx a, cmplx b)
{
return a + b;
}
#pragma omp declare simd
cmplx mult(cmplx a, cmplx b)
{
return a * b;
}
void k(cmplx *x, cmplx *&y, int i0, int N)
{
#pragma omp for simd
for (int i = i0; i < N; i++)
y[i] = add(mult(-(1i + 1.0), x[i]), 1i);
}
int main(int argc, char **argv)
{
cmplx *x = new cmplx[TEST_LEN];
cmplx *y = new cmplx[TEST_LEN];
for (int i = 0; i < TEST_LEN; i++)
x[i] = 0;
for (int i = 0; i < TEST_LEN; i++)
{
int N = std::min(4, TEST_LEN - i);
k(x, y, i, N);
}
delete[] x;
delete[] y;
return 1;
}
I am using the g++ compiler. For this code the compiler gives the following warning:
warning: unsupported return type 'cmplx' {aka 'std::complex'} for simd
for the lines containing the mult and add function.
It seems like it is not possible to vectorize the std::complex<double> type like this.
Is there a different way how this can be archieved?
Not easily. SIMD works quite well when you have values in the next N steps that behave the same way. So consider for example an array of 2D vectors:
X Y X Y X Y X Y
If we were to do a vector addition operation here,
X Y X Y X Y X Y
+ + + + + + + +
X Y X Y X Y X Y
The compiler will nicely vectorise that operation. If however we were to want to do something different for the X and Y values, the memory layout becomes problematic for SIMD:
X Y X Y X Y X Y
+ / + / + / + /
X Y X Y X Y X Y
If you consider for example the multiplication case:
(a + bi) (c + di) = (ac - bd) (ad + bc)i
Suddenly the operations are jumping between SIMD lanes, which is pretty much going to kill any decent vectorization.
Take a quick look at this godbolt: https://godbolt.org/z/rnVVgl
Addition boils down to some vaddps instructions (working on 8 floats at a time).
Multiply ends up using vfmadd231ss and vmulss (which both work on 1 float at a time).
The only easy way to automatically vectorise your complex code would be to seperate out the real and imaginary parts into 2 arrays:
struct ComplexArray {
float* real;
float* imaginary;
};
Within this godbolt you can see that the compiler is now using vfmadd213ps instructions (so again back to working on 8 floats at a time).
https://godbolt.org/z/Ostaax
I have a program that solves generally for 1D brownian motion using an Euler's Method.
Being a stochastic process, I want to average it over many particles. But I find that as I ramp up the number of particles, it overloads and i get the std::badalloc error, which I understand is a memory error.
Here is my full code
#include <iostream>
#include <vector>
#include <fstream>
#include <cmath>
#include <cstdlib>
#include <limits>
#include <ctime>
using namespace std;
// Box-Muller Method to generate gaussian numbers
double generateGaussianNoise(double mu, double sigma) {
const double epsilon = std::numeric_limits<double>::min();
const double tau = 2.0 * 3.14159265358979323846;
static double z0, z1;
static bool generate;
generate = !generate;
if (!generate) return z1 * sigma + mu;
double u1, u2;
do {
u1 = rand() * (1.0 / RAND_MAX);
u2 = rand() * (1.0 / RAND_MAX);
} while (u1 <= epsilon);
z0 = sqrt(-2.0 * log(u1)) * cos(tau * u2);
z1 = sqrt(-2.0 * log(u1)) * sin(tau * u2);
return z0 * sigma + mu;
}
int main() {
// Initialize Variables
double gg; // Gaussian Number Picked from distribution
// Integrator
double t0 = 0; // Setting the Time Window
double tf = 10;
double n = 5000; // Number of Steps
double h = (tf - t0) / n; // Time Step Size
// Set Constants
const double pii = atan(1) * 4; // pi
const double eta = 1; // viscous constant
const double m = 1; // mass
const double aa = 1; // radius
const double Temp = 30; // Temperature in Kelvins
const double KB = 1; // Boltzmann Constant
const double alpha = (6 * pii * eta * aa);
// More Constants
const double mu = 0; // Gaussian Mean
const double sigma = 1; // Gaussian Std Deviation
const double ng = n; // No. of pts to generate for Gauss distribution
const double npart = 1000; // No. of Particles
// Initial Conditions
double x0 = 0;
double y0 = 0;
double t = t0;
// Vectors
vector<double> storX; // Vector that keeps displacement values
vector<double> storY; // Vector that keeps velocity values
vector<double> storT; // Vector to store time
vector<double> storeGaussian; // Vector to store Gaussian numbers generated
vector<double> holder; // Placeholder Vector for calculation operations
vector<double> mainstore; // Vector that holds the final value desired
storT.push_back(t0);
// Prepares mainstore
for (int z = 0; z < (n+1); z++) {
mainstore.push_back(0);
}
for (int NN = 0; NN < npart; NN++) {
holder.clear();
storX.clear();
storY.clear();
storT.clear();
storT.push_back(0);
// Prepares holder
for (int z = 0; z < (n+1); z++) {
holder.push_back(0);
storX.push_back(0);
storY.push_back(0);
}
// Gaussian Generator
srand(time(NULL));
for (double iiii = 0; iiii < ng; iiii++) {
gg = generateGaussianNoise(0, 1); // generateGaussianNoise(mu,sigma)
storeGaussian.push_back(gg);
}
// Solver
for (int ii = 0; ii < n; ii++) {
storY[ii + 1] =
storY[ii] - (alpha / m) * storY[ii] * h +
(sqrt(2 * alpha * KB * Temp) / m) * sqrt(h) * storeGaussian[ii];
storX[ii + 1] = storX[ii] + storY[ii] * h;
holder[ii + 1] =
pow(storX[ii + 1], 2); // Finds the displacement squared
t = t + h;
storT.push_back(t);
}
// Updates the Main Storage
for (int z = 0; z < storX.size(); z++) {
mainstore[z] = mainstore[z] + holder[z];
}
}
// Average over the number of particles
for (int z = 0; z < storX.size(); z++) {
mainstore[z] = mainstore[z] / (npart);
}
// Outputs the data
ofstream fout("LangevinEulerTest.txt");
for (int jj = 0; jj < storX.size(); jj++) {
fout << storT[jj] << '\t' << mainstore[jj] << '\t' << storX[jj] << endl;
}
return 0;
}
As you can see, npart is the variable that I change to vary the number of particles. But after each iteration, I do clear my storage vectors like storX,storY... So on paper, the number of particles should not affect memory? I am only just calling the compiler to repeat many more times, and add onto the main storage vector mainstore. I am running my code on a computer with 4GB ram.
Would greatly appreciate it if anyone could point out my errors in logic or suggest improvements.
Edit: Currently the number of particles is set to npart = 1000.
So when I try to ramp it up to like npart = 20000 or npart = 50000, it gives me memory errors.
Edit2 I've edited the code to allocate an extra index to each of the storage vectors. But it does not seem to fix the memory overflow
There is an out of bounds exception in the solver part. storY has size n and you access ii+1 where i goes up to n-1. So for your code provided. storY has size 5000. It is allowed to access with indices between 0 and 4999 (including) but you try to access with index 5000. The same for storX, holder and mainstore.
Also, storeGaussian does not get cleared before adding new variables. It grows by n for each npart loop. You access only the first n values of it in the solver part anyway.
Please note, that vector::clear removes all elements from the vector, but does not necessarily change the vector's capacity (i.e. it's storage array), see the documentation.
This won't cause the problem here, because you'll reuse the same array in the next runs, but it's something to be aware when using vectors.
I managed to get my sqrt function to run perfectly, but I'm second guessing if I wrote this code correctly based on the pseudo code I was given.
Here is the pseudo code:
x = 1
repeat 10 times: x = (x + n / x) / 2
return x.
The code I wrote,
#include <iostream>
#include <math.h>
using namespace std;
double my_sqrt_1(double n)
{
double x= 1; x<10; ++x;
return (x+n/x)/2;
}
No, your code is not following your pseudo-code. For example, you're not repeating anything in your code. You need to add a loop to do that:
#include <iostream>
#include <math.h>
using namespace std;
double my_sqrt_1(double n)
{
double x = 1;
for(int i = 0; i < 10; ++i) // repeat 10 times
x = (x+n/x)/2;
return x;
}
Let's analyze your code:
double x = 1;
// Ok, x set to 1
x < 10;
// This is true, as 1 is less than 10, but it is not used anywhere
++x;
// Increment x - now x == 2
return (x + n / x) / 2
// return value is always (2 + n / 2) / 2
As you don't have any loop, function will always exit in the first "iteration" with the return value (2 + n / 2) / 2.
Just as another approach that you can use binary search or the another pretty elegant solution is to use the Newton's method.
Newton's method is a method for finding roots of a function, making use of a function's derivative. At each step, a value is calculated as: x(step) = x(step-1) - f(x(step-1))/f'(x(step-1)) Newton's_method
This might be faster than binary search.My implementation in C++:
double NewtonMethod(double x) {
double eps = 0.0001; //the precision
double x0 = 10;
while( fabs(x-x0) > eps) {
double a = x0*x0-n;
double r = a/(2*x0);
x = x0 - r;
x0 = x;
}
return x;
}
Since people are showing different approaches to calculating the square root, I couldn't resist ;)...
Below is the exact copy (with the original comments, but without preprocessor directives) of the inverse square root implementation from Quake III Arena:
float Q_rsqrt( float number )
{
long i;
float x2, y;
const float threehalfs = 1.5F;
x2 = number * 0.5F;
y = number;
i = * ( long * ) &y; // evil floating point bit level hacking
i = 0x5f3759df - ( i >> 1 ); // what the...?
y = * ( float * ) &i;
y = y * ( threehalfs - ( x2 * y * y ) ); // 1st iteration
// y = y * ( threehalfs - ( x2 * y * y ) ); // 2nd iteration, this can be removed
return y;
}
I am trying to calculate complex numbers for a 2D array in C++. The code is running very slowly and I have narrowed down the main cause to be the exp function (the program runs quickly when I comment out that line, even though I have 4 nested loops).
int main() {
typedef vector< complex<double> > complexVect;
typedef vector<double> doubleVect;
const int SIZE = 256;
vector<doubleVect> phi_w(SIZE, doubleVect(SIZE));
vector<complexVect> phi_k(SIZE, complexVect(SIZE));
complex<double> i (0, 1), cmplx (0, 0);
complex<double> temp;
int x, y, t, k, w;
double dk = 2.0*M_PI / (SIZE-1);
double dt = M_PI / (SIZE-1);
int xPos, yPos;
double arg, arg2, arg4;
complex<double> arg3;
double angle;
vector<complexVect> newImg(SIZE, complexVect(SIZE));
for (x = 0; x < SIZE; ++x) {
xPos = -127 + x;
for (y = 0; y < SIZE; ++y) {
yPos = -127 + y;
for (t = 0; t < SIZE; ++t) {
temp = cmplx;
angle = dt * t;
arg = xPos * cos(angle) + yPos * sin(angle);
for (k = 0; k < SIZE; ++k) {
arg2 = -M_PI + dk*k;
arg3 = exp(-i * arg * arg2);
arg4 = abs(arg) * M_PI / (abs(arg) + M_PI);
temp = temp + arg4 * arg3 * phi_k[k][t];
}
}
newImg[y][x] = temp;
}
}
}
Is there a way I can improve computation time? I have tried using the following helper function but it doesn't noticeably help.
complex<double> complexexp(double arg) {
complex<double> temp (sin(arg), cos(arg));
return temp;
}
I am using clang++ to compile my code
edit: I think the problem is the fact that I'm trying to calculate complex numbers. Would it be faster if I just used Euler's formula to calculate the real and imaginary parts in separate arrays and not have to deal with the complex class?
maybe this will work for you:
http://martin.ankerl.com/2007/02/11/optimized-exponential-functions-for-java/
I've had a look with callgrind. The only marginal improvement (~1.3% with size = 50) I could find was to change:
temp = temp + arg4 * arg3 * phi_k[k][t];
to
temp += arg4 * arg3 * phi_k[k][t];
The most costly function calls were sin()/cos(). I suspect that calling exp() with a complex number argument calls those functions in the background.
To retain precision, the function will compute very slowly and there doesn't seem to be a way around it. However, you could trade precision for accuracy, which seems to be what game developers would do: sin and cos are slow, is there an alternatve?
You can define number e as a constant and use std::pow() function
I'm writing a sparse matrix solver using the Gauss-Seidel method. By profiling, I've determined that about half of my program's time is spent inside the solver. The performance-critical part is as follows:
size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
for (size_t y = 1; y < d_ny - 1; ++y) {
for (size_t x = 1; x < d_nx - 1; ++x) {
d_x[ic] = d_b[ic]
- d_w[ic] * d_x[iw] - d_e[ic] * d_x[ie]
- d_s[ic] * d_x[is] - d_n[ic] * d_x[in];
++ic; ++iw; ++ie; ++is; ++in;
}
ic += 2; iw += 2; ie += 2; is += 2; in += 2;
}
All arrays involved are of float type. Actually, they are not arrays but objects with an overloaded [] operator, which (I think) should be optimized away, but is defined as follows:
inline float &operator[](size_t i) { return d_cells[i]; }
inline float const &operator[](size_t i) const { return d_cells[i]; }
For d_nx = d_ny = 128, this can be run about 3500 times per second on an Intel i7 920. This means that the inner loop body runs 3500 * 128 * 128 = 57 million times per second. Since only some simple arithmetic is involved, that strikes me as a low number for a 2.66 GHz processor.
Maybe it's not limited by CPU power, but by memory bandwidth? Well, one 128 * 128 float array eats 65 kB, so all 6 arrays should easily fit into the CPU's L3 cache (which is 8 MB). Assuming that nothing is cached in registers, I count 15 memory accesses in the inner loop body. On a 64-bits system this is 120 bytes per iteration, so 57 million * 120 bytes = 6.8 GB/s. The L3 cache runs at 2.66 GHz, so it's the same order of magnitude. My guess is that memory is indeed the bottleneck.
To speed this up, I've attempted the following:
Compile with g++ -O3. (Well, I'd been doing this from the beginning.)
Parallelizing over 4 cores using OpenMP pragmas. I have to change to the Jacobi algorithm to avoid reads from and writes to the same array. This requires that I do twice as many iterations, leading to a net result of about the same speed.
Fiddling with implementation details of the loop body, such as using pointers instead of indices. No effect.
What's the best approach to speed this guy up? Would it help to rewrite the inner body in assembly (I'd have to learn that first)? Should I run this on the GPU instead (which I know how to do, but it's such a hassle)? Any other bright ideas?
(N.B. I do take "no" for an answer, as in: "it can't be done significantly faster, because...")
Update: as requested, here's a full program:
#include <iostream>
#include <cstdlib>
#include <cstring>
using namespace std;
size_t d_nx = 128, d_ny = 128;
float *d_x, *d_b, *d_w, *d_e, *d_s, *d_n;
void step() {
size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
for (size_t y = 1; y < d_ny - 1; ++y) {
for (size_t x = 1; x < d_nx - 1; ++x) {
d_x[ic] = d_b[ic]
- d_w[ic] * d_x[iw] - d_e[ic] * d_x[ie]
- d_s[ic] * d_x[is] - d_n[ic] * d_x[in];
++ic; ++iw; ++ie; ++is; ++in;
}
ic += 2; iw += 2; ie += 2; is += 2; in += 2;
}
}
void solve(size_t iters) {
for (size_t i = 0; i < iters; ++i) {
step();
}
}
void clear(float *a) {
memset(a, 0, d_nx * d_ny * sizeof(float));
}
int main(int argc, char **argv) {
size_t n = d_nx * d_ny;
d_x = new float[n]; clear(d_x);
d_b = new float[n]; clear(d_b);
d_w = new float[n]; clear(d_w);
d_e = new float[n]; clear(d_e);
d_s = new float[n]; clear(d_s);
d_n = new float[n]; clear(d_n);
solve(atoi(argv[1]));
cout << d_x[0] << endl; // prevent the thing from being optimized away
}
I compile and run it as follows:
$ g++ -o gstest -O3 gstest.cpp
$ time ./gstest 8000
0
real 0m1.052s
user 0m1.050s
sys 0m0.010s
(It does 8000 instead of 3500 iterations per second because my "real" program does a lot of other stuff too. But it's representative.)
Update 2: I've been told that unititialized values may not be representative because NaN and Inf values may slow things down. Now clearing the memory in the example code. It makes no difference for me in execution speed, though.
Couple of ideas:
Use SIMD. You could load 4 floats at a time from each array into a SIMD register (e.g. SSE on Intel, VMX on PowerPC). The disadvantage of this is that some of the d_x values will be "stale" so your convergence rate will suffer (but not as bad as a jacobi iteration); it's hard to say whether the speedup offsets it.
Use SOR. It's simple, doesn't add much computation, and can improve your convergence rate quite well, even for a relatively conservative relaxation value (say 1.5).
Use conjugate gradient. If this is for the projection step of a fluid simulation (i.e. enforcing non-compressability), you should be able to apply CG and get a much better convergence rate. A good preconditioner helps even more.
Use a specialized solver. If the linear system arises from the Poisson equation, you can do even better than conjugate gradient using an FFT-based methods.
If you can explain more about what the system you're trying to solve looks like, I can probably give some more advice on #3 and #4.
I think I've managed to optimize it, here's a code, create a new project in VC++, add this code and simply compile under "Release".
#include <iostream>
#include <cstdlib>
#include <cstring>
#define _WIN32_WINNT 0x0400
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#include <conio.h>
using namespace std;
size_t d_nx = 128, d_ny = 128;
float *d_x, *d_b, *d_w, *d_e, *d_s, *d_n;
void step_original() {
size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
for (size_t y = 1; y < d_ny - 1; ++y) {
for (size_t x = 1; x < d_nx - 1; ++x) {
d_x[ic] = d_b[ic]
- d_w[ic] * d_x[iw] - d_e[ic] * d_x[ie]
- d_s[ic] * d_x[is] - d_n[ic] * d_x[in];
++ic; ++iw; ++ie; ++is; ++in;
}
ic += 2; iw += 2; ie += 2; is += 2; in += 2;
}
}
void step_new() {
//size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
float
*d_b_ic,
*d_w_ic,
*d_e_ic,
*d_x_ic,
*d_x_iw,
*d_x_ie,
*d_x_is,
*d_x_in,
*d_n_ic,
*d_s_ic;
d_b_ic = d_b;
d_w_ic = d_w;
d_e_ic = d_e;
d_x_ic = d_x;
d_x_iw = d_x;
d_x_ie = d_x;
d_x_is = d_x;
d_x_in = d_x;
d_n_ic = d_n;
d_s_ic = d_s;
for (size_t y = 1; y < d_ny - 1; ++y)
{
for (size_t x = 1; x < d_nx - 1; ++x)
{
/*d_x[ic] = d_b[ic]
- d_w[ic] * d_x[iw] - d_e[ic] * d_x[ie]
- d_s[ic] * d_x[is] - d_n[ic] * d_x[in];*/
*d_x_ic = *d_b_ic
- *d_w_ic * *d_x_iw - *d_e_ic * *d_x_ie
- *d_s_ic * *d_x_is - *d_n_ic * *d_x_in;
//++ic; ++iw; ++ie; ++is; ++in;
d_b_ic++;
d_w_ic++;
d_e_ic++;
d_x_ic++;
d_x_iw++;
d_x_ie++;
d_x_is++;
d_x_in++;
d_n_ic++;
d_s_ic++;
}
//ic += 2; iw += 2; ie += 2; is += 2; in += 2;
d_b_ic += 2;
d_w_ic += 2;
d_e_ic += 2;
d_x_ic += 2;
d_x_iw += 2;
d_x_ie += 2;
d_x_is += 2;
d_x_in += 2;
d_n_ic += 2;
d_s_ic += 2;
}
}
void solve_original(size_t iters) {
for (size_t i = 0; i < iters; ++i) {
step_original();
}
}
void solve_new(size_t iters) {
for (size_t i = 0; i < iters; ++i) {
step_new();
}
}
void clear(float *a) {
memset(a, 0, d_nx * d_ny * sizeof(float));
}
int main(int argc, char **argv) {
size_t n = d_nx * d_ny;
d_x = new float[n]; clear(d_x);
d_b = new float[n]; clear(d_b);
d_w = new float[n]; clear(d_w);
d_e = new float[n]; clear(d_e);
d_s = new float[n]; clear(d_s);
d_n = new float[n]; clear(d_n);
if(argc < 3)
printf("app.exe (x)iters (o/n)algo\n");
bool bOriginalStep = (argv[2][0] == 'o');
size_t iters = atoi(argv[1]);
/*printf("Press any key to start!");
_getch();
printf(" Running speed test..\n");*/
__int64 freq, start, end, diff;
if(!::QueryPerformanceFrequency((LARGE_INTEGER*)&freq))
throw "Not supported!";
freq /= 1000000; // microseconds!
{
::QueryPerformanceCounter((LARGE_INTEGER*)&start);
if(bOriginalStep)
solve_original(iters);
else
solve_new(iters);
::QueryPerformanceCounter((LARGE_INTEGER*)&end);
diff = (end - start) / freq;
}
printf("Speed (%s)\t\t: %u\n", (bOriginalStep ? "original" : "new"), diff);
//_getch();
//cout << d_x[0] << endl; // prevent the thing from being optimized away
}
Run it like this:
app.exe 10000 o
app.exe 10000 n
"o" means old code, yours.
"n" is mine, the new one.
My results:
Speed (original):
1515028
1523171
1495988
Speed (new):
966012
984110
1006045
Improvement of about 30%.
The logic behind:
You've been using index counters to access/manipulate.
I use pointers.
While running, breakpoint at a certain calculation code line in VC++'s debugger, and press F8. You'll get the disassembler window.
The you'll see the produced opcodes (assembly code).
Anyway, look:
int *x = ...;
x[3] = 123;
This tells the PC to put the pointer x at a register (say EAX).
The add it (3 * sizeof(int)).
Only then, set the value to 123.
The pointers approach is much better as you can understand, because we cut the adding process, actually we handle it ourselves, thus able to optimize as needed.
I hope this helps.
Sidenote to stackoverflow.com's staff:
Great website, I hope I've heard of it long ago!
For one thing, there seems to be a pipelining issue here. The loop reads from the value in d_x that has just been written to, but apparently it has to wait for that write to complete. Just rearranging the order of the computation, doing something useful while it's waiting, makes it almost twice as fast:
d_x[ic] = d_b[ic]
- d_e[ic] * d_x[ie]
- d_s[ic] * d_x[is] - d_n[ic] * d_x[in]
- d_w[ic] * d_x[iw] /* d_x[iw] has just been written to, process this last */;
It was Eamon Nerbonne who figured this out. Many upvotes to him! I would never have guessed.
Poni's answer looks like the right one to me.
I just want to point out that in this type of problem, you often gain benefits from memory locality. Right now, the b,w,e,s,n arrays are all at separate locations in memory. If you could not fit the problem in L3 cache (mostly in L2), then this would be bad, and a solution of this sort would be helpful:
size_t d_nx = 128, d_ny = 128;
float *d_x;
struct D { float b,w,e,s,n; };
D *d;
void step() {
size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
for (size_t y = 1; y < d_ny - 1; ++y) {
for (size_t x = 1; x < d_nx - 1; ++x) {
d_x[ic] = d[ic].b
- d[ic].w * d_x[iw] - d[ic].e * d_x[ie]
- d[ic].s * d_x[is] - d[ic].n * d_x[in];
++ic; ++iw; ++ie; ++is; ++in;
}
ic += 2; iw += 2; ie += 2; is += 2; in += 2;
}
}
void solve(size_t iters) { for (size_t i = 0; i < iters; ++i) step(); }
void clear(float *a) { memset(a, 0, d_nx * d_ny * sizeof(float)); }
int main(int argc, char **argv) {
size_t n = d_nx * d_ny;
d_x = new float[n]; clear(d_x);
d = new D[n]; memset(d,0,n * sizeof(D));
solve(atoi(argv[1]));
cout << d_x[0] << endl; // prevent the thing from being optimized away
}
For example, this solution at 1280x1280 is a little less than 2x faster than Poni's solution (13s vs 23s in my test--your original implementation is then 22s), while at 128x128 it's 30% slower (7s vs. 10s--your original is 10s).
(Iterations were scaled up to 80000 for the base case, and 800 for the 100x larger case of 1280x1280.)
I think you're right about memory being a bottleneck. It's a pretty simple loop with just some simple arithmetic per iteration. the ic, iw, ie, is, and in indices seem to be on opposite sides of the matrix so i'm guessing that there's a bunch of cache misses there.
I'm no expert on the subject, but I've seen that there are several academic papers on improving the cache usage of the Gauss-Seidel method.
Another possible optimization is the use of the red-black variant, where points are updated in two sweeps in a chessboard-like pattern. In this way, all updates in a sweep are independent and can be parallelized.
I suggest putting in some prefetch statements and also researching "data oriented design":
void step_original() {
size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
float dw_ic, dx_ic, db_ic, de_ic, dn_ic, ds_ic;
float dx_iw, dx_is, dx_ie, dx_in, de_ic, db_ic;
for (size_t y = 1; y < d_ny - 1; ++y) {
for (size_t x = 1; x < d_nx - 1; ++x) {
// Perform the prefetch
// Sorting these statements by array may increase speed;
// although sorting by index name may increase speed too.
db_ic = d_b[ic];
dw_ic = d_w[ic];
dx_iw = d_x[iw];
de_ic = d_e[ic];
dx_ie = d_x[ie];
ds_ic = d_s[ic];
dx_is = d_x[is];
dn_ic = d_n[ic];
dx_in = d_x[in];
// Calculate
d_x[ic] = db_ic
- dw_ic * dx_iw - de_ic * dx_ie
- ds_ic * dx_is - dn_ic * dx_in;
++ic; ++iw; ++ie; ++is; ++in;
}
ic += 2; iw += 2; ie += 2; is += 2; in += 2;
}
}
This differs from your second method since the values are copied to local temporary variables before the calculation is performed.