Problems with Rcpp precision - c++

Consider the following C++ function in R using Rcpp:
cppFunction('long double statZn_cpp(NumericVector dat, double kn) {
double n = dat.size();
// Get total sum and sum of squares; this will be the "upper sum"
// (i.e. the sum above k)
long double s_upper, s_square_upper;
// The "lower sums" (i.e. those below k)
long double s_lower, s_square_lower;
// Get lower sums
// Go to kn - 1 to prevent double-counting in main
// loop
for (int i = 0; i < kn - 1; ++i) {
s_lower += dat[i];
s_square_lower += dat[i] * dat[i];
}
// Get upper sum
for (int i = kn - 1; i < n; ++i) {
s_upper += dat[i];
s_square_upper += dat[i] * dat[i];
}
// The maximum, which will be returned
long double M = 0;
// A candidate for the new maximum, used in a loop
long double M_candidate;
// Compute the test statistic
for (int k = kn; k <= (n - kn); ++k) {
// Update s and s_square for both lower and upper
s_lower += dat[k-1];
s_square_lower += dat[k-1] * dat[k-1];
s_upper -= dat[k-1];
s_square_upper -= dat[k-1] * dat[k-1];
// Get estimate of sd for this k
long double sdk = sqrt((s_square_lower - pow(s_lower, 2.0) / k +
s_square_upper -
pow(s_upper, 2.0) / (n - k))/n);
M_candidate = abs(s_lower / k - s_upper / (n - k)) / sdk;
// Choose new maximum
if (M_candidate > M) {
M = M_candidate;
}
}
return M * sqrt(kn);
}')
Try the command statZn_cpp(1:20,4), and you will get 6.963106, which is the correct answer. Scaling should not matter; statZn_cpp(1:20*10,4) will also yield the correct answer of 6.963106. But statZn_cpp(1:20/10,4) yields the wrong answer of 6.575959, and statZn_cpp(1:20/100,4) again gives you the obviously wrong answer of 0. More to the point (and relevant to my research, which involves simulation studies), when I try statZn_cpp(rnorm(20),4), the answer is almost always 0, which is wrong.
Clearly the problem has to do with rounding errors, but I don't know where they are or how to fix them (I am brand new to C++). I've tried to expand precision as much as possible. Is there a way to fix the rounding problem? (An R wrapper function is permissible if I should be attempting what amounts to a preprocessing step, but it needs to be robust, working for general levels of precision.)
EDIT: Here is some "equivalent" R code:
statZn <- function(dat, kn = function(n) {floor(sqrt(n))}) {
n = length(dat)
return(sqrt(kn(n))*max(sapply(
floor(kn(n)):(n - floor(kn(n))), function(k)
abs(1/k*sum(dat[1:k]) -
1/(n-k)*sum(dat[(k+1):n]))/sqrt((sum((dat[1:k] -
mean(dat[1:k]))^2)+sum((dat[(k+1):n] -
mean(dat[(k+1):n]))^2))/n))))
}
Also, the R code below basically replicates the method that should be used by the C++ code. It is capable of reaching the correct answer.
n = length(dat)
s_lower = 0
s_square_lower = 0
s_upper = 0
s_square_upper = 0
for (i in 1:(kn-1)) {
s_lower = s_lower + dat[i]
s_square_lower = s_square_lower + dat[i] * dat[i]
}
for (i in kn:n) {
s_upper = s_upper + dat[i]
s_square_upper = s_square_upper + dat[i] * dat[i]
}
M = 0
for (k in kn:(n-kn)) {
s_lower = s_lower + dat[k]
s_square_lower = s_square_lower + dat[k] * dat[k]
s_upper = s_upper - dat[k]
s_square_upper = s_square_upper - dat[k] * dat[k]
sdk = sqrt((s_square_lower - (s_lower)^2/k +
s_square_upper -
(s_upper)^2/(n-k))/n)
M_candidate = sqrt(kn) * abs(s_lower / k - s_upper / (n - k)) / sdk
cat('k', k, '\n',
"s_lower", s_lower, '\n',
's_square_lower', s_square_lower, '\n',
's_upper', s_upper, '\n',
's_square_upper', s_square_upper, '\n',
'sdk', sdk, '\n',
'M_candidate', M_candidate, '\n\n')
if (M_candidate > M) {
M = M_candidate
}
}

1: You should not be using long double, since R represents all numeric values in the double type. Using a more precise type for intermediate calculations is extremely unlikely to provide any benefit, and is more likely to result in strange inconsistencies between platforms.
2: You're not initializing s_upper, s_square_upper, s_lower, and s_square_lower. (You actually are initializing them in the R implementation, but you forgot in the C++ implementation.)
3: Minor point, but I would replace the pow(x,2.0) calls with x*x. Although this doesn't really matter.
4: This is what fixed it for me: You need to qualify calls to C++ standard library functions with their containing namespace. IOW, std::sqrt() instead of just sqrt(), std::abs() instead of just abs(), and std::pow() instead of just pow() if you continue to use it.
cppFunction('double statZn_cpp(NumericVector dat, double kn) {
int n = dat.size();
double s_upper = 0, s_square_upper = 0; // Get total sum and sum of squares; this will be the "upper sum" (i.e. the sum above k)
double s_lower = 0, s_square_lower = 0; // The "lower sums" (i.e. those below k)
for (int i = 0; i < kn - 1; ++i) { s_lower += dat[i]; s_square_lower += dat[i] * dat[i]; } // Get lower sums; Go to kn - 1 to prevent double-counting in main
for (int i = kn - 1; i < n; ++i) { s_upper += dat[i]; s_square_upper += dat[i] * dat[i]; } // Get upper sum
double M = 0; // The maximum, which will be returned
double M_candidate; // A candidate for the new maximum, used in a loop
// Compute the test statistic
for (int k = kn; k <= (n - kn); ++k) {
// Update s and s_square for both lower and upper
s_lower += dat[k-1];
s_square_lower += dat[k-1] * dat[k-1];
s_upper -= dat[k-1];
s_square_upper -= dat[k-1] * dat[k-1];
// Get estimate of sd for this k
double sdk = std::sqrt((s_square_lower - s_lower*s_lower / k + s_square_upper - s_upper*s_upper / (n - k))/n);
M_candidate = std::abs(s_lower / k - s_upper / (n - k)) / sdk;
if (M_candidate > M) M = M_candidate; // Choose new maximum
}
return std::sqrt(kn) * M;
}');
statZn_cpp(1:20,4); ## you will get 6.963106, which is the correct answer
## [1] 6.963106
statZn_cpp(1:20*10,4); ## Scaling should not matter; will also yield the correct answer of 6.963106
## [1] 6.963106
statZn_cpp(1:20/10,4); ## yields the wrong answer of 6.575959
## [1] 6.963106
statZn_cpp(1:20/100,4); ## again gives you the obviously wrong answer of 0.
## [1] 6.963106
set.seed(1L); statZn_cpp(rnorm(20),4); ## More to the point (and relevant to my research, which involves simulation studies), the answer is almost always 0, which is wrong.
## [1] 1.270117

Related

C++ compound assignment and type conversion issue

I am calculating combination(15, 7) in C++.
I first used the following code and get the wrong answer due to a type promotion error.
#include <iostream>
int main()
{
int a = 15;
double ans = 1;
for(int i = 1; i <= 7; i++)
ans *= (a + 1 - i) / i;
std::cout << (int) ans;
return 0;
}
Output: 2520
So I changed ans *= (a + 1 - i) / i; to ans *= (double)(a + 1 - i) / i; and still get the wrong answer.
#include <iostream>
int main()
{
int a = 15;
double ans = 1;
for(int i = 1; i <= 7; i++)
ans *= (double) (a + 1 - i) / i;
std::cout << (int) ans;
return 0;
}
Output: 6434
Finally, I tried ans = ans * (a + 1 - i) / i, which gives the right answer.
#include <iostream>
int main()
{
int a = 15;
double ans = 1;
for(int i = 1; i <= 7; i++)
ans = ans * (a + 1 - i) / i;
std::cout << (int) ans;
return 0;
}
Output: 6435
Could someone tell me why the second one did not work?
If you print out ans without casting it to (int) you'll see the second result is 6434.9999999999990905052982270717620849609375. That's pretty darn close to the right answer of 6535, so it's clearly not a type promotion error any more.
No, this is classic floating point inaccuracy. When you write ans *= (double) (a + 1 - i) / i you are doing the equivalent of:
ans = ans * ((double) (a + 1 - i) / i);
Compare this to the third version:
ans = ans * (a + 1 - i) / i;
The former performs division first followed by multiplication. The latter operates left to right and so the multiplication precedes the division. This change in order of operations causes the results of the two to be slightly different. Floating point calculations are extremely sensitive to order of operations.
Quick fix: Don't truncate the result; round it.
Better fix: Don't use floating point for integral arithmetic. Save the divisions until after all the multiplications are done. Use long, long long, or even a big number library.
First one did not work because you have integer division there.
Difference btw second one and third one is this:
ans = ans * (double(a + 1 - i) / i); // second is equal to this
vs:
ans = (ans * (a + 1 - i)) / i; // third is equal to this
so difference is in order of multiplication and division. If you round double to integer instead of simply dropping fractional part you will get the same result.
std::cout << int( ans + 0.5 ) << std::endl;

Manual Implementation of Sobel Operator in OpenCV

I am trying to apply a sobel operator by iterating through an image and applying a mask to surrounding pixels.
For now, I am trying to apply the vertical portion of the mask, which is:
-1 0 1
-2 0 2
-1 0 1
In my implementaiton, I am iterating through the rows and columns as follows:
for (int i = 1; i < image.rows-1; i++){
for (int j = 1; j < image.cols-1; j++){
int pixel1 = image.at<Vec3b>(i-1,j-1)[0] * -1;
int pixel2 = image.at<Vec3b>(i,j-1)[0] * 0;
int pixel3 = image.at<Vec3b>(i+1,j-1)[0] * 1;
int pixel4 = image.at<Vec3b>(i-1,j)[0] * -2;
int pixel5 = image.at<Vec3b>(i,j)[0] * 0;
int pixel6 = image.at<Vec3b>(i+1,j)[0] * 2;
int pixel7 = image.at<Vec3b>(i-1,j+1)[0] * -1;
int pixel8 = image.at<Vec3b>(i,j+1)[0] * 0;
int pixel9 = image.at<Vec3b>(i+1,j+1)[0] * 1;
int sum = pixel1 + pixel2 + pixel3 + pixel4 + pixel5 + pixel6 + pixel7 + pixel8 + pixel9;
verticalSobel.at<Vec3b>(i,j)[0] = sum;
verticalSobel.at<Vec3b>(i,j)[1] = sum;
verticalSobel.at<Vec3b>(i,j)[2] = sum;
}
}
Where the pixels are labeled as:
1 2 3
4 5 6
7 8 9
However, the resulting image is far off of what it should look like.
For reference, the resulting image is
Where it should look similar to:
The guide I am using is: https://www.tutorialspoint.com/dip/sobel_operator.htm
I am not sure if I am simply implementing the operator incorrectly, or just iterating through the image incorrectly.
Any help would be greatly appreciated. Thanks!
You seem to have problems where the sum is negative. Take the absolute value of sum, and clamp it to 255 (or instead of absolute value, clamp it to 0 - depending of what you want to achieve. A "full" sobel operator usually uses 2d distance formula, so a horizonal/vertical only variant should use the absolute value)

How to find maximum area among given triangles on C++

The input should be n - the number of triangles (1 <= n <= 20) and afterwards n rows of three doubles each (corresponding to each of the triangles' sides). The output should be the "n" which has the maximum triangle area.
#include <iostream>
#include <math.h>
using namespace std;
const int MAX_SIZE = 20;
int main()
{
int n, s, p;
double max = 0;
cin >> n;
int x[MAX_SIZE];
for (int i = 0; i < n; i++)
{
double y[2];
for (int j = 0; j < 3; j++)
cin >> y[j];
p = (y[0] + y[1] + y[2]) / 2;
s = sqrt(p * (p - y[0]) * (p - y[1]) * (p - y[3]));
if (s >= max) max = s;
}
cout << max;
return 0;
}
That's what I've done so far. "p" stands for semiparameter by the way.. - I'm using Heron's formula. I haven't even got it to "cout" the n in which the area is max but rather the maximum area itself, yet it doesn't work but gives me a massive error instead. Any ideas?
You've got a few problems:
You need to change s and p from ints to doubles (otherwise you'll get unwanted truncation of your results).
You need to change double y[2]; to double y[3]; (since you need three side lengths, not two).
Change s = sqrt(p * (p - y[0]) * (p - y[1]) * (p - y[3])); to s = sqrt(p * (p - y[0]) * (p - y[1]) * (p - y[2])); (since y[3] is out of bounds of your array).
Note also that you can get rid of your array x, since you don't seem to actually use it anywhere.
You are allocation only 2 doubles. You need 3, try double y[3].

Calculate the running standard deviation

I am converting equations to c++. Is this correct for a running standard deviation.
this->runningStandardDeviation = (this->sumOfProcessedSquaredSamples - sumSquaredDividedBySampleCount) / (sampleCount - 1);
Here is the full function:
void BM_Functions::standardDeviationForRunningSamples (float samples [], int sampleCount)
{
// update the running process samples count
this->totalSamplesProcessed += sampleCount;
// get the mean of the samples
double mean = meanForSamples(samples, sampleCount);
// sum the deviations
// sum the squared deviations
for (int i = 0; i < sampleCount; i++)
{
// update the deviation sum of processed samples
double deviation = samples[i] - mean;
this->sumOfProcessedSamples += deviation;
// update the squared deviations sum
double deviationSquared = deviation * deviation;
this->sumOfProcessedSquaredSamples += deviationSquared;
}
// get the sum squared
double sumSquared = this->sumOfProcessedSamples * this->sumOfProcessedSamples;
// get the sum/N
double sumSquaredDividedBySampleCount = sumSquared / this->totalSamplesProcessed;
this->runningStandardDeviation = sqrt((this->sumOfProcessedSquaredSamples - sumSquaredDividedBySampleCount) / (sampleCount - 1));
}
A numerically stable and efficient algorithm for computing the running mean and variance/SD is Welford's algorithm.
One C++ implementation would be:
std::pair<double,double> getMeanVariance(const std::vector<double>& vec) {
double mean = 0, M2 = 0, variance = 0;
size_t n = vec.size();
for(size_t i = 0; i < n; ++i) {
double delta = vec[i] - mean;
mean += delta / (i + 1);
M2 += delta * (vec[i] - mean);
variance = M2 / (i + 1);
if (i >= 2) {
// <-- You can use the running mean and variance here
}
}
return std::make_pair(mean, variance);
}
Note: to get the SD, just take sqrt(variance)
You may check for sufficient sampleSount (1 would cause division by zero)
MAke sure that the variables have suitable data type (floating point)
Otherwise this looks correct...

How to speed up my sparse matrix solver?

I'm writing a sparse matrix solver using the Gauss-Seidel method. By profiling, I've determined that about half of my program's time is spent inside the solver. The performance-critical part is as follows:
size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
for (size_t y = 1; y < d_ny - 1; ++y) {
for (size_t x = 1; x < d_nx - 1; ++x) {
d_x[ic] = d_b[ic]
- d_w[ic] * d_x[iw] - d_e[ic] * d_x[ie]
- d_s[ic] * d_x[is] - d_n[ic] * d_x[in];
++ic; ++iw; ++ie; ++is; ++in;
}
ic += 2; iw += 2; ie += 2; is += 2; in += 2;
}
All arrays involved are of float type. Actually, they are not arrays but objects with an overloaded [] operator, which (I think) should be optimized away, but is defined as follows:
inline float &operator[](size_t i) { return d_cells[i]; }
inline float const &operator[](size_t i) const { return d_cells[i]; }
For d_nx = d_ny = 128, this can be run about 3500 times per second on an Intel i7 920. This means that the inner loop body runs 3500 * 128 * 128 = 57 million times per second. Since only some simple arithmetic is involved, that strikes me as a low number for a 2.66 GHz processor.
Maybe it's not limited by CPU power, but by memory bandwidth? Well, one 128 * 128 float array eats 65 kB, so all 6 arrays should easily fit into the CPU's L3 cache (which is 8 MB). Assuming that nothing is cached in registers, I count 15 memory accesses in the inner loop body. On a 64-bits system this is 120 bytes per iteration, so 57 million * 120 bytes = 6.8 GB/s. The L3 cache runs at 2.66 GHz, so it's the same order of magnitude. My guess is that memory is indeed the bottleneck.
To speed this up, I've attempted the following:
Compile with g++ -O3. (Well, I'd been doing this from the beginning.)
Parallelizing over 4 cores using OpenMP pragmas. I have to change to the Jacobi algorithm to avoid reads from and writes to the same array. This requires that I do twice as many iterations, leading to a net result of about the same speed.
Fiddling with implementation details of the loop body, such as using pointers instead of indices. No effect.
What's the best approach to speed this guy up? Would it help to rewrite the inner body in assembly (I'd have to learn that first)? Should I run this on the GPU instead (which I know how to do, but it's such a hassle)? Any other bright ideas?
(N.B. I do take "no" for an answer, as in: "it can't be done significantly faster, because...")
Update: as requested, here's a full program:
#include <iostream>
#include <cstdlib>
#include <cstring>
using namespace std;
size_t d_nx = 128, d_ny = 128;
float *d_x, *d_b, *d_w, *d_e, *d_s, *d_n;
void step() {
size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
for (size_t y = 1; y < d_ny - 1; ++y) {
for (size_t x = 1; x < d_nx - 1; ++x) {
d_x[ic] = d_b[ic]
- d_w[ic] * d_x[iw] - d_e[ic] * d_x[ie]
- d_s[ic] * d_x[is] - d_n[ic] * d_x[in];
++ic; ++iw; ++ie; ++is; ++in;
}
ic += 2; iw += 2; ie += 2; is += 2; in += 2;
}
}
void solve(size_t iters) {
for (size_t i = 0; i < iters; ++i) {
step();
}
}
void clear(float *a) {
memset(a, 0, d_nx * d_ny * sizeof(float));
}
int main(int argc, char **argv) {
size_t n = d_nx * d_ny;
d_x = new float[n]; clear(d_x);
d_b = new float[n]; clear(d_b);
d_w = new float[n]; clear(d_w);
d_e = new float[n]; clear(d_e);
d_s = new float[n]; clear(d_s);
d_n = new float[n]; clear(d_n);
solve(atoi(argv[1]));
cout << d_x[0] << endl; // prevent the thing from being optimized away
}
I compile and run it as follows:
$ g++ -o gstest -O3 gstest.cpp
$ time ./gstest 8000
0
real 0m1.052s
user 0m1.050s
sys 0m0.010s
(It does 8000 instead of 3500 iterations per second because my "real" program does a lot of other stuff too. But it's representative.)
Update 2: I've been told that unititialized values may not be representative because NaN and Inf values may slow things down. Now clearing the memory in the example code. It makes no difference for me in execution speed, though.
Couple of ideas:
Use SIMD. You could load 4 floats at a time from each array into a SIMD register (e.g. SSE on Intel, VMX on PowerPC). The disadvantage of this is that some of the d_x values will be "stale" so your convergence rate will suffer (but not as bad as a jacobi iteration); it's hard to say whether the speedup offsets it.
Use SOR. It's simple, doesn't add much computation, and can improve your convergence rate quite well, even for a relatively conservative relaxation value (say 1.5).
Use conjugate gradient. If this is for the projection step of a fluid simulation (i.e. enforcing non-compressability), you should be able to apply CG and get a much better convergence rate. A good preconditioner helps even more.
Use a specialized solver. If the linear system arises from the Poisson equation, you can do even better than conjugate gradient using an FFT-based methods.
If you can explain more about what the system you're trying to solve looks like, I can probably give some more advice on #3 and #4.
I think I've managed to optimize it, here's a code, create a new project in VC++, add this code and simply compile under "Release".
#include <iostream>
#include <cstdlib>
#include <cstring>
#define _WIN32_WINNT 0x0400
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#include <conio.h>
using namespace std;
size_t d_nx = 128, d_ny = 128;
float *d_x, *d_b, *d_w, *d_e, *d_s, *d_n;
void step_original() {
size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
for (size_t y = 1; y < d_ny - 1; ++y) {
for (size_t x = 1; x < d_nx - 1; ++x) {
d_x[ic] = d_b[ic]
- d_w[ic] * d_x[iw] - d_e[ic] * d_x[ie]
- d_s[ic] * d_x[is] - d_n[ic] * d_x[in];
++ic; ++iw; ++ie; ++is; ++in;
}
ic += 2; iw += 2; ie += 2; is += 2; in += 2;
}
}
void step_new() {
//size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
float
*d_b_ic,
*d_w_ic,
*d_e_ic,
*d_x_ic,
*d_x_iw,
*d_x_ie,
*d_x_is,
*d_x_in,
*d_n_ic,
*d_s_ic;
d_b_ic = d_b;
d_w_ic = d_w;
d_e_ic = d_e;
d_x_ic = d_x;
d_x_iw = d_x;
d_x_ie = d_x;
d_x_is = d_x;
d_x_in = d_x;
d_n_ic = d_n;
d_s_ic = d_s;
for (size_t y = 1; y < d_ny - 1; ++y)
{
for (size_t x = 1; x < d_nx - 1; ++x)
{
/*d_x[ic] = d_b[ic]
- d_w[ic] * d_x[iw] - d_e[ic] * d_x[ie]
- d_s[ic] * d_x[is] - d_n[ic] * d_x[in];*/
*d_x_ic = *d_b_ic
- *d_w_ic * *d_x_iw - *d_e_ic * *d_x_ie
- *d_s_ic * *d_x_is - *d_n_ic * *d_x_in;
//++ic; ++iw; ++ie; ++is; ++in;
d_b_ic++;
d_w_ic++;
d_e_ic++;
d_x_ic++;
d_x_iw++;
d_x_ie++;
d_x_is++;
d_x_in++;
d_n_ic++;
d_s_ic++;
}
//ic += 2; iw += 2; ie += 2; is += 2; in += 2;
d_b_ic += 2;
d_w_ic += 2;
d_e_ic += 2;
d_x_ic += 2;
d_x_iw += 2;
d_x_ie += 2;
d_x_is += 2;
d_x_in += 2;
d_n_ic += 2;
d_s_ic += 2;
}
}
void solve_original(size_t iters) {
for (size_t i = 0; i < iters; ++i) {
step_original();
}
}
void solve_new(size_t iters) {
for (size_t i = 0; i < iters; ++i) {
step_new();
}
}
void clear(float *a) {
memset(a, 0, d_nx * d_ny * sizeof(float));
}
int main(int argc, char **argv) {
size_t n = d_nx * d_ny;
d_x = new float[n]; clear(d_x);
d_b = new float[n]; clear(d_b);
d_w = new float[n]; clear(d_w);
d_e = new float[n]; clear(d_e);
d_s = new float[n]; clear(d_s);
d_n = new float[n]; clear(d_n);
if(argc < 3)
printf("app.exe (x)iters (o/n)algo\n");
bool bOriginalStep = (argv[2][0] == 'o');
size_t iters = atoi(argv[1]);
/*printf("Press any key to start!");
_getch();
printf(" Running speed test..\n");*/
__int64 freq, start, end, diff;
if(!::QueryPerformanceFrequency((LARGE_INTEGER*)&freq))
throw "Not supported!";
freq /= 1000000; // microseconds!
{
::QueryPerformanceCounter((LARGE_INTEGER*)&start);
if(bOriginalStep)
solve_original(iters);
else
solve_new(iters);
::QueryPerformanceCounter((LARGE_INTEGER*)&end);
diff = (end - start) / freq;
}
printf("Speed (%s)\t\t: %u\n", (bOriginalStep ? "original" : "new"), diff);
//_getch();
//cout << d_x[0] << endl; // prevent the thing from being optimized away
}
Run it like this:
app.exe 10000 o
app.exe 10000 n
"o" means old code, yours.
"n" is mine, the new one.
My results:
Speed (original):
1515028
1523171
1495988
Speed (new):
966012
984110
1006045
Improvement of about 30%.
The logic behind:
You've been using index counters to access/manipulate.
I use pointers.
While running, breakpoint at a certain calculation code line in VC++'s debugger, and press F8. You'll get the disassembler window.
The you'll see the produced opcodes (assembly code).
Anyway, look:
int *x = ...;
x[3] = 123;
This tells the PC to put the pointer x at a register (say EAX).
The add it (3 * sizeof(int)).
Only then, set the value to 123.
The pointers approach is much better as you can understand, because we cut the adding process, actually we handle it ourselves, thus able to optimize as needed.
I hope this helps.
Sidenote to stackoverflow.com's staff:
Great website, I hope I've heard of it long ago!
For one thing, there seems to be a pipelining issue here. The loop reads from the value in d_x that has just been written to, but apparently it has to wait for that write to complete. Just rearranging the order of the computation, doing something useful while it's waiting, makes it almost twice as fast:
d_x[ic] = d_b[ic]
- d_e[ic] * d_x[ie]
- d_s[ic] * d_x[is] - d_n[ic] * d_x[in]
- d_w[ic] * d_x[iw] /* d_x[iw] has just been written to, process this last */;
It was Eamon Nerbonne who figured this out. Many upvotes to him! I would never have guessed.
Poni's answer looks like the right one to me.
I just want to point out that in this type of problem, you often gain benefits from memory locality. Right now, the b,w,e,s,n arrays are all at separate locations in memory. If you could not fit the problem in L3 cache (mostly in L2), then this would be bad, and a solution of this sort would be helpful:
size_t d_nx = 128, d_ny = 128;
float *d_x;
struct D { float b,w,e,s,n; };
D *d;
void step() {
size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
for (size_t y = 1; y < d_ny - 1; ++y) {
for (size_t x = 1; x < d_nx - 1; ++x) {
d_x[ic] = d[ic].b
- d[ic].w * d_x[iw] - d[ic].e * d_x[ie]
- d[ic].s * d_x[is] - d[ic].n * d_x[in];
++ic; ++iw; ++ie; ++is; ++in;
}
ic += 2; iw += 2; ie += 2; is += 2; in += 2;
}
}
void solve(size_t iters) { for (size_t i = 0; i < iters; ++i) step(); }
void clear(float *a) { memset(a, 0, d_nx * d_ny * sizeof(float)); }
int main(int argc, char **argv) {
size_t n = d_nx * d_ny;
d_x = new float[n]; clear(d_x);
d = new D[n]; memset(d,0,n * sizeof(D));
solve(atoi(argv[1]));
cout << d_x[0] << endl; // prevent the thing from being optimized away
}
For example, this solution at 1280x1280 is a little less than 2x faster than Poni's solution (13s vs 23s in my test--your original implementation is then 22s), while at 128x128 it's 30% slower (7s vs. 10s--your original is 10s).
(Iterations were scaled up to 80000 for the base case, and 800 for the 100x larger case of 1280x1280.)
I think you're right about memory being a bottleneck. It's a pretty simple loop with just some simple arithmetic per iteration. the ic, iw, ie, is, and in indices seem to be on opposite sides of the matrix so i'm guessing that there's a bunch of cache misses there.
I'm no expert on the subject, but I've seen that there are several academic papers on improving the cache usage of the Gauss-Seidel method.
Another possible optimization is the use of the red-black variant, where points are updated in two sweeps in a chessboard-like pattern. In this way, all updates in a sweep are independent and can be parallelized.
I suggest putting in some prefetch statements and also researching "data oriented design":
void step_original() {
size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
float dw_ic, dx_ic, db_ic, de_ic, dn_ic, ds_ic;
float dx_iw, dx_is, dx_ie, dx_in, de_ic, db_ic;
for (size_t y = 1; y < d_ny - 1; ++y) {
for (size_t x = 1; x < d_nx - 1; ++x) {
// Perform the prefetch
// Sorting these statements by array may increase speed;
// although sorting by index name may increase speed too.
db_ic = d_b[ic];
dw_ic = d_w[ic];
dx_iw = d_x[iw];
de_ic = d_e[ic];
dx_ie = d_x[ie];
ds_ic = d_s[ic];
dx_is = d_x[is];
dn_ic = d_n[ic];
dx_in = d_x[in];
// Calculate
d_x[ic] = db_ic
- dw_ic * dx_iw - de_ic * dx_ie
- ds_ic * dx_is - dn_ic * dx_in;
++ic; ++iw; ++ie; ++is; ++in;
}
ic += 2; iw += 2; ie += 2; is += 2; in += 2;
}
}
This differs from your second method since the values are copied to local temporary variables before the calculation is performed.