I have a longer code that was cut down as much as possible while keeping the issue alive. My code runs an MCMC computation for different parameter values. For some combinations of values, the code runs much longer, about 100x slower than on typical cases. However, it shouldn't, because the number of operations does not depend on parameter values.
I am running this on an AMD64 Linux box with glibc-2.17 compiled with GCC 4.8.1 on Gentoo. Compile flags do not matter as it appears. I also tested it on a different Gentoo box with an older AMD64 processor and the results were the same.
There were a bunch of tests I did:
I have tried debugging with Valgrind and it found no memory issues or other nasty things.
Secondly, I have tried running the code with the problematic parameter values fixed, but the slowness was not encountered.
I have tried putting sleep(4) between iterations, but nothing changed.
The problem shows when the iteration hits k = 1, i = j = 0, which translates into mu[0] = -0.05, mu[1] = -0.05 and mu[2] = 0.05. As I said, using this fixed value for all iterations eliminates the issue I am seeing.
Here are things that eliminate the problem:
Changing the limits.
Fixing the mu[] coefficients.
Removing dW3 from the computation.
Removing rand().
Removing computation of q.
Removing the update of s[j].
I have read quit a bit about slowpow, thus tried eliminating exp by writing my own version of it. This solves the problems I am having with this MWE, but not when the reimplemented exp is placed in the production code.
Question: what is causing the semi-random slowness?
Code of the MWE follows. All help and suggestions as how to proceed would be greatly appreciated.
Note: This code is compiled with g++ although it is essentially C. Changing compiler doesn't change anything.
Regarding branch prediction: Removing one of the if statements by using
q = exp(dW);
q = q / (1.0 + q);
no matter what is the value of dW doesn't change code's behavior; if this indeed is due to branch prediction, it would have to be due to the second if.
#include <cstdio>
#include <cstdlib>
#include <cmath>
inline int index(int const i, int const j, int const n)
{
return (i + n) % n + ((j + n) % n) * n;
}
void get_sample(int* s, int n, double* mu)
{
for (int i = 0; i < 10 * n * n; i++)
{
int j = i % (n * n);
int x = j % n;
int y = (j - x) / n;
double dW1 = mu[0] * (s[index(x - 1, y, n)] + s[index(x + 1, y, n)] + s[index(x, y - 1, n)] + s[index(x, y + 1, n)]);
double dW2 = mu[1] * (s[index(x - 1, y - 1, n)] + s[index(x + 1, y - 1, n)] + s[index(x + 1, y + 1, n)] + s[index(x - 1, y + 1, n)]);
double dW3 = mu[2] * (s[index(x - 1, y, n)] * s[index(x - 1, y - 1, n)] * s[index(x, y - 1, n)] + s[index(x - 1, y, n)] * s[index(x - 1, y + 1, n)] * s[index(x, y + 1, n)]
+ s[index(x, y + 1, n)] * s[index(x + 1, y + 1, n)] * s[index(x + 1, y, n)] + s[index(x + 1, y, n)] * s[index(x + 1, y - 1, n)] * s[index(x, y - 1, n)]);
double dW = 2.0 * (dW1 + dW2 + dW3);
double q;
if (dW < 0.0)
{
q = exp(dW);
q = q / (1.0 + q);
}
else
{
q = exp(-dW);
q = 1.0 / (1.0 + q);
}
double p = ((double) rand()) / ((double) RAND_MAX);
if (p < q)
{
s[j] = 1;
}
else
{
s[j] = -1;
}
}
}
int main(int argc, char** argv)
{
double mu[3];
double limits[6] = {-0.05, 0.8, -0.05, 0.45, -0.45, 0.05};
int s[16];
for (int i = 0; i < 16; i++)
{
s[i] = -1;
}
for (int k = 0; k < 2; k++)
{
for (int j = 0; j < 2; j++)
{
for (int i = 0; i < 2; i++)
{
mu[0] = limits[0] + ((limits[1] - limits[0]) * i);
mu[1] = limits[2] + ((limits[3] - limits[2]) * j);
mu[2] = limits[4] + ((limits[5] - limits[4]) * k);
printf(" Computing (% .6lf, % .6lf, % .6lf)...\n", mu[0], mu[1], mu[2]);
for (int sample = 0; sample < 1000; sample++)
{
get_sample(s, 4, mu);
}
}
}
}
return 0;
}
However, it shouldn't, because the number of operations does not depend on parameter values.
The speed of floating point operations does depend on parameter values, though. If you introduce NaN or other exceptional values in your computation (which I didn't review the code for) it will drastically degrade your floating point performance.
EDIT: I manually profiled (with simple rdtsc counting) around the exp() and it was easy to bin "good" and "bad" cases. When I printed the bad cases it was all where dW ~= 0. If you split that case out you get even performance:
double q;
if (dW < -0.1e-15)
{
q = exp(dW);
q = q / (1.0 + q);
}
else if (dW > 0.1e-15)
{
q = exp(-dW);
q = 1.0 / (1.0 + q);
}
else
{
q = 0.5;
}
If I am right and branch prediction is the problem, you should try
void get_sample(int* s, int n, double* mu)
{
for (int i = 0; i < 10 * n * n; i++)
{
int j = i % (n * n);
int x = j % n;
int y = (j - x) / n;
double dW1 = mu[0] * (s[index(x - 1, y, n)] + s[index(x + 1, y, n)] + s[index(x, y - 1, n)] + s[index(x, y + 1, n)]);
double dW2 = mu[1] * (s[index(x - 1, y - 1, n)] + s[index(x + 1, y - 1, n)] + s[index(x + 1, y + 1, n)] + s[index(x - 1, y + 1, n)]);
double dW3 = mu[2] * (s[index(x - 1, y, n)] * s[index(x - 1, y - 1, n)] * s[index(x, y - 1, n)] + s[index(x - 1, y, n)] * s[index(x - 1, y + 1, n)] * s[index(x, y + 1, n)]
+ s[index(x, y + 1, n)] * s[index(x + 1, y + 1, n)] * s[index(x + 1, y, n)] + s[index(x + 1, y, n)] * s[index(x + 1, y - 1, n)] * s[index(x, y - 1, n)]);
double dW = 2.0 * (dW1 + dW2 + dW3);
double q;
q = exp(dW *((dW>0)*2-1);
q = ((dW>0)*q + (dW<=0)) / (1.0 + q);
double p = ((double) rand()) / ((double) RAND_MAX);
s[j] = (p<q)*2-1;
}
}
I am also wondering if a good compiler shouldn't make such transformations anyway...
Related
#include <iostream>
#include <chrono>
#include <random>
#include <time.h>
using namespace std;
typedef pair<double,double> pd;
#define x first
#define y second
#define cell(i,j,w) ((i)*(w) + (j))
class MyTimer
{
private:
std::chrono::time_point<std::chrono::steady_clock> starter;
std::chrono::time_point<std::chrono::steady_clock> ender;
public:
void startCounter() {
starter = std::chrono::steady_clock::now();
}
long long getCounter() {
ender = std::chrono::steady_clock::now();
return std::chrono::duration_cast<std::chrono::milliseconds>(ender - starter).count();
}
};
int main()
{
const int n = 5000;
int* value1 = new int[(n + 1) * (n + 1)];
int* value2 = new int[(n + 1) * (n + 1)];
double* a = new double[(n + 1) * (n + 1)];
double* b = new double[(n + 1) * (n + 1)];
pd* packed = new pd[(n + 1) * (n + 1)];
MyTimer timer;
for (int i = 1; i <= n; i++)
for (int j = 1; j <= n; j++) {
value1[cell(i, j, n + 1)] = rand() % 5000;
value2[cell(i, j, n + 1)] = rand() % 5000;
}
for (int i = 1; i <= n; i++) {
a[cell(i, 0, n + 1)] = 0;
a[cell(0, i, n + 1)] = 0;
b[cell(i, 0, n + 1)] = 0;
b[cell(0, i, n + 1)] = 0;
packed[cell(i, 0, n + 1)] = pd(0, 0);
packed[cell(0, i, n + 1)] = pd(0, 0);
}
for (int tt=1; tt<=5; tt++)
{
timer.startCounter();
for (int i=1; i<=n; i++)
for (int j = 1; j <= n; j++) {
// packed[i][j] = packed[i-1][j] + packed[i][j-1] - packed[i-1][j-1] + value1[i][j]
packed[cell(i, j, n + 1)].x = packed[cell(i - 1, j, n + 1)].x + packed[cell(i, j - 1, n + 1)].x - packed[cell(i - 1, j - 1, n + 1)].x + value1[cell(i, j, n + 1)];
packed[cell(i, j, n + 1)].y = packed[cell(i - 1, j, n + 1)].y + packed[cell(i, j - 1, n + 1)].y - packed[cell(i - 1, j - 1, n + 1)].y + value1[cell(i, j, n + 1)] * value1[cell(i, j, n + 1)];
}
cout << "Time packed = " << timer.getCounter() << "\n";
timer.startCounter();
for (int i=1; i<=n; i++)
for (int j = 1; j <= n; j++) {
// a[i][j] = a[i-1][j] + a[i][j-1] - a[i-1][j-1] + value2[i][j];
// b[i][j] = b[i-1][j] + b[i][j-1] - b[i-1][j-1] + value2[i][j] * value2[i][j];
a[cell(i, j, n + 1)] = a[cell(i - 1, j, n + 1)] + a[cell(i, j - 1, n + 1)] - a[cell(i - 1, j - 1, n + 1)] + value2[cell(i, j, n + 1)];
b[cell(i, j, n + 1)] = b[cell(i - 1, j, n + 1)] + b[cell(i, j - 1, n + 1)] - b[cell(i - 1, j - 1, n + 1)] + value2[cell(i, j, n + 1)] * value2[cell(i, j, n + 1)];
}
cout << "Time separate = " << timer.getCounter() << "\n\n";
}
delete[] value1;
delete[] value2;
delete[] a;
delete[] b;
delete[] packed;
}
So I'm computing a 2D prefix table (Summed Area Table). And I notice the property in the title.
When using CUDA nvcc compiler (with -O2) using the command line or Visual Studio Release mode , the result is 2x faster (separate takes 200ms, packed takes 100ms) the first run, but only 25% faster in subsequent run (this is because value2[] is cached after the first loop). In my actual program with more steps of calculation (computing SAT is just step 1), it's always 2x faster since value1[] and value2[] have definitely been evicted from cache.
I know packed array is faster because modern Intel CPU read 32-64 bytes into cache at once. So by packing both array together, it can read both data in 1 main memory (RAM) access instead of 2. But why is the speedup so high? Along with memory access, the CPU still has to perform 6 additions, 2 subtractions, and 1 multiply per loop. 2x speedup from halving memory access is 100% improvement efficiency (Amdahl Law), the same as if those add/mult operations didn't exist. How is it possible?
I'm certain it has something to do with CPU pipelining, but can't explain more thoroughly. Can anyone explain this further in terms of instruction latency/memory access latency/assembly? Thank you.
The code doesn't use any GPU, so any other good compiler should give the same 2x speedup as nvcc. On g++ 9.3.0 (g++ file.cpp -O2 -std=c++11 -o file.exe), it's also 2x speedup. CPU is Intel i7-7700
I've run this program here and here2 with command line arguments -O2 -std=c++11, it also shows 1.5-2x speedup. Use n = 3000, bigger and it won't run (free VM service, afterall). So it's not just my computer
The answer is in the access latency of different level of memory, from L1 cache -> main memory (RAM).
Data in L1 cache takes ~~5 cycle to access, while data from RAM takes 50-100cycle. Meanwhile, add/sub/mult operations takes 3-5 cycles.
Therefore, the dominating limiter of performance is main memory access. So by reducing the number of main memory request by half, performance almost doubles
I'm trying to understand the FFT algorithm.
Here's a code
void fft(double *a, double *b, double *w, int m, int l)
{
int i, i0, i1, i2, i3, j;
double u, v, wi, wr;
for (j = 0; j < l; j++) {
wr = w[j << 1];
wi = w[j << 1 + 1];
for (i = 0; i < m; i++) {
i0 = (i << 1) + (j * m << 1);
i1 = i0 + (m * l << 1);
i2 = (i << 1) + (j * m << 2);
i3 = i2 + (m << 1);
u = a[i0] - a[i1];
v = a[i0 + 1] - a[i1 + 1];
b[i2] = a[i0] + a[i1];
b[i2 + 1] = a[i0 + 1] + a[i1 + 1];
b[i3] = wr * u - wi * v;
b[i3 + 1] = wr * v + wi * u;
}
}
}
If I get it right, array W is input, where every odd number is real and even is imag. A and B are imag and real parts of complex result
Also I found that l = 2**m
But when i'm trying to do this:
double a[4] = { 0, 0, 0, 0 };
double b[4] = { 0, 0, 0, 0 };
double w[8] = { 1, 0, 0, 0, 0, 0, 0, 0 };
int m = 3;
int l = 8;
fft(a, b, w, m, l);
There's error.
This code is only part of an FFT. a is input. b is output. w contains precomputed weights. l is a number of subdivisions at the current point in the FFT. m is the number of elements per division. The data in a, b, and w is interleaved complex data—each pair of double elements from the array consists of the real part and the imaginary part of one complex number.
The code performs one radix-two butterfly pass over the data. To use it to compute an FFT, it must be called multiple times with specific values for l, m, and the weights in w. Since, for each call, the input is in a and the output is in b, the caller must use at least two buffers and alternate between them for successive calls to the routine.
From the indexing performed in i0 and i2, it appears the data is being rearranged slightly. This may be intended to produce the final results of the FFT in “natural” order instead of the bit-reversed order that occurs in a simple implementation.
But when i'm trying to do this:
double a[4] = { 0, 0, 0, 0 };
double b[4] = { 0, 0, 0, 0 };
double w[8] = { 1, 0, 0, 0, 0, 0, 0, 0 };
int m = 3;
int l = 8;
fft(a, b, w, m, l);
There's error.
From for (j = 0; j < l; j++), we see the maximum value of j in the loop is l-1. From for (i = 0; i < m; i++), we see the maximum value of i is m-1. Then in i0 = (i << 1) + (j * m << 1), we have i0 = ((m-1) << 1) + ((l-1) * m << 1) = (m-1)*2 + (l-1) * m * 2 = 2*m - 2 + l*m*2 - m*2 = 2*m*l - 2. And in i1 = i0 + (m * l << 1), we have i1 = 2*m*l - 2 + (m * l * 2) = 4*m*l - 2. When the code uses a[i1 + 1], the index is i1 + 1 = 4*m*l - 2 + 1 = 4*m*l - 1.
Therefore a must have an element with index 4*m*l - 1, so it must have at least 4*m*l elements. The required size for b can be computed similarly and is the same.
When you call fft with m set to 3 and l set to 8, a must have 4•3•8 = 96 elements. Your sample code shows four elements. Thus, the array is overrun, and the code fails.
I do not believe it is correct that l should equal 2m. More likely, 4*m*l should not vary between calls to fft in the same complete FFT computation, and, since a and b contain two double elements for every complex number, 4*m*l should be twice the number of complex elements in the signal being transformed.
my codes does not work for Gauss Elimination for Matrix. The core code is ok, but it seems to be missing some final touch which I honestly dont know. Would be great if someone can point out the mistake.
Basically when I input a square 3x3 Matrix filled with 3s, I get back (3, 3, 3, 0, -3, -3, 0, 0, 3) but it should be (3, 3, 3, 0, 0, 0, 0, 0, 0)
n is number of rows of matrix and m is number of columns.
All elements of matrix are stored in a SINGLE DIMENSION array called entries[i]
My code below for GaussElimination basically starts with placing the row with the largest first element on the top row. Then after that I just delete the elements right below the top elements.
Matrix Matrix::GaussElim() const {
double maxEle;
int maxRow;
for (int i = 1; i <= m; i++) {
maxEle = fabs(entries[i-1]);
maxRow = i;
for (int k = i+1; k <= m; k++) {
if (fabs(entries[(k - 1) * n + i - 1]) > maxEle) {
maxEle = entries[(k - 1) * n + i - 1];
maxRow = k;
}
}
for (int a = 1; a <= m; a++) {
swap(entries[(i - 1) * m + a - 1], entries[(maxRow - 1) * m + a - 1]);
}
for (int b = i + 1; b <= n; b++) {
double c = -(entries[(b - 1) * m + i - 1]) / entries[(i - 1) * m + i - 1];
for (int d = i; d <= n; d++) {
if (i == d) {
entries[(b - 1) * m + d - 1] = 0;
}
else {
entries[(b - 1) * m + d - 1] = c * entries[(i - 1) * m + d - 1];
}
}
}
}
Matrix Result(n, m, entries);
return Result;
}
For starters, I'd suggest to drop the habit of starting the loops at 1 instead of the more idiomatic 0, it would simplify all of the formulas.
That said, this statement
else {
entries[(b - 1) * m + d - 1] = c * entries[(i - 1) * m + d - 1];
// ^^^
}
Looks suspicious. There should be a += (or a -=, depending on how you choose the sign of the pivot).
Another source of unexpected results is the way chosen to calculate the constant c:
double c = -(entries[(b - 1) * m + i - 1]) / entries[(i - 1) * m + i - 1];
// ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Even in case of partial pivoting, that value could be zero (or too small), due to the nature of the starting matrix, like in the posted example, or to numerical errors. In those cases, it would be preferable to just zero out all the remaining elements of the matrix.
I tried a quick and dirty translation of the code here.
However, my version outputs noise comparable to grey t-shirt material, or heather if it please you:
#include <fstream>
#include "perlin.h"
double Perlin::cos_Interp(double a, double b, double x)
{
ft = x * 3.1415927;
f = (1 - cos(ft)) * .5;
return a * (1 - f) + b * f;
}
double Perlin::noise_2D(double x, double y)
{
/*
int n = (int)x + (int)y * 57;
n = (n << 13) ^ n;
int nn = (n * (n * n * 60493 + 19990303) + 1376312589) & 0x7fffffff;
return 1.0 - ((double)nn / 1073741824.0);
*/
int n = (int)x + (int)y * 57;
n = (n<<13) ^ n;
return ( 1.0 - ( (n * (n * n * 15731 + 789221) + 1376312589) & 0x7fffffff) / 1073741824.0);
}
double Perlin::smooth_2D(double x, double y)
{
corners = ( noise_2D(x - 1, y - 1) + noise_2D(x + 1, y - 1) + noise_2D(x - 1, y + 1) + noise_2D(x + 1, y + 1) ) / 16;
sides = ( noise_2D(x - 1, y) + noise_2D(x + 1, y) + noise_2D(x, y - 1) + noise_2D(x, y + 1) ) / 8;
center = noise_2D(x, y) / 4;
return corners + sides + center;
}
double Perlin::interp(double x, double y)
{
int x_i = int(x);
double x_left = x - x_i;
int y_i = int(y);
double y_left = y - y_i;
double v1 = smooth_2D(x_i, y_i);
double v2 = smooth_2D(x_i + 1, y_i);
double v3 = smooth_2D(x_i, y_i + 1);
double v4 = smooth_2D(x_i + 1, y_i + 1);
double i1 = cos_Interp(v1, v2, x_left);
double i2 = cos_Interp(v3, v4, x_left);
return cos_Interp(i1, i2, y_left);
}
double Perlin::perlin_2D(double x, double y)
{
double total = 0;
double p = .25;
int n = 1;
for(int i = 0; i < n; ++i)
{
double freq = pow(2, i);
double amp = pow(p, i);
total = total + interp(x * freq, y * freq) * amp;
}
return total;
}
int main()
{
Perlin perl;
ofstream ofs("./noise2D.ppm", ios_base::binary);
ofs << "P6\n" << 512 << " " << 512 << "\n255\n";
for(int i = 0; i < 512; ++i)
{
for(int j = 0; j < 512; ++j)
{
double n = perl.perlin_2D(i, j);
n = floor((n + 1.0) / 2.0 * 255);
unsigned char c = n;
ofs << c << c << c;
}
}
ofs.close();
return 0;
}
I don't believe that I strayed too far from the aforementioned site's directions aside from adding in the ppm image generation code, but then again I'll admit to not fully grasping what is going on in the code.
As you'll see by the commented section, I tried two (similar) ways of generating pseudorandom numbers for noise. I also tried different ways of scaling the numbers returned by perlin_2D to RGB color values. These two ways of editing the code have just yielded different looking t-shirt material. So, I'm forced to believe that there's something bigger going on that I am unable to recognize.
Also, I'm compiling with g++ and the c++11 standard.
EDIT: Here's an example: http://imgur.com/Sh17QjK
To convert a double in the range of [-1.0, 1.0] to an integer in range [0, 255]:
n = floor((n + 1.0) / 2.0 * 255.99);
To write it as a binary value to the PPM file:
ofstream ofs("./noise2D.ppm", ios_base::binary);
...
unsigned char c = n;
ofs << c << c << c;
Is this a direct copy of your code? You assigned an integer to what should be the Y fractional value - it's a typo and it will throw the entire noise algorithm off if you don't fix:
double Perlin::interp(double x, double y)
{
int x_i = int(x);
double x_left = x - x_i;
int y_i = int(y);
double y_left = y = y_i; //This Should have a minus, not an "=" like the line above
.....
}
My guess is if you're successfully generating the bitmap with the proper color computation, you're getting vertical bars or something along those lines?
You also need to remember that the Perlin generator usually spits out numbers in the range of -1 to 1 and you need to multiply the resultant value as such:
value * 127 + 128 = {R, G, B}
to get a good grayscale image.
I am trying to implement the PASCAL code given in this paper in C++ and my attempt is
#include <iostream>
using namespace std;
int GenFact(int a, int b)
{ // calculates the generalised factorial
// (a)(a-1)...(a-b+1)
int gf = 1;
for (int jj = (a - b + 1); jj < a + 1; jj++)
{
gf = gf * jj;
}
return (gf);
} // end of GenFact function
double GramPoly(int i, int m, int k, int s)
{ // Calculates the Gram Polynomial ( s = 0 ),
// or its s'th
// derivative evaluated at i, order k, over 2m + 1 points
double gp_val;
if (k > 0)
{
gp_val = (4.0 * k - 2.0) / (k * (2.0 * m - k + 1.0)) *
(i * GramPoly(i, m, k - 1, s) +
s * GramPoly(i, m, k - 1.0, s - 1.0)) -
((k - 1.0) * (2.0 * m + k)) /
(k * (2.0 * m - k + 1.0)) *
GramPoly(i, m, k - 2.0, s);
}
else
{
if ((k == 0) && (s == 0))
{
gp_val = 1.0;
}
else
{
gp_val = 0.0;
} // end of if k = 0 & s = 0
} // end of if k > 0
return (gp_val);
} // end of GramPoly function
double Weight(int i, int t, int m, int n, int s)
{ // calculates the weight of the i'th data
// point for the t'th Least-square
// point of the s'th derivative, over 2m + 1 points, order n
double sum = 0.0;
for (int k = 0; k < n + 1; k++)
{
sum += (2.0 * k + 1.0) *
GenFact(2.0 * m + k + 1.0, k + 1.0) *
GramPoly(i, m, k, 0) * GramPoly(t, m, k, s);
} // end of for loop
return (sum);
} // end of Weight function
int main()
{
double z;
z = Weight(-2, -2, 2, 2, 0);
cout << "The result is " << z;
return 0;
}
however, when I run the code the output is 1145 whilst I'm expecting 31/35 = 0.88571 as per equation 12 and the tables given in the paper. Where is my error?
Your Weight function is wrong - there is a term missing... try this one:
double Weight( int i , int t , int m , int n , int s )
{ // calculates the weight of the i'th data point for the t'th Least-square
// point of the s'th derivative, over 2m + 1 points, order n
double sum = 0.0 ;
for ( int k = 0 ; k <= n ; k++ )
{
sum += (2*k+1) *
(
GenFact(2*m,k) / //<-- here
GenFact(2*m+k+1,k+1)
) * GramPoly(i,m,k,0) * GramPoly(t,m,k,s) ;
} // end of for loop
return ( sum ) ;
} // end of Weight function
First function GenFact should be return a float or double instead of int. Therefore gf should be a floating-point type too.
Second your function Weight is not the same as that in the paper. I think you missed the part GenFact(2 * m, k)
In addition to the previous answer - you should divide by GenFact(2.0 * m + k + 1.0, k + 1.0), not multiply (at least the paper says so).