Did I used the pre-fetch instruction correctly to reduce memory latency?
Can I do better than this?
When I compile the code with -O3, g++ seems to unroll the inner loop (code at godbolt.org).
The architecture of the CPU is Broadwell.
Thanks.
Backward loop over an array and read/write elements.
Each calculation depends on the previous calculation.
#include <stdlib.h>
#include <iostream>
int main() {
const int N = 25000000;
float* x = reinterpret_cast<float*>(
aligned_alloc(16, 4*N)
); // 0.1 GB
x[N - 1] = 1.0f;
// fetch last cache line of the array
__builtin_prefetch(&x[N - 16], 0, 3);
// Backward loop over the i^th cache line.
for (int i = N - 16; i >= 0; i -= 16) {
for (int j = 15; j >= 1; --j) {
x[i + j - 1] += x[i + j];
}
__builtin_prefetch(&x[i - 16], 0, 3);
x[i - 1] = x[i];
}
std::cout << x[0] << "\n";
free(x);
}
Related
I am trying to implement the FFT algorithm on C. I wrote a code based on the function "four1" from the book "Numerical Recipes in C". I know that using external libraries such as FFTW would be more efficient, but I just wanted to try this as a first approach. But I am getting an error at runtime.
After trying to debug for a while, I have decided to copy the exact same function provided in the book, but I still have the same problem. The problem seems to be in the following commands:
tempr = wr * data[j] - wi * data[j + 1];
tempi = wr * data[j + 1] + wi * data[j];
and
data[j + 1] = data[i + 1] - tempi;
the j is sometimes as high as the last index of the array, so you cannot add one when indexing.
As I said, I didn´t change anything from the code, so I am very surprised that it is not working for me; it is a well-known reference for numerical methods in C, and I doubt there are errors in it. Also, I have found some questions regarding the same code example but none of them seemed to have the same issue (see C: Numerical Recipies (FFT), for example). What am I doing wrong?
Here is the code:
#include <iostream>
#include <stdio.h>
using namespace std;
#define SWAP(a,b) tempr=(a);(a)=(b);(b)=tempr
void four1(double* data, unsigned long nn, int isign)
{
unsigned long n, mmax, m, j, istep, i;
double wtemp, wr, wpr, wpi, wi, theta;
double tempr, tempi;
n = nn << 1;
j = 1;
for (i = 1; i < n; i += 2) {
if (j > i) {
SWAP(data[j], data[i]);
SWAP(data[j + 1], data[i + 1]);
}
m = n >> 1;
while (m >= 2 && j > m) {
j -= m;
m >>= 1;
}
j += m;
}
mmax = 2;
while (n > mmax) {
istep = mmax << 1;
theta = isign * (6.28318530717959 / mmax);
wtemp = sin(0.5 * theta);
wpr = -2.0 * wtemp * wtemp;
wpi = sin(theta);
wr = 1.0;
wi = 0.0;
for (m = 1; m < mmax; m += 2) {
for (i = m; i <= n; i += istep) {
j = i + mmax;
tempr = wr * data[j] - wi * data[j + 1];
tempi = wr * data[j + 1] + wi * data[j];
data[j] = data[i] - tempr;
data[j + 1] = data[i + 1] - tempi;
data[i] += tempr;
data[i + 1] += tempi;
}
wr = (wtemp = wr) * wpr - wi * wpi + wr;
wi = wi * wpr + wtemp * wpi + wi;
}
mmax = istep;
}
}
#undef SWAP
int main()
{
// Testing with random data
double data[] = {1, 1, 2, 0, 1, 3, 4, 0};
four1(data, 4, 1);
for (int i = 0; i < 7; i++) {
cout << data[i] << " ";
}
}
The first 2 editions of Numerical Recipes in C use the unusual (for C) convention that arrays are 1-based. (This was probably because the Fortran (1-based) version came first and the translation to C was done without regard to conventions.)
You should read section 1.2 Some C Conventions for Scientific
Computing, specifically the paragraphs on Vectors and One-Dimensional Arrays. As well as trying to justify their 1-based decision, this section does explain how to adapt pointers appropriately to match their code.
In your case, this should work -
int main()
{
// Testing with random data
double data[] = {1, 1, 2, 0, 1, 3, 4, 0};
double *data1based = data - 1;
four1(data1based, 4, 1);
for (int i = 0; i < 7; i++) {
cout << data[i] << " ";
}
}
However, as #Some programmer dude mentions in the comments the workaround advocated by the book is undefined behaviour as data1based points outside the bounds of the data array.
Whilst this way well work in practice, an alternative and non-UB workaround would be to change your interpretation to match their conventions -
int main()
{
// Testing with random data
double data[] = { -1 /*dummy value*/, 1, 1, 2, 0, 1, 3, 4, 0};
four1(data, 4, 1);
for (int i = 1; i < 8; i++) {
cout << data[i] << " ";
}
}
I'd be very wary of this becoming contagious though and infecting your code too widely.
The third edition tacitly recognised this 'mistake' and, as part of supporting C++ and standard library collections, switched to use the C & C++ conventions of zero-based arrays.
#include <iostream>
#include <chrono>
#include <random>
#include <time.h>
using namespace std;
typedef pair<double,double> pd;
#define x first
#define y second
#define cell(i,j,w) ((i)*(w) + (j))
class MyTimer
{
private:
std::chrono::time_point<std::chrono::steady_clock> starter;
std::chrono::time_point<std::chrono::steady_clock> ender;
public:
void startCounter() {
starter = std::chrono::steady_clock::now();
}
long long getCounter() {
ender = std::chrono::steady_clock::now();
return std::chrono::duration_cast<std::chrono::milliseconds>(ender - starter).count();
}
};
int main()
{
const int n = 5000;
int* value1 = new int[(n + 1) * (n + 1)];
int* value2 = new int[(n + 1) * (n + 1)];
double* a = new double[(n + 1) * (n + 1)];
double* b = new double[(n + 1) * (n + 1)];
pd* packed = new pd[(n + 1) * (n + 1)];
MyTimer timer;
for (int i = 1; i <= n; i++)
for (int j = 1; j <= n; j++) {
value1[cell(i, j, n + 1)] = rand() % 5000;
value2[cell(i, j, n + 1)] = rand() % 5000;
}
for (int i = 1; i <= n; i++) {
a[cell(i, 0, n + 1)] = 0;
a[cell(0, i, n + 1)] = 0;
b[cell(i, 0, n + 1)] = 0;
b[cell(0, i, n + 1)] = 0;
packed[cell(i, 0, n + 1)] = pd(0, 0);
packed[cell(0, i, n + 1)] = pd(0, 0);
}
for (int tt=1; tt<=5; tt++)
{
timer.startCounter();
for (int i=1; i<=n; i++)
for (int j = 1; j <= n; j++) {
// packed[i][j] = packed[i-1][j] + packed[i][j-1] - packed[i-1][j-1] + value1[i][j]
packed[cell(i, j, n + 1)].x = packed[cell(i - 1, j, n + 1)].x + packed[cell(i, j - 1, n + 1)].x - packed[cell(i - 1, j - 1, n + 1)].x + value1[cell(i, j, n + 1)];
packed[cell(i, j, n + 1)].y = packed[cell(i - 1, j, n + 1)].y + packed[cell(i, j - 1, n + 1)].y - packed[cell(i - 1, j - 1, n + 1)].y + value1[cell(i, j, n + 1)] * value1[cell(i, j, n + 1)];
}
cout << "Time packed = " << timer.getCounter() << "\n";
timer.startCounter();
for (int i=1; i<=n; i++)
for (int j = 1; j <= n; j++) {
// a[i][j] = a[i-1][j] + a[i][j-1] - a[i-1][j-1] + value2[i][j];
// b[i][j] = b[i-1][j] + b[i][j-1] - b[i-1][j-1] + value2[i][j] * value2[i][j];
a[cell(i, j, n + 1)] = a[cell(i - 1, j, n + 1)] + a[cell(i, j - 1, n + 1)] - a[cell(i - 1, j - 1, n + 1)] + value2[cell(i, j, n + 1)];
b[cell(i, j, n + 1)] = b[cell(i - 1, j, n + 1)] + b[cell(i, j - 1, n + 1)] - b[cell(i - 1, j - 1, n + 1)] + value2[cell(i, j, n + 1)] * value2[cell(i, j, n + 1)];
}
cout << "Time separate = " << timer.getCounter() << "\n\n";
}
delete[] value1;
delete[] value2;
delete[] a;
delete[] b;
delete[] packed;
}
So I'm computing a 2D prefix table (Summed Area Table). And I notice the property in the title.
When using CUDA nvcc compiler (with -O2) using the command line or Visual Studio Release mode , the result is 2x faster (separate takes 200ms, packed takes 100ms) the first run, but only 25% faster in subsequent run (this is because value2[] is cached after the first loop). In my actual program with more steps of calculation (computing SAT is just step 1), it's always 2x faster since value1[] and value2[] have definitely been evicted from cache.
I know packed array is faster because modern Intel CPU read 32-64 bytes into cache at once. So by packing both array together, it can read both data in 1 main memory (RAM) access instead of 2. But why is the speedup so high? Along with memory access, the CPU still has to perform 6 additions, 2 subtractions, and 1 multiply per loop. 2x speedup from halving memory access is 100% improvement efficiency (Amdahl Law), the same as if those add/mult operations didn't exist. How is it possible?
I'm certain it has something to do with CPU pipelining, but can't explain more thoroughly. Can anyone explain this further in terms of instruction latency/memory access latency/assembly? Thank you.
The code doesn't use any GPU, so any other good compiler should give the same 2x speedup as nvcc. On g++ 9.3.0 (g++ file.cpp -O2 -std=c++11 -o file.exe), it's also 2x speedup. CPU is Intel i7-7700
I've run this program here and here2 with command line arguments -O2 -std=c++11, it also shows 1.5-2x speedup. Use n = 3000, bigger and it won't run (free VM service, afterall). So it's not just my computer
The answer is in the access latency of different level of memory, from L1 cache -> main memory (RAM).
Data in L1 cache takes ~~5 cycle to access, while data from RAM takes 50-100cycle. Meanwhile, add/sub/mult operations takes 3-5 cycles.
Therefore, the dominating limiter of performance is main memory access. So by reducing the number of main memory request by half, performance almost doubles
It's a follow-up question to this one
Now I have the code:
#include <iostream>
#include <cmath>
#include <omp.h>
#define max(a, b) (a)>(b)?(a):(b)
const int m = 2001;
const int n = 2000;
const int p = 4;
double v[m + 2][m + 2];
double x[m + 2];
double y[m + 2];
double _new[m + 2][m + 2];
double maxdiffA[p + 1];
int icol, jrow;
int main() {
omp_set_num_threads(p);
double h = 1.0 / (n + 1);
double start = omp_get_wtime();
#pragma omp parallel for private(icol) shared(x, y, v, _new)
for (icol = 0; icol <= n + 1; ++icol) {
x[icol] = y[icol] = icol * h;
_new[icol][0] = v[icol][0] = 6 - 2 * x[icol];
_new[n + 1][icol] = v[n + 1][icol] = 4 - 2 * y[icol];
_new[icol][n + 1] = v[icol][n + 1] = 3 - x[icol];
_new[0][icol] = v[0][icol] = 6 - 3 * y[icol];
}
const double eps = 0.01;
#pragma omp parallel private(icol, jrow) shared(_new, v, maxdiffA)
{
while (true) { //for [iters=1 to maxiters by 2]
#pragma omp single
for (int i = 0; i < p; i++) maxdiffA[i] = 0;
#pragma omp for
for (icol = 1; icol <= n; icol++)
for (jrow = 1; jrow <= n; jrow++)
_new[icol][jrow] =
(v[icol - 1][jrow] + v[icol + 1][jrow] + v[icol][jrow - 1] + v[icol][jrow + 1]) / 4;
#pragma omp for
for (icol = 1; icol <= n; icol++)
for (jrow = 1; jrow <= n; jrow++)
v[icol][jrow] = (_new[icol - 1][jrow] + _new[icol + 1][jrow] + _new[icol][jrow - 1] +
_new[icol][jrow + 1]) / 4;
#pragma omp for
for (icol = 1; icol <= n; icol++)
for (jrow = 1; jrow <= n; jrow++)
maxdiffA[omp_get_thread_num()] = max(maxdiffA[omp_get_thread_num()],
fabs(_new[icol][jrow] - v[icol][jrow]));
#pragma omp barrier
double maxdiff = 0.0;
for (int k = 0; k < p; ++k) {
maxdiff = max(maxdiff, maxdiffA[k]);
}
if (maxdiff < eps)
break;
#pragma omp barrier
//#pragma omp single
//std::cout << maxdiff << std::endl;
}
}
double end = omp_get_wtime();
printf("start = %.16lf\nend = %.16lf\ndiff = %.16lf\n", start, end, end - start);
return 0;
}
But why it works 2-3 times slower (32sec vs 18sec) than serial analog:
#include <iostream>
#include <cmath>
#include <omp.h>
#define max(a,b) (a)>(b)?(a):(b)
const int m = 2001;
const int n = 2000;
double v[m + 2][m + 2];
double x[m + 2];
double y[m + 2];
double _new[m + 2][m + 2];
int main() {
double h = 1.0 / (n + 1);
double start = omp_get_wtime();
for (int i = 0; i <= n + 1; ++i) {
x[i] = y[i] = i * h;
_new[i][0]=v[i][0] = 6 - 2 * x[i];
_new[n + 1][i]=v[n + 1][i] = 4 - 2 * y[i];
_new[i][n + 1]=v[i][n + 1] = 3 - x[i];
_new[0][i]=v[0][i] = 6 - 3 * y[i];
}
const double eps=0.01;
while(true){ //for [iters=1 to maxiters by 2]
double maxdiff=0.0;
for (int i=1;i<=n;i++)
for (int j=1;j<=n;j++)
_new[i][j]=(v[i-1][j]+v[i+1][j]+v[i][j-1]+v[i][j+1])/4;
for (int i=1;i<=n;i++)
for (int j=1;j<=n;j++)
v[i][j]=(_new[i-1][j]+_new[i+1][j]+_new[i][j-1]+_new[i][j+1])/4;
for (int i=1;i<=n;i++)
for (int j=1;j<=n;j++)
maxdiff=max(maxdiff, fabs(_new[i][j]-v[i][j]));
if(maxdiff<eps) break;
std::cout << maxdiff<<std::endl;
}
double end = omp_get_wtime();
printf("start = %.16lf\nend = %.16lf\ndiff = %.16lf\n", start, end, end - start);
return 0;
}
Also interesting that it works SAME TIME as version (I can post it here if you say so) which looks like so
while(true){ //106 iteratins here!!!
#pragma omp paralell for
for(...)
#pragma omp paralell for
for(...)
#pragma omp paralell for
for(...)
}
But I thought that what making omp code slow is spawning threads inside while loop 106 times... But no! Then probably threads simultaneously write to the same array cells.. But where does it happen? I don't see it could you show me please?
Maybe it's because too much barriers? But Lecturer told me to implement the code like so and "analyse it" Maybe the answer is "Jacobi algorithm isn't meant to run well in parallel"? Or it's just my lame coding?
So the root of evel was
max(maxdiffA[w],fabs(_new[icol][jrow] - v[icol][jrow]))
because it's
#define max(a, b) (a)>(b)?(a):(b)
It's probably creating TOO much branching ('if's ) Without this thing parallel version works 8 times faster and loading CPU 68% instead of 99%..
The starange thing: same "max" doesn't affect serioal version
I am writing to make you aware of a few situations. It is not short to write in a comment, so I decided to write as an answer.
every time a thread is made, it takes some time for its creation. if your program's running time in a single core is short, then the creation of threads will make this time longer for multi-core.
plus using a barrier makes all your threads wait for others, which could somehow be slowed down in cpu. this way, even if all threads finish the job very fast, that last one will make the total run time longer.
try to run your program with bigger sized arrays where time is around 2 minutes for single threading. then make your way to multi-core.
then try to wrap your main code in a normal loop to run it a few times and prints the timings for each. the first run of the loop might be slow because of loading libraries, but the next runs should be faster to prove the increasing speed.
if above suggestions do not give a result, then it means your coding needs more editing.
EDIT:
To downvoters, If you don't like a post, please at least be polite and leave a comment. Or better, give your own answer so be helpful to community.
I have never worked with OpenMP or optimization of C++, so all help is welcome. I'm probably doing some very stupid things that slow down the process drastically. It doesn't need to be the fastest, but I think some easy tricks will significantly speed it up. Anyone? Thanks a lot!
This function calculates the standard deviation of a patch, given a kernel size and greyscale OpenCV image. The middle pixel of the patch is kept if it is below the given threshold, else it is rejected. This is done for each pixel except the border.
#include "stdafx.h"
#include "opencv2/imgproc/imgproc.hpp"
#include "opencv2/highgui/highgui.hpp"
#include "opencv2/photo/photo.hpp"
#include <stdlib.h>
#include <stdio.h>
#include "utils.h"
#include <windows.h>
#include <string.h>
#include <math.h>
#include <numeric>
using namespace cv;
using namespace std;
Mat low_pass_filter(Mat img, int threshold, int kernelSize)
{
unsigned char *input = (unsigned char*)(img.data);
Mat output = Mat::zeros(img.size(), CV_8UC1);
unsigned char *output_ptr = (unsigned char*)(output.data);
#pragma omp parallel for
for (int i = (kernelSize - 1) / 2; i < img.rows - (kernelSize - 1) / 2; i++){
for (int j = (kernelSize - 1) / 2; j < img.cols - (kernelSize - 1) / 2; j++){
double sum, m, accum, stdev;
vector<double> v;
// Kernel Patch
for (int kx = i - (kernelSize - 1) / 2; kx <= i + (kernelSize - 1) / 2; kx++){
for (int ky = j - (kernelSize - 1) / 2; ky <= j + (kernelSize - 1) / 2; ky++){
v.push_back((double)input[img.step * kx + ky]);//.at<uchar>(kx, ky));
}
}
sum = std::accumulate(std::begin(v), std::end(v), 0.0);
m = sum / v.size();
accum = 0.0;
std::for_each(std::begin(v), std::end(v), [&](const double d) {
accum += (d - m) * (d - m);
});
stdev = sqrt(accum / (v.size() - 1));
if (stdev < threshold){
output_ptr[img.step * i + j] = input[img.step * i + j];
}
}
}
return output;
}
Vector v is not required. Instead of adding items to it, maintain accumulators of d and d*d, and then use variance = E(v²) / E(v)² so that your inner code becomes:
double sum = 0;
double sum2 = 0;
int n = kernelSize * kernelSize;
// Kernel Patch
for (int kx = ...) {
for (int ky = ...) {
sum += d;
sum2 += d*d;
}
}
double mean = sum/n;
double stddev = sqrt(sum2/n - mean*mean);
if (stddev < threshold) {
...;
}
After that, consider that the sum of elements centred around (x+1,y) can be found from the result for (x,y) simply by subtracting all the elements in the previous left-hand column, and adding all the elements in the new right-hand column. An analogous operation works vertically.
Also, check your compiler options - are you auto-vectorizing loops, and using SIMD instructions (if available)?
I've been using the openCV to do some block matching and I've noticed it's sum of squared differences code is very fast compared to a straight forward for loop like this:
int SSD = 0;
for(int i =0; i < arraySize; i++)
SSD += (array1[i] - array2[i] )*(array1[i] - array2[i]);
If I look at the source code to see where the heavy lifting happens, the
OpenCV folks have their for loops do 4 squared difference calculations at a time in each iteration of the loop. The function to do the block matching looks like this.
int64
icvCmpBlocksL2_8u_C1( const uchar * vec1, const uchar * vec2, int len )
{
int i, s = 0;
int64 sum = 0;
for( i = 0; i <= len - 4; i += 4 )
{
int v = vec1[i] - vec2[i];
int e = v * v;
v = vec1[i + 1] - vec2[i + 1];
e += v * v;
v = vec1[i + 2] - vec2[i + 2];
e += v * v;
v = vec1[i + 3] - vec2[i + 3];
e += v * v;
sum += e;
}
for( ; i < len; i++ )
{
int v = vec1[i] - vec2[i];
s += v * v;
}
return sum + s;
}
This calculation is for unsigned 8 bit integers. They perform a similar calculation for 32-bit floats in this function:
double
icvCmpBlocksL2_32f_C1( const float *vec1, const float *vec2, int len )
{
double sum = 0;
int i;
for( i = 0; i <= len - 4; i += 4 )
{
double v0 = vec1[i] - vec2[i];
double v1 = vec1[i + 1] - vec2[i + 1];
double v2 = vec1[i + 2] - vec2[i + 2];
double v3 = vec1[i + 3] - vec2[i + 3];
sum += v0 * v0 + v1 * v1 + v2 * v2 + v3 * v3;
}
for( ; i < len; i++ )
{
double v = vec1[i] - vec2[i];
sum += v * v;
}
return sum;
}
I was wondering if anyone had any idea if breaking a loop up into chunks of 4 like this might speed up code? I should add that there is no multithreading occuring in this code.
My guess is that this is just a simple implementation of unrolling the loop - it saves 3 additions and 3 compares on each pass of the loop, which can be a great savings if, for example, checking len involves a cache miss. The downside is that this optimization adds code complexity (e.g. the additional for loop at the end to finish the loop for the len % 4 items left if the length is not evenly divisible by 4) and, of course, it's an architecture-dependent optimization whose magnitude of improvement will vary by hardware/compiler/etc...
Still, it's straightforward to follow compared to most optimizations and will probably result in some sort of performance increase regardless of the architecture, so it's low risk to just throw it in there and hope for the best. Since OpenCV is such a well-supported chunk of code, I'm sure that someone instrumented these chunks of code and found them to be well worth it - as you yourself have done.
There is one obvious optimisation of your code, viz:
int SSD = 0;
for(int i = 0; i < arraySize; i++)
{
int v = array1[i] - array2[i];
SSD += v * v;
}