How to speed up my sparse matrix solver? - c++

I'm writing a sparse matrix solver using the Gauss-Seidel method. By profiling, I've determined that about half of my program's time is spent inside the solver. The performance-critical part is as follows:
size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
for (size_t y = 1; y < d_ny - 1; ++y) {
for (size_t x = 1; x < d_nx - 1; ++x) {
d_x[ic] = d_b[ic]
- d_w[ic] * d_x[iw] - d_e[ic] * d_x[ie]
- d_s[ic] * d_x[is] - d_n[ic] * d_x[in];
++ic; ++iw; ++ie; ++is; ++in;
}
ic += 2; iw += 2; ie += 2; is += 2; in += 2;
}
All arrays involved are of float type. Actually, they are not arrays but objects with an overloaded [] operator, which (I think) should be optimized away, but is defined as follows:
inline float &operator[](size_t i) { return d_cells[i]; }
inline float const &operator[](size_t i) const { return d_cells[i]; }
For d_nx = d_ny = 128, this can be run about 3500 times per second on an Intel i7 920. This means that the inner loop body runs 3500 * 128 * 128 = 57 million times per second. Since only some simple arithmetic is involved, that strikes me as a low number for a 2.66 GHz processor.
Maybe it's not limited by CPU power, but by memory bandwidth? Well, one 128 * 128 float array eats 65 kB, so all 6 arrays should easily fit into the CPU's L3 cache (which is 8 MB). Assuming that nothing is cached in registers, I count 15 memory accesses in the inner loop body. On a 64-bits system this is 120 bytes per iteration, so 57 million * 120 bytes = 6.8 GB/s. The L3 cache runs at 2.66 GHz, so it's the same order of magnitude. My guess is that memory is indeed the bottleneck.
To speed this up, I've attempted the following:
Compile with g++ -O3. (Well, I'd been doing this from the beginning.)
Parallelizing over 4 cores using OpenMP pragmas. I have to change to the Jacobi algorithm to avoid reads from and writes to the same array. This requires that I do twice as many iterations, leading to a net result of about the same speed.
Fiddling with implementation details of the loop body, such as using pointers instead of indices. No effect.
What's the best approach to speed this guy up? Would it help to rewrite the inner body in assembly (I'd have to learn that first)? Should I run this on the GPU instead (which I know how to do, but it's such a hassle)? Any other bright ideas?
(N.B. I do take "no" for an answer, as in: "it can't be done significantly faster, because...")
Update: as requested, here's a full program:
#include <iostream>
#include <cstdlib>
#include <cstring>
using namespace std;
size_t d_nx = 128, d_ny = 128;
float *d_x, *d_b, *d_w, *d_e, *d_s, *d_n;
void step() {
size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
for (size_t y = 1; y < d_ny - 1; ++y) {
for (size_t x = 1; x < d_nx - 1; ++x) {
d_x[ic] = d_b[ic]
- d_w[ic] * d_x[iw] - d_e[ic] * d_x[ie]
- d_s[ic] * d_x[is] - d_n[ic] * d_x[in];
++ic; ++iw; ++ie; ++is; ++in;
}
ic += 2; iw += 2; ie += 2; is += 2; in += 2;
}
}
void solve(size_t iters) {
for (size_t i = 0; i < iters; ++i) {
step();
}
}
void clear(float *a) {
memset(a, 0, d_nx * d_ny * sizeof(float));
}
int main(int argc, char **argv) {
size_t n = d_nx * d_ny;
d_x = new float[n]; clear(d_x);
d_b = new float[n]; clear(d_b);
d_w = new float[n]; clear(d_w);
d_e = new float[n]; clear(d_e);
d_s = new float[n]; clear(d_s);
d_n = new float[n]; clear(d_n);
solve(atoi(argv[1]));
cout << d_x[0] << endl; // prevent the thing from being optimized away
}
I compile and run it as follows:
$ g++ -o gstest -O3 gstest.cpp
$ time ./gstest 8000
0
real 0m1.052s
user 0m1.050s
sys 0m0.010s
(It does 8000 instead of 3500 iterations per second because my "real" program does a lot of other stuff too. But it's representative.)
Update 2: I've been told that unititialized values may not be representative because NaN and Inf values may slow things down. Now clearing the memory in the example code. It makes no difference for me in execution speed, though.

Couple of ideas:
Use SIMD. You could load 4 floats at a time from each array into a SIMD register (e.g. SSE on Intel, VMX on PowerPC). The disadvantage of this is that some of the d_x values will be "stale" so your convergence rate will suffer (but not as bad as a jacobi iteration); it's hard to say whether the speedup offsets it.
Use SOR. It's simple, doesn't add much computation, and can improve your convergence rate quite well, even for a relatively conservative relaxation value (say 1.5).
Use conjugate gradient. If this is for the projection step of a fluid simulation (i.e. enforcing non-compressability), you should be able to apply CG and get a much better convergence rate. A good preconditioner helps even more.
Use a specialized solver. If the linear system arises from the Poisson equation, you can do even better than conjugate gradient using an FFT-based methods.
If you can explain more about what the system you're trying to solve looks like, I can probably give some more advice on #3 and #4.

I think I've managed to optimize it, here's a code, create a new project in VC++, add this code and simply compile under "Release".
#include <iostream>
#include <cstdlib>
#include <cstring>
#define _WIN32_WINNT 0x0400
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#include <conio.h>
using namespace std;
size_t d_nx = 128, d_ny = 128;
float *d_x, *d_b, *d_w, *d_e, *d_s, *d_n;
void step_original() {
size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
for (size_t y = 1; y < d_ny - 1; ++y) {
for (size_t x = 1; x < d_nx - 1; ++x) {
d_x[ic] = d_b[ic]
- d_w[ic] * d_x[iw] - d_e[ic] * d_x[ie]
- d_s[ic] * d_x[is] - d_n[ic] * d_x[in];
++ic; ++iw; ++ie; ++is; ++in;
}
ic += 2; iw += 2; ie += 2; is += 2; in += 2;
}
}
void step_new() {
//size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
float
*d_b_ic,
*d_w_ic,
*d_e_ic,
*d_x_ic,
*d_x_iw,
*d_x_ie,
*d_x_is,
*d_x_in,
*d_n_ic,
*d_s_ic;
d_b_ic = d_b;
d_w_ic = d_w;
d_e_ic = d_e;
d_x_ic = d_x;
d_x_iw = d_x;
d_x_ie = d_x;
d_x_is = d_x;
d_x_in = d_x;
d_n_ic = d_n;
d_s_ic = d_s;
for (size_t y = 1; y < d_ny - 1; ++y)
{
for (size_t x = 1; x < d_nx - 1; ++x)
{
/*d_x[ic] = d_b[ic]
- d_w[ic] * d_x[iw] - d_e[ic] * d_x[ie]
- d_s[ic] * d_x[is] - d_n[ic] * d_x[in];*/
*d_x_ic = *d_b_ic
- *d_w_ic * *d_x_iw - *d_e_ic * *d_x_ie
- *d_s_ic * *d_x_is - *d_n_ic * *d_x_in;
//++ic; ++iw; ++ie; ++is; ++in;
d_b_ic++;
d_w_ic++;
d_e_ic++;
d_x_ic++;
d_x_iw++;
d_x_ie++;
d_x_is++;
d_x_in++;
d_n_ic++;
d_s_ic++;
}
//ic += 2; iw += 2; ie += 2; is += 2; in += 2;
d_b_ic += 2;
d_w_ic += 2;
d_e_ic += 2;
d_x_ic += 2;
d_x_iw += 2;
d_x_ie += 2;
d_x_is += 2;
d_x_in += 2;
d_n_ic += 2;
d_s_ic += 2;
}
}
void solve_original(size_t iters) {
for (size_t i = 0; i < iters; ++i) {
step_original();
}
}
void solve_new(size_t iters) {
for (size_t i = 0; i < iters; ++i) {
step_new();
}
}
void clear(float *a) {
memset(a, 0, d_nx * d_ny * sizeof(float));
}
int main(int argc, char **argv) {
size_t n = d_nx * d_ny;
d_x = new float[n]; clear(d_x);
d_b = new float[n]; clear(d_b);
d_w = new float[n]; clear(d_w);
d_e = new float[n]; clear(d_e);
d_s = new float[n]; clear(d_s);
d_n = new float[n]; clear(d_n);
if(argc < 3)
printf("app.exe (x)iters (o/n)algo\n");
bool bOriginalStep = (argv[2][0] == 'o');
size_t iters = atoi(argv[1]);
/*printf("Press any key to start!");
_getch();
printf(" Running speed test..\n");*/
__int64 freq, start, end, diff;
if(!::QueryPerformanceFrequency((LARGE_INTEGER*)&freq))
throw "Not supported!";
freq /= 1000000; // microseconds!
{
::QueryPerformanceCounter((LARGE_INTEGER*)&start);
if(bOriginalStep)
solve_original(iters);
else
solve_new(iters);
::QueryPerformanceCounter((LARGE_INTEGER*)&end);
diff = (end - start) / freq;
}
printf("Speed (%s)\t\t: %u\n", (bOriginalStep ? "original" : "new"), diff);
//_getch();
//cout << d_x[0] << endl; // prevent the thing from being optimized away
}
Run it like this:
app.exe 10000 o
app.exe 10000 n
"o" means old code, yours.
"n" is mine, the new one.
My results:
Speed (original):
1515028
1523171
1495988
Speed (new):
966012
984110
1006045
Improvement of about 30%.
The logic behind:
You've been using index counters to access/manipulate.
I use pointers.
While running, breakpoint at a certain calculation code line in VC++'s debugger, and press F8. You'll get the disassembler window.
The you'll see the produced opcodes (assembly code).
Anyway, look:
int *x = ...;
x[3] = 123;
This tells the PC to put the pointer x at a register (say EAX).
The add it (3 * sizeof(int)).
Only then, set the value to 123.
The pointers approach is much better as you can understand, because we cut the adding process, actually we handle it ourselves, thus able to optimize as needed.
I hope this helps.
Sidenote to stackoverflow.com's staff:
Great website, I hope I've heard of it long ago!

For one thing, there seems to be a pipelining issue here. The loop reads from the value in d_x that has just been written to, but apparently it has to wait for that write to complete. Just rearranging the order of the computation, doing something useful while it's waiting, makes it almost twice as fast:
d_x[ic] = d_b[ic]
- d_e[ic] * d_x[ie]
- d_s[ic] * d_x[is] - d_n[ic] * d_x[in]
- d_w[ic] * d_x[iw] /* d_x[iw] has just been written to, process this last */;
It was Eamon Nerbonne who figured this out. Many upvotes to him! I would never have guessed.

Poni's answer looks like the right one to me.
I just want to point out that in this type of problem, you often gain benefits from memory locality. Right now, the b,w,e,s,n arrays are all at separate locations in memory. If you could not fit the problem in L3 cache (mostly in L2), then this would be bad, and a solution of this sort would be helpful:
size_t d_nx = 128, d_ny = 128;
float *d_x;
struct D { float b,w,e,s,n; };
D *d;
void step() {
size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
for (size_t y = 1; y < d_ny - 1; ++y) {
for (size_t x = 1; x < d_nx - 1; ++x) {
d_x[ic] = d[ic].b
- d[ic].w * d_x[iw] - d[ic].e * d_x[ie]
- d[ic].s * d_x[is] - d[ic].n * d_x[in];
++ic; ++iw; ++ie; ++is; ++in;
}
ic += 2; iw += 2; ie += 2; is += 2; in += 2;
}
}
void solve(size_t iters) { for (size_t i = 0; i < iters; ++i) step(); }
void clear(float *a) { memset(a, 0, d_nx * d_ny * sizeof(float)); }
int main(int argc, char **argv) {
size_t n = d_nx * d_ny;
d_x = new float[n]; clear(d_x);
d = new D[n]; memset(d,0,n * sizeof(D));
solve(atoi(argv[1]));
cout << d_x[0] << endl; // prevent the thing from being optimized away
}
For example, this solution at 1280x1280 is a little less than 2x faster than Poni's solution (13s vs 23s in my test--your original implementation is then 22s), while at 128x128 it's 30% slower (7s vs. 10s--your original is 10s).
(Iterations were scaled up to 80000 for the base case, and 800 for the 100x larger case of 1280x1280.)

I think you're right about memory being a bottleneck. It's a pretty simple loop with just some simple arithmetic per iteration. the ic, iw, ie, is, and in indices seem to be on opposite sides of the matrix so i'm guessing that there's a bunch of cache misses there.

I'm no expert on the subject, but I've seen that there are several academic papers on improving the cache usage of the Gauss-Seidel method.
Another possible optimization is the use of the red-black variant, where points are updated in two sweeps in a chessboard-like pattern. In this way, all updates in a sweep are independent and can be parallelized.

I suggest putting in some prefetch statements and also researching "data oriented design":
void step_original() {
size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
float dw_ic, dx_ic, db_ic, de_ic, dn_ic, ds_ic;
float dx_iw, dx_is, dx_ie, dx_in, de_ic, db_ic;
for (size_t y = 1; y < d_ny - 1; ++y) {
for (size_t x = 1; x < d_nx - 1; ++x) {
// Perform the prefetch
// Sorting these statements by array may increase speed;
// although sorting by index name may increase speed too.
db_ic = d_b[ic];
dw_ic = d_w[ic];
dx_iw = d_x[iw];
de_ic = d_e[ic];
dx_ie = d_x[ie];
ds_ic = d_s[ic];
dx_is = d_x[is];
dn_ic = d_n[ic];
dx_in = d_x[in];
// Calculate
d_x[ic] = db_ic
- dw_ic * dx_iw - de_ic * dx_ie
- ds_ic * dx_is - dn_ic * dx_in;
++ic; ++iw; ++ie; ++is; ++in;
}
ic += 2; iw += 2; ie += 2; is += 2; in += 2;
}
}
This differs from your second method since the values are copied to local temporary variables before the calculation is performed.

Related

Multiple threads taking more time than single process [duplicate]

This question already has answers here:
C: using clock() to measure time in multi-threaded programs
(2 answers)
Closed 2 years ago.
I am implementing pattern matching algorithm, by moving template gradient info over entire target's gradient image , that too at each rotation (-60 to 60). I have already saved the template info for each rotation ,i.e. 121 templates are already preprocessed and saved.
But the issue is, this is consuming lot of time (approx 110ms), so decided to split the matching at set of rotations (-60 to -30 , -30 to 0, 0 to 30 and 30 to 60) into 4 threads, but threading is taking more time that single process (approx 115ms to 120ms).
Snippet of code is...
#define MAXTARGETNUM 64
MatchResultA totalResultsTemp[MAXTARGETNUM];
void CShapeMatch::match(ShapeInfo *ShapeInfoVec, search_region SearchRegion, float MinScore, float Greediness, int width,int height, int16_t *pBufGradX ,int16_t *pBufGradY,float *pBufMag, bool corr)
{
MatchResultA resultsPerDeg[MAXTARGETNUM];
....
....
int startX = SearchRegion.StartX;
int startY = SearchRegion.StartY;
int endX = SearchRegion.EndX;
int endY = SearchRegion.EndY;
float AngleStep = SearchRegion.AngleStep;
float AngleStart = SearchRegion.AngleStart;
float AngleStop = SearchRegion.AngleStop;
int startIndex = (int)(ShapeInfoVec[0].AngleNum/2) + ShapeInfoVec[0].AngleNum%2+(int)AngleStart/AngleStep;
int stopIndex = (int)(ShapeInfoVec[0].AngleNum/2) + ShapeInfoVec[0].AngleNum%2+(int)AngleStop/AngleStep;
for (int k = startIndex; k < stopIndex ; k++){
....
for(int j = startY; j < endY; j++){
for(int i = startX; i < endX; i++){
for(int m = 0; m < ShapeInfoVec[k].NoOfCordinates; m++)
{
curX = i + (ShapeInfoVec[k].Coordinates + m)->x; // template X coordinate
curY = j + (ShapeInfoVec[k].Coordinates + m)->y ; // template Y coordinate
iTx = *(ShapeInfoVec[k].EdgeDerivativeX + m); // template X derivative
iTy = *(ShapeInfoVec[k].EdgeDerivativeY + m); // template Y derivative
iTm = *(ShapeInfoVec[k].EdgeMagnitude + m); // template gradients magnitude
if(curX < 0 ||curY < 0||curX > width-1 ||curY > height-1)
continue;
offSet = curY*width + curX;
iSx = *(pBufGradX + offSet); // get corresponding X derivative from source image
iSy = *(pBufGradY + offSet); // get corresponding Y derivative from source image
iSm = *(pBufMag + offSet);
if (PartialScore > MinScore)
{
float Angle = ShapeInfoVec[k].Angel;
bool hasFlag = false;
for(int n = 0; n < resultsNumPerDegree; n++)
{
if(abs(resultsPerDeg[n].CenterLocX - i) < 5 && abs(resultsPerDeg[n].CenterLocY - j) < 5)
{
hasFlag = true;
if(resultsPerDeg[n].ResultScore < PartialScore)
{
resultsPerDeg[n].Angel = Angle;
resultsPerDeg[n].CenterLocX = i;
resultsPerDeg[n].CenterLocY = j;
resultsPerDeg[n].ResultScore = PartialScore;
break;
}
}
}
if(!hasFlag)
{
resultsPerDeg[resultsNumPerDegree].Angel = Angle;
resultsPerDeg[resultsNumPerDegree].CenterLocX = i;
resultsPerDeg[resultsNumPerDegree].CenterLocY = j;
resultsPerDeg[resultsNumPerDegree].ResultScore = PartialScore;
resultsNumPerDegree ++;
}
minScoreTemp = minScoreTemp < PartialScore ? PartialScore : minScoreTemp;
}
}
}
for(int i = 0; i < resultsNumPerDegree; i++)
{
mtx.lock();
totalResultsTemp[totalResultsNum] = resultsPerDeg[i];
totalResultsNum++;
mtx.unlock();
}
n++;
}
void CallerFunction(){
int16_t *pBufGradX = (int16_t *) malloc(bufferSize * sizeof(int16_t));
int16_t *pBufGradY = (int16_t *) malloc(bufferSize * sizeof(int16_t));
float *pBufMag = (float *) malloc(bufferSize * sizeof(float));
clock_t start = clock();
float temp_stop = SearchRegion->AngleStop;
SearchRegion->AngleStop = -30;
thread t1(&CShapeMatch::match, this, ShapeInfoVec, *SearchRegion, MinScore, Greediness, width, height, pBufGradX ,pBufGradY,pBufMag, corr);
SearchRegion->AngleStart = -30;
SearchRegion->AngleStop=0;
thread t2(&CShapeMatch::match, this, ShapeInfoVec, *SearchRegion, MinScore, Greediness, width, height, pBufGradX ,pBufGradY,pBufMag, corr);
SearchRegion->AngleStart = 0;
SearchRegion->AngleStop=30;
thread t3(&CShapeMatch::match, this, ShapeInfoVec, *SearchRegion, MinScore, Greediness,width, height, pBufGradX ,pBufGradY,pBufMag, corr);
SearchRegion->AngleStart = 30;
SearchRegion->AngleStop=temp_stop;
thread t4(&CShapeMatch::match, this, ShapeInfoVec, *SearchRegion, MinScore, Greediness,width, height, pBufGradX ,pBufGradY,pBufMag, corr);
t1.join();
t2.join();
t3.join();
t4.join();
clock_t end = clock();
cout << 1000*(double)(end-start)/CLOCKS_PER_SEC << endl;
}
As we can see there are plenty of heap access but they just are read-only. Only totalResultTemp and totalResultNum are shared global resource on which write are performed.
My PC configuration is,
i5-7200U CPU # 2.50GHz 4 cores
4 Gig RAM
Ubuntu 18
for(int i = 0; i < resultsNumPerDegree; i++)
{
mtx.lock();
totalResultsTemp[totalResultsNum] = resultsPerDeg[i];
totalResultsNum++;
mtx.unlock();
}
You writing into static array, and mutexes are really time consuming. Instead of creating locks try to use std::atomic_int, or in my opinion even better, just pass to function exact place where to store result, so problem with sync is not your problem anymore
POSIX Threads in c/c++ are not concurrent since the time assigned by the operative system to each parent process must be split into the number of threads it has. Thus, your algorithm is executing only core. To leverage multicore technology, you must use OpenMP. This interface library let you split your algorithm in different physic cores. This is a good OpenMP tutorial

how to optimize and speed up the multiplication of matrix in c++?

this is optimized implementation of matrix multiplication and this routine performs a matrix multiplication operation.
C := C + A * B (where A, B, and C are n-by-n matrices stored in column-major format)
On exit, A and B maintain their input values.
void matmul_optimized(int n, int *A, int *B, int *C)
{
// to the effective bitwise calculation
// save the matrix as the different type
int i, j, k;
int cij;
for (i = 0; i < n; ++i) {
for (j = 0; j < n; ++j) {
cij = C[i + j * n]; // the initialization into C also, add separate additions to the product and sum operations and then record as a separate variable so there is no multiplication
for (k = 0; k < n; ++k) {
cij ^= A[i + k * n] & B[k + j * n]; // the multiplication of each terms is expressed by using & operator the addition is done by ^ operator.
}
C[i + j * n] = cij; // allocate the final result into C }
}
}
how do I more speed up the multiplication of matrix based on above function/method?
this function is tested up to 2048 by 2048 matrix.
the function matmul_optimized is done with matmul.
#include <stdio.h>
#include <stdlib.h>
#include "cpucycles.c"
#include "helper_functions.c"
#include "matmul_reference.c"
#include "matmul_optimized.c"
int main()
{
int i, j;
int n = 1024; // Number of rows or columns in the square matrices
int *A, *B; // Input matrices
int *C1, *C2; // Output matrices from the reference and optimized implementations
// Performance and correctness measurement declarations
long int CLOCK_start, CLOCK_end, CLOCK_total, CLOCK_ref, CLOCK_opt;
long int COUNTER, REPEAT = 5;
int difference;
float speedup;
// Allocate memory for the matrices
A = malloc(n * n * sizeof(int));
B = malloc(n * n * sizeof(int));
C1 = malloc(n * n * sizeof(int));
C2 = malloc(n * n * sizeof(int));
// Fill bits in A, B, C1
fill(A, n * n);
fill(B, n * n);
fill(C1, n * n);
// Initialize C2 = C1
for (i = 0; i < n; i++)
for (j = 0; j < n; j++)
C2[i * n + j] = C1[i * n + j];
// Measure performance of the reference implementation
CLOCK_total = 0;
for (COUNTER = 0; COUNTER < REPEAT; COUNTER++)
{
CLOCK_start = cpucycles();
matmul_reference(n, A, B, C1);
CLOCK_end = cpucycles();
CLOCK_total = CLOCK_total + CLOCK_end - CLOCK_start;
}
CLOCK_ref = CLOCK_total / REPEAT;
printf("n=%d Avg cycle count for reference implementation = %ld\n", n, CLOCK_ref);
// Measure performance of the optimized implementation
CLOCK_total = 0;
for (COUNTER = 0; COUNTER < REPEAT; COUNTER++)
{
CLOCK_start = cpucycles();
matmul_optimized(n, A, B, C2);
CLOCK_end = cpucycles();
CLOCK_total = CLOCK_total + CLOCK_end - CLOCK_start;
}
CLOCK_opt = CLOCK_total / REPEAT;
printf("n=%d Avg cycle count for optimized implementation = %ld\n", n, CLOCK_opt);
speedup = (float)CLOCK_ref / (float)CLOCK_opt;
// Check correctness by comparing C1 and C2
difference = 0;
for (i = 0; i < n; i++)
for (j = 0; j < n; j++)
difference = difference + C1[i * n + j] - C2[i * n + j];
if (difference == 0)
printf("Speedup factor = %.2f\n", speedup);
if (difference != 0)
printf("Reference and optimized implementations do not match\n");
//print(C2, n);
free(A);
free(B);
free(C1);
free(C2);
return 0;
}
You can try algorithm like Strassen or Coppersmith-Winograd and here is also a good example.
Or maybe try Parallel computing like future::task or std::thread
Optimizing matrix-matrix multiplication requires careful attention to be paid to a number of issues:
First, you need to be able to use vector instructions. Only vector instructions can access parallelism inherent in the architecture. So, either your compiler needs to be able to automatically map to vector instructions, or you have to do so by hand, for example by calling the vector intrinsic library for AVX-2 instructions (for x86 architectures).
Next, you need to pay careful attention to the memory hierarchy. Your performance can easily drop to less than 5% of peak if you don't do this.
Once you do this right, you will hopefully have broken the computation up into small enough computational chunks that you can also parallelize via OpenMP or pthreads.
A document that carefully steps through what is required can be found at http://www.cs.utexas.edu/users/flame/laff/pfhp/LAFF-On-PfHP.html. (This is very much a work in progress.) At the end of it all, you will have an implementation that gets close to the performance attained by high-performance libraries like Intel's Math Kernel Library (MKL) or the BLAS-like Library Instantiation Software (BLIS).
(And, actually, you CAN then also effectively incorporate Strassen's algorithm. But that is another story, told in Unit 3.5.3 of these notes.)
You may find the following thread relevant: How does BLAS get such extreme performance?

HOG optimization with using SIMD

There are several attempts to optimize calculation of HOG descriptor with using of SIMD instructions: OpenCV, Dlib, and Simd. All of them use scalar code to add resulting magnitude to HOG histogram:
float histogram[height/8][width/8][18];
float ky[height], kx[width];
int idx[size];
float val[size];
for(size_t i = 0; i < size; ++i)
{
histogram[y/8][x/8][idx[i]] += val[i]*ky[y]*kx[x];
histogram[y/8][x/8 + 1][idx[i]] += val[i]*ky[y]*kx[x + 1];
histogram[y/8 + 1][x/8][idx[i]] += val[i]*ky[y + 1]*kx[x];
histogram[y/8 + 1][x/8 + 1][idx[i]] += val[i]*ky[y + 1]*kx[x + 1];
}
There the value of size depends from implementation but in general the meaning is the same.
I know that problem of histogram calculation with using of SIMD does not have a simple and effective solution. But in this case we have small size (18) of histogram. Can it help in SIMD optimizations?
I have found solution. It is a temporal buffer. At first we sum histogram to temporary buffer (and this operation can be vectorized). Then we add the sum from buffer to output histogram (and this operation also can be vectorized):
float histogram[height/8][width/8][18];
float ky[height], kx[width];
int idx[size];
float val[size];
float buf[18][4];
for(size_t i = 0; i < size; ++i)
{
buf[idx[i]][0] += val[i]*ky[y]*kx[x];
buf[idx[i]][1] += val[i]*ky[y]*kx[x + 1];
buf[idx[i]][2] += val[i]*ky[y + 1]*kx[x];
buf[idx[i]][3] += val[i]*ky[y + 1]*kx[x + 1];
}
for(size_t i = 0; i < 18; ++i)
{
histogram[y/8][x/8][i] += buf[i][0];
histogram[y/8][x/8 + 1][i] += buf[i][1];
histogram[y/8 + 1][x/8][i] += buf[i][2];
histogram[y/8 + 1][x/8 + 1][i] += buf[i][3];
}
You can do a partial optimisation by using SIMD to calculate all the (flattened) histogram indices and the bin increments. Then process these in a scalar loop afterwards. You probably also want to strip-mine this such that you process one row at a time, in order to keep the temporary bin indices and increments in cache. It might appear that this would be inefficient, due to the use of temporary intermediate buffers, but in practice I have seen a useful overall gain in similar scenarios.
uint32_t i = 0;
for (y = 0; y < height; ++y) // for each row
{
uint32_t inds[width * 4]; // flattened histogram indices for this row
float vals[width * 4]; // histogram bin increments for this row
// SIMD loop for this row - calculate flattened histogram indices and bin
// increments (scalar code shown for reference - converting this loop to
// SIMD is left as an exercise for the reader...)
for (x = 0; x < width; ++x, ++i)
{
indices[4*x] = (y/8)*(width/8)*18+(x/8)*18+idx[i];
indices[4*x+1] = (y/8)*(width/8)*18+(x/8 + 1)*18+idx[i];
indices[4*x+2] = (y/8+1)*(width/8)*18+(x/8)*18+idx[i];
indices[4*x+3] = (y/8+1)*(width/8)*18+(x/8 + 1)*18+idx[i];
vals[4*x] = val[i]*ky[y]*kx[x];
vals[4*x+1] = val[i]*ky[y]*kx[x+1];
vals[4*x+2] = val[i]*ky[y+1]*kx[x];
vals[4*x+3] = val[i]*ky[y+1]*kx[x+1];
}
// scalar loop for this row
float * const histogram_base = &histogram[0][0][0]; // pointer to flattened histogram
for (x = 0; x < width * 4; ++x) // for each set of 4 indices/increments in this row
{
histogram_base[indices[x]] += vals[x]; // update the (flattened) histogram
}
}

C++ use SSE instructions for comparing huge vectors of ints

I have a huge vector<vector<int>> (18M x 128). Frequently I want to take 2 rows of this vector and compare them by this function:
int getDiff(int indx1, int indx2) {
int result = 0;
int pplus, pminus, tmp;
for (int k = 0; k < 128; k += 2) {
pplus = nodeL[indx2][k] - nodeL[indx1][k];
pminus = nodeL[indx1][k + 1] - nodeL[indx2][k + 1];
tmp = max(pplus, pminus);
if (tmp > result) {
result = tmp;
}
}
return result;
}
As you see, the function, loops through the two row vectors does some subtraction and at the end returns a maximum. This function will be used a million times, so I was wondering if it can be accelerated through SSE instructions. I use Ubuntu 12.04 and gcc.
Of course it is microoptimization but it would helpful if you could provide some help, since I know nothing about SSE. Thanks in advance
Benchmark:
int nofTestCases = 10000000;
vector<int> nodeIds(nofTestCases);
vector<int> goalNodeIds(nofTestCases);
vector<int> results(nofTestCases);
for (int l = 0; l < nofTestCases; l++) {
nodeIds[l] = randomNodeID(18000000);
goalNodeIds[l] = randomNodeID(18000000);
}
double time, result;
time = timestamp();
for (int l = 0; l < nofTestCases; l++) {
results[l] = getDiff2(nodeIds[l], goalNodeIds[l]);
}
result = timestamp() - time;
cout << result / nofTestCases << "s" << endl;
time = timestamp();
for (int l = 0; l < nofTestCases; l++) {
results[l] = getDiff(nodeIds[l], goalNodeIds[l]);
}
result = timestamp() - time;
cout << result / nofTestCases << "s" << endl;
where
int randomNodeID(int n) {
return (int) (rand() / (double) (RAND_MAX + 1.0) * n);
}
/** Returns a timestamp ('now') in seconds (incl. a fractional part). */
inline double timestamp() {
struct timeval tp;
gettimeofday(&tp, NULL);
return double(tp.tv_sec) + tp.tv_usec / 1000000.;
}
FWIW I put together a pure SSE version (SSE4.1) which seems to run around 20% faster than the original scalar code on a Core i7:
#include <smmintrin.h>
int getDiff_SSE(int indx1, int indx2)
{
int result[4] __attribute__ ((aligned(16))) = { 0 };
const int * const p1 = &nodeL[indx1][0];
const int * const p2 = &nodeL[indx2][0];
const __m128i vke = _mm_set_epi32(0, -1, 0, -1);
const __m128i vko = _mm_set_epi32(-1, 0, -1, 0);
__m128i vresult = _mm_set1_epi32(0);
for (int k = 0; k < 128; k += 4)
{
__m128i v1, v2, vmax;
v1 = _mm_loadu_si128((__m128i *)&p1[k]);
v2 = _mm_loadu_si128((__m128i *)&p2[k]);
v1 = _mm_xor_si128(v1, vke);
v2 = _mm_xor_si128(v2, vko);
v1 = _mm_sub_epi32(v1, vke);
v2 = _mm_sub_epi32(v2, vko);
vmax = _mm_add_epi32(v1, v2);
vresult = _mm_max_epi32(vresult, vmax);
}
_mm_store_si128((__m128i *)result, vresult);
return max(max(max(result[0], result[1]), result[2]), result[3]);
}
You probably can get the compiler to use SSE for this. Will it make the code quicker? Probably not. The reason being is that there is a lot of memory access compared to computation. The CPU is much faster than the memory and a trivial implementation of the above will already have the CPU stalling when it's waiting for data to arrive over the system bus. Making the CPU faster will just increase the amount of waiting it does.
The declaration of nodeL can have an effect on the performance so it's important to choose an efficient container for your data.
There is a threshold where optimising does have a benfit, and that's when you're doing more computation between memory reads - i.e. the time between memory reads is much greater. The point at which this occurs depends a lot on your hardware.
It can be helpful, however, to optimise the code if you've got non-memory constrained tasks that can run in prarallel so that the CPU is kept busy whilst waiting for the data.
This will be faster. Double dereference of vector of vectors is expensive. Caching one of the dereferences will help. I know it's not answering the posted question but I think it will be a more helpful answer.
int getDiff(int indx1, int indx2) {
int result = 0;
int pplus, pminus, tmp;
const vector<int>& nodetemp1 = nodeL[indx1];
const vector<int>& nodetemp2 = nodeL[indx2];
for (int k = 0; k < 128; k += 2) {
pplus = nodetemp2[k] - nodetemp1[k];
pminus = nodetemp1[k + 1] - nodetemp2[k + 1];
tmp = max(pplus, pminus);
if (tmp > result) {
result = tmp;
}
}
return result;
}
A couple of things to look at. One is the amount of data you are passing around. That will cause a bigger issue than the trivial calculation.
I've tried to rewrite it using SSE instructions (AVX) using library here
The original code on my system ran in 11.5s
With Neil Kirk's optimisation, it went down to 10.5s
EDIT: Tested the code with a debugger rather than in my head!
int getDiff(std::vector<std::vector<int>>& nodeL,int row1, int row2) {
Vec4i result(0);
const std::vector<int>& nodetemp1 = nodeL[row1];
const std::vector<int>& nodetemp2 = nodeL[row2];
Vec8i mask(-1,0,-1,0,-1,0,-1,0);
for (int k = 0; k < 128; k += 8) {
Vec8i nodeA(nodetemp1[k],nodetemp1[k+1],nodetemp1[k+2],nodetemp1[k+3],nodetemp1[k+4],nodetemp1[k+5],nodetemp1[k+6],nodetemp1[k+7]);
Vec8i nodeB(nodetemp2[k],nodetemp2[k+1],nodetemp2[k+2],nodetemp2[k+3],nodetemp2[k+4],nodetemp2[k+5],nodetemp2[k+6],nodetemp2[k+7]);
Vec8i tmp = select(mask,nodeB-nodeA,nodeA-nodeB);
Vec4i tmp_a(tmp[0],tmp[2],tmp[4],tmp[6]);
Vec4i tmp_b(tmp[1],tmp[3],tmp[5],tmp[7]);
Vec4i max_tmp = max(tmp_a,tmp_b);
result = select(max_tmp > result,max_tmp,result);
}
return horizontal_add(result);
}
The lack of branching speeds it up to 9.5s but still data is the biggest impact.
If you want to speed it up more, try to change the data structure to a single array/vector rather than a 2D one (a.l.a. std::vector) as that will reduce cache pressure.
EDIT
I thought of something - you could add a custom allocator to ensure you allocate the 2*18M vectors in a contiguous block of memory which allows you to keep the data structure and still go through it quickly. But you'd need to profile it to be sure
EDIT 2: Tested the code with a debugger rather than in my head!
Sorry Alex, this should be better. Not sure it will be faster than what the compiler can do. I still maintain that it's memory access that's the issue, so I would still try the single array approach. Give this a go though.

How to optimize simple gaussian filter for performance?

I am trying to write an android app which needs to calculate gaussian and laplacian pyramids for multiple full resolution images, i wrote this it on C++ with NDK, the most critical part of the code is applying gaussian filter to images abd i am applying this filter with horizontally and vertically.
The filter is (0.0625, 0.25, 0.375, 0.25, 0.0625)
Since i am working on integers i am calculating (1, 4, 6, 4, 1)/16
dst[index] = ( src[index-2] + src[index-1]*4 + src[index]*6+src[index+1]*4+src[index+2])/16;
I have made a few simple optimization however it still is working slow than expected and i was wondering if there are any other optimization options that i am missing.
PS: I should mention that i have tried to write this filter part with inline arm assembly however it give 2x slower results.
//horizontal filter
for(unsigned y = 0; y < height; y++) {
for(unsigned x = 2; x < width-2; x++) {
int index = y*width+x;
dst[index].r = (src[index-2].r+ src[index+2].r + (src[index-1].r + src[index+1].r)*4 + src[index].r*6)>>4;
dst[index].g = (src[index-2].g+ src[index+2].g + (src[index-1].g + src[index+1].g)*4 + src[index].g*6)>>4;
dst[index].b = (src[index-2].b+ src[index+2].b + (src[index-1].b + src[index+1].b)*4 + src[index].b*6)>>4;
}
}
//vertical filter
for(unsigned y = 2; y < height-2; y++) {
for(unsigned x = 0; x < width; x++) {
int index = y*width+x;
dst[index].r = (src[index-2*width].r + src[index+2*width].r + (src[index-width].r + src[index+width].r)*4 + src[index].r*6)>>4;
dst[index].g = (src[index-2*width].g + src[index+2*width].g + (src[index-width].g + src[index+width].g)*4 + src[index].g*6)>>4;
dst[index].b = (src[index-2*width].b + src[index+2*width].b + (src[index-width].b + src[index+width].b)*4 + src[index].b*6)>>4;
}
}
The index multiplication can be factored out of the inner loop since the mulitplicatation only occurs when y is changed:
for (unsigned y ...
{
int index = y * width;
for (unsigned int x...
You may gain some speed by loading variables before you use them. This would make the processor load them in the cache:
for (unsigned x = ...
{
register YOUR_DATA_TYPE a, b, c, d, e;
a = src[index - 2].r;
b = src[index - 1].r;
c = src[index + 0].r; // The " + 0" is to show a pattern.
d = src[index + 1].r;
e = src[index + 2].r;
dest[index].r = (a + e + (b + d) * 4 + c * 6) >> 4;
// ...
Another trick would be to "cache" the values of the src so that only a new one is added each time because the value in src[index+2] may be used up to 5 times.
So here is a example of the concepts:
//horizontal filter
for(unsigned y = 0; y < height; y++)
{
int index = y*width + 2;
register YOUR_DATA_TYPE a, b, c, d, e;
a = src[index - 2].r;
b = src[index - 1].r;
c = src[index + 0].r; // The " + 0" is to show a pattern.
d = src[index + 1].r;
e = src[index + 2].r;
for(unsigned x = 2; x < width-2; x++)
{
dest[index - 2 + x].r = (a + e + (b + d) * 4 + c * 6) >> 4;
a = b;
b = c;
c = d;
d = e;
e = src[index + x].r;
I'm not sure how your compiler would optimize all this, but I tend to work in pointers. Assuming your struct is 3 bytes... You can start with pointers in the right places (the edge of the filter for source, and the destination for target), and just move them through using constant array offsets. I've also put in an optional OpenMP directive on the outer loop, as this can also improve things.
#pragma omp parallel for
for(unsigned y = 0; y < height; y++) {
const int rowindex = y * width;
char * dpos = (char*)&dest[rowindex+2];
char * spos = (char*)&src[rowindex];
const char *end = (char*)&src[rowindex+width-2];
for( ; spos != end; spos++, dpos++) {
*dpos = (spos[0] + spos[4] + ((spos[1] + src[3])<<2) + spos[2]*6) >> 4;
}
}
Similarly for the vertical loop.
const int scanwidth = width * 3;
const int row1 = scanwidth;
const int row2 = row1+scanwidth;
const int row3 = row2+scanwidth;
const int row4 = row3+scanwidth;
#pragma omp parallel for
for(unsigned y = 2; y < height-2; y++) {
const int rowindex = y * width;
char * dpos = (char*)&dest[rowindex];
char * spos = (char*)&src[rowindex-row2];
const char *end = spos + scanwidth;
for( ; spos != end; spos++, dpos++) {
*dpos = (spos[0] + spos[row4] + ((spos[row1] + src[row3])<<2) + spos[row2]*6) >> 4;
}
}
This is how I do convolutions, anyway. It sacrifices readability a little, and I've never tried measuring the difference. I just tend to write them that way from the outset. See if that gives you a speed-up. The OpenMP definitely will if you have a multicore machine, and the pointer stuff might.
I like the comment about using SSE for these operations.
Some of the more obvious optimizations are exploiting the symmetry of the kernel:
a=*src++; b=*src++; c=*src++; d=*src++; e=*src++; // init
LOOP (n/5) times:
z=(a+e)+(b+d)<<2+c*6; *dst++=z>>4; // then reuse the local variables
a=*src++;
z=(b+a)+(c+e)<<2+d*6; *dst++=z>>4; // registers have been read only once...
b=*src++;
z=(c+b)+(d+a)<<2+e*6; *dst++=z>>4;
e=*src++;
The second thing is that one can perform multiple additions using a single integer. When the values to be filtered are unsigned, one can fit two channels in a single 32-bit integer (or 4 channels in a 64-bit integer); it's the poor mans SIMD.
a= 0x[0011][0034] <-- split to two
b= 0x[0031][008a]
----------------------
sum 0042 00b0
>>4 0004 200b0 <-- mask off
mask 00ff 00ff
-------------------
0004 000b <-- result
(The Simulated SIMD shows one addition followed by a shift by 4)
Here's a kernel that calculates 3 rgb operations in parallel (easy to modify for 6 rgb operations in 64-bit architectures...)
#define MASK (255+(255<<10)+(255<<20))
#define KERNEL(a,b,c,d,e) { \
a=((a+e+(c<<1))>>2) & MASK; a=(a+b+c+d)>>2 & MASK; *DATA++ = a; a=DATA[4]; }
void calc_5_rgbs(unsigned int *DATA)
{
register unsigned int a = DATA[0], b=DATA[1], c=DATA[2], d=DATA[3], e=DATA[4];
KERNEL(a,b,c,d,e);
KERNEL(b,c,d,e,a);
KERNEL(c,d,e,a,b);
KERNEL(d,e,a,b,c);
KERNEL(e,a,b,c,d);
}
Works best on ARM and on 64-bit IA with 16 registers... Needs heavy assembler optimizations to overcome register shortage in 32-bit IA (e.g. use ebp as GPR). And just because of that it's an inplace algorithm...
There are just 2 guardian bits between every 8 bits of data, which is just enough to get exactly the same result as in integer calculation.
And BTW: it's faster to just run through the array byte per byte than by r,g,b elements
unsigned char *s=(unsigned char *) source_array;
unsigned char *d=(unsigned char *) dest_array;
for (j=0;j<3*N;j++) d[j]=(s[j]+s[j+16]+s[j+8]*6+s[j+4]*4+s[j+12]*4)>>4;