Related
Given two integers X and Y, whats the most efficient way of converting them into X.Y float value in C++?
E.g.
X = 3, Y = 1415 -> 3.1415
X = 2, Y = 12 -> 2.12
Here are some cocktail-napkin benchmark results, on my machine, for all solutions converting two ints to a float, as of the time of writing.
Caveat: I've now added a solution of my own, which seems to do well, and am therefore biased! Please double-check my results.
Test
Iterations
ns / iteration
#aliberro's conversion v2
79,113,375
13
#3Dave's conversion
84,091,005
12
#einpoklum's conversion
1,966,008,981
0
#Ripi2's conversion
47,374,058
21
#TarekDakhran's conversion
1,960,763,847
0
CPU: Quad Core Intel Core i5-7600K speed/min/max: 4000/800/4200 MHz
Devuan GNU/Linux 3
Kernel: 5.2.0-3-amd64 x86_64
GCC 9.2.1, with flags: -O3 -march=native -mtune=native
Benchmark code (Github Gist).
float sum = x + y / pow(10,floor(log10(y)+1));
log10 returns log (base 10) of its argument. For 1234, that'll be 3 point something.
Breaking this down:
log10(1234) = 3.091315159697223
floor(log10(1234)+1) = 4
pow(10,4) = 10000.0
3 + 1234 / 10000.0 = 3.1234.
But, as #einpoklum pointed out, log(0) is NaN, so you have to check for that.
#include <iostream>
#include <cmath>
#include <vector>
using namespace std;
float foo(int x, unsigned int y)
{
if (0==y)
return x;
float den = pow(10,-1 * floor(log10(y)+1));
return x + y * den;
}
int main()
{
vector<vector<int>> tests
{
{3,1234},
{1,1000},
{2,12},
{0,0},
{9,1}
};
for(auto& test: tests)
{
cout << "Test: " << test[0] << "," << test[1] << ": " << foo(test[0],test[1]) << endl;
}
return 0;
}
See runnable version at:
https://onlinegdb.com/rkaYiDcPI
With test output:
Test: 3,1234: 3.1234
Test: 1,1000: 1.1
Test: 2,12: 2.12
Test: 0,0: 0
Test: 9,1: 9.1
Edit
Small modification to remove division operation.
(reworked solution)
Initially, my thoughts were improving on the performance of power-of-10 and division-by-power-of-10 by writing specialized versions of these functions, for integers. Then there was #TarekDakhran's comment about doing the same for counting the number of digits. And then I realized: That's essentially doing the same thing twice... so let's just integrate everything. This will, specifically, allow us to completely avoid any divisions or inversions at runtime:
inline float convert(int x, int y) {
float fy (y);
if (y == 0) { return float(x); }
if (y >= 1e9) { return float(x + fy * 1e-10f); }
if (y >= 1e8) { return float(x + fy * 1e-9f); }
if (y >= 1e7) { return float(x + fy * 1e-8f); }
if (y >= 1e6) { return float(x + fy * 1e-7f); }
if (y >= 1e5) { return float(x + fy * 1e-6f); }
if (y >= 1e4) { return float(x + fy * 1e-5f); }
if (y >= 1e3) { return float(x + fy * 1e-4f); }
if (y >= 1e2) { return float(x + fy * 1e-3f); }
if (y >= 1e1) { return float(x + fy * 1e-2f); }
return float(x + fy * 1e-1f);
}
Additional notes:
This will work for y == 0; but - not for negative x or y values. Adapting it for negative value is pretty easy and not very expensive though.
Not sure if this is absolutely optimal. Perhaps a binary-search for the number of digits of y would work better?
A loop would make the code look nicer; but the compiler would need to unroll it. Would it unroll the loop and compute all those floats beforehand? I'm not sure.
I put some effort into optimizing my previous answer and ended up with this.
inline uint32_t digits_10(uint32_t x) {
return 1u
+ (x >= 10u)
+ (x >= 100u)
+ (x >= 1000u)
+ (x >= 10000u)
+ (x >= 100000u)
+ (x >= 1000000u)
+ (x >= 10000000u)
+ (x >= 100000000u)
+ (x >= 1000000000u)
;
}
inline uint64_t pow_10(uint32_t exp) {
uint64_t res = 1;
while(exp--) {
res *= 10u;
}
return res;
}
inline double fast_zip(uint32_t x, uint32_t y) {
return x + static_cast<double>(y) / pow_10(digits_10(y));
}
double IntsToDbl(int ipart, int decpart)
{
//The decimal part:
double dp = (double) decpart;
while (dp > 1)
{
dp /= 10;
}
//Joint boths parts
return ipart + dp;
}
Simple and very fast solution is converting both values x and y to string, then concatenate them, then casting the result into a floating number as following:
#include <string>
#include <iostream>
std::string x_string = std::to_string(x);
std::string y_string = std::to_string(y);
std::cout << x_string +"."+ y_string ; // the result, cast it to float if needed
(Answer based on the fact that OP has not indicated what they want to use the float for.)
The fastest (most efficient) way is to do it implicitly, but not actually do anything (after compiler optimizations).
That is, write a "pseudo-float" class, whose members are integers of x and y's types before and after the decimal point; and have operators for doing whatever it is you were going to do with the float: operator+, operator*, operator/, operator- and maybe even implementations of pow(), log2(), log10() and so on.
Unless what you were planning to do is literally save a 4-byte float somewhere for later use, it would almost certainly be faster if you had the next operand you need to work with then to really create a float from just x and y, already losing precision and wasting time.
Try this
#include <iostream>
#include <math.h>
using namespace std;
float int2Float(int integer,int decimal)
{
float sign = integer/abs(integer);
float tm = abs(integer), tm2 = abs(decimal);
int base = decimal == 0 ? -1 : log10(decimal);
tm2/=pow(10,base+1);
return (tm+tm2)*sign;
}
int main()
{
int x,y;
cin >>x >>y;
cout << int2Float(x,y);
return 0;
}
version 2, try this out
#include <iostream>
#include <cmath>
using namespace std;
float getPlaces(int x)
{
unsigned char p=0;
while(x!=0)
{
x/=10;
p++;
}
float pow10[] = {1.0f,10.0f,100.0f,1000.0f,10000.0f,100000.0f};//don't need more
return pow10[p];
}
float int2Float(int x,int y)
{
if(y == 0) return x;
float sign = x != 0 ? x/abs(x) : 1;
float tm = abs(x), tm2 = abs(y);
tm2/=getPlaces(y);
return (tm+tm2)*sign;
}
int main()
{
int x,y;
cin >>x >>y;
cout << int2Float(x,y);
return 0;
}
If you want something that is simple to read and follow, you could try something like this:
float convertToDecimal(int x)
{
float y = (float) x;
while( y > 1 ){
y = y / 10;
}
return y;
}
float convertToDecimal(int x, int y)
{
return (float) x + convertToDecimal(y);
}
This simply reduces one integer to the first floating point less than 1 and adds it to the other one.
This does become a problem if you ever want to use a number like 1.0012 to be represented as 2 integers. But that isn't part of the question. To solve it, I would use a third integer representation to be the negative power of 10 for multiplying the second number. IE 1.0012 would be 1, 12, 4. This would then be coded as follows:
float convertToDecimal(int num, int e)
{
return ((float) num) / pow(10, e);
}
float convertToDecimal(int x, int y, int e)
{
return = (float) x + convertToDecimal(y, e);
}
It a little more concise with this answer, but it doesn't help to answer your question. It might help show a problem with using only 2 integers if you stick with that data model.
When I declare an array to store the Y values of each coordinate, define its values then use each of the element values to send into a rounding function, i obtain the error 'Run-Time Check Failure #2 - Stack around the variable 'Yarray; was corrupted. The output is mostly what is expected although i'm wondering why this is happening and if i can mitigate it, cheers.
void EquationElement::getPolynomial(int * values)
{
//Takes in coefficients to calculate Y values for a polynomial//
double size = 40;
double step = 1;
int Yarray[40];
int third = *values;
int second = *(values + 1);
int first = *(values + 2);
int constant = *(values + 3);
double x, Yvalue;
for (int i = 0; i < size + size + 1; ++i) {
x = (i - (size));
x = x * step;
double Y = (third *(x*x*x)) + (second *(x*x)) + (first * (x))
Yvalue = Y / step;
Yarray[i] = int(round(Yvalue)); //<-MAIN ISSUE HERE?//
cout << Yarray[i] << endl;
}
}
double EquationElement::round(double number)
{
return number < 0.0 ? ceil(number - 0.5) : floor(number + 0.5);
// if n<0 then ceil(n-0.5) else if >0 floor(n+0.5) ceil to round up floor to round down
}
// values could be null, you should check that
// if instead of int* values, you took std::vector<int>& values
// You know besides the values, the quantity of them
void EquationElement::getPolynomial(const int* values)
{
//Takes in coefficients to calculate Y values for a polynomial//
static const int size = 40; // No reason for size to be double
static const int step = 1; // No reason for step to be double
int Yarray[2*size+1]{}; // 40 will not do {} makes them initialized to zero with C++11 onwards
int third = values[0];
int second = values[1]; // avoid pointer arithmetic
int first = values[2]; // [] will work with std::vector and is clearer
int constant = values[3]; // Values should point at least to 4 numbers; responsability goes to caller
for (int i = 0; i < 2*size + 1; ++i) {
double x = (i - (size)) * step; // x goes from -40 to 40
double Y = (third *(x*x*x)) + (second *(x*x)) + (first * (x)) + constant;
// Seems unnatural that x^1 is values and x^3 is values+2, being constant at values+3
double Yvalue= Y / step; // as x and Yvalue will not be used outside the loop, no need to declare them there
Yarray[i] = int(round(Yvalue)); //<-MAIN ISSUE HERE?//
// Yep, big issue, i goes from 0 to size*2; you need size+size+1 elements
cout << Yarray[i] << endl;
}
}
Instead of
void EquationElement::getPolynomial(const int* values)
You could also declare
void EquationElement::getPolynomial(const int (&values)[4])
Which means that now you need to call it with a pointer to 4 elements; no more and no less.
Also, with std::vector:
void EquationElement::getPolynomial(const std::vector<int>& values)
{
//Takes in coefficients to calculate Y values for a polynomial//
static const int size = 40; // No reason for size to be double
static const int step = 1; // No reason for step to be double
std::vector<int> Yarray;
Yarray.reserve(2*size+1); // This is just optimization. Yarran *Can* grow above this limit.
int third = values[0];
int second = values[1]; // avoid pointer arithmetic
int first = values[2]; // [] will work with std::vector and is clearer
int constant = values[3]; // Values should point at least to 4 numbers; responsability goes to caller
for (int i = 0; i < 2*size + 1; ++i) {
double x = (i - (size)) * step; // x goes from -40 to 40
double Y = (third *(x*x*x)) + (second *(x*x)) + (first * (x)) + constant;
// Seems unnatural that x^1 is values and x^3 is values+2, being constant at values+3
double Yvalue= Y / step; // as x and Yvalue will not be used outside the loop, no need to declare them there
Yarray.push_back(int(round(Yvalue)));
cout << Yarray.back() << endl;
}
}
I have a 20 x 20 array and need to iterate over it by reading a 4 x 4 array. I thought I could do this with pointed assignment, but it does not do much except force close
const char SOURCE[20][20];
const char **pointer;
for(int x = 0; x < 20; x+=4)
{
for(int y = 0; y < 20; y+=4)
{
pointer = (const char **)&SOURCE[x][y];
printGrid(pointer);
}
}
void printGrid(const char **grid)
{
// do something usefull
}
Just casting a pointer to a different type doesn't change the
type of what it points to (and will usually lead to undefined
behavior, unless you really know what you're doing). If you
cannot change printGrid, you'll have to create an array of
pointers on the fly:
for ( int x = 0; x < 20; x += 4 ) {
for ( int y = 0; y < 20; y += 4 ) {
char const* p4[4] =
{
source[x] + y,
source[x + 1] + y,
source[x + 2] + y,
source[x + 3] + y
};
printGrid( p4 );
}
}
A pointer to a pointer is not the same as an array of arrays.
You can however use a pointer to an array instead:
const char (*pointer)[20];
You of course need to update the printGrid function to match the type.
As for the reason why a pointer-to-pointer and an array-of-array (also often called a matrix) see e.g. this old answer of mine that shows the memory layout of the two.
Your 2D-array is of type char:
const char SOURCE[20][20];
When you are iterating through it, you can either look at the char or reference the address with a char*:
for(int x = 0; x < 20; x+=4)
{
for(int y = 0; y < 20; y+=4)
{
printGrid(SOURCE[x][y]); // do this unless you need to do something with pointer
}
}
Then you can make printGrid with either of the following signatures:
void printGrid(const char& grid)
{
// do something usefull
}
or
void printGrid(const char* grid)
{
// do something usefull
}
Extending James's Answer, you may change your code as below as it sees that the it passes pointer to an array of 4 char rather than just array of char.
for(int x = 0; x < 20; x+=4)
{
for(int y = 0; y < 20; y+=4)
{
char const (*p4[4])[4] =
{
(const char(*)[4])(SOURCE[x] + y),
(const char(*)[4])(SOURCE[x + 1] + y),
(const char(*)[4])(SOURCE[x + 2] + y),
(const char(*)[4])(SOURCE[x + 3] + y)
};
}
}
I am trying to write an android app which needs to calculate gaussian and laplacian pyramids for multiple full resolution images, i wrote this it on C++ with NDK, the most critical part of the code is applying gaussian filter to images abd i am applying this filter with horizontally and vertically.
The filter is (0.0625, 0.25, 0.375, 0.25, 0.0625)
Since i am working on integers i am calculating (1, 4, 6, 4, 1)/16
dst[index] = ( src[index-2] + src[index-1]*4 + src[index]*6+src[index+1]*4+src[index+2])/16;
I have made a few simple optimization however it still is working slow than expected and i was wondering if there are any other optimization options that i am missing.
PS: I should mention that i have tried to write this filter part with inline arm assembly however it give 2x slower results.
//horizontal filter
for(unsigned y = 0; y < height; y++) {
for(unsigned x = 2; x < width-2; x++) {
int index = y*width+x;
dst[index].r = (src[index-2].r+ src[index+2].r + (src[index-1].r + src[index+1].r)*4 + src[index].r*6)>>4;
dst[index].g = (src[index-2].g+ src[index+2].g + (src[index-1].g + src[index+1].g)*4 + src[index].g*6)>>4;
dst[index].b = (src[index-2].b+ src[index+2].b + (src[index-1].b + src[index+1].b)*4 + src[index].b*6)>>4;
}
}
//vertical filter
for(unsigned y = 2; y < height-2; y++) {
for(unsigned x = 0; x < width; x++) {
int index = y*width+x;
dst[index].r = (src[index-2*width].r + src[index+2*width].r + (src[index-width].r + src[index+width].r)*4 + src[index].r*6)>>4;
dst[index].g = (src[index-2*width].g + src[index+2*width].g + (src[index-width].g + src[index+width].g)*4 + src[index].g*6)>>4;
dst[index].b = (src[index-2*width].b + src[index+2*width].b + (src[index-width].b + src[index+width].b)*4 + src[index].b*6)>>4;
}
}
The index multiplication can be factored out of the inner loop since the mulitplicatation only occurs when y is changed:
for (unsigned y ...
{
int index = y * width;
for (unsigned int x...
You may gain some speed by loading variables before you use them. This would make the processor load them in the cache:
for (unsigned x = ...
{
register YOUR_DATA_TYPE a, b, c, d, e;
a = src[index - 2].r;
b = src[index - 1].r;
c = src[index + 0].r; // The " + 0" is to show a pattern.
d = src[index + 1].r;
e = src[index + 2].r;
dest[index].r = (a + e + (b + d) * 4 + c * 6) >> 4;
// ...
Another trick would be to "cache" the values of the src so that only a new one is added each time because the value in src[index+2] may be used up to 5 times.
So here is a example of the concepts:
//horizontal filter
for(unsigned y = 0; y < height; y++)
{
int index = y*width + 2;
register YOUR_DATA_TYPE a, b, c, d, e;
a = src[index - 2].r;
b = src[index - 1].r;
c = src[index + 0].r; // The " + 0" is to show a pattern.
d = src[index + 1].r;
e = src[index + 2].r;
for(unsigned x = 2; x < width-2; x++)
{
dest[index - 2 + x].r = (a + e + (b + d) * 4 + c * 6) >> 4;
a = b;
b = c;
c = d;
d = e;
e = src[index + x].r;
I'm not sure how your compiler would optimize all this, but I tend to work in pointers. Assuming your struct is 3 bytes... You can start with pointers in the right places (the edge of the filter for source, and the destination for target), and just move them through using constant array offsets. I've also put in an optional OpenMP directive on the outer loop, as this can also improve things.
#pragma omp parallel for
for(unsigned y = 0; y < height; y++) {
const int rowindex = y * width;
char * dpos = (char*)&dest[rowindex+2];
char * spos = (char*)&src[rowindex];
const char *end = (char*)&src[rowindex+width-2];
for( ; spos != end; spos++, dpos++) {
*dpos = (spos[0] + spos[4] + ((spos[1] + src[3])<<2) + spos[2]*6) >> 4;
}
}
Similarly for the vertical loop.
const int scanwidth = width * 3;
const int row1 = scanwidth;
const int row2 = row1+scanwidth;
const int row3 = row2+scanwidth;
const int row4 = row3+scanwidth;
#pragma omp parallel for
for(unsigned y = 2; y < height-2; y++) {
const int rowindex = y * width;
char * dpos = (char*)&dest[rowindex];
char * spos = (char*)&src[rowindex-row2];
const char *end = spos + scanwidth;
for( ; spos != end; spos++, dpos++) {
*dpos = (spos[0] + spos[row4] + ((spos[row1] + src[row3])<<2) + spos[row2]*6) >> 4;
}
}
This is how I do convolutions, anyway. It sacrifices readability a little, and I've never tried measuring the difference. I just tend to write them that way from the outset. See if that gives you a speed-up. The OpenMP definitely will if you have a multicore machine, and the pointer stuff might.
I like the comment about using SSE for these operations.
Some of the more obvious optimizations are exploiting the symmetry of the kernel:
a=*src++; b=*src++; c=*src++; d=*src++; e=*src++; // init
LOOP (n/5) times:
z=(a+e)+(b+d)<<2+c*6; *dst++=z>>4; // then reuse the local variables
a=*src++;
z=(b+a)+(c+e)<<2+d*6; *dst++=z>>4; // registers have been read only once...
b=*src++;
z=(c+b)+(d+a)<<2+e*6; *dst++=z>>4;
e=*src++;
The second thing is that one can perform multiple additions using a single integer. When the values to be filtered are unsigned, one can fit two channels in a single 32-bit integer (or 4 channels in a 64-bit integer); it's the poor mans SIMD.
a= 0x[0011][0034] <-- split to two
b= 0x[0031][008a]
----------------------
sum 0042 00b0
>>4 0004 200b0 <-- mask off
mask 00ff 00ff
-------------------
0004 000b <-- result
(The Simulated SIMD shows one addition followed by a shift by 4)
Here's a kernel that calculates 3 rgb operations in parallel (easy to modify for 6 rgb operations in 64-bit architectures...)
#define MASK (255+(255<<10)+(255<<20))
#define KERNEL(a,b,c,d,e) { \
a=((a+e+(c<<1))>>2) & MASK; a=(a+b+c+d)>>2 & MASK; *DATA++ = a; a=DATA[4]; }
void calc_5_rgbs(unsigned int *DATA)
{
register unsigned int a = DATA[0], b=DATA[1], c=DATA[2], d=DATA[3], e=DATA[4];
KERNEL(a,b,c,d,e);
KERNEL(b,c,d,e,a);
KERNEL(c,d,e,a,b);
KERNEL(d,e,a,b,c);
KERNEL(e,a,b,c,d);
}
Works best on ARM and on 64-bit IA with 16 registers... Needs heavy assembler optimizations to overcome register shortage in 32-bit IA (e.g. use ebp as GPR). And just because of that it's an inplace algorithm...
There are just 2 guardian bits between every 8 bits of data, which is just enough to get exactly the same result as in integer calculation.
And BTW: it's faster to just run through the array byte per byte than by r,g,b elements
unsigned char *s=(unsigned char *) source_array;
unsigned char *d=(unsigned char *) dest_array;
for (j=0;j<3*N;j++) d[j]=(s[j]+s[j+16]+s[j+8]*6+s[j+4]*4+s[j+12]*4)>>4;
I'm writing a sparse matrix solver using the Gauss-Seidel method. By profiling, I've determined that about half of my program's time is spent inside the solver. The performance-critical part is as follows:
size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
for (size_t y = 1; y < d_ny - 1; ++y) {
for (size_t x = 1; x < d_nx - 1; ++x) {
d_x[ic] = d_b[ic]
- d_w[ic] * d_x[iw] - d_e[ic] * d_x[ie]
- d_s[ic] * d_x[is] - d_n[ic] * d_x[in];
++ic; ++iw; ++ie; ++is; ++in;
}
ic += 2; iw += 2; ie += 2; is += 2; in += 2;
}
All arrays involved are of float type. Actually, they are not arrays but objects with an overloaded [] operator, which (I think) should be optimized away, but is defined as follows:
inline float &operator[](size_t i) { return d_cells[i]; }
inline float const &operator[](size_t i) const { return d_cells[i]; }
For d_nx = d_ny = 128, this can be run about 3500 times per second on an Intel i7 920. This means that the inner loop body runs 3500 * 128 * 128 = 57 million times per second. Since only some simple arithmetic is involved, that strikes me as a low number for a 2.66 GHz processor.
Maybe it's not limited by CPU power, but by memory bandwidth? Well, one 128 * 128 float array eats 65 kB, so all 6 arrays should easily fit into the CPU's L3 cache (which is 8 MB). Assuming that nothing is cached in registers, I count 15 memory accesses in the inner loop body. On a 64-bits system this is 120 bytes per iteration, so 57 million * 120 bytes = 6.8 GB/s. The L3 cache runs at 2.66 GHz, so it's the same order of magnitude. My guess is that memory is indeed the bottleneck.
To speed this up, I've attempted the following:
Compile with g++ -O3. (Well, I'd been doing this from the beginning.)
Parallelizing over 4 cores using OpenMP pragmas. I have to change to the Jacobi algorithm to avoid reads from and writes to the same array. This requires that I do twice as many iterations, leading to a net result of about the same speed.
Fiddling with implementation details of the loop body, such as using pointers instead of indices. No effect.
What's the best approach to speed this guy up? Would it help to rewrite the inner body in assembly (I'd have to learn that first)? Should I run this on the GPU instead (which I know how to do, but it's such a hassle)? Any other bright ideas?
(N.B. I do take "no" for an answer, as in: "it can't be done significantly faster, because...")
Update: as requested, here's a full program:
#include <iostream>
#include <cstdlib>
#include <cstring>
using namespace std;
size_t d_nx = 128, d_ny = 128;
float *d_x, *d_b, *d_w, *d_e, *d_s, *d_n;
void step() {
size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
for (size_t y = 1; y < d_ny - 1; ++y) {
for (size_t x = 1; x < d_nx - 1; ++x) {
d_x[ic] = d_b[ic]
- d_w[ic] * d_x[iw] - d_e[ic] * d_x[ie]
- d_s[ic] * d_x[is] - d_n[ic] * d_x[in];
++ic; ++iw; ++ie; ++is; ++in;
}
ic += 2; iw += 2; ie += 2; is += 2; in += 2;
}
}
void solve(size_t iters) {
for (size_t i = 0; i < iters; ++i) {
step();
}
}
void clear(float *a) {
memset(a, 0, d_nx * d_ny * sizeof(float));
}
int main(int argc, char **argv) {
size_t n = d_nx * d_ny;
d_x = new float[n]; clear(d_x);
d_b = new float[n]; clear(d_b);
d_w = new float[n]; clear(d_w);
d_e = new float[n]; clear(d_e);
d_s = new float[n]; clear(d_s);
d_n = new float[n]; clear(d_n);
solve(atoi(argv[1]));
cout << d_x[0] << endl; // prevent the thing from being optimized away
}
I compile and run it as follows:
$ g++ -o gstest -O3 gstest.cpp
$ time ./gstest 8000
0
real 0m1.052s
user 0m1.050s
sys 0m0.010s
(It does 8000 instead of 3500 iterations per second because my "real" program does a lot of other stuff too. But it's representative.)
Update 2: I've been told that unititialized values may not be representative because NaN and Inf values may slow things down. Now clearing the memory in the example code. It makes no difference for me in execution speed, though.
Couple of ideas:
Use SIMD. You could load 4 floats at a time from each array into a SIMD register (e.g. SSE on Intel, VMX on PowerPC). The disadvantage of this is that some of the d_x values will be "stale" so your convergence rate will suffer (but not as bad as a jacobi iteration); it's hard to say whether the speedup offsets it.
Use SOR. It's simple, doesn't add much computation, and can improve your convergence rate quite well, even for a relatively conservative relaxation value (say 1.5).
Use conjugate gradient. If this is for the projection step of a fluid simulation (i.e. enforcing non-compressability), you should be able to apply CG and get a much better convergence rate. A good preconditioner helps even more.
Use a specialized solver. If the linear system arises from the Poisson equation, you can do even better than conjugate gradient using an FFT-based methods.
If you can explain more about what the system you're trying to solve looks like, I can probably give some more advice on #3 and #4.
I think I've managed to optimize it, here's a code, create a new project in VC++, add this code and simply compile under "Release".
#include <iostream>
#include <cstdlib>
#include <cstring>
#define _WIN32_WINNT 0x0400
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#include <conio.h>
using namespace std;
size_t d_nx = 128, d_ny = 128;
float *d_x, *d_b, *d_w, *d_e, *d_s, *d_n;
void step_original() {
size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
for (size_t y = 1; y < d_ny - 1; ++y) {
for (size_t x = 1; x < d_nx - 1; ++x) {
d_x[ic] = d_b[ic]
- d_w[ic] * d_x[iw] - d_e[ic] * d_x[ie]
- d_s[ic] * d_x[is] - d_n[ic] * d_x[in];
++ic; ++iw; ++ie; ++is; ++in;
}
ic += 2; iw += 2; ie += 2; is += 2; in += 2;
}
}
void step_new() {
//size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
float
*d_b_ic,
*d_w_ic,
*d_e_ic,
*d_x_ic,
*d_x_iw,
*d_x_ie,
*d_x_is,
*d_x_in,
*d_n_ic,
*d_s_ic;
d_b_ic = d_b;
d_w_ic = d_w;
d_e_ic = d_e;
d_x_ic = d_x;
d_x_iw = d_x;
d_x_ie = d_x;
d_x_is = d_x;
d_x_in = d_x;
d_n_ic = d_n;
d_s_ic = d_s;
for (size_t y = 1; y < d_ny - 1; ++y)
{
for (size_t x = 1; x < d_nx - 1; ++x)
{
/*d_x[ic] = d_b[ic]
- d_w[ic] * d_x[iw] - d_e[ic] * d_x[ie]
- d_s[ic] * d_x[is] - d_n[ic] * d_x[in];*/
*d_x_ic = *d_b_ic
- *d_w_ic * *d_x_iw - *d_e_ic * *d_x_ie
- *d_s_ic * *d_x_is - *d_n_ic * *d_x_in;
//++ic; ++iw; ++ie; ++is; ++in;
d_b_ic++;
d_w_ic++;
d_e_ic++;
d_x_ic++;
d_x_iw++;
d_x_ie++;
d_x_is++;
d_x_in++;
d_n_ic++;
d_s_ic++;
}
//ic += 2; iw += 2; ie += 2; is += 2; in += 2;
d_b_ic += 2;
d_w_ic += 2;
d_e_ic += 2;
d_x_ic += 2;
d_x_iw += 2;
d_x_ie += 2;
d_x_is += 2;
d_x_in += 2;
d_n_ic += 2;
d_s_ic += 2;
}
}
void solve_original(size_t iters) {
for (size_t i = 0; i < iters; ++i) {
step_original();
}
}
void solve_new(size_t iters) {
for (size_t i = 0; i < iters; ++i) {
step_new();
}
}
void clear(float *a) {
memset(a, 0, d_nx * d_ny * sizeof(float));
}
int main(int argc, char **argv) {
size_t n = d_nx * d_ny;
d_x = new float[n]; clear(d_x);
d_b = new float[n]; clear(d_b);
d_w = new float[n]; clear(d_w);
d_e = new float[n]; clear(d_e);
d_s = new float[n]; clear(d_s);
d_n = new float[n]; clear(d_n);
if(argc < 3)
printf("app.exe (x)iters (o/n)algo\n");
bool bOriginalStep = (argv[2][0] == 'o');
size_t iters = atoi(argv[1]);
/*printf("Press any key to start!");
_getch();
printf(" Running speed test..\n");*/
__int64 freq, start, end, diff;
if(!::QueryPerformanceFrequency((LARGE_INTEGER*)&freq))
throw "Not supported!";
freq /= 1000000; // microseconds!
{
::QueryPerformanceCounter((LARGE_INTEGER*)&start);
if(bOriginalStep)
solve_original(iters);
else
solve_new(iters);
::QueryPerformanceCounter((LARGE_INTEGER*)&end);
diff = (end - start) / freq;
}
printf("Speed (%s)\t\t: %u\n", (bOriginalStep ? "original" : "new"), diff);
//_getch();
//cout << d_x[0] << endl; // prevent the thing from being optimized away
}
Run it like this:
app.exe 10000 o
app.exe 10000 n
"o" means old code, yours.
"n" is mine, the new one.
My results:
Speed (original):
1515028
1523171
1495988
Speed (new):
966012
984110
1006045
Improvement of about 30%.
The logic behind:
You've been using index counters to access/manipulate.
I use pointers.
While running, breakpoint at a certain calculation code line in VC++'s debugger, and press F8. You'll get the disassembler window.
The you'll see the produced opcodes (assembly code).
Anyway, look:
int *x = ...;
x[3] = 123;
This tells the PC to put the pointer x at a register (say EAX).
The add it (3 * sizeof(int)).
Only then, set the value to 123.
The pointers approach is much better as you can understand, because we cut the adding process, actually we handle it ourselves, thus able to optimize as needed.
I hope this helps.
Sidenote to stackoverflow.com's staff:
Great website, I hope I've heard of it long ago!
For one thing, there seems to be a pipelining issue here. The loop reads from the value in d_x that has just been written to, but apparently it has to wait for that write to complete. Just rearranging the order of the computation, doing something useful while it's waiting, makes it almost twice as fast:
d_x[ic] = d_b[ic]
- d_e[ic] * d_x[ie]
- d_s[ic] * d_x[is] - d_n[ic] * d_x[in]
- d_w[ic] * d_x[iw] /* d_x[iw] has just been written to, process this last */;
It was Eamon Nerbonne who figured this out. Many upvotes to him! I would never have guessed.
Poni's answer looks like the right one to me.
I just want to point out that in this type of problem, you often gain benefits from memory locality. Right now, the b,w,e,s,n arrays are all at separate locations in memory. If you could not fit the problem in L3 cache (mostly in L2), then this would be bad, and a solution of this sort would be helpful:
size_t d_nx = 128, d_ny = 128;
float *d_x;
struct D { float b,w,e,s,n; };
D *d;
void step() {
size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
for (size_t y = 1; y < d_ny - 1; ++y) {
for (size_t x = 1; x < d_nx - 1; ++x) {
d_x[ic] = d[ic].b
- d[ic].w * d_x[iw] - d[ic].e * d_x[ie]
- d[ic].s * d_x[is] - d[ic].n * d_x[in];
++ic; ++iw; ++ie; ++is; ++in;
}
ic += 2; iw += 2; ie += 2; is += 2; in += 2;
}
}
void solve(size_t iters) { for (size_t i = 0; i < iters; ++i) step(); }
void clear(float *a) { memset(a, 0, d_nx * d_ny * sizeof(float)); }
int main(int argc, char **argv) {
size_t n = d_nx * d_ny;
d_x = new float[n]; clear(d_x);
d = new D[n]; memset(d,0,n * sizeof(D));
solve(atoi(argv[1]));
cout << d_x[0] << endl; // prevent the thing from being optimized away
}
For example, this solution at 1280x1280 is a little less than 2x faster than Poni's solution (13s vs 23s in my test--your original implementation is then 22s), while at 128x128 it's 30% slower (7s vs. 10s--your original is 10s).
(Iterations were scaled up to 80000 for the base case, and 800 for the 100x larger case of 1280x1280.)
I think you're right about memory being a bottleneck. It's a pretty simple loop with just some simple arithmetic per iteration. the ic, iw, ie, is, and in indices seem to be on opposite sides of the matrix so i'm guessing that there's a bunch of cache misses there.
I'm no expert on the subject, but I've seen that there are several academic papers on improving the cache usage of the Gauss-Seidel method.
Another possible optimization is the use of the red-black variant, where points are updated in two sweeps in a chessboard-like pattern. In this way, all updates in a sweep are independent and can be parallelized.
I suggest putting in some prefetch statements and also researching "data oriented design":
void step_original() {
size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
float dw_ic, dx_ic, db_ic, de_ic, dn_ic, ds_ic;
float dx_iw, dx_is, dx_ie, dx_in, de_ic, db_ic;
for (size_t y = 1; y < d_ny - 1; ++y) {
for (size_t x = 1; x < d_nx - 1; ++x) {
// Perform the prefetch
// Sorting these statements by array may increase speed;
// although sorting by index name may increase speed too.
db_ic = d_b[ic];
dw_ic = d_w[ic];
dx_iw = d_x[iw];
de_ic = d_e[ic];
dx_ie = d_x[ie];
ds_ic = d_s[ic];
dx_is = d_x[is];
dn_ic = d_n[ic];
dx_in = d_x[in];
// Calculate
d_x[ic] = db_ic
- dw_ic * dx_iw - de_ic * dx_ie
- ds_ic * dx_is - dn_ic * dx_in;
++ic; ++iw; ++ie; ++is; ++in;
}
ic += 2; iw += 2; ie += 2; is += 2; in += 2;
}
}
This differs from your second method since the values are copied to local temporary variables before the calculation is performed.