Optimizing C++ code for performance - c++

Can you think of some way to optimize this piece of code? It's meant to execute in an ARMv7 processor (Iphone 3GS):
4.0% inline float BoxIntegral(IplImage *img, int row, int col, int rows, int cols)
{
0.7% float *data = (float *) img->imageData;
1.4% int step = img->widthStep/sizeof(float);
// The subtraction by one for row/col is because row/col is inclusive.
1.1% int r1 = std::min(row, img->height) - 1;
1.0% int c1 = std::min(col, img->width) - 1;
2.7% int r2 = std::min(row + rows, img->height) - 1;
3.7% int c2 = std::min(col + cols, img->width) - 1;
float A(0.0f), B(0.0f), C(0.0f), D(0.0f);
8.5% if (r1 >= 0 && c1 >= 0) A = data[r1 * step + c1];
11.7% if (r1 >= 0 && c2 >= 0) B = data[r1 * step + c2];
7.6% if (r2 >= 0 && c1 >= 0) C = data[r2 * step + c1];
9.2% if (r2 >= 0 && c2 >= 0) D = data[r2 * step + c2];
21.9% return std::max(0.f, A - B - C + D);
3.8% }
All this code is taken from the OpenSURF library. Here's the context of the function (some people were asking for the context):
//! Calculate DoH responses for supplied layer
void FastHessian::buildResponseLayer(ResponseLayer *rl)
{
float *responses = rl->responses; // response storage
unsigned char *laplacian = rl->laplacian; // laplacian sign storage
int step = rl->step; // step size for this filter
int b = (rl->filter - 1) * 0.5 + 1; // border for this filter
int l = rl->filter / 3; // lobe for this filter (filter size / 3)
int w = rl->filter; // filter size
float inverse_area = 1.f/(w*w); // normalisation factor
float Dxx, Dyy, Dxy;
for(int r, c, ar = 0, index = 0; ar < rl->height; ++ar)
{
for(int ac = 0; ac < rl->width; ++ac, index++)
{
// get the image coordinates
r = ar * step;
c = ac * step;
// Compute response components
Dxx = BoxIntegral(img, r - l + 1, c - b, 2*l - 1, w)
- BoxIntegral(img, r - l + 1, c - l * 0.5, 2*l - 1, l)*3;
Dyy = BoxIntegral(img, r - b, c - l + 1, w, 2*l - 1)
- BoxIntegral(img, r - l * 0.5, c - l + 1, l, 2*l - 1)*3;
Dxy = + BoxIntegral(img, r - l, c + 1, l, l)
+ BoxIntegral(img, r + 1, c - l, l, l)
- BoxIntegral(img, r - l, c - l, l, l)
- BoxIntegral(img, r + 1, c + 1, l, l);
// Normalise the filter responses with respect to their size
Dxx *= inverse_area;
Dyy *= inverse_area;
Dxy *= inverse_area;
// Get the determinant of hessian response & laplacian sign
responses[index] = (Dxx * Dyy - 0.81f * Dxy * Dxy);
laplacian[index] = (Dxx + Dyy >= 0 ? 1 : 0);
#ifdef RL_DEBUG
// create list of the image coords for each response
rl->coords.push_back(std::make_pair<int,int>(r,c));
#endif
}
}
}
Some questions:
Is it a good idea that the function is inline?
Would using inline assembly provide a significant speedup?

Specialize for the edges so that you don't need to check for them in every row and column. I assume that this call is in a nested loop and is called a lot. This function would become:
inline float BoxIntegralNonEdge(IplImage *img, int row, int col, int rows, int cols)
{
float *data = (float *) img->imageData;
int step = img->widthStep/sizeof(float);
// The subtraction by one for row/col is because row/col is inclusive.
int r1 = row - 1;
int c1 = col - 1;
int r2 = row + rows - 1;
int c2 = col + cols - 1;
float A(data[r1 * step + c1]), B(data[r1 * step + c2]), C(data[r2 * step + c1]), D(data[r2 * step + c2]);
return std::max(0.f, A - B - C + D);
}
You get rid of a conditional and branch for each min and two conditionals and a branch for each if. You can only call this function if you already meet the conditions -- check that in the caller for the whole row once instead of each pixel.
I wrote up some tips for optimizing image processing when you have to do work on each pixel:
http://www.atalasoft.com/cs/blogs/loufranco/archive/2006/04/28/9985.aspx
Other things from the blog:
You are recalculating a position in the image data with 2 multiplies (indexing is multiplication) -- you should be incrementing a pointer.
Instead of passing in img, row, row, col and cols, pass in pointers to the exact pixels to process -- which you get from incrementing pointers, not indexing.
If you don't do the above, step is the same for all pixels, calculate it in the caller and pass it in. If you do 1 and 2, you won't need step at all.

There are a few places to reuse temporary variables, but whether it would improve performance would have to be measured as dirkgently stated:
Change
if (r1 >= 0 && c1 >= 0) A = data[r1 * step + c1];
if (r1 >= 0 && c2 >= 0) B = data[r1 * step + c2];
if (r2 >= 0 && c1 >= 0) C = data[r2 * step + c1];
if (r2 >= 0 && c2 >= 0) D = data[r2 * step + c2];
to
if (r1 >= 0) {
int r1Step = r1 * step;
if (c1 >= 0) A = data[r1Step + c1];
if (c2 >= 0) B = data[r1Step + c2];
}
if (r2 >= 0) {
int r2Step = r2 * step;
if (c1 >= 0) C = data[r2Step + c1];
if (c2 >= 0) D = data[r2Step + c2];
}
You may actually end up doing the temp multiplactions too often in case your if statements rarely provides true.

You aren't interested in four variables A, B, C, D, but only the combination A - B - C + D.
Try
float result(0.0f);
if (r1 >= 0 && c1 >= 0) result += data[r1 * step + c1];
if (r1 >= 0 && c2 >= 0) result -= data[r1 * step + c2];
if (r2 >= 0 && c1 >= 0) result -= data[r2 * step + c1];
if (r2 >= 0 && c2 >= 0) result += data[r2 * step + c2];
if (result > 0f) return result;
return 0f;

The compiler probably handles inling automatically where it's proper.
Without any knowledge about the context. Is the if(r1 >= 0 && c1 >= 0) check necessary?
Isn't it required that the row and col parameters are > 0?
float BoxIntegral(IplImage *img, int row, int col, int rows, int cols)
{
assert(row > 0 && col > 0);
float *data = (float*)img->imageData; // Don't use C-style casts
int step = img->widthStep/sizeof(float);
// Is the min check rly necessary?
int r1 = std::min(row, img->height) - 1;
int c1 = std::min(col, img->width) - 1;
int r2 = std::min(row + rows, img->height) - 1;
int c2 = std::min(col + cols, img->width) - 1;
int r1_step = r1 * step;
int r2_step = r2 * step;
float A = data[r1_step + c1];
float B = data[r1_step + c2];
float C = data[r2_step + c1];
float D = data[r2_step + c2];
return std::max(0.0f, A - B - C + D);
}

Some of the examples say to initialize A, B, C and D directly and skip the initialization with 0, but this is functionally different than your original code in some ways. I would do this however:
inline float BoxIntegral(IplImage *img, int row, int col, int rows, int cols) {
const float *data = (float *) img->imageData;
const int step = img->widthStep/sizeof(float);
// The subtraction by one for row/col is because row/col is inclusive.
const int r1 = std::min(row, img->height) - 1;
const int r2 = std::min(row + rows, img->height) - 1;
const int c1 = std::min(col, img->width) - 1;
const int c2 = std::min(col + cols, img->width) - 1;
const float A = (r1 >= 0 && c1 >= 0) ? data[r1 * step + c1] : 0.0f;
const float B = (r1 >= 0 && c2 >= 0) ? data[r1 * step + c2] : 0.0f;
const float C = (r2 >= 0 && c1 >= 0) ? data[r2 * step + c1] : 0.0f;
const float D = (r2 >= 0 && c2 >= 0) ? data[r2 * step + c2] : 0.0f;
return std::max(0.f, A - B - C + D);
}
like your original code, this will make A, B, C and D have a value either from data[] if the condition is true or 0.0f if the condition is false. Also, I would (as I have shown) use const wherever it is appropriate. Many compilers aren't able to improve code much based on const-ness, but it certainly can't hurt to give the compiler more information about the data it is operating on. Finally I have reordered the r1/r2/c1/c2 variables to encourage reuse of the fetched width and height.
Obviously you would need to profile to determine if any of this is actually an improvement.

I am not sure if your problem lends itself to SIMD but this could potentially allow you to perform multiple operations on your image at once and give you a good performance improvement. I am assuming you are inlining and optimizing because you are performing the operation multiple times. Take a look at:
http://blogs.arm.com/software-enablement/coding-for-neon-part-1-load-and-stores/
http://blogs.arm.com/software-enablement/coding-for-neon-part-2-dealing-with-leftovers/
http://blogs.arm.com/software-enablement/coding-for-neon-part-3-matrix-multiplication/
http://blogs.arm.com/software-enablement/coding-for-neon-part-4-shifting-left-and-right/
Compiler do have some support for Neon if the correct flags are enabled but you will probably need to roll out some on your own.
Edit
To get compiler support for neon you will need to use the compiler flag -mfpu=neon

Related

Using float type for intermediate variable makes program run slower than int type, why?

I'm currently writing a program for YUV420SP => RGB/BGR color space conversion, follow the floatint-point formula calculation, without any SIMD or multi-threading optimization.
The function's input data is unsigned char type, the finally result's type is also unsigned char type. But for the intermediate variables, the formula itself requires float type(the expressions in the right of the =), but for the float => unsigned char conversion, there are two choices, one is using float r, g, b the other is int r, g, b:
unsigned char y = 223; // mock for getting y value
unsigned char u = 200; // mock for getting u value
unsigned char v = 200; // mock for getting v value
unsigned char* rgb0 = (unsigned char*)malloc(MAXN); // for finally result saving
// the YUV=>RGB color conversion
float r, g, b; // [!! choice1 !!] if using this line, code run slower
int r, g, b; // [!! choice2 !!] if using this line, code run much faster
y = std::max(16, (int)y_ptr0[0]);
r = 1.164 * (y - 16) + 1.596 * (v - 128);
g = 1.164 * (y - 16) - 0.813 * (v - 128) - 0.391 * (u - 128);
b = 1.164 * (y - 16) + 2.018 * (u - 128);
rgb0[2-b_idx] = saturate_ucast(r);
rgb0[1] = saturate_ucast(g);
rgb0[b_idx] = saturate_ucast(b);
rgb0 += 3;
What makes me confusing is, for the actual test (convert a width=7680x4320 image), the float r,g,b is about much slower that using int r, g, b, on both Linux x86 and Android ARMv8 platform
The full code for the color conversion is:
#include <limits.h>
inline uchar saturate_uchar(int v)
{
return (uchar)((unsigned int)v <= UCHAR_MAX ? v : v > 0 ? UCHAR_MAX : 0);
}
inline uchar saturate_uchar(float v)
{
int iv = round(v);
return saturate_uchar(iv);
}
template<int u_idx, int b_idx>
void yuv420sp2rgb_naive(
const uchar* y_plane, int height, int width, int y_linebytes,
const uchar* uv_plane, int uv_linebytes,
uchar* rgb, int rgb_linebytes,
const Option& opt
)
{
/// param checking
assert (y_plane!=NULL && uv_plane!=NULL && rgb!=NULL);
/// neon-specific param checking
assert (width>=2 && height>=2);
int w = width;
int h = height;
for (int i=0; i <= h-2; i+=2)
{
const unsigned char* y_ptr0 = y_plane + i * y_linebytes;
const unsigned char* y_ptr1 = y_ptr0 + y_linebytes;
unsigned char* rgb0 = rgb + i * rgb_linebytes;
unsigned char* rgb1 = rgb0+ rgb_linebytes;
const unsigned char* uv_ptr = uv_plane + (i/2) * uv_linebytes;
for (size_t j=0; j <= width-2; j += 2)
{
int y;
float r, g, b; // choice1
//int r, g, b; // choice2
// R = 1.164(Y - 16) + 1.596(V - 128)
// G = 1.164(Y - 16) - 0.813(V - 128) - 0.391(U - 128)
// B = 1.164(Y - 16) + 2.018(U - 128)
int u = uv_ptr[u_idx];
int v = uv_ptr[1 - u_idx];
// y00
y = std::max(16, (int)y_ptr0[0]);
r = 1.164 * (y - 16) + 1.596 * (v - 128);
g = 1.164 * (y - 16) - 0.813 * (v - 128) - 0.391 * (u - 128);
b = 1.164 * (y - 16) + 2.018 * (u - 128);
rgb0[2-b_idx] = saturate_uchar(r);
rgb0[1] = saturate_uchar(g);
rgb0[b_idx] = saturate_uchar(b);
rgb0 += 3;
// y01
y = std::max(16, (int)y_ptr0[1]);
r = 1.164 * (y - 16) + 1.596 * (v - 128);
g = 1.164 * (y - 16) - 0.813 * (v - 128) - 0.391 * (u - 128);
b = 1.164 * (y - 16) + 2.018 * (u - 128);
rgb0[2-b_idx] = saturate_uchar(r);
rgb0[1] = saturate_uchar(g);
rgb0[b_idx] = saturate_uchar(b);
rgb0 += 3;
// y10
y = std::max(16, (int)y_ptr1[0]);
r = 1.164 * (y - 16) + 1.596 * (v - 128);
g = 1.164 * (y - 16) - 0.813 * (v - 128) - 0.391 * (u - 128);
b = 1.164 * (y - 16) + 2.018 * (u - 128);
rgb1[2-b_idx] = saturate_uchar(r);
rgb1[1] = saturate_uchar(g);
rgb1[b_idx] = saturate_uchar(b);
rgb1 += 3;
// y11
y = std::max(16, (int)y_ptr1[1]);
r = 1.164 * (y - 16) + 1.596 * (v - 128);
g = 1.164 * (y - 16) - 0.813 * (v - 128) - 0.391 * (u - 128);
b = 1.164 * (y - 16) + 2.018 * (u - 128);
rgb1[2-b_idx] = saturate_uchar(r);
rgb1[1] = saturate_uchar(g);
rgb1[b_idx] = saturate_uchar(b);
rgb1 += 3;
y_ptr0 += 2;
y_ptr1 += 2;
uv_ptr += 2;
}
}
}
platform
choice
time cost
linux x64
float r, g, b
140 ms
linux x64
int r, g, b
107 ms
armv8
float r, g, b
152 ms
armv8
int r, g, b
111 ms
Question: why changing variable r,g,b's type from float to int boost speed so much?

Compare roots of quadratic functions

I need function to fast compare root of quadratic function and a given value and function to fast compare two roots of two quadratic functions.
I write first function
bool isRootLessThanValue (bool sqrtDeltaSign, int a, int b, int c, int value) {
bool ret;
if(sqrtDeltaSign){
if(a < 0){
ret = (2*a*value + b < 0) || (a*value*value + b*value + c > 0);
}else{
ret = (2*a*value + b > 0) && (a*value*value + b*value + c > 0);
}
}else{
if(a < 0){
ret = (2*a*value + b < 0) && (a*value*value + b*value + c < 0);
}else{
ret = (2*a*value + b > 0) || (a*value*value + b*value + c < 0);
}
}
return ret;
};
When i try to write this for second function it grow to very big and complicated...
bool isRoot1LessThanRoot2 (bool sqrtDeltaSign1, int a1, int b1, int c1, bool sqrtDeltaSign2, int a2, int b2, int c2) {
//...
}
Have u any suggestions how can i simplify this function?
If you think thats stupid idea for optimizations, please tell me why :)
I give a simpified version of the first part of your code by comparing the greater root of the quadratic function with a given value as follows:
#include <iostream>
#include <cmath> // for main testing
int isRootLessThanValue (int a, int b, int c, int value)
{
if (a<0){ b *= -1; c *= -1; a *= -1;}
int xt, delta;
xt = 2 * a * value + b;
if (xt < 0) return false; // value is left to reflection point
delta = b*b - 4*a*c;
// compare square distance between value and the root
return ( (xt * xt) > delta )? true: false;
}
In the test main() program, the roots are first calculate for clarity purpose:
int main()
{
int a, b, c, v;
a = -2;
b = 4;
c = 3;
double r1, r2, r, dt;
dt = std::sqrt(b*b-4.0*a*c);
r1 = (-b + dt) / (2.0*a);
r2 = (-b - dt) / (2.0*a);
r = (r1>r2)? r1 : r2;
while (1)
{
std::cout << "Input the try value = ";
std::cin >> v;
if (isRootLessThanValue(a,b,c,v)) std::cout << v <<" > " << r << std::endl;
else std::cout << v <<" < " << r << std::endl;
}
return 0;
}
A test run
The following assumes that both quadratics have real, mutually distinct roots, and a1 = a2 = 1. This keeps the notations simpler, though similar logic can be used in the general case.
Suppose f(x) = x^2 + b1 x + c1 has the real roots u1 < u2, and g(x) = x^2 + b2 x + c2 has the real roots v1 < v2. Then there are 6 possible sort orders.
(1)   u1 < u2 < v1 < v2
(2)   u1 < v1 < u2 < v2
(3)   u1 < v1 < v2 < u2
(4)   v1 < u1 < u2 < v2
(5)   v1 < u1 < v2 < u2
(6)   v1 < v2 < u1 < u2
Let v be a root of g so that g(v) = v^2 + b2 v + c2 = 0 then v^2 = -b2 v - c2 and therefore f(v) = (b1 - b2) v + c1 - c2 = b12 v + c12 where b12 = b1 - b2 and c12 = c1 - c2.
It follows that Sf = f(v1) + f(v2) = b12(v1 + v2) + 2 c12 and Pf = f(v1) f(v2) = b12^2 v1 v2 + b12 c12 (v1 + v2) + c12^2. Using Vieta's relations v1 v2 = c2 and v1 + v2 = -b2 so in the end Sf = f(v1) + f(v2) = -b12 b2 + 2 c2 and Pf = f(v1) f(v2) = b12^2 c2 - b12 c12 b2 + c12^2. Similar expressions can be calculated for Sg = g(u1) + g(u2) and Pg = g(u1) g(u2).
(Should be noted that Sf, Pf, Sg, Pg above are arithmetic expressions in the coefficients, not involving sqrt square roots. There is, however, the potential for integer overflow. If that is an actual concern, then the calculations would have to be done in floating point instead of integers.)
If Pf = f(v1) f(v2) < 0 then exactly one root of f is between the roots v1, v2 of g.
If the axis of f is to the left of the g one, meaning -b1 < -b2, then that's the smaller root u1 of f which is between v1, v2 i.e. case (5).
Otherwise if -b1 > -b2 then that's the larger root i.e. case (2).
If Pf = f(v1) f(v2) > 0 then either both or none of the roots of f are between the roots of g. In this case f(v1) and f(v2) must have the same sign, and they will either be both negative if Sf = f(v1) + f(v2) < 0 or both positive if Sf > 0.
If f(v1) < 0 and f(v2) < 0 then both roots v1, v2 of g are between the roots of f i.e. case (3).
By symmetry, if Pg > 0 and Sg < 0 then g(u1) < 0 and g(u2) < 0, so both roots u1, u2 of f are between the roots of g i.e. case (4).
Otherwise the last combination left is f(v1), f(v2) > 0 and g(u1), g(u2) > 0 where the intervals (u1, u2) and (v1, v2) do not overlap. If -b1 < -b2 the axis of f is to the left of the g one i.e. case (1) else it's case (6).
Once the sort order between all roots is determined, comparing any particular pair of roots follows.
We are definitely talking micro-optimization here, but consider making calculations before performing the comparison:
bool isRootLessThanValue (bool sqrtDeltaSign, int a, int b, int c, int value)
{
const int a_value = a * value;
const int two_a_b_value = 2 * a_value + b;
const int a_squared_b = a_value * value + b * value + c;
const bool two_ab_less_zero = (two_a_b_value < 0);
bool ret = false;
if(sqrtDeltaSign)
{
const bool a_squared_b_greater_zero = (a_squared_b > 0);
if (a < 0)
{
ret = two_ab_less_zero || a_squared_b_greater_zero;
}
else
{
ret = !two_ab_less_zero && a_squared_b_greater_zero;//(edited)
}
}
else
{
const bool a_squared_b_less_zero = (a_squared_b < 0);
if (a < 0)
{
ret = two_ab_less_zero && a_squared_b_less_zero;
}
else
{
ret = !two_ab_less_zero || a_squared_b_less_zero;//(edited)
}
}
return ret;
};
Another note, is that the boolean expression is calculated and stored in a variable, thus could be counted as a data processing instruction (depending on the compiler and processor).
Compare the assembly language of this function to yours. Also benchmark. As I said, I'm not expecting much time savings here, but I don't know how many times this function is called in your code.
Im reorganising my code and have found some facilities :)
When calculate a, b and c i can keep structure to get only a > 0 :)
and i know that i want small or big root :)
so function to compare root to value is regresed to the form below
bool isRootMinLessThanValue (int a, int b, int c, int value) {
const int a_value = a * value;
const int u = 2*a_value + b;
const int v = a_value*value + b*value + c;
return u > 0 || v < 0 ;
};
bool isRootMaxLessThanValue (int a, int b, int c, int value) {
const int a_value = a*value;
const int u = 2*a_value + b;
const int v = a_value*value + b*value + c;
return u > 0 && v > 0;
}
when im testing benchmark its faster than calculate roots traditionaly (by assumptions I cannot say how much)
Below code for fast (and slow traditionaly) compare root to value without assumptions
bool isRootLessThanValue (bool sqrtDeltaSign, int a, int b, int c, int value) {
const int a_value = a*value;
const int u = 2*a_value + b;
const int v = a_value*value + b*value + c;
const bool s = sqrtDeltaSign;
return ( a < 0 && s && u < 0 ) ||
( a < 0 && s && v > 0 ) ||
( a < 0 && !s && u < 0 && v < 0) ||
(!(a < 0) && !s && u > 0 ) ||
(!(a < 0) && !s && v < 0 ) ||
(!(a < 0) && s && u > 0 && v > 0);
};
bool isRootLessThanValueTraditional (bool sqrtDeltaSign, int a, int b, int c, int value) {
double delta = b*b - 4.0*a*c;
double calculatedRoot = sqrtDeltaSign ? (-b + sqrt(delta))/(2.0*a) : (-b - sqrt(delta))/(2.0*a);
return calculatedRoot < value;
};
benchmark results below:
isRootLessThanValue (optimized): 10000000000 compares in 152.922s
isRootLessThanValueTraditional : 10000000000 compares in 196.168s
Any suggestions how can i simplify even more isRootLessThanValue function? :)
I will try to prepare function to compare two roots of different equations
edited::2020-11-30
bool isRootLessThanValue (bool sqrtDeltaSign, int a, int b, int c, int value) {
const int a_value = a*value;
const int u = 2*a_value + b;
const int v = a_value*value + b*value + c;
return sqrtDeltaSign ?
(( a < 0 && (u < 0 || v > 0) ) || (u > 0 && v > 0)) :
(( a > 0 && (u > 0 || v < 0) ) || (u < 0 && v < 0));
};

What could parametrs of FFT function mean

I'm trying to understand the FFT algorithm.
Here's a code
void fft(double *a, double *b, double *w, int m, int l)
{
int i, i0, i1, i2, i3, j;
double u, v, wi, wr;
for (j = 0; j < l; j++) {
wr = w[j << 1];
wi = w[j << 1 + 1];
for (i = 0; i < m; i++) {
i0 = (i << 1) + (j * m << 1);
i1 = i0 + (m * l << 1);
i2 = (i << 1) + (j * m << 2);
i3 = i2 + (m << 1);
u = a[i0] - a[i1];
v = a[i0 + 1] - a[i1 + 1];
b[i2] = a[i0] + a[i1];
b[i2 + 1] = a[i0 + 1] + a[i1 + 1];
b[i3] = wr * u - wi * v;
b[i3 + 1] = wr * v + wi * u;
}
}
}
If I get it right, array W is input, where every odd number is real and even is imag. A and B are imag and real parts of complex result
Also I found that l = 2**m
But when i'm trying to do this:
double a[4] = { 0, 0, 0, 0 };
double b[4] = { 0, 0, 0, 0 };
double w[8] = { 1, 0, 0, 0, 0, 0, 0, 0 };
int m = 3;
int l = 8;
fft(a, b, w, m, l);
There's error.
This code is only part of an FFT. a is input. b is output. w contains precomputed weights. l is a number of subdivisions at the current point in the FFT. m is the number of elements per division. The data in a, b, and w is interleaved complex data—each pair of double elements from the array consists of the real part and the imaginary part of one complex number.
The code performs one radix-two butterfly pass over the data. To use it to compute an FFT, it must be called multiple times with specific values for l, m, and the weights in w. Since, for each call, the input is in a and the output is in b, the caller must use at least two buffers and alternate between them for successive calls to the routine.
From the indexing performed in i0 and i2, it appears the data is being rearranged slightly. This may be intended to produce the final results of the FFT in “natural” order instead of the bit-reversed order that occurs in a simple implementation.
But when i'm trying to do this:
double a[4] = { 0, 0, 0, 0 };
double b[4] = { 0, 0, 0, 0 };
double w[8] = { 1, 0, 0, 0, 0, 0, 0, 0 };
int m = 3;
int l = 8;
 
fft(a, b, w, m, l);
There's error.
From for (j = 0; j < l; j++), we see the maximum value of j in the loop is l-1. From for (i = 0; i < m; i++), we see the maximum value of i is m-1. Then in i0 = (i << 1) + (j * m << 1), we have i0 = ((m-1) << 1) + ((l-1) * m << 1) = (m-1)*2 + (l-1) * m * 2 = 2*m - 2 + l*m*2 - m*2 = 2*m*l - 2. And in i1 = i0 + (m * l << 1), we have i1 = 2*m*l - 2 + (m * l * 2) = 4*m*l - 2. When the code uses a[i1 + 1], the index is i1 + 1 = 4*m*l - 2 + 1 = 4*m*l - 1.
Therefore a must have an element with index 4*m*l - 1, so it must have at least 4*m*l elements. The required size for b can be computed similarly and is the same.
When you call fft with m set to 3 and l set to 8, a must have 4•3•8 = 96 elements. Your sample code shows four elements. Thus, the array is overrun, and the code fails.
I do not believe it is correct that l should equal 2m. More likely, 4*m*l should not vary between calls to fft in the same complete FFT computation, and, since a and b contain two double elements for every complex number, 4*m*l should be twice the number of complex elements in the signal being transformed.

How to implement the deconv layer in caffe in the 3D filter manner?

I have a requirement to implement the forward computing of deconv layer in the 3D filter manner.
Here, by '3D filter manner', I mean convolution like the Gaussian filter in CV. In the contrast, the caffe implements the deconv in the gemm + col2im manner.
I find a similar question here. The guy wrote the code according the introduction in tranposed conv.
Image
He/She does not open the source code. So I finished my own one:
template <typename DataType> int deconv_cpu(
DataType *src, DataType *dst, DataType *para, DataType *bias,
int in_width, int in_height, int in_channel,
int out_width, int out_height, int out_channel,
int ks, int padding = 0, int step = 1) { // step indicates the stride
int col, row, ch_o, ch_i, x, y;
int r = (ks - 1) / 2; //radius;
DataType result;
DataType *output;
DataType *filter;
DataType *input;
int sim_width, sim_height, sim_pad, width_border, height_border;
sim_width = in_width * step - step + 1;
sim_height = in_height * step - step + 1;
sim_pad = ks - padding - 1;
width_border = sim_pad == 0 ? r : 0;
height_border = sim_pad == 0 ? r : 0;
for (row = height_border; row < (sim_height - height_border); row++)
for (col = width_border; col < (sim_width - width_border); col++)
{
for (ch_o = 0; ch_o < out_channel; ch_o++)
{
output = dst + ch_o * out_width * out_height;
result = 0;
for (ch_i = 0; ch_i < in_channel; ch_i++)
{
filter = para + ks * ks * (in_channel * ch_o + ch_i);
//filter = para + ks*ks * (out_channel * ch_i + ch_o);
input = src + ch_i * in_width * in_height;
for (x = -r; x <= r; x++)
{
for (y = -r; y <= r; y++)
{
if ((row + x) >= 0 && (col + y) >= 0 && (row + x) < sim_height && (col + y) < sim_width)
{
if ( (row + x) % step != 0 || (col + y) % step != 0) continue;
result += input[(row + x) / step * in_width + (col + y) / step] * filter[(x + r) * ks + (y + r)];
}
}
}
}
if (bias != NULL) result = result + bias[ch_o];
output[(row - height_border) * out_width + (col - width_border)] = result;
}
}
return 0;
}
I compare the result with the caffe's one:
const caffe::vector<caffe::shared_ptr<caffe::Blob<float> > > blobs = layers[i]->blobs();
float *filter = blobs[0]->mutable_cpu_data();
float *bias = blobs[1]->mutable_cpu_data();
caffe::shared_ptr<caffe::Blob<float> > blob;
blob = caffe_net->blob_by_name(np.bottom(0));
deconv_cpu(blob->mutable_cpu_data(), dst, filter, bias, width1,
height1, c1, width2, height2, c2, ks, pad, stride);
blob = caffe_net->blob_by_name(np.top(0));
if(compare(dst, blob->mutable_cpu_data()) == 0) printf("match\n");
else printf("do not match\n");
However, the code does not give the same result with the caffe's implementation.
Do anyone know what is wrong? Or any advises or comment on the code?
This issue is finally fixed by change the filter index:
filter[(r-x) * ks + (r-y)]

My perlin noise looks like wrong, almost like grey t-shirt material (heather). Why?

I tried a quick and dirty translation of the code here.
However, my version outputs noise comparable to grey t-shirt material, or heather if it please you:
#include <fstream>
#include "perlin.h"
double Perlin::cos_Interp(double a, double b, double x)
{
ft = x * 3.1415927;
f = (1 - cos(ft)) * .5;
return a * (1 - f) + b * f;
}
double Perlin::noise_2D(double x, double y)
{
/*
int n = (int)x + (int)y * 57;
n = (n << 13) ^ n;
int nn = (n * (n * n * 60493 + 19990303) + 1376312589) & 0x7fffffff;
return 1.0 - ((double)nn / 1073741824.0);
*/
int n = (int)x + (int)y * 57;
n = (n<<13) ^ n;
return ( 1.0 - ( (n * (n * n * 15731 + 789221) + 1376312589) & 0x7fffffff) / 1073741824.0);
}
double Perlin::smooth_2D(double x, double y)
{
corners = ( noise_2D(x - 1, y - 1) + noise_2D(x + 1, y - 1) + noise_2D(x - 1, y + 1) + noise_2D(x + 1, y + 1) ) / 16;
sides = ( noise_2D(x - 1, y) + noise_2D(x + 1, y) + noise_2D(x, y - 1) + noise_2D(x, y + 1) ) / 8;
center = noise_2D(x, y) / 4;
return corners + sides + center;
}
double Perlin::interp(double x, double y)
{
int x_i = int(x);
double x_left = x - x_i;
int y_i = int(y);
double y_left = y - y_i;
double v1 = smooth_2D(x_i, y_i);
double v2 = smooth_2D(x_i + 1, y_i);
double v3 = smooth_2D(x_i, y_i + 1);
double v4 = smooth_2D(x_i + 1, y_i + 1);
double i1 = cos_Interp(v1, v2, x_left);
double i2 = cos_Interp(v3, v4, x_left);
return cos_Interp(i1, i2, y_left);
}
double Perlin::perlin_2D(double x, double y)
{
double total = 0;
double p = .25;
int n = 1;
for(int i = 0; i < n; ++i)
{
double freq = pow(2, i);
double amp = pow(p, i);
total = total + interp(x * freq, y * freq) * amp;
}
return total;
}
int main()
{
Perlin perl;
ofstream ofs("./noise2D.ppm", ios_base::binary);
ofs << "P6\n" << 512 << " " << 512 << "\n255\n";
for(int i = 0; i < 512; ++i)
{
for(int j = 0; j < 512; ++j)
{
double n = perl.perlin_2D(i, j);
n = floor((n + 1.0) / 2.0 * 255);
unsigned char c = n;
ofs << c << c << c;
}
}
ofs.close();
return 0;
}
I don't believe that I strayed too far from the aforementioned site's directions aside from adding in the ppm image generation code, but then again I'll admit to not fully grasping what is going on in the code.
As you'll see by the commented section, I tried two (similar) ways of generating pseudorandom numbers for noise. I also tried different ways of scaling the numbers returned by perlin_2D to RGB color values. These two ways of editing the code have just yielded different looking t-shirt material. So, I'm forced to believe that there's something bigger going on that I am unable to recognize.
Also, I'm compiling with g++ and the c++11 standard.
EDIT: Here's an example: http://imgur.com/Sh17QjK
To convert a double in the range of [-1.0, 1.0] to an integer in range [0, 255]:
n = floor((n + 1.0) / 2.0 * 255.99);
To write it as a binary value to the PPM file:
ofstream ofs("./noise2D.ppm", ios_base::binary);
...
unsigned char c = n;
ofs << c << c << c;
Is this a direct copy of your code? You assigned an integer to what should be the Y fractional value - it's a typo and it will throw the entire noise algorithm off if you don't fix:
double Perlin::interp(double x, double y)
{
int x_i = int(x);
double x_left = x - x_i;
int y_i = int(y);
double y_left = y = y_i; //This Should have a minus, not an "=" like the line above
.....
}
My guess is if you're successfully generating the bitmap with the proper color computation, you're getting vertical bars or something along those lines?
You also need to remember that the Perlin generator usually spits out numbers in the range of -1 to 1 and you need to multiply the resultant value as such:
value * 127 + 128 = {R, G, B}
to get a good grayscale image.