dft (discrete fourier transform) with C++ code

dft (discrete fourier transform) with C++ code - c++

I programmed c in Visual studio c++.
In visual studio c++, I operate complex number with <complex> at c code grammar, not c++.
a[] is 4001 array, so using b[4001] store operated value and finally return a[].
NXm is defined 4001 from main.
when I compared with the result of matlab's fft, The difference occurs from the 169th to 4000th value.
Would you see if there is any error? or what is the cause?
Thank you for reading the question.
I've tried reversely progressing "for", from NXm to 0.
I've tried changing double ak = (double)k * (double)n * (2.0 * M_PI / (double)NXm); to double ak = k * n * (2.0 * M_PI / (double)NXm);
When I've tried operating short length of field, function worked well.
This is the code.
void fft(complex<double> a[], int NXm)
{
complex<double> sum = 0.0 + 0.0*I; ;
complex<double> c = 1.0*I;
complex<double> b[4001] = { 0 };
for (int k = 0; k < NXm; k++)
{
sum = 0.0 + 0.0*I;
for (int n = 0; n < NXm; n++)
{
double ak = (double)k * (double)n * (2.0 * M_PI / (double)NXm);
sum = sum + a[n] * exp(-c * ak);
}
b[k] = sum;
}
for (int i = 0; i < NXm; i++)
{
a[i] = b[i];
}
}
I expect almost same result as fft of matlab.
Slight error of epsilon level is ok

Complex exponential is:
exp(x + I*y) = exp(x) ( cos(y) + I*sin(y)) = exp(x) * cis(y)
problem may be caused by a flaw in FPU part of processor, if y is rather small, built-in a intrinsic function for sin(y) got precision worse than |y - sin(y)|. Naive implementations of Euler formula may err with small arguments, sloppy ones may assume that sin(y) is zero.
It's better to replace sin() by cosecant intrinsic, which is a reciprocal sinus (1/sin(x)).
Another, less common issue is when trig functions are supplied by values which are greater than 2*PI*n, where n is a natural greater than 1. Precision drops with increase of n and sin (x) != sin(2*PI*n + x)
Third, depending on compiler, there might be optimizing flags which regulate how compiler treats use of math functions and floating point math in general. They may default to imprecise but fast calculations. This may supercede with use of C header in C++ code.

MATLAB used FFT (Fast Fourier Transform) algorithm for computing DFT. It is both much faster (O(n log n) as opposed to O(n^2) with the direct method) and more stable numerically. So don't expect the same results as MATLAB makes.

Related

Hyperbolic sine without math.h

im new to code and c++ for a homework assignment im to create a code for sinh without the math file. I understand the math behind sinh, but i have no idea how to code it, any help would be highly appreciated.

According to Wikipedia, there is a Taylor series for sinh:
sinh(x) = x + (pow(x, 3) / 3!) + (pow(x, 5) / 5!) + pow(x, 7) / 7! + ...
One challenge is that you are not allowed to use the pow function. The other is calculating the factorial.
The series is a sum of terms, so you'll need a loop:
double sum = 0.0;
for (unsigned int i = 0; i < NUMBER_OF_TERMS; ++i)
{
sum += Term(i);
}
You could implement Term as a separate function, but you may want to take advantage of declaring and using variables in the loop (that the function may not have access to).
Consider that pow(x, N) expands to x * x * x...
This means that in each iteration the previous value is multiplied by the present value. (This will come in handy later.)
Consider that N! expands to 1 * 2 * 3 * 4 * 5 * ...
This means that in each iteration, the previous value is multiplied by the iteration number.
Let's revisit the loop:
double sum = 0.0;
double power = 1.0;
double factorial = 1.0;
for (unsigned int i = 1; i <= NUMBER_OF_TERMS; ++i)
{
// Calculate pow(x, i)
power = power * x;
// Calculate x!
factorial = factorial * i;
}
One issue with the above loop is that the pow and factorial need to be calculated for each iteration, but the Taylor Series terms use the odd iterations. This is solved by calculated the terms for odd iterations:
for (unsigned int i = 1; i <= NUMBER_OF_TERMS; ++i)
{
// Calculate pow(x, i)
power = power * x;
// Calculate x!
factorial = factorial * i;
// Calculate sum for odd iterations
if ((i % 2) == 1)
{
// Calculate the term.
sum += //...
}
}
In summary, the pow and factorial functions are broken down into iterative pieces. The iterative pieces are placed into a loop. Since the Taylor Series terms are calculated with odd iteration values, a check is placed into the loop.
The actual calculation of the Taylor Series term is left as an exercise for the OP or reader.

Is Eigen library matrix/vector manipulation faster than .net ones if the matrix is dense and unsymmetrical?

I have some matrix operations, mostly dealing with operations like running over all the each of the rows and columns of the matrix and perform multiplication a*mat[i,j]*mat[ii,j]:
public double[] MaxSumFunction()
{
var maxSum= new double[vector.GetLength(1)];
for (int j = 0; j < matrix.GetLength(1); j++)
{
for (int i = 0; i < matrix.GetLength(0); i++)
{
for (int ii = 0; ii < matrix.GetLength(0); ii++)
{
double wi= Math.Sqrt(vector[i]);
double wii= Math.Sqrt(vector[ii]);
maxSum[j] += SomePowerFunctions(wi, wii) * matrix[i, j]*matrix[ii, j];
}
}
}
}
private double SomePowerFunctions(double wi, double wj)
{
var betaij = wi/ wj;
var numerator = 8 * Math.Sqrt(wi* wj) * Math.Pow(betaij, 3.0 / 2)
* (wi+ betaij * wj);
var dominator = Math.Pow(1 - betaij * betaij, 2) +
4 * wi* wj* betaij * (1 + Math.Pow(betaij, 2)) +
4 * (wi* wi+ wj* wj) * Math.Pow(betaij, 2);
if (wi== 0 && wj== 0)
{
if (Math.Abs(betaij - 1) < 1.0e-8)
return 1;
else
return 0;
}
return numerator / dominator;
}
I found such loops to be particularly slow if the matrix size is big.
I want the speed to be fast. So I am thinking about re-implementing these algorithms using the Eigen library.
My matrix is not symmetrical, not sparse and contains no regularity that any solver can exploit reliably.
I read that Eigen solver can be fast because of:
Compiler optimization
Vectorization
Multi-thread support
But I wonder those advantages are really applicable given my matrix characteristics?
Note: I could have just run a sample or two to find out, but I believe that asking the question here and have it documented on the Internet is going to help others as well.

Before thinking about low level optimizations, look at your code and observe that many quantities are recomputed many time. For instance, f(wi,wii) does not depend on j, so they could either be precomputed once (see below) or you can rewrite your loop to make the loop on j the nested one. Then the nested loop will simply be a coefficient wise product between a constant scalar and two columns of your matrix (I don't .net and assume j is indexing columns). If the storage if column-major, then this operation should be fully vectorized by your compiler (again, I don't know .net, but any C++ compiler will do, and if you Eigen, it will be vectorized explicitly). This should be enough to get a huge performance boost.
Depending on the sizes of matrix, you might also try to leverage optimized matrix-matrix implementation by precomputed f(wi,wii) into a MatrixXd F; (using Eigen's language), and then observe that the whole computation amount to:
VectorXd v = your_vector;
MatrixXd F = MatrixXd::nullaryExpr(n,n,[&](Index i,Index j) {
return SomePowerFunctions(sqrt(v(i)), sqrt(v(j)));
});
MatrixXd M = your_matrix;
MatrixXd FM = F * M;
VectorXd maxSum = (M.array() * FM.array()).colwise().sum();

Multidimensional Integration - Coupled Limits

I need to calculate the value of a high dimensional integral in C++. I have found numerous libraries capable of solving this task for fixed limit integrals,
\int_{0}^{L} \int_{0}^{L} dx dy f(x,y) .
However the integrals which I am looking at have variable limits,
\int_{0}^{L} \int_{x}^{L} dx dy f(x,y) .
To clarify what i mean, here is a naive 2D Riemann sum implementation in 2D, which returns the desired result,
int steps = 100;
double integral = 0;
double dl = L/((double) steps);
double x[2] = {0};
for(int i = 0; i < steps; i ++){
x[0] = dl*i;
for(int j = i; j < steps; j ++){
x[1] = dl*j;
double val = f(x);
integral += val*val*dl*dl;
}
}
where f is some arbitrary function and L the common upper integration limit. While this implementation works, it's slow and thus impractical for higher dimensions.
Effective algorithms for higher dimensions exist, but to my knowledge, library implementations (e.g. Cuba) take a fixed value vector as the limit argument which renders them useless for my problem.
Is there any reason for this and/or is there any trick to circumvent the problem?

Your integration order is wrong, should be dy dx.
You are integrating over the triangle
0 <= x <= y <= L
inside the square [0,L]x[0,L]. This can be simulated by integrating over the full square where the integrand f is defined as 0 outside of the triangle. In many cases, when f is defined on the full square, this can be accomplished by taking the product of f with the indicator function of the triangle as new integrand.

When integrating over a triangular region such as 0<=x<=y<=L one can take advantage of symmetry: integrate f(min(x,y),max(x,y)) over the square 0<=x,y<=L and divide the result by 2. This has an advantage over extending f by zero (the method mentioned by LutzL) in that the extended function is continuous, which improves the performance of the integration routine.
I compared these on the example of the integral of 2x+y over 0<=x<=y<=1. The true value of the integral is 2/3. Let's compare the performance; for demonstration purpose I use Matlab routine, but this is not specific to language or library used.
Extending by zero
f = #(x,y) (2*x+y).*(x<=y);
result = integral2(f, 0, 1, 0, 1);
fprintf('%.9f\n',result);
Output:
Warning: Reached the maximum number of function evaluations
(10000). The result fails the global error test.
0.666727294
Extending by symmetry
g = #(x,y) (2*min(x,y)+max(x,y));
result2 = integral2(g, 0, 1, 0, 1)/2;
fprintf('%.9f\n',result2);
Output:
0.666666776
The second result is 500 times more accurate than the first.
Unfortunately, this symmetry trick is not available for general domains; but integration over a triangle comes up often enough so it's useful to keep it in mind.

I was a bit confused by your integral definition but from your code i see it like this:
just did some testing so here is your code:
//---------------------------------------------------------------------------
double f(double *x) { return (x[0]+x[1]); }
void integral0()
{
double L=10.0;
int steps = 10000;
double integral = 0;
double dl = L/((double) steps);
double x[2] = {0};
for(int i = 0; i < steps; i ++){
x[0] = dl*i;
for(int j = i; j < steps; j ++){
x[1] = dl*j;
double val = f(x);
integral += val*val*dl*dl;
}
}
}
//---------------------------------------------------------------------------
Here is optimized code:
//---------------------------------------------------------------------------
void integral1()
{
double L=10.0;
int i0,i1,steps = 10000;
double x[2]={0.0,0.0};
double integral,val,dl=L/((double)steps);
#define f(x) (x[0]+x[1])
integral=0.0;
for(x[0]= 0.0,i0= 0;i0<steps;i0++,x[0]+=dl)
for(x[1]=x[0],i1=i0;i1<steps;i1++,x[1]+=dl)
{
val=f(x);
integral+=val*val;
}
integral*=dl*dl;
#undef f
}
//---------------------------------------------------------------------------
results:
[ 452.639 ms] integral0
[ 336.268 ms] integral1
so the increase in speed is ~ 1.3 times (on 32bit app on WOW64 AMD 3.2GHz)
for higher dimensions it will multiply
but still I think this approach is slow
The only thing to reduce complexity I can think of is algebraically simplify things
either by integration tables or by Laplace or Z transforms
but for that the f(*x) must be know ...
constant time reduction can of course be done
by the use of multi-threading
and or GPU ussage
this can give you N times speed increase
because this is all directly parallelisable

Faster computation of (approximate) variance needed

I can see with the CPU profiler, that the compute_variances() is the bottleneck of my project.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
75.63 5.43 5.43 40 135.75 135.75 compute_variances(unsigned int, std::vector<Point, std::allocator<Point> > const&, float*, float*, unsigned int*)
19.08 6.80 1.37 readDivisionSpace(Division_Euclidean_space&, char*)
...
Here is the body of the function:
void compute_variances(size_t t, const std::vector<Point>& points, float* avg,
float* var, size_t* split_dims) {
for (size_t d = 0; d < points[0].dim(); d++) {
avg[d] = 0.0;
var[d] = 0.0;
}
float delta, n;
for (size_t i = 0; i < points.size(); ++i) {
n = 1.0 + i;
for (size_t d = 0; d < points[0].dim(); ++d) {
delta = (points[i][d]) - avg[d];
avg[d] += delta / n;
var[d] += delta * ((points[i][d]) - avg[d]);
}
}
/* Find t dimensions with largest scaled variance. */
kthLargest(var, points[0].dim(), t, split_dims);
}
where kthLargest() doesn't seem to be a problem, since I see that:
0.00 7.18 0.00 40 0.00 0.00 kthLargest(float*, int, int, unsigned int*)
The compute_variances() takes a vector of vectors of floats (i.e. a vector of Points, where Points is a class I have implemented) and computes the variance of them, in each dimension (with regard to the algorithm of Knuth).
Here is how I call the function:
float avg[(*points)[0].dim()];
float var[(*points)[0].dim()];
size_t split_dims[t];
compute_variances(t, *points, avg, var, split_dims);
The question is, can I do better? I would really happy to pay the trade-off between speed and approximate computation of variances. Or maybe I could make the code more cache friendly or something?
I compiled like this:
g++ main_noTime.cpp -std=c++0x -p -pg -O3 -o eg
Notice, that before edit, I had used -o3, not with a capital 'o'. Thanks to ypnos, I compiled now with the optimization flag -O3. I am sure that there was a difference between them, since I performed time measurements with one of these methods in my pseudo-site.
Note that now, compute_variances is dominating the overall project's time!
[EDIT]
copute_variances() is called 40 times.
Per 10 calls, the following hold true:
points.size() = 1000 and points[0].dim = 10000
points.size() = 10000 and points[0].dim = 100
points.size() = 10000 and points[0].dim = 10000
points.size() = 100000 and points[0].dim = 100
Each call handles different data.
Q: How fast is access to points[i][d]?
A: point[i] is just the i-th element of std::vector, where the second [], is implemented as this, in the Point class.
const FT& operator [](const int i) const {
if (i < (int) coords.size() && i >= 0)
return coords.at(i);
else {
std::cout << "Error at Point::[]" << std::endl;
exit(1);
}
return coords[0]; // Clear -Wall warning
}
where coords is a std::vector of float values. This seems a bit heavy, but shouldn't the compiler be smart enough to predict correctly that the branch is always true? (I mean after the cold start). Moreover, the std::vector.at() is supposed to be constant time (as said in the ref). I changed this to have only .at() in the body of the function and the time measurements remained, pretty much, the same.
The division in the compute_variances() is for sure something heavy! However, Knuth's algorithm was a numerical stable one and I was not able to find another algorithm, that would de both numerical stable and without division.
Note that I am not interesting in parallelism right now.
[EDIT.2]
Minimal example of Point class (I think I didn't forget to show something):
class Point {
public:
typedef float FT;
...
/**
* Get dimension of point.
*/
size_t dim() const {
return coords.size();
}
/**
* Operator that returns the coordinate at the given index.
* #param i - index of the coordinate
* #return the coordinate at index i
*/
FT& operator [](const int i) {
return coords.at(i);
//it's the same if I have the commented code below
/*if (i < (int) coords.size() && i >= 0)
return coords.at(i);
else {
std::cout << "Error at Point::[]" << std::endl;
exit(1);
}
return coords[0]; // Clear -Wall warning*/
}
/**
* Operator that returns the coordinate at the given index. (constant)
* #param i - index of the coordinate
* #return the coordinate at index i
*/
const FT& operator [](const int i) const {
return coords.at(i);
/*if (i < (int) coords.size() && i >= 0)
return coords.at(i);
else {
std::cout << "Error at Point::[]" << std::endl;
exit(1);
}
return coords[0]; // Clear -Wall warning*/
}
private:
std::vector<FT> coords;
};

1. SIMD
One easy speedup for this is to use vector instructions (SIMD) for the computation. On x86 that means SSE, AVX instructions. Based on your word length and processor you can get speedups of about x4 or even more. This code here:
for (size_t d = 0; d < points[0].dim(); ++d) {
delta = (points[i][d]) - avg[d];
avg[d] += delta / n;
var[d] += delta * ((points[i][d]) - avg[d]);
}
can be sped-up by doing the computation for four elements at once with SSE. As your code really only processes one single element in each loop iteration, there is no bottleneck. If you go down to 16bit short instead of 32bit float (an approximation then), you can fit eight elements in one instruction. With AVX it would be even more, but you need a recent processor for that.
It is not the solution to your performance problem, but just one of them that can also be combined with others.
2. Micro-parallelizm
The second easy speedup when you have that many loops is to use parallel processing. I typically use Intel TBB, others might suggest OpenMP instead. For this you would probably have to change the loop order. So parallelize over d in the outer loop, not over i.
You can combine both techniques, and if you do it right, on a quadcore with HT you might get a speed-up of 25-30 for the combination without any loss in accuracy.
3. Compiler optimization
First of all maybe it is just a typo here on SO, but it needs to be -O3, not -o3!
As a general note, it might be easier for the compiler to optimize your code if you declare the variables delta, n within the scope where you actually use them. You should also try the -funroll-loops compiler option as well as -march. The option to the latter depends on your CPU, but nowadays typically -march core2 is fine (also for recent AMDs), and includes SSE optimizations (but I would not trust the compiler just yet to do that for your loop).

The big problem with your data structure is that it's essentially a vector<vector<float> >. That's a pointer to an array of pointers to arrays of float with some bells and whistles attached. In particular, accessing consecutive Points in the vector doesn't correspond to accessing consecutive memory locations. I bet you see tons and tons of cache misses when you profile this code.
Fix this before horsing around with anything else.
Lower-order concerns include the floating-point division in the inner loop (compute 1/n in the outer loop instead) and the big load-store chain that is your inner loop. You can compute the means and variances of slices of your array using SIMD and combine them at the end, for instance.
The bounds-checking once per access probably doesn't help, either. Get rid of that too, or at least hoist it out of the inner loop; don't assume the compiler knows how to fix that on its own.

Here's what I would do, in guesstimated order of importance:
Return the floating-point from the Point::operator[] by value, not by reference.
Use coords[i] instead of coords.at(i), since you already assert that it's within bounds. The at member checks the bounds. You only need to check it once.
Replace the home-baked error indication/checking in the Point::operator[] with an assert. That's what asserts are for. They are nominally no-ops in release mode - I doubt that you need to check it in release code.
Replace the repeated division with a single division and repeated multiplication.
Remove the need for wasted initialization by unrolling the first two iterations of the outer loop.
To lessen impact of cache misses, run the inner loop alternatively forwards then backwards. This at least gives you a chance at using some cached avg and var. It may in fact remove all cache misses on avg and var if prefetch works on reverse order of iteration, as it well should.
On modern C++ compilers, the std::fill and std::copy can leverage type alignment and have a chance at being faster than the C library memset and memcpy.
The Point::operator[] will have a chance of getting inlined in the release build and can reduce to two machine instructions (effective address computation and floating point load). That's what you want. Of course it must be defined in the header file, otherwise the inlining will only be performed if you enable link-time code generation (a.k.a. LTO).
Note that the Point::operator[]'s body is only equivalent to the single-line
return coords.at(i) in a debug build. In a release build the entire body is equivalent to return coords[i], not return coords.at(i).
FT Point::operator[](int i) const {
assert(i >= 0 && i < (int)coords.size());
return coords[i];
}
const FT * Point::constData() const {
return &coords[0];
}
void compute_variances(size_t t, const std::vector<Point>& points, float* avg,
float* var, size_t* split_dims)
{
assert(points.size() > 0);
const int D = points[0].dim();
// i = 0, i_n = 1
assert(D > 0);
#if __cplusplus >= 201103L
std::copy_n(points[0].constData(), D, avg);
#else
std::copy(points[0].constData(), points[0].constData() + D, avg);
#endif
// i = 1, i_n = 0.5
if (points.size() >= 2) {
assert(points[1].dim() == D);
for (int d = D - 1; d >= 0; --d) {
float const delta = points[1][d] - avg[d];
avg[d] += delta * 0.5f;
var[d] = delta * (points[1][d] - avg[d]);
}
} else {
std::fill_n(var, D, 0.0f);
}
// i = 2, ...
for (size_t i = 2; i < points.size(); ) {
{
const float i_n = 1.0f / (1.0f + i);
assert(points[i].dim() == D);
for (int d = 0; d < D; ++d) {
float const delta = points[i][d] - avg[d];
avg[d] += delta * i_n;
var[d] += delta * (points[i][d] - avg[d]);
}
}
++ i;
if (i >= points.size()) break;
{
const float i_n = 1.0f / (1.0f + i);
assert(points[i].dim() == D);
for (int d = D - 1; d >= 0; --d) {
float const delta = points[i][d] - avg[d];
avg[d] += delta * i_n;
var[d] += delta * (points[i][d] - avg[d]);
}
}
++ i;
}
/* Find t dimensions with largest scaled variance. */
kthLargest(var, D, t, split_dims);
}

for (size_t d = 0; d < points[0].dim(); d++) {
avg[d] = 0.0;
var[d] = 0.0;
}
This code could be optimized by simply using memset. The IEEE754 representation of 0.0 in 32bits is 0x00000000. If the dimension is big, it worth it.
Something like:
memset((void*)avg, 0, points[0].dim() * sizeof(float));
In your code, you have a lot of calls to points[0].dim(). It would be better to call once at the beginning of the function and store in a variable. Likely, the compiler already does this (since you are using -O3).
The division operations are a lot more expensive (from clock-cycle POV) than other operations (addition, subtraction).
avg[d] += delta / n;
It could make sense, to try to reduce the number of divisions: use partial non-cumulative average calculation, that would result in Dim division operation for N elements (instead of N x Dim); N < points.size()
Huge speedup could be achieved, using Cuda or OpenCL, since the calculation of avg and var could be done simultaneously for each dimension (consider using a GPU).

Another optimization is cache optimization including both data cache and instruction cache.
High level optimization techniques
Data Cache optimizations
Example of data cache optimization & unrolling
for (size_t d = 0; d < points[0].dim(); d += 4)
{
// Perform loading all at once.
register const float p1 = points[i][d + 0];
register const float p2 = points[i][d + 1];
register const float p3 = points[i][d + 2];
register const float p4 = points[i][d + 3];
register const float delta1 = p1 - avg[d+0];
register const float delta2 = p2 - avg[d+1];
register const float delta3 = p3 - avg[d+2];
register const float delta4 = p4 - avg[d+3];
// Perform calculations
avg[d + 0] += delta1 / n;
var[d + 0] += delta1 * ((p1) - avg[d + 0]);
avg[d + 1] += delta2 / n;
var[d + 1] += delta2 * ((p2) - avg[d + 1]);
avg[d + 2] += delta3 / n;
var[d + 2] += delta3 * ((p3) - avg[d + 2]);
avg[d + 3] += delta4 / n;
var[d + 3] += delta4 * ((p4) - avg[d + 3]);
}
This differs from classic loop unrolling in that loading from the matrix is performed as a group at the top of the loop.
Edit 1:
A subtle data optimization is to place the avg and var into a structure. This will ensure that the two arrays are next to each other in memory, sans padding. The data fetching mechanism in processors like datums that are very close to each other. Less chance for data cache miss and better chance to load all of the data into the cache.

You could use Fixed Point math instead of floating point math as an optimization.
Optimization via Fixed Point
Processors love to manipulate integers (signed or unsigned). Floating point may take extra computing power due to the extraction of the parts, performing the math, then reassemblying the parts. One mitigation is to use Fixed Point math.
Simple Example: meters
Given the unit of meters, one could express lengths smaller than a meter by using floating point, such as 3.14159 m. However, the same length can be expressed in a unit of finer detail like millimeters, e.g. 3141.59 mm. For finer resolution, a smaller unit is chosen and the value multiplied, e.g. 3,141,590 um (micrometers). The point is choosing a small enough unit to represent the floating point accuracy as an integer.
The floating point value is converted at input into Fixed Point. All data processing occurs in Fixed Point. The Fixed Point value is convert to Floating Point before outputting.
Power of 2 Fixed Point Base
As with converting from floating point meters to fixed point millimeters, using 1000, one could use a power of 2 instead of 1000. Selecting a power of 2 allows the processor to use bit shifting instead of multiplication or division. Bit shifting by a power of 2 is usually faster than multiplication or division.
Keeping with the theme and accuracy of millimeters, we could use 1024 as the base instead of 1000. Similarly, for higher accuracy, use 65536 or 131072.
Summary
Changing the design or implementation to used Fixed Point math allows the processor to use more integral data processing instructions than floating point. Floating point operations consume more processing power than integral operations in all but specialized processors. Using powers of 2 as the base (or denominator) allows code to use bit shifting instead of multiplication or division. Division and multiplication take more operations than shifting and thus shifting is faster. So rather than optimizing code for execution (such as loop unrolling), one could try using Fixed Point notation rather than floating point.

Point 1.
You're computing the average and the variance at the same time.
Is that right?
Don't you have to calculate the average first, then once you know it, calculate the sum of squared differences from the average?
In addition to being right, it's more likely to help performance than hurt it.
Trying to do two things in one loop is not necessarily faster than two consecutive simple loops.
Point 2.
Are you aware that there is a way to calculate average and variance at the same time, like this:
double sumsq = 0, sum = 0;
for (i = 0; i < n; i++){
double xi = x[i];
sum += xi;
sumsq += xi * xi;
}
double avg = sum / n;
double avgsq = sumsq / n
double variance = avgsq - avg*avg;
Point 3.
The inner loops are doing repetitive indexing.
The compiler might be able to optimize that to something minimal, but I wouldn't bet my socks on it.
Point 4.
You're using gprof or something like it.
The only reasonably reliable number to come out of it is self-time by function.
It won't tell you very well how time is spent inside the function.
I and many others rely on this method, which takes you straight to the heart of what takes time.

Create sine lookup table in C++

How can I rewrite the following pseudocode in C++?
real array sine_table[-1000..1000]
for x from -1000 to 1000
sine_table[x] := sine(pi * x / 1000)
I need to create a sine_table lookup table.

You can reduce the size of your table to 25% of the original by only storing values for the first quadrant, i.e. for x in [0,pi/2].
To do that your lookup routine just needs to map all values of x to the first quadrant using simple trig identities:
sin(x) = - sin(-x), to map from quadrant IV to I
sin(x) = sin(pi - x), to map from quadrant II to I
To map from quadrant III to I, apply both identities, i.e. sin(x) = - sin (pi + x)
Whether this strategy helps depends on how much memory usage matters in your case. But it seems wasteful to store four times as many values as you need just to avoid a comparison and subtraction or two during lookup.
I second Jeremy's recommendation to measure whether building a table is better than just using std::sin(). Even with the original large table, you'll have to spend cycles during each table lookup to convert the argument to the closest increment of pi/1000, and you'll lose some accuracy in the process.
If you're really trying to trade accuracy for speed, you might try approximating the sin() function using just the first few terms of the Taylor series expansion.
sin(x) = x - x^3/3! + x^5/5! ..., where ^ represents raising to a power and ! represents the factorial.
Of course, for efficiency, you should precompute the factorials and make use of the lower powers of x to compute higher ones, e.g. use x^3 when computing x^5.
One final point, the truncated Taylor series above is more accurate for values closer to zero, so its still worthwhile to map to the first or fourth quadrant before computing the approximate sine.
Addendum:
Yet one more potential improvement based on two observations:
1. You can compute any trig function if you can compute both the sine and cosine in the first octant [0,pi/4]
2. The Taylor series expansion centered at zero is more accurate near zero
So if you decide to use a truncated Taylor series, then you can improve accuracy (or use fewer terms for similar accuracy) by mapping to either the sine or cosine to get the angle in the range [0,pi/4] using identities like sin(x) = cos(pi/2-x) and cos(x) = sin(pi/2-x) in addition to the ones above (for example, if x > pi/4 once you've mapped to the first quadrant.)
Or if you decide to use a table lookup for both the sine and cosine, you could get by with two smaller tables that only covered the range [0,pi/4] at the expense of another possible comparison and subtraction on lookup to map to the smaller range. Then you could either use less memory for the tables, or use the same memory but provide finer granularity and accuracy.

long double sine_table[2001];
for (int index = 0; index < 2001; index++)
{
sine_table[index] = std::sin(PI * (index - 1000) / 1000.0);
}

One more point: calling trigonometric functions is pricey. if you want to prepare the lookup table for sine with constant step - you may save the calculation time, in expense of some potential precision loss.
Consider your minimal step is "a". That is, you need sin(a), sin(2a), sin(3a), ...
Then you may do the following trick: First calculate sin(a) and cos(a). Then for every consecutive step use the following trigonometric equalities:
sin([n+1] * a) = sin(n*a) * cos(a) + cos(n*a) * sin(a)
cos([n+1] * a) = cos(n*a) * cos(a) - sin(n*a) * sin(a)
The drawback of this method is that during this procedure the round-off error is accumulated.

double table[1000] = {0};
for (int i = 1; i <= 1000; i++)
{
sine_table[i-1] = std::sin(PI * i/ 1000.0);
}
double getSineValue(int multipleOfPi){
if(multipleOfPi == 0) return 0.0;
int sign = 1;
if(multipleOfPi < 0){
sign = -1;
}
return signsine_table[signmultipleOfPi - 1];
}
You can reduce the array length to 500, by a trick sin(pi/2 +/- angle) = +/- cos(angle).
So store sin and cos from 0 to pi/4.
I don't remember from top of my head but it increased the speed of my program.

You'll want the std::sin() function from <cmath>.

another approximation from a book or something
streamin ramp;
streamout sine;
float x,rect,k,i,j;
x = ramp -0.5;
rect = x * (1 - x < 0 & 2);
k = (rect + 0.42493299) *(rect -0.5) * (rect - 0.92493302) ;
i = 0.436501 + (rect * (rect + 1.05802));
j = 1.21551 + (rect * (rect - 2.0580201));
sine = i*j*k*60.252201*x;
full discussion here:
http://synthmaker.co.uk/forum/viewtopic.php?f=4&t=6457&st=0&sk=t&sd=a
I presume that you know, that using a division is a lot slower than multiplying by decimal number, /5 is always slower than *0.2
it's just an approximation.
also:
streamin ramp;
streamin x; // 1.5 = Saw 3.142 = Sin 4.5 = SawSin
streamout sine;
float saw,saw2;
saw = (ramp * 2 - 1) * x;
saw2 = saw * saw;
sine = -0.166667 + saw2 * (0.00833333 + saw2 * (-0.000198409 + saw2 * (2.7526e-006+saw2 * -2.39e-008)));
sine = saw * (1+ saw2 * sine);

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js