Multiply Large Complex Number Vector by Scalar efficiently C++ - c++

I'm currently trying to most efficiently do an in-place multiplication of an array of complex numbers (memory aligned the same way the std::complex would be but currently using our own ADT) by an array of scalar values that is the same size as the complex number array.
The algorithm is already parallelized, i.e. the calling object splits the work up into threads. This calculation is done on arrays in the 100s of millions - so, it can take some time to complete. CUDA is not a solution for this product, although I wish it was. I do have access to boost and thus have some potential to use BLAS/uBLAS.
I'm thinking, however, that SIMD might yield much better results, but I'm not familiar enough with how to do this with complex numbers. The code I have now is as follows (remember this is chunked up into threads which correspond to the number of cores on the target machine). The target machine is also unknown. So, a generic approach is probably best.
void cmult_scalar_inplace(fcomplex *values, const int start, const int end, const float *scalar)
{
for (register int idx = start; idx < end; ++idx)
{
values[idx].real *= scalar[idx];
values[idx].imag *= scalar[idx];
}
}
fcomplex is defined as follows:
struct fcomplex
{
float real;
float imag;
};
I've tried manually unrolling the loop, as my finally loop count will always be a power of 2, but the compiler is already doing that for me (I've unrolled as far as 32). I've tried a const float reference to the scalar - in thinking I'd save one access - and that proved to be equal to the what the compiler was already doing. I've tried STL and transform, which game close results, but still worse. I've also tried casting to std::complex and allow it to use the overloaded operator for scalar * complex for the multiplication but this ultimately produced the same results.
So, anyone with any ideas? Much appreciation is given for your time in considering this! Target platform is Windows. I'm using Visual Studio 2008. Product cannot contain GPL code as well! Thanks so much.

You can do this fairly easily with SSE, e.g.
void cmult_scalar_inplace(fcomplex *values, const int start, const int end, const float *scalar)
{
for (int idx = start; idx < end; idx += 2)
{
__m128 vc = _mm_load_ps((float *)&values[idx]);
__m128 vk = _mm_set_ps(scalar[idx + 1], scalar[idx + 1], scalar[idx], scalar[idx]);
vc = _mm_mul_ps(vc, vk);
_mm_store_ps((float *)&values[idx], vc);
}
}
Note that values and scalar need to be 16 byte aligned.
Or you could just use the Intel ICC compiler and let it do the hard work for you.
UPDATE
Here is an improved version which unrolls the loop by a factor of 2 and uses a single load instruction to get 4 scalar values which are then unpacked into two vectors:
void cmult_scalar_inplace(fcomplex *values, const int start, const int end, const float *scalar)
{
for (int idx = start; idx < end; idx += 4)
{
__m128 vc0 = _mm_load_ps((float *)&values[idx]);
__m128 vc1 = _mm_load_ps((float *)&values[idx + 2]);
__m128 vk = _mm_load_ps(&scalar[idx]);
__m128 vk0 = _mm_shuffle_ps(vk, vk, 0x50);
__m128 vk1 = _mm_shuffle_ps(vk, vk, 0xfa);
vc0 = _mm_mul_ps(vc0, vk0);
vc1 = _mm_mul_ps(vc1, vk1);
_mm_store_ps((float *)&values[idx], vc0);
_mm_store_ps((float *)&values[idx + 2], vc1);
}
}

Your best bet will be to use an optimised BLAS which will take advantage of whatever is available on your target platform.

One problem I see is that in the function it's hard for the compiler to understand that the scalar pointer is not indeed pointing in the middle of the complex array (scalar could in theory be pointing to the complex or real part of a complex).
This actually forces the order of evaluation.
Another problem I see is that here the computation is so simple that other factors will influence the raw speed, therefore if you really care about performance the only solution is in my opinion to implement several variations and test them at runtime on the user machine to discover what is the fastest.
What I'd consider is using different unrolling sizes, and also playing with the alignment of scalar and values (the memory access pattern can have a big influence of caching effects).
For the problem of the unwanted serialization an option is to see what is the generated code for something like
float r0 = values[i].real, i0 = values[i].imag, s0 = scalar[i];
float r1 = values[i+1].real, i1 = values[i+1].imag, s1 = scalar[i+1];
float r2 = values[i+2].real, i2 = values[i+2].imag, s2 = scalar[i+2];
values[i].real = r0*s0; values[i].imag = i0*s0;
values[i+1].real = r1*s1; values[i+1].imag = i1*s1;
values[i+2].real = r2*s2; values[i+2].imag = i2*s2;
because here the optimizer has in theory a little bit more freedom.

Do you have access to Intel's Integrated Performance Primitives?
Integrated Performance Primitives They have a number of functions that handle cases like this with pretty decent performance. You might have some success with your particular problem, but I would not be surprised if your compiler already does a decent job of optimizing the code.

Related

Can OpenMP's SIMD directive vectorize indexing operations?

Say I have an MxN matrix (SIG) and a list of Nx1 fractional indices (idxt). Each fractional index in idxt uniquely corresponds to the same position column in SIG. I would like to index to the appropriate value in SIG using the indices stored in idxt, take that value and save it in another Nx1 vector. Since the indices in idxt are fractional, I need to interpolate in SIG. Here is an implementation that uses linear interpolation:
void calcPoint(const Eigen::Ref<const Eigen::VectorXd>& idxt,
const Eigen::Ref<const Eigen::Matrix<short int, -1, -1>>& SIG,
double& bfVal) {
Eigen::VectorXd bfPTVec(idxt.size());
#pragma omp simd
for (int j = 0; j < idxt.size(); j++) {
int vIDX = static_cast<int>(idxt(j));
double interp1 = vIDX + 1 - idxt(j);
double interp2 = idxt(j) - vIDX;
bfPTVec(j) = (SIG(vIDX,j)*interp1 + SIG(vIDX+1,j)*interp2);
}
bfVal = ((idxt.array() > 0.0).select(bfPTVec,0.0)).sum();
}
I suspect there is a better way to implement the body of the loop here that would help the compiler better exploit SIMD operations. For example, as I understand it, forcing the compiler to cast between types, both explicitly as the first line does and implicitly as some of the mathematical operations do is not a vectorizable operation.
Additionally, by making the access to SIG dependent on values in idxt which are calculated at runtime I'm not clear if the type of memory read-write I'm performing here is vectorizable, or how it could be vectorized. Looking at the big picture description of my problem where each idxt corresponds to the same "position" column as SIG, I get a sense that it should be a vectorizable operation, but I'm not sure how to translate that into good code.
Clarification
Thanks to the comments, I realized I hadn't specified that certain values that I don't want contributing to the final summation in idxt are set to zero when idxt is initialized outside of this method. Hence the last line in the example given above.
Theoretically, it should be possible, assuming the processor support this operation. However, in practice, this is not the case for many reasons.
First of all, mainstream x86-64 processors supporting the instruction set AVX-2 (or AVX-512) does have instructions for that: gather SIMD instructions. Unfortunately, the instruction set is quite limited: you can only fetch 32-bit/64-bit values from the memory base on 32-bit/64-bit indices. Moreover, this instruction is not very efficiently implemented on mainstream processors yet. Indeed, it fetch every item separately which is not faster than a scalar code, but this can still be useful if the rest of the code is vectorized since reading many scalar value to fill a SIMD register manually tends to be a bit less efficient (although it was surprisingly faster on old processors due to a quite inefficient early implementation of gather instructions). Note that is the SIG matrix is big, then cache misses will significantly slow down the code.
Additionally, AVX-2 is not enabled by default on mainstream processors because not all x86-64 processors supports it. Thus, you need to enable AVX-2 (eg. using -mavx2) so compilers could vectorize the loop efficiently. Unfortunately, this is not enough. Indeed, most compilers currently fail to automatically detect when this instruction can/should be used. Even if they could, then the fact that IEEE-754 floating point number operations are not associative and values can be infinity or NaN generally does not help them to generate an efficient code (although it should be fine here). Note that you can tell to your compiler that operations can be assumed associated and you use only finite/basic real numbers (eg. using -ffast-math, which can be unsafe). The same thing apply for Eigen type/operators if compilers fail to completely inline all the functions (which is the case for ICC).
To speed up the code, you can try to change the type of the SIG variable to a matrix reference containing int32_t items. Another possible optimization is to split the loop in small fixed-size chunks (eg.32 items) and split the loop in many parts so to compute the indirection in a separate loops so compilers can vectorize at least some of the loops. Some compilers likes Clang are able to do that automatically for you: they generate a fast SIMD implementation for a part of the loop and do the indirections use scalar instructions. If this is not enough (which appear to be the case so far), then you certainly need to vectorize the loop yourself using SIMD intrinsics (or possible use SIMD libraries that does that for you).
Probably no, but I would expect manually vectorized version to be faster.
Below is an example of that inner loop, untested. It doesn’t use AVX only SSE up to 4.1, and should be compatible with these Eigen matrices you have there.
The pIndex input pointer should point to the j-th element of your idxt vector, and pSignsColumn should point to the start of the j-th column of the SIG matrix. It assumes your SIG matrix is column major. It’s normally the default memory layout in Eigen but possible to override with template arguments, and probably with macros as well.
inline double computePoint( const double* pIndex, const int16_t* pSignsColumn )
{
// Load the index value into both lanes of the vector
__m128d idx = _mm_loaddup_pd( pIndex );
// Convert into int32 with truncation; this assumes the number there ain't negative.
const int iFloor = _mm_cvttsd_si32( idx );
// Compute fractional part
idx = _mm_sub_pd( idx, _mm_floor_pd( idx ) );
// Compute interpolation coefficients, they are [ 1.0 - idx, idx ]
idx = _mm_addsub_pd( _mm_set_sd( 1.0 ), idx );
// Load two int16_t values from sequential addresses
const __m128i signsInt = _mm_loadu_si32( pSignsColumn + iFloor );
// Upcast them to int32, then to fp64
const __m128d signs = _mm_cvtepi32_pd( _mm_cvtepi16_epi32( signsInt ) );
// Compute the result
__m128d res = _mm_mul_pd( idx, signs );
res = _mm_add_sd( res, _mm_unpackhi_pd( res, res ) );
// The above 2 lines (3 instructions) can be replaced with the following one:
// const __m128d res = _mm_dp_pd( idx, signs, 0b110001 );
// It may or may not be better, the dppd instruction is not particularly fast.
return _mm_cvtsd_f64( res );
}

Eigen: Efficiently storing the output of a matrix evaluation in a raw pointer

I am using some legacy C code that passing around lots of raw pointers. To interface with the code, I have to pass a function of the form:
const int N = ...;
T * func(T * x) {
// TODO Put N elements in x
return x + N;
}
where this function should write the result into x, and then return x.
Internally, in this function, I am using Eigen extensively to perform some calculations. Then I write the result back to the raw pointer using the Map class. A simple example which mimics what I am doing is this:
const int N = 5;
T * func(T * x) {
// Do a lot of operations that result in some matrices like
Eigen::Matrix<T, N, 1 > A = ...
Eigen::Matrix<T, N, 1 > B = ...
Eigen::Map<Eigen::Matrix<T, N, 1 >> constraint(x);
constraint = A - B;
return x + N;
}
Obviously, there is much more complicated stuff going on internally, but that is the gist of it... Do some calculations with Eigen, then use the Map class to write the result back to the raw pointer.
Now the problem is that when I profile this code with Callgrind, and then view the results with KCachegrind, the lines
constraint = A - B;
are almost always the bottleneck. This is sort of understandable, because such lines could/are potentially doing three things:
Constructing the Map object
Performing the calculation
Writing the result to the pointer
So it is understandable that this line might have the longest runtime. But I am a little bit worried that perhaps I am somehow doing an extra copy in that line before the data gets written to the raw pointer.
So is there a better way of writing the result to the raw pointer? Or is that the idiom I should be using?
In the back of my mind, I am wondering if using the placement new syntax would buy me anything here.
Note: This code is mission critical and should run in realtime, so I really need to squeeze every ounce of speed out of it. For instance, getting this call from a runtime of 0.12 seconds to 0.1 seconds would be huge for us. But code legibility is also a huge concern since we are constantly tweaking the model used in the internal calculations.
These two lines of code:
Eigen::Map<Eigen::Matrix<T, N, 1 >> constraint(x);
constraint = A - B;
are essentially compiled by Eigen as:
for(int i=0; i<N; ++i)
x[i] = A[i] - B[i];
The reality is a bit more complicated because of explicit unrolling, and explicit vectorization (both depends on T), but that's essentially it. So the construction of the Map object is essentially a no-op (it is optimized away by any compiler) and no, there is no extra copy going on here.
Actually, if your profiler is able to tell you that the bottleneck lies on this simple expression, then that very likely means that this piece of code has not been inlined, meaning that you did not enabled compiler optimizations flags (like -O3 with gcc/clang).

Calculate average using SSE with STL vectors

I'm trying to learn about vectorisation, and rather than reinvet the wheel I'm using Agner Fog's vector library
Here's my original C++/STL code
#include <vector>
#include <vectorclass.h>
template<typename T>
double mean_v1(T begin,T end) {
float mean = 0;
std::for_each(begin,end,[&mean](const double& d) { mean+=d; });
return mean / std::distance(begin,end);
}
double mean_v2(T begin,T end) {
float mean = 0;
const int distance = std::distance(begin,end); // This is expensive
const int loop = ( distance >> 2)+1; // divide by 4
const int partial = distance & 2; // remainder 4
Vec4d vec;
for(int i = 0; i < loop;++i) {
if(i == (loop-1)) {
vec.load_partial(partial,&*begin);
mean = horizontal_add(vec);
}
else {
vec.load(&*begin);
mean = horizontal_add(vec);
begin+=4; // This is expensive
}
}
return mean / distance;
}
int main(int argc,char**argv) {
using namespace boost::assign;
std::vector<float> numbers;
// Note 13 numbers, which won't fit into a sse register perfectly
numbers+=39.57,39.57,39.604,39.58,39.61,31.669,31.669,31.669,31.65,32.09,33.54,32.46,33.45;
const float mean1 = mean_v1(numbers.begin(),numbers.end());
const float mean2 = mean_v2(numbers.begin(),numbers.end());
return 0;
}
Both v1 and v2 work correctly and they both take about the same time. However profiling it shows the the std::distance() and moving the iterator along takes almost 45% of the total time. The vector adds is just 0.8% which is significantly faster than v1.
Searching the web, all the examples seem to deal with perfect number of values that fit precisely into the SSE registers. How do people deal with odd numbers of values eg for this example where setting up the loop is taking a lot longer than the calculation.
I'm thinking there must be best practices or ideas on how to deal with this scenario.
Assume I can't change the interface of mean() to take float[], but must use iterators
You're mixing float & double unnecessarily, especially as you don't let your accumulator be double your precision is totally destroyed and won't be close to satisfactory for larger series.
As the arithmetic is super light weight what's destroying your performance here is most likely memory access, read up on memory cache lines and how they work. Basically what you need to do here is probe ahead, some processors have explicit instructions for pulling stuff into your cache, otherwise you can perform a load at a memory location ahead of time. Create another level of nesting in your loop and at regular intervals prime the cache with data you know you will get to in a few iterations.
What people do to maximize performance is that they spend a lot of time actually designing their data layout. You shouldn't need to do an intermediate transformation on your data. So what people do is they allocate aligned memory ( most SIMD instruction sets either requires or imposes grave penalties for reading / writing to unaligned memory ), and then they try to aggregate data in such a way that it fits the instruction set. In fact it's often a win to pad your data up to whatever register size the instruction set supports. So if lets say you're going to process 3 dimensional vectors, padding with an extra element which is unused will almost always be a big win.

How to use std::accumulate to neatly sum values in a vector pointed by separately defined indices (replacing loops)

I was wondering if there's a neater (or better yet, more efficient), method of summing values of a vector/(asymmetric) matrix (a matrix having structure like symmetry, could of course be exploited in looping, but not that pertinent to my question) pointed by a collection of indices. Basically this code could be used to calculate, say, a cost of a route through a 2D matrix. I'm looking for a way to utilize CPU, not GPU.
Here's some relevant code, the one I'm more interested is the first case. I was thinking it's possible to use std::accumulate with a lambda to capture the indices vector, but then I got wondering, if there's already a neater way, perhaps with some other operator. Not a "real problem" as looping is quite clear for my tastes too, but in hunt for the super-neat or more efficient on-liner...
template<typename out_type>
out_type sum(std::vector<float> const& matrix, std::vector<int> const& indices)
{
out_type cost = 0;
for(decltype(indices.size()) i = 0; i < indices.size() - 1; ++i)
{
const int index = indices.size() * indices[i] + indices[i + 1];
cost += matrix[index];
}
const int index = indices.size() * indices[indices.size() - 1] + indices[0];
cost += matrix[index];
return cost;
}
template<typename out_type>
out_type sum(std::vector<std::vector<float>> const& matrix, std::vector<int> const& indices)
{
out_type cost = 0;
for(decltype(indices.size()) i = 0; i < indices.size() - 1; i++)
{
cost += matrix[indices[i]][indices[i + 1]];
}
cost += matrix[indices[indices.size() - 1]][indices[0]];
return cost;
}
Oh, and PPL/TBB are fair game too.
Edit
As an afterthought and as commented to John, would there be a place to employ std::common_type in the calculation as the input and output types may differ? This is a bit of hand-waving and more like learning techniques and libraries. A form of code kata, if you will.
Edit 2
Now, there's one option to make the loops faster, explained in blog writing How to process a STL vector using SSE code by a blogger theowl84. The code uses __m128 directly, but I wonder if there's something in DirectXMath library too.
Edit 3
Now, after writing some concrete code, I found std::accumulate wouldn't get me far. Or at least I couldn't find a way to do the [indices[i + 1] part in matrix[indices[i]][indices[i + 1]]; in a neat way, as std::accumulate itself gives access to only the current value and the sum. In that light, it looks like novelocrat's approach would be the most fruitful one.
DeadMG proposed using parallel_reduce with associativity caveats, further commented by novelocrat. I didn't go about seeing if I could use parallel_reduce, as the interface looked somewhat cumbersome for quick trying. Other than that, even though my code executes serially, it would suffer from the same floating some issues as the parallel reduction version. Though the parallel version would/could be (much) more unpredictable with than serial version, I think.
This goes somewhat tangential, but it may be of interest to some stumbling here, and to those of whom have read this far, may be (very) interested on article Wandering Precision in The NAG blog, which details some intricanciens even introduced by hardware instruction re-ordering! Then there are some ruminations about this very issue in distributed setting in #AltDevBlogADay Synchronous RTS Engines and a Tale of Desyncs. Also, ACCU (the general mailing list is excellent, by the way, and it's free to join) features several articles (e.g. this) on floating point accuracy. A tangential to tangential, I found Fernando Cacciola's Robustness issues in geometric computing to be a good article to read, originally from ACCU mailing list.
And then then the std::common_type. I couldn't find usage for that. If I had two different types as parameters, then the return value could/should be decided by std::common_type. Perhaps more pertinent is std::is_convertible with static_assert to make sure the desired result type is convertible from the argument types (with a clean error message). Other than that, I can only make up a check that the return value/intermediate calculation value accurracy is sufficient to represent the result of summation without overflows and things like that, but I haven't come across a standard facility for that.
That about that, I think, ladies and gentlemen. I enjoyed myself, I hope those reading this got something out of this too.
You could produce an iterator that takes matrix and indices and yields the appropriate values.
class route_iterator
{
vector<vector<float>> const& matrix;
vector<int> const& indices;
int i;
public:
route_iterator(vector<vector<float>> const& matrix_, vector<int> const& indices_,
int begin = 0)
: matrix(matrix_), indices(indices_), i(begin)
{ }
float operator*() {
return matrix[indices[i]][indices[(i + 1) % indices.size()]];
}
route_iterator& operator++() {
++i;
return *this;
}
};
Then your accumulate runs from route_iterator(matrix, indices) to route_iterator(matrix, indices, indices.size()).
Admittedly, though, this sequentializes without a smart compiler turning it into something parallel. What you really want are parallel map and fold (accumulate) operations.
out_type cost = 0;
for(decltype(indices.size()) i = 0; i < indices.size() - 1; i++)
{
cost += matrix[indices[i]][indices[i + 1]];
}
This is basically std::accumulate. PPL provides (and so does TBB, if I recall) parallel_reduce. This requires associativity but not commutivity, and + over the real/float/integer is associative.

How much can this c/c++ loop be optimized? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I am a newbie on optimization. I have been reading some reference on how to optimize c++ code but I have a hard time applying it to real code. Therefore I just want to gather some real world optimization technique on how to squeeze as much juice from the CPU/Memory as possible from the loop below
double sum = 0, *array;
array = (double*) malloc(T * sizeof(double));
for(int t = 0; t < T; ++t){
sum += fun(a,b,c,d,e,f,sum);
*(array+t) = sum;
}
where a,b,c,d,e,f are double and T is int. Anything including but not limited to memory alignment, parallelism, openmp/MPI, and SSE instructions are welcome. Compiler is standard gcc, microsoft, or commonly available compiler. If the solution is compiler specific, please specific compiler and any option flag associate with your solution.
Thanks!
PS: Forgot to mention properties fun. Please assume that it is a simple function with no loop inside, and compose of only basic arithmetic operation. Simply think of it as an inline function.
EDIT2: since the details of fun is important, please forget the parameter c, d, e, f and assume fun is defined as
inline double fun(a,b, sum){
return sum + a* ( b - sum);
}
Since sum depends on its previous values in a non-trivial way, it is impossible to parallelize the code (so OpenMP and MPI are out).
Memory alignment and SSE should be enforced/used automatically with appropriate compiler settings.
Besides inlining fun and unrolling the loop (by compiling in -O3) there is not much we can do if fun is completely general.
Since fun(a,b,sum) = sum + a*(b-sum), we have the closed form
ab t+1
array[t] = ———— ( 1 - (2-a) )
a-1
which can be vectorized and parallelized, but the division and exponentiation can be very expensive.
Nevertheless, with the closed form we can start the loop from any index, e.g. create 2 threads one from t = 0 to T/2-1, another from t = T/2 to T-1, which perform the original loop, but the initial sum is computed using the above closed form solution. Also, if only a few values from the array is needed this can be computed lazily.
And for SSE, you could first fill the array first with (2-a)^(t+1), and then apply x :-> x - 1 to the whole array, and then apply x :-> x * c to the whole array where c = a*b/(1-a), but there may be automatic vectorization already.
Unless fun() is very trivial - in which case consider inline, it is likely to dominate anything else you can do to the loop.
You probably want to look at the algorithm inside fun()
one (very) minor optimization that can be done is:
double sum = 0, *array;
array = (double*) malloc(T * sizeof(double));
double* pStart = array;
double* pEnd = array + T;
while(pStart < pEnd)
{
sum += fun(a,b,c,d,e,f,sum);
*pStart++ = sum;
}
this eliminates the addition of t to array for every iteration of the loop, the incrementation of t is replaced by the incrementation of pStart, for small sets of iterations(think less than 3, in that case the loop should be derolled), there is no real gain. the compiler should do this automatically, but sometimes it needs a little encouragement.
also depending on the size ranges of T it might be possible to gain performance by using a variable sized array(which would be stack allocated) or aligned _alloca
The code you have above is about as fast as you can make it.
Memory alignment is normally handled well enough by malloc anyway.
The code cannot be parallelized because f is a function of previous sums (so you can't break up the computation into chunks).
The computations are not specified so it's unclear whether SSE or CUDA or anything like that would be applicable.
Likewise, you can't perform any useful loop-unrolling based on the properties of f because we don't know what it does.
(Stylistically, I'd use array[t] since it's clearer what's going on and it is no slower.)
Edit: now that we have f(a,b,sum) = sum + a*(b-sum), we can try loop unrolling by hand to see if there's some pattern. Like so (where I'm using ** to mean "to the power of"):
sum(n) = sum(n-1) + sum(n-1) + a*(b-sum(n-1)) = (2-a)*sum(n-1) + a*b
sum(n) = (2-a)*( (2-a)*sum(n-2) + a*b ) + a*b
. . .
sum(n) = a*b*(2-a)**n + a*b*(2-a)**(n-1) + ... + a*b
sum(n) = a*b*( (2-a)**0 + (2-a)**1 + ... + (2-a)**n )
Well, now, isn't that interesting! We've converted from a recurrent formula to a geometric series! And, you may recall that the geometric series
SUM( x^n , n = 0..N ) = (x**(n+1) - 1) / (x - 1)
so that
sum(n) = a*b*( (pow(2-a,n+1) - 1) / (1-a) )
Now that you've done that math, you can start in on the sum anywhere you want (using the somewhat expensive pow computation). If you have M free processors, and your array is long, you can split it into M equal pieces, use the above computation to find the first sum, and then use the recurrence formula you were using before (with the function) to fill the rest in.
At the very least, you could calculate a*b and 2-a separately and use those instead of the existing function:
sum = ab + twonega*sum
That cuts the math in your inner loop in half, approximately.
Accept #KennyTM's answer. He is wrong to state that the computation is not parallelisable, as he goes on to show. In showing that you can rewrite your recurrence relation in closed form he illustrates a very general principle of optimising programs -- choose the best algorithm you can find and implement that. None of the micro-optimisations that the other answers suggest will come close to computing the closed form and spreading the computation across many processors in parallel.
And, lest anyone offer the suggestion that this is just an example for learning parallelisation, I contend that #KennyTM's answer still holds good -- don't learn to optimise fragments of code, learn to optimise computations. Choose the best algorithm for your purposes, implement it well and only then worry about performance.
Have a look at callgrind, part of the valgrind toolset. Run your code through that and see if anything sticks out as taking an unusually large amount of time. Then you'll know what needs optimizing. Otherwise, you're just guessing, and you (as well as the rest of us) will likely guess wrong.
Just a couple of suggestions that haven't come up yet. I'm a bit out of date when it comes to modern PC-style processors, so they may make no significant difference.
Using float might be faster than double if you can tolerate the lower precision. Integer-based fixed point might be faster still, depending on how well floating point operations are pipelined.
Counting down from T to zero (and incrementing array each iteration) might be slightly faster - certainly, on an ARM processor this would save one cycle per loop.
another very minor optimization would be to turn the for() into
while (--T)
as comparing with zero is usually faster than comparing two random integers.
I would enable vector processing on the compiler. You could rewrite the code to open up the loops but the compiler will do it for you. If it is a later version.
You could use t+array as the for loop increment... again the optimizer might do this.
means your array index won't use a multiply again optimizer might do this.
You could use the switch to dump the generated assembler code and using that see what you can change in the code to make it run faster.
Following the excellent answer by #KennyTM, I'd say the fastest way to do it sequentially should be:
double acc = 1, *array;
array = (double*) malloc(T * sizeof(double));
// the compiler should be able to derive those constant expressions, but let's do it explicitly anyway
const double k = a*b/(a-1);
const double twominusa = 2 - a;
for(int t = 0; t < T; ++t){
acc *= twominusa;
array[t] = k*(1-acc);
}
Rather than having the compiler unroll the loop, you could unroll the loop and and some data prefetching. Search the web for data driven design c++. Here is an example of loop unrolling and prefetching data:
double sum = 0, *array;
array = (double*) malloc(T * sizeof(double));
// Calculate the number iterations and the
// remaining iterations.
unsigned int iterations = T / 4;
unsigned int remaining_iterations = T % 4;
double sum1;
double sum2;
double sum3;
double sum4;
double * p_array = array;
for(int t = 0; t < T; T += 4)
{
// Do some data precalculation
sum += fun(a,b,c,d,e,f,sum);
sum1 = sum;
sum += fun(a,b,c,d,e,f,sum);
sum2 = sum;
sum += fun(a,b,c,d,e,f,sum);
sum3 = sum;
sum += fun(a,b,c,d,e,f,sum);
sum4 = sum;
// Do a "block" transfer to the array.
p_array[0] = sum1;
p_array[1] = sum2;
p_array[2] = sum3;
p_array[3] = sum4;
p_array += 4;
}
// Handle the few remaining calculations
for (t = 0; t < remaining_iterations; ++t)
{
sum += fun(a,b,c,d,e,f,sum);
p_array[t] = sum;
}
The big hit here is the call to the fun function. There are hidden setup and restore instructions involved when executing a function. Also, the call forces a branch which will cause instruction pipelines to be flushed and reloaded (or cause to processor to waste time in branch prediction).
Another time performance hit is the number of variables passed to the function. These variables have to be placed on the stack and copied into the function, which takes time.
Many computers have a dedicated multiply-accumulate unit implemented in the processor's hardware. Depending on your ultimate algorithm and target platform, you may be able to use this if the compiler isn't already using it when it optimizes.
The following may not be worth it, but ....
The routine fun() takes seven (7) parameters.
Changing the order of the parameters to fun (sum, a, b, c, d, e, f) could help, IF the compiler can take advantage of the following scenario. Parameters through appear to be invariant, and only appears to be changing at this level in the code. As parameters are pushed onto the stack in C/C++ from right to left, if parameters through truly are invariant, then the compiler could in theory optimize the pushing of the stack variables. In other words, through would only need to be pushed onto the stack once, and could in theory be the only parameter pushed and popped while in the loop.
I do not know if the compiler would take advantage of such a scenario, but I am tossing it out there as a possibility. Disassembling could verify it as true or false, and profiling would indicate how much of a benefit that may be if true.