Multithread C++ program to speed up a summatory loop - c++

I have a loop that iterates from 1 to N and takes a modular sum over time. However N is very large and so I am wondering if there is a way to modify it by taking advantage of multithread.
To give sample program
for (long long i = 1; i < N; ++i)
total = (total + f(i)) % modulus;
f(i) in my case isn't an actual function, but a long expression that would take up room here. Putting it there to illustrate purpose.

Yes, try this:
double total=0;
#pragma omp parallel for reduction(+:total)
for (long long i = 1; i < N; ++i)
total = (total + f(i)) % modulus;
Compile with:
g++ -fopenmp your_program.c
It's that simple! No headers are required. The #pragma line automatically spins up a couple of threads, divides the iterations of the loop evenly, and then recombines everything after the loop. Note though, that you must know the number of iterations beforehand.
This code uses OpenMP, which provides easy-to-use parallelism that's quite suitable to your case. OpenMP is even built-in to the GCC and MSVC compilers.
This page shows some of the other reduction operations that are possible.
If you need nested for loops, you can just write
double total=0;
#pragma omp parallel for reduction(+:total)
for (long long i = 1; i < N; ++i)
for (long long j = 1; j < N; ++j)
total = (total + f(i)*j) % modulus;
And the outer loop will be parallelised, with each thread running its own copy of the inner loop.
But you could also use the collapse directive:
#pragma omp parallel for reduction(+:total) collapse(2)
and then the iterations of both loops will be automagically divied up.
If each thread needs its own copy of a variable defined prior to the loop, use the private command:
double total=0, cheese=4;
#pragma omp parallel for reduction(+:total) private(cheese)
for (long long i = 1; i < N; ++i)
total = (total + f(i)) % modulus;
Note that you don't need to use private(total) because this is implied by reduction.

As presumably the f(i) are independent but take the same time roughly to run, you could create yourself 4 threads, and get each to sum up 1/4 of the total, then return the sum as a value, and join each one. This isn't a very flexible method, especially if the times the f(i) times can be random.
You might also want to consider a thread pool, and make each thread calculate f(i) then get the next i to sum.

Try openMP's parallel for with the reduction clause for your total http://bisqwit.iki.fi/story/howto/openmp/#ReductionClause

If f(long long int) is a function that solely relies on its input and no global state and the abelian properties of addition hold, you can gain a significant advantage like this:
for(long long int i = 0, j = 1; i < N; i += 2, j += 2)
{
total1 = (total1 + f(i)) % modulus;
total2 = (total2 + f(j)) % modulus;
}
total = (total1 + total2) % modulus;
Breaking this out like that should help by allowing the compiler to improve code generation and the CPU to use more resources (the two operations can be handled in parallel) and pump more data out and reduce stalls. [I am assuming an x86 architecture here]
Of course, without knowing what f really looks like, it's hard to be sure if this is possible or if it will really help or make a measurable difference.
There may be other similar tricks that you can exploit special knowledge of your input and your platform - for example, SSE instructions could allow you to do even more. Platform-specific functionality might also be useful. For example, a modulo operation may not be required at all and your compiler may provide a special intrinsic function to perform addition modulo N.
I must ask, have you profiled your code and found this to be a hotspot?

You could use Threading Building Blocks
tbb::parallel_for(1, N, [=](long long i) {
total = (total + f(i)) % modulus;
});
Or whitout overflow checks:
tbb::parallel_for(1, N, [=](long long i) {
total = (total + f(i));
});
total %= modulus;

Related

Example of C++ code optimization for parallel computing

I'm trying to understand optimization routines. I'm focusing on the most critical part of my code (the code has some cycles of length "nc" and one cycle of length "np", where number "np" is much larger then "nc"). I present part of the code in here. The rest of code is not very essential in % of computational time so i prefer code purify in the rest of the algorithm. However, the critical cycle with "np" length is a pretty simple piece of code and it can be parallelized. So it will not hurt if i rewrite this part into some more effective and less clear version (maybe into SSE instructions). I'm using a gcc compiler, c++ code, and OpenMP parallelization.
This code is part of the well known particle-in-cell algorithm (and this one is also basic one). I'm trying to learn code optimization on this version (so my goal is not to have effective PIC algorithm only, because it is already written in thousand variants, but i want to bring some demonstrative example for code optimization also). I'm trying to do some work but i am not very sure if i solved all optimization properties correctly.
const int NT = ...; // number of threads (in two versions: about 6 or about 30)
const int np = 10000000; // np is about 1000-10000 times larger than nc commonly
const int nc = 10000;
const int step = 1000;
float u[np], x[np];
float a[nc], a_lin[nc], rho_full[NT][nc], rho_diff[NT][nc] , weight[nc];
int p,num;
for ( i = 0 ; i<step ; i++) {
// ***
// *** some not very time consuming code for calculation
// *** a, a_lin from values of rho_full and rho_diff
#pragma omp for private(p,num)
for ( k = np ; --k ; ) {
num = omp_get_thread_num();
p = (int) x[k];
u[k] += a[p] + a_lin[p] * (x[k] - p);
x[k] += u[k];
if (x[k]<0 ) {x[k]+=nc;} else
if (x[k]>nc) {x[k]-=nc;};
p = (int) x[k];
rho_full[num][p] += weight[k];
rho_diff[num][p] += weight[k] * (x[k] - p);
}
};
I realize this has problems:
1) (main question) I use set of arrays rho_full[num][p] where num is index for each thread. After computation i just summarize this arrays (rho_full[0][p] + rho_full[1][p] + rho_full[2][p] ...). The reason is avoidance of writing into same part of array with two different threads. I am not very sure if this way is an effective solution (note that number "nc" is relatively small, so number of operations with "np" is still probably most essential)
2) (also important question) I need to read x[k] many times and it's also changed many times. Maybe its better to read this value into some register and then forget whole x array or fix some pointer in here. After all calculation i can call x[k] array again and store obtained value. I believe that compiler do this work for me but i am not very sure because i used modification of x[k] in the center of algorithm. So the compiler probably do some effective work on their own but maybe in this version it call more times then nessesary becouse more then ones I swich calling and storing this value.
3) (probably not relevant) The code works with integer part and remainder below decimal point part. It needs both of this values. I identify integer part as p = (int) x and remainder as x - p. I calculate this routine at the begin and also in the end of cycle interior. One can see that this spliting can be stored somewhere and used at next step (i mean step at i index). Do you thing that following version is better? I store integral and remainder part at arrays of x instead of whole value x.
int x_int[np];
float x_rem[np];
//...
for ( k = np ; --k ; ) {
num = omp_get_thread_num();
u[k] += a[x_int[k]] + a_lin[x_int[k]] * x_rem[k];
x_rem[k] += u[k];
p = (int) x_rem[k]; // *** This part is added into code for simplify the rest.
x_int[k] += p; // *** And maybe there is a better way how to realize
x_rem[k] -= p; // *** this "pushing correction".
if (x_int[k]<0 ) {x_int[k]+=nc;} else
if (x_int[k]>nc) {x_int[k]-=nc;};
rho_full[num][x_int[k]] += weight[k];
rho_diff[num][x_int[k]] += weight[k] * x_rem[k];
}
};
You can use OMP reduction for your for loop:
int result = 0;
#pragma omp for nowait reduction(+:result)
for ( k = np ; --k ; ) {
num = omp_get_thread_num();
p = (int) x[k];
u[k] += a[p] + a_lin[p] * (x[k] - p);
x[k] += u[k];
if (x[k]<0 ) {x[k]+=nc;} else
if (x[k]>nc) {x[k]-=nc;};
p = (int) x[k];
result += weight[k] + weight[k] * (x[k] - p);
}

C/C++: Multiply, or bitshift then divide? [duplicate]

This question already has answers here:
Is multiplication and division using shift operators in C actually faster?
(19 answers)
Closed 8 years ago.
Where it's possible to do so, I'm wondering if it's faster to replace a single multiplication with a bitshift followed by an integer division. Say I've got an int k and I want to multiply it by 2.25.
What's faster?
int k = 5;
k *= 2.25;
std::cout << k << std::endl;
or
int k = 5;
k = (k<<1) + (k/4);
std::cout << k << std::endl;
Output
11
11
Both give the same result, you can check this full example.
The first attempt
I defined functions regularmultiply() and bitwisemultiply() as follows:
int regularmultiply(int j)
{
return j * 2.25;
}
int bitwisemultiply(int k)
{
return (k << 1) + (k >> 2);
}
Upon doing profiling with Instruments (in XCode on a 2009 Macbook OS X 10.9.2), it seemed that bitwisemultiply executed about 2x faster than regularmultiply.
The assembly code output seemed to confirm this, with bitwisemultiply spending most of its time on register shuffling and function returns, while regularmultiply spent most of its time on the multiplying.
regularmultiply:
bitwisemultiply:
But the length of my trials was too short.
The second attempt
Next, I tried executing both functions with 10 million multiplications, and this time putting the loops in the functions so that all the function entry and leaving wouldn't obscure the numbers. And this time, the results were that each method took about 52 milliseconds of time. So at least for a relatively large but not gigantic number of calculations, the two functions take about the same time. This surprised me, so I decided to calculate for longer and with larger numbers.
The third attempt
This time, I only multiplied 100 million through 500 million by 2.25, but the bitwisemultiply actually came out slightly slower than the regularmultiply.
The final attempt
Finally, I switched the order of the two functions, just to see if the growing CPU graph in Instruments was perhaps slowing the second function down. But still, the regularmultiply performed slightly better:
Here is what the final program looked like:
#include <stdio.h>
int main(void)
{
void regularmultiplyloop(int j);
void bitwisemultiplyloop(int k);
int i, j, k;
j = k = 4;
bitwisemultiplyloop(k);
regularmultiplyloop(j);
return 0;
}
void regularmultiplyloop(int j)
{
for(int m = 0; m < 10; m++)
{
for(int i = 100000000; i < 500000000; i++)
{
j = i;
j *= 2.25;
}
printf("j: %d\n", j);
}
}
void bitwisemultiplyloop(int k)
{
for(int m = 0; m < 10; m++)
{
for(int i = 100000000; i < 500000000; i++)
{
k = i;
k = (k << 1) + (k >> 2);
}
printf("k: %d\n", k);
}
}
Conclusion
So what can we say about all this? One thing we can say for certain is that optimizing compilers are better than most people. And furthermore, those optimizations show themselves even more when there are a lot of computations, which is the only time you'd really want to optimize anyway. So unless you're coding your optimizations in assembly, changing multiplication to bit shifting probably won't help much.
It's always good to think about efficiency in your applications, but the gains of micro-efficiency are usually not enough to warrant making your code less readable.
Indeed it depends on a variety of factors. So I have just checked it by running and measuring time. So the string we are interested in takes only a few instructions of CPU which is very fast so I have wrapped it into the cycle - multiplied the execution time of one code by a big number, and I got the k *= 2.25; is about in 1.5 times slower than k = (k<<1) + (k/4);.
Here is my two codes to comapre:
prog1:
#include <iostream>
using namespace std;
int main() {
int k = 5;
for (unsigned long i = 0; i <= 0x2fffffff;i++)
k = (k<<1) + (k/4);
cout << k << endl;
return 0;
}
prog 2:
#include <iostream>
using namespace std;
int main() {
int k = 5;
for (unsigned long i = 0; i <= 0x2fffffff;i++)
k *= 2.25;
cout << k << endl;
return 0;
}
Prog1 takes 8 secs and Prog2 takes 14 secs. So by running this test with you architecture and compiler you can get the result which is correct to your particular environment.
That depends heavily on the CPU architecture: Floating point arithmetic, including multiplications, has become quite cheap on many CPUs. But the necessary float->int conversion can bite you: on POWER-CPUs, for instance, the regular multiplication will crawl along due to the pipeline flushes that are generated when a value is moved from the floating point unit to the integer unit.
On some CPUs (including mine, which is an AMD CPU), this version is actually the fastest:
k *= 9;
k >>= 2;
because these CPUs can do a 64 bit integer multiplication in a single cycle. Other CPUs are definitely slower with my version than with your bitshift version, because their integer multiplication is not as heavily optimized. Most CPUs aren't as bad on multiplications as they used to be, but a multiplication can still take more than four cycles.
So, if you know which CPU your program will run on, measure which is fastest. If you don't know, your bitshift version won't perform badly on any architecture (unlike both the regular version and mine), which makes it a really safe bet.
It highly depends on what hardware are you using. On modern hardware floating point multiplications may run way faster than integer ones, so you might want to change the entire algorithm and start using doubles instead of integers. If you're writing for modern hardware and you have a lot of operations like multiplying by 2.25, I'd suggest using double rather than integers, if nothing else prevents you from doing that.
And be data driven - measure performance, because it's affected by compiler, hardware and your way of implementing your algorithm.

call a function and loops in parallel

I don't have any experience in openMP , so I want to kow how to do the following:
for (int i = 1; i <= NumImages; i++) {
//call a function
myfunction(...);
for (int k = 0 ; k < SumNumber k++) {
for (int l = 0; l < ElNum ; l++) {
//do 2 summing up calculations inside a while loop
}//end k loop
}//end i loop
Now , I have 40 cores in my disposal.
NumImages will be from 50 to 150 ,more usual 150.
SumNumber will be around 200.
ElNum will be around 5000.
So , the best dealing with this is assigning every thread to a function call and also execute in parallel the l loop?
And if yes , it will be like:
#pragma omp parallel for num_threads(40)
for (int i = 1; i <= NumImages; i++) {
myfunction(...);
for (int k = 0 ; k < SumNumber k++) {
#pragma omp for
for (int l = 0; l < ElNum ; l++) {
And the above means (for NumImages = 150) that myfunction will be executed 40 times in parallel and also l loop and then ,when l loop and k loop finishes , the next 40 threads will call again the function and the next 40 , so 3*40 = 120 and then the next 30?
Generally the best way is the way that splits the work evenly, to maintain efficiency (no cores are waiting). E.g. in your case probably static scheduling is not a good idea, because 40 does not divide 150 evenly, for the last iteration you would loose 25% of computing power. So it might turn out, that it would be better to put parallel clause before second loop. It all the depends on the mode you choose, and how really work is distributed within loops. E.g., If myfunction does 99% then its a bad idea, if 99% of work is within 2 inner loops it might be good.
Not really. There are 3 scheduling modes. But none of them works in a way, that it blocks other threads. There is a pool of tasks (iterations) that is distributed among the threads. Scheduling mode describes the strategy of assigning tasks to threads. When one thread finishes, it just gets next task, no waiting. The strategies are described in more detail here: http://en.wikipedia.org/wiki/OpenMP#Scheduling_clauses (I am not sure if balant-copy paste from wiki is a good idea, so I'll leave a link. It's a good material.)
Maybe what is not written there is that the modes overhead are presented in order of the amount of overhead they introduce. static is fastest, then dynamic, then guided. My advice when to use which would be, this is not the exact best, but good rule of thumb IMO:
static if you know will be divided evenly among the threads and take the same amount of time
dynamic if you know the tasks will not be divided evenly or their execution times are not even
guided for rather long tasks that you pretty much cannot tell anything
If your tasks are rather small you can see an overhead even for static scheduling (E.g. why my OpenMP C++ code is slower than a serial code?), but I think in your case dynamic should be fine and best choice.

Optimize this function (in C++)

I have a cpu-consuming code, where some function with a loop is executed many times. Every optimization in this loop brings noticeable performance gain. Question: How would you optimize this loop (there is not much more to optimize though...)?
void theloop(int64_t in[], int64_t out[], size_t N)
{
for(uint32_t i = 0; i < N; i++) {
int64_t v = in[i];
max += v;
if (v > max) max = v;
out[i] = max;
}
}
I tried a few things, e.g. I replaced arrays with pointers that were incremented in every loop, but (surprisingly) i lost some performance instead of gaining...
Edit:
changed name of one variable (itsMaximums, error)
the function is an a method of a class
in and put are int64_t , so are negative and positive
`(v > max) can evaluate to true: consider the situation when actual max is negative
the code runs on 32-bit pc (development) and 64-bit (production)
N is unknown at compile time
I tried some SIMD, but I failed to increase performance... (the overhead of moving the variables to _m128i, executing and storing back was higher than than SSE speed gain. Yet I am not an expert on SSE, so maybe I had a poor code)
Results:
I added some loop unfolding, and a nice hack from Alex'es post. Below I paste some results:
original: 14.0s
unfolded loop (4 iterations): 10.44s
Alex'es trick: 10.89s
2) and 3) at once: 11.71s
strage, that 4) is not faster than 3) and 4). Below code for 4):
for(size_t i = 1; i < N; i+=CHUNK) {
int64_t t_in0 = in[i+0];
int64_t t_in1 = in[i+1];
int64_t t_in2 = in[i+2];
int64_t t_in3 = in[i+3];
max &= -max >> 63;
max += t_in0;
out[i+0] = max;
max &= -max >> 63;
max += t_in1;
out[i+1] = max;
max &= -max >> 63;
max += t_in2;
out[i+2] = max;
max &= -max >> 63;
max += t_in3;
out[i+3] = max;
}
First, you need to look at the generated assembly. Otherwise you have no way of knowing what actually happens when this loop is executed.
Now: is this code running on a 64-bit machine? If not, those 64-bit additions might hurt a bit.
This loop seems an obvious candidate for using SIMD instructions. SSE2 supports a number of SIMD instructions for integer arithmetics, including some that work on two 64-bit values.
Other than that, see if the compiler properly unrolls the loop, and if not, do so yourself. Unroll a couple of iterations of the loop, and then reorder the hell out of it. Put all the memory loads at the top of the loop, so they can be started as early as possible.
For the if line, check that the compiler is generating a conditional move, rather than a branch.
Finally, see if your compiler supports something like the restrict/__restrict keyword. It's not standard in C++, but it is very useful for indicating to the compiler that in and out do not point to the same addresses.
Is the size (N) known at compile-time? If so, make it a template parameter (and then try passing in and out as references to properly-sized arrays, as this may also help the compiler with aliasing analysis)
Just some thoughts off the top of my head. But again, study the disassembly. You need to know what the compiler does for you, and especially, what it doesn't do for you.
Edit
with your edit:
max &= -max >> 63;
max += t_in0;
out[i+0] = max;
what strikes me is that you added a huge dependency chain.
Before the result can be computed, max must be negated, the result must be shifted, the result of that must be and'ed together with its original value, and the result of that must be added to another variable.
In other words, all these operations have to be serialized. You can't start one of them before the previous has finished. That's not necessarily a speedup. Modern pipelined out-of-order CPUs like to execute lots of things in parallel. Tying it up with a single long chain of dependant instructions is one of the most crippling things you can do. (Of course, it if can be interleaved with other iterations, it might work out better. But my gut feeling is that a simple conditional move instruction would be preferable)
> #**Announcement** see [chat](https://chat.stackoverflow.com/rooms/5056/discussion-between-sehe-and-jakub-m)
> > _Hi Jakub, what would you say if I have found a version that uses a heuristic optimization that, for random data distributed uniformly will result in ~3.2x speed increase for `int64_t` (10.56x effective using `float`s)?_
>
I have yet to find the time to update the post, but the explanation and code can be found through the chat.
> I used the same test-bed code (below) to verify that the results are correct and exactly match the original implementation from your OP
**Edit**: ironically... that testbed had a fatal flaw, which rendered the results invalid: the heuristic version was in fact skipping parts of the input, but because existing output wasn't being cleared, it appeared to have the correct output... (still editing...)
Ok, I have published a benchmark based on your code versions, and also my proposed use of partial_sum.
Find all the code here https://gist.github.com/1368992#file_test.cpp
Features
For a default config of
#define MAGNITUDE 20
#define ITERATIONS 1024
#define VERIFICATION 1
#define VERBOSE 0
#define LIMITED_RANGE 0 // hide difference in output due to absense of overflows
#define USE_FLOATS 0
It will (see output fragment here):
run 100 x 1024 iterations (i.e. 100 different random seeds)
for data length 1048576 (2^20).
The random input data is uniformly distributed over the full range of the element data type (int64_t)
Verify output by generating a hash digest of the output array and comparing it to the reference implementation from the OP.
Results
There are a number of (surprising or unsurprising) results:
there is no significant performance difference between any of the algorithms whatsoever (for integer data), provided you are compiling with optimizations enabled. (See Makefile; my arch is 64bit, Intel Core Q9550 with gcc-4.6.1)
The algorithms are not equivalent (you'll see hash sums differ): notably the bit fiddle proposed by Alex doesn't handle integer overflow in quite the same way (this can be hidden defining
#define LIMITED_RANGE 1
which limits the input data so overflows won't occur; Note that the partial_sum_incorrect version shows equivalent C++ non-bitwise _arithmetic operations that yield the same different results:
return max<0 ? v : max + v;
Perhaps, it is ok for your purpose?)
Surprisingly It is not more expensive to calculate both definitions of the max algorithm at once. You can see this being done inside partial_sum_correct: it calculates both 'formulations' of max in the same loop; This is really not more than a triva here, because none of the two methods is significantly faster...
Even more surprisingly a big performance boost can be had when you are able to use float instead of int64_t. A quick and dirty hack can be applied to the benchmark
#define USE_FLOATS 0
showing that the STL based algorithm (partial_sum_incorrect) runs aproximately 2.5x faster when using float instead of int64_t (!!!).Note:
that the naming of partial_sum_incorrect only relates to integer overflow, which doesn't apply to floats; this can be seen from the fact that the hashes match up, so in fact it is partial_sum_float_correct :)
that the current implementation of partial_sum_correct is doing double work that causes it to perform badly in floating point mode. See bullet 3.
(And there was that off-by-1 bug in the loop-unrolled version from the OP I mentioned before)
Partial sum
For your interest, the partial sum application looks like this in C++11:
std::partial_sum(data.begin(), data.end(), output.begin(),
[](int64_t max, int64_t v) -> int64_t
{
max += v;
if (v > max) max = v;
return max;
});
Sometimes, you need to step backward and look over it again. The first question is obviously, do you need this ? Could there be an alternative algorithm that would perform better ?
That being said, and supposing for the sake of this question that you already settled on this algorithm, we can try and reason about what we actually have.
Disclaimer: the method I am describing is inspired by the successful method Tim Peters used to improve the traditional introsort implementation, leading to TimSort. So please bear with me ;)
1. Extracting Properties
The main issue I can see is the dependency between iterations, which will prevent much of the possible optimizations and thwart many attempts at parallelizing.
int64_t v = in[i];
max += v;
if (v > max) max = v;
out[i] = max;
Let us rework this code in a functional fashion:
max = calc(in[i], max);
out[i] = max;
Where:
int64_t calc(int64_t const in, int64_t const max) {
int64_t const bumped = max + in;
return in > bumped ? in : bumped;
}
Or rather, a simplified version (baring overflow since it's undefined):
int64_t calc(int64_t const in, int64_t const max) {
return 0 > max ? in : max + in;
}
Do you notice the tip point ? The behavior changes depending on whether the ill-named(*) max is positive or negative.
This tipping point makes it interesting to watch the values in in more closely, especially according to the effect they might have on max:
max < 0 and in[i] < 0 then out[i] = in[i] < 0
max < 0 and in[i] > 0 then out[i] = in[i] > 0
max > 0 and in[i] < 0 then out[i] = (max + in[i]) ?? 0
max > 0 and in[i] > 0 then out[i] = (max + in[i]) > 0
(*) ill-named because it is also an accumulator, which the name hides. I have no better suggestion though.
2. Optimizing operations
This leads us to discover interesting cases:
if we have a slice [i, j) of the array containing only negative values (which we call negative slice), then we could do a std::copy(in + i, in + j, out + i) and max = out[j-1]
if we have a slice [i, j) of the array containing only positive values, then it's a pure accumulation code (which can easily be unrolled)
max gets positive as soon as in[i] is positive
Therefore, it could be interesting (but maybe not, I make no promise) to establish a profile of the input before actually working with it. Note that the profile could be made chunk by chunk for large inputs, for example tuning the chunk size based on the cache line size.
For references, the 3 routines:
void copy(int64_t const in[], int64_t out[],
size_t const begin, size_t const end)
{
std::copy(in + begin, in + end, out + begin);
} // copy
void accumulate(int64_t const in[], int64_t out[],
size_t const begin, size_t const end)
{
assert(begin != 0);
int64_t max = out[begin-1];
for (size_t i = begin; i != end; ++i) {
max += in[i];
out[i] = max;
}
} // accumulate
void regular(int64_t const in[], int64_t out[],
size_t const begin, size_t const end)
{
assert(begin != 0);
int64_t max = out[begin - 1];
for (size_t i = begin; i != end; ++i)
{
max = 0 > max ? in[i] : max + in[i];
out[i] = max;
}
}
Now, supposing that we can somehow characterize the input using a simple structure:
struct Slice {
enum class Type { Negative, Neutral, Positive };
Type type;
size_t begin;
size_t end;
};
typedef void (*Func)(int64_t const[], int64_t[], size_t, size_t);
Func select(Type t) {
switch(t) {
case Type::Negative: return ©
case Type::Neutral: return &regular;
case Type::Positive: return &accumulate;
}
}
void theLoop(std::vector<Slice> const& slices, int64_t const in[], int64_t out[]) {
for (Slice const& slice: slices) {
Func const f = select(slice.type);
(*f)(in, out, slice.begin, slice.end);
}
}
Now, unless introsort the work in the loop is minimal, so computing the characteristics might be too costly as is... however it leads itself well to parallelization.
3. Simple parallelization
Note that the characterization is a pure function of the input. Therefore, supposing that you work in a chunk per chunk fashion, it could be possible to have, in parallel:
Slice Producer: a characterizer thread, which computes the Slice::Type value
Slice Consumer: a worker thread, which actually executes the code
Even if the input is essentially random, providing the chunk is small enough (for example, a CPU L1 cache line) there might be chunks for which it does work. Synchronization between the two threads can be done with a simple thread-safe queue of Slice (producer/consumer) and adding a bool last attribute to stop consumption or by creating the Slice in a vector with a Unknown type, and having the consumer block until it's known (using atomics).
Note: because characterization is pure, it's embarrassingly parallel.
4. More Parallelization: Speculative work
Remember this innocent remark: max gets positive as soon as in[i] is positive.
Suppose that we can guess (reliably) that the Slice[j-1] will produce a max value that is negative, then the computation on Slice[j] are independent of what preceded them, and we can start the work right now!
Of course, it's a guess, so we might be wrong... but once we have fully characterized all the Slices, we have idle cores, so we might as well use them for speculative work! And if we're wrong ? Well, the Consumer thread will simply gently erase our mistake and replace it with the correct value.
The heuristic to speculatively compute a Slice should be simple, and it will have to be tuned. It may be adaptative as well... but that may be more difficult!
Conclusion
Analyze your dataset and try to find if it's possible to break dependencies. If it is you can probably take advantage of it, even without going multi-thread.
If values of max and in[] are far away from 64-bit min/max (say, they are always between -261 and +261), you may try a loop without the conditional branch, which may be causing some perf degradation:
for(uint32_t i = 1; i < N; i++) {
max &= -max >> 63; // assuming >> would do arithmetic shift with sign extension
max += in[i];
out[i] = max;
}
In theory the compiler may do a similar trick as well, but without seeing the disassembly, it's hard to tell if it does it.
The code appears already pretty fast. Depending on the nature of the in array, you could try special casing, for instance if you happen to know that in a particular invokation all the input numbers are positive, out[i] will be equal to the cumulative sum, with no need for an if branch.
ensuring the method isn't virtual, inline, _attribute_((always_inline)) and -funroll-loops seem like good options to explore.
Only by you benchmarking them can we determine if they were worthwhile optimizations in your bigger program.
The only thing that comes to mind that might help a small bit is to use pointers rather than array indices within your loop, something like
void theloop(int64_t in[], int64_t out[], size_t N)
{
int64_t max = in[0];
out[0] = max;
int64_t *ip = in + 1,*op = out+1;
for(uint32_t i = 1; i < N; i++) {
int64_t v = *ip;
ip++;
max += v;
if (v > max) max = v;
*op = max;
op++
}
}
The thinking here is that an index into an array is liable to compile as taking the base address of the array, multiplying the size of element by the index, and adding the result to get the element address. Keeping running pointers avoids this. I'm guessing a good optimizing compiler will do this already, so you'd need to study the current assembler output.
int64_t max = 0, i;
for(i=N-1; i > 0; --i) /* Comparing with 0 is faster */
{
max = in[i] > 0 ? max+in[i] : in[i];
out[i] = max;
--i; /* Will reduce checking of i>=0 by N/2 times */
max = in[i] > 0 ? max+in[i] : in[i]; /* Reduce operations v=in[i], max+=v by N times */
out[i] = max;
}
if(0 == i) /* When N is odd */
{
max = in[i] > 0 ? max+in[i] : in[i];
out[i] = max;
}

g++ -O optimizations

I have the following code:
#include <iostream>
int main()
{
int n = 100;
long a = 0;
int *x = new int[n];
for(int i = 0; i < n; ++i)
for(int j = 0; j < n; ++j)
for(int k = 0; k < n; ++k)
for(int l = 0; l < n; ++l)
{
a += x[(j + k + l) % 100];
}
std::cout << a << std::endl;
delete[] x;
return 0;
}
If I compile without optimizations g++ test.cc and then run time ./a.out it will display 0.7s. However when I compile it with -O, the time decreases by 2 times.
Compiler used
g++ (Ubuntu/Linaro 4.5.2-8ubuntu4) 4.5.2
My question
How can I rewrite the code so when compiling without -O I can obtain same time ( or close to it ) ? In other words, how to optimize nested loops manually ?
Why I ask
I have a similar code which runs about 4 times faster if I use -O optimization.
PS: I've looked in the manual of the compiler but there are way too many flags there to test which one really makes the difference.
Most of the things the compiler optimizes with -O are things on a level below C++. For example, all variables live in memory, including your loop variables. Therefore without optimization, the compiler will most likely on each iteration of the inner loop first read the loop variable to compare it with 0, inside the loop load it again in order to use it for the index, and then at the end of the loop will read the value again, increment it, and write it back. With optimization, it will notice that the loop variable is not changed in the loop body, and therefore does not need to get re-read from memory each time. Moreover it will also note that the address of the variable is never taken, so no other code will ever access it, and therefore writing it to memory can also be omitted. That is, the loop variable will live only in memory. This optimization alone will save three hundred million memory reads and one hundred million memory writes during execution of your function. But since things like processor registers and memory reads/writes are not exposed at the language level, there's no way to optimize it at the language level.
Furthermore, it doesn't make sense to hand-optimize things which the compiler optimizes anyway. Better spend your time at optimizing things which the compiler cannot optimize.
This code has undefined behaviour, so actually the optimizer can do whatever it wants..
a += x[j + k + l % 100];
should be:
a += x[(j + k + l) % 100];
If you fix this I still don't get why you want to optimize something that actually doesn't do anything...
Personally I would optimize this to: :)
std::cout << 0 << std::endl;
Note: remove the loop for i and do std::cout << a * n << std::endl;
You could observe that your loops form a set of coefficients coeff[i] such that a single loop summing a[i] * coeff[i] produces the same total as the nested loop. If we assume your index was meant to be (i + j + k) % 100 then coeff[i] = n * n * n for all i, so your whole program simplifies to
long coeff = n * n * n;
for (int i = 0; i < n; ++n)
a += x[i] * coeff;
which I'm sure runs in far less than 0.7 seconds.
You're using a polynomial time algorithm (n ** 4) so it stands to reason it'll be slow, especially for larger n. Perhaps an algorithm with a lower complexity would help you?
If you want your code optimized you can either ask the compiler to optimize for you, or just write it in assembly language and be done with it. Don't try to second-guess the C++ compiler.
How can I rewrite the code so when compiling without -O I can obtain same time ( or close to it ) ? In other words, how to optimize nested loops manually ?
You write them in assembly. Good luck with that, BTW.
The purpose of the optimization switches in compilers is for you to say, "Compiler, I want you to try to generate fast assembly." By default, compilers do not do optimizations, because those optimizations might inhibit the debugability of the resulting code. So you have to specifically ask for them.
The things the compiler does to optimize code are, generally speaking, not something you can do manually. Some can be (like loop unrolling), but others can't be.