C/C++ optimization: negate doubles fast

C/C++ optimization: negate doubles fast - c++

I need to negate very large number of doubles quickly. If bit_generator generates 0, then the sign must be changed. If bit_generator generates 1, then nothing happens. The loop is run many times over and bit_generator is extremely fast. On my platform case 2 is noticeably faster than case 1. Looks like my CPU doesn't like branching. Is there any faster and portable way to do it? What do you think about case 3?
// generates 0 and 1
int bit_generator();
// big vector (C++)
vector<double> v;
// case 1
for (size_t i=0; i<v.size(); ++i)
if (bit_generator()==0)
v[i] = -v[i];
// case 2
const int sign[] = {-1, 1};
for (size_t i=0; i<v.size(); ++i)
v[i] *= sign[bit_generator()];
// case 3
const double sign[] = {-1, 1};
for (size_t i=0; i<v.size(); ++i)
v[i] *= sign[bit_generator()];
// case 4 uses C-array
double a[N];
double number_generator(); // generates doubles
double z[2]; // used as buffer
for (size_t i=0; i<N; ++i) {
z[0] = number_generator();
z[1] = -z[0];
a[i] = z[bit_generator()];
}
EDIT: Added case 4 and C-tag, because the vector can be a plain array. Since I can control how doubles are generated, I redesigned the code as shown in case 4. It avoids extra multiplication and branching at the same. I presume it should be quite fast on all platforms.

Unless you want to resize the vector in the loop, lift the v.size() out of the for expression, i.e.
const unsigned SZ=v.size();
for (size_t i=0; i<SZ; ++i)
if (bit_generator()==0)
v[i] = -v[i];
If the compiler can't see what happens in bit_generator(), then it might be very hard for the compiler to prove that v.size() does not change, which makes loop unrolling or vectorization impossible.
UPDATE: I've made some tests and on my machine method 2 seems to be fastest. However, it seems to be faster to use a pattern which I call "group action" :-). Basically, you group multiple decisions into one value and switch over it:
const size_t SZ=v.size();
for (size_t i=0; i<SZ; i+=2) // manual loop unrolling
{
int val=2*bit_generator()+bit_generator();
switch(val) // only one conditional
{
case 0:
break; // nothing happes
case 1:
v[i+1]=-v[i+1];
break;
case 2:
v[i]=-v[i];
break;
case 3:
v[i]=-v[i];
v[i+1]=-v[i+1];
}
}
// not shown: wrap up the loop if SZ%2==1

if you can assume that the sign is represented by one specific bit, like in x86 implementations, you can simply do:
v[i] ^= !bit_generator() << SIGN_BIT_POSITION; // negate the output of
// bit_generator because 0 means
// negate and one means leave
// unchanged.
In x86 the sign bit is the MSB, so for doubles that's bit 63:
#define SIGN_BIT_POSITION 63
will do the trick.
Edit:
Based on comments, I should add that you might need to do some extra work to get this to compile, since v is an array of double, while bit_generator() returns int. You could do it like this:
union int_double {
double d; // assumption: double is 64 bits wide
long long int i; // assumption: long long is 64 bits wide
};
(syntax might be a bit different for C because you might need a typedef.)
Then define v as a vector of int_double and use:
v[i].i ^= bit_generator() << SIGN_BIT_POSITION;

Generally, if you have an if() inside a loop, that loop cannot be vectorized or unrolled, and the code has to execute once per pass, maximizing the loop overhead. Case 3 should perform very well, especially if the compiler can use SSE instructions.
For fun, if you're using GCC, use the -S -o foo.S -c foo.c flags instead of the usual -o foo.o -c foo.c flags. This will give you the assembly code, and you can see what is getting compiled for your three cases.

You don't need the lookup table, a simple formula suffices:
const size_t size = v.size();
for (size_t i=0; i<size; ++i)
v[i] *= 2*bit_generator() - 1;

assuming that the actual negation is fast (a good assumption on a modern compiler and CPU), you could use a conditional assignment, which is also fast on modern CPUs, to choose between two possibilities:
v[i] = bit_generator() ? v[i] : -v[i];
This avoids branches and allows the compiler to vectorize the loop and make it faster.

Are you able to rewrite bit_generator so it returns 1 and -1 instead? That removes an indirection from the equation at the possible cost of some clarity.

Premature optimization is the root of insipid SO questions
On my machine, running at 5333.24 BogoMIPS, the timings for 1'000 iterations across an array of 1'000'000 doubles yields the following times per expression:
p->d = -p->d 7.33 ns
p->MSB(d) ^= 0x80 6.94 ns
Where MSB(d) is pseudo-code for grabbing the most significant byte of d. This means that the naive d = -d takes 5.32% longer to execute than the obfuscated approach. For a billion such negations this means the difference between 7.3 and 6.9 seconds.
Someone must have an awfully big pile of doubles to care about that optimization.
Incidentally, I had to print out the content of the array when completed or my compiler optimized the whole test into zero op codes.

Related

Initializing large 2-dimensional array to all one value in C++

I want to initialize a large 2-dimensional array (say 1000x1000, though I'd like to go even larger) to all -1 in C++.
If my array were 1-dimensional, I know I could do:
int my_array[1000];
memset(my_array, -1, sizeof(my_array));
However, memset does not allow for initializing all the elements of an array to be another array. I know I could just make a 1-dimensional array of length 1000000, but for readability's sake I would prefer a 2-dimensional array. I could also just loop through the 2-dimensional array to set the values after initializing it to all 0, but this bit of code will be run many times in my program and I'm not sure how fast that would be. What's the best way of achieving this?
Edited to add minimal reproducible example:
int my_array[1000][1000];
// I want my_array[i][j] to be -1 for each i, j

I am a little bit surprised.
And I know, it is C++. And, I would never use plain C-Style arrays.
And therefore the accepted answer is maybe the best.
But, if we come back to the question
int my_array[1000];
memset(my_array, -1, sizeof(my_array));
and
int my_array[1000][1000];
// I want my_array[i][j] to be -1 for each i, j
Then the easiest and fastest solution is the same as the original assumption:
int my_array[1000][1000];
memset(my_array, -1, sizeof(my_array));
There is no difference. The compiler will even optimze this away and use fast assembler loop instructions.
Sizeof is smart enough. It will do the trick. And the memory is contiguous: So, it will work. Fast. Easy.
(A good compiler will do the same optimizations for the other solutions).
Please consider.

With GNU GCC you can:
int my_array[1000][1000] = { [0 .. 999] = { [0 .. 999] = -1, }, };
With any other compiler you need to:
int my_array[1000][1000] = { { -1, -1, -1, .. repeat -1 1000 times }, ... repeat { } 1000 times ... };
Side note: The following is doing assignment, not initialization:
int my_array[1000][1000];
for (auto&& i : my_array)
for (auto&& j : i)
j = -1;
Is there any real difference between doing what you wrote and doing for(int i=0; i<1000; i++){ for(int j=0; j<1000; j++){ my_array[i][j]=-1; } }?
It depends. If you have a bad compiler, you compile without optimization, etc., then yes. Most probably, no. Anyway, don't use indexes. I believe the range based for loop in this case roughly translates to something like this:
for (int (*i)[1000] = my_array; i < my_array + 1000; ++i)
for (int *j = *i; j < *i + 1000; ++j)
*j = -1;
Side note: Ach! It hurts to calculate my_array + 1000 and *i + 1000 each loop. That's like 3 operations done each loop. This cpu time wasted! It can be easily optimized to:
for (int (*i)[1000] = my_array, (*max_i)[1000] = my_array + 10000; i < max_i; ++i)
for (int *j = *i, *max_j = *i + 1000; j < max_j; ++j)
*j = -1;
The my_array[i][j] used in your loop, translates into *(*(my_array + i) + j) (see aarray subscript operator). That from pointer arithmetics is equal to *(*((uintptr_t)my_array + i * sizeof(int**)) + j * sizeof(int*)). Counting operations, my_array[i][j] is behind the scenes doing multiplication, addition, dereference, multiplication, addition, derefence - like six operations. (When using bad or non-optimizing compiler), your version could be way slower.
That said, a good compiler should optimize each version to the same code, as shown here.
And are either of these significantly slower than just initializing it explicitly by typing a million -1's?
I believe assigning each array element (in this particular case of elements having the easy to optimize type int) will be as fast or slower then initialization. It really depends on your particular compiler and on your architecture. A bad compiler can do very slow version of iterating over array elements, so it would take forever. On the other hand a static initialization can embed the values in your program, so your program size will increase by sizeof(int) * 1000 * 1000, and during program startup is will do plain memcpy when initializing static regions for your program. So, when compared to a properly optimized loop with assignment, you will not gain nothing in terms of speed and loose tons of read-only memory.

If the array is static, it's placed in sequential memory (check this question). So char [1000][1000] is equal to char [1000000] (if your stack can hold that much).
If the array has been created with multidimensional new (say char(*x)[5] = new char[5][5]) then it's also contiguous.
If it's not (if you create it with dynamic allocations), then you can use the solutions found in my question to map a n-dimension array to a single one after you have memsetd it.

Vectorization & #pragma omp simd

Since I got lost through all the reading of SIMD and OpenMP depending on vectorization, I would like to ask you if somebody can clarify me the above.
Specifically, I have a part of a C++ code I want to parallelize, but I am pretty stuffed at the moment and can't figure something on my own.
Any help clearing out to me what exactly the vectorization is and how can I use it in the following part of code would be greatly appreciated!
for(unsigned short i=1; i<=N_a; i++) {
for(unsigned short j=1; j<=N_b; j++) {
temp[0] = H[i-1][j-1]+similarity_score(seq_a[i-1],seq_b[j-1]);
temp[1] = H[i-1][j]-delta;
temp[2] = H[i][j-1]-delta;
temp[3] = 0.;
H[i][j] = find_array_max(temp, 4);
switch(ind) {
case 0: // score in (i,j) stems from a match/mismatch
I_i[i][j] = i-1;
I_j[i][j] = j-1;
break;
case 1: // score in (i,j) stems from a deletion in sequence A
I_i[i][j] = i-1;
I_j[i][j] = j;
break;
case 2: // score in (i,j) stems from a deletion in sequence B
I_i[i][j] = i;
I_j[i][j] = j-1;
break;
case 3: // (i,j) is the beginning of a subsequence
I_i[i][j] = i;
I_j[i][j] = j;
break;
}
}
}
Regards!

So ind is constant for both nested loops?
You might get a compiler to auto-vectorize this for you with OpenMP. (Put the line #pragma omp simd right before either of your for loops, and see if that affects the asm when you compile with -O3. I don't know OpenMP that well, so IDK if you need other options.)
Wrap it in a function that actually compiles, so I can see what happens. (e.g. by putting the code on http://gcc.godbolt.org/ to get nicely formatted asm output).
If it doesn't auto-vectorize, it's probably not too hard to manually vectorize with Intel intrinsics for x86, since you're just initializing some arrays with the array index. Keep a vector of loop counters starting with a vector of __m128i jvec = _mm_set_epi32(3, 2, 1, 0);, and increment it with _mm_add_ps() with a vector of [ 4 4 4 4 ] (_mm_set1_epi32(4)) to increment every element by 4.
Keep a separate vector of i values, which you only modify in the outer loop, but still store in the inner loop.
See the x86 tag wiki for instruction-set stuff.
See the sse tag wiki for some SIMD guides, including this nice intro to SIMD and what it's all about.

Digital filter and std::inner_product optimization

In a digital filtering C++ application, I use std::inner_product (with std::vector<double> and std::deque<double>) to compute the dot product between the filter coefficients and the input data, for each data sample. After profiling my application, I figured out that no less than 85% of the execution time is spent in std::inner_product!
To what extend is std::inner_product optimized, in GCC for example?
Does it uses SIMD instructions? Does it performs loop unrolling? How to make sure of that?
Based on this, would it worth it to implement custom dot product function(s) (especially if the number of coefficient is low)? (but I would like to keep the function as generic as possible)
More specifically, this is the piece of code I use to apply a filter:
std::deque<double> in(filterNum.size(), 0.0);
std::deque<double> out(filterDenom.size() - 1, 0.0);
const double gain = filterDenom.back();
for (unsigned int s = 0, size = data.size(); s < size; ++s) {
in.pop_front();
in.push_back(data[s] / gain);
data[s] = inner_product(in.begin(), in.end(), filterNum.begin(),
-inner_product(out.begin(), out.end(), filterDenom.begin(), 0.0));
out.pop_front();
out.push_back(data[s]);
}
Typically, I use second order bandpass IIR filters, which means that the size of filterNum and filterDenom (numerator and denominator coefficients of the filter) is 5. data is the vector containing the input samples.

Getting an additional factor of 2 out of this shouldn't be hard if you just write the code directly. Part of it might come from removing some of the generality of inner_product, but some would also come from such things as eliminating the use of deques - if you just keep a pointer into your input array you can index off it and off the filter array in the inner loop, and increment the pointer to the input array in the outer loop.
Each of those inner_products has to use iterators through deques,
Most of the (coding) effort then becomes handling the edge conditions.
And take that division out of there - it should be a multiplication by a constant calculated outside the loop.
Inner product itself is pretty efficient (there's not much to do there), but it needs to increment two iterators on each pass through the inner loop. There is no explicit loop unrolling, but a good compiler can unroll a loop that simple. And a compiler is more likely to know how far to unroll a loop before running into instruction cache issues.
Deque iterators are not nearly as efficient as ++ on a pure pointer. There is at least a test on every ++, and there may be more than one assignment.
This is what a simple (FIR) filter can look like, without including the code for the edge conditions (which goes outside of the loop)
double norm = 1.0/sum;
double *p = data.values(); // start of input data
double *q = output.values(); // start of output buffer
int width = data.size() - filter.size();
for( int i = 0; i < width; ++i )
{
double *f = filter.values();
double accumulator = ( f[0] * p[0] );
for( int j = 1; j < filter.size(); ++j )
{
accumulator += ( f[i] * p[i] );
}
*q++ = accumulator * norm;
}
Note that there are messy details left out, and this is not the same as your filter, but it gives the idea. What's inside the outer loop easily fits in a modern instruction cache. The inner loop may be unrolled by the compiler. Most modern architectures can do the add and multiply in parallel.

You can ask GCC to computes most of the algorithms in <algorithms> and <numeric> in parallel mode, it may give a performance boost if your data set is very high (I think that it really only uses OpenMP inside).
However on small datasets it may give a performance hit.
A comparison with the other solution would be more than welcome!
http://gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mode.html

Optimize this function (in C++)

I have a cpu-consuming code, where some function with a loop is executed many times. Every optimization in this loop brings noticeable performance gain. Question: How would you optimize this loop (there is not much more to optimize though...)?
void theloop(int64_t in[], int64_t out[], size_t N)
{
for(uint32_t i = 0; i < N; i++) {
int64_t v = in[i];
max += v;
if (v > max) max = v;
out[i] = max;
}
}
I tried a few things, e.g. I replaced arrays with pointers that were incremented in every loop, but (surprisingly) i lost some performance instead of gaining...
Edit:
changed name of one variable (itsMaximums, error)
the function is an a method of a class
in and put are int64_t , so are negative and positive
`(v > max) can evaluate to true: consider the situation when actual max is negative
the code runs on 32-bit pc (development) and 64-bit (production)
N is unknown at compile time
I tried some SIMD, but I failed to increase performance... (the overhead of moving the variables to _m128i, executing and storing back was higher than than SSE speed gain. Yet I am not an expert on SSE, so maybe I had a poor code)
Results:
I added some loop unfolding, and a nice hack from Alex'es post. Below I paste some results:
original: 14.0s
unfolded loop (4 iterations): 10.44s
Alex'es trick: 10.89s
2) and 3) at once: 11.71s
strage, that 4) is not faster than 3) and 4). Below code for 4):
for(size_t i = 1; i < N; i+=CHUNK) {
int64_t t_in0 = in[i+0];
int64_t t_in1 = in[i+1];
int64_t t_in2 = in[i+2];
int64_t t_in3 = in[i+3];
max &= -max >> 63;
max += t_in0;
out[i+0] = max;
max &= -max >> 63;
max += t_in1;
out[i+1] = max;
max &= -max >> 63;
max += t_in2;
out[i+2] = max;
max &= -max >> 63;
max += t_in3;
out[i+3] = max;
}

First, you need to look at the generated assembly. Otherwise you have no way of knowing what actually happens when this loop is executed.
Now: is this code running on a 64-bit machine? If not, those 64-bit additions might hurt a bit.
This loop seems an obvious candidate for using SIMD instructions. SSE2 supports a number of SIMD instructions for integer arithmetics, including some that work on two 64-bit values.
Other than that, see if the compiler properly unrolls the loop, and if not, do so yourself. Unroll a couple of iterations of the loop, and then reorder the hell out of it. Put all the memory loads at the top of the loop, so they can be started as early as possible.
For the if line, check that the compiler is generating a conditional move, rather than a branch.
Finally, see if your compiler supports something like the restrict/__restrict keyword. It's not standard in C++, but it is very useful for indicating to the compiler that in and out do not point to the same addresses.
Is the size (N) known at compile-time? If so, make it a template parameter (and then try passing in and out as references to properly-sized arrays, as this may also help the compiler with aliasing analysis)
Just some thoughts off the top of my head. But again, study the disassembly. You need to know what the compiler does for you, and especially, what it doesn't do for you.
Edit
with your edit:
max &= -max >> 63;
max += t_in0;
out[i+0] = max;
what strikes me is that you added a huge dependency chain.
Before the result can be computed, max must be negated, the result must be shifted, the result of that must be and'ed together with its original value, and the result of that must be added to another variable.
In other words, all these operations have to be serialized. You can't start one of them before the previous has finished. That's not necessarily a speedup. Modern pipelined out-of-order CPUs like to execute lots of things in parallel. Tying it up with a single long chain of dependant instructions is one of the most crippling things you can do. (Of course, it if can be interleaved with other iterations, it might work out better. But my gut feeling is that a simple conditional move instruction would be preferable)

> #**Announcement** see [chat](https://chat.stackoverflow.com/rooms/5056/discussion-between-sehe-and-jakub-m)
> > _Hi Jakub, what would you say if I have found a version that uses a heuristic optimization that, for random data distributed uniformly will result in ~3.2x speed increase for `int64_t` (10.56x effective using `float`s)?_
>
I have yet to find the time to update the post, but the explanation and code can be found through the chat.
> I used the same test-bed code (below) to verify that the results are correct and exactly match the original implementation from your OP
**Edit**: ironically... that testbed had a fatal flaw, which rendered the results invalid: the heuristic version was in fact skipping parts of the input, but because existing output wasn't being cleared, it appeared to have the correct output... (still editing...)
Ok, I have published a benchmark based on your code versions, and also my proposed use of partial_sum.
Find all the code here https://gist.github.com/1368992#file_test.cpp
Features
For a default config of
#define MAGNITUDE 20
#define ITERATIONS 1024
#define VERIFICATION 1
#define VERBOSE 0
#define LIMITED_RANGE 0 // hide difference in output due to absense of overflows
#define USE_FLOATS 0
It will (see output fragment here):
run 100 x 1024 iterations (i.e. 100 different random seeds)
for data length 1048576 (2^20).
The random input data is uniformly distributed over the full range of the element data type (int64_t)
Verify output by generating a hash digest of the output array and comparing it to the reference implementation from the OP.
Results
There are a number of (surprising or unsurprising) results:
there is no significant performance difference between any of the algorithms whatsoever (for integer data), provided you are compiling with optimizations enabled. (See Makefile; my arch is 64bit, Intel Core Q9550 with gcc-4.6.1)
The algorithms are not equivalent (you'll see hash sums differ): notably the bit fiddle proposed by Alex doesn't handle integer overflow in quite the same way (this can be hidden defining
#define LIMITED_RANGE 1
which limits the input data so overflows won't occur; Note that the partial_sum_incorrect version shows equivalent C++ non-bitwise _arithmetic operations that yield the same different results:
return max<0 ? v : max + v;
Perhaps, it is ok for your purpose?)
Surprisingly It is not more expensive to calculate both definitions of the max algorithm at once. You can see this being done inside partial_sum_correct: it calculates both 'formulations' of max in the same loop; This is really not more than a triva here, because none of the two methods is significantly faster...
Even more surprisingly a big performance boost can be had when you are able to use float instead of int64_t. A quick and dirty hack can be applied to the benchmark
#define USE_FLOATS 0
showing that the STL based algorithm (partial_sum_incorrect) runs aproximately 2.5x faster when using float instead of int64_t (!!!).Note:
that the naming of partial_sum_incorrect only relates to integer overflow, which doesn't apply to floats; this can be seen from the fact that the hashes match up, so in fact it is partial_sum_float_correct :)
that the current implementation of partial_sum_correct is doing double work that causes it to perform badly in floating point mode. See bullet 3.
(And there was that off-by-1 bug in the loop-unrolled version from the OP I mentioned before)
Partial sum
For your interest, the partial sum application looks like this in C++11:
std::partial_sum(data.begin(), data.end(), output.begin(),
[](int64_t max, int64_t v) -> int64_t
{
max += v;
if (v > max) max = v;
return max;
});

Sometimes, you need to step backward and look over it again. The first question is obviously, do you need this ? Could there be an alternative algorithm that would perform better ?
That being said, and supposing for the sake of this question that you already settled on this algorithm, we can try and reason about what we actually have.
Disclaimer: the method I am describing is inspired by the successful method Tim Peters used to improve the traditional introsort implementation, leading to TimSort. So please bear with me ;)
1. Extracting Properties
The main issue I can see is the dependency between iterations, which will prevent much of the possible optimizations and thwart many attempts at parallelizing.
int64_t v = in[i];
max += v;
if (v > max) max = v;
out[i] = max;
Let us rework this code in a functional fashion:
max = calc(in[i], max);
out[i] = max;
Where:
int64_t calc(int64_t const in, int64_t const max) {
int64_t const bumped = max + in;
return in > bumped ? in : bumped;
}
Or rather, a simplified version (baring overflow since it's undefined):
int64_t calc(int64_t const in, int64_t const max) {
return 0 > max ? in : max + in;
}
Do you notice the tip point ? The behavior changes depending on whether the ill-named(*) max is positive or negative.
This tipping point makes it interesting to watch the values in in more closely, especially according to the effect they might have on max:
max < 0 and in[i] < 0 then out[i] = in[i] < 0
max < 0 and in[i] > 0 then out[i] = in[i] > 0
max > 0 and in[i] < 0 then out[i] = (max + in[i]) ?? 0
max > 0 and in[i] > 0 then out[i] = (max + in[i]) > 0
(*) ill-named because it is also an accumulator, which the name hides. I have no better suggestion though.
2. Optimizing operations
This leads us to discover interesting cases:
if we have a slice [i, j) of the array containing only negative values (which we call negative slice), then we could do a std::copy(in + i, in + j, out + i) and max = out[j-1]
if we have a slice [i, j) of the array containing only positive values, then it's a pure accumulation code (which can easily be unrolled)
max gets positive as soon as in[i] is positive
Therefore, it could be interesting (but maybe not, I make no promise) to establish a profile of the input before actually working with it. Note that the profile could be made chunk by chunk for large inputs, for example tuning the chunk size based on the cache line size.
For references, the 3 routines:
void copy(int64_t const in[], int64_t out[],
size_t const begin, size_t const end)
{
std::copy(in + begin, in + end, out + begin);
} // copy
void accumulate(int64_t const in[], int64_t out[],
size_t const begin, size_t const end)
{
assert(begin != 0);
int64_t max = out[begin-1];
for (size_t i = begin; i != end; ++i) {
max += in[i];
out[i] = max;
}
} // accumulate
void regular(int64_t const in[], int64_t out[],
size_t const begin, size_t const end)
{
assert(begin != 0);
int64_t max = out[begin - 1];
for (size_t i = begin; i != end; ++i)
{
max = 0 > max ? in[i] : max + in[i];
out[i] = max;
}
}
Now, supposing that we can somehow characterize the input using a simple structure:
struct Slice {
enum class Type { Negative, Neutral, Positive };
Type type;
size_t begin;
size_t end;
};
typedef void (*Func)(int64_t const[], int64_t[], size_t, size_t);
Func select(Type t) {
switch(t) {
case Type::Negative: return ©
case Type::Neutral: return &regular;
case Type::Positive: return &accumulate;
}
}
void theLoop(std::vector<Slice> const& slices, int64_t const in[], int64_t out[]) {
for (Slice const& slice: slices) {
Func const f = select(slice.type);
(*f)(in, out, slice.begin, slice.end);
}
}
Now, unless introsort the work in the loop is minimal, so computing the characteristics might be too costly as is... however it leads itself well to parallelization.
3. Simple parallelization
Note that the characterization is a pure function of the input. Therefore, supposing that you work in a chunk per chunk fashion, it could be possible to have, in parallel:
Slice Producer: a characterizer thread, which computes the Slice::Type value
Slice Consumer: a worker thread, which actually executes the code
Even if the input is essentially random, providing the chunk is small enough (for example, a CPU L1 cache line) there might be chunks for which it does work. Synchronization between the two threads can be done with a simple thread-safe queue of Slice (producer/consumer) and adding a bool last attribute to stop consumption or by creating the Slice in a vector with a Unknown type, and having the consumer block until it's known (using atomics).
Note: because characterization is pure, it's embarrassingly parallel.
4. More Parallelization: Speculative work
Remember this innocent remark: max gets positive as soon as in[i] is positive.
Suppose that we can guess (reliably) that the Slice[j-1] will produce a max value that is negative, then the computation on Slice[j] are independent of what preceded them, and we can start the work right now!
Of course, it's a guess, so we might be wrong... but once we have fully characterized all the Slices, we have idle cores, so we might as well use them for speculative work! And if we're wrong ? Well, the Consumer thread will simply gently erase our mistake and replace it with the correct value.
The heuristic to speculatively compute a Slice should be simple, and it will have to be tuned. It may be adaptative as well... but that may be more difficult!
Conclusion
Analyze your dataset and try to find if it's possible to break dependencies. If it is you can probably take advantage of it, even without going multi-thread.

If values of max and in[] are far away from 64-bit min/max (say, they are always between -261 and +261), you may try a loop without the conditional branch, which may be causing some perf degradation:
for(uint32_t i = 1; i < N; i++) {
max &= -max >> 63; // assuming >> would do arithmetic shift with sign extension
max += in[i];
out[i] = max;
}
In theory the compiler may do a similar trick as well, but without seeing the disassembly, it's hard to tell if it does it.

The code appears already pretty fast. Depending on the nature of the in array, you could try special casing, for instance if you happen to know that in a particular invokation all the input numbers are positive, out[i] will be equal to the cumulative sum, with no need for an if branch.

ensuring the method isn't virtual, inline, _attribute_((always_inline)) and -funroll-loops seem like good options to explore.
Only by you benchmarking them can we determine if they were worthwhile optimizations in your bigger program.

The only thing that comes to mind that might help a small bit is to use pointers rather than array indices within your loop, something like
void theloop(int64_t in[], int64_t out[], size_t N)
{
int64_t max = in[0];
out[0] = max;
int64_t *ip = in + 1,*op = out+1;
for(uint32_t i = 1; i < N; i++) {
int64_t v = *ip;
ip++;
max += v;
if (v > max) max = v;
*op = max;
op++
}
}
The thinking here is that an index into an array is liable to compile as taking the base address of the array, multiplying the size of element by the index, and adding the result to get the element address. Keeping running pointers avoids this. I'm guessing a good optimizing compiler will do this already, so you'd need to study the current assembler output.

int64_t max = 0, i;
for(i=N-1; i > 0; --i) /* Comparing with 0 is faster */
{
max = in[i] > 0 ? max+in[i] : in[i];
out[i] = max;
--i; /* Will reduce checking of i>=0 by N/2 times */
max = in[i] > 0 ? max+in[i] : in[i]; /* Reduce operations v=in[i], max+=v by N times */
out[i] = max;
}
if(0 == i) /* When N is odd */
{
max = in[i] > 0 ? max+in[i] : in[i];
out[i] = max;
}

g++ -O optimizations

I have the following code:
#include <iostream>
int main()
{
int n = 100;
long a = 0;
int *x = new int[n];
for(int i = 0; i < n; ++i)
for(int j = 0; j < n; ++j)
for(int k = 0; k < n; ++k)
for(int l = 0; l < n; ++l)
{
a += x[(j + k + l) % 100];
}
std::cout << a << std::endl;
delete[] x;
return 0;
}
If I compile without optimizations g++ test.cc and then run time ./a.out it will display 0.7s. However when I compile it with -O, the time decreases by 2 times.
Compiler used
g++ (Ubuntu/Linaro 4.5.2-8ubuntu4) 4.5.2
My question
How can I rewrite the code so when compiling without -O I can obtain same time ( or close to it ) ? In other words, how to optimize nested loops manually ?
Why I ask
I have a similar code which runs about 4 times faster if I use -O optimization.
PS: I've looked in the manual of the compiler but there are way too many flags there to test which one really makes the difference.

Most of the things the compiler optimizes with -O are things on a level below C++. For example, all variables live in memory, including your loop variables. Therefore without optimization, the compiler will most likely on each iteration of the inner loop first read the loop variable to compare it with 0, inside the loop load it again in order to use it for the index, and then at the end of the loop will read the value again, increment it, and write it back. With optimization, it will notice that the loop variable is not changed in the loop body, and therefore does not need to get re-read from memory each time. Moreover it will also note that the address of the variable is never taken, so no other code will ever access it, and therefore writing it to memory can also be omitted. That is, the loop variable will live only in memory. This optimization alone will save three hundred million memory reads and one hundred million memory writes during execution of your function. But since things like processor registers and memory reads/writes are not exposed at the language level, there's no way to optimize it at the language level.
Furthermore, it doesn't make sense to hand-optimize things which the compiler optimizes anyway. Better spend your time at optimizing things which the compiler cannot optimize.

This code has undefined behaviour, so actually the optimizer can do whatever it wants..
a += x[j + k + l % 100];
should be:
a += x[(j + k + l) % 100];
If you fix this I still don't get why you want to optimize something that actually doesn't do anything...
Personally I would optimize this to: :)
std::cout << 0 << std::endl;
Note: remove the loop for i and do std::cout << a * n << std::endl;

You could observe that your loops form a set of coefficients coeff[i] such that a single loop summing a[i] * coeff[i] produces the same total as the nested loop. If we assume your index was meant to be (i + j + k) % 100 then coeff[i] = n * n * n for all i, so your whole program simplifies to
long coeff = n * n * n;
for (int i = 0; i < n; ++n)
a += x[i] * coeff;
which I'm sure runs in far less than 0.7 seconds.

You're using a polynomial time algorithm (n ** 4) so it stands to reason it'll be slow, especially for larger n. Perhaps an algorithm with a lower complexity would help you?
If you want your code optimized you can either ask the compiler to optimize for you, or just write it in assembly language and be done with it. Don't try to second-guess the C++ compiler.

How can I rewrite the code so when compiling without -O I can obtain same time ( or close to it ) ? In other words, how to optimize nested loops manually ?
You write them in assembly. Good luck with that, BTW.
The purpose of the optimization switches in compilers is for you to say, "Compiler, I want you to try to generate fast assembly." By default, compilers do not do optimizations, because those optimizations might inhibit the debugability of the resulting code. So you have to specifically ask for them.
The things the compiler does to optimize code are, generally speaking, not something you can do manually. Some can be (like loop unrolling), but others can't be.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js