How to make this code faster (learning best practices)? [closed] - c++

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I have this little loop here, and I was wondering if I do some big mistake, perf wise.
For example, is there a way to rewrite parts of it differently, to make vectorization possible (assuming GCC4.8.1 and all vecotrization friendly flags enabled)?
Is this the best way to pass a list a number (const float name_of_var[])?
The idea of the code is to take a vector (in the mathematical sense, not necesserly a std::vector) of (unsorted numbers) y and two bound values (ox[0]<=ox[1]) and to store in a vector of integers rdx the index i of the entry of y satisfying ox[0]<=y[i]<=ox[1].
rdx can contain m elements and y has capacity n and n>m. If there are more than m
values of y[i] satisfying ox[0]<=y[i]<=ox[1] then the code should return the first m
Thanks in advance,
void foo(const int n,const int m,const float y[],const float ox[],int rdx[]){
int d0,j=0,i=0;
for(;;){
i++;
d0=((y[i]>=ox[0])+(y[i]<=ox[1]))/2;
if(d0==1){
rdx[j]=i;
j++;
}
if(j==m) break;
if(i==n-1) break;
}
}

d0=((y[i]>=ox[0])+(y[i]<=ox[1]))/2;
if(d0==1)
I believe the use of an intermediary variable is useless, and take a few more cycles
This is the most optimized version I could think of, but it's totally unreadable...
void foo(int n, int m, float y[],const float ox[],int rdx[])
{
for(int i = 0; i < n && m != 0; i++)
{
if(*y >= *ox && *y <= ox[1])
{
*rdx=i;
rdx++;
m--;
}
y++;
}
}
I think the following version with a decent optimisation flag should do the job
void foo(int n, int m,const float y[],const float ox[],int rdx[])
{
for(int j = 0, i = 0; j < m && i < n; i++) //Reorder to put the condition with the highest probability to fail first
{
if(y[i] >= ox[0] && y[i] <= ox[1])
{
rdx[j++] = i;
}
}
}

Just to make sure I'm correct: you're trying to find the first m+1 (if it's actually m, do j == m-1) values that are in the range of [ ox[0], ox[1] ]?
If so, wouldn't it be better to do:
for (int i=0, j=0;;++i) {
if (y[i] < ox[0]) continue;
if (y[i] > ox[1]) continue;
rdx[j] = i;
j++;
if (j == m || i == n-1) break;
}
If y[i] is indeed in the range you must perform both comparisons as we both do.
If y[i] is under ox[0], no need to perform the second comparison.
I avoid the use of division.

A. Yes, passing the float array as float[] is not only efficient, it is the only way (and is identical to a float * argument).
A1. But in C++ you can use better types without performance loss. Accessing a vector or array (the standard library container) should not be slower than accessing a plain C style array. I would strongly advise you to use those. In modern C++ there is also the possibility to use iterators and functors; I am no expert there but if you can express the independence of operations on different elements by being more abstract you may give the compiler the chance to generate code that is more suitable for vectorization.
B. You should replace the division by a logical AND, operator&&. The first advantage is that the second condition is not evaluated at all if the first one is false -- this could be your most important performance gain here. The second advantage is expressiveness and thus readability.
C. The intermediate variable d0 will probably disappear when you compile with -O3, but it's unnecessary nonetheless.
The rest is ok performancewise. Idiomatically there is room for improvement as has been shown already.
D. I am not sure about a chance for vectorization with the code as presented here. The compiler will probably do some loop unrolling at -O3; try to let it emit SSE code (cf. http://gcc.gnu.org/onlinedocs/, specifically http://gcc.gnu.org/onlinedocs/gcc-4.8.2/gcc/i386-and-x86-64-Options.html#i386-and-x86-64-Options). Who knows.
Oh, I just realized that your original code passes the constant interval boundaries as an array with 2 elements, ox[]. Since array access is an unnecessary indirection and as such may carry an overhead, using two normal float parameters would be preferred here. Keep them const like your array. You could also name them nicely.

Related

Lower time complexity of two for loop and optimize this to become 1 for loop [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I want to optimize this loop. Its time complexity is n2. I want something like n or log(n).
for (int i = 1; i <= n; i++) {
for (int j = i+1; j <= n; j++) {
if (a[i] != a[j] && a[a[i]] == a[a[j]]) {
x = 1;
break;
}
}
}
The a[i] satisfy 1 <= a[i] <= n.
This is what I will try :
Let us call B the image by a[], i.e. the set {a[i]}: B = {b[k]; k = 1..K, such that i exists, a[i] = b[k]}
For each b[k] value, k = 1..K, determine the set Ck = {i; a[i] = b[k]}.
Determinate of B and the Ck could be done in linear time.
Then let us examine the sets Ck one by one.
If Card(Ck} = 1 : k++
If Card(Ck) > 1 : if two elements of Ck are elements of B, then x = 1 ; else k++
I will use a table (std::vector<bool>) to memorize if an element of 1..N belongs to B or not.
I hope not having made a mistake. No time to write a programme just now. I could do it later on, but I guess you will be able to do it easily.
Note: I discovered after sending this answer that #Mike Borkland proposed something similar already in a comment...
Since sometimes you need to see a solution to learn, I'm providing you with a small function that does the job you want. I hope it helps.
#define MIN 1
#define MAX 100000 // 10^5
int seek (int *arr, int arr_size)
{
if(arr_size > MAX || arr_size < MIN || MIN < 1)
return 0;
unsigned char seen[arr_size];
unsigned char indices[arr_size];
memset(seen, 0, arr_size);
memset(indices, 0, arr_size);
for(int i = 0; i < arr_size; i++)
{
if (arr[i] <= MAX && arr[i] >= MIN && !indices[arr[i]] && seen[arr[arr[i]]])
return 1;
else
{
seen[arr[arr[i]]] = 1;
indices[arr[i]] = 1;
}
}
return 0;
}
Ok, how and why this works? First, let's take a look at the problem the one the original algorithm is trying to solve; they say half of the solution is a well-stated problem. The problem is to find if in a given integer array A of size n whose elements are bound between one and n ([1,n]) there exist two elements in A, x and y such that x != y and Ax = Ay (the array at the index x and y, respectively). Furthermore, we are seeking for an algorithm with good time complexity so that for n = 10000 the implementation runs within one second.
To begin with, let's start analyzing the problem. In the worst case scenario, the array needs to be completely scanned at least one time to decide if such pair of elements exist within the array. So, we can't do better than O(n). But, how would you do that? One possible way is to scan the array and record if a given index has appeared, this can be done in another array B (of size n); likewise, record if a given number that corresponds to A at the index of the scanned element has appeared, this can also be done in another array C. If while scanning the current element of the array has not appeared as an index and it has appeared as an element, then return yes. I have to say that this is a "classical trick" of using hash-table-like data structures.
The original tasks were: i) to reduce the time complexity (from O(n^2)), and ii) to make sure the implementation runs within a second for an array of size 10000. The proposed algorithm runs in O(n) time and space complexity. I tested with random arrays and it seems the implementation does its job much faster than required.
Edit: My original answer wasn't very useful, thanks for pointing that out. After checking the comments, I figured the code could help a bit.
Edit 2: I also added the explanation on how it works so it might be useful. I hope it helps :)
I want to optimize this loop. Its time complexity is n2. I want something like n or log(n).
Well, the easiest thing is to sort the array first. That's O(n log(n)), and then a linear scan looking for two adjacent elements is also O(n), so the dominant complexity is unchanged at O(n log(n)).
You know how to use std::sort, right? And you know the complexity is O(n log(n))?
And you can figure out how to call std::adjacent_find, and you can see that the complexity must be linear?
The best possible complexity is linear time. This only allows us to make a constant number of linear traversals of the array. That means, if we need some lookup to determine for each element, whether we saw that value before - it needs to be constant time.
Do you know any data structures with constant time insertion and lookups? If so, can you write a simple one-pass loop?
Hint: std::unordered_set is the general solution for constant-time membership tests, and Damien's suggestion of std::vector<bool> is potentially more efficient for your particular case.

Which has better memory access ? (C++) [duplicate]

This question already has answers here:
Which ordering of nested loops for iterating over a 2D array is more efficient [duplicate]
(10 answers)
Closed 6 years ago.
Which version is more efficient and why?
It seems that both make the same computations. The only thing I can think of is if the compiler recognizes that in (a) j does not change value and doesn't have to compute it over and over again.
Any input would be great!
#define M /* some mildly large number */
double a[M*M], x[M], c[M];
int i, j;
(a) First version
for (j = 0; j < M; j++)
for (i = 0; i < M; i++)
c[j] += a[i+j*M]*x[i];
(b) Second version
for (i = 0; i < M; i++)
for (j = 0; j < M; j++)
c[j] += a[i+j*M]*x[i];
This is about memory-access patterns rather than computational efficiency. In general (a) is faster because it accesses memory with unit stride, which is much more cache-efficient than (b), which has a stride of M. In the case of (a) each cache line is fully utilised, whereas with (b) it is possible that only one array element will be used from each cache line before it is evicted,
Having said that, some compilers can perform loop reordering optimisations, so in practice you may not see any difference if that happens. As always, you should benchmark/profile your code, rather than just guessing.

How to check if two n-sized vectors are linearly dependant on C++? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
A program should be made which finds if the two vectors a = (a0, a1, ..., an-1) and b = (b0, b1, ..., bn-1) (1 ≤ n ≤ 20) are linearly dependant. The input should be n, and the coordinates of the two vectors and the output should be 1 if the vectors are linearly dependant, else - 0.
I've been struggling for hours over this now and I've got absolutely nothing. I know only basic C++ stuff and my geometry sucks way too much. I'd be really thankful if someone would write me a solution or at least give me some hint. Thanks in advance !
#include <iostream>
using namespace std;
int main()
{
int n;
double a[20], b[20];
cin >> n;
int counter = n;
bool flag = false;
for (int i = 0; i < n; i++)
{
cin >> a[i];
cin >> b[i];
}
double k;
for (int i = 0; i < n; i++)
{
for (k = 0; k < 1000; k = k + 0.01)
{
if (a[i] == b[i])
{
counter--;
}
}
}
if (counter == 0 && k != 0)
flag = true;
cout << flag;
return 0;
}
Apparently that was all I could possibly come up with. The "for" cycle is wrong on so many levels but I don't know how to fix it. I'm open to suggestions.
There are 4 parts to the problem:
1. Math and algorithms
Vectors a and b are linearly depndent if ∃k. a = k b. That is expanded to ∃k. ∑i=1..n ai = k ai and that is a set of equations any of which can be solved for k.
So you calculate k as b0 / a0 and check that the same k works for the other dimensions.
Don't forget to handle a0 = 0 (or small, see below). I'd probably swap the vectors so the larger absolute value is denominator.
2. Limited precision numeric calculations
Since the precision is limited, calculations involve rounding error. You need to check for approximate equality, not exact, because most likely you won't get exact results even when you expect them.
Approximate equality comes in two forms, absolute (|x - y| < ε) and relative (1 - ε < |x / y| < 1 + ε). Obviously the relative makes more sense here (you want to ignore the last significant digit only), but again you have to handle the case where the values are too small.
3. C++
Don't use plain arrays, use std::vector. That way you won't have arbitrary limits.
Iterate with iterator, not indices. Iterators work for all container types, indices only work for the few with continuous integral indices and random access. Which is basically just vector and plain array. And note that iterators are designed so that pointer is iterator, so you can iterate with iterator over plain arrays too.
4. Plain old bugs
You have the loop over k, but you don't use the value inside the loop.
The logic with counter does not seem to make any sense. I don't even see what you wanted to achieve with that.
You're right, that code bears no reationship to the problem at all.
It's easier than you think (at least conceptually). Divide each element in the vector by the corressponding element in the other vector. If all those division result in the same number then the vectors are linearly dependendent. So { 1, 2, 4 } and { 3, 6, 12 } are linear because 1/3 == 2/6 == 4/12.
However there are two technical problems. First you have to consider what happens when your elements are zero, you don't want to divide by zero.
Secondly because you are dealing with floating point numbers it's not sufficient to test if two numbers are equal. Because of rounding errors they often won't be. So you have to come up with some test to see if two numbers are nearly equal.
I'll leave you to think about both those problems.

C++ sieve of Eratosthenes with array [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I'd like to code the famous Sieve of Eratosthenes in C++ using just array as it would be a set where I can delete some elements on the way to find out primes numbers.
I don't want to use STL (vector, set)... Just array! How can I realize it?
I try to explain why I don't want to use STL set operator: I'm learning C++ from the very beginning and I think STL is of course useful for programmers but built on standard library, so I'd like to use former operators and commands. I know that everything could be easier with STL.
The key to the sieve of Eratosthenes's efficiency is that it does not, repeat not, delete ⁄ remove ⁄ throw away ⁄ etc. the composites as it enumerates them, but instead just marks them as such.
Keeping all the numbers preserves our ability to use a number's value as its address in this array and thus directly address it: array[n]. This is what makes the sieve's enumeration and marking off of each prime's multiples efficient, when implemented on modern random-access memory computers (just as with the integer sorting algorithms).
To make that array simulate a set, we give each entry two possible values, flags: on and off, prime or composite, 1 or 0. Yes, we actually only need one bit, not byte, to represent each number in the sieve array, provided we do not remove any of them while working on it.
And btw, vector<bool> is automatically packed, representing bools by bits. Very convenient.
From Algorithms and Data Structures
#include<iostream>
#include<cmath>
#include<cstring>
using namespace std;
void runEratosthenesSieve(int upperBound) {
int upperBoundSquareRoot = (int)sqrt((double)upperBound);
bool *isComposite = new bool[upperBound + 1];
memset(isComposite, 0, sizeof(bool) * (upperBound + 1));
for (int m = 2; m <= upperBoundSquareRoot; m++) {
if (!isComposite[m]) {
cout << m << " ";
for (int k = m * m; k <= upperBound; k += m)
isComposite[k] = true;
}
}
for (int m = upperBoundSquareRoot; m <= upperBound; m++)
if (!isComposite[m])
cout << m << " ";
delete [] isComposite;
}
int main()
{
runEratosthenesSieve(1000);
}
You don't want to use STL, but that's not a good idea
STL makes life much simpler.
Still consider this implementation using std::map
int max = 100;
S sieve;
for(int it=2;it < max;++it)
sieve.insert(it);
for(S::iterator it = sieve.begin();it != sieve.end();++it)
{
int prime = *it;
S::iterator x = it;
++x;
while(x != sieve.end())
if (((*x) % prime) == 0)
sieve.erase(x++);
else
++x;
}
for(S::iterator it = sieve.begin();it != sieve.end();++it)
std::cout<<*it<<std::endl;

Optimize this function (in C++)

I have a cpu-consuming code, where some function with a loop is executed many times. Every optimization in this loop brings noticeable performance gain. Question: How would you optimize this loop (there is not much more to optimize though...)?
void theloop(int64_t in[], int64_t out[], size_t N)
{
for(uint32_t i = 0; i < N; i++) {
int64_t v = in[i];
max += v;
if (v > max) max = v;
out[i] = max;
}
}
I tried a few things, e.g. I replaced arrays with pointers that were incremented in every loop, but (surprisingly) i lost some performance instead of gaining...
Edit:
changed name of one variable (itsMaximums, error)
the function is an a method of a class
in and put are int64_t , so are negative and positive
`(v > max) can evaluate to true: consider the situation when actual max is negative
the code runs on 32-bit pc (development) and 64-bit (production)
N is unknown at compile time
I tried some SIMD, but I failed to increase performance... (the overhead of moving the variables to _m128i, executing and storing back was higher than than SSE speed gain. Yet I am not an expert on SSE, so maybe I had a poor code)
Results:
I added some loop unfolding, and a nice hack from Alex'es post. Below I paste some results:
original: 14.0s
unfolded loop (4 iterations): 10.44s
Alex'es trick: 10.89s
2) and 3) at once: 11.71s
strage, that 4) is not faster than 3) and 4). Below code for 4):
for(size_t i = 1; i < N; i+=CHUNK) {
int64_t t_in0 = in[i+0];
int64_t t_in1 = in[i+1];
int64_t t_in2 = in[i+2];
int64_t t_in3 = in[i+3];
max &= -max >> 63;
max += t_in0;
out[i+0] = max;
max &= -max >> 63;
max += t_in1;
out[i+1] = max;
max &= -max >> 63;
max += t_in2;
out[i+2] = max;
max &= -max >> 63;
max += t_in3;
out[i+3] = max;
}
First, you need to look at the generated assembly. Otherwise you have no way of knowing what actually happens when this loop is executed.
Now: is this code running on a 64-bit machine? If not, those 64-bit additions might hurt a bit.
This loop seems an obvious candidate for using SIMD instructions. SSE2 supports a number of SIMD instructions for integer arithmetics, including some that work on two 64-bit values.
Other than that, see if the compiler properly unrolls the loop, and if not, do so yourself. Unroll a couple of iterations of the loop, and then reorder the hell out of it. Put all the memory loads at the top of the loop, so they can be started as early as possible.
For the if line, check that the compiler is generating a conditional move, rather than a branch.
Finally, see if your compiler supports something like the restrict/__restrict keyword. It's not standard in C++, but it is very useful for indicating to the compiler that in and out do not point to the same addresses.
Is the size (N) known at compile-time? If so, make it a template parameter (and then try passing in and out as references to properly-sized arrays, as this may also help the compiler with aliasing analysis)
Just some thoughts off the top of my head. But again, study the disassembly. You need to know what the compiler does for you, and especially, what it doesn't do for you.
Edit
with your edit:
max &= -max >> 63;
max += t_in0;
out[i+0] = max;
what strikes me is that you added a huge dependency chain.
Before the result can be computed, max must be negated, the result must be shifted, the result of that must be and'ed together with its original value, and the result of that must be added to another variable.
In other words, all these operations have to be serialized. You can't start one of them before the previous has finished. That's not necessarily a speedup. Modern pipelined out-of-order CPUs like to execute lots of things in parallel. Tying it up with a single long chain of dependant instructions is one of the most crippling things you can do. (Of course, it if can be interleaved with other iterations, it might work out better. But my gut feeling is that a simple conditional move instruction would be preferable)
> #**Announcement** see [chat](https://chat.stackoverflow.com/rooms/5056/discussion-between-sehe-and-jakub-m)
> > _Hi Jakub, what would you say if I have found a version that uses a heuristic optimization that, for random data distributed uniformly will result in ~3.2x speed increase for `int64_t` (10.56x effective using `float`s)?_
>
I have yet to find the time to update the post, but the explanation and code can be found through the chat.
> I used the same test-bed code (below) to verify that the results are correct and exactly match the original implementation from your OP
**Edit**: ironically... that testbed had a fatal flaw, which rendered the results invalid: the heuristic version was in fact skipping parts of the input, but because existing output wasn't being cleared, it appeared to have the correct output... (still editing...)
Ok, I have published a benchmark based on your code versions, and also my proposed use of partial_sum.
Find all the code here https://gist.github.com/1368992#file_test.cpp
Features
For a default config of
#define MAGNITUDE 20
#define ITERATIONS 1024
#define VERIFICATION 1
#define VERBOSE 0
#define LIMITED_RANGE 0 // hide difference in output due to absense of overflows
#define USE_FLOATS 0
It will (see output fragment here):
run 100 x 1024 iterations (i.e. 100 different random seeds)
for data length 1048576 (2^20).
The random input data is uniformly distributed over the full range of the element data type (int64_t)
Verify output by generating a hash digest of the output array and comparing it to the reference implementation from the OP.
Results
There are a number of (surprising or unsurprising) results:
there is no significant performance difference between any of the algorithms whatsoever (for integer data), provided you are compiling with optimizations enabled. (See Makefile; my arch is 64bit, Intel Core Q9550 with gcc-4.6.1)
The algorithms are not equivalent (you'll see hash sums differ): notably the bit fiddle proposed by Alex doesn't handle integer overflow in quite the same way (this can be hidden defining
#define LIMITED_RANGE 1
which limits the input data so overflows won't occur; Note that the partial_sum_incorrect version shows equivalent C++ non-bitwise _arithmetic operations that yield the same different results:
return max<0 ? v : max + v;
Perhaps, it is ok for your purpose?)
Surprisingly It is not more expensive to calculate both definitions of the max algorithm at once. You can see this being done inside partial_sum_correct: it calculates both 'formulations' of max in the same loop; This is really not more than a triva here, because none of the two methods is significantly faster...
Even more surprisingly a big performance boost can be had when you are able to use float instead of int64_t. A quick and dirty hack can be applied to the benchmark
#define USE_FLOATS 0
showing that the STL based algorithm (partial_sum_incorrect) runs aproximately 2.5x faster when using float instead of int64_t (!!!).Note:
that the naming of partial_sum_incorrect only relates to integer overflow, which doesn't apply to floats; this can be seen from the fact that the hashes match up, so in fact it is partial_sum_float_correct :)
that the current implementation of partial_sum_correct is doing double work that causes it to perform badly in floating point mode. See bullet 3.
(And there was that off-by-1 bug in the loop-unrolled version from the OP I mentioned before)
Partial sum
For your interest, the partial sum application looks like this in C++11:
std::partial_sum(data.begin(), data.end(), output.begin(),
[](int64_t max, int64_t v) -> int64_t
{
max += v;
if (v > max) max = v;
return max;
});
Sometimes, you need to step backward and look over it again. The first question is obviously, do you need this ? Could there be an alternative algorithm that would perform better ?
That being said, and supposing for the sake of this question that you already settled on this algorithm, we can try and reason about what we actually have.
Disclaimer: the method I am describing is inspired by the successful method Tim Peters used to improve the traditional introsort implementation, leading to TimSort. So please bear with me ;)
1. Extracting Properties
The main issue I can see is the dependency between iterations, which will prevent much of the possible optimizations and thwart many attempts at parallelizing.
int64_t v = in[i];
max += v;
if (v > max) max = v;
out[i] = max;
Let us rework this code in a functional fashion:
max = calc(in[i], max);
out[i] = max;
Where:
int64_t calc(int64_t const in, int64_t const max) {
int64_t const bumped = max + in;
return in > bumped ? in : bumped;
}
Or rather, a simplified version (baring overflow since it's undefined):
int64_t calc(int64_t const in, int64_t const max) {
return 0 > max ? in : max + in;
}
Do you notice the tip point ? The behavior changes depending on whether the ill-named(*) max is positive or negative.
This tipping point makes it interesting to watch the values in in more closely, especially according to the effect they might have on max:
max < 0 and in[i] < 0 then out[i] = in[i] < 0
max < 0 and in[i] > 0 then out[i] = in[i] > 0
max > 0 and in[i] < 0 then out[i] = (max + in[i]) ?? 0
max > 0 and in[i] > 0 then out[i] = (max + in[i]) > 0
(*) ill-named because it is also an accumulator, which the name hides. I have no better suggestion though.
2. Optimizing operations
This leads us to discover interesting cases:
if we have a slice [i, j) of the array containing only negative values (which we call negative slice), then we could do a std::copy(in + i, in + j, out + i) and max = out[j-1]
if we have a slice [i, j) of the array containing only positive values, then it's a pure accumulation code (which can easily be unrolled)
max gets positive as soon as in[i] is positive
Therefore, it could be interesting (but maybe not, I make no promise) to establish a profile of the input before actually working with it. Note that the profile could be made chunk by chunk for large inputs, for example tuning the chunk size based on the cache line size.
For references, the 3 routines:
void copy(int64_t const in[], int64_t out[],
size_t const begin, size_t const end)
{
std::copy(in + begin, in + end, out + begin);
} // copy
void accumulate(int64_t const in[], int64_t out[],
size_t const begin, size_t const end)
{
assert(begin != 0);
int64_t max = out[begin-1];
for (size_t i = begin; i != end; ++i) {
max += in[i];
out[i] = max;
}
} // accumulate
void regular(int64_t const in[], int64_t out[],
size_t const begin, size_t const end)
{
assert(begin != 0);
int64_t max = out[begin - 1];
for (size_t i = begin; i != end; ++i)
{
max = 0 > max ? in[i] : max + in[i];
out[i] = max;
}
}
Now, supposing that we can somehow characterize the input using a simple structure:
struct Slice {
enum class Type { Negative, Neutral, Positive };
Type type;
size_t begin;
size_t end;
};
typedef void (*Func)(int64_t const[], int64_t[], size_t, size_t);
Func select(Type t) {
switch(t) {
case Type::Negative: return ©
case Type::Neutral: return &regular;
case Type::Positive: return &accumulate;
}
}
void theLoop(std::vector<Slice> const& slices, int64_t const in[], int64_t out[]) {
for (Slice const& slice: slices) {
Func const f = select(slice.type);
(*f)(in, out, slice.begin, slice.end);
}
}
Now, unless introsort the work in the loop is minimal, so computing the characteristics might be too costly as is... however it leads itself well to parallelization.
3. Simple parallelization
Note that the characterization is a pure function of the input. Therefore, supposing that you work in a chunk per chunk fashion, it could be possible to have, in parallel:
Slice Producer: a characterizer thread, which computes the Slice::Type value
Slice Consumer: a worker thread, which actually executes the code
Even if the input is essentially random, providing the chunk is small enough (for example, a CPU L1 cache line) there might be chunks for which it does work. Synchronization between the two threads can be done with a simple thread-safe queue of Slice (producer/consumer) and adding a bool last attribute to stop consumption or by creating the Slice in a vector with a Unknown type, and having the consumer block until it's known (using atomics).
Note: because characterization is pure, it's embarrassingly parallel.
4. More Parallelization: Speculative work
Remember this innocent remark: max gets positive as soon as in[i] is positive.
Suppose that we can guess (reliably) that the Slice[j-1] will produce a max value that is negative, then the computation on Slice[j] are independent of what preceded them, and we can start the work right now!
Of course, it's a guess, so we might be wrong... but once we have fully characterized all the Slices, we have idle cores, so we might as well use them for speculative work! And if we're wrong ? Well, the Consumer thread will simply gently erase our mistake and replace it with the correct value.
The heuristic to speculatively compute a Slice should be simple, and it will have to be tuned. It may be adaptative as well... but that may be more difficult!
Conclusion
Analyze your dataset and try to find if it's possible to break dependencies. If it is you can probably take advantage of it, even without going multi-thread.
If values of max and in[] are far away from 64-bit min/max (say, they are always between -261 and +261), you may try a loop without the conditional branch, which may be causing some perf degradation:
for(uint32_t i = 1; i < N; i++) {
max &= -max >> 63; // assuming >> would do arithmetic shift with sign extension
max += in[i];
out[i] = max;
}
In theory the compiler may do a similar trick as well, but without seeing the disassembly, it's hard to tell if it does it.
The code appears already pretty fast. Depending on the nature of the in array, you could try special casing, for instance if you happen to know that in a particular invokation all the input numbers are positive, out[i] will be equal to the cumulative sum, with no need for an if branch.
ensuring the method isn't virtual, inline, _attribute_((always_inline)) and -funroll-loops seem like good options to explore.
Only by you benchmarking them can we determine if they were worthwhile optimizations in your bigger program.
The only thing that comes to mind that might help a small bit is to use pointers rather than array indices within your loop, something like
void theloop(int64_t in[], int64_t out[], size_t N)
{
int64_t max = in[0];
out[0] = max;
int64_t *ip = in + 1,*op = out+1;
for(uint32_t i = 1; i < N; i++) {
int64_t v = *ip;
ip++;
max += v;
if (v > max) max = v;
*op = max;
op++
}
}
The thinking here is that an index into an array is liable to compile as taking the base address of the array, multiplying the size of element by the index, and adding the result to get the element address. Keeping running pointers avoids this. I'm guessing a good optimizing compiler will do this already, so you'd need to study the current assembler output.
int64_t max = 0, i;
for(i=N-1; i > 0; --i) /* Comparing with 0 is faster */
{
max = in[i] > 0 ? max+in[i] : in[i];
out[i] = max;
--i; /* Will reduce checking of i>=0 by N/2 times */
max = in[i] > 0 ? max+in[i] : in[i]; /* Reduce operations v=in[i], max+=v by N times */
out[i] = max;
}
if(0 == i) /* When N is odd */
{
max = in[i] > 0 ? max+in[i] : in[i];
out[i] = max;
}