Optimizing sum of elements in array [closed] - c++

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Closed 8 years ago.
This question appears to be off-topic because it lacks sufficient information to diagnose the problem. Describe your problem in more detail or include a minimal example in the question itself.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Improve this question
My code is as follows:
double a,b; //These variable are inputs to the function
double *inArr; //This is also an iput to the function whose size is NumElements
double *arr = new double[numElements]; //NumElements is ~ 10^6
double sum = 0.0;
for(unsigned int i=0;i<numElements;++i)
{
double k = a*inArr[i] + b; //This doesn't take any time
double el = arr[i]; //This doesn't take any time
el *= k; //This doesn't take any time
sum += el; //This takes a long time!!!
}
This code goes over the elements of an array each time calculating a value k, for each element it adds k times that element to sum. I separated the code into so many steps so that when my profiler tells me which line takes a long time I will know exactly which calculation is the culprit. My profiler tells me that adding el to sum is what's slowing down my program (this might seem a little strange that a simple addition would be slow but I call this function hundreds of times and each time it performs millions of calculations). My only theory is that because sum is in a different scope operations using it take longer. So I edited the code to be:
double a,b; //These variable are inputs to the function
double *inArr; //This is also an iput to the function whose size is NumElements
double *arr = new double[numElements]; //NumElements is ~ 10^6
double sum = 0.0;
for(unsigned int i=0;i<numElements;++i)
{
double k = a*inArr[i] + b; //This doesn't take any time
double el = arr[i]; //This doesn't take any time
el *= k; //This doesn't take any time
double temp = sum + el; //This doesn't take any time
sum = el; //This takes a long time!!!
}
And now the sum operation takes very little time even though it accesses the sum variable. The assignment takes a long time now. Is my theory correct that the reason this happens is that it takes longer to assign to variables that aren't in the current scope? If so why is that true? Is there any way to make this assignment work quickly? I know I can optimize this using parallelization, I want to know if I can do any better serially. I am using VS 2012 running in release mode, I am using the VS performance analyzer as a profiler.
Edit:
Once I removed the optimization it turns out that the access to inArr is what is taking the most time.

Is my theory correct that the reason this happens is that it takes longer to assign to variables that aren't in the current scope?
No.
Your profiler is lying to you, and pinpointing the wrong source for the delay. Short of parallelisation this code cannot be optimised meaningfully without any knowledge of someQuickCalc. All the other operations are quite elementary.

There are limits to what a profiler can do. If you've compiled
with optimization, the compiler has probably rearranged a fair
bit of code, so the profiler can't necessarily tell which line
number is associated with any one particular instruction. And
both the compiler and the hardware will allow for a good deal of
overlapping; in many cases, the hardware will just go on to the
next instruction, even if the preceding one hasn't finished,
leaving a number of operations in the pipeline (and the compiler
will arrange the code so that the hardware can do this most
effectively). Thus, for example, the sub-expression inArr[i]
involves a memory access, which is probably significantly slower
than anything else. But the execution doesn't wait for it; the
execution doesn't wait until it actually needs the results. (If
the compiler is really clever, it may remark that arr[i]
accesses uninitialized memory, which is undefined behavior, so
it can skip the access, and give you any old random value.)
In your case, the compiler is probably only doing full
optimization within the loop, so the execution is only stalling
for the pipelined operations to finish when you write to
a variable outside the loop. And the profiler thus attributes
all of the time to this write.
(I've simplified greatly: for more details, I'd have to know
more about the actual processor, and look at the generated code
with and without optimization.)

Related

How to generate computation intensive code in C++ that will not be removed by compiler? [duplicate]

This question already has an answer here:
How to prevent optimization of busy-wait
(1 answer)
Closed 7 years ago.
I am doing some experiments on CPU's performance. I wonder if anyone know a formal way or a tool to generate simple code that can run for a period of time (several seconds) and consumes significant computation resource of a CPU.
I know there are a lot of CPU benchmarks but the code of them is pretty complicated. What I want is a program more straight forward.
As the compiler is very smart, writing some redundant code as following will not work.
for (int i = 0; i < 100; i++) {
int a = i * 200 + 100;
}
Put the benchmark code in a function in a separate translation unit from the code that calls it. This prevents the code from being inlined, which can lead to aggressive optimizations.
Use parameters for the fixed values (e.g., the number of iterations to run) and return the resulting value. This prevents the optimizer from doing too much constant folding and it keeps it from eliminating calculations for a variable that it determines you never use.
Building on the example from the question:
int TheTest(int iterations) {
int a;
for (int i = 0; i < iterations; i++) {
a = i * 200 + 100;
}
return a;
}
Even in this example, there's still a chance that the compiler might realize that only the last iteration matters and completely omit the loop and just return 200*(iterations - 1) + 100, but I wouldn't expect that to happen in many real-life cases. Examine the generated code to be certain.
Other ideas, like using volatile on certain variables can inhibit some reasonable optimizations, which might make your benchmark perform worse that actual code.
There are also frameworks, like this one, for writing benchmarks like these.
It's not necessarily your optimiser that removes the code. CPU's these days are very powerful, and you need to increase the challenge level. However, note that your original code is not a good general benchmark: you only use a very subset of a CPU's instruction set. A good benchmark will try to challenge the CPU on different kinds of operations, to predict the performance in real world scenarios. Very good benchmarks will even put load on various components of your computer, to test their interplay.
Therefore, just stick to a well known published benchmark for your problem. There is a very good reason why they are more involved. However, if you really just want to benchmark your setup and code, then this time, just go for higher counter values:
double j=10000;
for (double i = 0; i < j*j*j*j*j; i++)
{
}
This should work better for now. Note that there a just more iterations. Change j according to your needs.

Code reordering due to optimization

I've heard so many times that an optimizer may reorder your code that I'm starting to believe it.
Are there any examples or typical cases where this might happen and how can I Avoid such a thing (eg I want a benchmark to be impervious to this)?
There are LOTS of different kinds of "code-motion" (moving code around), and it's caused by lots of different parts of the optimisation process:
move these instructions around, because it's a waste of time to wait for the memory read to complete without putting at least one or two instructions between the memory read and the operation using the content we got from memory
Move things out of loops, because it only needs to happen once (if you call x = sin(y) once or 1000 times without changing y, x will have the same value, so no point in doing that inside a loop. So compiler moves it out.
Move code around based on "compiler expects this code to hit more often than the other bit, so better cache-hit ratio if we do it this way" - for example error handling being moved away from the source of the error, because it's unlikely that you get an error [compilers often understand commonly used functions and that they typically result in success].
Inlining - code is moved from the actual function into the calling function. This often leads to OTHER effects such as reduction in pushing/poping registers from the stack and arguments can be kept where they are, rather than having to move them to the "right place for arguments".
I'm sure I've missed some cases in the above list, but this is certainly some of the most common.
The compiler is perfectly within its rights to do this, as long as it doesn't have any "observable difference" (other than the time it takes to run and the number of instructions used - those "don't count" in observable differences when it comes to compilers)
There is very little you can do to avoid compiler from reordering your code - you can write code that ensures the order to some degree. So for example, we can have code like this:
{
int sum = 0;
for(i = 0; i < large_number; i++)
sum += i;
}
Now, since sum isn't being used, the compiler can remove it. Adding some code that checks prints the sum would ensure that it's "used" according to the compiler.
Likewise:
for(i = 0; i < large_number; i++)
{
do_stuff();
}
if the compiler can figure out that do_stuff doesn't actually change any global value, or similar, it will move code around to form this:
do_stuff();
for(i = 0; i < large_number; i++)
{
}
The compiler may also remove - in fact almost certainly will - the, now, empty loop so that it doesn't exist at all. [As mentioned in the comments: If do_stuff doesn't actually change anything outside itself, it may also be removed, but the example I had in mind is where do_stuff produces a result, but the result is the same each time]
(The above happens if you remove the printout of results in the Dhrystone benchmark for example, since some of the loops calculate values that are never used other than in the printout - this can lead to benchmark results that exceed the highest theoretical throughput of the processor by a factor of 10 or so - because the benchmark assumes the instructions necessary for the loop were actually there, and says it took X nominal operations to execute each iteration)
There is no easy way to ensure this doesn't happen, aside from ensuring that do_stuff either updates some variable outside the function, or returns a value that is "used" (e.g. summing up, or something).
Another example of removing/omitting code is where you store values repeatedly to the same variable multiple times:
int x;
for(i = 0; i < large_number; i++)
x = i * i;
can be replaced with:
x = (large_number-1) * (large_number-1);
Sometimes, you can use volatile to ensure that something REALLY happens, but in a benchmark, that CAN be detrimental, since the compiler also can't optimise code that it SHOULD optimise (if you are not careful with the how you use volatile).
If you have some SPECIFIC code that you care particularly about, it would be best to post it (and compile it with several state of the art compilers, and see what they actually do with it).
[Note that moving code around is definitely not a BAD thing in general - I do want my compiler (whether it is the one I'm writing myself, or one that I'm using that was written by someone else) to make optimisation by moving code, because, as long as it does so correctly, it will produce faster/better code by doing so!]
Most of the time, reordering is only allowed in situations where the observable effects of the program are the same - this means you shouldn't be able to tell.
Counterexamples do exist, for example the order of operands is unspecified and an optimizer is free to rearrange things. You can't predict the order of these two function calls for example:
int a = foo() + bar();
Read up on sequence points to see what guarantees are made.

Is adding 1 to a number repeatedly slower than adding everything at the same time in C++? [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 8 years ago.
Improve this question
If I have a number a, would it be slower to add 1 to it b times rather than simply adding a + b?
a += b;
or
for (int i = 0; i < b; i++) {
a += 1;
}
I realize that the second example seems kind of silly, but I have a situation where coding would actually be easier that way, and I am wondering if that would impact performance.
EDIT: Thank you for all your answers. It looks like some posters would like to know what situation I have. I am trying to write a function to shift an inputted character a certain number of characters over (ie. a cipher) if it is a letter. So, I want to say that one char += the number of shifts, but I also need to account for the jumps between the lowercase characters and uppercase characters on the ascii table, and also wrapping from z back to A. So, while it is doable in another way, I thought it would be easiest to keep adding one until I get to the end of a block of letter characters, then jump to the next one and keep going.
If your loop is really that simple, I don't see any reason why a compiler couldn't optimize it. I have no idea if any actually would, though. If your compiler doesn't the single addition will be much faster than the loop.
The language C++ does not describe how long either of those operations take. Compilers are free to turn your first statement into the second, and that is a legal way to compile it.
In practice, many compilers would treat those two subexpressions as the same expression, assuming everything is of type int. The second, however, would be fragile in that seemingly innocuous changes would cause massive performance degradation. Small changes in type that 'should not matter', extra statements nearby, etc.
It would be extremely rare for the first to be slower than the second, but if the type of a was such that += b was a much slower operation than calling += 1 a bunch of times, it could be. For example;
struct A {
std::vector<int> v;
void operator+=( int x ) {
// optimize for common case:
if (x==1 && v.size()==v.capacity()) v.reserve( v.size()*2 );
// grow the buffer:
for (int i = 0; i < x; ++i)
v.reserve( v.size()+1 );
v.resize( v.size()+1 );
}
}
};
then A a; int b = 100000; a+=b; would take much longer than the loop construct.
But I had to work at it.
The overhead (CPU instructions) on having a variable being incremented in a loop is likely to be insignificant compared to the total number of instructions in that loop (unless the only thing you are doing in the loop is incrementing). Loop variables are likely to remain in the low levels of the CPU cache (if not in CPU registries) and is very fast to increment as in doesn't need to read from the RAM via the FSB. Anyway, if in doubt just make a quick profile and you'll know if it makes sense to sacrifice code readability for speed.
Yes, absolutely slower. The second example is beyond silly. I highly doubt you have a situation where it would make sense to do it that way.
Lets say 'b' is 500,000... most computers can add that in a single operation, why do 500,000 operations (not including the loop overhead).
If the processor has an increment instruction, the compiler will usually translate the "add one" operation into an increment instruction.
Some processors may have an optimized increment instructions to help speed up things like loops. Other processors can combine an increment operation with a load or store instruction.
There is a possibility that a small loop containing only an increment instruction could be replaced by a multiply and add. The compiler is allowed to do so, if and only if the functionality is the same.
This kind of operation, generally produces negligible results. However, for large data sets and performance critical applications, this kind of operation may be necessary and the time gained would be significant.
Edit 1:
For adding values other than 1, the compiler would emit processor instructions to use the best addition operations.
The add operation is optimized in hardware as a different animal than incrementing. Arithmetic Logic Units (ALU) have been around for a long time. The basic addition operation is very optimized and a lot faster than incrementing in a loop.

Removing code portions does not match the profiler's data

I am doing a little proof of concept profile and optimize type of example. However, I ran into something that I can't quite explain, and I'm hoping that someone here can clear this up.
I wrote a very short snippet of code:
int main (void)
{
for (int j = 0; j < 1000; j++)
{
a = 1;
b = 2;
c = 3;
for (int i = 0; i < 100000; i++)
{
callbackWasterOne();
callbackWasterTwo();
}
printf("Two a: %d, b: %d, c: %d.", a, b, c);
}
return 0;
}
void callbackWasterOne(void)
{
a = b * c;
}
void callbackWasterTwo(void)
{
b = a * c;
}
All it does is call two very basic functions that just multiply numbers together. Since the code is identical, I expect the profiler (oprofile) to return roughly the same number.
I run this code 10 times per profile, and I got the following values for how long time is spent on each function:
main: average = 5.60%, stdev = 0.10%
callbackWasterOne = 43.78%, stdev = 1.04%
callbackWasterTwo = 50.24%, stdev = 0.98%
rest is in miscellaneous things like printf and no-vmlinux
The difference between the time for callbackWasterOne and callbackWasterTwo is significant enough (to me at least) given that they have the same code, that I switched their order in my code and reran the profiler with the following results now:
main: average = 5.45%, stdev = 0.40%
callbackWasterOne = 50.69%, stdev = 0.49%
callbackWasterTwo = 43.54%, stdev = 0.18%
rest is in miscellaneous things like printf and no-vmlinux
So evidently the profiler samples one more than the other based on the execution order. Not good. Disregarding this, I decided to see the effects of removing some code and I got this for execution times (averages):
Nothing removed: 0.5295s
call to callbackWasterOne() removed from for loop: 0.2075s
call to callbackWasterTwo() removed from for loop: 0.2042s
remove both calls from for loop: 0.1903s
remove both calls and the for loop: 0.0025s
remove contents of callbackWasterOne: 0.379s
remove contents of callbackWasterTwo: 0.378s
remove contents of both: 0.382s
So here is what I'm having trouble understanding:
When I remove just one of the calls from the for loop, the execution time drops by ~60%, which is greater than the time spent by that one function + the main in the first place! How is this possible?
why is the effect of removing both calls from the loop so little compared to removing just one? I can't figure out this non-linearity. I understand that the for loop is expensive, but in that case (if most of the remaining time can be attributed to the for loop that performs the function calls), why would removing one of the calls cause such a large improvement in the first place?
I looked at the disassembly and the two functions are the same in code. The calls to them are the same, and removing the call simply deletes the one call line.
Other info that might be relevant
I'm using Ubuntu 14.04LTS
The code is complied by Eclipse with no optimization (O0)
I time the code by running it in terminal using "time"
I use OProfile with count = 10000 and 10 repetitions.
Here are the results from when I do this with -O1 optimization:
main: avg = 5.89%, stdev = 0.14%
callbackWasterOne: avg = 44.28%, stdev = 2.64%
callbackWasterTwo: avg = 49.66%, stdev = 2.54% (greater than before)
Rest is miscellaneous
Results of removing various bits (execution time averages):
Nothing removed: 0.522s
Remove callbackWasterOne call: 0.149s (71.47% decrease)
Remove callbackWasterTwo call: 0.123% (76.45% decrease)
Remove both calls: 0.0365s (93.01% decrease) (what I would expect given the profile data just above)
So removing one call now is much better than before, and removing both still carries a benefit (probably because the optimizer understands that nothing happens in the loop). Still, removing one is much more beneficial than I would have anticipated.
Results of the two functions using different variables:
I defined 3 more variables for callbackWasterTwo() to use instead of reusing same ones. Now the results are what I would have expected.
main: avg = 10.87%, stdev = 0.19% (average is greater, but maybe due to those new variables)
callbackWasterOne: avg = 46.08%, stdev = 0.53%
callbackWasterTwo: avg = 42.82%, stdev = 0.62%
Rest is miscellaneous
Results of removing various bits (execution time averages):
Nothing removed: 0.520s
Remove callbackWasterOne call: 0.292s (43.83% decrease)
Remove callbackWasterTwo call: 0.291% (44.07% decrease)
Remove both calls: 0.065s (87.55% decrease)
So now removing both calls is pretty much equivalent (within stdev) to removing one call + the other.
Since the result of removing either function is pretty much the same (43.83% vs 44.07%), I am going to go out on a limb and say that perhaps the profiler data (46% vs 42%) is still skewed. Perhaps it is the way it samples (going to vary the counter value next and see what happens).
It appears that the success of optimization relates pretty strongly to the code reuse fraction. The only way to achieve "exactly" (you know what I mean) the speedup noted by the profiler is to optimize on completely independent code. Anyway this is all interesting.
I am still looking for some explanation ideas for the 70% decrease in the -O1 case though...
I did this with 10 functions (different formulas in each, but using some combination of 6 different variables, 3 at a time, all multiplication):
These results are disappointing to say the least. I know the functions are identical, and yet, the profiler indicates that some take significantly longer. No matter which one I remove ("fast" or "slow" one), the results are the same ;) So this leaves me to wonder, how many people are incorrectly relying on the profiler to indicate the incorrect areas of code to fix? If I unknowingly saw these results, what could possible tell me to go fix the 5% function rather than the 20% (even though they are exactly the same)? What if the 5% one was much easier to fix, with a large potential benefit? And of course, this profiler might just not be very good, but it is popular! People use it!
Here is a screenshot. I don't feel like typing it in again:
My conclusion: I am overall rather disappointed with Oprofile. I decided to try out callgrind (valgrind) through command line on the same function and it gave me far more reasonable results. In fact, the results were very reasonable (all functions spent ~ the same amount of time executing). I think Callgrind samples far more than Oprofile ever did.
Callgrind will still not explain the difference in improvement when a function is removed, but at least it gives the correct baseline information...
Ah, I see you did look at the assembly. This question is indeed interesting in its own right, but in general there's no point profiling unoptimized code, since there's so much boilerplate that could easily be reduced even in -O1.
If it's really only the call that's missing, then that could explain the timing differences -- there's lots of boilerplate from the -O0 stack manipulation code (any caller-saved registers have to be pushed onto the stack, and any arguments too, then afterwards the any return value has to be treated and the opposite stack manipulation has to be done) which contributes to the time it takes to call the functions, but is not necessarily completely attributed to the functions themselves by oprofile since that code is executed before/after the function is actually called.
I suspect the reason the second function seems to always take less time is that there's less (or no) stack juggling that needs to be done -- the parameter values are already on the stack thanks to the previous function call, and so, as you've seen, only the call to the function has to be executed, without any other extra work.

Same program works on macos but fails on windows [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 8 years ago.
Improve this question
Ok, so I am working on an interface on qt, and I am using qtcreator as an IDE. The thing is that the algorithm works normally on mac, but on windows the same program gets an error.
the only difference is the compiler. The compiler I am using on windows is the visual c++, and on mac is clang (I think).
Is it possible that the same algorithm works on mac but doesn't on windows? If so, what problem could it be?
EDIT: I see that I got downvoted. I don't know exactly why. I know already what the error means, vector subscription out of range. The thing is that I don't want to waste time trying to find where the error is because it actually works fine on mac. Also the pc is better than the mac one.
EDIT 2: Indeed it looks like the same code works differently on windows than on mac. Tomorrow I will test it on mac to try to understand this, but the code that changes is this one:
vector<double> create_timeVector(double simulationTime, double step) {
vector<double> time;
time.push_back(0);
double i = 0;
do {
++i;
time.push_back(time[i-1] + step);
} while (time[i] < simulationTime);
return time;
}
The size of the vector that is returned is one size bigger on windows than on the mac. The thing is that I didn't make any changes on the code.
The probable reason why it works differently is that you're using a floating point calculation to determine when the loop is to stop (or keep going, depending on how you look at it).
time.push_back(time[i-1] + step);
} while (time[i] < simulationTime);
You have step as a double, simulationTime as a double, and a vector<double> called time being used. That is a recipe for loops running inconsistently across compilers, compiler optimizations, etc.
Floating point is not exact. The way to make the loops consistent is to not use any floating point calculations in the looping condition.
In other words, by hook or by crook, compute the number of calculations you need by using integer arithmetic. If you need to step 100 times, then it's 100, not a value computed from floating point arithmetic:
For example:
for (float i = 0.01F; i <= 1.0F; i+=0.01F)
{
// use i in some sort of calculation
}
The number of times that loop executes can be 99 times, or it can be 100 times. It depends on the compiler and any floating point optimizations that may apply. To fix this:
for (int i = 1; i <= 100; ++i )
{
float dI = static_cast<float>(i) / 100.0F;
// use dI instead of i some sort of calculation
}
As long as i isn't changed in the loop, the loop is guaranteed to always do 100 iterations, regardless of the hardware, compiler optimizations, etc.
See this: Any risk of using float variables as loop counters and their fractional increment/decrement for non "==" conditions?
vector subscript out of range means that you used [n] on a vector, and n was less than 0 or greater than or equal to the number of elements in the vector.
This causes undefined behaviour, so different compilers may react in different ways.
To get reliable behaviour in this case, one way is to use .at(n) instead of [] and make sure you catch exceptions. Another way is to check your index values before applying them, so that you never access out of bounds in the first place.