This question already has answers here:
Why are elementwise additions much faster in separate loops than in a combined loop?
(10 answers)
What is the overhead in splitting a for-loop into multiple for-loops, if the total work inside is the same? [duplicate]
(4 answers)
Closed 9 years ago.
I have a piece of code that is really dirty.
I want to optimize it a little bit. Does it makes any difference when I take one of the following structures or are they identical with the point of view to performance in c++ ?
for(unsigned int i = 1; i < entity.size(); ++i) begin
if
if ... else ...
for end
for(unsigned int i = 1; i < entity.size(); ++i) begin
if
if ... else ...
for end
for(unsigned int i = 1; i < entity.size(); ++i) begin
if
if ... else ...
for end
....
or
for(unsigned int i = 1; i < entity.size(); ++i) begin
if
if ... else ...
if
if ... else ...
if
if ... else ...
....
for end
Thanks in Advance!
Both are O(n). As we do not know the guts of the various for loops it is impossible to say.
BTW - Mark it as pseudo code and not C++
The 1st one may spend less time incrementing/testing i and conditionally branching (assuming the compiler's optimiser doesn't reduce it to the equivalent of the second one anyway), but with loop unrolling the time taken for the i loop may be insignificant compared to the time spent within the loop anyway.
Countering that, it's easily possible that the choice of separate versus combined loops will affect the ratio of cache hits, and that could significantly impact either version: it really depends on the code. For example, if each of the three if/else statements accessed different arrays at index i, then they'll be competing for CPU cache and could slow each other down. On the other hand, if they accessed the same array at index i, doing different steps in some calculation, then it's probably better to do all three steps while those memory pages are still in cache.
There are potential impacts other than caches - from impact to register allocation, speed of I/O devices (e.g. if each loop operates on lines/records from a distinct file on different physical drives, it's very probably faster to process some of each file in a loop, rather than sequentially process each file), etc..
If you care, benchmark your actual application with representative data.
Just from the structure of the loop it is not possible to say which approach will be faster.
Algorithmically, both has the same complexity O(n). However, both might have different performance numbers depending upon the kind of operation you are performing on the elements and the size of the container.
The size of container may have an impact on locality and hence the performance. So generally speaking, you would like to chew the data as much as you can, once you get it into the cache. So I would prefer the second approach. To get a clear picture you should actually measure the performance of you approach.
The second is only slightly more efficient than the first. You save:
Initialization of loop index
Calling size()
Comparing the loop index with the size()`
Incrementing the loop index
These are very minor optimizations. Do it if it doesn't impact readability.
I would expect the second approach to be at least marginally more optimal in most cases as it can leverage the locality of reference with respect to access to elements of the entity collection/set. Note that in the first approach, each for loop would need to start accessing elements from the beginning; depending on the size of the cache, the size of the list and the extent to which compiler can infer and optimize, this may lead to cache misses when a new for loop attempts to read an element even though that element would have been read already by a preceding loop.
Related
fibs is a std::vector. Using g++, I was advised to take fibs.size() out of the loop, to save computing it each time (because the vector could change)
int sum = 0;
for(int i = 0; i < fibs.size(); ++i){
if(fibs[i] % 2 == 0){
sum += fibs[i];
}
}
Surely there is some dataflow analysis in the compiler that would tell us that fibs won't change size. Is there? Or should I set some other variable to be fibs.size() and use that in the loop condition?
The compiler will likely determine that it won't change. Even if it did, size() for vectors is an O(1) operation.
Unless you know it's a problem, leave it as it is. First make it correct, then make it clear, then make it fast (if necessary).
vector::size is extremely fast anyway. It seems to me likely that the compiler will optimise this case, since it is fairly obvious that the vector is not modified and all the functions called will be inlined so the compiler can tell.
You could always look at the generated code to see if this has happened.
If you do want to change it, you need to be able to measure the time it takes before and after. That's a fair amount of work - you probably have better things to do.
size() is constant time operation, there's no penalty calling it this way. If you are concerned about performance and a more general way to go through the collection, use iterators:
int sum = 0;
for(auto it = fibs.cbegin(); it != fibs.cend(); ++it) {
if((*it) % 2 == 0){
sum += *it;
}
}
I think you are missing another, more important point here: Is this loop causing a slow-down of your application? If you do not know for sure (i.e. if you haven't profiled), you risk focusing on the wrong parts of your application.
You already have to keep thousands of things in your head when writing programs (coding guidelines, architecture (bigger picture) of your application, variable names, function names, class names, readability, etc.), you can ignore the speed of the code during your initial implementation (in at least 95% of the time). This will allow you to focus on things, which are more important and far more valuable (like correctness, readability and maintainability).
In your example the compiler can easily analyze the flow and determine that it doesn't change. In more complicated code it cannot:
for(int i = 0; i < fibs.size(); ++i){
complicated_function();
}
complicated_function can change fibs. However, since the above code involves a function call, the compiler cannot store fibs.size() in a register and hence you cannot eliminate the memory access.
Let us consider the following code snippet in C++ to print the fist 10 positive integers :
for (int i = 1; i<11;i++)
{
cout<< i ;
}
Will this be faster or slower than sequentially printing each integer one by one as follow :
x =1;
cout<< x;
x++;
cout<< x;
And so on ..
Is there any reason as to why it should be faster or slower ? Does it vary from one language to another ?
This question is similar to this one; I've copied an excerpt of my answer to that question below... (the numbers are different; 11 vs. 50; the analysis is the same)
What you're considering doing is a manual form of loop unrolling. Loop unrolling is an optimization that compilers sometimes use for reducing the overhead involved in a loop. Compilers can do it only if the number of iterations of the loop can be known at compile time (i.e. the number of iterations is a constant, even if the constant involves computation based on other constants). In some cases, the compiler may determine that it is worthwhile to unroll the loop, but often it won't unroll it completely. For instance, in your example, the compiler may determine that it would be a speed advantage to unroll the loop from 50 iterations out to only 10 iterations with 5 copies of the loop body. The loop variable would still be there, but instead of doing 50 comparisons of the loop counter, now the code only has to do the comparison 10 times. It's a tradeoff, because the 5 copies of the loop body eat up 5 times as much space in the cache, which means that loading those extra copies of the same instructions forces the cache to evict (throw out) that many instructions that are already in the cache and which you might have wanted to stay in the cache. Also, loading those 4 extra copies of the loop body instructions from main memory takes much, much longer than simply grabbing the already-loaded instructions from the cache in the case where the loop isn't unrolled at all.
So all in all, it's often more advantageous to just use only one copy of the loop body and go ahead and leave the loop logic in place. (I.e. don't do any loop unrolling at all.)
In loop, the actual machine level instruction would be the same, and therefore the same address. In explicit statements, the instructions will have different addresses. So it is possible that for loops, the CPU's instruction cache will provide performance boost that might not happen for the latter case.
For really small range (10) the difference will most likely be negligible. For significant length of the loop it could show up more clearly.
I'm working on a C++ project where I need to search through a vector ignoring those that have already been visited. If one has been visited I set its corresponding visited to 1 and ignore it. Which solution is faster?
Solution 1:
vector<string> stringsToVisit;
vector<int> stringVisited;
for (int i = 0; i < stringToVisit.size(); ++i) {
if (stringVisited[i] == 0) {
string current = stringsToVisit[i];
...Do Stuff...
stringVisited[i] = 1;
}
}
or
Solution 2:
struct StringInfo {
string myString;
int visited = 0;
}
vector<StringInfo> stringsToVisit;
for (int i = 0; i < stringsToVisit.size(); ++i) {
if (stringsToVisit[i].visited == 0) {
string current = stringsToVisit[i].myString;
...Do Stuff...
stringsToVisit[i].visited = 1;
}
}
As Bernard notes, the time and memory complexity of both proposed solutions is identical, and the slightly more complex addressing required by the second solution isn't going to slow things down on modern processors. But I disagree with his suggestion that "Solution 2 is likely to be faster." We really don't know enough to even say that it should theoretically be faster, and except perhaps in a few degenerate situations, the difference in actual performance is likely to be unmeasurable.
The first iteration of the loop would indeed likely be slower. The cache is cold, and the first solution requires two cache lines to store the first elements, while the second solution only requires one. But after that, both solutions are doing a forward linear traversal. The CPU is going to have no problem prefetching additional cache lines, so in most situations, that initial extra overhead is unlikely to actually matter too much.
On the other hand, you are writing data while in this loop, so some of the cache lines you access also get marked dirty (meaning their data needs to be written back to a shared cache or main memory eventually, and they get purged from any other cores' caches). In solution 1, depending on sizeof(string) and sizeof(int), only 5-25% of the cache lines are getting marked dirty. Solution 2, however, dirties every single one, so it may actually use more memory bandwidth.
So, some things that might make solution 2 faster are:
The list of strings being processed is extremely short
...Do Stuff... is very complicated (enough so that the cache lines holding the data get purged from L1 cache)
Some things that might make solution 1 equivalent or faster than solution 2:
The list of strings being processed is moderate to large
...Do Stuff... is not very complex, so the cache stays warm
The program is multithreaded, and another thread wants to read data from stringsToVisit at the same time.
The bottom line is, it probably doesn't matter.
First of all, you should profile your code to check if this piece of code is really the bottleneck, and accurately measure the amount of time each solution takes to run. That would give the best results.
Nevertheless, here's my answer:
The time complexity of both solutions are O(n), so we are talking only about constant-factor optimizations here.
Solution 1 requires the lookup of two different memory blocks - stringsToVisit[i] and stringVisited[i] in each loop. This isn't good for CPU caches, as compared to Solution 2, each iteration of the loop accesses a single struct stored in contiguous locations in memory. As such, Solution 2 would perform better.
Solution 2 would need a more complicated indirect memory lookup than Solution 1 to access the visited property of the struct: (base address of stringsToVisit) + (index) * (struct size) + (displacement in struct). Nevertheless, this kind of lookup fits well into most processors' SIB (scale-index-base) addressing, so it will compile to one assembly instruction only, so there would not be much slowness, if any at all. It is worth noting that an optimizing compiler might notice that you're accessing memory sequentially and do optimizations to avoid using SIB addressing totally.
Hence, Solution 2 is likely to be faster.
Does the c++ compiler take care of cases like, buildings is vector:
for (int i = 0; i < buildings.size(); i++) {}
that is, does it notice if buildings is modified in the loop or not, and then
based on that not evaluate it each iteration? Or maybe I should do this myself,
not that pretty but:
int n = buildings.size();
for (int i = 0; i < n; i++) {}
buildings.size() will likely be inlined by the compiler to directly access the private size field on the vector<T> class. So you shouldn't separate the call to size. This kind of micro-optimization is something you don't want to worry about anyway (unless you're in some really tight loop identified as a bottleneck by profiling).
Don't decide whether to go for one or the other by thinking in terms of performance; your compiler may or may not inline the call - and std::vector::size() has constant complexity, too.
What you should really consider is correctness, because the two versions will behave very differently if you add or remove elements while iterating.
If you don't modify the vector in any way in the loop, stick with the former version to avoid a little bit of state (the n variable).
If the compiler can determine that buildings isn't mutated within the loop (for example if it's a simple loop with no function calls that could have side effects) it will probably optmize the computation away. But computing the size of a vector is a single subtraction anyway which should be pretty cheap as well.
Write the code in the obvious way (size inside the loop) and only if profiling shows you that it's too slow should you consider an alternative mechanism.
I write loops like this:
for (int i = 0, maxI = buildings.size(); i < maxI; ++i)
Takes care of many issues at once: suggest max is fixed up front, no more thinking about lost performance, consolidate types. If evaluation is in the middle expression it suggests the loop changes the collection size.
Too bad language does not allow sensible use of const, else it would be const maxI.
OTOH for more and more cases I rather use some algo, lambda even allows to make it look almost like traditional code.
Assuming the size() function is an inline function for the base-template, one can also assume that it's very little overhead. It is far different from, say, strlen() in C, which can have major overhead.
It is possible that it's still faster to use int n = buildings.size(); - because the compiler can see that n is not changing inside the loop, so load it into a register and not indirectly fetch the vector size. But it's very marginal, and only really tight, highly optimized loops would need this treatment (and only after analyzing and finding that it's a benefit), since it's not ALWAYS that things work as well as you expect in that sort of regard.
Only start to manually optimize stuff like that, if it's really a performance problem. Then measure the difference. Otherwise you'll lot's of unmaintainable ugly code that's harder to debug and less productive to work with. Most leading compilers will probably optimize it away, if the size doesn't change within the loop.
But even if it's not optimized away, then it will probably be inlined (since templates are inlined by default) and cost almost nothing.
fibs is a std::vector. Using g++, I was advised to take fibs.size() out of the loop, to save computing it each time (because the vector could change)
int sum = 0;
for(int i = 0; i < fibs.size(); ++i){
if(fibs[i] % 2 == 0){
sum += fibs[i];
}
}
Surely there is some dataflow analysis in the compiler that would tell us that fibs won't change size. Is there? Or should I set some other variable to be fibs.size() and use that in the loop condition?
The compiler will likely determine that it won't change. Even if it did, size() for vectors is an O(1) operation.
Unless you know it's a problem, leave it as it is. First make it correct, then make it clear, then make it fast (if necessary).
vector::size is extremely fast anyway. It seems to me likely that the compiler will optimise this case, since it is fairly obvious that the vector is not modified and all the functions called will be inlined so the compiler can tell.
You could always look at the generated code to see if this has happened.
If you do want to change it, you need to be able to measure the time it takes before and after. That's a fair amount of work - you probably have better things to do.
size() is constant time operation, there's no penalty calling it this way. If you are concerned about performance and a more general way to go through the collection, use iterators:
int sum = 0;
for(auto it = fibs.cbegin(); it != fibs.cend(); ++it) {
if((*it) % 2 == 0){
sum += *it;
}
}
I think you are missing another, more important point here: Is this loop causing a slow-down of your application? If you do not know for sure (i.e. if you haven't profiled), you risk focusing on the wrong parts of your application.
You already have to keep thousands of things in your head when writing programs (coding guidelines, architecture (bigger picture) of your application, variable names, function names, class names, readability, etc.), you can ignore the speed of the code during your initial implementation (in at least 95% of the time). This will allow you to focus on things, which are more important and far more valuable (like correctness, readability and maintainability).
In your example the compiler can easily analyze the flow and determine that it doesn't change. In more complicated code it cannot:
for(int i = 0; i < fibs.size(); ++i){
complicated_function();
}
complicated_function can change fibs. However, since the above code involves a function call, the compiler cannot store fibs.size() in a register and hence you cannot eliminate the memory access.