check palindrome through recursion c++ - c++

bool palindrome(char arr[],int size){
if(size<=1){
return true;
}
if(*(arr)==*(arr+size-1)){
bool small_ans=palindrome(arr+1,size-2);
return small_ans;
}
return false;
}
How efficient is this code for checking palindrome ??

There is compiler optimization called tailing recursion.
In your quite simple case compiler spotted that there is possibility to use this optimization. As a result it silently turn your code into iterative version:
https://godbolt.org/z/rsjaYhde6
palindrome(char*, int):
cmp esi, 1
jle .L4
movsx rax, esi
sub esi, 2
shr esi
lea rax, [rdi-1+rax]
lea edx, [rsi+1]
add rdx, rdi
jmp .L3
.L8:
add rdi, 1
sub rax, 1
cmp rdi, rdx
je .L4
.L3:
movzx ecx, BYTE PTR [rax]
cmp BYTE PTR [rdi], cl
je .L8
xor eax, eax
ret
.L4:
mov eax, 1
ret
Note:
there is no call instruction needed in code which actually uses recursion
label .L8 is responsible for a loop which replaced recursion
Remember there is "As-if rule" so compiler can transom your code in may ways to make it faster.

In general, the recursive solution is often more elegant than iteration, but mostly needs more CPU time and memory space. The CPU has to put data on the stack every recursion.
Especially in this case, iteration seems more efficient in time and memory.
Try somthing like this:
bool palindrome(char arr[], int size)
{
for (int i = 0; i < size; ++i) {
if (arr[i] != arr[size-1-i])
return false;
}
return true;
}

Related

Finding max number between two, which implementation to choose

I am trying to figure out, which implementation has edge over other while finding max number between two. As an example let's examine two implementation:
Implementation 1:
int findMax (int a, int b)
{
return (a > b) ? a : b;
}
// Assembly output: (gcc 11.1)
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], edi
mov DWORD PTR [rbp-8], esi
mov eax, DWORD PTR [rbp-4]
cmp eax, DWORD PTR [rbp-8]
jle .L2
mov eax, DWORD PTR [rbp-4]
jmp .L4 .L2:
mov eax, DWORD PTR [rbp-8] .L4:
pop rbp
ret
Implementation 2:
int findMax(int a, int b)
{
int diff, s, max;
diff = a - b;
s = (diff >> 31) & 1;
max = a - (s * diff);
return max;
}
// Assembly output: (gcc 11.1)
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-20], edi
mov DWORD PTR [rbp-24], esi
mov eax, DWORD PTR [rbp-20]
sub eax, DWORD PTR [rbp-24]
mov DWORD PTR [rbp-4], eax
mov eax, DWORD PTR [rbp-4]
shr eax, 31
mov DWORD PTR [rbp-8], eax
mov eax, DWORD PTR [rbp-8]
imul eax, DWORD PTR [rbp-4]
mov edx, eax
mov eax, DWORD PTR [rbp-20]
sub eax, edx
mov DWORD PTR [rbp-12], eax
mov eax, DWORD PTR [rbp-12]
pop rbp
ret
The second one produced more assembly instructions but first one has conditional jump. Just trying to understand if both are equally good.
First you need to turn on compiler optimizations (I used -O2 for the following). And you should compare to std::max. Then this:
#include <algorithm>
int findMax (int a, int b)
{
return (a > b) ? a : b;
}
int findMax2(int a, int b)
{
int diff, s, max;
diff = a - b;
s = (diff >> 31) & 1;
max = a - (s * diff);
return max;
}
int findMax3(int a,int b){
return std::max(a,b);
}
results in:
findMax(int, int):
cmp edi, esi
mov eax, esi
cmovge eax, edi
ret
findMax2(int, int):
mov ecx, edi
mov eax, edi
sub ecx, esi
mov edx, ecx
shr edx, 31
imul edx, ecx
sub eax, edx
ret
findMax3(int, int):
cmp edi, esi
mov eax, esi
cmovge eax, edi
ret
Your first version results in identical assembly as std::max, while your second variant is doing more. Actually when trying to optimize you need to specify what you optimize for. There are several options that typically require a trade-off to be made: Runtime, memory usage, size of executable, readability of code, etc. Typically you cannot get it all at once.
When in doubt, do not reinvent a wheel but use existing already optimzied std::max. And do not forget that code you write is not instructions for your CPU, rather it is a high level abstract description of what the program should do. Its the compilers job to figure out how that can be achieved best.
Last but not least, your second variant is actually broken. See example here compiled with -O2 -fsanitize=signed-integer-overflow, results in:
/app/example.cpp:13:10: runtime error: signed integer overflow: -2147483648 - 2147483647 cannot be represented in type 'int'
You should favor correctness over speed. The fastest code is not worth a thing when it is wrong. And because of that, readability is next on the list. Code that is difficult to read and understand is also difficult to proove correct. I was only able to spot the problem in your code with the help of the compiler, while std::max(a,b) is unlikely to cause undefined behavior (and even if it does, at least it isnt your fault ;).
For two ints, you can compute max(a, b) without branching using a technique you probably learnt at school:
a ^ ((a ^ b) & -(a < b));
But no sane person would write this in their code. Always use std::max and trust the compiler to pick the best way. You may well find it adopts the above for int arguments with optimisations set appropriately. Although I conject that a compare and jump is probably the best way on the whole, even at the expense of a pipeline dump.
Using std::max gives the compiler the best optimisation hint.
Implementation 1 performs well on a CISC CPU like a modern x64 AMD/Intel CPU.
Implementation 2 performs well on a RISC GPU like from nVIDIA or AMD Graphics.
The term "performs well" is only significant in a tight loop.

Understanding what clang is doing in assembly, decrementing for a loop that is incrementing

Consider the following code, in C++:
#include <cstdlib>
std::size_t count(std::size_t n)
{
std::size_t i = 0;
while (i < n) {
asm volatile("": : :"memory");
++i;
}
return i;
}
int main(int argc, char* argv[])
{
return count(argc > 1 ? std::atoll(argv[1]) : 1);
}
It is just a loop that is incrementing its value, and returns it at the end. The asm volatile prevents the loop from being optimized away. We compile it under g++ 8.1 and clang++ 5.0 with the arguments -Wall -Wextra -std=c++11 -g -O3.
Now, if we look at what compiler explorer is producing, we have, for g++:
count(unsigned long):
mov rax, rdi
test rdi, rdi
je .L2
xor edx, edx
.L3:
add rdx, 1
cmp rax, rdx
jne .L3
.L2:
ret
main:
mov eax, 1
xor edx, edx
cmp edi, 1
jg .L25
.L21:
add rdx, 1
cmp rdx, rax
jb .L21
mov eax, edx
ret
.L25:
push rcx
mov rdi, QWORD PTR [rsi+8]
mov edx, 10
xor esi, esi
call strtoll
mov rdx, rax
test rax, rax
je .L11
xor edx, edx
.L12:
add rdx, 1
cmp rdx, rax
jb .L12
.L11:
mov eax, edx
pop rdx
ret
and for clang++:
count(unsigned long): # #count(unsigned long)
test rdi, rdi
je .LBB0_1
mov rax, rdi
.LBB0_3: # =>This Inner Loop Header: Depth=1
dec rax
jne .LBB0_3
mov rax, rdi
ret
.LBB0_1:
xor edi, edi
mov rax, rdi
ret
main: # #main
push rbx
cmp edi, 2
jl .LBB1_1
mov rdi, qword ptr [rsi + 8]
xor ebx, ebx
xor esi, esi
mov edx, 10
call strtoll
test rax, rax
jne .LBB1_3
mov eax, ebx
pop rbx
ret
.LBB1_1:
mov eax, 1
.LBB1_3:
mov rcx, rax
.LBB1_4: # =>This Inner Loop Header: Depth=1
dec rcx
jne .LBB1_4
mov rbx, rax
mov eax, ebx
pop rbx
ret
Understanding the code generated by g++, is not that complicated, the loop being:
.L3:
add rdx, 1
cmp rax, rdx
jne .L3
every iteration increments rdx, and compares it to rax that stores the size of the loop.
Now, I have no idea of what clang++ is doing. Apparently it uses dec, which is weird to me, and I don't even understand where the actual loop is. My question is the following: what is clang doing?
(I am looking for comments about the clang assembly code to describe what is done at each step and how it actually works).
The effect of the function is to return n, either by counting up to n and returning the result, or by simply returning the passed-in value of n. The clang code does the latter. The counting loop is here:
mov rax, rdi
.LBB0_3: # =>This Inner Loop Header: Depth=1
dec rax
jne .LBB0_3
mov rax, rdi
ret
It begins by copying the value of n into rax. It decrements the value in rax, and if the result is not 0, it jumps back to .LBB0_3. If the value is 0 it falls through to the next instruction, which copies the original value of n into rax and returns.
There is no i stored, but the code does the loop the prescribed number of times, and returns the value that i would have had, namely, n.

why is std::equal much slower than a hand rolled loop for two small std::array?

I was profiling a small piece of code that is part of a larger simulation, and to my surprise, the STL function equal (std::equal) is much slower than a simple for-loop, comparing the two arrays element by element. I wrote a small test case, which I believe to be a fair comparison between the two, and the difference, using g++ 6.1.1 from the Debian archives is not insignificant. I am comparing two, four-element arrays of signed integers. I tested std::equal, operator==, and a small for loop. I didn't use std::chrono for an exact timing, but the difference can be seen explicitly with time ./a.out.
My question is, given the sample code below, why does operator== and the overloaded function std::equal (which calls operator== I believe) take approx 40s to complete, and the hand written loop take only 8s? I'm using a very recent intel based laptop. The for-loop is faster on all optimizations levels, -O1, -O2, -O3, and -Ofast. I compiled the code with
g++ -std=c++14 -Ofast -march=native -mtune=native
Run the code
The loop runs a huge number of times, just to make the difference clear to the naked eye. The modulo operators represent a cheap operation on one of the array elements, and serve to keep the compiler from optimizing out of the loop.
#include<iostream>
#include<algorithm>
#include<array>
using namespace std;
using T = array<int32_t, 4>;
bool
are_equal_manual(const T& L, const T& R)
noexcept {
bool test{ true };
for(uint32_t i{0}; i < 4; ++i) { test = test && (L[i] == R[i]); }
return test;
}
bool
are_equal_alg(const T& L, const T& R)
noexcept {
bool test{ equal(cbegin(L),cend(L),cbegin(R)) };
return test;
}
int main(int argc, char** argv) {
T left{ {0,1,2,3} };
T right{ {0,1,2,3} };
cout << boolalpha << are_equal_manual(left,right) << endl;
cout << boolalpha << are_equal_alg(left,right) << endl;
cout << boolalpha << (left == right) << endl;
bool t{};
const size_t N{ 5000000000 };
for(size_t i{}; i < N; ++i) {
//t = left == right; // SLOW
//t = are_equal_manual(left,right); // FAST
t = are_equal_alg(left,right); // SLOW
left[0] = i % 10;
right[2] = i % 8;
}
cout<< boolalpha << t << endl;
return(EXIT_SUCCESS);
}
Here's the generated assembly of the for loop in main() when the are_equal_manual(left,right) function is used:
.L21:
xor esi, esi
test eax, eax
jne .L20
cmp edx, 2
sete sil
.L20:
mov rax, rcx
movzx esi, sil
mul r8
shr rdx, 3
lea rax, [rdx+rdx*4]
mov edx, ecx
add rax, rax
sub edx, eax
mov eax, edx
mov edx, ecx
add rcx, 1
and edx, 7
cmp rcx, rdi
And here's what's generated when the are_equal_alg(left,right) function is used:
.L20:
lea rsi, [rsp+16]
mov edx, 16
mov rdi, rsp
call memcmp
mov ecx, eax
mov rax, rbx
mov rdi, rbx
mul r12
shr rdx, 3
lea rax, [rdx+rdx*4]
add rax, rax
sub rdi, rax
mov eax, ebx
add rbx, 1
and eax, 7
cmp rbx, rbp
mov DWORD PTR [rsp], edi
mov DWORD PTR [rsp+24], eax
jne .L20
I'm not exactly sure what's happening in the generated code for first case, but it's clearly not calling memcmp(). It doesn't appear to be comparing the contents of the arrays at all. While the loop is still being iterated 5000000000 times, it's optimized to doing nothing much. However, the loop that uses are_equal_alg(left,right) is still performing the comparison. Basically, the compiler is still able to optimize the manual comparison much better than the std::equal template.

Successfully enabling -fno-finite-math-only on NaN removal method

In finding a bug which made everything turn into NaNs when running the optimized version of my code (compiling in g++ 4.8.2 and 4.9.3), I identified that the problem was the -Ofast option, specifically, the -ffinite-math-only flag it includes.
One part of the code involves reading floats from a FILE* using fscanf, and then replacing all NaNs with a numeric value. As could be expected, however, -ffinite-math-only kicks in, and removes these checks, thus leaving the NaNs.
In trying to solve this problem, I stumbled uppon this, which suggested adding -fno-finite-math-only as a method attribute to disable the optimization on the specific method. The following illustrates the problem and the attempted fix (which doesn't actually fix it):
#include <cstdio>
#include <cmath>
__attribute__((optimize("-fno-finite-math-only")))
void replaceNaN(float * arr, int size, float newValue){
for(int i = 0; i < size; i++) if (std::isnan(arr[i])) arr[i] = newValue;
}
int main(void){
const size_t cnt = 10;
float val[cnt];
for(int i = 0; i < cnt; i++) scanf("%f", val + i);
replaceNaN(val, cnt, -1.0f);
for(int i = 0; i < cnt; i++) printf("%f ", val[i]);
return 0;
}
The code does not act as desired if compiled/run using echo 1 2 3 4 5 6 7 8 nan 10 | (g++ -ffinite-math-only test.cpp -o test && ./test), specifically, it outputs a nan (which should have been replaced by a -1.0f) -- it behaves fine if the -ffinite-math-only flag is ommited. Shouldn't this work? Am I missing something with the syntax for attributes in gcc, or is this one of the afforementioned "there being some trouble with some version of GCC related to this" (from the linked SO question)
A few solutions I'm aware of, but would rather something a bit cleaner/more portable:
Compile the code with -fno-finite-math-only (my interrim solution): I suspect that this optimization may be rather useful in my context in the remainder of the program;
Manually look for the string "nan" in the input stream, and then replace the value there (the input reader is in an unrelated part of the library, yielding poor design to include this test there).
Assume a specific floating point architecture and make my own isNaN: I may do this, but it's a bit hackish and non-portable.
Prefilter the data using a separately compiled program without the -ffinite-math-only flag, and then feed that into the main program: The added complexity of maintaining two binaries and getting them to talk to each other just isn't worth it.
Edit: As put in the accepted answer, It would seem this is a compiler "bug" in older versions of g++, such as 4.82 and 4.9.3, that is fixed in newer versions, such as 5.1 and 6.1.1.
If for some reason updating the compiler is not a reasonably easy option (e.g.: no root access), or adding this attribute to a single function still doesn't completely solve the NaN check problem, an alternate solution, if you can be certain that the code will always run in an IEEE754 floating point environment, is to manually check the bits of the float for a NaN signature.
The accepted answer suggests doing this using a bit field, however, the order in which the compiler places the elements in a bit field is non-standard, and in fact, changes between the older and newer versions of g++, even refusing to adhere to the desired positioning in older versions (4.8.2 and 4.9.3, always placing the mantissa first), regardless of the order in which they appear in the code.
A solution using bit manipulation, however, is guaranteed to work on all IEEE754 compliant compilers. Below is my such implementation, which I ultimately used to solve my problem. It checks for IEEE754 compliance, and I've extended it to allow for doubles, as well as other more routine floating point bit manipulations.
#include <limits> // IEEE754 compliance test
#include <type_traits> // enable_if
template<
typename T,
typename = typename std::enable_if<std::is_floating_point<T>::value>::type,
typename = typename std::enable_if<std::numeric_limits<T>::is_iec559>::type,
typename u_t = typename std::conditional<std::is_same<T, float>::value, uint32_t, uint64_t>::type
>
struct IEEE754 {
enum class WIDTH : size_t {
SIGN = 1,
EXPONENT = std::is_same<T, float>::value ? 8 : 11,
MANTISSA = std::is_same<T, float>::value ? 23 : 52
};
enum class MASK : u_t {
SIGN = (u_t)1 << (sizeof(u_t) * 8 - 1),
EXPONENT = ((~(u_t)0) << (size_t)WIDTH::MANTISSA) ^ (u_t)MASK::SIGN,
MANTISSA = (~(u_t)0) >> ((size_t)WIDTH::SIGN + (size_t)WIDTH::EXPONENT)
};
union {
T f;
u_t u;
};
IEEE754(T f) : f(f) {}
inline u_t sign() const { return u & (u_t)MASK::SIGN >> ((size_t)WIDTH::EXPONENT + (size_t)WIDTH::MANTISSA); }
inline u_t exponent() const { return u & (u_t)MASK::EXPONENT >> (size_t)WIDTH::MANTISSA; }
inline u_t mantissa() const { return u & (u_t)MASK::MANTISSA; }
inline bool isNan() const {
return (mantissa() != 0) && ((u & ((u_t)MASK::EXPONENT)) == (u_t)MASK::EXPONENT);
}
};
template<typename T>
inline IEEE754<T> toIEEE754(T val) { return IEEE754<T>(val); }
And the replaceNaN function now becomes:
void replaceNaN(float * arr, int size, float newValue){
for(int i = 0; i < size; i++)
if (toIEEE754(arr[i]).isNan()) arr[i] = newValue;
}
An inspection of the assembly of these functions reveals that, as expected, all masks become compile-time constants, leading to the following (seemingly) efficient code:
# In loop of replaceNaN
movl (%rcx), %eax # eax = arr[i]
testl $8388607, %eax # Check if mantissa is empty
je .L3 # If it is, it's not a nan (it's inf), continue loop
andl $2139095040, %eax # Mask leaves only exponent
cmpl $2139095040, %eax # Test if exponent is all 1s
jne .L3 # If it isn't, it's not a nan, so continue loop
This is one instruction less than with a working bit field solution (no shift), and the same number of registers are used (although it's tempting to say this alone makes it more efficient, there are other concerns such as pipelining which may make one solution more or less efficient than the other one).
Looks like a compiler bug to me. Up through GCC 4.9.2, the attribute is completely ignored. GCC 5.1 and later pay attention to it. Perhaps it's time to upgrade your compiler?
__attribute__((optimize("-fno-finite-math-only")))
void replaceNaN(float * arr, int size, float newValue){
for(int i = 0; i < size; i++) if (std::isnan(arr[i])) arr[i] = newValue;
}
Compiled with -ffinite-math-only on GCC 4.9.2:
replaceNaN(float*, int, float):
rep ret
But with the exact same settings on GCC 5.1:
replaceNaN(float*, int, float):
test esi, esi
jle .L26
sub rsp, 8
call std::isnan(float) [clone .isra.0]
test al, al
je .L2
mov rax, rdi
and eax, 15
shr rax, 2
neg rax
and eax, 3
cmp eax, esi
cmova eax, esi
cmp esi, 6
jg .L28
mov eax, esi
.L5:
cmp eax, 1
movss DWORD PTR [rdi], xmm0
je .L16
cmp eax, 2
movss DWORD PTR [rdi+4], xmm0
je .L17
cmp eax, 3
movss DWORD PTR [rdi+8], xmm0
je .L18
cmp eax, 4
movss DWORD PTR [rdi+12], xmm0
je .L19
cmp eax, 5
movss DWORD PTR [rdi+16], xmm0
je .L20
movss DWORD PTR [rdi+20], xmm0
mov edx, 6
.L7:
cmp esi, eax
je .L2
.L6:
mov r9d, esi
lea r8d, [rsi-1]
mov r11d, eax
sub r9d, eax
lea ecx, [r9-4]
sub r8d, eax
shr ecx, 2
add ecx, 1
cmp r8d, 2
lea r10d, [0+rcx*4]
jbe .L9
movaps xmm1, xmm0
lea r8, [rdi+r11*4]
xor eax, eax
shufps xmm1, xmm1, 0
.L11:
add eax, 1
add r8, 16
movaps XMMWORD PTR [r8-16], xmm1
cmp ecx, eax
ja .L11
add edx, r10d
cmp r9d, r10d
je .L2
.L9:
movsx rax, edx
movss DWORD PTR [rdi+rax*4], xmm0
lea eax, [rdx+1]
cmp eax, esi
jge .L2
add edx, 2
cdqe
cmp esi, edx
movss DWORD PTR [rdi+rax*4], xmm0
jle .L2
movsx rdx, edx
movss DWORD PTR [rdi+rdx*4], xmm0
.L2:
add rsp, 8
.L26:
rep ret
.L28:
test eax, eax
jne .L5
xor edx, edx
jmp .L6
.L20:
mov edx, 5
jmp .L7
.L19:
mov edx, 4
jmp .L7
.L18:
mov edx, 3
jmp .L7
.L17:
mov edx, 2
jmp .L7
.L16:
mov edx, 1
jmp .L7
The output is similar, although not quite identical, on GCC 6.1.
Replacing the attribute with
#pragma GCC push_options
#pragma GCC optimize ("-fno-finite-math-only")
void replaceNaN(float * arr, int size, float newValue){
for(int i = 0; i < size; i++) if (std::isnan(arr[i])) arr[i] = newValue;
}
#pragma GCC pop_options
makes absolutely no difference, so it is not simply a matter of the attribute being ignored. These older versions of the compiler clearly do not support controlling the floating-point optimization behavior at function-level granularity.
Note, however, that the generated code on GCC 5.1 and later is still significantly worse than compiling the function without without the -ffinite-math-only switch:
replaceNaN(float*, int, float):
test esi, esi
jle .L1
lea eax, [rsi-1]
lea rax, [rdi+4+rax*4]
.L5:
movss xmm1, DWORD PTR [rdi]
ucomiss xmm1, xmm1
jnp .L6
movss DWORD PTR [rdi], xmm0
.L6:
add rdi, 4
cmp rdi, rax
jne .L5
rep ret
.L1:
rep ret
I have no idea why there is such a discrepancy. Something is badly throwing the compiler off its game; this is even worse code than you get with optimizations completely disabled. If I had to guess, I'd speculate it was the implementation of std::isnan. If this replaceNaN method is not speed-critical, then it probably doesn't matter. If you need to repeatedly parse values from a file, you might prefer to have a reasonably efficient implementation.
Personally, I would write my own non-portable implementation of std::isnan. The IEEE 754 formats are all quite well-documented, and assuming you thoroughly test and comment the code, I can't see the harm in this, unless you absolutely need the code to be portable to all different architectures. It will drive purists up the wall, but so should using non-standard options like -ffinite-math-only. For a single-precision float, something like:
bool my_isnan(float value)
{
union IEEE754_Single
{
float f;
struct
{
#if BIG_ENDIAN
uint32_t sign : 1;
uint32_t exponent : 8;
uint32_t mantissa : 23;
#else
uint32_t mantissa : 23;
uint32_t exponent : 8;
uint32_t sign : 1;
#endif
} bits;
} u = { value };
// In the IEEE 754 representation, a float is NaN when
// the mantissa is non-zero, and the exponent is all ones
// (2^8 - 1 == 255).
return (u.bits.mantissa != 0) && (u.bits.exponent == 255);
}
Now, no need for annotations, just use my_isnan instead of std::isnan. The produces the following object code when compiled with -ffinite-math-only enabled:
replaceNaN(float*, int, float):
test esi, esi
jle .L6
lea eax, [rsi-1]
lea rdx, [rdi+4+rax*4]
.L13:
mov eax, DWORD PTR [rdi] ; get original floating-point value
test eax, 8388607 ; test if mantissa != 0
je .L9
shr eax, 16 ; test if exponent has all bits set
and ax, 32640
cmp ax, 32640
jne .L9
movss DWORD PTR [rdi], xmm0 ; set newValue if original was NaN
.L9:
add rdi, 4
cmp rdx, rdi
jne .L13
rep ret
.L6:
rep ret
The NaN check is slightly more complicated than a simple ucomiss followed by a test of the parity flag, but is guaranteed to be correct as long as your compiler adheres to the IEEE 754 standard. This works on all versions of GCC, and any other compiler.

Would the compiler optimize this expression into a temporary constant rather than resolve it every iteration?

I have the following loop:
for (unique_ptr<Surface>& child : Children)
{
child->Gather(hit);
if (hit && HitTestMode == HitTestMode::Content && child->MouseOver && !mouseOver)
{
mouseOver = true;
}
}
I wonder if the compiler (I use Visual Studio 2013, targeting x64 on Win7 upwards) would optimize the expression
hit && HitTestMode == HitTestMode::Content
into a temporary constant and use that rather than resolve the expression every iteration, similar to me doing something like this:
bool contentMode = hit && HitTestMode == HitTestMode::Content;
for (unique_ptr<Surface>& child : Children)
{
child->Gather(hit);
if (contentMode && child->MouseOver && !mouseOver)
{
mouseOver = true;
}
}
Bonus question:
Is checking for !mouseOver worth it (in order to skip the conditional mouseOver = true; if it has already been set)? Or is it faster to simply set it again regardless?
The answer to whether that optimization could even take place would depend on what hit, HitTestMode and HitTestMode::Content are and whether it's possible that they could be changed by the call to child->Gather().
If those identifiers are constants or local variables that the compiler can prove aren't modified, then it's entirely possible that the sub-expression hit && HitTestMode == HitTestMode::Content will be hoisted.
For example, consider the following compilable version of your example:
#include <memory>
#include <vector>
using namespace std;
class Surface
{
public:
void Gather(bool hit);
bool MouseOver;
};
enum class HitTestMode
{
Content = 1,
Foo = 3,
Bar = 4,
};
extern HitTestMode hittestmode;
bool anyMiceOver( vector<unique_ptr<Surface> > & Children, bool hit)
{
bool mouseOver = false;
for (unique_ptr<Surface>& child : Children)
{
child->Gather(hit);
if (hit && hittestmode == HitTestMode::Content && child->MouseOver && !mouseOver)
{
mouseOver = true;
}
}
return mouseOver;
}
When compiled using g++ 4.8.1 (mingw) with the -O3 optimization option, you get the following snippet of code for the loop (annotations added):
mov rbx, QWORD PTR [rcx] ; Children.begin()
mov rsi, QWORD PTR 8[rcx] ; Children.end()
cmp rbx, rsi
je .L8 ; early exit if Children is empty
test dl, dl ; hit == 0?
movzx edi, dl
je .L5 ; then goto loop L5
xor ebp, ebp
mov r12d, 1
jmp .L7
.p2align 4,,10
.L6:
add rbx, 8
cmp rsi, rbx ; check for end of Children
je .L2
.L7:
mov rcx, QWORD PTR [rbx]
mov edx, edi
call _ZN7Surface6GatherEb ; call child->Gather(hit)
cmp DWORD PTR hittestmode[rip], 1 ; check hittestmode
jne .L6
mov rax, QWORD PTR [rbx] ; check child->MouseOver
cmp BYTE PTR [rax], 0
cmovne ebp, r12d ; set mouseOver appropriately
jmp .L6
.p2align 4,,10
.L5: ; loop L5 is run only when hit == 0
mov rcx, QWORD PTR [rbx] ; get net element in Children
mov edx, edi
add rbx, 8
call _ZN7Surface6GatherEb ; call child->Gather(hit)
cmp rsi, rbx
jne .L5
.L8:
xor ebp, ebp
.L2:
mov eax, ebp
add rsp, 32
pop rbx
pop rsi
pop rdi
pop rbp
pop r12
ret
You'll note that the check for hit has been hoisted out of the loop - and if it's false then the a loop that does nothing but call child->Gather() is run.
If hitmodetest is changed to be a variable that's passed as a function argument so it's no longer subject to possibly being changed by the call to child-Gather(hit), then the compiler will also hoist the check for the value of hittestmode out of the loop and jump to the loop that does nothing but call child->Gather().
With a local hittestmode using -O2 will calculate hit && hittestmode == HitTestMode::Content prior to the loop and stash that result in a register, but it will still test the register in each loop iteration instead of optimizing to a separate loops that don't even bother with the test.
Since you specifically asked about the VS2013 compiler (using /Ox and /Ot options), it doesn't seem to hoist or optimize either of the checks (for hit or hittestmode) out of the loop - all it seems to do is keep the values for those variable in registers.