Is this a compiler error in Visual Studio 2010?

Is this a compiler error in Visual Studio 2010? - c++

I have a bug in this conditional:
while(CurrentObserverPathPointDisplacement > lengthToNextPoint && CurrentObserverPathPointIndex < (PathSize - 1) )
{
CurrentObserverPathPointIndex = CurrentObserverPathPointIndex + 1;
CurrentObserverPathPointDisplacement -= lengthToNextPoint;
lengthToNextPoint = (CurrentObserverPath->pathPoints[min((PathSize - 1),CurrentObserverPathPointIndex + 1)] - CurrentObserverPath->pathPoints[CurrentObserverPathPointIndex]).length();
}
It seems to get stuck in an infinite loop while in Release mode. Works fine in debug mode, or more interstingly when I put a debug print on the last line
OutputInDebug("Here");
Here is the generated assembly for the conditional itself:
while(CurrentObserverPathPointDisplacement > lengthToNextPoint && CurrentObserverPathPointIndex < (PathSize - 1) )
00F074CF fcom qword ptr [dist]
00F074D2 fnstsw ax
00F074D4 test ah,5
00F074D7 jp ModelViewData::moveCameraAndCenterOnXYPlaneForwardBackward+27Eh (0F0753Eh)
00F074D9 mov eax,dword ptr [dontRotate]
00F074DC cmp eax,ebx
00F074DE jge ModelViewData::moveCameraAndCenterOnXYPlaneForwardBackward+27Eh (0F0753Eh)
{
You can see that for the second condition, it seems to move the value of 'dontRotate', a function parameter of type bool, into eax, and then compare against it, yet dontRotate is used nowhere near that bit of code.
I understand that this may be a bit little data, but it seems like an obvious compiler error personally. But sadly, i'm not sure how to distill it down to a self contained enough problem to actually produce a bug report.
Edit:
Not the actual decelerations, but the types:
double CurrentObserverPathPointDisplacement;
double lengthToNextPoint;
int CurrentObserverPathPointIndex;
int PathSize;
vector<vector3<double>> CurrentObserverPath::pathPoints;
Edit2:
Once I add in the debug print statement to the end of the while, this is the assembly that gets generated, which no longer expresses the bug:
while(CurrentObserverPathPointDisplacement > lengthToNextPoint && CurrentObserverPathPointIndex < (PathSize - 1) )
00B1751E fcom qword ptr [esi+208h]
00B17524 fnstsw ax
00B17526 test ah,5
00B17529 jp ModelViewData::moveCameraAndCenterOnXYPlaneForwardBackward+2D6h (0B175A6h)
00B1752B mov eax,dword ptr [esi+200h]
00B17531 cmp eax,ebx
00B17533 jge ModelViewData::moveCameraAndCenterOnXYPlaneForwardBackward+2D6h (0B175A6h)
{

Here:
while(/* foo */ && CurrentObserverPathPointIndex < (PathSize - 1) )
{
CurrentObserverPathPointIndex = CurrentObserverPathPointIndex + 1;
Since this is the only point (unless min does something really nasty) in the loop where CurrentObserverPathPointIndex is changed and both CurrentObserverPathPointIndex and PathSize are signed integers of the same size (and PathSize is small enough to rule out integer promotion issues), the remaining floating point fiddling is irrelevant. The loop must terminate eventually (it may take quite a long time if the initial value of CurrentOvserverPathPointIndex is small compared to PathSize, though).
This allows only one conclusion; If the compiler generates code that does not terminate (ever), the compiler is wrong.

It looks like PathSize doesn't change in the loop, so the compiler can compute PathSize - 1 before looping and coincidentally use the same memory location as dontRotate, whatever that is.
More importantly, how many elements are there in CurrentObserverPath->pathPoints?
Your loop condition includes this test:
CurrentObserverPathPointIndex < (PathSize - 1)
Inside your loop is this assignment:
CurrentObserverPathPointIndex = CurrentObserverPathPointIndex + 1;
followed by this further incremented subscript:
[min((PathSize - 1),CurrentObserverPathPointIndex + 1)]
Maybe your code appeared to work in debug mode because random undefined behaviour appeared to work?

Related

why does visual studio have i++ post increment as a default for autocompleting for loop?

so this is what visual studio gives when auto completing a for loop.
I have been told that ++i is more effecient than i++ as the pre increment does not require the comiler to create a temporary variable to store i. and that I should use ++i unless I actually require i++. is there any reason that visual studio does this as a default or am I just overthinking it?
for (size_t i = 0; i < length; i++)
{
}

Some of this answer is based on my experience rather than on any hard facts, so please take it with a grain of salt.
Reasons for i++ as the default loop incrementor
Postfix vs Prefix
The postfix incrementor i++ produces a temporary pre-incremented value that's passed 'upwards' in the expression, then variable i is subsequently incremented.
The prefix incrementor ++i does not produce a temporary object, but directly increments i and passes the incremented result to the expression.
Performance
Looking at two simple loops, one using i++, and the other ++i, let us examine the generated assembly
#include <stdio.h>
int main()
{
for (int i = 0; i < 5; i++) {
//
}
for (int j = 0; j < 5; ++j) {
//
}
return 0;
}
Generates the following assembly using GCC:
main:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], 0
.L3:
cmp DWORD PTR [rbp-4], 4
jg .L2
add DWORD PTR [rbp-4], 1
jmp .L3
.L2:
mov DWORD PTR [rbp-8], 0
.L5:
cmp DWORD PTR [rbp-8], 4
jg .L4
add DWORD PTR [rbp-8], 1
jmp .L5
.L4:
mov eax, 0
pop rbp
ret
.L3 corresponds to the first loop, and .L5 corresponds to the second loop. The compiler optimizes away any differences in this case resulting in identical code, so no performance difference.
Appearance
Looking at the following three loops, which one appeals to you most aesthetically?
for (int i = 0; i < 5; i++) {
//
}
for (int i = 0; i < 5; ++i) {
//
}
for (int i = 0; i < 5; i+=1) {
//
}
For me i+=1 is right out, and ++i just feels sort of... backwards
Ergonomics
i++) - Your hand is already in the correct position to type ) after typing ++
++i) - Requires moving from upper row to homerow and back
History
Older C programmers, and current C programmers who write systems level code make frequent use of i++ to allow for writing very compact code such as:
// print some char * s
while(*s)
putc(*s++);
// strcpy from src to dest
while (*dest++=*src++);
Because of this i++ became the incrementor C programmers reach for first, eventually becoming a sort of standard in cases where ++i vs i++ have the same functionally.

Choosing ++i instead there would merely be a style choice.
There are no performance gains to be had in this context. That is a myth.
i++ is also more "conventional" for a simple loop over ints.
You are of course free to write your own loop in your own style!

Why does this if-statement combining assignment and an equality check return true?

I've been thinking of some beginner mistakes and I ended up with the one on the if statement. I expanded a bit the code to this:
int i = 0;
if (i = 1 && i == 0) {
std::cout << i;
}
I have seen that the if statement returns true, and it cout's i as 1. If i is assigned 1 in the if statement, why did i == 0 return true?

This has to do with operator precedence.
if (i = 1 && i == 0)
is not
if ((i = 1) && (i == 0))
because both && and == have a higher precedence than =. What it really works out to is
if (i = (1 && (i == 0)))
which assigns the result of 1 && (i == 0) to i. So, if i starts at 0 then i == 0 is true, so 1 && true is true (or 1), and then i gets set to 1. Then since 1 is true, you enter the if block and print the value you assigned to i.

Assuming your code actually looks like this:
#include <iostream>
using namespace std;
int main() {
int i = 0;
if (i = 1 && i == 0) {
cout << i;
}
}
Then this:
if (i = 1 && i == 0) {
evaluates as
if (i = (1 && i == 0)) {
and so i is set to 1.

The actual answer is:
The compiler gives precedence to "i == 0", which evaluates to true.
Then it will evaluate i=1 as TRUE or FALSE, and since compiled assignment operators never fail (otherwise they wouldn't compile), it also evaluates to true.
Since both statements evaluate as true, and TRUE && TRUE evaluates to TRUE, the if statement will evaluate to TRUE.
As proof, just look at the asm output of your compiler for the code you entered (all comments are my own):
mov dword ptr [rbp - 8], 0 ; i = 0;
cmp dword ptr [rbp - 8], 0 ; i == 0?
sete al ; TRUE (=1)
mov cl, al
and cl, 1 ; = operator always TRUE
movzx edx, cl
mov dword ptr [rbp - 8], edx ; set i=TRUE;
test al, 1 ; al never changed,
; so final ans is TRUE
The asm output above was from CLANG, but all other compilers I looked at gave similar output. This is true for all the compilers on that site, whether they are pure C or C++ compilers, all without any pragmas to change the mode of the compiler (which by default is C++ for the C++ compilers)
Note that your compiler did not actually set i=1, but i=TRUE (which means any 32-bit not zero integer value). That's because the && operator only evaluates whether a statement is TRUE or FALSE, and then sets the results according to that result. As proof, try changing i=1 to i=2 and you can observe for yourself that nothing will change. See for yourself using any online compiler at Compiler Explorer

It has to do with parsing an the right to left rules.
Eg y = x+5.
All sub-expressions are weighted in importance.
Two expressions of equal importance are evaluated right to left, . The && expression side is done first, followed by the LHS.
Makes sense to me.

Why do C++ optimizers have problems with these temporary variables or rather why `v[]` should be avoided in tight loops?

In this code snippet, I'm comparing performance of two functionally identical loops:
for (int i = 1; i < v.size()-1; ++i) {
int a = v[i-1];
int b = v[i];
int c = v[i+1];
if (a < b && b < c)
++n;
}
and
for (int i = 1; i < v.size()-1; ++i)
if (v[i-1] < v[i] && v[i] < v[i+1])
++n;
The first one runs significantly slower than the second one across a number of different C++ compilers with optimization flag set to O2:
second loop is about 330% slower now with Clang 3.7.0
second loop is about 2% slower with gcc 4.9.3
second loop is about 2% slower with Visual C++ 2015
I'm puzzled that modern C++ optimizers have problems handling this case. Any clues why? Do I have to write ugly code without using temporary variables in order to get the best performance?
Using temporary variables makes the code faster, sometimes dramatically, now. What is going on?
The full code I'm using is provided below:
#include <algorithm>
#include <chrono>
#include <random>
#include <iomanip>
#include <iostream>
#include <vector>
using namespace std;
using namespace std::chrono;
vector<int> v(1'000'000);
int f0()
{
int n = 0;
for (int i = 1; i < v.size()-1; ++i) {
int a = v[i-1];
int b = v[i];
int c = v[i+1];
if (a < b && b < c)
++n;
}
return n;
}
int f1()
{
int n = 0;
for (int i = 1; i < v.size()-1; ++i)
if (v[i-1] < v[i] && v[i] < v[i+1])
++n;
return n;
}
int main()
{
auto benchmark = [](int (*f)()) {
const int N = 100;
volatile long long result = 0;
vector<long long> timings(N);
for (int i = 0; i < N; ++i) {
auto t0 = high_resolution_clock::now();
result += f();
auto t1 = high_resolution_clock::now();
timings[i] = duration_cast<nanoseconds>(t1-t0).count();
}
sort(timings.begin(), timings.end());
cout << fixed << setprecision(6) << timings.front()/1'000'000.0 << "ms min\n";
cout << timings[timings.size()/2]/1'000'000.0 << "ms median\n" << "Result: " << result/N << "\n\n";
};
mt19937 generator (31415); // deterministic seed
uniform_int_distribution<> distribution(0, 1023);
for (auto& e: v)
e = distribution(generator);
benchmark(f0);
benchmark(f1);
cout << "\ndone\n";
return 0;
}

It seems like the compiler lacks knowledge about the relationship between std::vector<>::size() and internal vector buffer size. Consider std::vector being our custom bugged_vector vector-like object with slight bug - its ::size() can sometimes be one more than internal buffer size n, but only then v[n-2] >= v[n-1].
Then two snippets have different semantics again: first one has undefined behavior, as we access element v[v.size() - 1]. The second one, however, doesn't have: due to short-circuit nature of &&, we don't ever read v[v.size() - 1] on the last iteration.
So, if compiler can't prove that our v is not a bugged_vector, it must short-circuit, which introduce additional jump in a machine code.
By looking at assembly output from clang, we can see that it actually happens.
From the Godbolt Compiler Explorer, with clang 3.7.0 -O2, the loop in f0 is:
### f0: just the loop
.LBB1_2: # =>This Inner Loop Header: Depth=1
mov edi, ecx
cmp edx, edi
setl r10b
mov ecx, dword ptr [r8 + 4*rsi + 4]
lea rsi, [rsi + 1]
cmp edi, ecx
setl dl
and dl, r10b
movzx edx, dl
add eax, edx
cmp rsi, r9
mov edx, edi
jb .LBB1_2
And for f1:
### f1: just the loop
.LBB2_2: # =>This Inner Loop Header: Depth=1
mov esi, r10d
mov r10d, dword ptr [r9 + 4*rdi]
lea rcx, [rdi + 1]
cmp esi, r10d
jge .LBB2_4 # <== This is Extra Jump
cmp r10d, dword ptr [r9 + 4*rdi + 4]
setl dl
movzx edx, dl
add eax, edx
.LBB2_4: # %._crit_edge.3
cmp rcx, r8
mov rdi, rcx
jb .LBB2_2
I've pointed out the extra jump in f1. And as we (hopefuly) know, conditional jumps in a tight loops are bad for performance. (See the performance guides in the x86 tag wiki for details.)
GCC and Visual Studio are aware that std::vector is well-behaved, and produce almost identical assembly for both snippets.
Edit. It turns out clang does better job optimizing the code. All three compilers can't prove that it is safe to read v[i + 1] prior to comparison in the second example (or choose not to), but only clang manages to optimize the first example with the additional information that reading v[i + 1] is either valid or UB.
A performance difference of 2% is negligible can be explained by different order or choice of some instructions.

Here's additional insight to expand on #deniss' answer, which correctly diagnosed the issue.
Incidentally, this is related to the most popular C++ Q&A of all time "Why is processing a sorted array faster than an unsorted array?".
The main issue is the compiler must honor the logical AND operator (&&) and not load from v[i+1] unless the first condition is true. This is a consequence of the semantics of the Logical AND operator as well as the tightened memory model semantics introduced with C++11, the relevant clauses in the draft of the standard are
5.14 Logical AND operator [expr.log.and]
Unlike &, && guarantees left-to-right evaluation: the second
operand is not evaluated if the first operand is false.ISO C++14 Standard (draft N3797)
and for speculative reads
1.10 Multi-threaded executions and data races [intro.multithread]
23 [ Note: Transformations that introduce a speculative read of a potentially shared memory location may not preserve the semantics of the C++ program as defined in this standard, since they potentially introduce a data race. However, they are typically valid in the context of an optimizing compiler that targets a specific machine with well-defined semantics for data races. They would be invalid for a hypothetical machine that is not tolerant of races or provides hardware race detection. — end note ]ISO C++14 Standard (draft N3797)
My guess is optimizers play it safe and currently choose not to issue speculative loads to potentially shared memory rather than special case for each target processor whether the speculative load could introduce a detectable data race for that target.
In order to implement this, the compiler generates a conditional branch. Usually this isn't noticeable because modern processors have very sophisticated branch prediction, and the misprediction rate is typically very low. However the data here is random - this kills branch prediction. The cost of a misprediction is 10 to 20 CPU cycles, considering that the CPU is typically retiring 2 instructions per cycle this is equivalent to 20 to 40 instructions. If the prediction rate is 50% (random) then every iteration has a mispredict penalty equivalent to 10 to 20 instructions - HUGE.
Note: The compiler could prove that elements v[0] to v[v.size()-2] will be referenced, in that order, regardless of the values they contain. This would allow the compiler in this case to generate code that unconditionally loads all but the last element of the vector. The last element of the vector, at v[v.size()-1], may only be loaded in the last iteration of the loop and only if the first condition is true.
The compiler could therefore generate code for the loop without the short circuit branch up until the last iteration, then use different code with the short circuit branch for the last iteration - that would require the compiler knowing that the data is random and branch prediction is useless and therefore that it is worth bothering with that - compilers aren't that sophisticated - yet.
To avoid the conditional branch generated by the Logical AND (&&) and avoid loading the memory locations into local variables we can change the Logical AND operator into a Bitwise AND, code snippet here, the result is almost 4x faster when the data is random
int f2()
{
int n = 0;
for (int i = 1; i < v.size()-1; ++i)
n += (v[i-1] < v[i]) & (v[i] < v[i+1]); // Bitwise AND
return n;
}
Output
3.642443ms min
3.779982ms median
Result: 166634
3.725968ms min
3.870808ms median
Result: 166634
1.052786ms min
1.081085ms median
Result: 166634
done
The result on gcc 5.3 is 8x faster (live in Coliru here)
g++ --version
g++ -std=c++14 -O3 -Wall -Wextra -pedantic -pthread -pedantic-errors main.cpp -lm && ./a.out
g++ (GCC) 5.3.0
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
3.761290ms min
4.025739ms median
Result: 166634
3.823133ms min
4.050742ms median
Result: 166634
0.459393ms min
0.505011ms median
Result: 166634
done
You might wonder how the compiler can evaluate the comparison v[i-1] < v[i] without generating a conditional branch. The answer depends on the target, for x86 this is possible because of the SETcc instruction, which generates a one byte result, 0 or 1, depending on a condition in the EFLAGS register, the same condition that could be used in a conditional branch, but without branching. In the generated code given by #deniss you can see setl generated, that sets the result to 1 if the condition "less than" is met, which is evaluated by the previous compare instruction:
cmp edx, edi ; a < b ?
setl r10b ; r10b = a < b ? 1 : 0
mov ecx, dword ptr [r8 + 4*rsi + 4] ; c = v[i+1]
lea rsi, [rsi + 1] ; ++i
cmp edi, ecx ; b < c ?
setl dl ; dl = b < c ? 1 : 0
and dl, r10b ; dl &= r10b
movzx edx, dl ; edx = zero extended dl
add eax, edx ; n += edx

f0 and f1 are semantically different.
x() && y() involves a short-circuit in the case of x() being false as we know. This means that if x() is false , then y() must not be evaluated.
This prevents prefetching of the data in order to evaluate y() and (at least on clang) is causing the insertion of a conditional jump, which is resulting in branch-predictor misses.
Adding another 2 tests proves the point.
#include <algorithm>
#include <chrono>
#include <random>
#include <iomanip>
#include <iostream>
#include <vector>
using namespace std;
using namespace std::chrono;
vector<int> v(1'000'000);
int f0()
{
int n = 0;
for (int i = 1; i < v.size()-1; ++i) {
int a = v[i-1];
int b = v[i];
int c = v[i+1];
if (a < b && b < c)
++n;
}
return n;
}
int f1()
{
int n = 0;
auto s = v.size() - 1;
for (size_t i = 1; i < s; ++i)
if (v[i-1] < v[i] && v[i] < v[i+1])
++n;
return n;
}
int f2()
{
int n = 0;
auto s = v.size() - 1;
for (size_t i = 1; i < s; ++i)
{
auto t1 = v[i-1] < v[i];
auto t2 = v[i] < v[i+1];
if (t1 && t2)
++n;
}
return n;
}
int f3()
{
int n = 0;
auto s = v.size() - 1;
for (size_t i = 1; i < s; ++i)
{
n += 1 * (v[i-1] < v[i]) * (v[i] < v[i+1]);
}
return n;
}
int main()
{
auto benchmark = [](int (*f)()) {
const int N = 100;
volatile long long result = 0;
vector<long long> timings(N);
for (int i = 0; i < N; ++i) {
auto t0 = high_resolution_clock::now();
result += f();
auto t1 = high_resolution_clock::now();
timings[i] = duration_cast<nanoseconds>(t1-t0).count();
}
sort(timings.begin(), timings.end());
cout << fixed << setprecision(6) << timings.front()/1'000'000.0 << "ms min\n";
cout << timings[timings.size()/2]/1'000'000.0 << "ms median\n" << "Result: " << result/N << "\n\n";
};
mt19937 generator (31415); // deterministic seed
uniform_int_distribution<> distribution(0, 1023);
for (auto& e: v)
e = distribution(generator);
benchmark(f0);
benchmark(f1);
benchmark(f2);
benchmark(f3);
cout << "\ndone\n";
return 0;
}
results (apple clang, -O2):
1.233948ms min
1.320545ms median
Result: 166850
3.366751ms min
3.493069ms median
Result: 166850
1.261948ms min
1.361748ms median
Result: 166850
1.251434ms min
1.353653ms median
Result: 166850

None of the answers so far have given a version of f() that gcc or clang can fully optimize. They all generate asm that does both compares each iteration. See the code with asm output on the Godbolt Compiler Explorer. (Important background knowledge for predicting performance from asm output: Agner Fog's microarchitecture guide, and other links on the x86 tag wiki. As always, it usually works best to profile with performance counters to find stalls.)
v[i-1] < v[i] is work we already did last iteration, when we evaluated v[i] < v[i+1]. In theory, helping the compiler grok that would let it optimize better (see f3()). In practice, that ends up defeating auto-vectorization in some cases, and gcc emits code with partial-register stalls, even with -mtune=core2 where that's a huge problem.
Manually hoisting the v.size() - 1 out of the loop's upper bound check seems to help. The OP's f0 and f1 don't actually re-compute v.size() from the start/end pointers in v, but somehow it still optimizes less well than when computing a size_t upper = v.size() - 1 outside the loop (f2() and f4()).
A separate issue is that using an int loop counter with a size_t upper bound means the loop is potentially infinite. I'm not sure how much impact this has on other optimizations.
Bottom line: compilers are complex beasts. Predicting which version will optimize well is not at all obvious or straightforward.
Results on 64bit Ubuntu 15.10, on Core2 E6600 (Merom/Conroe microarchitecture).
clang++-3.8 -O3 -march=core2 | g++ 5.2 -O3 -march=core2 | gcc 5.2 -O2 (default -mtune=generic)
f0 1.825ms min(1.858 med) | 5.008ms min(5.048 med) | 5.000 min(5.028 med)
f1 4.637ms min(4.673 med) | 4.899ms min(4.952 med) | 4.894 min(4.931 med)
f2 1.292ms min(1.323 med) | 1.058ms min(1.088 med) (autovec) | 4.888 min(4.912 med)
f3 1.082ms min(1.117 med) | 2.426ms min(2.458 med) | 2.420 min(2.465 med)
f4 1.291ms min(1.341 med) | 1.022ms min(1.052 med) (autovec) | 2.529 min(2.560 med)
Results would be different on Intel SnB-family hardware, esp. IvyBridge and later where there would be no partial register slowdowns at all. Core2 is limited by slow unaligned loads, and only one load per cycle. The loops may be small enough that decode isn't an issue, though.
f0 and f1:
gcc 5.2: The OP's f0 and f1 both make branchy loops, and won't auto-vectorize. f0 only uses one branch, though, and uses a weird setl sil / cmp sil, 1 / sbb eax, -1 to do the second half of the short-circuit compare. So it's still doing both comparisons on every iteration.
clang 3.8: f0: only one load per iteration, but does both compares and ands them together. f1: both compares each iteration, one with a branch to preserve the C semantics. Two loads per iteration.
int f2() {
int n = 0;
size_t upper = v.size()-1; // difference from f0: hoist upper bound and use size_t loop counter
for (size_t i = 1; i < upper; ++i) {
int a = v[i-1], b = v[i], c = v[i+1];
if (a < b && b < c)
++n;
}
return n;
}
gcc 5.2 -O3: auto-vectorizes, with three loads to get the three offset vectors needed to produce one vector of 4 compare results. Also, after combining the results from two pcmpgtd instructions, compares them against a vector of all-zero and then masks that. Zero is already the identity element for addition, so that's really silly.
clang 3.8 -O3: unrolls: every iteration does two loads, three cmp/setcc, two ands, and two adds.
int f4() {
int n = 0;
size_t upper = v.size()-1;
for (size_t i = 1; i < upper; ++i) {
int a = v[i-1], b = v[i], c = v[i+1];
bool ab_lt = a < b;
bool bc_lt = b < c;
n += (ab_lt & bc_lt); // some really minor code-gen differences from f2: auto-vectorizes to better code that runs slightly faster even for this large problem size
}
return n;
}
gcc 5.2 -O3: autovectorizes like f2, but without the extra pcmpeqd.
gcc 5.2 -O2: didn't investigate why this is twice as fast as f2.
clang -O3: about the same code as f2.
Attempt at compiler hand-holding
int f3() {
int n = 0;
int a = v[0], b = v[1]; // These happen before checking v.size, defeating the loop vectorizer or something
bool ab_lt = a < b;
size_t upper = v.size()-1;
for (size_t i = 1; i < upper; ++i) {
int c = v[i+1]; // only one load and compare inside the loop
bool bc_lt = b < c;
n += (ab_lt & bc_lt);
ab_lt = bc_lt;
a = b; // unused inside the loop, only the compare result is needed
b = c;
}
return n;
}
clang 3.8 -O3: Unrolls with 4 loads inside the loop (clang typically likes to unroll by 4 when there aren't complex loop-carried dependencies).
4 cmp/setcc, 4x and/movzx, 4x add. So clang did exactly what I was hoping, and made near-optimal scalar code. This was the fastest non-vectorized version, and (on core2 where movups unaligned loads are slow) is as fast as gcc's vectorized versions.
gcc 5.2 -O3: Fails to auto-vectorize. My theory on that is that accessing the array outside the loop confuses the auto-vectorizer. Maybe because we do it before checking v.size(), or maybe just in general.
Compiles to the scalar code we'd hope for, with one load, one cmp/setcc, and one and per iteration. But gcc creates a partial-register stall, even with -mtune=core2 where it's a huge problem (2 to 3 cycle stall to insert a merging uop when reading a wide reg after writing only part of it). (setcc is only available with an 8-bit operand size, which IMO is something AMD should have changed when they designed the AMD64 ISA.) It's the main reason why gcc's code runs 2.5x slower than clang's.
## the loop in f3(), from gcc 5.2 -O3 (same code with -O2)
.L31:
add rcx, 1 # i,
mov edi, DWORD PTR [r10+rcx*4] # a, MEM[base: _19, index: i_13, step: 4, offset: 0]
cmp edi, r8d # a, a # gcc's verbose-asm comments are a bit bogus here: one of these `a`s is from the last iteration, so this is really comparing c, b
mov r8d, edi # a, a
setg sil #, tmp124
and edx, esi # D.111089, tmp124 # PARTIAL-REG STALL: reading esi after writing sil
movzx edx, dl # using movzx to widen sil to esi would have solved the problem, instead of doing it after the and
add eax, edx # n, D.111085 # n += ...
cmp r9, rcx # upper, i
mov edx, esi # ab_lt, tmp124
jne .L31 #,
ret

Maintain x*x in C++

I have the following while-loop
uint32_t x = 0;
while(x*x < STOP_CONDITION) {
if(CHECK_CONDITION) x++
// Do other stuff that modifies CHECK_CONDITION
}
The STOP_CONDITION is constant at run-time, but not at compile time. Is there are more efficient way to maintain x*x or do I really need to recompute it every time?

Note: According to the benchmark below, this code runs about 1 -- 2% slower than this option. Please read the disclaimer included at the bottom!
In addition to Tamas Ionut's answer, if you want to maintain STOP_CONDITION as the actual stop condition and avoid the square root calculation, you could update the square using the mathematical identity
(x + 1)² = x² + 2x + 1
whenever you change x:
uint32_t x = 0;
unit32_t xSquare = 0;
while(xSquare < STOP_CONDITION) {
if(CHECK_CONDITION) {
xSquare += 2 * x + 1;
x++;
}
// Do other stuff that modifies CHECK_CONDITION
}
Since the 2*x + 1 is just a bit shift and an increment, the compiler should be able to optimize this fairly well.
Disclaimer: Since you asked "how can I optimize this code" I answered with one particular way to possibly make it faster. Whether the double + increment is actually faster than a single integer multiplication should be tested in practice. Whether you should optimize the code is a different question. I assume you have already benchmarked the loop and found it to be a bottleneck, or that you have a theoretical interest in the question. If you are writing production code that you wish to optimize, first measure the performance and then optimize where needed (which is probably not the x*x in this loop).

What about:
uint32_t x = 0;
double bound= sqrt(STOP_CONDITION);
while(x < bound) {
if(CHECK_CONDITION) x++
// Do other stuff that modifies CHECK_CONDITION
}
This way, you're getting rid of that extra computation.

I made a small benchmarking for Tamas Ionut and CompuChip answers and here are the results:
Tamas Ionut: 19.7068
The code of this method:
uint32_t x = 0;
double bound= sqrt(STOP_CONDITION);
while(x < bound) {
if(CHECK_CONDITION) x++
// Do other stuff that modifies CHECK_CONDITION
}
CompuChip: 20.2056
The code of this method:
uint32_t x = 0;
unit32_t xSquare = 0;
while(xSquare < STOP_CONDITION) {
if(CHECK_CONDITION) {
xSquare += 2 * x + 1;
x++;
}
// Do other stuff that modifies CHECK_CONDITION
}
with STOP_CONDITION = 1000000 and repeating the process 1000000 times
Environment:
Compiler : MSVC 2013
OS : Windows 8.1 - X64
Processor: Core i7-4510U
#2.00 GHZ
Release Mode - Maximize Speed (/O2)

I would say, optimization in readibility is better than optimization in Performance in your case since we are talking about a very small Performance optimization
The compliter can optimize a lot for you regarding Performance but readibility lies in the responsibility of the programmer

I believe Tamas Ionut solution is better than that of CompuChip because we only have x++ inside the for loop. However, a comparison between uint32_t and double will kill the deal. It would be more efficient if we use uint32_t for bound instead of using double. This approach has less problem with numerical overflow because x cannot be greater than 2^16 = 65536 if we want to have a correct x^2 value.
If we also do a heavy work in the loop then results obtained from both approach should be very similar, however, Tamas Ionut approach is more simple and easier to read.
Below is my code and the corresponding assembly code obtained using clang version 3.8.0 with -O3 flag. It is very clear from the assembly code that the first approach is more efficient.
using T = size_t;
void test1(const T stopCondition, bool checkCondition) {
T x = 0;
while (x < stopCondition) {
if (checkCondition) {
x++;
}
// Do something heavy here
}
}
void test2(const T stopCondition, bool checkCondition) {
T x = 0;
T xSquare = 0;
const T threshold = stopCondition * stopCondition;
while (xSquare < threshold) {
if (checkCondition) {
xSquare += 2 * x + 1;
x++;
}
// Do something heavy here
}
}
(gdb) disassemble test1
Dump of assembler code for function _Z5test1mb:
0x0000000000400be0 <+0>: movzbl %sil,%eax
0x0000000000400be4 <+4>: mov %rax,%rcx
0x0000000000400be7 <+7>: neg %rcx
0x0000000000400bea <+10>: nopw 0x0(%rax,%rax,1)
0x0000000000400bf0 <+16>: add %rax,%rcx
0x0000000000400bf3 <+19>: cmp %rdi,%rcx
0x0000000000400bf6 <+22>: jb 0x400bf0 <_Z5test1mb+16>
0x0000000000400bf8 <+24>: retq
End of assembler dump.
(gdb) disassemble test2
Dump of assembler code for function _Z5test2mb:
0x0000000000400c00 <+0>: imul %rdi,%rdi
0x0000000000400c04 <+4>: test %sil,%sil
0x0000000000400c07 <+7>: je 0x400c2e <_Z5test2mb+46>
0x0000000000400c09 <+9>: xor %eax,%eax
0x0000000000400c0b <+11>: mov $0x1,%ecx
0x0000000000400c10 <+16>: test %rdi,%rdi
0x0000000000400c13 <+19>: je 0x400c42 <_Z5test2mb+66>
0x0000000000400c15 <+21>: data32 nopw %cs:0x0(%rax,%rax,1)
0x0000000000400c20 <+32>: add %rcx,%rax
0x0000000000400c23 <+35>: add $0x2,%rcx
0x0000000000400c27 <+39>: cmp %rdi,%rax
0x0000000000400c2a <+42>: jb 0x400c20 <_Z5test2mb+32>
0x0000000000400c2c <+44>: jmp 0x400c42 <_Z5test2mb+66>
0x0000000000400c2e <+46>: test %rdi,%rdi
0x0000000000400c31 <+49>: je 0x400c42 <_Z5test2mb+66>
0x0000000000400c33 <+51>: data32 data32 data32 nopw %cs:0x0(%rax,%rax,1)
0x0000000000400c40 <+64>: jmp 0x400c40 <_Z5test2mb+64>
0x0000000000400c42 <+66>: retq
End of assembler dump.

Reverse "formula" based on a string

Is possible to make the reverse process of that?
string dir = textenc.Text;
uint EDI = 0x1505;
uint EDX = 0;
byte ECX = 0;
for ( int i = 0; i < dir.Length; i++ )
{
if ( dir[ i ] != '.' && dir[ i ] != '\\' )
{
EDX = EDI;
EDX = EDX << 5;
ECX = ( byte )dir[ i ];
EDX = EDX + EDI;
EDX = EDX + ECX;
EDI = EDX;
}
};
return EDI;
The dir is a string, for example, when dir is "data\font\tahoma.ttf" the output of that function would be: 2114405758.
Is there a way to retrieve the original string giving only the output number?

No.
The hash function ignores the characters . and \. You can add as many as these as you want, and it will still calculate the same value.
Note: As mentioned by others, this is a hash function, which will create infinitely many collisions.

The short answer is NO, it is not possible.
The long answer is that it will be nearly impossible to tell if the generated hash is unique to the input given.
The only way to know this is to generate hashes for all possible string combinations until you hit a duplicate; Or you don't hit a duplicate, but you will run out of memory before the later happens. There is also the case that you don't hit duplicates for a long time which can also mean that the hash function is quite good but still does not mean it is reversible.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Is this a compiler error in Visual Studio 2010? - c++

Related

why does visual studio have i++ post increment as a default for autocompleting for loop?

Why does this if-statement combining assignment and an equality check return true?

Why do C++ optimizers have problems with these temporary variables or rather why `v[]` should be avoided in tight loops?

Maintain x*x in C++

Reverse "formula" based on a string

Categories

Resources