Why with or without const Modifier make efficiency diff 4 times? - c++

Why with or without const Modifier make efficiency diff 4 times? This code need about 16 second to finish in my PC. But if I make a small change, like declare mod as const int or move the mod declaration in the main body, or change i as int type, the execute time reduced to 4 second. (I compile this code use g++ with default parameters)
Here is the assembly code for this code, the left part is generate with non-const int mod, another with const int mod declaration.
The big efficiency occur only when I declare i as long long and the operator in for loop is '%'. Otherwise the performance only diff about 10%.
// const int mod = 1000000009;
int mod = 1000000009;
int main(){
// int mod = 1000000009;
int rel = 0;
for(long long i=1e9; i<2*(1e9); i++){
rel = i % mod;
}
return 0;
}

Because when you add const,the compile changes it into the constant value and write it into the assemble codes, but when you do not add const, the value will be loaded into the register, so you must query it every time you have to use it

When loading the value of mod from memory into a register, the generated assembly code is different.
For example, this is what you get when using the Visual Studio 2013 compiler for x64 based processor:
For int mod = 1000000009:
mov eax,dword ptr ds:[xxxxxxxxh] ; xxxxxxxxh = &mod
cdq
push edx
push eax
For const int mod = 1000000009:
push 0
push 3B9ACA09h ; 3B9ACA09h = 1000000009

A const variable may or may not take space on stack - that's upto the compiler. But, in most cases, a const-variable's usage will be replaced by its constant value. Consider:
const int size = 100;
int* pModify = (int*)&size;
*pModify = 200;
int Array[size];
When you use *pModify it will render 200, but the size of array would be 100 elements only (ignore compiler extensions of new features that allow variable size arrays). It is just because compiler has replaced [size] with [100]. When you use size, it will mostly be just 100.
In that loop %mod is just getting replaced with %1000000009. There is one read-memory (load) instruction less, that's why it is performing fast.
But, it must be noted that compilers act smart, very smart, so you cannot guess what optimization technique it might have applied. It might have removed all of loop (since it seems no-op to the compiler).

Related

Signed overflow in C++ and undefined behaviour (UB)

I'm wondering about the use of code like the following
int result = 0;
int factor = 1;
for (...) {
result = ...
factor *= 10;
}
return result;
If the loop is iterated over n times, then factor is multiplied by 10 exactly n times. However, factor is only ever used after having been multiplied by 10 a total of n-1 times. If we assume that factor never overflows except on the last iteration of the loop, but may overflow on the last iteration of the loop, then should such code be acceptable? In this case, the value of factor would provably never be used after the overflow has happened.
I'm having a debate on whether code like this should be accepted. It would be possible to put the multiplication inside an if-statement and just not do the multiplication on the last iteration of the loop when it can overflow. The downside is that it clutters the code and adds an unnecessary branch that would need to check for on all the previous loop iterations. I could also iterate over the loop one fewer time and replicate the loop body once after the loop, again, this complicates the code.
The actual code in question is used in a tight inner-loop that consumes a large chunk of the total CPU time in a real-time graphics application.
Compilers do assume that a valid C++ program does not contain UB. Consider for example:
if (x == nullptr) {
*x = 3;
} else {
*x = 5;
}
If x == nullptr then dereferencing it and assigning a value is UB. Hence the only way this could end in a valid program is when x == nullptr will never yield true and the compiler can assume under the as if rule, the above is equivalent to:
*x = 5;
Now in your code
int result = 0;
int factor = 1;
for (...) { // Loop until factor overflows but not more
result = ...
factor *= 10;
}
return result;
The last multiplication of factor cannot happen in a valid program (signed overflow is undefined). Hence also the assignment to result cannot happen. As there is no way to branch before the last iteration also the previous iteration cannot happen. Eventually, the part of code that is correct (i.e., no undefined behaviour ever happens) is:
// nothing :(
The behaviour of int overflow is undefined.
It doesn't matter if you read factor outside the loop body; if it has overflowed by then then the behaviour of your code on, after, and somewhat paradoxically before the overflow is undefined.
One issue that might arise in keeping this code is that compilers are getting more and more aggressive when it comes to optimisation. In particular they are developing a habit where they assume that undefined behaviour never happens. For this to be the case, they may remove the for loop altogether.
Can't you use an unsigned type for factor although then you'd need to worry about unwanted conversion of int to unsigned in expressions containing both?
It might be insightful to consider real-world optimizers. Loop unrolling is a known technique. The basic idea of loop unrolling is that
for (int i = 0; i != 3; ++i)
foo()
might be better implemented behind the scenes as
foo()
foo()
foo()
This is the easy case, with a fixed bound. But modern compilers can also do this
for variable bounds:
for (int i = 0; i != N; ++i)
foo();
becomes
__RELATIVE_JUMP(3-N)
foo();
foo();
foo();
Obviously this only works if the compiler knows that N<=3. And that's where we get back to the original question:
int result = 0;
int factor = 1;
for (...) {
result = ...
factor *= 10;
}
return result;
Because the compiler knows that signed overflow does not occur, it knows that the loop can execute a maximum of 9 times on 32 bits architectures. 10^10 > 2^32. It can therefore do a 9 iteration loop unroll. But the intended maximum was 10 iterations !.
What might happen is that you get a relative jump to a assembly instruction (9-N) with N==10, so an offset of -1, which is the jump instruction itself. Oops. This is a perfectly valid loop optimization for well-defined C++, but the example given turns into a tight infinite loop.
Any signed integer overflow results in undefined behaviour, regardless of whether or not the overflowed value is or might be read.
Maybe in your use-case you can to lift the first iteration out of the loop, turning this
int result = 0;
int factor = 1;
for (int n = 0; n < 10; ++n) {
result += n + factor;
factor *= 10;
}
// factor "is" 10^10 > INT_MAX, UB
into this
int factor = 1;
int result = 0 + factor; // first iteration
for (int n = 1; n < 10; ++n) {
factor *= 10;
result += n + factor;
}
// factor is 10^9 < INT_MAX
With optimization enabled, the compiler might unroll the second loop above into one conditional jump.
This is UB; in ISO C++ terms the entire behaviour of the entire program is completely unspecified for an execution that eventually hits UB. The classic example is as far as the C++ standard cares, it can make demons fly out of your nose. (I recommend against using an implementation where nasal demons are a real possibility). See other answers for more details.
Compilers can "cause trouble" at compile time for paths of execution they can see leading to compile-time-visible UB, e.g. assume those basic blocks are never reached.
See also What Every C Programmer Should Know About Undefined Behavior (LLVM blog). As explained there, signed-overflow UB lets compilers prove that for(... i <= n ...) loops are not infinite loops, even for unknown n. It also lets them "promote" int loop counters to pointer width instead of redoing sign-extension. (So the consequence of UB in that case could be accessing outside the low 64k or 4G elements of an array, if you were expecting signed wrapping of i into its value range.)
In some cases compilers will emit an illegal instruction like x86 ud2 for a block that provably causes UB if ever executed. (Note that a function might not ever be called, so compilers can't in general go berserk and break other functions, or even possible paths through a function that don't hit UB. i.e. the machine code it compiles to must still work for all inputs that don't lead to UB.)
Probably the most efficient solution is to manually peel the last iteration so the unneeded factor*=10 can be avoided.
int result = 0;
int factor = 1;
for (... i < n-1) { // stop 1 iteration early
result = ...
factor *= 10;
}
result = ... // another copy of the loop body, using the last factor
// factor *= 10; // and optimize away this dead operation.
return result;
Or if the loop body is large, consider simply using an unsigned type for factor. Then you can let the unsigned multiply overflow and it will just do well-defined wrapping to some power of 2 (the number of value bits in the unsigned type).
This is fine even if you use it with signed types, especially if your unsigned->signed conversion never overflows.
Conversion between unsigned and 2's complement signed is free (same bit-pattern for all values); the modulo wrapping for int -> unsigned specified by the C++ standard simplifies to just using the same bit-pattern, unlike for one's complement or sign/magnitude.
And unsigned->signed is similarly trivial, although it is implementation-defined for values larger than INT_MAX. If you aren't using the huge unsigned result from the last iteration, you have nothing to worry about. But if you are, see Is conversion from unsigned to signed undefined?. The value-doesn't-fit case is implementation-defined, which means that an implementation must pick some behaviour; sane ones just truncate (if necessary) the unsigned bit pattern and use it as signed, because that works for in-range values the same way with no extra work. And it's definitely not UB. So big unsigned values can become negative signed integers. e.g. after int x = u; gcc and clang don't optimize away x>=0 as always being true, even without -fwrapv, because they defined the behaviour.
If you can tolerate a few additional assembly instructions in the loop, instead of
int factor = 1;
for (int j = 0; j < n; ++j) {
...
factor *= 10;
}
you can write:
int factor = 0;
for (...) {
factor = 10 * factor + !factor;
...
}
to avoid the last multiplication. !factor will not introduce a branch:
xor ebx, ebx
L1:
xor eax, eax
test ebx, ebx
lea edx, [rbx+rbx*4]
sete al
add ebp, 1
lea ebx, [rax+rdx*2]
mov edi, ebx
call consume(int)
cmp r12d, ebp
jne .L1
This code
int factor = 0;
for (...) {
factor = factor ? 10 * factor : 1;
...
}
also results in branchless assembly after optimization:
mov ebx, 1
jmp .L1
.L2:
lea ebx, [rbx+rbx*4]
add ebx, ebx
.L1:
mov edi, ebx
add ebp, 1
call consume(int)
cmp r12d, ebp
jne .L2
(Compiled with GCC 8.3.0 -O3)
You didn't show what's in the parentheses of the for statement, but I'm going to assume it's something like this:
for (int n = 0; n < 10; ++n) {
result = ...
factor *= 10;
}
You can simply move the counter increment and loop termination check into the body:
for (int n = 0; ; ) {
result = ...
if (++n >= 10) break;
factor *= 10;
}
The number of assembly instructions in the loop will remain the same.
Inspired by Andrei Alexandrescu's presentation "Speed Is Found In The Minds of People".
Consider the function:
unsigned mul_mod_65536(unsigned short a, unsigned short b)
{
return (a*b) & 0xFFFFu;
}
According to the published Rationale, the authors of the Standard would have expected that if this function were invoked on (e.g.) a commonplace 32-bit computer with arguments of 0xC000 and 0xC000, promoting the operands of * to signed int would cause the computation to yield -0x10000000, which when converted to unsigned would yield 0x90000000u--the same answer as if they had made unsigned short promote to unsigned. Nonetheless, gcc will sometimes optimize that function in ways that would behave nonsensically if an overflow occurs. Any code where some combination of inputs could cause an overflow must be processed with -fwrapv option unless it would be acceptable to allow creators of deliberately-malformed input to execute arbitrary code of their choosing.
Why not this:
int result = 0;
int factor = 10;
for (...) {
factor *= 10;
result = ...
}
return result;
There are many different faces of Undefined Behavior, and what's acceptable depends on the usage.
tight inner-loop that consumes a large chunk of the total CPU time in a real-time graphics application
That, by itself, is a bit of an unusual thing, but be that as it may... if this is indeed the case, then the UB is most probably within the realm "allowable, acceptable". Graphics programming is notorious for hacks and ugly stuff. As long as it "works" and it doesn't take longer than 16.6ms to produce a frame, usually, nobody cares. But still, be aware of what it means to invoke UB.
First, there is the standard. From that point of view, there's nothing to discuss and no way to justify, your code is simply invalid. There are no ifs or whens, it just isn't a valid code. You might as well say that's middle-finger-up from your point of view, and 95-99% of the time you'll be good to go anyway.
Next, there's the hardware side. There are some uncommon, weird architectures where this is a problem. I'm saying "uncommon, weird" because on the one architecture that makes up 80% of all computers (or the two architectures that together make up 95% of all computers) overflow is a "yeah, whatever, don't care" thing on the hardware level. You sure do get a garbage (although still predictable) result, but no evil things happen.
That is not the case on every architecture, you might very well get a trap on overflow (though seeing how you speak of a graphics application, the chances of being on such an odd architecture are rather small). Is portability an issue? If it is, you may want to abstain.
Last, there is the compiler/optimizer side. One reason why overflow is undefined is that simply leaving it at that was easiest to cope with hardware once upon a time. But another reason is that e.g. x+1 is guaranteed to always be larger than x, and the compiler/optimizer can exploit this knowledge. Now, for the previously mentioned case, compilers are indeed known to act this way and simply strip out complete blocks (there existed a Linux exploit some years ago which was based on the compiler having dead-stripped some validation code because of exactly this).
For your case, I would seriously doubt that the compiler does some special, odd, optimizations. However, what do you know, what do I know. When in doubt, try it out. If it works, you are good to go.
(And finally, there's of course code audit, you might have to waste your time discussing this with an auditor if you're unlucky.)

const vs non-const variable with no change in value once assign

In C++, if value of a variable never gets changed once assigned in whole program VS If making that variable as const , In which case executable code is faster?
How compiler optimize executable code in case 1?
A clever compiler can understand that the value of a variable is never changed, thus optimizing the related code, even without the explicit const keyword by the programmer.
As for your second, question, when you mark a variable as const, then the follow might happen: the "compiler can optimize away this const by not providing storage to this variable rather add it in symbol table. So, subsequent read just need indirection into the symbol table rather than instructions to fetch value from memory". Read more in What kind of optimization does const offer in C/C++? (if any).
I said might, because const does not mean that this is a constant expression for sure, which can be done by using constexpr instead, as I explain bellow.
In general, you should think about safer code, rather than faster code when it comes to using the const keyword. So unless, you do it for safer and more readable code, then you are likely a victim of premature optimization.
Bonus:
C++ offers the constexpr keyword, which allows the programmer to mark a variable as what the Standard calls constant expressions. A constant expression is more than merely constant.
Read more in Difference between `constexpr` and `const` and When should you use constexpr capability in C++11?
PS: Constness prevents moving, so using const too liberally may turn your code to execute slower.
In which case executable code is faster?
The code is faster in case on using const, because compiler has more room for optimization. Consider this snippet:
int c = 5;
[...]
int x = c + 5;
If c is constant, it will simply assign 10 to x. If c is not a constant, it depend on compiler if it will be able to deduct from the code that c is de-facto constant.
How compiler optimize executable code in case 1?
Compiler has harder time to optimize the code in case the variable is not constant. The broader the scope of the variable, the harder for the compiler to make sure the variable is not changing.
For simple cases, like a local variables, the compiler with basic optimizations will be able to deduct that the variable is a constant. So it will treat it like a constant.
if (...) {
int c = 5;
[...]
int x = c + 5;
}
For broader scopes, like global variables, external variables etc., if the compiler is not able to analyze the whole scope, it will treat it like a normal variable, i.e. allocate some space, generate load and store operations etc.
file1.c
int c = 5;
file2.c
extern int c;
[...]
int x = c + 5;
There are more aggressive optimization options, like link time optimizations, which might help in such cases. But still, performance-wise, the const keyword helps, especially for variables with wide scopes.
EDIT:
Simple example
File const.C:
const int c = 5;
volatile int x;
int main(int argc, char **argv)
{
x = c + 5;
}
Compilation:
$ g++ const.C -O3 -g
Disassembly:
5 {
6 x = c + 5;
0x00000000004003e0 <+0>: movl $0xa,0x200c4a(%rip) # 0x601034 <x>
7 }
So we just move 10 (0xa) to x.
File nonconst.C:
int c = 5;
volatile int x;
int main(int argc, char **argv)
{
x = c + 5;
}
Compilation:
$ g++ nonconst.C -O3 -g
Disassembly:
5 {
6 x = c + 5;
0x00000000004003e0 <+0>: mov 0x200c4a(%rip),%eax # 0x601030 <c>
0x00000000004003e6 <+6>: add $0x5,%eax
0x00000000004003e9 <+9>: mov %eax,0x200c49(%rip) # 0x601038 <x>
7 }
We load c, add 5 and store to x.
So as you can see even with quite aggressive optimization (-O3) and the shortest program you can write, the effect of const is quite obvious.
g++ version 5.4.1

How to write Linux C++ debug information when performance is critical?

I'm trying to debug a rather large program with many variables. The code is setup in this way:
while (condition1) {
//non timing sensitive code
while (condition2) {
//timing sensitive code
//many variables that change each iteration
}
}
I have many variables on the inner loop that I want to save for viewing. I want to write them to a text file each outer loop iteration. The inner loop executes a different number of times each iteration. It can be just 2 or 3, or it can be several thousands.
I need to see all the variables values from each inner iteration, but I need to keep the inner loop as fast as possible.
Originally, I tried just storing each data variable in its own vector where I just appended a value at each inner loop iteration. Then, when the outer loop iteration came, I would read from the vectors and write the data to a debug file. This quickly got out of hand as variables were added.
I thought about using a string buffer to store the information, but I'm not sure if this is the fastest way given strings would need to be created multiple times within the loop. Also, since I don't know the number of iterations, I'm not sure how large the buffer would grow.
With the information stored being in formats such as:
"Var x: 10\n
Var y: 20\n
.
.
.
Other Text: Stuff\n"
So, is there a cleaner option for writing large amounts of debug data quickly?
If it's really time-sensitive, then don't format strings inside the critical loop.
I'd go for appending records to a log buffer of binary records inside the critical loop. The outer loop can either write that directly to a binary file (which can be processed later), or format text based on the records.
This has the advantage that the loop only needs to track a couple extra variables (pointers to the end of used and allocated space of one std::vector), rather than two pointers for a std::vector for every variable being logged. This will have much lower impact on register allocation in the critical loop.
In my testing, it looks like you just get a bit of extra loop overhead to track the vector, and a store instruction for every variable you want to log. I didn't write a big enough test loop to expose any potential problems from keeping all the variables "alive" until the emplace_back(). If the compiler does a bad job with bigger loops where it needs to spill registers, see the section below about using a simple array without any size checking. That should remove any constraint on the compiler that makes it try to do all the stores into the log buffer at the same time.
Here's an example of what I'm suggesting. It compiles and runs, writing a binary log file which you can hexdump.
See the source and asm output with nice formatting on the Godbolt compiler explorer. It can even colourize source and asm lines so you can more easily see which asm comes from which source line.
#include <vector>
#include <cstdint>
#include <cstddef>
#include <iostream>
struct loop_log {
// Generally sort in order of size for better packing.
// Use as narrow types as possible to reduce memory bandwidth.
// e.g. logging an int loop counter into a short log record is fine if you're sure it always in-practice fits in a short, and has zero performance downside
int64_t x, y, z;
uint64_t ux, uy, uz;
int32_t a, b, c;
uint16_t t, i, j;
uint8_t c1, c2, c3;
// isn't there a less-repetitive way to write this?
loop_log(int64_t x, int32_t a, int outer_counter, char c1)
: x(x), a(a), i(outer_counter), c1(c1)
// leaves other members *uninitialized*, not zeroed.
// note lack of gcc warning for initializing uint16_t i from an int
// and for not mentioning every member
{}
};
static constexpr size_t initial_reserve = 10000;
// take some args so gcc can't count the iterations at compile time
void foo(std::ostream &logfile, int outer_iterations, int inner_param) {
std::vector<struct loop_log> log;
log.reserve(initial_reserve);
int outer_counter = outer_iterations;
while (--outer_counter) {
//non timing sensitive code
int32_t a = inner_param - outer_counter;
while (a != 0) {
//timing sensitive code
a <<= 1;
int64_t x = outer_counter * (100LL + a);
char c1 = x;
// much more efficient code with gcc 5.3 -O3 than push_back( a struct literal );
log.emplace_back(x, a, outer_counter, c1);
}
const auto logdata = log.data();
const size_t bytes = log.size() * sizeof(*logdata);
// write group size, then a group of records
logfile.write( reinterpret_cast<const char *>(&bytes), sizeof(bytes) );
logfile.write( reinterpret_cast<const char *>(logdata), bytes );
// you could format the records into strings at this point if you want
log.clear();
}
}
#include <fstream>
int main() {
std::ofstream logfile("dbg.log");
foo(logfile, 100, 10);
}
gcc's output for foo() pretty much optimizes away all the vector overhead. As long as the initial reserve() is big enough, the inner loop is just:
## gcc 5.3 -masm=intel -O3 -march=haswell -std=gnu++11 -fverbose-asm
## The inner loop from the above C++:
.L59:
test rbx, rbx # log // IDK why gcc wants to check for a NULL pointer inside the hot loop, instead of doing it once after reserve() calls new()
je .L6 #,
mov QWORD PTR [rbx], rbp # log_53->x, x // emplace_back the 4 elements
mov DWORD PTR [rbx+48], r12d # log_53->a, a
mov WORD PTR [rbx+62], r15w # log_53->i, outer_counter
mov BYTE PTR [rbx+66], bpl # log_53->c1, x
.L6:
add rbx, 72 # log, // struct size is 72B
mov r8, r13 # D.44727, log
test r12d, r12d # a
je .L58 #, // a != 0
.L4:
add r12d, r12d # a // a <<= 1
movsx rbp, r12d # D.44726, a // x = ...
add rbp, 100 # D.44726, // x = ...
imul rbp, QWORD PTR [rsp+8] # x, %sfp // x = ...
cmp r14, rbx # log$D40277$_M_impl$_M_end_of_storage, log
jne .L59 #, // stay in this tight loop as long as we don't run out of reserved space in the vector
// fall through into code that allocates more space and copies.
// gcc generates pretty lame copy code, using 8B integer loads/stores, not rep movsq. Clang uses AVX to copy 32B at a time
// anyway, that code never runs as long as the reserve is big enough
// I guess std::vector doesn't try to realloc() to avoid the copy if possible (e.g. if the following virtual address region is unused) :/
An attempt to avoid repetitive constructor code:
I tried a version that uses a braced initializer list to avoid having to write a really repetitive constructor, but got much worse code from gcc:
#ifdef USE_CONSTRUCTOR
// much more efficient code with gcc 5.3 -O3.
log.emplace_back(x, a, outer_counter, c1);
#else
// Put the mapping from local var names to struct member names right here in with the loop
log.push_back( (struct loop_log) {
.x = x, .y =0, .z=0, // C99 designated-initializers are a GNU extension to C++,
.ux=0, .uy=0, .uz=0, // but gcc doesn't support leaving having uninitialized elements before the last initialized one:
.a = a, .b=0, .c=0, // without all the ...=0, you get "sorry, unimplemented: non-trivial designated initializers not supported"
.t=0, .i = outer_counter, .j=0,
.c1 = (uint8_t)c1
} );
#endif
This unfortunately stores a struct onto the stack and then copies it 8B at a time with code like:
mov rax, QWORD PTR [rsp+72]
mov QWORD PTR [rdx+8], rax // rdx points into the vector's buffer
mov rax, QWORD PTR [rsp+80]
mov QWORD PTR [rdx+16], rax
... // total of 9 loads/stores for a 72B struct
So it will have more impact on the inner loop.
There are a few ways to push_back() a struct into a vector, but using a braced-initializer-list unfortunately seems to always result in a copy that doesn't get optimized away by gcc 5.3. It would nice to avoid writing a lot of repetitive code for a constructor. And with designated initializer lists ({.x = val}), the code inside the loop wouldn't have to care much about what order the struct actually stores things. You could just write them in easy-to-read order.
BTW, .x= val C99 designated-initializer syntax is a GNU extension to C++. Also, you can get warnings for forgetting to initialize a member in a braced-list with gcc's -Wextra (which enables -Wmissing-field-initializers).
For more on syntax for initializers, have a look at Brace-enclosed initializer list constructor and the docs for member initialization.
This was a fun but terrible idea:
// Doesn't compiler. Worse: hard to read, probably easy to screw up
while (outerloop) {
int64_t x=0, y=1;
struct loop_log {int64_t logx=x, logy=y;}; // loop vars as default initializers
// error: default initializers can't be local vars with automatic storage.
while (innerloop) { x+=y; y+=x; log.emplace_back(loop_log()); }
}
Lower overhead from using a flat array instead of a std::vector
Perhaps trying to get the compiler to optimize away any kind of std::vector operation is less good than just making a big array of structs (static, local, or dynamic) and keeping a count yourself of how many records are valid. std::vector checks to see if you've used up the reserved space on every iteration, but you don't need anything like that if there is a fixed upper-bound you can use to allocate enough space to never overflow. (Depending on the platform and how you allocate the space, a big chunk of memory that's allocated but never written isn't really a problem. e.g. on Linux, malloc uses mmap(MAP_ANONYMOUS) for big allocations, and that gives you pages that are all copy-on-write mapped to a zeroed physical page. The OS doesn't need to allocate physical pages until you write, them. The same should apply to a large static array.)
So in your loop, you could just have code like
loop_log *current_record = logbuf;
while(inner_loop) {
int64_t x = ...;
current_record->x = x;
...
current_record->i = (short)outer_counter;
...
// or maybe
// *current_record = { .x = x, .i = (short)outer_counter };
// compilers will probably have an easier time avoiding any copying with a braced initializer list in this case than with vector.push_back
current_record++;
}
size_t record_bytes = (current_record - log) * sizeof(log[0]);
// or size_t record_bytes = static_cast<char*>(current_record) - static_cast<char*>(log);
logfile.write((const char*)logbuf, record_bytes);
Scattering the stores throughout the inner loop will require the array pointer to be live all the time, but OTOH doesn't require all the loop variables to be live at the same time. IDK if gcc would optimize an emplace_back to store each variable into the vector once the variable was no longer needed, or if it might spill variables to the stack and then copy them all into the vector in one group of instructions.
Using log[records++].x = ... might lead to the compiler keeping the array and counter tying up two registers, since we'd use the record count in the outer loop. We want the inner loop to be fast, and can take the time to do the subtraction in the outer loop, so I wrote it with pointer increments to encourage the compiler to only use one register for that piece of state. Besides register pressure, base+index store instructions are less efficient on Intel SnB-family hardware than single-register addressing modes.
You could still use a std::vector for this, but it's hard to get std::vector not to write zeroes into memory it allocates. reserve() just allocates without zeroing, but you calling .data() and using the reserved space without telling vector about it with .resize() kind of defeats the purpose. And of course .resize() will initialize all the new elements. So you std::vector is a bad choice for getting your hands on a large allocation without dirtying it.
It sounds like what you really want is to look at your program from within a debugger. You haven't specified a platform, but if you build with debug information (-g using gcc or clang) you should be able to step through the loop when starting the program from within the debugger (gdb on linux.) Assuming you are on linux, tell it to break at the beginning of the function (break ) and then run. If you tell the debugger to display all the variables you want to see after each step or breakpoint hit, you'll get to the bottom of your problem in no time.
Regarding performance: unless you do something fancy like set conditional breakpoints or watch memory, running the program through the debugger will not dramatically affect perf as long as the program is not stopped. You may need to turn down the optimization level to get meaningful information though.

optimization of access to members in c++

I'm running into an inconsistent optimization behavior with different compilers for the following code:
class tester
{
public:
tester(int* arr_, int sz_)
: arr(arr_), sz(sz_)
{}
int doadd()
{
sm = 0;
for (int n = 0; n < 1000; ++n)
{
for (int i = 0; i < sz; ++i)
{
sm += arr[i];
}
}
return sm;
}
protected:
int* arr;
int sz;
int sm;
};
The doadd function simulates some intensive access to members (ignore the overflows in addition for this question). Compared with similar code implemented as a function:
int arradd(int* arr, int sz)
{
int sm = 0;
for (int n = 0; n < 1000; ++n)
{
for (int i = 0; i < sz; ++i)
{
sm += arr[i];
}
}
return sm;
}
The doadd method runs about 1.5 times slower than the arradd function when compiled in Release mode with Visual C++ 2008. When I modify the doadd method to be as follows (aliasing all members with locals):
int doadd()
{
int mysm = 0;
int* myarr = arr;
int mysz = sz;
for (int n = 0; n < 1000; ++n)
{
for (int i = 0; i < mysz; ++i)
{
mysm += myarr[i];
}
}
sm = mysm;
return sm;
}
Runtimes become roughly the same. Am I right in concluding that this is a missing optimization by the Visual C++ compiler? g++ seems to do it better and run both the member function and the normal function at the same speed when compiling with -O2 or -O3.
The benchmarking is done by invoking the doadd member and arradd function on some sufficiently large array (a few millions of integers in size).
EDIT: Some fine-grained testing shows that the main culprit is the sm member. Replacing all others by local versions still makes the runtime long, but once I replace sm by mysm the runtime becomes equal to the function version.
Resolution
Dissapointed with the answers (sorry guys), I shaked off my laziness and dove into the disassembly listings for this code. My answer below summarizes the findings. In short: it has nothing to do with aliasing, it has all to do with loop unrolling, and with some strange heuristics MSVC applies when deciding which loop to unroll.
It may be an aliasing issue - the compiler can not know that the instance variable sm will never be pointed at by arr, so it has to treat sm as if it were effectively volatile, and save it on every iteration. You could make sm a different type to test this hypothesis. Or just use a temporary local sum (which will get cached in a register) and assign it to sm at the end.
MSVC is correct, in that it is the only one that, given the code we've seen, is guaranteed to work correctly. GCC employs optimizations that are probably safe in this specific instance, but that can only be verified by seeing more of the program.
Because sm is not a local variable, MSVC apparently assumes that it might alias arr.
That's a fairly reasonable assumption: because arr is protected, a derived class might set it to point to sm, so arr could alias sm.
GCC sees that it doesn't actually alias arr, and so it doesn't write sm back to memory until after the loop, which is much faster.
It's certainly possible to instantiate the class so that arr points to sm, which MSVC would handle, but GCC wouldn't.
Assuming that sz > 1, GCCs optimization is permissible in general.
Because the function loops over arr, treating it as an array of sz elements, calling the function with sz > 1 would yield undefined behavior whether or not arr aliases sm, and so GCC could safely assume that they don't alias. But if sz == 1, or if the compiler can't be sure what sz's value might be, then it runs the risk that sz might be 1, and so arr and sm could alias perfectly legally, and GCC's code would break.
So most likely, GCC simply gets away with it by inlining the whole thing, and seeing that in this case, they don't alias.
I disassembled the code with MSVC to better understand what's going on. Turns out aliasing wasn't a problem at all, and neither was some kind of paranoid thread safety.
Here is the interesting part of the arradd function disassambled:
for (int n = 0; n < 10; ++n)
{
for (int i = 0; i < sz; ++i)
013C101C mov ecx,ebp
013C101E mov ebx,29B9270h
{
sm += arr[i];
013C1023 add eax,dword ptr [ecx-8]
013C1026 add edx,dword ptr [ecx-4]
013C1029 add esi,dword ptr [ecx]
013C102B add edi,dword ptr [ecx+4]
013C102E add ecx,10h
013C1031 sub ebx,1
013C1034 jne arradd+23h (13C1023h)
013C1036 add edi,esi
013C1038 add edi,edx
013C103A add eax,edi
013C103C sub dword ptr [esp+10h],1
013C1041 jne arradd+16h (13C1016h)
013C1043 pop edi
013C1044 pop esi
013C1045 pop ebp
013C1046 pop ebx
ecx points to the array, and we can see that the internal loop is unrolled x4 here - note the four consecutive add instructions from following addresses, and ecx being advanced by 16 bytes (4 words) at a time inside the loop.
For the unoptimized version of the member function, doadd:
int tester::doadd()
{
sm = 0;
for (int n = 0; n < 10; ++n)
{
for (int i = 0; i < sz; ++i)
{
sm += arr[i];
}
}
return sm;
}
The disassembly is (it's harder to find since the compiler inlined it into main):
int tr_result = tr.doadd();
013C114A xor edi,edi
013C114C lea ecx,[edi+0Ah]
013C114F nop
013C1150 xor eax,eax
013C1152 add edi,dword ptr [esi+eax*4]
013C1155 inc eax
013C1156 cmp eax,0A6E49C0h
013C115B jl main+102h (13C1152h)
013C115D sub ecx,1
013C1160 jne main+100h (13C1150h)
Note 2 things:
The sum is stored in a register - edi. Hence, there's not aliasing "care" taken here. The value of sm isn't re-read all the time. edi isinitialized just once and then used as a temporary. You don't see its return since the compiler optimized it and used edi directly as the return value of the inlined code.
The loop is not unrolled. Why? No good reason.
Finally, here's an "optimized" version of the member function, with mysm keeping the sum local manually:
int tester::doadd_opt()
{
sm = 0;
int mysm = 0;
for (int n = 0; n < 10; ++n)
{
for (int i = 0; i < sz; ++i)
{
mysm += arr[i];
}
}
sm = mysm;
return sm;
}
The (again, inlined) disassembly is:
int tr_result_opt = tr_opt.doadd_opt();
013C11F6 xor edi,edi
013C11F8 lea ebp,[edi+0Ah]
013C11FB jmp main+1B0h (13C1200h)
013C11FD lea ecx,[ecx]
013C1200 xor ecx,ecx
013C1202 xor edx,edx
013C1204 xor eax,eax
013C1206 add ecx,dword ptr [esi+eax*4]
013C1209 add edx,dword ptr [esi+eax*4+4]
013C120D add eax,2
013C1210 cmp eax,0A6E49BFh
013C1215 jl main+1B6h (13C1206h)
013C1217 cmp eax,0A6E49C0h
013C121C jge main+1D1h (13C1221h)
013C121E add edi,dword ptr [esi+eax*4]
013C1221 add ecx,edx
013C1223 add edi,ecx
013C1225 sub ebp,1
013C1228 jne main+1B0h (13C1200h)
The loop here is unrolled, but just x2.
This explains my speed-difference observations quite well. For a 175e6 array, the function runs ~1.2 secs, the unoptimized member ~1.5 secs, and the optimized member ~1.3 secs. (Note that this may differ for you, on another machine I got closer runtimes for all 3 versions).
What about gcc? When compiled with it, all 3 versions ran at ~1.5 secs. Suspecting the lack of unrolling I looked at gcc's disassembly and indeed: gcc doesn't unroll any of the versions.
As Paul wrote it is probably because sm member is really updated every time in the "real" memory , meanwhile local summary in the function can be accumulated in register variable (after compiler optimization).
You can get similar issues when passing in pointer arguments. If you like getting your hands dirty, you may find the restrict keyword useful in future.
http://developers.sun.com/solaris/articles/cc_restrict.html
This isn't really the same code at all. If you put the sm, arr and sz variables inside the class instead of making theme local, the compiler can't (easily) guess that some other class won't inherit from test class and want access to these members, doing something like `arr=&sm; doadd();. Henceforth, access to these variables can't be optimized away as they can when they are local to function.
In the end the reason is basically the one Paul pointed out, sm is updated in real memory when using a class member, can be stored in a register when in a function. Memory reads from add shouldn't change resulting time much, as memomry must be read anyway to get the value.
In this case if test is not exported to another module and not aliased even indirectly to something exported, and if there is no aliasing like above. The compiler could optimize intermediate writes to sm... some compilers like gcc seems to optimize aggressively enough to detect above cases (would it also if test class is exported). But these are really hard guesses. There is still much simpler optimizations that are not yet performed by compilers (like inlining through different compilation units).
The key is probably that doadd is written like this if you make the member accesses explicit with this:
int doadd()
{
this->sm = 0;
for (int n = 0; n < 1000; ++n)
{
for (int i = 0; i < this->sz; ++i)
{
this->sm += this->arr[i];
}
}
return this->sm;
}
Therein lies the problem: all class members are accessed via the this pointer, whereas arradd has all variables on the stack. To speed it up, you have found that by moving all members on to the stack as local variables, the speed then matches arradd. So this indicates the this indirection is responsible for the performance loss.
Why might that be? As I understand it this is usually stored in a register so I don't think it's ultimately any slower than just accessing the stack (which is an offset in to the stack pointer as well). As other answers point out, it's probably the aliasing problem that generates less optimal code: the compiler can't tell if any of the memory addresses overlap. Updating sm could also in theory change the content of arr, so it decides to write out the value of sm to main memory every time rather than tracking it in a register. When variables are on the stack, the compiler can assume they're all at different memory addresses. The compiler doesn't see the program as clearly as you do: it can tell what's on the stack (because you declared it like that), but everything else is just arbitrary memory addresses that could be anything, anywhere, overlapping any other pointer.
I'm not surprised the optimisation in your question (using local variables) isn't made - not only would the compiler have to prove the memory of arr does not overlap anything pointed to by this, but also that not updating the member variables until the end of the function is equivalent to the un-optimised version updating throughout the function. That can be a lot trickier to determine than you imagine, especially if you take concurrency in to account.

C and C++: Array element access pointer vs int

Is there a performance difference if you either do myarray[ i ] or store the adress of myarray[ i ] in a pointer?
Edit: The pointers are all calculated during an unimportant step in my program where performance is no criteria. During the critical parts the pointers remain static and are not modified. Now the question is if these static pointers are faster than using myarray[ i ] all the time.
For this code:
int main() {
int a[100], b[100];
int * p = b;
for ( unsigned int i = 0; i < 100; i++ ) {
a[i] = i;
*p++ = i;
}
return a[1] + b[2];
}
when built with -O3 optimisation in g++, the statement:
a[i] = i;
produced the assembly output:
mov %eax,(%ecx,%eax,4)
and this statement:
*p++ = i;
produced:
mov %eax,(%edx,%eax,4)
So in this case there was no difference between the two. However, this is not and cannot be a general rule - the optimiser might well generate completely different code for even a slightly different input.
It will probably make no difference at all. The compiler will usually be smart enough to know when you are using an expression more than once and create a temporary itself, if appropriate.
Compilers can do surprising optimizations; the only way to know is to read the generated assembly code.
With GCC, use -S, with -masm=intel for Intel syntax.
With VC++, use /FA (IIRC).
You should also enable optimizations: -O2 or -O3 with GCC, and /O2 with VC++.
I prefer using myarray[ i ] since it is more clear and the compiler has easier time compiling this to optimized code.
When using pointers it is more complex for the compiler to optimize this code since it's harder to know exactly what you're doing with the pointer.
There should not be much different but by using indexing you avoid all types of different pitfalls that the compiler's optimizer is prone to (aliasing being the most important one) and thus I'd say the indexing case should be easier to handle for the compiler. This doesn't mean that you should take care of aforementioned things before the loop, but pointers in a loop generally just adds to the complexity.
Yes. Having a pointer the address won't be calculated by using the initial address of the array. It will accessed directly. So you have a little performance improve if you save the address in a pointer.
But the compiler will usually optimize the code and use the pointer in both cases (if you have statical arrays)
For dynamic arrays (created with new) the pointer will offer you more performance as the compiler cannot optimize array accesses at compile time.
There will be no substantial difference. Premature optimization is the root of all evil - get a profiler before checking micro-optimizations like this. Also, the myarray[i] is more portable to custom types, such as a std::vector.
Okay so your questions is, whats faster:
int main(int argc, char **argv)
{
int array[20];
array[0] = 0;
array[1] = 1;
int *value_1 = &array[1];
printf("%d", *value_1);
printf("%d", array[1]);
printf("%d", *(array + 1));
}
Like someone else already pointed out, compilers can do clever optimization. Of course this is depending on where an expression is used, but normally you shouldn't care about those subtle differences. All your assumption can be proven wrong by the compiler. Today you shouldn't need to care about such differences.
For example the above code produces the following (only snippet):
mov [ebp+var_54], 1 #store 1
lea eax, [ebp+var_58] # load the address of array[0]
add eax, 4 # add 4 (size of int)
mov [ebp+var_5C], eax
mov eax, [ebp+var_5C]
mov eax, [eax]
mov [esp+88h+var_84], eax
mov [esp+88h+var_88], offset unk_403000 # points to %d
call printf
mov eax, [ebp+var_54]
mov [esp+88h+var_84], eax
mov [esp+88h+var_88], offset unk_403000
call printf
mov eax, [ebp+var_54]
mov [esp+88h+var_84], eax
mov [esp+88h+var_88], offset unk_403000
call printf
Short answer: the only way to know for sure is to code up both versions and compare performance. I would personally be surprised if there was a measureable difference unless you were doing a lot of array accesses in a really tight loop. If this is something that happens once or twice over the lifetime of the program, or depends on user input, it's not worth worrying about.
Remember that the expression a[i] is evaluated as *(a+i), which is an addition plus a dereference, whereas *p is just a dereference. Depending on how the code is structured, though, it may not make a difference. Assume the following:
int a[N]; // for any arbitrary N > 1
int *p = a;
size_t i;
for (i = 0; i < N; i++)
printf("a[%d] = %d\n", i, a[i]);
for (i = 0; i < N; i++)
printf("*(%p) = %d\n", (void*) p, *p++);
Now we're comparing a[i] to *p++, which is a dereference plus a postincrement (in addition to the i++ in the loop control); that may turn out to be a more expensive operation than the array subscript. Not to mention we've introduced another variable that's not strictly necessary; we're trading a little space for what may or may not be an improvement in speed. It really depends on the compiler, the structure of the code, optimization settings, OS, and CPU.
Worry about correctness first, then worry about readability/maintainability, then worry about safety/reliability, then worry about performance. Unless you're failing to meet a hard performance requirement, focus on making your intent clear and easy to understand. It doesn't matter how fast your code is if it gives you the wrong answer or performs the wrong action, or if it crashes horribly at the first hint of bad input, or if you can't fix bugs or add new features without breaking something.
Yes.. when storing myarray[i] pointer it will perform better (if used on large scale...)
Why??
It will save you an addition and may be a multiplication (or a shift..)
Many compilers may optimize that for you in case of static memory allocation.
If you are using dynamic memory allocation, the compiler will not optimize it, because it is in runtime!