C++ catching dangling reference - c++

Suppose the following piece of code
struct S {
S(int & value): value_(value) {}
int & value_;
};
S function() {
int value = 0;
return S(value); // implicitly returning reference to local value
}
compiler does not produce warning (-Wall), this error can be hard to catch.
What tools are out there to help catch such problems

There are runtime based solutions which instrument the code to check invalid pointer accesses. I've only used mudflap so far (which is integrated in GCC since version 4.0). mudflap tries to track each pointer (and reference) in the code and checks each access if the pointer/reference actually points to an alive object of its base type. Here is an example:
#include <stdio.h>
struct S {
S(int & value): value_(value) {}
int & value_;
};
S function() {
int value = 0;
return S(value); // implicitly returning reference to local value
}
int main()
{
S x=function();
printf("%s\n",x.value_); //<-oh noes!
}
Compile this with mudflap enabled:
g++ -fmudflap s.cc -lmudflap
and running gives:
$ ./a.out
*******
mudflap violation 1 (check/read): time=1279282951.939061 ptr=0x7fff141aeb8c size=4
pc=0x7f53f4047391 location=`s.cc:14:24 (main)'
/opt/gcc-4.5.0/lib64/libmudflap.so.0(__mf_check+0x41) [0x7f53f4047391]
./a.out(main+0x7f) [0x400c06]
/lib64/libc.so.6(__libc_start_main+0xfd) [0x7f53f358aa7d]
Nearby object 1: checked region begins 332B before and ends 329B before
mudflap object 0x703430: name=`argv[]'
bounds=[0x7fff141aecd8,0x7fff141aece7] size=16 area=static check=0r/0w liveness=0
alloc time=1279282951.939012 pc=0x7f53f4046791
Nearby object 2: checked region begins 348B before and ends 345B before
mudflap object 0x708530: name=`environ[]'
bounds=[0x7fff141aece8,0x7fff141af03f] size=856 area=static check=0r/0w liveness=0
alloc time=1279282951.939049 pc=0x7f53f4046791
Nearby object 3: checked region begins 0B into and ends 3B into
mudflap dead object 0x7089e0: name=`s.cc:8:9 (function) int value'
bounds=[0x7fff141aeb8c,0x7fff141aeb8f] size=4 area=stack check=0r/0w liveness=0
alloc time=1279282951.939053 pc=0x7f53f4046791
dealloc time=1279282951.939059 pc=0x7f53f4046346
number of nearby objects: 3
Segmentation fault
A couple of points to consider:
mudflap can be fine tuned in what exactly it should check and do. read http://gcc.gnu.org/wiki/Mudflap_Pointer_Debugging for details.
The default behaviour is to raise a SIGSEGV on a violation, this means you can find the violation in your debugger.
mudflap can be a bitch, in particular when your are interacting with libraries that are not compiled with mudflap support.
It wont't bark on the place where the dangling reference is created (return S(value)), only when the reference is dereferenced. If you need this, then you'll need a static analysis tool.
P.S. one thing to consider was, to add a NON-PORTABLE check to the copy constructor of S(), which asserts that value_ is not bound to an integer with a shorter life span (for example, if *this is located on an "older" slot of the stack that the integer it is bound to). This is higly-machine specific and possibly tricky to get right of course, but should be OK as long it's only for debugging.

I think this is not possible to catch all these, although some compilers may give warnings in some cases.
It's as well to remember that references are really pointers under the hood, and many of the shoot-self-in-foot scenarios possible with pointers are still possible..
To clarify what I mean about "pointers under the hood", take the following two classes. One uses references, the other pointers.
class Ref
{
int &ref;
public:
Ref(int &r) : ref(r) {};
int get() { return ref; };
};
class Ptr
{
int *ptr;
public:
Ptr(int *p) : ptr(p) {};
int get() { return *ptr; };
};
Now, compare at the generated code for the two.
##Ref#$bctr$qri proc near // Ref::Ref(int &ref)
push ebp
mov ebp,esp
mov eax,dword ptr [ebp+8]
mov edx,dword ptr [ebp+12]
mov dword ptr [eax],edx
pop ebp
ret
##Ptr#$bctr$qpi proc near // Ptr::Ptr(int *ptr)
push ebp
mov ebp,esp
mov eax,dword ptr [ebp+8]
mov edx,dword ptr [ebp+12]
mov dword ptr [eax],edx
pop ebp
ret
##Ref#get$qv proc near // int Ref:get()
push ebp
mov ebp,esp
mov eax,dword ptr [ebp+8]
mov eax,dword ptr [eax]
mov eax,dword ptr [eax]
pop ebp
ret
##Ptr#get$qv proc near // int Ptr::get()
push ebp
mov ebp,esp
mov eax,dword ptr [ebp+8]
mov eax,dword ptr [eax]
mov eax,dword ptr [eax]
pop ebp
ret
Spot the difference? There isn't any.

You have to use an technology based on compile-time instrumentation. While valgrind could check all function calls at run-time (malloc, free), it could not check just code.
Depending your architecture, IBM PurifyPlus find some of these problem. Therefore, you should find a valid license (or use your company one) to use-it, or try-it with the trial version.

I don't think any static tool can catch that, but if you use Valgrind along with some unit tests or whatever code is crashing (seg fault), you can easily find where the memory is bring referenced and where it was allocated originally.

There is a guideline I follow after having been beaten by this exact thing:
When a class has a reference member (or a pointer to something that can have a lifetime you don't control), make the object non-copyable.
This way, you reduce the chances of escaping the scope with a dangling reference.

Your code shouldn't even compile. The compilers I know of will either fail to compile the code, or at the very least throw a warning.
If you meant return S(value) instead, then for heavens sake COPY PASTE THE CODE YOU POST HERE.
Rewriting and introducing typos just means it is impossible for us to actually guess which errors you're asking about, and which ones were accidents we're supposed to ignore.
When you post a question anywhere on the internet, if that question includes code, POST THE EXACT CODE.
Now, assuming this was actually a typo, the code is perfectly legal, and there's no reason why any tool should warn you.
As long as you don't try to dereference the dangling reference, the code is perfectly safe.
It is possible that some static analysis tools (Valgrind, or MSVC with /analyze, for example) can warn you about this, but there doesn't seem to be much point because you're not doing anything wrong. You're returning an object which happens to contain a dangling reference. You're not directly returning a reference to a local object (which compilers typically do warn about), but a higher level object with behavior that might make it perfectly safe to use, even though it contains a reference to a local object that's gone out of scope.

This is perfectly valid code.
If you call your function and bind the temporary to a const reference the scope gets prolonged.
const S& s1 = function(); // valid
S& s2 = function(); // invalid
This is explicitly allowed in the C++ standard.
See 12.2.4:
There are two contexts in which temporaries are destroyed at a different point than the end of the full-expression.
and 12.2.5:
The second context is when a reference is bound to a temporary. The temporary to which the reference is bound or the temporary that is the complete object of a subobject to which the reference is bound persists for the lifetime of the reference except: [...]

Related

Coroutine frame seems to be prematurely marked as destroyed when using address sanitizer

I am trying to write a small and simple coroutine library just to get a more solid understanding of C++20 coroutines. It seems to work fine, but when I compile with clang's adress sanitizer, it throws up on me.
I have narrowed down the issue to the following code example (available with compiler and sanitizer output at https://godbolt.org/z/WqY6Gd), but I still can't make any sense of it.
// namespace coro = std::/std::experimental;
// inlining this suppresses the error
__attribute__((noinline)) void foo(int& i) { i = 0; }
struct task {
struct promise_type {
promise_type() = default;
coro::suspend_always initial_suspend() noexcept { return {}; }
coro::suspend_always final_suspend() noexcept { return {}; }
void unhandled_exception() noexcept { std::terminate(); }
void return_value(int v) noexcept { value = v; }
task get_return_object() {
return task{coro::coroutine_handle<promise_type>::from_promise(*this)};
}
int value{};
};
void Start() { return handle_.resume(); }
int Get() {
auto& promise = handle_.promise();
return promise.value;
}
coro::coroutine_handle<promise_type> handle_;
};
task func() { co_return 3; }
int main() {
auto t = func();
t.Start();
const auto result = t.Get();
foo(t.handle_.promise().value);
// moving this one line down or separating this into a noinline
// function suppresses the error
// removing this removes the stack-use-after-scope, but (rightfully) reports a leak
t.handle_.destroy();
if (result != 3) return 1;
}
Address sanitizer reports use-after-scope (full output available at godbolt, link above).
With some help from lldb, I found out that the error is thrown in main, more precisely: the jump at line 112 in the assembly listing, jne .LBB2_15, jumps to asan's report and never returns. It seems to be inside main's prologue.
As the comments indicate, moving destroy() a line down or calling it in a separate noinline function1 changes the behavior of address sanitizer. The only two explanations to this seem to be undefined behavior and asan throwing a false positive (or -fsanitize=address itself is creating lifetime issues, which is sort of the same in a sense).
At this point I'm fairly certain that there's no UB in the code above: both task and result live on main's stack frame, the promise object lives in the coroutine frame. The frame itself is allocated (on main's stack because no suspend-points) at line 1 of main, and destroyed right before returning, past the last access to it in foo(). The coroutine frame is not destroyed automatically because control never flows off co_await final_suspend(), as per the standard. I've been staring at this code for a while, though, so please forgive me if I missed something obvious.
The assembly generated without sanitation seems to makes sense to me and all the memory access happens within [rsp, rsp+24], as allocated. Futhermore, compiling with -fsanitize=address,undefined, or just -fsanitize=undefined, or simply compiling with gcc with -fsanitize=address reports no errors, which leads me to believe the issue is hidden somewhere in the code generated by asan.
Unfortunately, I can't quite make sense of what exactly happens in the code instrumented by asan, and that's why I'm posting this. I have a general understanding of Address sanitizer's algorithm, but I can't map the assembly memory access/allocations to what's happenning in the C
++ code.
I'm hoping that an answer will help me
Understand where the lifetime issues are hidden, if there are any
Understand what exactly happens in main when compiled with asan, so that a person reading this can have a more clear way of finding what memory access in the C++ code triggered the error, and where (if anywhere) was that memory allocated and freed.
Consistently suppress this particular false positive, and elaborate a bit on what causes it, if the issue really is in asan.
Thanks in advance.
1 This initially lead me to believe that clang's optimizer is reading result from the (destroyed) coroutine's frame directly, but moving destroy() into task's destructor brings the issue back and proves that theory wrong, as far as I can tell. destroy() is not in the destructor in the listing above because it requires implementing move construction/assignment in order to avoid double free, and I wanted to keep the example as small and clear as possible.
I think I figured it out - but mostly because it's already fixed in clang12.0.
Running the smaller/cleaner example with clang-12 shows no error from asan. The difference is in the following lines:
movabs rcx, -866669180174077455
mov qword ptr [r13 + 2147450880], rcx
mov dword ptr [r13 + 2147450888], -202116109
lea rdi, [r12 + 40]
mov rcx, rdi
shr rcx, 3
cmp byte ptr [rcx + 2147450880], 0
jne .LBB2_14
lea r14, [r12 + 32]
mov qword ptr [r14 + 8], offset f() [clone .cleanup]
lea rdi, [r14 + 16]
mov byte ptr [r13 + 2147450884], 0
mov rcx, rdi
shr rcx, 3
mov dl, byte ptr [rcx + 2147450880]
test dl, dl
jne .LBB2_7
.LBB2_8:
mov qword ptr [rbx + 16], rax # 8-byte Spill
mov dword ptr [r14 + 16], 0
Which clang-11 has, and clang-12 doesn't. From the looks of it, the address sanitizer tries to check that r12+40 (which should be the promise's cleanup method) is initialized before initializing it.
Clang-12 just performs no checks for the promise, leaving the entirity of the code above out.
TL;DR: (probably) a bug in clang-11 coroutine sanitation, fixed in 12.0, perhaps in later versions of clang-11 as well.

Under what conditions does MSVC C++ Compiler sometimes write the array size directly before the pointer returned from function operator new[]?

I'm currently working on a memory tracker for work, and we are overloading the function operator new[], in its many variations. While writing some unit tests, I stumbled across the fact that MSVC C++ 2019 (using the ISO C++ 17 Standard(std:c++17) compiler setting), writes the size of the allocated array of objects directly before the pointer returned to the caller, but only sometimes. I have been unable to find any documented conditions under which this will occur. Can anyone please explain what those conditions are, how I can detect them at runtime, and or point me to any documentation?
To even determine this was happening, I had to disassemble the code. Here is the C++:
const size_t k_NumFoos = 6;
Foo* pFoo = new Foo[k_NumFoos];
And here is the disassembly:
00007FF747BB3683 call operator new[] (07FF747A00946h)
00007FF747BB3688 mov qword ptr [rbp+19E8h],rax
00007FF747BB368F cmp qword ptr [rbp+19E8h],0
00007FF747BB3697 je ____C_A_T_C_H____T_E_S_T____0+0FF7h (07FF747BB36F7h)
00007FF747BB3699 mov rax,qword ptr [rbp+19E8h]
00007FF747BB36A0 mov qword ptr [rax],6
00007FF747BB36A7 mov rax,qword ptr [rbp+19E8h]
00007FF747BB36AE add rax,8
00007FF747BB36B2 mov qword ptr [rbp+1B58h],rax
The cmp and je lines are from the Catch2 library we are using for our unit tests. The subsequent two movs, following the je, are where it's writing the array size. The next three lines (mov, add, mov) are where it's moving the pointer to after where it has written the array size. This is all well and good, mostly.
We are also using MS's VirtualAlloc as the allocator internal to the overloaded function operator new[]. The address returned from VirtualAlloc must be aligned for the function operator new[] that uses std::align_t, and when the alignment is greater than the default max alignment, the moving of the pointer in those last three lines of disassembly are messing with the aligned address being returned. Initially, I thought all allocations made with function operator new[] would have this behavior. So, I tested some other uses of function operator new[], and found it to be true in all cases I tested. I wrote the code to adjust for this behavior, and then ran into a case where it doesn't exhibit the behavior of writing the array size before the returned allocation.
Here is the C++ of where it is not writing the array size before the returned allocation:
char **utf8Argv = new char *[ argc ];
argc is equal to 1. The line comes from the Session::applyCommandLine method in the Catch2 library. The disassembly looks like so:
00007FF73E189C6A call operator new[] (07FF73E07D6D8h)
00007FF73E189C6F mov qword ptr [rbp+168h],rax
00007FF73E189C76 mov rax,qword ptr [rbp+168h]
00007FF73E189C7D mov qword ptr [utf8Argv],rax
Notice after the call to operator new[] (07FF73E07D6F8h) there is no writing of the array size. When looking at the two for differences, I can see that one writes to a pointer, while the other writes to a pointer to a pointer. However, none of that information is available internally, at runtime, to function operator new[], as far as I know.
The code here comes from a Debug | x64 build. Any ideas on how to determine when this behavior will occur?
Update (for convo below):
Class Foo:
template<size_t ArrLen>
class TFoo
{
public:
TFoo()
{
memset(m_bar, 0, ArrLen);
}
TFoo(const TFoo<ArrLen>& other)
{
strncpy_s(m_bar, other.m_bar, ArrLen);
}
TFoo(TFoo<ArrLen>&& victim)
{
strncpy_s(m_bar, victim.m_bar, ArrLen);
}
~TFoo()
{
}
TFoo<ArrLen>& operator= (const TFoo<ArrLen>& other)
{
strncpy_s(m_bar, other.m_bar, ArrLen);
}
TFoo<ArrLen>& operator= (TFoo<ArrLen>&& victim)
{
strncpy_s(m_bar, victim.m_bar, ArrLen);
}
const char* GetBar()
{
return m_bar;
}
void SetBar(const char bar[ArrLen])
{
strncpy_s(m_bar, bar, ArrLen);
}
protected:
char m_bar[ArrLen];
};
using Foo = TFoo<8>;
At a guess, I would think the compiler would write the number of objects allocated out before the pointer returned to you when it is allocating objects which have a destructor that needs to be called when you call delete []. Under those circumstances, the compiler has to emit code to destroy each of the objects allocated when you call delete [], and to do that, it needs to know how many objects are present in the array.
OTOH, for something like char *, no count is needed, and so, as a minor optimisation, none is emitted, or so it would seem.
I don't suppose you'll find this documented anywhere and the behaviour might change in future versions of the compiler. It doesn't seem to be part of the standard.

Access through reference overhead vs copy overhead

Let's say that I want to pass a POD object to function as a const argument. I know that for simple types like int and double passing by value is better than by const reference because of the reference overhead. But at what size it is worth it to pass as a reference?
struct arg
{
...
}
void foo(const arg input)
{
// read from input
}
or
void foo(const arg& input)
{
// read from input
}
i.e., at what size of struct arg should I start using the latter approach?
I should also mention that I'm not talking about copy elision here. Let's suppose that it doesn't happen.
TL;DR: This depends highly on the target architecture, the compiler and the context in which the functions are invoked. When unsure, profile and manually inspect generated code.
If the functions are inlined, a good optimizing compiler will probably emit exact same code in both cases.
If the functions are not inlined however, the ABI on most C++ implementations dictate to pass a const& argument as a pointer. That means the structure has to be stored in RAM just so one can get an address of it. This can have a significant impact on performance for small objects.
Let's take x86_64 Linux G++ 8.2 as an example...
A struct with 2 members:
struct arg
{
int a;
long b;
};
int foo1(const arg input)
{
return input.a + input.b;
}
int foo2(const arg& input)
{
return input.a + input.b;
}
Generated assembly:
foo1(arg):
lea eax, [rdi+rsi]
ret
foo2(arg const&):
mov eax, DWORD PTR [rdi]
add eax, DWORD PTR [rdi+8]
ret
First version passes the structure entirely via registers, the second one via the stack..
Now let's try 3 members:
struct arg
{
int a;
long b;
int c;
};
int foo1(const arg input)
{
return input.a + input.b + input.c;
}
int foo2(const arg& input)
{
return input.a + input.b + input.c;
}
Generated assembly:
foo1(arg):
mov eax, DWORD PTR [rsp+8]
add eax, DWORD PTR [rsp+16]
add eax, DWORD PTR [rsp+24]
ret
foo2(arg const&):
mov eax, DWORD PTR [rdi]
add eax, DWORD PTR [rdi+8]
add eax, DWORD PTR [rdi+16]
ret
Not a whole lot of difference anymore, although using the second version will still be a bit slower because it requires the address to be put in rdi.
Does it really matter that much?
Usually not. If you care about performance of a particular function, it's probably called frequently and is therefore small. As such, it will most likely be inlined.
Let's try invoking the two functions above:
int test(int x)
{
arg a {x, x};
return foo1(a) + foo2(a);
}
Generated assembly:
test(int):
lea eax, [0+rdi*4]
ret
Voilà. It's all moot now. The compiler inlined and merged both functions into a single instruction!
A reasonable rule of thumb: If the size of the class is same or less than size of a pointer, then copying may be a bit faster.
If the size of the class is slightly higher, then it may be hard to predict. The difference is often insignificant.
If the size of the class is humongous, then copying is likely slower. That said, point is moot since humongous objects can't in practice have automatic storage, since it is limited.
If the function is expanded inline, then there is probably no difference whatsoever.
To find out whether one program is faster than the other on a particular system, and whether the difference is significant in the first place, you can use a profiler.
In addition to other responses, there is also optimization concerns.
Since it's a reference, the compiler cannot know if the reference point to a mutable global variable or not. When calling any function that the source is not available to the current TU, the compiler must assume the variable may have been mutated.
For example, if you have a if depending on a data member of Foo, call a function, then use the same data member, the compiler will be force to output two sparated loads, whereas if the variable is local, it knows it cannot be mutated elsewhere. Here's an example:
struct Foo {
int data;
};
extern void use_data(int);
void bar(Foo const& foo) {
int const& data = foo.data;
// may mutate foo.data through a global Foo
use_data(data);
// must load foo.data again through the reference
use_data(data);
}
If the variable is local, the compiler will simply reuse the value already inside the registers.
Here's a compiler explorer example that shows the optimization being applied only if the variable is local.
This is why the "general advise" will give you good performance, but won't give you optimal performance. You must mesure and profile your code if you truly care about the performance of your code.

compiler memory optimization - reusing existing blocks

Say i were to allocate 2 memory blocks.
I use the first memory block to store something and use this stored data.
Then i use the second memory block to do something similar.
{
int a[10];
int b[10];
setup_0(a);
use_0(a);
setup_1(b);
use_1(b);
}
|| compiler optimizes this to this?
\/
{
int a[10];
setup_0(a);
use_0(a);
setup_1(a);
use_1(a);
}
// the setup functions overwrites all 10 words
The question is now: Do compiler optimize this, so that they reuse the existing memory blocks, instead of allocating a second one, if the compiler knows that the first block will not be referenced again?
If this is true:
Does this also work with dynamic memory allocation?
Is this also possible if the memory persists outside the scope, but is used in the same way as given in the example?
I assume this only works if setup and foo are implemented in the same c file (exist in the same object as the calling code)?
Do compiler optimize this
This question can only be answered if you ask about a particular compiler. And the answer can be found by inspecting the generated code.
so that they reuse the existing memory blocks, instead of allocating a second one, if the compiler knows that the first block will not be referenced again?
Such optimization would not change the behaviour of the program, so it would be allowed. Another matter is: Is it possible to prove that the memory will not be referenced? If it is possible, then is it easy enough to prove in reasonable time? I feel very safe in saying that it is not possible to prove in general, but it is provable in some cases.
I assume this only works if setup and foo are implemented in the same c file (exist in the same object as the calling code)?
That would usually be required to prove the untouchability of the memory. Link time optimization might lift this requirement, in theory.
Does this also work with dynamic memory allocation?
In theory, since it doesn't change the behaviour of the program. However, the dynamic memory allocation is typically performed by a library and thus the compiler may not be able to prove the lack of side-effects and therefore wouldn't be able to prove that removing an allocation wouldn't change behaviour.
Is this also possible if the memory persists outside the scope, but is used in the same way as given in the example?
If the compiler is able to prove that the memory is leaked, then perhaps.
Even though the optimization may be possible, it is not very significant. Saving a bit of stack space probably has very little effect on run time. It could be useful to prevent stack overflows if the arrays are large.
https://godbolt.org/g/5nDqoC
#include <cstdlib>
extern int a;
extern int b;
int main()
{
{
int tab[1];
tab[0] = 42;
a = tab[0];
}
{
int tab[1];
tab[0] = 42;
b = tab[0];
}
return 0;
}
Compiled with gcc 7 with -O3 compilation flag:
main:
mov DWORD PTR a[rip], 42
mov DWORD PTR b[rip], 42
xor eax, eax
ret
If you follow the link you should see the code being compiled on gcc and clang with -O3 optimisation level. The resulting asm code is pretty straight forward. As the value stored in the array is know at compilation time, the compiler can easily skip everything and straight up set the variables a and b. Your buffer is not needed.
Following a code similar to the one provided in your example:
https://godbolt.org/g/bZHSE4
#include <cstdlib>
int func1(const int (&tab)[10]);
int func2(const int (&tab)[10]);
int main()
{
int a[10];
int b[10];
func1(a);
func2(b);
return 0;
}
Compiled with gcc 7 with -O3 compilation flag:
main:
sub rsp, 104
mov rdi, rsp ; first address is rsp
call func1(int const (&) [10])
lea rdi, [rsp+48] ; second address is [rsp+48]
call func2(int const (&) [10])
xor eax, eax
add rsp, 104
ret
You can see the pointer sent to the function func1 and func2 is different as the first pointer used is rsp in the call to func1, and [rsp+48] in the call to func2.
You can see that either the compiler completely ignores your code in the case it is predictable. In the other case, at least for gcc 7 and clang 3.9.1, it is not optimized.
https://godbolt.org/g/TnV62V
#include <cstdlib>
extern int * a;
extern int * b;
inline int do_stuff(int ** to)
{
*to = (int *) malloc(sizeof(int));
(**to) = 42;
return **to;
}
int main()
{
do_stuff(&a);
free(a);
do_stuff(&b);
free(b);
return 0;
}
Compiled with gcc 7 with -O3 compilation flag:
main:
sub rsp, 8
mov edi, 4
call malloc
mov rdi, rax
mov QWORD PTR a[rip], rax
call free
mov edi, 4
call malloc
mov rdi, rax
mov QWORD PTR b[rip], rax
call free
xor eax, eax
add rsp, 8
ret
While not being fluent at reading this, it is pretty easy to tell that with the following example, malloc and free is not being optimized neither by gcc or clang (if you want to try with more compiler, suit yourself but don't forget to set the optimization flag).
You can clearly see a call to "malloc" followed by a call to "free", twice
Optimizing stack space is quite unlikely to really have an effect on the speed of your program, unless you manipulate large amount of data.
Optimizing dynamically allocated memory is more relevant. AFAIK you will have to use a third-party library or run your own system if you plan to do that and this is not a trivial task.
EDIT: Forgot to mention the obvious, this is very compiler dependent.
As the compiler sees that a is used as a parameter for a function, it will not optimize b away. It can't, because it doesn't know what happens in the function that uses a and b. Same for a: the compiler doesn't know that a isn't used anymore.
As far as the compiler is concerned, the address of a could e.g. have ben stored by setup0 in a global variable and will be used by setup1 when it is called with b.
The short answer is: No! The compiler cannot optimize this code to what you suggested, because it is not semantically equivalent.
Long explenation: The lifetime of a and b is with some simplification the complete block.
So now lets assume, that one of setup_0 or use_0 stores a pointer to a in some global variable. Now setup_1 and use_1 are allowed to use a via this global variable in combination with b (It can for example add the array elements of a and b. If the transformation you suggested of the code was done, this would result in undefined behaviour. If you really want to make a statement about the lifetime, you have to write the code in the following way:
{
{ // Lifetime block for a
char a[100];
setup_0(a);
use_0(a);
} // Lifetime of a ends here, so no one of the following called
// function is allowed to access it. If it does access it by
// accident it is undefined behaviour
char b[100];
setup_1(b); // Not allowed to access a
use_1(b); // Not allowed to access a
}
Please also note that gcc 12.x and clang 15 both do the optimization. If you comment out the curly brackets, the optimization is (correctly!) not done.
Yes, theoretically, a compiler could optimize the code as you describe, assuming that it could prove that these functions did not modify the arrays passed in as parameters.
But in practice, no, that does not happen. You can write a simple test case to verify this. I've avoided defining the helper functions so the compiler can't inline them, but passed the arrays by const-reference to ensure that the compiler knows the functions don't modify them:
void setup_0(const int (&p)[10]);
void use_0 (const int (&p)[10]);
void setup_1(const int (&p)[10]);
void use_1 (const int (&p)[10]);
void TestFxn()
{
int a[10];
int b[10];
setup_0(a);
use_0(a);
setup_1(b);
use_1(b);
}
As you can see here on Godbolt's Compiler Explorer, no compilers (GCC, Clang, ICC, nor MSVC) will optimize this to use a single stack-allocated array of 10 elements. Of course, each compiler varies in how much space it allocates on the stack. Some of that is due to different calling conventions, which may or may not require a red zone. Otherwise, it's due to the optimizer's alignment preferences.
Taking GCC's output as an example, you can immediately tell that it is not reusing the array a. The following is the disassembly, with my annotations:
; Allocate 104 bytes on the stack
; by subtracting from the stack pointer, RSP.
; (The stack always grows downward on x86.)
sub rsp, 104
; Place the address of the top of the stack in RDI,
; which is how the array is passed to setup_0().
mov rdi, rsp
call setup_0(int const (&) [10])
; Since setup_0() may have clobbered the value in RDI,
; "refresh" it with the address at the top of the stack,
; and call use_0().
mov rdi, rsp
call use_0(int const (&) [10])
; We are now finished with array 'a', so add 48 bytes
; to the top of the stack (RSP), and place the result
; in the RDI register.
lea rdi, [rsp+48]
; Now, RDI contains what is effectively the address of
; array 'b', so call setup_1().
; The parameter is passed in RDI, just like before.
call setup_1(int const (&) [10])
; Second verse, same as the first: "refresh" the address
; of array 'b' in RDI, since it might have been clobbered,
; and pass it to use_1().
lea rdi, [rsp+48]
call use_1(int const (&) [10])
; Clean up the stack by adding 104 bytes to compensate for the
; same 104 bytes that we subtracted at the top of the function.
add rsp, 104
ret
So, what gives? Are compilers just massively missing the boat here when it comes to an important optimization? No. Allocating space on the stack is extremely fast and cheap. There would be very little benefit in allocating ~50 bytes, as opposed to ~100 bytes. Might as well just play it safe and allocate enough space for both arrays separately.
There might be more of a benefit in reusing the stack space for the second array if both arrays were extremely large, but empirically, compilers don't do this, either.
Does this work with dynamic memory allocation? No. Emphatically no. I've never seen a compiler that optimizes around dynamic memory allocation like this, and I don't expect to see one. It just doesn't make sense. If you wanted to re-use the block of memory, you would have written the code to re-use it instead of allocating a separate block.
I suppose you are thinking that if you had something like the following C code:
void TestFxn()
{
int* a = malloc(sizeof(int) * 10);
setup_0(a);
use_0(a);
free(a);
int* b = malloc(sizeof(int) * 10);
setup_1(b);
use_1(b);
free(b);
}
that the optimizer could see that you were freeing a, and then immediately re-allocating a block of the same size as b? Well, the optimizer won't recognize this and elide the back-to-back calls to free and malloc, but the run-time library (and/or operating system) very likely will. free is a very cheap operation, and since a block of the appropriate size was just released, allocation will also be very cheap. (Most run-time libraries maintain a private heap for the application and won't even return the memory to the operating system, so depending on the memory-allocation strategy, it's even possible that you get the exact same block back.)

C++ variable declaration re: memory usage

Hi I know this is a very silly/basic question, but what is the difference between the code like:-
int *i;
for(j=0;j<10;j++)
{
i = static_cast<int *>(getNthCon(j));
i->xyz
}
and, some thing like this :-
for(j=0;j<10;j++)
{
int *i = static_cast<int *>(getNthCon(j));
i->xyz;
}
I mean, are these code extremely same in logic, or would there be any difference due to its local nature ?
One practical difference is the scope of i. In the first case, i continues to exist after the final iteration of the loop. In the second it does not.
There may be some case where you want to know the value of i after all of the computation is complete. In that case, use the second pattern.
A less practical difference is the nature of the = token in each case. In the first example i = ... indicates assignment. In the second example, int *i = ... indicates initialization. Some types (but not int* nor fp_ContainerObject*) might treat assignment and initialization differently.
There is very little difference between them.
In the first code sample, i is declared outside the loop, so you're re-using the same pointer variable on each iteration. In the second, i is local to the body of the loop.
Since i is not used outside the loop, and the value assigned to it in one iteration is not used in future iterations, it's better style to declare it locally, as in the second sample.
Incidentally, i is a bad name for a pointer variable; it's usually used for int variables, particularly ones used in for loops.
For any sane optimizing compiler there will be no difference in terms of memory allocation. The only difference will be the scope of i. Here is a sample program (and yes, I realize there is a leak here):
#include <iostream>
int *get_some_data(int value) {
return new int(value);
}
int main(int argc, char *argv[]){
int *p;
for(int i = 0; i < 10; ++i) {
p = get_some_data(i);
std::cout << *p;
}
return 0;
}
And the generated assembly output:
int main(int argc, char *argv[]){
01091000 push esi
01091001 push edi
int *p;
for(int i = 0; i < 10; ++i) {
01091002 mov edi,dword ptr [__imp_operator new (10920A8h)]
01091008 xor esi,esi
0109100A lea ebx,[ebx]
p = get_some_data(i);
01091010 push 4
01091012 call edi
01091014 add esp,4
01091017 test eax,eax
01091019 je main+1Fh (109101Fh)
0109101B mov dword ptr [eax],esi
0109101D jmp main+21h (1091021h)
0109101F xor eax,eax
std::cout << *p;
01091021 mov eax,dword ptr [eax]
01091023 mov ecx,dword ptr [__imp_std::cout (1092048h)]
01091029 push eax
0109102A call dword ptr [__imp_std::basic_ostream<char,std::char_traits<char> >::operator<< (1092044h)]
01091030 inc esi
01091031 cmp esi,0Ah
01091034 jl main+10h (1091010h)
}
Now with the pointer declared inside of the loop:
int main(int argc, char *argv[]){
008D1000 push esi
008D1001 push edi
for(int i = 0; i < 10; ++i) {
008D1002 mov edi,dword ptr [__imp_operator new (8D20A8h)]
008D1008 xor esi,esi
008D100A lea ebx,[ebx]
int *p = get_some_data(i);
008D1010 push 4
008D1012 call edi
008D1014 add esp,4
008D1017 test eax,eax
008D1019 je main+1Fh (8D101Fh)
008D101B mov dword ptr [eax],esi
008D101D jmp main+21h (8D1021h)
008D101F xor eax,eax
std::cout << *p;
008D1021 mov eax,dword ptr [eax]
008D1023 mov ecx,dword ptr [__imp_std::cout (8D2048h)]
008D1029 push eax
008D102A call dword ptr [__imp_std::basic_ostream<char,std::char_traits<char> >::operator<< (8D2044h)]
008D1030 inc esi
008D1031 cmp esi,0Ah
008D1034 jl main+10h (8D1010h)
}
As you can see, the output is identical. Note that, even in a debug build, the assembly remains identical.
Ed S. shows that most compilers will generate the same code for both cases. But, as Mahesh points out, they're not actually identical (even beyond the obvious fact that it would be legal to use i outside the loop scope in version 1 but not version 2). Let me try to explain how these can both be true, in a way that isn't misleading.
First, where does the storage for i come from?
The standard is silent on this—as long as storage is available for the entire lifetime of the scope of i, it can be anywhere the compiler likes. But the typical way to deal with local variables (technically, variables with automatic storage duration) is to expand the stack frame of the appropriate scope by sizeof(i) bytes, and store i as an offset into that stack frame.
A "teaching compiler" might always create a stack frame for each scope. But a real compiler usually won't bother, especially if nothing happens on entering and exiting the loop scope. (There's no way you can tell the difference, except by looking at the assembly or breaking in with a debugger, so of course it's allowed to do this.) So, both versions will probably end up with i referring to the exact same offset from the function's stack frame. (Actually, it's quite plausible i will end up in a register, but that doesn't change anything important here.)
Now let's look at the lifecycle.
In the first case, the compiler has to default-initialize i where it's declared at the function scope, copy-assign into it each time through the loop, and destroy it at the end of the function scope. In the second case, the compiler has to copy-initialize i at the start of each loop, and destroy it at the end of each loop. Like this:
If i were of class type, this would be a very significant difference. (See below if it's not obvious why.) But it's not, it's a pointer. This means default initialization and destruction are both no-ops, and copy-initialization and copy-assignment are identical.
So, the lifecycle-management code will be identical in both cases: it's a copy once each time through the loop, and that's it.
In other words, the storage is allowed to be, and probably will be, the same; the lifecycle management is required to be the same.
I promised to come back to why these would be different if i were of class type.
Compare this pseudocode:
i.IType();
for(j=0;j<10;j++) {
i.operator=(static_cast<IType>(getNthCon(j));
}
i.~IType();
to this:
for(j=0;j<10;j++) {
i.IType(static_cast<IType>(getNthCon(j));
i.~IType();
}
At first glance, the first version looks "better", because it's 1 IType(), 10 operator=(IType&), and 1 ~IType(), while the second is 10 IType(IType&) and 10 ~IType(). And for some classes, this might be true. But if you think about how operator= works, it usually has to do at least the equivalent of a copy construction and a destruction.
So the real difference here is that the first version requires a default constructor and a copy-assignment operator, while the second doesn't. And if you take out that static_cast bit (so we're talking about a conversion constructor and assignment instead of copy), what you're looking at is equivalent to this:
for(j=0;j<10;j++) {
std::ifstream i(filenames[j]);
}
Clearly, you would try to pull i out of the loop in that case.
But again, this is only true for "most" classes; you can easily design a class for which version 2 is ridiculously bad and version 1 makes more sense.
For every iteration, in second case, a new pointer variable is created on the stack. While in the first case, the pointer variable is created only once(i.e., before entering the loop )