Do compilers optimize memset in destructor?

Do compilers optimize memset in destructor? - c++

Given a struct like:
struct CryptoKey {
std::vector<unsigned char> key;
~CryptoKey() { memset(key.data(),0,key.size()); }
};
The compiler is entitled to eliminate the call to memset because this will save time, and no program with defined behaviour can tell the difference. (Given that the variable key will cease to exist once the destructor returns.)
Nevertheless, code like this is useful in cryptographic applications, because the less time that a secret is stored in memory, the less chance an attacker has to extract it. (The memset does not provide security, but it does provide "defence in depth".)
My question is, which real compilers actually do eliminate such memset calls (obviously, with optimization turned on)?

Perhaps it is better to say that a good compiler would attempt to eliminate the memset call and a developer should not rely on differences of compiler implementation to avoid this optimisation. These compilers typically have secure alternatives that will not be optimised.
Secure version of memset
C11 introduces memset_s which one of the characteristics is that is will not be optimised out.
Unlike memset, any call to the memset_s function shall be evaluated strictly according to the rules of the abstract machine as described in (5.1.2.3). That is, any call to the memset_s function shall assume that the memory indicated by s and n may be accessible in the future and thus must contain the values indicated by c.
Windows specific
On windows there are other choices. SecureZeroMemory or using a #pragma optimize pragma to turn off optimisation.
Common sub-expression optimisations
There is a broader issue with cryptographic safety: compilers are within their rights to copy buffers for optimisation reasons. Zeroing may not remove all copies, the compiler may have applied optimisations that copy the heap to the stack to eliminate common sub-expressions. So besides avoiding optimising out the zeroing, care should be taken that the compiler isn't inserting additional copies.

The problem for optimizers here is that your memset isn't writing to a member at all. Yes, key will cease to exist, but not so key.data. That memory will be returned to std::allocator. And std::allocator will very likely read adjacent memory to determine the memory block from which key.data came. Typical implementations store such data in the header of allocated blocks, i.e. at negative offsets. It's not unlikely that the header will be updated to reflect the block is free, or to coalesce the free block with other free blocks.
This may even be inlined, so the optimizer sees one function doing a memset and then the header access. It would be unreasonable to expect that the optimizer can figure out the memset is harmless. For all it knows, the allocator may be keeping a pool of zeroed blocks.

Related

Is a fundamental type volatile initialization an observable behavior?

Consider this function:
void f(void* loc)
{
auto p = new(loc) volatile int{42};
*p = 0;
}
I have check the generated code by clang, gcc and CL, none of them elide the initialization. (The answer may be seen by the hardwer:).
Is it an extension provided by compilers to the standard? Does the standard allow compilers not to perform the write 42?
Actualy for objects of class type, it is specfied that constructor of an object is executed without consideration for the volatile qualifier [class.ctor]:
A constructor can be invoked for a const, volatile or const volatile object. const and volatile
semantics (10.1.7.1) are not applied on an object under construction. They come into effect when the
constructor for the most derived object (4.5) ends.

[intro.execution]/8 lists the minimum requirements for a conforming implementation; these are also known as “observable behavior”. The first requirement is that “Access to volatile objects are evaluated strictly according to the rules of the abstract machine.” The compiler is required to produce all observable behavior. In particular, it is not allowed to remove accesses to volatile objects. And note that “object” here is used in the compiler-writer’s sense: it includes built-in types.

This is not a coherent question because what it means for a compiler to perform a write is platform-specific. There is no platform-independent notion of performing a write other than perhaps seeing the effects of a write in a subsequent read.
As you see, typical compilers on x86 will emit a write instruction but no memory barrier. The CPU may reorder the write, coalesce it, or even avoid doing any write to main memory because of the way the platform's cache coherence works.
The reason they made this implementation choice is that it makes volatile work for a broad range of applications, including those where the standard requires it to work, and because it has acceptable performance consequences. The standard, being platform-neutral, doesn't dictate platform-specific decisions like this and compiler writers do not understand it to do that.
They could have forced every volatile access to be uncoalsecable, un-reorderable, and pushed through the cache subsystem to main memory. But that would provide terrible performance and, on this platform, no significant benefits. So they don't do it, and they don't understand the C++ standard to suggest that there's some mythical observer on the memory bus who must see specific things. The very existence of a memory bus is platform-specific. The standard is not platform-specific.
You will sometimes see people argue, for example, that the standard somehow requires the compiler to issue instructions to do volatile writes in order but that it doesn't matter if the CPU coalesces or re-orders the writes. This is, frankly, silly. The C++ standard doesn't impose requirements on the instructions compilers generate but rather on what those instructions must actually do when executed. It doesn't distinguish between optimizations done by a CPU and optimizations done by a compiler and any such distinctions would be platform-specific anyway.
If the standard allows a CPU to re-order two writes, then it allows the compiler to re-order them. It does not, and cannot, make that kind of distinction. Of course, compiler writers may still decide that they will issues the writes in order even though the CPU can re-order them because that may make the most sense on their platform.

as-if rule and removal of allocation

The "as-if rule" gives the compiler the right to optimize out or reorder expressions that would not make a difference to the output and correctness of a program under certain rules, such as;
§1.9.5
A conforming implementation executing a well-formed program shall
produce the same observable behavior as one of the possible executions
of the corresponding instance of the abstract machine with the same
program and the same input.
The cppreference url I linked above specifically mentions special rules for the values of volatile objects, as well as for "new expressions", under C++14:
New-expression has another exception from the as-if rule: the compiler
may remove calls to the replaceable allocation functions even if a
user-defined replacement is provided and has observable side-effects.
I assume "replaceable" here is what is talked about for example in
§18.6.1.1.2
Replaceable: a C++ program may define a function with this function
signature that displaces the default version defined by the C++
standard library.
Is it correct that mem below can be removed or reordered under the as-if rule?
{
... some conformant code // upper block of code
auto mem = std::make_unique<std::array<double, 5000000>>();
... more conformant code, not using mem // lower block of code
}
Is there a way to ensure it's not removed, and stays between the upper and lower blocks of code? A well placed volatile (either/or volatile std::array or left of auto) comes to mind, but as there is no reading of mem, I think even that would not help under the as-if rule.
Side note; I've not been able to get visual studio 2015 to optimize out mem and the allocation at all.
Clarification: The way to observe this would be that the allocation call to the OS comes between any i/o from the two blocks. The point of this is for test cases and/or trying to get objects to be allocated at new locations.

Yes; No. Not within C++.
The abstract machine of C++ does not talk about system allocation calls at all. Only the side effects of such a call that impact the behavior of the abstract machine are fixed by C++, and even then the compiler is free to do something else, so long as-if it results in the same observable behavior on the part of the program in the abstract machine.
In the abstract machine, auto mem = std::make_unique<std::array<double, 5000000>>(); creates a variable mem. It, if used, gives you access to a large amount of doubles packed into an array. The abstract machine is free to throw an exception, or provide you with that large amount of doubles; either is fine.
Note that it is a legal C++ compiler to replace all allocations through new with an unconditional throw of an allocation failure (or returning nullptr for the no throw versions), but that would be a poor quality of implementation.
In the case where it is allocated, the C++ standard doesn't really say where it comes from. The compiler is free to use a static array, for example, and make the delete call a no-op (note it may have to prove it catches all ways to call delete on the buffer).
Next, if you have a static array, if nobody reads or writes to it (and the construction cannot be observed), the compiler is free to eliminate it.
That being said, much of the above relies on the compiler knowing what is going on.
So an approach is to make it impossible for the compiler to know. Have your code load a DLL, then pass a pointer to the unique_ptr to that DLL at the points where you want its state to be known.
Because the compiler cannot optimize over run-time DLL calls, the state of the variable has to basically be what you'd expect it to be.
Sadly, there is no standard way to dynamically load code like that in C++, so you'll have to rely upon your current system.
Said DLL can be separately written to be a noop; or, even, you can examine some external state, and conditionally load and pass the data to the DLL based on the external state. So long as the compiler cannot prove said external state will occur, it cannot optimize around the calls not being made. Then, never set that external state.
Declare the variable at the top of the block. Pass a pointer to it to the fake-external-DLL while uninitialized. Repeat just before initializing it, then after. Then finally, do it at the end of the block before destroying it, .reset() it, then do it again.

Efficiency penalty of initializing a struct/class within a loop

I've done my best to find an answer to this with no luck. Also, I've tested it and don't see any difference whatsoever in an optimized release build (there is a difference in debug)... still, I can't imagine why there is no difference, or how the optimizer is able to remove the penalty, and maybe someone knows what is happening internally.
If I create new instances of a simple class/struct within a loop, is there any penalty in efficiency for creating the class/struct on every loop iteration?
i.e.
struct mystruct
{
inline mystruct(const double &initial) : _myvalue(initial) {}
double myvalue;
}
why does...
for(int i=0; i<big_int; ++i)
{
mystruct a = mystruct(1.1)
}
take the same amount of real time as
for(int i=0; i<big_int; ++i)
{
double s = 1.1
}
?? Shouldn't there be some time required for the constructor/initialization?

This is easy-peasy work for a modern optimizer to handle.
As a programmer you might look at that constructor and struct and think it has to cost something. "The constructor code involves branching, passing arguments through registers/stack, popping from the stack, etc. The struct is a user-defined type, it must add more data somewhere. There's aliasing/indirection overhead for the const reference, etc."
Except the optimizer then has a go at your code, and it notices that the struct has no virtual functions, it has no objects that require a non-trivial constructor. The whole thing fits into a general-purpose register. And then it notices that your constructor is doing little more than assigning one variable to another. And it'll probably even notice that you're just calling it with a literal constant, which translates to a single move/store instruction to a register which doesn't even require any additional memory beyond the instruction.
It's all very magical, and compilers are sophisticated beasts, but they usually do this in multiple passes, and from your original code to intermediate representations, and from intermediate representations to machine code. To really appreciate and understand what they do, it's worth having a peek at the disassembly from time to time.
It's worth noting that C++ has been around for decades. As a successor to C, it originally was pushed mostly as an object-oriented language with hot concepts like encapsulation and information hiding. To promote a language where people start replacing public data members and manual initialization/destruction and things like that for simple accessor functions, constructors, destructors, it would have been very difficult to popularize the language if there was a measurable overhead in even a simple function call. So as magical as this all sounds, C++ optimizers have been doing this now for decades, squashing all that overhead you add to make things easier to maintain down to the same assembly as something which wouldn't be so easy to maintain.
So it's generally worth thinking of things like function calls and small structures as being basically free, since if it's worth inlining and squashing away all the overhead to zilch, optimizers will generally do it. Exceptions arise with indirect function calls: virtual methods, calls through function pointers, etc. But the code you posted is easy stuff for a modern optimizer to squash down.

C++ philosophy is that you should not "pay" (in CPU cycles or in memory bytes) for anything that you do not use. The struct in your example is nothing more than a double with a constructor tied to it. Moreover, the constructor can be inlined, bringing the overhead all the way down to zero.
If your struct had other parts to initialize, such as other fields or a table of virtual functions, there would be some overhead. The way your example is set up, however, the compiler can optimize out the constructor, producing an assembly output that boils down to a single assignment of a double.

Neither of your loops do anything. Dead code may be removed. Furthermore, there is no representational difference between a struct containing a single double and a primitive double. The compier should be able to easily "see through" an inline constructor. C++ relies on optimisations of these things to allow its abstractions to compete with hand-written versions.
There is no reason for the performance to be different, and if it were, I would consider it a bug (up to debug builds, where debug information could change the performance cost).

These quotes from the C++ Standard may help to understand what optimization is permitted:
The semantic descriptions in this International Standard define a parameterized nondeterministic abstract machine. This International Standard places no requirement on the structure of conforming implementations. In particular, they need not copy or emulate the structure of the abstract machine. Rather, conforming implementations are required to emulate (only) the observable behavior of the abstract machine as explained below.
and also:
The least requirements on a conforming implementation are:
Access to volatile objects are evaluated strictly according to the rules of the abstract machine.
At program termination, all data written into files shall be identical to one of the possible results that execution of the program according to the abstract semantics would have produced.
The input and output dynamics of interactive devices shall take place in such a fashion that prompting output is actually delivered before a program waits for input. What constitutes an interactive device is implementation-defined.
These collectively are referred to as the observable behavior of the program.
To summarize: the compiler can generate whatever executable it likes so long as that executable performs the same I/O and access to volatile variables as the unoptimized version would. In particular, there are no requirements about timing or memory allocation.
In your code sample, the entire thing could be optimized out as it produces no observable behaviour. However, real-world compilers sometimes decide to leave in things that could be optimized out, if they think the programmer really wanted those operations to happen for some reason.

#Ikes answer is exactly what I was getting at. However, If you are curious about this question, I very much recommend reading answers of #dasblinkenlight, #Mankarse, and #Matt McNabb and the discussions below them, which get at the details of the situation. Thanks all.

Is writing to memory an observable behaviour?

I've looked at the standard but couldn't find any indication that simply writing to memory would be considered observable behaviour. If not, that would mean the compiled code need not actually write to that memory. If a compiler choose to optimize away such access anything involving mapper memory, or shared memory, may not work.
1.9-8 seems to defined a very limited observable behaviour but indicates an implementation may define more. Can one assume than any quality compiler would treat modifying memory as an observable behaviour? That is, it may not guarantee atomicity or ordering, but does guarantee that data will eventually be written.
So, have I overlooked something in the standard, or is the writing to memory merely something the compiler decides to do?
Statements from the current or C++0x standard are good. Please note I'm not talking about accessing memory through a function, I mean direct access, such as writing data to a pointer (perhaps retrieved via mmap or another library function).

This kind of thing is what volatile exists for. Else, writing to memory and never apparently reading from it is not observable behaviour. However, in the general case, it would be quite impossible for the optimizer to prove that you never read it back except in relatively trivial examples, so it's not usually an issue.

Can one assume than any quality compiler would treat modifying memory as an observable behaviour?
No. Volatile is meant for marking that. However, you cannot fully trust the compiler even after adding the volatile qualifier, at least as told by a 2008 paper: http://www.cs.utah.edu/~regehr/papers/emsoft08-preprint.pdf
EDIT:
From C standard (not C++) http://c0x.coding-guidelines.com/5.1.2.3.html
An actual implementation need not evaluate part of an expression if it can deduce that its value is not used and that no needed side effects are produced (including any caused by calling a function or accessing a volatile object).

My reading of C99 is that unless you specify volatile, how and when the variable is actually accessed is implementation defined. If you specify volatile qualifier then code must work according to the rules of an abstract machine.
Relevant parts in the standard are: 6.7.3 Type qualifiers (volatile description) and 5.1.2.3 Program execution (the abstract machine definition).
For some time now I know that many compilers actually have heuristics to detect cases when a variable should be reread again and when it is okay to use a cached copy. Volatile makes it clear to the compiler that every access to the variable should be actually an access to the memory. Without volatile it seems compiler is free to never reread the variable.
And BTW wrapping the access in a function doesn't change that since a function even without inline might be still inlined by the compiler within the current compilation unit.
From your question below:
Assume I use an array on the heap (unspecified where it is allocated),
and I use that array to perform a calculation (temp space). The
optimizer sees that it doesn't actually need any of that space as it
can use strictly registers. Does the compiler nonetheless write the
temp values to the memory?
Per MSalters below:
It's not guaranteed, and unlikely. Consider a a Static Single
Assignment optimizer. This figures out each possible write/read
dependency, and then assigns registers to optimize these dependencies.
As a side effect, any write that's not followed by a (possible) read
creates no dependencies at all, and is eliminated. In your example
("use strictly registers") the optimizer has satisfied all write/read
dependencies with registers, so it won't write to memory at all. All
reads produce the correct values, so it's a correct optimization.

zero Functor construction and overhead even with new and delete?

If I have a functor class with no state, but I create it from the heap with new, are typical compilers smart enough to optimize away the creation overhead entirely?
This question has come up when making a bunch of stateless functors. If they're allocated on the stack, does their 0 state class body mean that the stack really isn't changed at all? It seems it must in case you later take an address of the functor instance.
Same for heap allocation.
In that case, functors are always adding a (trivial, but non-zero) overhead in their creation. But maybe compilers can see whether the address is used and if not it can eliminate that stack allocation. (Or, can it even eliminate a heap allocation?)
But how about a functor that's created as a temporary?
#include <iostream>
struct GTfunctor
{
inline bool operator()(int a, int b) {return a>b; }
};
int main()
{
GTfunctor* f= new GTfunctor;
GTfunctor g;
std::cout<< (*f)(2,1) << std::endl;
std::cout<< g(2,1) << std::endl;
std::cout<< GTfunctor()(2,1) << std::endl;
delete f;
}
So in the concrete example above, the three lines each call the same functor in three different ways. In this example, is there any efficiency difference between the ways? Or is the compiler able to optimize each line all the way down to being a compute-less print statement?
Edit:
Most answers say that the compiler could never inline/eliminate the heap allocated functor. But is this really true as well? Most compilers (GCC, MS, Intel) have linktime optimization as well which could indeed do this optimization. (but does it?)

are typical compilers smart enough to optimize away the creation overhead entirely?
When you're creating them on the heap, I doubt whether the compiler is allowed to. IMO:
Invoking new implies invoking operator new.
operator new is a non-trivial function defined in the run-time library.
The compiler isn't allowed to decide that you didn't really mean to invoke such a function and to decide that as an optimization it will silently not invoke it.
When you're creating them on the stack, and not taking their address, then maybe ... or maybe not: my guess is that every object has a non-zero size, in order to occupy some memory, in order to have an identity, even when the object has no state apart from its identity.

Obviously, it depends on your compiler.
I would say
No compiler will optimize away the object on the heap. (This is because, as ChrisW says, compilers will never optimize away a call to new, which is almost certainly defined in another translation unit.)
Some compilers will optimize away a named object on the stack. I've known gcc to do this optimization quite often.
Most compilers will optimize away an unnamed object on the stack. This is one of the "standard" C++ optimizations, especially as more advanced C++ users tend to create lots of unnamed temporary variables.
Unfortunately, these are only rules of thumb. Optimizers are notoriously unpredictable; really the only way to know what your compiler is doing is to read the assembly output.

I highly doubt this type of optimization is allowed, but if your functor has no state, why would you want to initialize it on the heap? It should be just as easy to use it as a temporary.

A C++ object is always non-zero in size. "Empty base class optimization" allows empty base class to have zero size but that doesn't apply here.
I have not worked on any C++ optimizer, so whatever i say is just speculating. I think 2nd and 3rd will be expanded inline easily and there will be no overhead, and no GTFunctor is created. The functor pointer, however, is a different story. In your example, it may seem simple enough and any optimizer should be able to eliminate heap allocation, but in a non trivial program, you maybe creating the functors in one translation unit and use it in another. Or even in a different library where the compiler/linker/loader/runtime system don't have source code to, and it is almost impossible to optimize. Given the fact that optimizing it is not easy, the potential gain in performance is not great, and the number of cases where empty functor is allocated in the heap is probably small, i think most optimizer programmer will probably not put this optimization high in their to do list.

The compiler cannot optimize out a call to new or delete. It may however optimize out the variable created on the stack since it has no state.

Simple way to answer the heap question:
GTfunctor *f = new GTfunctor;
The value of f must not be null, so what should it be? And you also had:
GTfunctor *g = new GTfunctor;
Now the value of g must not equal the value of f, so what should each be?
Furthermore, neither f or g may be equal to any other pointer obtained from new, unless some pointer elsewhere is somehow initialised to be equal to f or g, which (depending on the code that comes after) may involve examining what the entire rest of the program does.
Yes, if by local examination of the code the compiler can see that you never rely on any of these requirements, then it could perform a rewrite such that no heap allocation occurs. The problem is, if your code was that simple, you could probably do that rewrite yourself and end up with a more readable program anyway, e.g. your test program would look like your stack-based g example. So real programs would not benefit from such an optimisation in the compiler.
Presumably the reason you're doing this is because sometimes the functor does have data, depending on which type is chosen at runtime. So compile-time analysis cannot usefully work its magic here.

C++ Standard states that each object (imho on the heap) must at least have a size one byte, so it can be uniquely addressed.
Generating functors with new can lead to two problems:
The constructions can generally not optimized away. New is a function with complex side effects (bad_alloc).
Because you address the functor indirectly the compiler may not be able to inline the function.
Chances are good that you will not see a sign of the functor, if you generate it on the stack.
Side note: The inline statement is not necessary. Every function which is defined in a class definition is treated as inlineable.

The compiler can probably figure out that operator() doesn't use any member variables, and optimize it to the max. I wouldn't make any assumptions about the local or heap allocated variables, though.
Edit: When in doubt, turn on the assembly output option on your compiler and see what it's actually doing. No sense in listening to a bunch of idiots on the web when you can see the real answer for yourself.

The answer to your question has two aspects.
Does the compiler optimize away the heap allocation: I strongly doubt it, but I'm not a standard guy, so I have to look it up.
Can the compiler optimize by inline the object's operator()? Yes. As long as you don't specify the call as virtual, even the pointer dereferencing isn't actually performed.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js