Can creation of composite objects from temporaries be optimised away? - c++

I've asked a few questions which have touched around this issue, but I've been getting differing responses, so I thought best to ask it directly.
Lets say we have the following code:
// Silly examples of A and B, don't take so seriously,
// just keep in mind they're big and not dynamically allocated.
struct A { int x[1000]; A() { for (int i = 0; i != 1000; ++i) { x[i] = i * 2; } };
struct B { int y[1000]; B() { for (int i = 0; i != 1000; ++i) { y[i] = i * 3; } };
struct C
{
A a;
B b;
};
A create_a() { return A(); }
B create_b() { return B(); }
C create_c(A&& a, B&& b)
{
C c;
c.a = std::move(a);
c.b = std::move(b);
return C;
};
int main()
{
C x = create_c(create_a(), create_b());
}
Now ideally create_c(A&&, B&&) should be a no-op. Instead of the calling convention being for A and B to be created and references to them passed on stack, A and B should created and passed in by value in the place of the return value, c. With NRVO, this will mean creating and passing them directly into x, with no further work for the function create_c to do.
This would avoid the need to create copies of A and B.
Is there any way to allow/encourage/force this behavior from a compiler, or do optimizing compilers generally do this anyway? And will this only work when the compiler inline the functions, or will it work across compilation units.
(How I think this could work across compilation units...)
If create_a() and create_b() took a hidden parameter of where to place the return value, they could place the results into x directly, which is then passed by reference to create_c() which needs to do nothing and immediately returns.

There are different ways of optimizing the code that you have, but rvalue references are not one. The problem is that neither A nor B can be moved at no cost, since you cannot steal the contents of the object. Consider the following example:
template <typename T>
class simple_vector {
typedef T element_type;
typedef element_type* pointer_type;
pointer_type first, last, end_storage;
public:
simple_vector() : first(), last(), end_storage() {}
simple_vector( simple_vector const & rhs ) // not production ready, memory can leak from here!
: first( new element_type[ rhs.last - rhs.first ] ),
last( first + rhs.last-rhs.first ),
end_storage( last )
{
std::copy( rhs.first, rhs.last, first );
}
simple_vector( simple_vector && rhs ) // we can move!
: first( rhs.first ), last( rhs.last ), end_storage( rhs.end_storage )
{
rhs.first = rhs.last = rhs.end_storage = 0;
}
~simple_vector() {
delete [] rhs.first;
}
// rest of operations
};
In this example, as the resources are held through pointers, there is a simple way of moving the object (i.e. stealing the contents of the old object into the new one and leaving the old object in a destroyable but useless state. Simply copy the pointers and reset them in the old object to null so that the original object destructor will not free the memory.
The problem with both A and B is that the actual memory is held in the object through an array, and that array cannot be moved to a different memory location for the new C object.
Of course, since you are using stack allocated objects in the code, the old (N)RVO can be used by the compiler, and when you do: C c = { create_a(), create_b() }; the compiler can perform that optimization (basically set the attribute c.a on the address of the returned object from create_a, while when compiling create_a, create the returned temporary directly over that same address, so effectively, c.a, the returned object from create_a and the temporary constructed inside create_a (implicit this to the constructor) are the same object, avoiding two copies. The same can be done with c.b, avoiding the copying cost. If the compiler does inline your code, it will remove create_c and replace it with a construct similar to: C c = {create_a(), create_b()}; so it can potentially optimize all copies away.
Note on the other hand, that this optimization cannot be completely used in the case of a C object allocated dynamically as in C* p = new C; p->a = create_a();, since the destination is not in the stack, the compiler can only optimize the temporary inside create_a and its return value, but it cannot make that coincide with p->a, so a copy will need to be done. This is the advantage of rvalue-references over (N)RVO, but as mentioned before you cannot do use effectively rvalue-references in your code example directly.

There are two kinds of optimization which can apply in your case:
Function Inlining (In the case of A, B, and C (and the A and B it contains))
Copy elision (C (and the A and B it contains) only, because you returned C by value)
For a function this small, it's probably going to be inlined. Most any compiler will do it if it exists in the same translation unit, and good compilers like MSVC++ and G++ (and I think LLVM but I'm not sure on that one) have whole-program-optimization settings which will do it even across translation units. If the function is inlined, then yes, the function call (And the copy that comes with it) aren't going to occur at all.
If for some reason the function doesn't get inlined (i.e. you used __declspec(noinline) on MSVC++), then you're still going to be eligible for the Named Return Value Optimization (NRO), which good C++ compilers (again, MSVC++, G++, and I think LLVM) all implement. Basically, the standard says that the compilers are allowed to not perform the copy on return if they can avoid doing so, and they will usually emit code that avoids it. There are some things you can do to deactivate NRVO, but for the most part it's a pretty safe optimization to rely on.
Finally, profile. If you see a performance problem, then figure out something else. Otherwise I'd write things in the ideomatic way and replace them with more performant constructs if and only if you need to.

Isn't the obvious thing to do to give C a constructor and then say:
C create_c(const A & a, const B & b)
{
return C( a, b );
}
which has lots of possibilities for optimisation. Or indeed get rid of the create function. I don't think this is a very good motivating example.

Related

I want to have a good understanding of the use of return *this in c++

I want to have a good understanding of the use of return *this and take of instance we have
coord& coord::operator=(const coord& other)
{
if (this== &other)
}
My concern here is the use of this and the use of return *this
this means a pointer reference to itself. Before we copy other obj, we should know whether the two objects are the same. If same, we don't need do anything.
It's quite simple:
coord& coord::operator=(const coord& other)
{
// the test on adresses below is to make sure we're not doing useless work.
// comparing pointers is fast.
// making sure the addresses are different, we know we are not copying
// the object onto itself.
// It is both an optimization, and can, for certain memory operations, avoid bugs
// although modern compilers do avoid the memcpy/memmove bug, which was real pain a
// while back. Some CPUs _will_ puke when asked to copy and source == destination.
if (this != &other)
{
// usually member-by member copy goes here.
}
// returning a reference is standard for copy operations.
// it allows for such syntaxes as:
// coord a, b, c;
//
// a = b = c;
//
// or
// if ((a = b) == c) {}
//
// etc...
//
return *this;
}
Note: this answer only refers to the question about comparing addresses and the need to return a reference. If your class requires memory allocation or copying any of the members may throw, you should check this post: [1]: What is the copy-and-swap idiom? as pointed out by Lightness Races in Orbit.
Assuming a and b are of type coord, and a simple assignment a = b is performed, the return *this in the body of coord &coord::operator=(const coord &) returns a reference to a.
The effect of this is to permit chaining of assignments of coord objects. For example,
a = b = c;
is equivalent to (since assignment is right to left associative)
a = (b = c);
The expression b = c therefore is therefore turned by the compiler into a call of b.operator=(c), which returns a reference to b. Similarly the assignment of a is turned into a call of operator=(), so the above is equivalent to
a.operator=(b.operator=(c));
which (assuming operator=() is correctly implemented to perform a member-wise assignment) assigns both a and b to c. The reference returned by a.operator=() is, in this case, ignored by the compiler.
The if (this == &other) test is an old technique to handle self-assignment (e.g. a = a) as a special case (e.g. by doing nothing). Such a thing is generally discouraged for various reasons, such as behaving differently if the coord type has a unary operator&(). As noted by "Lightness Races in Orbit" in a comment to another answer, it is generally considered preferable to use the copy and swap idiom instead - refer for example to What is the copy-and-swap idiom?.

Does postblit constructor differ only by source from copy constructor?

If I understand correctly postblit constructor in D starts from bitwise copy (always), then there is user-defined body of it.
But when I look at the body of postblit constructor it is very similar to C++ copy constructor, the only difference is in C++ the source is some object, when in D is this (itself).
Am I correct?
Eh, close. I think you have a pretty good handle on it, but to spell it out:
a = b; (iff a and b are both the same type, a struct) translates into:
memcpy(&a, &b, b.sizeof); // bitwise copy
a.__postblit(); // call the postblit on the destination only (iff this(this) is defined on the type!)
So you don't have to explicitly assign any variables in the postblit (they are all copied automatically) and also cannot use it to implement move semantics (you don't have access to the source).
The place I most often use the postblit is when the struct is a pointer to another object, so I can increase the refcount:
struct S {
SomeObject* wrapped;
this(this) { if(wrapped) wrapped.addReference(); }
~this() { if(wrapped) wrapped.releaseReference(); }
}
This only works with references since otherwise you'd be incrementing a copy of the variable!
You can (but shouldn't) also use it to perform deep copies:
struct S {
string content;
this(this) { content = content.idup; }
}
But that's actually a bad idea since struct assignment is supposed to be universally cheap in D and deep copies aren't cheap. There's also generally no need anyway, since the garbage collector handles cases like double free where you might want this in C++.
The other case where I use it a lot in D is actually to disable it:
struct S {
#disable this(this);
}
S a, b;
a = b; // compile error, b is not copyable
That's different than just not implementing a postblit at all, which leaves you with the automatic implementation of memcpy. This makes assignment an outright compile error, which you can use to funnel the user toward another method, for move semantics for example:
struct S {
int* cool;
#disable this(this);
S release() { auto n = cool; cool = null; return S(cool); }
}
Since a=b is prohibited, we can now force the user to use the .release method when they want to reassign it which does our moving.

Is RVO (Return Value Optimization) applicable for all objects?

Is RVO (Return Value Optimization) guaranteed or applicable for all objects and situations in C++ compilers (specially GCC)?
If answer is "no", what are the conditions of this optimization for a class/object? How can I force or encourage the compiler to do a RVO on a specific returned value?
Return Value Optimization can always be applied, what cannot be universally applied is Named Return Value Optimization. Basically, for the optimization to take place, the compiler must know what object is going to be returned at the place where the object is constructed.
In the case of RVO (where a temporary is returned) that condition is trivially met: the object is constructed in the return statement, and well, it is returned.
In the case of NRVO, you would have to analyze the code to understand whether the compiler can know or not that information. If the analysis of the function is simple, chances are that the compiler will optimize it (single return statement that does not contain a conditional, for example; multiple return statements of the same object; multiple return statements like T f() { if (condition) { T r; return r; } else { T r2; return r2; } } where the compiler knows that r or r2 will be returned...)
Note that you can only assume the optimization in simple cases, specifically, the example in wikipedia could actually be optimized by a smart enough compiler:
std::string f( bool x ) {
std::string a("a"), b("b");
if ( x ) return a;
else return b;
}
Can be rewritten by the compiler into:
std::string f( bool x ) {
if ( x ) {
std::string a("a"), b("b");
return a;
} else {
std::string a("a"), b("b");
return b;
}
}
And the compiler can know at this time that in the first branch a is to be constructed in place of the returned object, and in the second branch the same applies to b. But I would not count on that. If the code is complex, assume that the compiler will not be able to produce the optimization.
EDIT: There is one case that I have not mentioned explicitly, the compiler is not allowed (in most cases even if it was allowed, it could not possibly do it) to optimize away the copy from an argument to the function to the return statement:
T f( T value ) { return value; } // Cannot be optimized away --but can be converted into
// a move operation if available.
Is RVO (Return Value Optimization) guaranteed for all objects in gcc compilers?
No optimisation is ever guaranteed (though RVO is fairly dependable, there do exist some cases that throw it off).
If answer is "no", what is the conditions of this optimization for a class/object?
An implementation detail that's quite deliberately abstracted from you.
Neither know nor care about this, please.
To Jesper: if the object to be constructed is big, avoiding the copy might be necessary (or at the very least highly desirable).
If RVO happens, the copy is avoided and you need not write any more lines of code.
If it doesn't, you'll have to do it manually, writing extra scaffolding yourself. And this will probably involve designating a buffer in advance, forcing you to write a constructor for this empty (probably invalid, you can see how this is not clean) object and a method to ‘construct’ this invalid object.
So ‘It can reduce my lines of code if it's guaranteed. Isn't it?’ does not mean that Masoud is a moron. Unfortunately for him however, RVO is not guaranteed. You have to test if it happens and if it doesn't, write the scaffolding and pollute your design. It can't be herped.
Move semantics (new feature of C++11) is a solution to your problem, which allows you to use Type(Type &&r); (the move constructor) explicitly, instead of Type(const Type &r) (the copy constructor).
For example:
class String {
public:
char *buffer;
String(const char *s) {
int n = strlen(s) + 1;
buffer = new char[n];
memcpy(buffer, s, n);
}
~String() { delete [] buffer; }
String(const String &r) {
// traditional copy ...
}
String(String &&r) {
buffer = r.buffer; // O(1), No copying, saves time.
r.buffer = 0;
}
};
String hello(bool world) {
if (world) {
return String("Hello, world.");
} else {
return String("Hello.");
}
}
int main() {
String foo = hello();
std::cout <<foo.buffer <<std::endl;
}
And this will not trigger the copy constructor.
I don't have a yes or no answer, but you say that you can write fewer lines of code if the optimization you're looking for is guaranteed.
If you write the code you need to write, the program will always work, and if the optimization is there, it will work faster. If there is indeed a case where the optimization "fills in the blank" in the logic rather than the mechanics of the code and makes it work, or outright changes the logic, that seems like a bug that I'd want fixed rather than an implementation detail I'd want to rely on or exploit.

Return value copying issue (to improve debug timing) -- What's the solution here?

The most interesting C++ question I've encountered recently goes as follows:
We determined (through profiling) that our algorithm spends a lot of time in debug mode in MS Visual Studio 2005 with functions of the following type:
MyClass f(void)
{
MyClass retval;
// some computation to populate retval
return retval;
}
As most of you probably know, the return here calls a copy constructor to pass out a copy of retval and then the destructor on retval. (Note: the reason release mode is very fast for this is because of the return value optimization. However, we want to turn this off when we debug so that we can step in and nicely see things in the debugger IDE.)
So, one of our guys came up with a cool (if slightly flawed) solution to this, which is, create a conversion operator:
MyClass::MyClass(MyClass *t)
{
// construct "*this" by transferring the contents of *t to *this
// the code goes something like this
this->m_dataPtr = t->m_dataPtr;
// then clear the pointer in *t so that its destruction still works
// but becomes 'trivial'
t->m_dataPtr = 0;
}
and also changing the function above to:
MyClass f(void)
{
MyClass retval;
// some computation to populate retval
// note the ampersand here which calls the conversion operator just defined
return &retval;
}
Now, before you cringe (which I am doing as I write this), let me explain the rationale. The idea is to create a conversion operator that basically does a "transfer of contents" to the newly constructed variable. The savings happens because we're no longer doing a deep copy, but simply transferring the memory by its pointer. The code goes from a 10 minute debug time to a 30 second debug time, which, as you can imagine, has a huge positive impact on productivity. Granted, the return value optimization does a better job in release mode, but at the cost of not being able to step in and watch our variables.
Of course, most of you will say "but this is abuse of a conversion operator, you shouldn't be doing this kind of stuff" and I completely agree. Here's an example why you shouldn't be doing it too (this actually happened:)
void BigFunction(void)
{
MyClass *SomeInstance = new MyClass;
// populate SomeInstance somehow
g(SomeInstance);
// some code that uses SomeInstance later
...
}
where g is defined as:
void g(MyClass &m)
{
// irrelevant what happens here.
}
Now this happened accidentally, i.e., the person who called g() should not have passed in a pointer when a reference was expected. However, there was no compiler warning (of course). The compiler knew exactly how to convert, and it did so. The problem is that the call to g() will (because we've passed it a MyClass * when it was expecting a MyClass &) called the conversion operator, which is bad, because it set the internal pointer in SomeInstance to 0, and rendered SomeInstance useless for the code that occured after the call to g(). ... and time consuming debugging ensued.
So, my question is, how do we gain this speedup in debug mode (which has as direct debugging time benefit) with clean code that doesn't open the possibility to make such other terrible errors slip through the cracks?
I'm also going to sweeten the pot on this one and offer my first bounty on this one once it becomes eligible. (50 pts)
You need to use something called "swaptimization".
MyClass f(void)
{
MyClass retval;
// some computation to populate retval
return retval;
}
int main() {
MyClass ret;
f().swap(ret);
}
This will prevent a copy and keep the code clean in all modes.
You can also try the same trick as auto_ptr, but that's more than a little iffy.
If your definition of g is written the same as in your code base I'm not sure how it compiled since the compiler isn't allowed to bind unnamed temporaries to non-const references. This may be a bug in VS2005.
If you make the converting constructor explicit then you can use it in your function(s) (you would have to say return MyClass(&retval);) but it won't be allowed to be called in your example unless the conversion was explicitly called out.
Alternately move to a C++11 compiler and use full move semantics.
(Do note that the actual optimization used is Named Return Value Optimization or NRVO).
The problem is occuring because you're using MyClass* as a magic device, sometimes but not always. Solution: use a different magic device.
class MyClass;
class TempClass { //all private except destructor, no accidental copies by callees
friend MyClass;
stuff* m_dataPtr; //unfortunately requires duplicate data
//can't really be tricked due to circular dependancies.
TempClass() : m_dataPtr(NULL) {}
TempClass(stuff* p) : m_dataPtr(p) {}
TempClass(const TempClass& p) : m_dataPtr(p) {}
public:
~TempClass() {delete m_dataPtr;}
};
class MyClass {
stuff* m_dataPtr;
MyClass(const MyClass& b) {
m_dataPtr = new stuff();
}
MyClass(TempClass& b) {
m_dataPtr = b.m_dataPtr ;
b.m_dataPtr = NULL;
}
~MyClass() {delete m_dataPtr;}
//be sure to overload operator= too.
TempClass f(void) //note: returns hack. But it's safe
{
MyClass retval;
// some computation to populate retval
return retval;
}
operator TempClass() {
TempClass r(m_dataPtr);
m_dataPtr = nullptr;
return r;
}
Since TempClass is almost all private (friending MyClass), other objects cannot create, or copy TempClass. This means the hack can only be created by your special functions when clearly told to, preventing accidental usage. Also, since this doesn't use pointers, memory can't be accidentally leaked.
Move semantics have been mentioned, you've agreed to look them up for education, so that's good. Here's a trick they use.
There's a function template std::move which turns an lvalue into an rvalue reference, that is to say it gives "permission" to move from an object[*]. I believe you can imitate this for your class, although I won't make it a free function:
struct MyClass;
struct MovableMyClass {
MyClass *ptr;
MovableMyClass(MyClass *ptr) : ptr(ptr) {}
};
struct MyClass {
MyClass(const MovableMyClass &tc) {
// unfortunate, we need const reference to bind to temporary
MovableMyClass &t = const_cast<MovableMyClass &>(tc);
this->m_dataPtr = t.ptr->m_dataPtr;
t.ptr->m_dataPtr = 0;
}
MovableMyClass move() {
return MovableMyClass(this);
}
};
MyClass f(void)
{
MyClass retval;
return retval.move();
}
I haven't tested this, but something along those lines. Note the possibility of doing something const-unsafe with a MovableMyClass object that actually is const, but it should be easier to avoid ever creating one of those than it is to avoid creating a MyClass* (which you've found out is quite difficult!)
[*] Actually I'm pretty sure I've over-simplified that to the point of being wrong, it's actually about affecting what overload gets chosen rather than "turning" anything into anything else as such. But causing a move instead of a copy is what std::move is for.
A different approach, given your special scenario:
Change MyClass f(void) (or operator+) to something like the following:
MyClass f(void)
{
MyClass c;
inner_f(c);
return c;
}
And let inner_f(c) hold the actual logic:
#ifdef TESTING
# pragma optimize("", off)
#endif
inline void inner_f(MyClass& c)
{
// actual logic here, setting c to whatever needed
}
#ifdef TESTING
# pragma optimize("", on)
#endif
Then, create an additional build configurations for this kind of testing, in which TESTING is included in the preprocessor definitions.
This way, you can still take advantage of RVO in f(), but the actual logic will not be optimized on your testing build. Note that the testing build can either be a release build or a debug build with optimizations turned on. Either way, the sensitive parts of the code will not be optimized (you can use the #pragma optimize in other places too, of course - in the code above it only affects inner_f itself, and not code called from it).
Possible solutions
Set higher optimization options for the compiler so it optimizes out the copy construction
Use heap allocation and return pointers or pointer wrappers, preferably with garbage collection
Use the move semantics introduced in C++11; rvalue references, std::move, move constructors
Use some swap trickery, either in the copy constructor or the way DeadMG did, but I don't recommend them with a good conscience. An inappropriate copy constructor like that could cause problems, and the latter is a bit ugly and needs easily destructible default objects which might not be true for all cases.
+1: Check and optimize your copy constructors, if they take so long then something isn't right about them.
I would prefer to simply pass the object by reference to the calling function when MyClass is too big to copy:
void f(MyClass &retval) // <--- no worries !
{
// some computation to populate retval
}
Just simple KISS principle.
Okay I think I have a solution to bypass the Return Value Optimization in release mode, but it depends on the compiler and not guaranteed to work. It is based on this.
MyClass f (void)
{
MyClass retval;
MyClass dummy;
// ...
volatile bool b = true;
if b ? retval : dummy;
}
As for why the copy construction takes so long in DEBUG mode, I have no idea. The only possible way to speed it up while remaining in DEBUG mode is to use rvalue references and move semantics. You already discovered move semantics with your "move" constructor that accepts pointer. C++11 gives a proper syntax for this kind of move semantics. Example:
// Suppose MyClass has a pointer to something that would be expensive to clone.
// With move construction we simply move this pointer to the new object.
MyClass (MyClass&& obj) :
ptr (obj.ptr)
{
// We set the source object to some trivial state so it is easy to delete.
obj.ptr = NULL;
}
MyClass& operator = (MyClass&& obj) :
{
// Here we simply swap the pointer so the old object will be destroyed instead of the temporary.
std::swap(ptr, obj.ptr);
return *this;
}

Avoiding need for #define with expression templates

With the following code, "hello2" is not displayed as the temporary string created on Line 3 dies before Line 4 is executed. Using a #define as on Line 1 avoids this issue, but is there a way to avoid this issue without using #define? (C++11 code is okay)
#include <iostream>
#include <string>
class C
{
public:
C(const std::string& p_s) : s(p_s) {}
const std::string& s;
};
int main()
{
#define x1 C(std::string("hello1")) // Line 1
std::cout << x1.s << std::endl; // Line 2
const C& x2 = C(std::string("hello2")); // Line 3
std::cout << x2.s << std::endl; // Line 4
}
Clarification:
Note that I believe Boost uBLAS stores references, this is why I don't want to store a copy. If you suggest that I store by value, please explain why Boost uBLAS is wrong and storing by value will not affect performance.
Expression templates that do store by reference typically do so for performance, but with the caveat they only be used as temporaries
Taken from the documentation of Boost.Proto (which can be used to create expression templates):
Note An astute reader will notice that the object y defined above will be left holding a dangling reference to a temporary int. In the sorts of high-performance applications Proto addresses, it is typical to build and evaluate an expression tree before any temporary objects go out of scope, so this dangling reference situation often doesn't arise, but it is certainly something to be aware of. Proto provides utilities for deep-copying expression trees so they can be passed around as value types without concern for dangling references.
In your initial example this means that you should do:
std::cout << C(std::string("hello2")).s << std::endl;
That way the C temporary never outlives the std::string temporary. Alternatively you could make s a non reference member as others pointed out.
Since you mention C++11, in the future I expect expression trees to store by value, using move semantics to avoid expensive copying and wrappers like std::reference_wrapper to still give the option of storing by reference. This would play nicely with auto.
A possible C++11 version of your code:
class C
{
public:
explicit
C(std::string const& s_): s { s_ } {}
explicit
C(std::string&& s_): s { std::move(s_) } {}
std::string const&
get() const& // notice lvalue *this
{ return s; }
std::string
get() && // notice rvalue *this
{ return std::move(s); }
private:
std::string s; // not const to enable moving
};
This would mean that code like C("hello").get() would only allocate memory once, but still play nice with
std::string clvalue("hello");
auto c = C(clvalue);
std::cout << c.get() << '\n'; // no problem here
but is there a way to avoid this issue without using #define?
Yes.
Define your class as: (don't store the reference)
class C
{
public:
C(const std::string & p_s) : s(p_s) {}
const std::string s; //store the copy!
};
Store the copy!
Demo : http://www.ideone.com/GpSa2
The problem with your code is that std::string("hello2") creates a temporary, and it remains alive as long as you're in the constructor of C, and after that the temporary is destroyed but your object x2.s stills points to it (the dead object).
After your edit:
Storing by reference is dangerous and error prone sometimes. You should do it only when you are 100% sure that the variable reference will never go out of scope until its death.
C++ string is very optimized. Until you change a string value, all will refer to the same string only. To test it, you can overload operator new (size_t) and put a debug statement. For multiple copies of same string, you will see that the memory allocation will happen only once.
You class definition should not be storing by reference, but by value as,
class C {
const std::string s; // s is NOT a reference now
};
If this question is meant for general sense (not specific to string) then the best way is to use dynamic allocation.
class C {
MyClass *p;
C() : p (new MyClass()) {} // just an example, can be allocated from outside also
~C() { delete p; }
};
Without looking at BLAS, expression templates typically make heavy use of temporary objects of types you aren't supposed to even know exists. If Boost is storing references like this within theirs, then they would suffer the same problem you see here. But as long as those temporary objects remain temporary, and the user doesnt store them for later, everything is fine because the temporaries they reference remain alive for as long as the temporary objects do. The trick is you perform a deep copy when the intermediate object is turned into the final object that the user stores. You've skipped this last step here.
In short, it's a dangerous move, which is perfectly safe as long as the user of your library doesn't do anything foolish. I wouldn't recommend making use of it unless you have a clear need, and you're well aware of the consequences. And even then, there might be a better alternative, I've never worked with expression templates in any serious capacity.
As an aside, since you tagged this C++0x, auto x = a + b; seems like it would be one of those "foolish" things users of your code can do to make your optimization dangerous.