I recently modified a program to use virtual functions (in place of sequence if if-else conditions with static calls.) The modified program runs 8% slower than the original. That seems like too high of a cost for using virtual functions, so I must be doing something inefficient in the way I set the class hierarchy and virtual functions; but, I'm at a loss for how to track down the problem. (I see similar performance degradation using both clang on my Mac and gcc on Linux.)
The program is used to study different community detection algorithms. The program uses a nested loop to apply a series of user-specified objective functions to a variety of (graph, partition) pairs.
Here is a rough outline of the original code
int main(int argc, char* argv[]) {
bool use_m1;
bool use_m2;
...
bool use_m10;
// set the various "use" flags based on argv
for (Graph& g : graphsToStudy()) {
for (Partition& p : allPartitions()) {
if (use_m1) {
M1::evaluate(g, p);
}
if (use_m2) {
M2::evaluate(g,p);
}
// and so on
}
}
To make the code easier to maintain, I created a class structure for the different objective functions, and iterated through an array of pointers:
class ObjectiveFunction {
public:
virtual double eval(Graph& g, Partition& p) = 0;
}
class ObjFn1 : public ObjectiveFunction {
public:
virtual double eval(Graph& g, Partition& p) {
return M1::evaluate(g,p);
}
}
class ObjFn2 : public ObjectiveFunction {
public:
virtual double eval(Graph& g, Partition& p) {
return M2::evaluate(g,p);
}
}
int main(int argc, char* argv[]) {
vector<ObjectiveFunction*> funcs;
fill_funcs_based_on_opts(funcs, argc, argv);
for (Graph& g : graphsToStudy()) {
for (Partition& p : allPartitions()) {
// funcs contains one object for each function selected by user.
for (ObjectiveFunction* fp : funcs) {
fp->evaluate(g, p);
}
}
}
Given that generating graphs and partitions, as well as the objective functions themselves are moderately computationally intensive, the addition of the virtual function call should be almost unnoticeable. Any ideas what I may have done wrong; or how to track it down? I tried using callgrind, but am not seeing any insights.
Perhaps I am just incorrectly interpreting the output of callgrind_annotate. In the example below, Neo::Context::evaluatePartition is analogous to ObjFn1::evaluate in the example above.
Why is this function listed four different times with different
source files? This method is only ever called from function main
in timeMetrics.cpp.
What does src/lib/PartitionIterator.h:main refer to? There is no
main function in PartitionIterator.h.
Why does 414,219,420 appear twice in the source code listing for
evaluatePartition? Isn't the first number supposed to represent
the overhead of the function call?
35,139,513,913 PROGRAM TOTALS
17,029,020,600 src/lib/metrics/Neo.h:gvcd::metrics::Neo::Context<unsigned int, unsigned char, unsigned int>::evaluatePartition(gvcd::Partition<unsigned int, unsigned int> const&, bool) [bin/timeMetrics_v]
7,168,741,865 /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/vector:gvcd::Partition<unsigned int, unsigned int>::buildMembersh ipList()
4,418,473,884 src/lib/Partition.h:gvcd::Partition<unsigned int, unsigned int>::buildMembershipList() [bin/timeMetrics_v]
1,459,239,657 src/lib/PartitionIterator.h:main
1,288,682,640 /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/vector:gvcd::metrics::Neo::Context<unsigned int, unsigned char, u nsigned int>::evaluatePartition(gvcd::Partition<unsigned int, unsigned int> const&, bool)
1,058,560,740 src/lib/Partition.h:gvcd::metrics::Neo::Context<unsigned int, unsigned char, unsigned int>::evaluatePartition(gvcd::Partition<unsigned int, unsigned int> const&, bool)
1,012,736,608 src/perfEval/timeMetrics.cpp:main [bin/timeMetrics_v] 443,847,782 /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/vector:main
368,372,912 /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/memory:gvcd::Partition<unsigned int, unsigned int>::buildMembersh ipList()
322,170,738 /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/ostream:main
92,048,760 src/lib/SmallGraph.h:gvcd::metrics::Neo::Context<unsigned int, unsigned char, unsigned int>::evaluatePartition(gvcd::Partition<unsigned int, unsigned int> const&, bool)
84,549,144 ???:szone_free_definite_size [/usr/lib/system/libsystem_malloc.dylib]
54,212,938 ???:tiny_free_list_add_ptr [/usr/lib/system/libsystem_malloc.dylib]
. virtual double
414,219,420 evaluatePartition(const Partition <VertexID, SetBitmap> &p, bool raw = false) {
414,219,420 uint_wn_t raw_answer = Neo::evaluatePartition(*(this->g), p);
. return (double) (raw ? raw_answer : max_neo - raw_answer);
. }
. }; // end Context
Lets fix the obvious first:
In both versions you do this:
foreach (Graph g : graphsToStudy()) {
foreach (Partition p : allPartitions()) {
Unless Graph/Partition are easy and small to copy then most of your work will be here.
foreach (Graph& g : graphsToStudy()) {
// ^
foreach (Partition& p : allPartitions()) {
// ^
The second question I have. This does not seem like the correct usage of virtual functions. Your original code looks totally fine in this use case where multiple version of evaluate() are being called on each (g, p) object pair.
Now if you only called every one of the evaluate() functions then it might be a better use case, but then you no longer need that inner loop:
foreach (ObjectiveFunction* fp : funcs) {
It's expensive because you're actually using polymorphism, which defeats the branch predictor.
It may help the branch predictor if you replace collection iteration with an intrinsic linked list:
class ObjectiveFunction
{
ObjectiveFunction* m_next;
virtual double evaluate(Graph& g, Partition& p) = 0;
protected:
ObjectiveFunction(ObjectiveFunction* next = nullptr) : m_next(next) {}
// for gcc use __attribute__((always_inline))
// for MSVC use __forceinline
void call_next(Graph& g, Partition& p)
{
if (m_next) m_next->eval(g, p);
}
public:
virtual void eval(Graph& g, Partition& p) = 0;
};
Now, instead of one line of code inside the loop reaching many different functions, the call_next() function (which should be the last step of each individual eval overload) should be inlined into each of those overloads, and at runtime, each inlined copy of that indirect-call instruction will repeatedly call just one function, resulting in 100% branch prediction.
Where I can, I prefer static over dynamic dispatch -- dynamic dispatch can cost you by preventing optimizations like function inlining, and with the double-dereference involved with vtables you can suffer from poor locality (instruction cache misses).
I suspect the lion's share of the difference in performance is from losing the benefits of optimizations performed on static dispatch. It might be interesting to try disabling inlining on the original code to see how much of an benefit you were enjoying.
Related
I have a member function with two arguments. Both are pointers to complex objects. When called, the function performs some non-trivial computation and then returns an integer. Like this:
struct Fooer {
int foo(const A* a, const B* b);
};
The returned integer is always the same if foo() is given the same two arguments. This function is pretty heavily used, so it would make sense to memoize its result. Normally, some lookup table with the key being the pair of pointers would suffice. However, I'm in the unique position where I know all the call sites and I know that any given call site will always use the same pair of parameters during execution. This could greatly speed up memoization if only I could pass in a third parameter, a unique integer that is basically the cache hint:
struct Fooer {
int foo(const A* a, const B* b, int pos) {
if (cached_[pos] > 0) return cached_[pos];
cached_[pos] = /* Heavy computation. */ + 1;
return cached_[pos];
}
std::vector<int> cached_;
};
What I'm looking for is a mechanism to easily generate this 'cache hint'. But nothing comes to mind. For now, I'm manually adding this parameter to the call sites of foo(), but it's obviously ugly and fragile. The function is really heavily used so it's worth this kind of optimization, in case you're wondering.
More generally, I'd like to have some kind of 'thunk' at each call site that performs the heavy lifting the first time is called, then just returns the pre-computed integer.
Note that foo() is a member function so that different instances of Fooer should have different caches.
Would this approach help you?
struct Fooer {
using CacheMap = std::map<std::pair<const A*, const B*>, int>;
std::map<int, CacheMap> lineCache;
int foo(const A* a, const B* b, int line) {
const auto key = std::make_pair(a,b);
if (linecache.count(line) > 0) {
CacheMap& cacheMap = lineCache[line];
if(cacheMap.count(key)) return cacheMap[key];
}
lineCache[line][key] = /* Heavy computation. */ + 1;
return cacheMap[key];
}
};
// Calling
foo(a, b, __LINE__)
See _ReturnAddress or any alternatives for yours compiler. Maybe you can use it in your project. Obviously, if it work for you, than just create map caller-result.
Given the following:
class ReadWrite {
public:
int Read(size_t address);
void Write(size_t address, int val);
private:
std::map<size_t, int> db;
}
In read function when accessing an address which no previous write was made to I want to either throw exception designating such error or allow that and return 0, in other words I would like to either use std::map<size_t, int>::operator[]() or std::map<size_t, int>::at(), depending on some bool value which user can set. So I add the following:
class ReadWrite {
public:
int Read(size_t add) { if (allow) return db[add]; return db.at(add);}
void Write(size_t add, int val) { db[add] = val; }
void Allow() { allow = true; }
private:
bool allow = false;
std::map<size_t, int> db;
}
The problem with that is:
Usually, the program will have one call of allow or none at the beginning of the program and then afterwards many accesses. So, performance wise, this code is bad because it every-time performs the check if (allow) where usually it's either always true or always false.
So how would you solve such problem?
Edit:
While the described use case (one or none Allow() at first) of this class is very likely it's not definite and so I must allow user call Allow() dynamically.
Another Edit:
Solutions which use function pointer: What about the performance overhead incurred by using function pointer which is not able to make inline by the compiler? If we use std::function instead will that solve the issue?
Usually, the program will have one call of allow or none at the
beginning of the program and then afterwards many accesses. So,
performance wise, this code is bad because it every-time performs the
check if (allow) where usually it's either always true or always
false. So how would you solve such problem?
I won't, The CPU will.
the Branch Prediction will figure out that the answer is most likely to be same for some long time so it will able to optimize the branch in the hardware level very much. it will still incur some overhead, but very negligible.
If you really need to optimize your program, I think your better use std::unordered_map instead of std::map, or move to some faster map implementation, like google::dense_hash_map. the branch is insignificant compared to map-lookup.
If you want to decrease the time-cost, you have to increase the memory-cost. Accepting that, you can do this with a function pointer. Below is my answer:
class ReadWrite {
public:
void Write(size_t add, int val) { db[add] = val; }
// when allowed, make the function pointer point to read2
void Allow() { Read = &ReadWrite::read2;}
//function pointer that points to read1 by default
int (ReadWrite::*Read)(size_t) = &ReadWrite::read1;
private:
int read1(size_t add){return db.at(add);}
int read2(size_t add) {return db[add];}
std::map<size_t, int> db;
};
The function pointer can be called as the other member functions. As an example:
ReadWrite rwObject;
//some code here
//...
rwObject.Read(5); //use of function pointer
//
Note that non-static data member initialization is available with c++11, so the int (ReadWrite::*Read)(size_t) = &ReadWrite::read1; may not compile with older versions. In that case, you have to explicitly declare one constructor, where the initialization of the function pointer can be done.
You can use a pointer to function.
class ReadWrite {
public:
void Write(size_t add, int val) { db[add] = val; }
int Read(size_t add) { (this->*Rfunc)(add); }
void Allow() { Rfunc = &ReadWrite::Read2; }
private:
std::map<size_t, int> db;
int Read1(size_t add) { return db.at(add); }
int Read2(size_t add) { return db[add]; }
int (ReadWrite::*Rfunc)(size_t) = &ReadWrite::Read1;
}
If you want runtime dynamic behaviour you'll have to pay for it at runtime (at the point you want your logic to behave dynamically).
You want different behaviour at the point where you call Read depending on a runtime condition and you'll have to check that condition.
No matter whether your overhad is a function pointer call or a branch, you'll find a jump or call to different places in your program depending on allow at the point Read is called by the client code.
Note: Profile and fix real bottlenecks - not suspected ones. (You'll learn more if you profile by either having your suspicion confirmed or by finding out why your assumption about the performance was wrong.)
In a function that takes several arguments of the same type, how can we guarantee that the caller doesn't mess up the ordering?
For example
void allocate_things(int num_buffers, int pages_per_buffer, int default_value ...
and later
// uhmm.. lets see which was which uhh..
allocate_things(40,22,80,...
A typical solution is to put the parameters in a structure, with named fields.
AllocateParams p;
p.num_buffers = 1;
p.pages_per_buffer = 10;
p.default_value = 93;
allocate_things(p);
You don't have to use fields, of course. You can use member functions or whatever you like.
If you have a C++11 compiler, you could use user-defined literals in combination with user-defined types. Here is a naive approach:
struct num_buffers_t {
constexpr num_buffers_t(int n) : n(n) {} // constexpr constructor requires C++14
int n;
};
struct pages_per_buffer_t {
constexpr pages_per_buffer_t(int n) : n(n) {}
int n;
};
constexpr num_buffers_t operator"" _buffers(unsigned long long int n) {
return num_buffers_t(n);
}
constexpr pages_per_buffer_t operator"" _pages_per_buffer(unsigned long long int n) {
return pages_per_buffer_t(n);
}
void allocate_things(num_buffers_t num_buffers, pages_per_buffer_t pages_per_buffer) {
// do stuff...
}
template <typename S, typename T>
void allocate_things(S, T) = delete; // forbid calling with other types, eg. integer literals
int main() {
// now we see which is which ...
allocate_things(40_buffers, 22_pages_per_buffer);
// the following does not compile (see the 'deleted' function):
// allocate_things(40, 22);
// allocate_things(40, 22_pages_per_buffer);
// allocate_things(22_pages_per_buffer, 40_buffers);
}
Two good answers so far, one more: another approach would be to try leverage the type system wherever possible, and to create strong typedefs. For instance, using boost strong typedef (http://www.boost.org/doc/libs/1_61_0/libs/serialization/doc/strong_typedef.html).
BOOST_STRONG_TYPEDEF(int , num_buffers);
BOOST_STRONG_TYPEDEF(int , num_pages);
void func(num_buffers b, num_pages p);
Calling func with arguments in the wrong order would now be a compile error.
A couple of notes on this. First, boost's strong typedef is rather dated in its approach; you can do much nicer things with variadic CRTP and avoid macros completely. Second, obviously this introduces some overhead as you often have to explicitly convert. So generally you don't want to overuse it. It's really nice for things that come up over and over again in your library. Not so good for things that come up as a one off. So for instance, if you are writing a GPS library, you should have a strong double typedef for distances in metres, a strong int64 typedef for time past epoch in nanoseconds, and so on.
(Note: post was originally tagged 'C`)
C99 onwards allows an extension to #Dietrich Epp idea: compound literal
struct things {
int num_buffers;
int pages_per_buffer;
int default_value
};
allocate_things(struct things);
// Use a compound literal
allocate_things((struct things){.default_value=80, .num_buffers=40, .pages_per_buffer=22});
Could even pass the address of the structure.
allocate_things(struct things *);
// Use a compound literal
allocate_things(&((struct things){.default_value=80,.num_buffers=40,.pages_per_buffer=22}));
You can't. That's why it is recommended to have as few function arguments as possible.
In your example you could have separate functions like set_num_buffers(int num_buffers), set_pages_per_buffer(int pages_per_buffer) etc.
You probably have noticed yourself that allocate_things is not a good name because it doesn't express what the function is actually doing. Especially I would not expect it to set a default value.
Just for completeness, you could use named arguments, when your call becomes.
void allocate_things(num_buffers=20, pages_per_buffer=40, default_value=20);
// or equivalently
void allocate_things(pages_per_buffer=40, default_value=20, num_buffers=20);
However, with the current C++ this requires quite a bit of code to be implemented (in the header file declaring allocate_things(), which must also declare appropriate external objects num_buffers etc providing operator= which return a unique suitable object).
---------- working example (for sergej)
#include <iostream>
struct a_t { int x=0; a_t(int i): x(i){} };
struct b_t { int x=0; b_t(int i): x(i){} };
struct c_t { int x=0; c_t(int i): x(i){} };
// implement using all possible permutations of the arguments.
// for many more argumentes better use a varidadic template.
void func(a_t a, b_t b, c_t c)
{ std::cout<<"a="<<a.x<<" b="<<b.x<<" c="<<c.x<<std::endl; }
inline void func(b_t b, c_t c, a_t a) { func(a,b,c); }
inline void func(c_t c, a_t a, b_t b) { func(a,b,c); }
inline void func(a_t a, c_t c, b_t b) { func(a,b,c); }
inline void func(c_t c, b_t b, a_t a) { func(a,b,c); }
inline void func(b_t b, a_t a, c_t c) { func(a,b,c); }
struct make_a { a_t operator=(int i) { return {i}; } } a;
struct make_b { b_t operator=(int i) { return {i}; } } b;
struct make_c { c_t operator=(int i) { return {i}; } } c;
int main()
{
func(b=2, c=10, a=42);
}
Are you really going to try to QA all the combinations of arbitrary integers? And throw in all the checks for negative/zero values etc?
Just create two enum types for minimum, medium and maximum number of buffers, and small medium and large buffer sizes. Then let the compiler do the work and let your QA folks take an afternoon off:
allocate_things(MINIMUM_BUFFER_CONFIGURATION, LARGE_BUFFER_SIZE, 42);
Then you only have to test a limited number of combinations and you'll have 100% coverage. The people working on your code 5 years from now will only need to know what they want to achieve and not have to guess the numbers they might need or which values have actually been tested in the field.
It does make the code slightly harder to extend, but it sounds like the parameters are for low-level performance tuning, so twiddling the values should not be perceived as cheap/trivial/not needing thorough testing. A code review of a change from
allocate_something(25, 25, 25);
...to
allocate_something(30, 80, 42);
...will likely get just a shrug/blown off, but a code review of a new enum value EXTRA_LARGE_BUFFERS will likely trigger all the right discussions about memory use, documentation, performance testing etc.
I am programming with C++11 and was wondering if there is a way to generate some code during execution.
For example instead of writing:
void b(int i){i+1}
void c(int i){i-1}
if(true) b()
else{ c() }
would there be a more straightforward way to say if true, then replace all + with - ?
Thank you and sorry if this question is stupid..
C++ has no native facilities for runtime code generation. You could of course invoke a C++ compiler from your program, then dynamically load the resulting binary, and call code from it, but I doubt this is the best solution to your problem.
If you are worried about repeatedly checking the condition, you shouldn't be. Modern CPUs will likely deal with this very well, even in a tight loop, due to branch prediction.
Last, if you really want to more dynamically alter the code path you take, you could use function pointers and/or polymorphism and/or lambdas.
An example with functions
typedef void (pFun*)(int); // pointer to function taking int, returning void
void b(int i){i+1}
void c(int i){i-1}
...
pFun d = cond ? b : c; // based on condition, select function b or c
...
pFun(i); // calls either b or c, effectively selecting + or -
An example with polymorphism
class Operator
{
public:
Operator() {}
virtual ~Operator() {}
virtual void doIt(int i) = 0;
};
class Add : public Operator
{
public:
virtual void doIt(int i) { i+1; }
};
class Sub : public Operator
{
public:
virtual void doIt(int i) { i-1; }
};
...
Operator *pOp = cond ? new Add() : new Sub();
...
pOp->doIt(i);
...
delete pOp;
Here, I have defined a base class with the doIt pure virtual function. The two child classes override the doIt() function to do different things. pOp will then point at either an Add or a Sub instance depending on cond, so when pOp->doIt() is called, the appropriate implementation of your operator is used. Under the covers, this does essentially what I outlined in the above example with function pointers, so choosing one over the other is largely a matter of style and/or other design constrains. They should both perform just as well.
An example with lambdas
This is basically the same as the first example using function pointers, but done in a more C++11 way using lambdas (and it is more concise).
auto d = cond ? [](int i) { i+1; }
: [](int i) { i-1; };
...
d(i);
Alternatively, you may prefer to have the condition inside the body of the lambda, for example
auto d = [&](int i) { cond ? i+1 : i-1; }
...
d(i);
C++ does not have runtime code generation since it's a compiled language.
In this case, you could put the sign into a variable (to be used with multiple variables.)
E.g.
int sign = (true ? 1 : -1);
result2 += sign;
result1 += sign;
Not necessarily a solution for your problem, but you could use
a template, instantiated on one of the operators in <functional>:
template <typename Op>
int
func( int i )
{
return Op()( i, 1 );
}
In your calling function, you would then do something like:
int (*f)( int i ) = condition ? &func<std::plus> : &func<std::minus>;
// ...
i = f( i );
It's possible to use lambdas, which may be preferable, but you can't use
the conditional operator in this case. (Every lambda has a unique type,
and the second and third operatands of the conditional operator must
have the same type.) So it becomes a bit more verbose:
int (*f)( int i );
if ( condition ) {
f = []( int i ) { return i + 1; }
} else {
f = []( int i ) { return i - 1; }
}
This will only work if there is no capture in the lambdas; when there is
no capture, the lambda not only generates an instance of a class with
a unique type, but also a function. Although not being able to use the
conditional operator makes this more verbose than necessary, it is still
probably simpler than having to define a function outside of the class,
unless that function can be implemented as a template, as in my first
example. (I'm assuming that your actual case may be significantly more
complicated than the example you've posted.)
EDIT:
Re lambdas, I tried:
auto f = c ? []( int i ) { return i + 1; } : []( int i ) { return i - 1; };
just out of curiosity. MSC++ gave me the expected error
message:
no conversion from 'someFunc::<lambda_21edbc86aa2c32f897f801ab50700d74>' to 'someFunc::<lambda_0dff34d4a518b95e95f7980e6ff211c5>'
but g++ compiled it without complaining, typeid(f) gave "PFiiI",
which I think is a pointer to a function. In this case, I'm pretty sure
that MSC++ is right: the standard says that each of the lambdas has
a unique type, and that each has a conversion operator to (in this
case) an int (*)( int ) (so both can be converted to the same
type—this is why the version with the if works). But the
specification of the conditional operator requires that either the
second operand can be converted to the type of the third, or vice versa,
but the results must be the type of one of the operands; it cannot be
a third type to which both are converted.
Suppose you have a function, and you call it a lot of times, every time the function return a big object. I've optimized the problem using a functor that return void, and store the returning value in a public member:
#include <vector>
const int N = 100;
std::vector<double> fun(const std::vector<double> & v, const int n)
{
std::vector<double> output = v;
output[n] *= output[n];
return output;
}
class F
{
public:
F() : output(N) {};
std::vector<double> output;
void operator()(const std::vector<double> & v, const int n)
{
output = v;
output[n] *= n;
}
};
int main()
{
std::vector<double> start(N,10.);
std::vector<double> end(N);
double a;
// first solution
for (unsigned long int i = 0; i != 10000000; ++i)
a = fun(start, 2)[3];
// second solution
F f;
for (unsigned long int i = 0; i != 10000000; ++i)
{
f(start, 2);
a = f.output[3];
}
}
Yes, I can use inline or optimize in an other way this problem, but here I want to stress on this problem: with the functor I declare and construct the output variable output only one time, using the function I do that every time it is called. The second solution is two time faster than the first with g++ -O1 or g++ -O2. What do you think about it, is it an ugly optimization?
Edit:
to clarify my aim. I have to evaluate the function >10M times, but I need the output only few random times. It's important that the input is not changed, in fact I declared it as a const reference. In this example the input is always the same, but in real world the input change and it is function of the previous output of the function.
More common scenario is to create object with reserved large enough size outside the function and pass large object to the function by pointer or by reference. You could reuse this object on several calls to your function. Thus you could reduce continual memory allocation.
In both cases you are allocating new vector many many times.
What you should do is to pass both input and output objects to your class/function:
void fun(const std::vector<double> & in, const int n, std::vector<double> & out)
{
out[n] *= in[n];
}
this way you separate your logic from the algorithm. You'll have to create a new std::vector once and pass it to the function as many time as you want. Notice that there's unnecessary no copy/allocation made.
p.s. it's been awhile since I did c++. It may not compile right away.
It's not an ugly optimization. It's actually a fairly decent one.
I would, however, hide output and make an operator[] member to access its members. Why? Because you just might be able to perform a lazy evaluation optimization by moving all the math to that function, thus only doing that math when the client requests that value. Until the user asks for it, why do it if you don't need to?
Edit:
Just checked the standard. Behavior of the assignment operator is based on insert(). Notes for that function state that an allocation occurs if new size exceeds current capacity. Of course this does not seem to explicitly disallow an implementation from reallocating even if otherwise...I'm pretty sure you'll find none that do and I'm sure the standard says something about it somewhere else. Thus you've improved speed by removing allocation calls.
You should still hide the internal vector. You'll have more chance to change implementation if you use encapsulation. You could also return a reference (maybe const) to the vector from the function and retain the original syntax.
I played with this a bit, and came up with the code below. I keep thinking there's a better way to do this, but it's escaping me for now.
The key differences:
I'm allergic to public member variables, so I made output private, and put getters around it.
Having the operator return void isn't necessary for the optimization, so I have it return the value as a const reference so we can preserve return value semantics.
I took a stab at generalizing the approach into a templated base class, so you can then define derived classes for a particular return type, and not re-define the plumbing. This assumes the object you want to create takes a one-arg constructor, and the function you want to call takes in one additional argument. I think you'd have to define other templates if this varies.
Enjoy...
#include <vector>
template<typename T, typename ConstructArg, typename FuncArg>
class ReturnT
{
public:
ReturnT(ConstructArg arg): output(arg){}
virtual ~ReturnT() {}
const T& operator()(const T& in, FuncArg arg)
{
output = in;
this->doOp(arg);
return this->getOutput();
}
const T& getOutput() const {return output;}
protected:
T& getOutput() {return output;}
private:
virtual void doOp(FuncArg arg) = 0;
T output;
};
class F : public ReturnT<std::vector<double>, std::size_t, const int>
{
public:
F(std::size_t size) : ReturnT<std::vector<double>, std::size_t, const int>(size) {}
private:
virtual void doOp(const int n)
{
this->getOutput()[n] *= n;
}
};
int main()
{
const int N = 100;
std::vector<double> start(N,10.);
double a;
// second solution
F f(N);
for (unsigned long int i = 0; i != 10000000; ++i)
{
a = f(start, 2)[3];
}
}
It seems quite strange(I mean the need for optimization at all) - I think that a decent compiler should perform return value optimization in such cases. Maybe all you need is to enable it.