I have been looking for a way to get around the slowness of the dynamic cast type checking. Before you start saying I should redesign everything, let me inform you that the design was decided on 5 years ago. I can't fix all 400,000 lines of code that came after (I wish I could), but I can make some changes. I have run this little test on type identification:
#include <iostream>
#include <typeinfo>
#include <stdint.h>
#include <ctime>
using namespace std;
#define ADD_TYPE_ID \
static intptr_t type() { return reinterpret_cast<intptr_t>(&type); }\
virtual intptr_t getType() { return type(); }
struct Base
{
ADD_TYPE_ID;
};
template <typename T>
struct Derived : public Base
{
ADD_TYPE_ID;
};
int main()
{
Base* b = new Derived<int>();
cout << "Correct Type: " << (b->getType() == Derived<int>::type()) << endl; // true
cout << "Template Type: " << (b->getType() == Derived<float>::type()) << endl; // false
cout << "Base Type: " << (b->getType() == Base::type()) << endl; // false
clock_t begin = clock();
{
for (size_t i = 0; i < 100000000; i++)
{
if (b->getType() == Derived<int>::type())
Derived <int>* d = static_cast<Derived<int>*> (b);
}
}
clock_t end = clock();
double elapsed = double(end - begin) / CLOCKS_PER_SEC;
cout << "Type elapsed: " << elapsed << endl;
begin = clock();
{
for (size_t i = 0; i < 100000000; i++)
{
Derived<int>* d = dynamic_cast<Derived<int>*>(b);
if (d);
}
}
end = clock();
elapsed = double(end - begin) / CLOCKS_PER_SEC;
cout << "Type elapsed: " << elapsed << endl;
begin = clock();
{
for (size_t i = 0; i < 100000000; i++)
{
Derived<int>* d = dynamic_cast<Derived<int>*>(b);
if ( typeid(d) == typeid(Derived<int>*) )
static_cast<Derived<int>*> (b);
}
}
end = clock();
elapsed = double(end - begin) / CLOCKS_PER_SEC;
cout << "Type elapsed: " << elapsed << endl;
return 0;
}
It seems that using the class id (first times solution above) would be the fastest way to do type-checking at runtime.
Will this cause any problems with threading? Is there a better way to check for types at runtime (with not much re-factoring)?
Edit: Might I also add that this needs to work with the TI compilers, which currently only support up to '03
First off, note that there's a big difference between dynamic_cast and RTTI: The cast tells you whether you can treat a base object as some further derived, but not necessarily most-derived object. RTTI tells you the precise most-derived type. Naturally the former is more powerful and more expensive.
So then, there are two natural ways you can select on types if you have a polymorphic hierarchy. They're different; use the one that actually applies.
void method1(Base * p)
{
if (Derived * q = dynamic_cast<Derived *>(p))
{
// use q
}
}
void method2(Base * p)
{
if (typeid(*p) == typeid(Derived))
{
auto * q = static_cast<Derived *>(p);
// use q
}
}
Note also that method 2 is not generally available if the base class is a virtual base. Neither method applies if your classes are not polymorphic.
In a quick test I found method 2 to be significantly faster than your manual ID-based solution, which in turn is faster than the dynamic cast solution (method 1).
How about comparing the classes' virtual function tables?
Quick and dirty proof of concept:
void* instance_vtbl(void* c)
{
return *(void**)c;
}
template<typename C>
void* class_vtbl()
{
static C c;
return instance_vtbl(&c);
}
// ...
begin = clock();
{
for (size_t i = 0; i < 100000000; i++)
{
if (instance_vtbl(b) == class_vtbl<Derived<int>>())
Derived <int>* d = static_cast<Derived<int>*> (b);
}
}
end = clock();
elapsed = double(end - begin) / CLOCKS_PER_SEC;
cout << "Type elapsed: " << elapsed << endl;
With Visual C++'s /Ox switch, this appears 3x faster than the type/getType trick.
Given this type of code
class A {
};
class B : public A {
}
A * a;
B * b = dynamic_cast<B*> (a);
if( b != 0 ) // do something B specific
The polymorphic (right?) way to fix it is something like this
class A {
public:
virtual void specific() { /* do nothing */ }
};
class B : public A {
public:
virtual void specific() { /* do something B specific */ }
}
A * a;
if( a != 0 ) a->specific();
When MSVC 2005 first came out, dynamic_cast<> for 64-bit code was much slower than for 32-bit code. We wanted a quick and easy fix. This is what our code looks like. It probably violates all kinds of good design rules, but the conversion to remove dynamic_cast<> can be automated with a script.
class dbbMsgEph {
public:
virtual dbbResultEph * CastResultEph() { return 0; }
virtual const dbbResultEph * CastResultEph() const { return 0; }
};
class dbbResultEph : public dbbMsgEph {
public:
virtual dbbResultEph * CastResultEph() { return this; }
virtual const dbbResultEph * CastResultEph() const { return this; }
static dbbResultEph * Cast( dbbMsgEph * );
static const dbbResultEph * Cast( const dbbMsgEph * );
};
dbbResultEph *
dbbResultEph::Cast( dbbMsgEph * arg )
{
if( arg == 0 ) return 0;
return arg->CastResultEph();
}
const dbbResultEph *
dbbResultEph::Cast( const dbbMsgEph * arg )
{
if( arg == 0 ) return 0;
return arg->CastResultEph();
}
When we used to have
dbbMsgEph * pMsg;
dbbResultEph * pResult = dynamic_cast<dbbResultEph *> (pMsg);
we changed it to
dbbResultEph * pResult = dbbResultEph::Cast (pMsg);
using a simple sed(1) script. And virtual function calls are pretty efficient.
//in release module(VS2008) this is true:
cout << "Base Type: " << (b->getType() == Base::type()) << endl;
I guess it's because the optimization.So I change the implementation of Derived::type()
template <typename T>
struct Derived : public Base
{
static intptr_t type()
{
cout << "different type()" << endl;
return reinterpret_cast<intptr_t>(&type);
}
virtual intptr_t getType() { return type(); }
};
Then it's different.So how to deal with it if use this method???
Related
I learned the difference between static and run-time polymorphism and since then I've been reading a lot of stuff on it.
Plus, since I found out that implementing a static polymorphic interface (in a manner that a class can use an object with the base class definition without knowing what "derived" class the object was instantiated with).
So I ran a little test to compare these different implementations.
Here is the code:
#include <functional>
#include <iostream>
#include <ctime>
#include <chrono>
#include <memory>
class BaseLambda
{
public:
double run(double param){return m_func(param);};
protected:
BaseLambda(std::function<double(double)> func) : m_func(func) { };
std::function<double(double)> m_func;
};
class DerivedLambda : public BaseLambda
{
public:
DerivedLambda() : BaseLambda([](double param){return DerivedLambda::runImpl(param);}) { };
private:
static double runImpl(double param) {return (param * param) / 2.5;};
};
class BaseVirtual
{
public:
double run(double param){return runImpl(param);};
protected:
virtual double runImpl(double param) = 0;
};
class DerivedVirtual : public BaseVirtual
{
protected:
double runImpl(double param) {return (param * param) / 2.5;};
};
template<class C> class BaseTemplate
{
public:
double run(double param){return static_cast<C*>(this)->runImpl(param);};
};
class DerivedTemplate : public BaseTemplate<DerivedTemplate>
{
public:
double runImpl(double param) {return (param * param) / 2.5;};
};
int main()
{
std::clock_t c_start = std::clock();
std::unique_ptr<BaseVirtual> baseVirtual = std::make_unique<DerivedVirtual>();
for(unsigned int i = 0 ; i < 1000000 ; ++i) {
auto var = baseVirtual->run(10.6);
}
baseVirtual.reset(nullptr);
std::clock_t c_end = std::clock();
std::cout << "Execution time with virtual : "
<< 1000.0 * (c_end-c_start) / CLOCKS_PER_SEC << " ms" << std::endl;
c_start = std::clock();
std::unique_ptr<BaseLambda> baseLambda = std::make_unique<DerivedLambda>();
for(unsigned int i = 0 ; i < 1000000 ; ++i) {
auto var = baseLambda->run(10.6);
}
baseLambda.reset(nullptr);
c_end = std::clock();
std::cout << "Execution time with lambda : "
<< 1000.0 * (c_end-c_start) / CLOCKS_PER_SEC << " ms" << std::endl;
std::unique_ptr<BaseTemplate<DerivedTemplate>> baseTemplate = std::make_unique<DerivedTemplate>();
for(unsigned int i = 0 ; i < 1000000 ; ++i) {
auto var = baseTemplate->run(10.6);
}
baseTemplate.reset(nullptr);
c_end = std::clock();
std::cout << "Execution time with template : "
<< 1000.0 * (c_end-c_start) / CLOCKS_PER_SEC << " ms" << std::endl;
}
The thing is, when I ran this, the virtual implementation is the fastest while static polymorphism is supposed to be faster.
Here is an execution result example:
Execution time with virtual: 53 ms
Execution time with lambda: 94 ms
Execution time with template: 162 ms
I'm guessing that it is because of some compiler optimization. I compile with gcc 6.3.0.
The my question is: why the template implementation the slowest? Is it really because of compiler's optimization, and if it is, what are they and how can I change my test to have the "real" running time? And if not, what have I done wrong?
I know that dynamic_cast has a serious cost, but when I try the following codes, I get a bigger value almost every time from virtual function call loop. Do I know wrong until this time?
EDIT: The problem was that my compiler had been in debug mode. When I switched to release mode, virtual function call loop runs 5 to 7 times faster than dynamic_cast loop.
struct A {
virtual void foo() {}
};
struct B : public A {
virtual void foo() override {}
};
struct C : public B {
virtual void foo() override {}
};
int main()
{
vector<A *> vec;
for (int i = 0; i < 100000; ++i)
if (i % 2)
vec.push_back(new C());
else
vec.push_back(new B());
clock_t begin = clock();
for (auto iter : vec)
if (dynamic_cast<C*>(iter))
;
clock_t end = clock();
cout << (static_cast<double>(end) - begin) / CLOCKS_PER_SEC << endl;
begin = clock();
for (auto iter : vec)
iter->foo();
end = clock();
cout << (static_cast<double>(end) - begin) / CLOCKS_PER_SEC << endl;
return 0;
}
Since you are not doing anything with the result of the dynamic_cast in the lines
for (auto iter : vec)
if (dynamic_cast<C*>(iter))
;
the compiler might be optimizing away most of that code if not all of it.
If you do something useful with the result of the dynamic_cast, you might see a difference. You could try:
for (auto iter : vec)
{
if (C* cptr = dynamic_cast<C*>(iter))
{
cptr->foo();
}
if (B* bptr = dynamic_cast<B*>(iter))
{
bptr->foo();
}
}
That will most likely make a difference.
See http://ideone.com/BvqoqU for a sample run.
Do I know wrong until this time?
We probably can not tell from your code. The optimizer is clever, and it is some times quite challenging to 'defeat' or 'deceive' it.
In the following, I use 'assert()' to try to control the optimizer's enthusiasm. Also note that 'time(0)' is a fast function on Ubuntu 15.10. I believe the compiler does not yet know what the combination will do, and thus will not remove it, providing a more reliable/repeatable measurement.
I think I like these results better, and perhaps these indicate that dynamic cast is slower than virtual function invocation.
Environment:
on an older Dell, using Ubuntu 15.10, 64 bit, and -O3
~$ g++-5 --version
g++-5 (Ubuntu 5.2.1-23ubuntu1~15.10) 5.2.1 20151028
Results (dynamic cast followed by virtual funtion):
void T523_t::testStruct()
0.443445
0.184873
void T523_t::testClass()
252,495 us
184,961 us
FINI 2914399 us
Code:
#include <chrono>
// 'compressed' chrono access --------------vvvvvvv
typedef std::chrono::high_resolution_clock HRClk_t; // std-chrono-hi-res-clk
typedef HRClk_t::time_point Time_t; // std-chrono-hi-res-clk-time-point
typedef std::chrono::milliseconds MS_t; // std-chrono-milliseconds
typedef std::chrono::microseconds US_t; // std-chrono-microseconds
typedef std::chrono::nanoseconds NS_t; // std-chrono-nanoseconds
using namespace std::chrono_literals; // support suffixes like 100ms, 2s, 30us
#include <iostream>
#include <iomanip>
#include <vector>
#include <cassert>
// original ////////////////////////////////////////////////////////////////////
struct A {
virtual ~A() = default; // warning: ‘struct A’ has virtual functions and
// accessible non-virtual destructor [-Wnon-virtual-dtor]
virtual void foo() { assert(time(0)); }
};
struct B : public A {
virtual void foo() override { assert(time(0)); }
};
struct C : public B {
virtual void foo() override { assert(time(0)); }
};
// with class ////////////////////////////////////////////////////////////////////////////
// If your C++ code has no class ... why bother?
class A_t {
public:
virtual ~A_t() = default; // warning: ‘struct A’ has virtual functions and
// accessible non-virtual destructor [-Wnon-virtual-dtor]
virtual void foo() { assert(time(0)); }
};
class B_t : public A_t {
public:
virtual void foo() override { assert(time(0)); }
};
class C_t : public B_t {
public:
virtual void foo() override { assert(time(0)); }
};
class T523_t
{
public:
T523_t() = default;
~T523_t() = default;
int exec()
{
testStruct();
testClass();
return(0);
}
private: // methods
std::string digiComma(std::string s)
{ //vvvvv--sSize must be signed int of sufficient size
int32_t sSize = static_cast<int32_t>(s.size());
if (sSize > 3)
for (int32_t indx = (sSize - 3); indx > 0; indx -= 3)
s.insert(static_cast<size_t>(indx), 1, ',');
return(s);
}
void testStruct()
{
using std::vector;
using std::cout; using std::endl;
std::cout << "\n\n " << __PRETTY_FUNCTION__ << std::endl;
vector<A *> vec;
for (int i = 0; i < 10000000; ++i)
if (i % 2)
vec.push_back(new C());
else
vec.push_back(new B());
clock_t begin = clock();
int i=0;
for (auto iter : vec)
{
if(i % 2) (assert(dynamic_cast<C*>(iter))); // if (dynamic_cast<C*>(iter)) {};
else (assert(dynamic_cast<B*>(iter)));
}
clock_t end = clock();
cout << "\n " << std::setw(8)
<< ((static_cast<double>(end) - static_cast<double>(begin))
/ CLOCKS_PER_SEC) << endl; //^^^^^^^^^^^^^^^^^^^^^^^^^^
// warning: conversion to ‘double’ from ‘clock_t {aka long int}’ may alter its value [-Wconversion]
begin = clock();
for (auto iter : vec)
iter->foo();
end = clock();
cout << "\n " << std::setw(8)
<< ((static_cast<double>(end) - static_cast<double>(begin))
/ CLOCKS_PER_SEC) << endl; //^^^^^^^^^^^^^^^^^^^^^^^^^^
// warning: conversion to ‘double’ from ‘clock_t {aka long int}’ may alter its value [-Wconversion]
}
void testClass()
{
std::cout << "\n\n " << __PRETTY_FUNCTION__ << std::endl;
std::vector<A_t *> APtrVec;
for (int i = 0; i < 10000000; ++i)
{
if (i % 2) APtrVec.push_back(new C_t());
else APtrVec.push_back(new B_t());
}
{
Time_t start_us = HRClk_t::now();
int i = 0;
for (auto Aptr : APtrVec)
{
if(i % 2) assert(dynamic_cast<C_t*>(Aptr)); // check for nullptr
else assert(dynamic_cast<B_t*>(Aptr)); // check for nullptr
++i;
}
auto duration_us = std::chrono::duration_cast<US_t>(HRClk_t::now() - start_us);
std::cout << "\n " << std::setw(8)
<< digiComma(std::to_string(duration_us.count()))
<< " us" << std::endl;
}
{
Time_t start_us = HRClk_t::now();
for (auto Aptr : APtrVec) {
Aptr->foo();
}
auto duration_us = std::chrono::duration_cast<US_t>(HRClk_t::now() - start_us);
std::cout << "\n " << std::setw(8)
<< digiComma(std::to_string(duration_us.count()))
<< " us" << std::endl;
}
}
}; // class T523_t
int main(int argc, char* argv[])
{
std::cout << "\nargc: " << argc << std::endl;
for (int i = 0; i < argc; i += 1) std::cout << argv[i] << " ";
std::cout << std::endl;
setlocale(LC_ALL, "");
std::ios::sync_with_stdio(false);
{ time_t t0 = std::time(nullptr); while(t0 == time(nullptr)) { /**/ }; }
Time_t start_us = HRClk_t::now();
int retVal = -1;
{
T523_t t523;
retVal = t523.exec();
}
auto duration_us = std::chrono::duration_cast<US_t>(HRClk_t::now() - start_us);
std::cout << "\n FINI " << (std::to_string(duration_us.count()))
<< " us" << std::endl;
return(retVal);
}
update 2017-08-31
I suspect many of you will object to performing the dynamic cast without using the result. Here is one possible approach by replacing the for-auto loop in testClass() method:
for (auto Aptr : APtrVec)
{
if(i % 2) { C_t* c = dynamic_cast<C_t*>(Aptr); assert(c); c->foo(); }
else { B_t* b = dynamic_cast<B_t*>(Aptr); assert(b); b->foo(); }
++i;
}
With results
void T523_t::testStruct()
0.443445
0.184873
void T523_t::testClass()
322,431 us
191,285 us
FINI 4156941 us
end update
This question already has answers here:
When should static_cast, dynamic_cast, const_cast, and reinterpret_cast be used?
(11 answers)
Closed 9 years ago.
What would be a genuinely-acceptable example of usage for a derived cast? I have always thought they are only used when implementing "hacks" but if this is not the case, could someone give an acceptable example of when to use one?
#user997112
[edit at bottom]
Hello. Below we use a collection of random polymorphic pointers with common ancestor
through the common interface.
Additional work is done with one of the particular derived classes
we need dynamic_cast or typeid to know this ....
main function has the call
then class declarations
then the dynamic cast is at end
delete of objects created with new is not shown
#include <iostream>
#include <algorithm>
#include <random>
#include <exception>
using namespace std;
int dynamic_test();
int main()
{
cout << "Hello world!" << endl;
dynamic_test();
return 0;
}
............
class basex {
public:
virtual ~basex() {};
virtual void work() const = 0;
};
class next1x : public basex {
public:
void work() const override {cout << "1";/*secret*/}
};
class next2x : public basex {
public:
void work() const override {cout << "2";/*secret*/}
};
class next3x : public basex {
public:
void work() const override {cout << "3";/*secret*/}
};
std::vector<basex *> secret_class_picker()
{
//pick classes with common base at random
std::random_device rd;
std::uniform_int_distribution<int> ud(1,3);
std::mt19937 mt(rd());
std::vector<int> random_v;
for (int i = 0; i < 22; ++i)
random_v.push_back( ud(mt) );
cout << "Random" << endl;
for ( auto bq : random_v) //inspecting for human reader
cout << bq << " ";
std::vector<basex *> v;
basex * bptr;
for (auto bq : random_v) {
switch(bq)
{
default: throw std::exception(); break;
case 1: bptr = new next1x; break;
case 2: bptr = new next2x; break;
case 3: bptr = new next3x; break;
}
v.push_back(bptr);
}
cout << "Objects Created " << v.size() << endl;
return v;
}
//this function demands a more derived class
int special_work(const next3x *)
{
//elided
cout <<"[!]";
return 0;
}
int dynamic_test()
{
std::vector<basex *> v = secret_class_picker();//delete these pointer later
cout <<"Working with random polymorphic pointers"<<endl;
for (const auto bq : v)
{
bq->work();//polymorphic
next3x * ptr = dynamic_cast<next3x *>(bq);
if (nullptr != ptr) special_work(ptr); //reserved for particular type
}
return 0;
}
...................... alternative
int dynamic_static_typeid()
{
std::vector<basex *> v = secret_class_picker();
cout <<"Working with random polymorphic pointers"<<endl;
int k(0);
for (const auto bq : v)
{
bool flipflop = (k % 2) == 0;
bq->work();//polymorphic
//cout << "[*]"<< typeid(*bq).name();//dereference
if (flipflop) {
next3x * dc_ptr = dynamic_cast<next3x *>(bq);//not constant time in general
if (nullptr != dc_ptr) {
special_work(dc_ptr); //reserved for particular type
++k;
}
}
else {
if (typeid(next3x) == typeid(*bq)){//constant time
auto sc_ptr = static_cast<next3x *>(bq);//constant time
special_work(sc_ptr);
++k; cout <<"[sc]";
}
}
cout << endl;
}
return 0;
}
I have requirement as follows.
I have to generate increment negative numbers from -1 to -100 which is used a unique id for a request. Like it should be like this: -1, -2, -3, ...-100, -1, -2, and so on. How can I do this effectively? I am not supposed to use Boost. C++ STL is fine. I prefer to write simple function like int GetNextID() and it should generate ID. Request sample program on how to do this effectively?
Thanks for your time and help
int ID = -1;
auto getnext = [=] mutable {
if (ID == -100) ID = -1;
return ID--;
};
Fairly basic stuff here, really. If you have to ask somebody on the Interwebs to write this program for you, you should really consider finding some educational material in C++.
I love the functor solution:
template <int limit> class NegativeNumber
{
public:
NegativeNumber() : current(0) {};
int operator()()
{
return -(1 + (current++ % limit));
};
private:
int current;
};
Then, you can define any generator with any limit and use it:
NegativeNumber<5> five;
NegativeNumber<2> two;
for (int x = 0; x < 20; ++x)
std::cout << "limit five: " << five() << "\tlimit two: " << two() << '\n';
You can also pass the generator as parameter to another function, with each funtor with its own state:
void f5(NegativeNumber<5> &n)
{
std::cout << "limit five: " << n() << '\n';
}
void f2(NegativeNumber<2> &n)
{
std::cout << "limit two: " << n() << '\n';
}
f5(five);
f2(two);
If you don't like the template solution to declare the limit, there's also the no-template version:
class NegativeNumberNoTemplate
{
public:
NegativeNumberNoTemplate(int limit) : m_limit(limit), current(0) {};
int operator()()
{
return -(1 + (current++ % m_limit));
};
private:
const int m_limit;
int current;
};
Using as argument to a function works in the same way, and it's internal state is transfered as well:
void f(NegativeNumberNoTemplate &n)
{
std::cout << "no template: " << n() << '\n';
}
NegativeNumberNoTemplate notemplate(3);
f(notemplate);
I hope you don't want to use it with threading, they're not thread safe ;)
Here you have all the examples; hope it helps.
Something like.... (haven't compiled)
class myClass
{
int number = 0;
int GetValue ()
{
return - (number = ((number+1) % 101))
}
}
Even a simple problem like this could lead you to several approximations, both in the algorithmic solution and in the concrete usage of the programming language.
This was my first solution using C++03. I preferred to switch the sign after computing the value.
#include <iostream>
int GetNextID() {
// This variable is private to this function. Be careful of not calling it
// from multiple threads!
static int current_value = 0;
const int MAX_CYCLE_VALUE = 100;
return - (current_value++ % MAX_CYCLE_VALUE) - 1;
}
int main()
{
const int TOTAL_GETS = 500;
for (int i = 0; i < TOTAL_GETS; ++i)
std::cout << GetNextID() << std::endl;
}
A different solution taking into account that the integer modulo in C++ takes the sign of the dividend (!) as commented in the Wikipedia
#include <iostream>
int GetNextID() {
// This variable is private to this function. Be careful of not calling it
// from multiple threads!
static int current_value = 0;
const int MAX_CYCLE_VALUE = 10;
return (current_value-- % MAX_CYCLE_VALUE) - 1;
}
int main()
{
const int TOTAL_GETS = 50;
for (int i = 0; i < TOTAL_GETS; ++i)
std::cout << GetNextID() << std::endl;
}
I read that using a policy class for a function that will be called in a tight loop is much faster than using a polymorphic function. However, I setup this demo and the timing indicates that it is exactly the opposite!? The policy version takes between 2-3x longer than the polymorphic version.
#include <iostream>
#include <boost/timer.hpp>
// Policy version
template < typename operation_policy>
class DoOperationPolicy : public operation_policy
{
using operation_policy::Operation;
public:
void Run(const float a, const float b)
{
Operation(a,b);
}
};
class OperationPolicy_Add
{
protected:
float Operation(const float a, const float b)
{
return a + b;
}
};
// Polymorphic version
class DoOperation
{
public:
virtual float Run(const float a, const float b)= 0;
};
class OperationAdd : public DoOperation
{
public:
float Run(const float a, const float b)
{
return a + b;
}
};
int main()
{
boost::timer timer;
unsigned int numberOfIterations = 1e7;
DoOperationPolicy<OperationPolicy_Add> policy_operation;
for(unsigned int i = 0; i < numberOfIterations; ++i)
{
policy_operation.Run(1,2);
}
std::cout << timer.elapsed() << " seconds." << std::endl;
timer.restart();
DoOperation* polymorphic_operation = new OperationAdd;
for(unsigned int i = 0; i < numberOfIterations; ++i)
{
polymorphic_operation->Run(1,2);
}
std::cout << timer.elapsed() << " seconds." << std::endl;
}
Is there something wrong with the demo? Or is just incorrect that the policy should be faster?
Your benchmark is meaningless (sorry).
Making real benchmarks is hard, unfortunately, as compilers are very clever.
Things to look for here:
devirtualization: the polymorphic call is expected to be slower because it is supposed to be virtual, but here the compiler can realize than polymorphic_operation is necessarily a OperationAdd and thus directly call OperationAdd::Run without invoking runtime dispatch
inlining: since the compiler has access to the methods body, it can inline them, and avoid the function calls altogether.
"dead store removal": values that are not used need not be stored, and the computations that lead to them and do not provoke side-effects can be avoided entirely.
Indeed, your entire benchmark code can be optimized to:
int main()
{
boost::timer timer;
std::cout << timer.elapsed() << " seconds." << std::endl;
timer.restart();
DoOperation* polymorphic_operation = new OperationAdd;
std::cout << timer.elapsed() << " seconds." << std::endl;
}
Which is when you realize that you are not timing what you'd like to...
In order to make your benchmark meaningful you need to:
prevent devirtualization
force side-effects
To prevent devirtualization, just declare a DoOperation& Get() function, and then in another cpp file: DoOperation& Get() { static OperationAdd O; return O; }.
To force side-effects (only necessary if the methods are inlined): return the value and accumulate it, then display it.
In action using this program:
// test2.cpp
namespace so8746025 {
class DoOperation
{
public:
virtual float Run(const float a, const float b) = 0;
};
class OperationAdd : public DoOperation
{
public:
float Run(const float a, const float b)
{
return a + b;
}
};
class OperationAddOutOfLine: public DoOperation
{
public:
float Run(const float a, const float b);
};
float OperationAddOutOfLine::Run(const float a, const float b)
{
return a + b;
}
DoOperation& GetInline() {
static OperationAdd O;
return O;
}
DoOperation& GetOutOfLine() {
static OperationAddOutOfLine O;
return O;
}
} // namespace so8746025
// test.cpp
#include <iostream>
#include <boost/timer.hpp>
namespace so8746025 {
// Policy version
template < typename operation_policy>
struct DoOperationPolicy
{
float Run(const float a, const float b)
{
return operation_policy::Operation(a,b);
}
};
struct OperationPolicy_Add
{
static float Operation(const float a, const float b)
{
return a + b;
}
};
// Polymorphic version
class DoOperation
{
public:
virtual float Run(const float a, const float b) = 0;
};
class OperationAdd : public DoOperation
{
public:
float Run(const float a, const float b)
{
return a + b;
}
};
class OperationAddOutOfLine: public DoOperation
{
public:
float Run(const float a, const float b);
};
DoOperation& GetInline();
DoOperation& GetOutOfLine();
} // namespace so8746025
using namespace so8746025;
int main()
{
unsigned int numberOfIterations = 1e8;
DoOperationPolicy<OperationPolicy_Add> policy;
OperationAdd stackInline;
DoOperation& virtualInline = GetInline();
OperationAddOutOfLine stackOutOfLine;
DoOperation& virtualOutOfLine = GetOutOfLine();
boost::timer timer;
float result = 0;
for(unsigned int i = 0; i < numberOfIterations; ++i) {
result += policy.Run(1,2);
}
std::cout << "Policy: " << timer.elapsed() << " seconds (" << result << ")" << std::endl;
timer.restart();
result = 0;
for(unsigned int i = 0; i < numberOfIterations; ++i)
{
result += stackInline.Run(1,2);
}
std::cout << "Stack Inline: " << timer.elapsed() << " seconds (" << result << ")" << std::endl;
timer.restart();
result = 0;
for(unsigned int i = 0; i < numberOfIterations; ++i)
{
result += virtualInline.Run(1,2);
}
std::cout << "Virtual Inline: " << timer.elapsed() << " seconds (" << result << ")" << std::endl;
timer.restart();
result = 0;
for(unsigned int i = 0; i < numberOfIterations; ++i)
{
result += stackOutOfLine.Run(1,2);
}
std::cout << "Stack Out Of Line: " << timer.elapsed() << " seconds (" << result << ")" << std::endl;
timer.restart();
result = 0;
for(unsigned int i = 0; i < numberOfIterations; ++i)
{
result += virtualOutOfLine.Run(1,2);
}
std::cout << "Virtual Out Of Line: " << timer.elapsed() << " seconds (" << result << ")" << std::endl;
}
We get:
$ gcc --version
gcc (GCC) 4.3.2
$ ./testR
Policy: 0.17 seconds (6.71089e+07)
Stack Inline: 0.17 seconds (6.71089e+07)
Virtual Inline: 0.52 seconds (6.71089e+07)
Stack Out Of Line: 0.6 seconds (6.71089e+07)
Virtual Out Of Line: 0.59 seconds (6.71089e+07)
Note the subtle difference between devirtualization + inline and the absence of devirtualization.
FWIW I made it
a policy, as opposed to mixn
return the value
use a volatile to avoid optimizing away of the loop and unrelated optimization of the loop (like, reducing load/stores due to loop unrolling and vectorization on targets that support it).
compare with a direct, static function call
use way more iterations
compile with -O3 on gcc
Timings are:
DoDirect: 3.4 seconds.
Policy: 3.41 seconds.
Polymorphic: 3.4 seconds.
Ergo: there is no difference. Mainly because GCC is able to statically analyze the type of DoOperation* to be DoOperationAdd - there is vtable lookup inside the loop :)
IMPORTANT
If you wanted to benchmark reallife performance of this exact loop, instead of function invocation overhead, drop the volatile. The timings now become
DoDirect: 6.71089e+07 in 1.12 seconds.
Policy: 6.71089e+07 in 1.15 seconds.
Polymorphic: 6.71089e+07 in 3.38 seconds.
As you can see, without volatile, the compiler is able to optimize some load-store cycles away; I assume it might be doing loop unrolling+register allocation there (however I haven't inspected the machine code). The point is, that the loop as a whole can be optimized much more with the 'policy' approach than with the dynamic dispatch (i.e. the virtual method)
CODE
#include <iostream>
#include <boost/timer.hpp>
// Direct version
struct DoDirect {
static float Run(const float a, const float b) { return a + b; }
};
// Policy version
template <typename operation_policy>
struct DoOperationPolicy {
float Run(const float a, const float b) const {
return operation_policy::Operation(a,b);
}
};
struct OperationPolicy_Add {
static float Operation(const float a, const float b) {
return a + b;
}
};
// Polymorphic version
struct DoOperation {
virtual float Run(const float a, const float b) const = 0;
};
struct OperationAdd : public DoOperation {
float Run(const float a, const float b) const { return a + b; }
};
int main(int argc, const char *argv[])
{
boost::timer timer;
const unsigned long numberOfIterations = 1<<30ul;
volatile float result = 0;
for(unsigned long i = 0; i < numberOfIterations; ++i) {
result += DoDirect::Run(1,2);
}
std::cout << "DoDirect: " << result << " in " << timer.elapsed() << " seconds." << std::endl;
timer.restart();
DoOperationPolicy<OperationPolicy_Add> policy_operation;
for(unsigned long i = 0; i < numberOfIterations; ++i) {
result += policy_operation.Run(1,2);
}
std::cout << "Policy: " << result << " in " << timer.elapsed() << " seconds." << std::endl;
timer.restart();
result = 0;
DoOperation* polymorphic_operation = new OperationAdd;
for(unsigned long i = 0; i < numberOfIterations; ++i) {
result += polymorphic_operation->Run(1,2);
}
std::cout << "Polymorphic: " << result << " in " << timer.elapsed() << " seconds." << std::endl;
}
Turn on optimisation. The policy-based variant profits highly from that because most intermediate steps are completely optimised out, while the polymorphic version cannot skip for example the dereferencing of the object.
You have to turn on optimization, and make sure that
both code parts actually do the same thing (which they currently do not, your policy-variant does not return the result)
the result is used for something, so that the compiler does not discard the code path altogether (just sum the results and print them somewhere should be enough)
I had to change your policy code to return the computed value:
float Run(const float a, const float b)
{
return Operation(a,b);
}
Secondly, I had to store the returned value to ensure that the loop wouldn't be optimized away:
int main()
{
unsigned int numberOfIterations = 1e9;
float answer = 0.0;
boost::timer timer;
DoOperationPolicy<OperationPolicy_Add> policy_operation;
for(unsigned int i = 0; i < numberOfIterations; ++i)
{
answer += policy_operation.Run(1,2);
}
std::cout << "Policy got " << answer << " in " << timer.elapsed() << " seconds" << std::endl;
answer = 0.0;
timer.restart();
DoOperation* polymorphic_operation = new OperationAdd;
for(unsigned int i = 0; i < numberOfIterations; ++i)
{
answer += polymorphic_operation->Run(1,2);
}
std::cout << "Polymo got " << answer << " in " << timer.elapsed() << " seconds" << std::endl;
return 0;
}
Without optimizations on g++ 4.1.2:
Policy got 6.71089e+07 in 13.75 seconds
Polymo got 6.71089e+07 in 7.52 seconds
With -O3 on g++ 4.1.2:
Policy got 6.71089e+07 in 1.18 seconds
Polymo got 6.71089e+07 in 3.23 seconds
So the policy is definitely faster once optimizations are turned on.