I learned the difference between static and run-time polymorphism and since then I've been reading a lot of stuff on it.
Plus, since I found out that implementing a static polymorphic interface (in a manner that a class can use an object with the base class definition without knowing what "derived" class the object was instantiated with).
So I ran a little test to compare these different implementations.
Here is the code:
#include <functional>
#include <iostream>
#include <ctime>
#include <chrono>
#include <memory>
class BaseLambda
{
public:
double run(double param){return m_func(param);};
protected:
BaseLambda(std::function<double(double)> func) : m_func(func) { };
std::function<double(double)> m_func;
};
class DerivedLambda : public BaseLambda
{
public:
DerivedLambda() : BaseLambda([](double param){return DerivedLambda::runImpl(param);}) { };
private:
static double runImpl(double param) {return (param * param) / 2.5;};
};
class BaseVirtual
{
public:
double run(double param){return runImpl(param);};
protected:
virtual double runImpl(double param) = 0;
};
class DerivedVirtual : public BaseVirtual
{
protected:
double runImpl(double param) {return (param * param) / 2.5;};
};
template<class C> class BaseTemplate
{
public:
double run(double param){return static_cast<C*>(this)->runImpl(param);};
};
class DerivedTemplate : public BaseTemplate<DerivedTemplate>
{
public:
double runImpl(double param) {return (param * param) / 2.5;};
};
int main()
{
std::clock_t c_start = std::clock();
std::unique_ptr<BaseVirtual> baseVirtual = std::make_unique<DerivedVirtual>();
for(unsigned int i = 0 ; i < 1000000 ; ++i) {
auto var = baseVirtual->run(10.6);
}
baseVirtual.reset(nullptr);
std::clock_t c_end = std::clock();
std::cout << "Execution time with virtual : "
<< 1000.0 * (c_end-c_start) / CLOCKS_PER_SEC << " ms" << std::endl;
c_start = std::clock();
std::unique_ptr<BaseLambda> baseLambda = std::make_unique<DerivedLambda>();
for(unsigned int i = 0 ; i < 1000000 ; ++i) {
auto var = baseLambda->run(10.6);
}
baseLambda.reset(nullptr);
c_end = std::clock();
std::cout << "Execution time with lambda : "
<< 1000.0 * (c_end-c_start) / CLOCKS_PER_SEC << " ms" << std::endl;
std::unique_ptr<BaseTemplate<DerivedTemplate>> baseTemplate = std::make_unique<DerivedTemplate>();
for(unsigned int i = 0 ; i < 1000000 ; ++i) {
auto var = baseTemplate->run(10.6);
}
baseTemplate.reset(nullptr);
c_end = std::clock();
std::cout << "Execution time with template : "
<< 1000.0 * (c_end-c_start) / CLOCKS_PER_SEC << " ms" << std::endl;
}
The thing is, when I ran this, the virtual implementation is the fastest while static polymorphism is supposed to be faster.
Here is an execution result example:
Execution time with virtual: 53 ms
Execution time with lambda: 94 ms
Execution time with template: 162 ms
I'm guessing that it is because of some compiler optimization. I compile with gcc 6.3.0.
The my question is: why the template implementation the slowest? Is it really because of compiler's optimization, and if it is, what are they and how can I change my test to have the "real" running time? And if not, what have I done wrong?
Related
I am experimenting with this code with a lockfree programming and running into weird problems.
This is a dinning philosophers problem, and I am trying to implement that in a lockfree way.
I am reserving the size of unordered_map and making sure that it is not relocating objs during runtime. But unfortunately what I see that during runtime size of Philosopher::umap is weirdly assigned, (in example below, I am restricting it to 10, but during runtime it is changing to 30-40). Can someone help me understand what is going on here.
#include <iostream>
#include <thread>
#include <string>
#include <vector>
#include <memory>
#include <atomic>
#include <chrono>
#include <tuple>
#include <unordered_map>
class Eater {
public:
virtual void Eating() = 0;
~Eater() {
}
};
class Dreamer {
public:
virtual void Dreaming() = 0;
~Dreamer() {}
};
class Gamer {
public:
virtual void Gaming() = 0;
~Gamer() {
}
};
class Philosopher : public Dreamer, public Eater {
private:
const uint32_t id_;
public:
static std::atomic<uint32_t> gen;
static std::unordered_map<uint32_t, std::tuple<uint32_t, uint32_t>> umap;
Philosopher()
: id_(++gen) {
umap.emplace(id_, std::make_tuple(0,0));
}
Philosopher(const Philosopher& that) : id_(that.id_) {
}
Philosopher(Philosopher&& that) : id_(that.id_) {
}
virtual void Eating() override {
umap[id_] = std::make_tuple(1+std::get<0>(umap[id_]), std::get<1>(umap[id_])); // this is failing - despite, seperate thread working on separate memory location
}
virtual void Dreaming() override {
umap[id_] = std::make_tuple(std::get<0>(umap.at(id_)), 1+std::get<1>(umap.at(id_))); // this is failing - despite, seperate thread working on separate memory location
}
~Philosopher() {
}
};
std::unordered_map<uint32_t, std::tuple<uint32_t, uint32_t>> Philosopher::umap;
std::atomic<uint32_t> Philosopher::gen(0u);
class Engine : public Gamer {
private:
uint32_t m_count_philosophers_;
std::vector<std::shared_ptr<std::thread>> m_vthread_;
public:
Engine(uint32_t n)
: m_count_philosophers_(n) {
m_vthread_.reserve(n);
}
virtual void Gaming() override {
std::atomic<uint32_t> counter(0u);
while(counter++ < m_count_philosophers_) {
m_vthread_.emplace_back(std::make_shared<std::thread>([&]() mutable {
Philosopher philosopher;
std::chrono::duration<double> elapsed_seconds;
auto start = std::chrono::steady_clock::now();
while(elapsed_seconds.count() < 2) {
philosopher.Eating();
philosopher.Dreaming();
auto finish = std::chrono::steady_clock::now();
elapsed_seconds = finish - start;
}
}));
}
for(auto &iter : m_vthread_) {
iter->join();
}
}
};
int main() {
auto N = 10u;
Philosopher::umap.reserve(N); // there are no more than N elements in the unordered_map
Engine eng(N);
eng.Gaming();
for (auto &i : Philosopher::umap)
{
std::cout << "phi: " << i.first << " eating: " << std::get<0>(i.second) << ", dreaming: " << std::get<1>(i.second) << "\n";
}
return 0;
}
I know that dynamic_cast has a serious cost, but when I try the following codes, I get a bigger value almost every time from virtual function call loop. Do I know wrong until this time?
EDIT: The problem was that my compiler had been in debug mode. When I switched to release mode, virtual function call loop runs 5 to 7 times faster than dynamic_cast loop.
struct A {
virtual void foo() {}
};
struct B : public A {
virtual void foo() override {}
};
struct C : public B {
virtual void foo() override {}
};
int main()
{
vector<A *> vec;
for (int i = 0; i < 100000; ++i)
if (i % 2)
vec.push_back(new C());
else
vec.push_back(new B());
clock_t begin = clock();
for (auto iter : vec)
if (dynamic_cast<C*>(iter))
;
clock_t end = clock();
cout << (static_cast<double>(end) - begin) / CLOCKS_PER_SEC << endl;
begin = clock();
for (auto iter : vec)
iter->foo();
end = clock();
cout << (static_cast<double>(end) - begin) / CLOCKS_PER_SEC << endl;
return 0;
}
Since you are not doing anything with the result of the dynamic_cast in the lines
for (auto iter : vec)
if (dynamic_cast<C*>(iter))
;
the compiler might be optimizing away most of that code if not all of it.
If you do something useful with the result of the dynamic_cast, you might see a difference. You could try:
for (auto iter : vec)
{
if (C* cptr = dynamic_cast<C*>(iter))
{
cptr->foo();
}
if (B* bptr = dynamic_cast<B*>(iter))
{
bptr->foo();
}
}
That will most likely make a difference.
See http://ideone.com/BvqoqU for a sample run.
Do I know wrong until this time?
We probably can not tell from your code. The optimizer is clever, and it is some times quite challenging to 'defeat' or 'deceive' it.
In the following, I use 'assert()' to try to control the optimizer's enthusiasm. Also note that 'time(0)' is a fast function on Ubuntu 15.10. I believe the compiler does not yet know what the combination will do, and thus will not remove it, providing a more reliable/repeatable measurement.
I think I like these results better, and perhaps these indicate that dynamic cast is slower than virtual function invocation.
Environment:
on an older Dell, using Ubuntu 15.10, 64 bit, and -O3
~$ g++-5 --version
g++-5 (Ubuntu 5.2.1-23ubuntu1~15.10) 5.2.1 20151028
Results (dynamic cast followed by virtual funtion):
void T523_t::testStruct()
0.443445
0.184873
void T523_t::testClass()
252,495 us
184,961 us
FINI 2914399 us
Code:
#include <chrono>
// 'compressed' chrono access --------------vvvvvvv
typedef std::chrono::high_resolution_clock HRClk_t; // std-chrono-hi-res-clk
typedef HRClk_t::time_point Time_t; // std-chrono-hi-res-clk-time-point
typedef std::chrono::milliseconds MS_t; // std-chrono-milliseconds
typedef std::chrono::microseconds US_t; // std-chrono-microseconds
typedef std::chrono::nanoseconds NS_t; // std-chrono-nanoseconds
using namespace std::chrono_literals; // support suffixes like 100ms, 2s, 30us
#include <iostream>
#include <iomanip>
#include <vector>
#include <cassert>
// original ////////////////////////////////////////////////////////////////////
struct A {
virtual ~A() = default; // warning: ‘struct A’ has virtual functions and
// accessible non-virtual destructor [-Wnon-virtual-dtor]
virtual void foo() { assert(time(0)); }
};
struct B : public A {
virtual void foo() override { assert(time(0)); }
};
struct C : public B {
virtual void foo() override { assert(time(0)); }
};
// with class ////////////////////////////////////////////////////////////////////////////
// If your C++ code has no class ... why bother?
class A_t {
public:
virtual ~A_t() = default; // warning: ‘struct A’ has virtual functions and
// accessible non-virtual destructor [-Wnon-virtual-dtor]
virtual void foo() { assert(time(0)); }
};
class B_t : public A_t {
public:
virtual void foo() override { assert(time(0)); }
};
class C_t : public B_t {
public:
virtual void foo() override { assert(time(0)); }
};
class T523_t
{
public:
T523_t() = default;
~T523_t() = default;
int exec()
{
testStruct();
testClass();
return(0);
}
private: // methods
std::string digiComma(std::string s)
{ //vvvvv--sSize must be signed int of sufficient size
int32_t sSize = static_cast<int32_t>(s.size());
if (sSize > 3)
for (int32_t indx = (sSize - 3); indx > 0; indx -= 3)
s.insert(static_cast<size_t>(indx), 1, ',');
return(s);
}
void testStruct()
{
using std::vector;
using std::cout; using std::endl;
std::cout << "\n\n " << __PRETTY_FUNCTION__ << std::endl;
vector<A *> vec;
for (int i = 0; i < 10000000; ++i)
if (i % 2)
vec.push_back(new C());
else
vec.push_back(new B());
clock_t begin = clock();
int i=0;
for (auto iter : vec)
{
if(i % 2) (assert(dynamic_cast<C*>(iter))); // if (dynamic_cast<C*>(iter)) {};
else (assert(dynamic_cast<B*>(iter)));
}
clock_t end = clock();
cout << "\n " << std::setw(8)
<< ((static_cast<double>(end) - static_cast<double>(begin))
/ CLOCKS_PER_SEC) << endl; //^^^^^^^^^^^^^^^^^^^^^^^^^^
// warning: conversion to ‘double’ from ‘clock_t {aka long int}’ may alter its value [-Wconversion]
begin = clock();
for (auto iter : vec)
iter->foo();
end = clock();
cout << "\n " << std::setw(8)
<< ((static_cast<double>(end) - static_cast<double>(begin))
/ CLOCKS_PER_SEC) << endl; //^^^^^^^^^^^^^^^^^^^^^^^^^^
// warning: conversion to ‘double’ from ‘clock_t {aka long int}’ may alter its value [-Wconversion]
}
void testClass()
{
std::cout << "\n\n " << __PRETTY_FUNCTION__ << std::endl;
std::vector<A_t *> APtrVec;
for (int i = 0; i < 10000000; ++i)
{
if (i % 2) APtrVec.push_back(new C_t());
else APtrVec.push_back(new B_t());
}
{
Time_t start_us = HRClk_t::now();
int i = 0;
for (auto Aptr : APtrVec)
{
if(i % 2) assert(dynamic_cast<C_t*>(Aptr)); // check for nullptr
else assert(dynamic_cast<B_t*>(Aptr)); // check for nullptr
++i;
}
auto duration_us = std::chrono::duration_cast<US_t>(HRClk_t::now() - start_us);
std::cout << "\n " << std::setw(8)
<< digiComma(std::to_string(duration_us.count()))
<< " us" << std::endl;
}
{
Time_t start_us = HRClk_t::now();
for (auto Aptr : APtrVec) {
Aptr->foo();
}
auto duration_us = std::chrono::duration_cast<US_t>(HRClk_t::now() - start_us);
std::cout << "\n " << std::setw(8)
<< digiComma(std::to_string(duration_us.count()))
<< " us" << std::endl;
}
}
}; // class T523_t
int main(int argc, char* argv[])
{
std::cout << "\nargc: " << argc << std::endl;
for (int i = 0; i < argc; i += 1) std::cout << argv[i] << " ";
std::cout << std::endl;
setlocale(LC_ALL, "");
std::ios::sync_with_stdio(false);
{ time_t t0 = std::time(nullptr); while(t0 == time(nullptr)) { /**/ }; }
Time_t start_us = HRClk_t::now();
int retVal = -1;
{
T523_t t523;
retVal = t523.exec();
}
auto duration_us = std::chrono::duration_cast<US_t>(HRClk_t::now() - start_us);
std::cout << "\n FINI " << (std::to_string(duration_us.count()))
<< " us" << std::endl;
return(retVal);
}
update 2017-08-31
I suspect many of you will object to performing the dynamic cast without using the result. Here is one possible approach by replacing the for-auto loop in testClass() method:
for (auto Aptr : APtrVec)
{
if(i % 2) { C_t* c = dynamic_cast<C_t*>(Aptr); assert(c); c->foo(); }
else { B_t* b = dynamic_cast<B_t*>(Aptr); assert(b); b->foo(); }
++i;
}
With results
void T523_t::testStruct()
0.443445
0.184873
void T523_t::testClass()
322,431 us
191,285 us
FINI 4156941 us
end update
I have been looking for a way to get around the slowness of the dynamic cast type checking. Before you start saying I should redesign everything, let me inform you that the design was decided on 5 years ago. I can't fix all 400,000 lines of code that came after (I wish I could), but I can make some changes. I have run this little test on type identification:
#include <iostream>
#include <typeinfo>
#include <stdint.h>
#include <ctime>
using namespace std;
#define ADD_TYPE_ID \
static intptr_t type() { return reinterpret_cast<intptr_t>(&type); }\
virtual intptr_t getType() { return type(); }
struct Base
{
ADD_TYPE_ID;
};
template <typename T>
struct Derived : public Base
{
ADD_TYPE_ID;
};
int main()
{
Base* b = new Derived<int>();
cout << "Correct Type: " << (b->getType() == Derived<int>::type()) << endl; // true
cout << "Template Type: " << (b->getType() == Derived<float>::type()) << endl; // false
cout << "Base Type: " << (b->getType() == Base::type()) << endl; // false
clock_t begin = clock();
{
for (size_t i = 0; i < 100000000; i++)
{
if (b->getType() == Derived<int>::type())
Derived <int>* d = static_cast<Derived<int>*> (b);
}
}
clock_t end = clock();
double elapsed = double(end - begin) / CLOCKS_PER_SEC;
cout << "Type elapsed: " << elapsed << endl;
begin = clock();
{
for (size_t i = 0; i < 100000000; i++)
{
Derived<int>* d = dynamic_cast<Derived<int>*>(b);
if (d);
}
}
end = clock();
elapsed = double(end - begin) / CLOCKS_PER_SEC;
cout << "Type elapsed: " << elapsed << endl;
begin = clock();
{
for (size_t i = 0; i < 100000000; i++)
{
Derived<int>* d = dynamic_cast<Derived<int>*>(b);
if ( typeid(d) == typeid(Derived<int>*) )
static_cast<Derived<int>*> (b);
}
}
end = clock();
elapsed = double(end - begin) / CLOCKS_PER_SEC;
cout << "Type elapsed: " << elapsed << endl;
return 0;
}
It seems that using the class id (first times solution above) would be the fastest way to do type-checking at runtime.
Will this cause any problems with threading? Is there a better way to check for types at runtime (with not much re-factoring)?
Edit: Might I also add that this needs to work with the TI compilers, which currently only support up to '03
First off, note that there's a big difference between dynamic_cast and RTTI: The cast tells you whether you can treat a base object as some further derived, but not necessarily most-derived object. RTTI tells you the precise most-derived type. Naturally the former is more powerful and more expensive.
So then, there are two natural ways you can select on types if you have a polymorphic hierarchy. They're different; use the one that actually applies.
void method1(Base * p)
{
if (Derived * q = dynamic_cast<Derived *>(p))
{
// use q
}
}
void method2(Base * p)
{
if (typeid(*p) == typeid(Derived))
{
auto * q = static_cast<Derived *>(p);
// use q
}
}
Note also that method 2 is not generally available if the base class is a virtual base. Neither method applies if your classes are not polymorphic.
In a quick test I found method 2 to be significantly faster than your manual ID-based solution, which in turn is faster than the dynamic cast solution (method 1).
How about comparing the classes' virtual function tables?
Quick and dirty proof of concept:
void* instance_vtbl(void* c)
{
return *(void**)c;
}
template<typename C>
void* class_vtbl()
{
static C c;
return instance_vtbl(&c);
}
// ...
begin = clock();
{
for (size_t i = 0; i < 100000000; i++)
{
if (instance_vtbl(b) == class_vtbl<Derived<int>>())
Derived <int>* d = static_cast<Derived<int>*> (b);
}
}
end = clock();
elapsed = double(end - begin) / CLOCKS_PER_SEC;
cout << "Type elapsed: " << elapsed << endl;
With Visual C++'s /Ox switch, this appears 3x faster than the type/getType trick.
Given this type of code
class A {
};
class B : public A {
}
A * a;
B * b = dynamic_cast<B*> (a);
if( b != 0 ) // do something B specific
The polymorphic (right?) way to fix it is something like this
class A {
public:
virtual void specific() { /* do nothing */ }
};
class B : public A {
public:
virtual void specific() { /* do something B specific */ }
}
A * a;
if( a != 0 ) a->specific();
When MSVC 2005 first came out, dynamic_cast<> for 64-bit code was much slower than for 32-bit code. We wanted a quick and easy fix. This is what our code looks like. It probably violates all kinds of good design rules, but the conversion to remove dynamic_cast<> can be automated with a script.
class dbbMsgEph {
public:
virtual dbbResultEph * CastResultEph() { return 0; }
virtual const dbbResultEph * CastResultEph() const { return 0; }
};
class dbbResultEph : public dbbMsgEph {
public:
virtual dbbResultEph * CastResultEph() { return this; }
virtual const dbbResultEph * CastResultEph() const { return this; }
static dbbResultEph * Cast( dbbMsgEph * );
static const dbbResultEph * Cast( const dbbMsgEph * );
};
dbbResultEph *
dbbResultEph::Cast( dbbMsgEph * arg )
{
if( arg == 0 ) return 0;
return arg->CastResultEph();
}
const dbbResultEph *
dbbResultEph::Cast( const dbbMsgEph * arg )
{
if( arg == 0 ) return 0;
return arg->CastResultEph();
}
When we used to have
dbbMsgEph * pMsg;
dbbResultEph * pResult = dynamic_cast<dbbResultEph *> (pMsg);
we changed it to
dbbResultEph * pResult = dbbResultEph::Cast (pMsg);
using a simple sed(1) script. And virtual function calls are pretty efficient.
//in release module(VS2008) this is true:
cout << "Base Type: " << (b->getType() == Base::type()) << endl;
I guess it's because the optimization.So I change the implementation of Derived::type()
template <typename T>
struct Derived : public Base
{
static intptr_t type()
{
cout << "different type()" << endl;
return reinterpret_cast<intptr_t>(&type);
}
virtual intptr_t getType() { return type(); }
};
Then it's different.So how to deal with it if use this method???
I'd like advice on a way to cache a computation that is shared by two derived classes. As an illustration, I have two types of normalized vectors L1 and L2, which each define their own normalization constant (note: against good practice I'm inheriting from std::vector here as a quick illustration-- believe it or not, my real problem is not about L1 and L2 vectors!):
#include <vector>
#include <iostream>
#include <iterator>
#include <math.h>
struct NormalizedVector : public std::vector<double> {
NormalizedVector(std::initializer_list<double> init_list):
std::vector<double>(init_list) { }
double get_value(int i) const {
return (*this)[i] / get_normalization_constant();
}
virtual double get_normalization_constant() const = 0;
};
struct L1Vector : public NormalizedVector {
L1Vector(std::initializer_list<double> init_list):
NormalizedVector(init_list) { }
double get_normalization_constant() const {
double tot = 0.0;
for (int k=0; k<size(); ++k)
tot += (*this)[k];
return tot;
}
};
struct L2Vector : public NormalizedVector {
L2Vector(std::initializer_list<double> init_list):
NormalizedVector(init_list) { }
double get_normalization_constant() const {
double tot = 0.0;
for (int k=0; k<size(); ++k) {
double val = (*this)[k];
tot += val * val;
}
return sqrt(tot);
}
};
int main() {
L1Vector vec{0.25, 0.5, 1.0};
std::cout << "L1 ";
for (int k=0; k<vec.size(); ++k)
std::cout << vec.get_value(k) << " ";
std::cout << std::endl;
std::cout << "L2 ";
L2Vector vec2{0.25, 0.5, 1.0};
for (int k=0; k<vec2.size(); ++k)
std::cout << vec2.get_value(k) << " ";
std::cout << std::endl;
return 0;
}
This code is unnecessarily slow for large vectors because it calls get_normalization_constant() repeatedly, even though it doesn't change after construction (assuming modifiers like push_back have appropriately been disabled).
If I was only considering one form of normalization, I would simply use a double value to cache this result on construction:
struct NormalizedVector : public std::vector<double> {
NormalizedVector(std::initializer_list<double> init_list):
std::vector<double>(init_list) {
normalization_constant = get_normalization_constant();
}
double get_value(int i) const {
return (*this)[i] / normalization_constant;
}
virtual double get_normalization_constant() const = 0;
double normalization_constant;
};
However, this understandably doesn't compile because the NormalizedVector constructor tries to call a pure virtual function (the derived virtual table is not available during base initialization).
Option 1:
Derived classes must manually call the normalization_constant = get_normalization_constant(); function in their constructors.
Option 2:
Objects define a virtual function for initializing the constant:
init_normalization_constant() {
normalization_constant = get_normalization_constant();
}
Objects are then constructed by a factory:
struct NormalizedVector : public std::vector<double> {
NormalizedVector(std::initializer_list<double> init_list):
std::vector<double>(init_list) {
// init_normalization_constant();
}
double get_value(int i) const {
return (*this)[i] / normalization_constant;
}
virtual double get_normalization_constant() const = 0;
virtual void init_normalization_constant() {
normalization_constant = get_normalization_constant();
}
double normalization_constant;
};
// ...
// same code for derived types here
// ...
template <typename TYPE>
struct Factory {
template <typename ...ARGTYPES>
static TYPE construct_and_init(ARGTYPES...args) {
TYPE result(args...);
result.init_normalization_constant();
return result;
}
};
int main() {
L1Vector vec = Factory<L1Vector>::construct_and_init<std::initializer_list<double> >({0.25, 0.5, 1.0});
std::cout << "L1 ";
for (int k=0; k<vec.size(); ++k)
std::cout << vec.get_value(k) << " ";
std::cout << std::endl;
return 0;
}
Option 3:
Use an actual cache: get_normalization_constant is defined as a new type, CacheFunctor; the first time CacheFunctor is called, it saves the return value.
In Python, this works as originally coded, because the virtual table is always present, even in __init__ of a base class. In C++ this is much trickier.
I'd really appreciate the help; this comes up a lot for me. I feel like I'm getting the hang of good object oriented design in C++, but not always when it comes to making very efficient code (especially in the case of this sort of simple caching).
I suggest the non-virtual interface pattern. This pattern excels when you want a method to provide both common and unique functionality. (In this case, caching in common, computation in uniqueness.)
http://en.wikibooks.org/wiki/More_C%2B%2B_Idioms/Non-Virtual_Interface
// UNTESTED
struct NormalizedVector : public std::vector<double> {
...
double normalization_constant;
bool cached;
virtual double do_get_normalization_constant() = 0;
double get_normalization_constant() {
if(!cached) {
cached = true;
normalization_constant = do_get_normalization_constant();
}
return normalization_constant;
};
P.s. You really ought not publicly derive from std::vector.
P.P.s. Invalidating the cache is as simple as setting cached to false.
Complete Solution
#include <vector>
#include <iostream>
#include <iterator>
#include <cmath>
#include <algorithm>
struct NormalizedVector : private std::vector<double> {
private:
typedef std::vector<double> Base;
protected:
using Base::operator[];
using Base::begin;
using Base::end;
public:
using Base::size;
NormalizedVector(std::initializer_list<double> init_list):
std::vector<double>(init_list) { }
double get_value(int i) const {
return (*this)[i] / get_normalization_constant();
}
virtual double do_get_normalization_constant() const = 0;
mutable bool normalization_constant_valid;
mutable double normalization_constant;
double get_normalization_constant() const {
if(!normalization_constant_valid) {
normalization_constant = do_get_normalization_constant();
normalization_constant_valid = true;
}
return normalization_constant;
}
void push_back(const double& value) {
normalization_constant_valid = false;
Base::push_back(value);
}
virtual ~NormalizedVector() {}
};
struct L1Vector : public NormalizedVector {
L1Vector(std::initializer_list<double> init_list):
NormalizedVector(init_list) { get_normalization_constant(); }
double do_get_normalization_constant() const {
return std::accumulate(begin(), end(), 0.0);
}
};
struct L2Vector : public NormalizedVector {
L2Vector(std::initializer_list<double> init_list):
NormalizedVector(init_list) { get_normalization_constant(); }
double do_get_normalization_constant() const {
return std::sqrt(
std::accumulate(begin(), end(), 0.0,
[](double a, double b) { return a + b * b; } ) );
}
};
std::ostream&
operator<<(std::ostream& os, NormalizedVector& vec) {
for (int k=0; k<vec.size(); ++k)
os << vec.get_value(k) << " ";
return os;
}
int main() {
L1Vector vec{0.25, 0.5, 1.0};
std::cout << "L1 " << vec << "\n";
vec.push_back(2.0);
std::cout << "L1 " << vec << "\n";
L2Vector vec2{0.25, 0.5, 1.0};
std::cout << "L2 " << vec2 << "\n";
vec2.push_back(2.0);
std::cout << "L2 " << vec2 << "\n";
return 0;
}
A quick and dirty solution is to use static member variable.
double get_normalization_constant() const {
static double tot = 0.0;
if( tot == 0.0 )
for (int k=0; k<size(); ++k)
tot += (*this)[k];
return tot;
}
In this case, it will only be computed once.. and each time it will return the latest value.
NOTE:
This double tot, will be shared will all objects of same type. Don't use it if you will create many object of the type L1Vector
I read that using a policy class for a function that will be called in a tight loop is much faster than using a polymorphic function. However, I setup this demo and the timing indicates that it is exactly the opposite!? The policy version takes between 2-3x longer than the polymorphic version.
#include <iostream>
#include <boost/timer.hpp>
// Policy version
template < typename operation_policy>
class DoOperationPolicy : public operation_policy
{
using operation_policy::Operation;
public:
void Run(const float a, const float b)
{
Operation(a,b);
}
};
class OperationPolicy_Add
{
protected:
float Operation(const float a, const float b)
{
return a + b;
}
};
// Polymorphic version
class DoOperation
{
public:
virtual float Run(const float a, const float b)= 0;
};
class OperationAdd : public DoOperation
{
public:
float Run(const float a, const float b)
{
return a + b;
}
};
int main()
{
boost::timer timer;
unsigned int numberOfIterations = 1e7;
DoOperationPolicy<OperationPolicy_Add> policy_operation;
for(unsigned int i = 0; i < numberOfIterations; ++i)
{
policy_operation.Run(1,2);
}
std::cout << timer.elapsed() << " seconds." << std::endl;
timer.restart();
DoOperation* polymorphic_operation = new OperationAdd;
for(unsigned int i = 0; i < numberOfIterations; ++i)
{
polymorphic_operation->Run(1,2);
}
std::cout << timer.elapsed() << " seconds." << std::endl;
}
Is there something wrong with the demo? Or is just incorrect that the policy should be faster?
Your benchmark is meaningless (sorry).
Making real benchmarks is hard, unfortunately, as compilers are very clever.
Things to look for here:
devirtualization: the polymorphic call is expected to be slower because it is supposed to be virtual, but here the compiler can realize than polymorphic_operation is necessarily a OperationAdd and thus directly call OperationAdd::Run without invoking runtime dispatch
inlining: since the compiler has access to the methods body, it can inline them, and avoid the function calls altogether.
"dead store removal": values that are not used need not be stored, and the computations that lead to them and do not provoke side-effects can be avoided entirely.
Indeed, your entire benchmark code can be optimized to:
int main()
{
boost::timer timer;
std::cout << timer.elapsed() << " seconds." << std::endl;
timer.restart();
DoOperation* polymorphic_operation = new OperationAdd;
std::cout << timer.elapsed() << " seconds." << std::endl;
}
Which is when you realize that you are not timing what you'd like to...
In order to make your benchmark meaningful you need to:
prevent devirtualization
force side-effects
To prevent devirtualization, just declare a DoOperation& Get() function, and then in another cpp file: DoOperation& Get() { static OperationAdd O; return O; }.
To force side-effects (only necessary if the methods are inlined): return the value and accumulate it, then display it.
In action using this program:
// test2.cpp
namespace so8746025 {
class DoOperation
{
public:
virtual float Run(const float a, const float b) = 0;
};
class OperationAdd : public DoOperation
{
public:
float Run(const float a, const float b)
{
return a + b;
}
};
class OperationAddOutOfLine: public DoOperation
{
public:
float Run(const float a, const float b);
};
float OperationAddOutOfLine::Run(const float a, const float b)
{
return a + b;
}
DoOperation& GetInline() {
static OperationAdd O;
return O;
}
DoOperation& GetOutOfLine() {
static OperationAddOutOfLine O;
return O;
}
} // namespace so8746025
// test.cpp
#include <iostream>
#include <boost/timer.hpp>
namespace so8746025 {
// Policy version
template < typename operation_policy>
struct DoOperationPolicy
{
float Run(const float a, const float b)
{
return operation_policy::Operation(a,b);
}
};
struct OperationPolicy_Add
{
static float Operation(const float a, const float b)
{
return a + b;
}
};
// Polymorphic version
class DoOperation
{
public:
virtual float Run(const float a, const float b) = 0;
};
class OperationAdd : public DoOperation
{
public:
float Run(const float a, const float b)
{
return a + b;
}
};
class OperationAddOutOfLine: public DoOperation
{
public:
float Run(const float a, const float b);
};
DoOperation& GetInline();
DoOperation& GetOutOfLine();
} // namespace so8746025
using namespace so8746025;
int main()
{
unsigned int numberOfIterations = 1e8;
DoOperationPolicy<OperationPolicy_Add> policy;
OperationAdd stackInline;
DoOperation& virtualInline = GetInline();
OperationAddOutOfLine stackOutOfLine;
DoOperation& virtualOutOfLine = GetOutOfLine();
boost::timer timer;
float result = 0;
for(unsigned int i = 0; i < numberOfIterations; ++i) {
result += policy.Run(1,2);
}
std::cout << "Policy: " << timer.elapsed() << " seconds (" << result << ")" << std::endl;
timer.restart();
result = 0;
for(unsigned int i = 0; i < numberOfIterations; ++i)
{
result += stackInline.Run(1,2);
}
std::cout << "Stack Inline: " << timer.elapsed() << " seconds (" << result << ")" << std::endl;
timer.restart();
result = 0;
for(unsigned int i = 0; i < numberOfIterations; ++i)
{
result += virtualInline.Run(1,2);
}
std::cout << "Virtual Inline: " << timer.elapsed() << " seconds (" << result << ")" << std::endl;
timer.restart();
result = 0;
for(unsigned int i = 0; i < numberOfIterations; ++i)
{
result += stackOutOfLine.Run(1,2);
}
std::cout << "Stack Out Of Line: " << timer.elapsed() << " seconds (" << result << ")" << std::endl;
timer.restart();
result = 0;
for(unsigned int i = 0; i < numberOfIterations; ++i)
{
result += virtualOutOfLine.Run(1,2);
}
std::cout << "Virtual Out Of Line: " << timer.elapsed() << " seconds (" << result << ")" << std::endl;
}
We get:
$ gcc --version
gcc (GCC) 4.3.2
$ ./testR
Policy: 0.17 seconds (6.71089e+07)
Stack Inline: 0.17 seconds (6.71089e+07)
Virtual Inline: 0.52 seconds (6.71089e+07)
Stack Out Of Line: 0.6 seconds (6.71089e+07)
Virtual Out Of Line: 0.59 seconds (6.71089e+07)
Note the subtle difference between devirtualization + inline and the absence of devirtualization.
FWIW I made it
a policy, as opposed to mixn
return the value
use a volatile to avoid optimizing away of the loop and unrelated optimization of the loop (like, reducing load/stores due to loop unrolling and vectorization on targets that support it).
compare with a direct, static function call
use way more iterations
compile with -O3 on gcc
Timings are:
DoDirect: 3.4 seconds.
Policy: 3.41 seconds.
Polymorphic: 3.4 seconds.
Ergo: there is no difference. Mainly because GCC is able to statically analyze the type of DoOperation* to be DoOperationAdd - there is vtable lookup inside the loop :)
IMPORTANT
If you wanted to benchmark reallife performance of this exact loop, instead of function invocation overhead, drop the volatile. The timings now become
DoDirect: 6.71089e+07 in 1.12 seconds.
Policy: 6.71089e+07 in 1.15 seconds.
Polymorphic: 6.71089e+07 in 3.38 seconds.
As you can see, without volatile, the compiler is able to optimize some load-store cycles away; I assume it might be doing loop unrolling+register allocation there (however I haven't inspected the machine code). The point is, that the loop as a whole can be optimized much more with the 'policy' approach than with the dynamic dispatch (i.e. the virtual method)
CODE
#include <iostream>
#include <boost/timer.hpp>
// Direct version
struct DoDirect {
static float Run(const float a, const float b) { return a + b; }
};
// Policy version
template <typename operation_policy>
struct DoOperationPolicy {
float Run(const float a, const float b) const {
return operation_policy::Operation(a,b);
}
};
struct OperationPolicy_Add {
static float Operation(const float a, const float b) {
return a + b;
}
};
// Polymorphic version
struct DoOperation {
virtual float Run(const float a, const float b) const = 0;
};
struct OperationAdd : public DoOperation {
float Run(const float a, const float b) const { return a + b; }
};
int main(int argc, const char *argv[])
{
boost::timer timer;
const unsigned long numberOfIterations = 1<<30ul;
volatile float result = 0;
for(unsigned long i = 0; i < numberOfIterations; ++i) {
result += DoDirect::Run(1,2);
}
std::cout << "DoDirect: " << result << " in " << timer.elapsed() << " seconds." << std::endl;
timer.restart();
DoOperationPolicy<OperationPolicy_Add> policy_operation;
for(unsigned long i = 0; i < numberOfIterations; ++i) {
result += policy_operation.Run(1,2);
}
std::cout << "Policy: " << result << " in " << timer.elapsed() << " seconds." << std::endl;
timer.restart();
result = 0;
DoOperation* polymorphic_operation = new OperationAdd;
for(unsigned long i = 0; i < numberOfIterations; ++i) {
result += polymorphic_operation->Run(1,2);
}
std::cout << "Polymorphic: " << result << " in " << timer.elapsed() << " seconds." << std::endl;
}
Turn on optimisation. The policy-based variant profits highly from that because most intermediate steps are completely optimised out, while the polymorphic version cannot skip for example the dereferencing of the object.
You have to turn on optimization, and make sure that
both code parts actually do the same thing (which they currently do not, your policy-variant does not return the result)
the result is used for something, so that the compiler does not discard the code path altogether (just sum the results and print them somewhere should be enough)
I had to change your policy code to return the computed value:
float Run(const float a, const float b)
{
return Operation(a,b);
}
Secondly, I had to store the returned value to ensure that the loop wouldn't be optimized away:
int main()
{
unsigned int numberOfIterations = 1e9;
float answer = 0.0;
boost::timer timer;
DoOperationPolicy<OperationPolicy_Add> policy_operation;
for(unsigned int i = 0; i < numberOfIterations; ++i)
{
answer += policy_operation.Run(1,2);
}
std::cout << "Policy got " << answer << " in " << timer.elapsed() << " seconds" << std::endl;
answer = 0.0;
timer.restart();
DoOperation* polymorphic_operation = new OperationAdd;
for(unsigned int i = 0; i < numberOfIterations; ++i)
{
answer += polymorphic_operation->Run(1,2);
}
std::cout << "Polymo got " << answer << " in " << timer.elapsed() << " seconds" << std::endl;
return 0;
}
Without optimizations on g++ 4.1.2:
Policy got 6.71089e+07 in 13.75 seconds
Polymo got 6.71089e+07 in 7.52 seconds
With -O3 on g++ 4.1.2:
Policy got 6.71089e+07 in 1.18 seconds
Polymo got 6.71089e+07 in 3.23 seconds
So the policy is definitely faster once optimizations are turned on.