Why do function pointers perform better than virtual methods - c++

I did some profiling using this code
#include "Timer.h"
#include <iostream>
enum class BackendAPI {
B_API_NONE,
B_API_VULKAN,
B_API_DIRECTX_12,
B_API_WEB_GPU,
};
namespace Functional
{
typedef void* VertexBufferHandle;
namespace Vulkan
{
struct VulkanVertexBuffer {};
VertexBufferHandle CreateVertexBuffer(size_t size)
{
return nullptr;
}
__forceinline void Hello() {}
__forceinline void Bello() {}
__forceinline void Mello() {}
}
class RenderBackend {
public:
RenderBackend() {}
~RenderBackend() {}
void SetupBackendMethods(BackendAPI api)
{
switch (api)
{
case BackendAPI::B_API_VULKAN:
{
CreateVertexBuffer = Vulkan::CreateVertexBuffer;
Hello = Vulkan::Hello;
Bello = Vulkan::Bello;
Mello = Vulkan::Mello;
}
break;
case BackendAPI::B_API_DIRECTX_12:
break;
case BackendAPI::B_API_WEB_GPU:
break;
default:
break;
}
}
VertexBufferHandle(*CreateVertexBuffer)(size_t size) = nullptr;
void (*Hello)() = nullptr;
void (*Bello)() = nullptr;
void (*Mello)() = nullptr;
};
}
namespace ObjectOriented
{
struct VertexBuffer {};
class RenderBackend {
public:
RenderBackend() {}
virtual ~RenderBackend() {}
virtual VertexBuffer* CreateVertexBuffer(size_t size) = 0;
virtual void Hello() = 0;
virtual void Bello() = 0;
virtual void Mello() = 0;
};
class VulkanBackend final : public RenderBackend {
struct VulkanVertexBuffer : public VertexBuffer {};
public:
VulkanBackend() {}
~VulkanBackend() {}
__forceinline virtual VertexBuffer* CreateVertexBuffer(size_t size) override
{
return nullptr;
}
__forceinline virtual void Hello() override {}
__forceinline virtual void Bello() override {}
__forceinline virtual void Mello() override {}
};
RenderBackend* CreateBackend(BackendAPI api)
{
switch (api)
{
case BackendAPI::B_API_VULKAN:
return new VulkanBackend;
break;
case BackendAPI::B_API_DIRECTX_12:
break;
case BackendAPI::B_API_WEB_GPU:
break;
default:
break;
}
return nullptr;
}
}
int main()
{
constexpr int maxItr = 1000000;
for (int i = 0; i < 100; i++)
{
int counter = maxItr;
Timer t;
auto pBackend = ObjectOriented::CreateBackend(BackendAPI::B_API_VULKAN);
while (counter--)
{
pBackend->Hello();
pBackend->Bello();
pBackend->Mello();
auto pRef = pBackend->CreateVertexBuffer(100);
}
delete pBackend;
}
std::cout << "\n";
for (int i = 0; i < 100; i++)
{
int counter = maxItr;
Timer t;
{
Functional::RenderBackend backend;
backend.SetupBackendMethods(BackendAPI::B_API_VULKAN);
while (counter--)
{
backend.Hello();
backend.Bello();
backend.Mello();
auto pRef = backend.CreateVertexBuffer(100);
}
}
}
}
In which `#include "Timer.h" is
#pragma once
#include <chrono>
/**
* Timer class.
* This calculates the total time taken from creation till the termination of the object.
*/
class Timer {
public:
/**
* Default contructor.
*/
Timer()
{
// Set the time point at the creation of the object.
startPoint = std::chrono::high_resolution_clock::now();
}
/**
* Default destructor.
*/
~Timer()
{
// Get the time point of the time of the object's termination.
auto endPoint = std::chrono::high_resolution_clock::now();
// Convert time points.
long long start = std::chrono::time_point_cast<std::chrono::microseconds>(startPoint).time_since_epoch().count();
long long end = std::chrono::time_point_cast<std::chrono::microseconds>(endPoint).time_since_epoch().count();
// Print the time to the console.
printf("Time taken: %15I64d\n", static_cast<__int64>(end - start));
}
private:
std::chrono::time_point<std::chrono::high_resolution_clock> startPoint; // The start time point.
};
And after the output in a graph (compiled using the Release configuration in Visual Studio 2019), the results are as follows,
Note: The above code is made to profile Functional vs Object oriented approach performance differences when building a large scale library. The profiling is done by running the application 5 times, recompiling the source code. Each run has 100 iterations. The tests are done both ways (object oriented first, functional second and vise versa) but the performance results are more or less the same.
I am aware that inheritance is somewhat slow because it has to resolve the function pointers from the V-Table at runtime. But the part which I don't understand is, if I'm correct, function pointers are also resolved at runtime. Which means that the program needs to fetch the function code prior to executing it.
So my questions are,
Why does the function pointers perform somewhat better than virtual methods?
Why does the virtual methods have performance drops at some points but the function pointers are somewhat stable?
Thank You!

Virtual method lookup tables need to be accessed (basically) every time the method is called. It adds another indirection to every call.
When you initialize a backend and then save the function pointers you essentially take out this extra indirection and pre-compute it once at the start.
It is thus not a surprise to see a small performance benefit from direct function pointers.

Related

How to use polymorphism to execute command on objects, which have no common base class?

I am receiveing commands through json, which I insert in to a pipe. For this reason thye must have the same base class.
The pipe is read by a pipe handler, some commands are consumed by the pipe handler, others have to be passed down to a device, which is a member of the pipe handler. I could simply do this:
class Command{};
class HandlerCommand : public Command {
void execute(Handler* h);
};
class DeviceCommand : public Command {
void execute(Device* d);
};
Command* c = pipe.receive();
if (const auto hc = dynamic_cast<const HandlerCommand*>(c)) { hc.execute( **handlerptr** ); }
else if (const auto dc = dynamic_cast<const DeviceCommand*>(c)) { dc.execute( **deviceptr** );}
Device and pipehandler should not have the same base, since they have no common methods, fields, they are conceptually different.
Is there a way to avoid using dynamic cast here. I was thinking maybe there is some neat design pattern for this, but couldn`t quit come up with a better solution.
EDIT: did not derive DeviceCommand and HandlerCommand from command, fixed this.
You cannot use polymorphism of two things which have nothing in common. You will need the same base class/interface: in your case Command. As mentioned above your base class requires a pure virtual function that must be implemented by the derived classes. I will utilize a Command * clone()const prototype, which could be very useful later on. Please introduce a virtual destructor of your base class, otherwise, to track down this memory error could be a pain in the ass. Note, regarding your dynamic_cast the member function execute, must be const. You may try this:
#include <iostream>
#include <vector>
class Handler
{
public:
Handler(){}
};
class Device
{
public:
Device(){}
};
enum class CommandType{Handler,Devise};
class Command
{
public:
virtual ~Command(){}
virtual Command*clone()const = 0;
virtual CommandType getType()const = 0;
};
class HandlerCommand : public Command {
public:
HandlerCommand():Command(){}
void execute(Handler* h) const
{
std::cout << __FUNCTION__<<"\n";
}
HandlerCommand*clone()const { return new HandlerCommand(*this); }
CommandType getType()const { return CommandType::Handler; }
};
class DeviceCommand : public Command{
public:
DeviceCommand():Command(){}
void execute(Device* d)const
{
std::cout << __FUNCTION__<<"\n";
}
DeviceCommand*clone()const { return new DeviceCommand(*this); }
CommandType getType()const { return CommandType::Devise; }
};
int main()
{
Device dev;
Handler handler;
std::vector<Command*> pipe{ new HandlerCommand(), new DeviceCommand() };
while (!pipe.empty())
{
Command* c = pipe.back();
if (c->getType() == CommandType::Handler) { static_cast<const HandlerCommand*>(c)->execute(&handler); }
else if (c->getType() == CommandType::Devise ) { static_cast<const DeviceCommand*>(c)->execute(&dev); }
delete c;
pipe.pop_back();
}
std::cin.get();
}
outputs:
DeviceCommand::execute
HandlerCommand::execute
Version 2.0 using std::variant. You will need at least C++17 to compile this. Note, a single pipe container can exclusively comprise one of the mentioned classes within the variant. So there is no casting anymore, but you will need two pipes. Because of that, I introduced a time stamp variable.
#include <iostream>
#include <vector>
#include <variant>
class Handler
{
public:
Handler() {}
};
class Device
{
public:
Device() {}
};
class HandlerCommand {
int ts;
public:
HandlerCommand(int _ts):ts(_ts) {}
void execute(Handler* h) const
{
std::cout << ts << ": "<< __FUNCTION__ << "\n";
}
int timeStamp()const { return ts; }
};
class DeviceCommand {
int ts;
public:
DeviceCommand(int _ts) :ts(_ts) {}
void execute(Device* d)const
{
std::cout << ts << ": " << __FUNCTION__ << "\n";
}
int timeStamp()const { return ts; }
};
using Command = std::variant<HandlerCommand, DeviceCommand>;
int main()
{
Device dev;
Handler handler;
std::vector<Command> hcPipe{HandlerCommand(2),HandlerCommand(5)};
std::vector<Command> dcPipe{DeviceCommand(1),DeviceCommand(4)};
Command single = DeviceCommand(0);
if (single.index() == 0)
{
std::get<HandlerCommand>(single).execute(&handler);
}
else
{
std::get<DeviceCommand>(single).execute(&dev);
}
while (!hcPipe.empty() || !dcPipe.empty())
{
if (!hcPipe.empty() && (dcPipe.empty() || std::get<HandlerCommand>(hcPipe.front()).timeStamp() < std::get<DeviceCommand>(dcPipe.front()).timeStamp()))
{
std::get<HandlerCommand>(hcPipe.front()).execute(&handler);
hcPipe.erase(hcPipe.begin());
}
else
{
std::get<DeviceCommand>(dcPipe.front()).execute(&dev);
dcPipe.erase(dcPipe.begin());
}
}
std::cin.get();
}
outputs:
0: DeviceCommand::execute
1: DeviceCommand::execute
2: HandlerCommand::execute
4: DeviceCommand::execute
5: HandlerCommand::execute

How can I share expensive computations among classes?

As an example, I have this case, in which the classes A and B perform the same expensive calculation, the function expensiveFunction. This function is "pure", in that I can guarantee that it will give the same result given the same input. The client may use both classes (or more similar classes) with the same input, and I would wish that the expensensive function is only calculated once. However, the client may also only use one class for a given input.
Code example:
class A {
public:
A(const InputData& input) {
res = expensiveFunction(input);
}
void foo(); //Use the expensive result
private:
ExpensiveResult res;
};
class B {
public:
B(const InputData& input) {
res = expensiveFunction(input); //Same function as in A
}
double bar(); //Use the expensive result
private:
ExpensiveResult res;
};
int main() {
//Get some input
//...
A a(input);
B b(input);
//Do stuff with a and b
//More input
A a2(otherInput);
//...
}
In some languages, due to referential transparency and memoization, it can safely compute it only once for a given input.
What I have thought of is using some sort factory method/class, or give a function object/functor/supension to the A and B classes that stores the result.
What are some good design ideas to solve this problem?
I own all of the code, so I can change the client or the service classes if necessary.
You can memoize just inside of your function
COutput expensive(CInput input) {
static std::map<CInput, COutput> memoized_result;
auto resit = memoized_result.find(input);
if (resit == memoized_result.end()) {
// ... do calculations
output = expensiveCalculation(input);
resit = memoized_result.insert(std::make_pair(input, output));
}
return resit->second;
}
The result of your computation is stored in the static map (memoized_result), and persisted between function calls.
If input is too expensive to use as a key in the map, you can create a separate class for handling computation result, and share it between all clients:
#include <memory>
using namespace std;
class ExpensiveResult {
public:
ExpensiveResult(int input) {
out_ = input+1;
}
int out_;
};
class BaseCompResultUser {
public:
BaseCompResultUser(const std::shared_ptr<ExpensiveResult>& res) {
res_ = res;
}
private:
std::shared_ptr<ExpensiveResult> res_;
};
class A : public BaseCompResultUser {
public:
A(const std::shared_ptr<ExpensiveResult>& r) : BaseCompResultUser(r) { }
};
class B : public BaseCompResultUser {
public:
B(const std::shared_ptr<ExpensiveResult>& r) : BaseCompResultUser(r) { }
};
int main() {
std::shared_ptr<ExpensiveResult> res(new ExpensiveResult(1));
A a(res);
B b(res);
return 0;
}
This will force sharing computation result between objects.
I think that the object-oriented way of solving it is for the expensiveFunction to be a member function of InputData (or some wrapper of InputData) and then your problem pretty much goes away. You just make ExpensiveResult a mutable cache in InputData:
class InputData {
private:
mutable std::shared_ptr<ExpensiveResult> result_;
public:
InputData() : result_(nullptr) {}
std::shared_ptr<ExpensiveResult> expensiveFunction() const {
if (!result_) {
// calculate expensive result...
result_ = std::make_shared<ExpensiveResult>();
}
return result_;
}
};
The expensive calculation is only done the first time expensiveFunction is called. You might have to add some locking if this is being called in a multi-threaded way.
If ExpensiveFunction does the same thing in A and B, it hardly seems like a true member of either. Why not a function?
int main() {
//Get some input
//...
res = expensiveFunction (input) ;
A a(res);
B b(res);
//Do stuff with a and b
//...
}

Modifying priority_queue of unique_ptrs

I am trying to write an event-driven simulation in C++. Right now it's just a bare-bones priority queue of unique_ptrs to base Event class:
class Event
{
public:
double time;
Event(double time);
virtual void handle() = 0;
};
struct EventCompare
{
bool operator()(std::unique_ptr<Event> e1, std::unique_ptr<Event> e2) {
return e1->time > e2->time;
}
};
class DumpSimulationEvent : public Event
{
public:
DumpSimulationEvent(const double time);
void handle();
};
typedef std::priority_queue<std::unique_ptr<Event>, std::vector<std::unique_ptr<Event>>, EventCompare> EventQueue;
class Simulation
{
double time;
EventQueue eventQueue;
public:
Simulation();
void run();
};
Event::Event(const double t)
{
time = t;
}
DumpSimulationEvent::DumpSimulationEvent(const double t) : Event(t)
{
}
void DumpSimulationEvent::handle()
{
std::cout << "Event time: " << time;
}
Simulation::Simulation()
{
time = 0;
eventQueue = EventQueue();
std::unique_ptr<DumpSimulationEvent> dumpEvent5(new DumpSimulationEvent(5));
//eventQueue.emplace(dumpEvent5);
}
void Simulation::run()
{
while (!eventQueue.empty()) {
std::unique_ptr<Event> currentEvent = std::move(eventQueue.top());
//eventQueue.pop();
time += currentEvent->time;
currentEvent->handle();
}
}
Main function (not shown above) just creates an instance of Simulation and calls the run() method. Problem is that uncommenting either emplace() or pop() results in
error C2280: 'std::unique_ptr<Event,std::default_delete<_Ty>>::unique_ptr(const std::unique_ptr<_Ty,std::default_delete<_Ty>> &)' : attempting to reference a deleted function c:\program files (x86)\microsoft visual studio 12.0\vc\include\xutility 521 1
Research indicates that most likely cause is an attempt to copy an unique_ptr. I am, however, at loss whether is it actual reason and does it actually happen at commented lines or just becomes visible there. Adding std::move to emplace argument doesn't seem to help.
Your problem is that you are not moving things correctly, but you are trying to make copies in several places.
Here is a diff that makes your code work, with some commentary:
struct EventCompare
{
- bool operator()(std::unique_ptr<Event> e1, std::unique_ptr<Event> e2) {
+ bool operator()(std::unique_ptr<Event> const &e1, std::unique_ptr<Event> const &e2) {
return e1->time > e2->time;
}
};
Here, as juanchopanza mentioned in his answer, you have to take std::unique_ptrs by reference, not by value, otherwise you are asking the compiler to make copies for you, which is not allowed.
time = 0;
eventQueue = EventQueue();
std::unique_ptr<DumpSimulationEvent> dumpEvent5(new DumpSimulationEvent(5));
- //eventQueue.emplace(dumpEvent5);
+ eventQueue.emplace(std::move(dumpEvent5));
}
In the above code, you have to MOVE your std::unique_ptr into the queue. Emplace doesn't magically move things, it just forwards arguments to the constructor. Without std::move here, you are asking to make a copy. You could have also just written: eventQueue.emplace(new DumpSimulationEvent(5)); and skipped the intermediate object.
while (!eventQueue.empty()) {
- std::unique_ptr<Event> currentEvent = std::move(eventQueue.top());
- //eventQueue.pop();
+ std::unique_ptr<Event> currentEvent(std::move(const_cast<std::unique_ptr<Event>&>(eventQueu
+ eventQueue.pop();
time += currentEvent->time;
currentEvent->handle();
Finally, in the above code, you are trying to move from eventQueue.top(), but you can't move from a const reference, which is what top() returns. If you want to force the move to work, you have to use both const_cast and std::move() as shown above.
Here is the complete modified code which compiles fine here with g++-4.8 -std=c++11:
#include <memory>
#include <queue>
#include <iostream>
class Event
{
public:
double time;
Event(double time);
virtual void handle() = 0;
};
struct EventCompare
{
bool operator()(std::unique_ptr<Event> const &e1, std::unique_ptr<Event> const &e2) {
return e1->time > e2->time;
}
};
class DumpSimulationEvent : public Event
{
public:
DumpSimulationEvent(const double time);
void handle();
};
typedef std::priority_queue<std::unique_ptr<Event>, std::vector<std::unique_ptr<Event>>, EventCompare> EventQueue;
class Simulation
{
double time;
EventQueue eventQueue;
public:
Simulation();
void run();
};
Event::Event(const double t)
{
time = t;
}
DumpSimulationEvent::DumpSimulationEvent(const double t) : Event(t)
{
}
void DumpSimulationEvent::handle()
{
std::cout << "Event time: " << time;
}
Simulation::Simulation()
{
time = 0;
eventQueue = EventQueue();
std::unique_ptr<DumpSimulationEvent> dumpEvent5(new DumpSimulationEvent(5));
eventQueue.emplace(std::move(dumpEvent5));
}
void Simulation::run()
{
while (!eventQueue.empty()) {
std::unique_ptr<Event> currentEvent(std::move(const_cast<std::unique_ptr<Event>&>(eventQueue.top())));
eventQueue.pop();
time += currentEvent->time;
currentEvent->handle();
}
}

How can I avoid a virtual call when I know the type?

Consider the following code snippet:
struct Base { virtual void func() { } };
struct Derived1 : Base { void func() override { print("1"); } };
struct Derived2 : Base { void func() override { print("2"); } };
class Manager {
std::vector<std::unique_ptr<Base>> items;
public:
template<class T> void add() { items.emplace_back(new T); }
void funcAll() { for(auto& i : items) i->func(); }
};
int main() {
Manager m;
m.add<Derived1>();
m.add<Derived2>();
m.funcAll(); // prints "1" and "2"
};
I'm using virtual dispatch in order to call the correct override method from a std::vector of polymorphic objects.
However, I know what type the polymorphic objects are, since I specify that in Manager::add<T>.
My idea was to avoid a virtual call by taking the address of the member function T::func() and directly storing it somewhere. However that's impossible, since I would need to store it as void* and cast it back in Manager::funcAll(), but I do not have type information at that moment.
My question is: it seems that in this situation I have more information than usual for polymorphism (the user specifies the derived type T in Manager::add<T>) - is there any way I can use this type information to prevent a seemingly unneeded virtual call? (An user should be able to create its own classes that derive from Base in its code, however.)
However, I know what type the polymorphic objects are, since I specify that in Manager::add<T>.
No you don't. Within add you know the type of the object that's being added; but you can add objects of different types, as you do in your example. There's no way for funcAll to statically determine the types of the elements unless you parametrise Manager to only handle one type.
If you did know the type, then you could call the function non-virtually:
i->T::func();
But, to reiterate, you can't determine the type statically here.
If I understand well, you want your add method, which is getting the class of the object, to store the right function in your vector depending on that object class.
Your vector just contains functions, no more information about the objects.
You kind of want to "solve" the virtual call before it is invoked.
This is maybe interesting in the following case: the function is then called a lot of times, because you don't have the overhead of solving the virtual each time.
So you may want to use a similar process than what "virtual" does, using a "virtual table".
The implementation of virtual is done at low level, so pretty fast compared to whatever you will come up with, so again, the functions should be invoked a LOT of times before it gets interesting.
One trick that can sometimes help in this kind of situation is to sort the vector by type (you should be able to use the knowledge of the type available in the add() function to enforce this) if the order of elements doesn't otherwise matter. If you are mostly going to be iterating over the vector in order calling a virtual function this will help the CPU's branch predictor predict the target of the call. Alternatively you can maintain separate vectors for each type in your manager and iterate over them in turn which has a similar effect.
Your compiler's optimizer can also help you with this kind of code, particularly if it supports Profile Guided Optimization (POGO). Compilers can de-virtualize calls in certain situations, or with POGO can do things in the generated assembly to help the CPU's branch predictor, like test for the most common types and perform a direct call for those with a fallback to an indirect call for the less common types.
Here's the results of a test program that illustrates the performance benefits of sorting by type, Manager is your version, Manager2 maintains a hash table of vectors indexed by typeid:
Derived1::count = 50043000, Derived2::count = 49957000
class Manager::funcAll took 714ms
Derived1::count = 50043000, Derived2::count = 49957000
class Manager2::funcAll took 274ms
Derived1::count = 50043000, Derived2::count = 49957000
class Manager2::funcAll took 273ms
Derived1::count = 50043000, Derived2::count = 49957000
class Manager::funcAll took 714ms
Test code:
#include <iostream>
#include <vector>
#include <memory>
#include <random>
#include <unordered_map>
#include <typeindex>
#include <chrono>
using namespace std;
using namespace std::chrono;
static const int instanceCount = 100000;
static const int funcAllIterations = 1000;
static const int numTypes = 2;
struct Base { virtual void func() = 0; };
struct Derived1 : Base { static int count; void func() override { ++count; } };
int Derived1::count = 0;
struct Derived2 : Base { static int count; void func() override { ++count; } };
int Derived2::count = 0;
class Manager {
vector<unique_ptr<Base>> items;
public:
template<class T> void add() { items.emplace_back(new T); }
void funcAll() { for (auto& i : items) i->func(); }
};
class Manager2 {
unordered_map<type_index, vector<unique_ptr<Base>>> items;
public:
template<class T> void add() { items[type_index(typeid(T))].push_back(make_unique<T>()); }
void funcAll() {
for (const auto& type : items) {
for (auto& i : type.second) {
i->func();
}
}
}
};
template<typename Man>
void Test() {
mt19937 engine;
uniform_int_distribution<int> d(0, numTypes - 1);
Derived1::count = 0;
Derived2::count = 0;
Man man;
for (auto i = 0; i < instanceCount; ++i) {
switch (d(engine)) {
case 0: man.add<Derived1>(); break;
case 1: man.add<Derived2>(); break;
}
}
auto startTime = high_resolution_clock::now();
for (auto i = 0; i < funcAllIterations; ++i) {
man.funcAll();
}
auto endTime = high_resolution_clock::now();
cout << "Derived1::count = " << Derived1::count << ", Derived2::count = " << Derived2::count << "\n"
<< typeid(Man).name() << "::funcAll took " << duration_cast<milliseconds>(endTime - startTime).count() << "ms" << endl;
}
int main() {
Test<Manager>();
Test<Manager2>();
Test<Manager2>();
Test<Manager>();
}

TBB task allocation assertion

I'm trying to traverse a tree via TBB tasks and continuations. The code is below. When I run the code it keeps aborting (frequently, although not always) with the following error:
Assertion t_next->state()==task::allocated failed on line 334 of file ../../src/tbb/custom_scheduler.h
Detailed description: if task::execute() returns task, it must be marked as allocated
What can be causing this problem?
template<class NodeVisitor>
void
traverse_tree(NodeVisitor& nv)
{
TreeTraversal<NodeVisitor>& tt = *(new(task::allocate_root()) TreeTraversal<NodeVisitor>(nv));
task::spawn_root_and_wait(tt);
}
template<class NodeVisitor>
class TreeTraversal: public task
{
public:
struct Continuation;
public:
TreeTraversal(NodeVisitor nv_):
nv(nv_) {}
task* execute()
{
nv.pre();
Continuation* c = new(allocate_continuation()) Continuation(nv);
c->set_ref_count(nv.size());
for (size_t i = 0; i < nv.size(); ++i)
{
TreeTraversal& tt = *(new(c->allocate_child()) TreeTraversal(nv.child(i)));
spawn(tt);
}
if (!nv.size())
return c;
return NULL;
}
private:
NodeVisitor nv;
};
template<class NodeVisitor>
class TreeTraversal<NodeVisitor>::Continuation: public task
{
public:
Continuation(NodeVisitor& nv_):
nv(nv_) {}
task* execute() { nv.post(); return NULL; }
private:
NodeVisitor nv;
};
I have never seen before that a task is allocated as a continuation and then returned from execute(). That might be the reason of the assertion failure (update: an experiment showed it is not, see details below).
Meanwhile, you can change the code of TreeTraversal::execute() to be roughly this:
nv.pre();
if (!nv.size())
nv.post();
else {
// Do all the task manipulations
}
return NULL;
Update: a simplified test shown below worked well on my dual-core laptop. That makes me suppose possible memory corruption in your actual code, in which case the re-shuffling suggested above might just hide the issue but not fix it.
#include "tbb/task.h"
using namespace tbb;
class T: public task {
public:
class Continuation: public task {
public:
Continuation() {}
task* execute() { return NULL; }
};
private:
size_t nv;
public:
T(size_t n): nv(n) {}
task* execute() {
Continuation* c = new(allocate_continuation()) Continuation();
c->set_ref_count(nv);
for (size_t i = 0; i < nv; ++i) {
T& tt = *(new(c->allocate_child()) T(nv-i-1));
spawn(tt);
}
return (nv==0)? c : NULL;
}
};
int main() {
T& t = *new( task::allocate_root() ) T(24);
task::spawn_root_and_wait(t);
return 0;
}