I'm trying to traverse a tree via TBB tasks and continuations. The code is below. When I run the code it keeps aborting (frequently, although not always) with the following error:
Assertion t_next->state()==task::allocated failed on line 334 of file ../../src/tbb/custom_scheduler.h
Detailed description: if task::execute() returns task, it must be marked as allocated
What can be causing this problem?
template<class NodeVisitor>
void
traverse_tree(NodeVisitor& nv)
{
TreeTraversal<NodeVisitor>& tt = *(new(task::allocate_root()) TreeTraversal<NodeVisitor>(nv));
task::spawn_root_and_wait(tt);
}
template<class NodeVisitor>
class TreeTraversal: public task
{
public:
struct Continuation;
public:
TreeTraversal(NodeVisitor nv_):
nv(nv_) {}
task* execute()
{
nv.pre();
Continuation* c = new(allocate_continuation()) Continuation(nv);
c->set_ref_count(nv.size());
for (size_t i = 0; i < nv.size(); ++i)
{
TreeTraversal& tt = *(new(c->allocate_child()) TreeTraversal(nv.child(i)));
spawn(tt);
}
if (!nv.size())
return c;
return NULL;
}
private:
NodeVisitor nv;
};
template<class NodeVisitor>
class TreeTraversal<NodeVisitor>::Continuation: public task
{
public:
Continuation(NodeVisitor& nv_):
nv(nv_) {}
task* execute() { nv.post(); return NULL; }
private:
NodeVisitor nv;
};
I have never seen before that a task is allocated as a continuation and then returned from execute(). That might be the reason of the assertion failure (update: an experiment showed it is not, see details below).
Meanwhile, you can change the code of TreeTraversal::execute() to be roughly this:
nv.pre();
if (!nv.size())
nv.post();
else {
// Do all the task manipulations
}
return NULL;
Update: a simplified test shown below worked well on my dual-core laptop. That makes me suppose possible memory corruption in your actual code, in which case the re-shuffling suggested above might just hide the issue but not fix it.
#include "tbb/task.h"
using namespace tbb;
class T: public task {
public:
class Continuation: public task {
public:
Continuation() {}
task* execute() { return NULL; }
};
private:
size_t nv;
public:
T(size_t n): nv(n) {}
task* execute() {
Continuation* c = new(allocate_continuation()) Continuation();
c->set_ref_count(nv);
for (size_t i = 0; i < nv; ++i) {
T& tt = *(new(c->allocate_child()) T(nv-i-1));
spawn(tt);
}
return (nv==0)? c : NULL;
}
};
int main() {
T& t = *new( task::allocate_root() ) T(24);
task::spawn_root_and_wait(t);
return 0;
}
Related
I am doing something like this with TBB:
#include <tbb/tbb.h>
#include <memory>
#include <atomic>
class base_creator
{
public:
using node = tbb::flow::input_node<bool>;
virtual ~base_creator() = default;
base_creator()
{
m_kill = std::make_shared<std::atomic_bool>(false);
};
static tbb::flow::graph& g()
{
static tbb::flow::graph me_g;
return me_g;
};
virtual std::shared_ptr<node> get_node() const = 0;
template<typename Op>
static tbb::flow::input_node<bool> build_node(const Op& op)
{
return tbb::flow::input_node<bool>(base_creator::g(), op);
};
protected:
mutable std::shared_ptr<std::atomic_bool> m_kill;
};
class creater : public base_creator
{
public:
creater() = default;
public:
virtual std::shared_ptr<node> get_node() const override
{
const std::shared_ptr<std::atomic_bool> flag = this->m_kill;
auto op = [flag](flow_control& control) -> bool
{
if (flag->load(std::memory_order_relaxed))
control.stop();
return true;
};
node nd = base_creator::build_node(std::cref(op));
return std::make_shared<node>(nd);
};
};
int main(int argc, char* argv[])
{
creater c;
std::shared_ptr<base_creator::node> s = c.get_node();
using my_func_node = std::shared_ptr<tbb::flow::function_node<bool, bool>>;
my_func_node f = std::make_shared<tbb::flow::function_node<bool, bool>>(base_creator::g(), 1,[](const bool b) { std::cout << b; return b; });
tbb::flow::make_edge(*s, *f);
};
The flag should be always false in this code. Once I call tbb::flow::make_edge it becomes true when debugging the body of the node s in THAT IS SO WEIRD? I have no clue, could you please help, I am starting to hit in the TBB code now and it's too complex :)
Pay attention to the following code. It creates std::reference_wrapper over local object op that is copied into input_node with build_node. Therefore, when get_node returns the std::reference_wrapper (that is inside the node that is inside shared_ptr), it references the destroyed object. Then, make_edge reuses the stack and replaces flag pointer with some other pointer that contains 72.
auto op = [flag](flow_control& control) -> bool
{
if (flag->load(std::memory_order_relaxed))
control.stop();
return true;
};
node nd = base_creator::build_node(std::cref(op));
when I push fifoGroundEvtEntry data inside list_fifoGroundEvt from another thread using sender::GetInstance()->getDataCollector()->pushGroundEventFifo(entry); and when I debug puting one breakpoint inside pushGroundEventFifo function I can see the correct value of grdEvt.x and grdEvt.y inside list_fifoGroundEvt.
then when I call test method inside transmit methode and I pute breakpoint inside test. I see wrong values inside list_fifoGroundEvt-> grdEvt.y = 0x00F12751 for entry.y = 2 !
PS: transmit() is a thread and I start it using sender::GetInstance()->start() (I didn't put all functions I put only those who have a link with the problem )
the thread is starting after pushing entries inside list_fifoGroundEvt
#include "stdafx.h"
#include <iostream>
#include <list>
struct fifoEvtEntry {
virtual ~fifoEvtEntry() {}
int x;
};
struct fifoGroundEvtEntry : fifoEvtEntry
{
int y;
};
class collector {
public:
void pushGroundEventFifo(fifoEvtEntry& entry) {
if (fifoGroundEvtEntry* grdEvt = dynamic_cast<fifoGroundEvtEntry*>(&entry))
{
list_fifoGroundEvt.push_back(grdEvt);
}
}
void test() {
if (fifoGroundEvtEntry* grdEvt = dynamic_cast<fifoGroundEvtEntry*>(list_fifoGroundEvt.front()))
{
std::cout << grdEvt ->y << std::endl;
}
list_fifoGroundEvt.pop_front();
}
private:
std::list<fifoEvtEntry*> list_fifoGroundEvt;
};
class sender {
public:
sender(collector* data):_data(data) {};
~sender() {};
static void setInstance(collector* data) {
_instance = new sender(data);
}
static sender* GetInstance() {
return _instance;
}
void transmit() {
// this is a thread function
// ..
_data->test();
}
collector* getDataCollector(){
return _data;
}
static sender* _instance;
private:
collector* _data;
};
int main(){
return 0;
}
I did some profiling using this code
#include "Timer.h"
#include <iostream>
enum class BackendAPI {
B_API_NONE,
B_API_VULKAN,
B_API_DIRECTX_12,
B_API_WEB_GPU,
};
namespace Functional
{
typedef void* VertexBufferHandle;
namespace Vulkan
{
struct VulkanVertexBuffer {};
VertexBufferHandle CreateVertexBuffer(size_t size)
{
return nullptr;
}
__forceinline void Hello() {}
__forceinline void Bello() {}
__forceinline void Mello() {}
}
class RenderBackend {
public:
RenderBackend() {}
~RenderBackend() {}
void SetupBackendMethods(BackendAPI api)
{
switch (api)
{
case BackendAPI::B_API_VULKAN:
{
CreateVertexBuffer = Vulkan::CreateVertexBuffer;
Hello = Vulkan::Hello;
Bello = Vulkan::Bello;
Mello = Vulkan::Mello;
}
break;
case BackendAPI::B_API_DIRECTX_12:
break;
case BackendAPI::B_API_WEB_GPU:
break;
default:
break;
}
}
VertexBufferHandle(*CreateVertexBuffer)(size_t size) = nullptr;
void (*Hello)() = nullptr;
void (*Bello)() = nullptr;
void (*Mello)() = nullptr;
};
}
namespace ObjectOriented
{
struct VertexBuffer {};
class RenderBackend {
public:
RenderBackend() {}
virtual ~RenderBackend() {}
virtual VertexBuffer* CreateVertexBuffer(size_t size) = 0;
virtual void Hello() = 0;
virtual void Bello() = 0;
virtual void Mello() = 0;
};
class VulkanBackend final : public RenderBackend {
struct VulkanVertexBuffer : public VertexBuffer {};
public:
VulkanBackend() {}
~VulkanBackend() {}
__forceinline virtual VertexBuffer* CreateVertexBuffer(size_t size) override
{
return nullptr;
}
__forceinline virtual void Hello() override {}
__forceinline virtual void Bello() override {}
__forceinline virtual void Mello() override {}
};
RenderBackend* CreateBackend(BackendAPI api)
{
switch (api)
{
case BackendAPI::B_API_VULKAN:
return new VulkanBackend;
break;
case BackendAPI::B_API_DIRECTX_12:
break;
case BackendAPI::B_API_WEB_GPU:
break;
default:
break;
}
return nullptr;
}
}
int main()
{
constexpr int maxItr = 1000000;
for (int i = 0; i < 100; i++)
{
int counter = maxItr;
Timer t;
auto pBackend = ObjectOriented::CreateBackend(BackendAPI::B_API_VULKAN);
while (counter--)
{
pBackend->Hello();
pBackend->Bello();
pBackend->Mello();
auto pRef = pBackend->CreateVertexBuffer(100);
}
delete pBackend;
}
std::cout << "\n";
for (int i = 0; i < 100; i++)
{
int counter = maxItr;
Timer t;
{
Functional::RenderBackend backend;
backend.SetupBackendMethods(BackendAPI::B_API_VULKAN);
while (counter--)
{
backend.Hello();
backend.Bello();
backend.Mello();
auto pRef = backend.CreateVertexBuffer(100);
}
}
}
}
In which `#include "Timer.h" is
#pragma once
#include <chrono>
/**
* Timer class.
* This calculates the total time taken from creation till the termination of the object.
*/
class Timer {
public:
/**
* Default contructor.
*/
Timer()
{
// Set the time point at the creation of the object.
startPoint = std::chrono::high_resolution_clock::now();
}
/**
* Default destructor.
*/
~Timer()
{
// Get the time point of the time of the object's termination.
auto endPoint = std::chrono::high_resolution_clock::now();
// Convert time points.
long long start = std::chrono::time_point_cast<std::chrono::microseconds>(startPoint).time_since_epoch().count();
long long end = std::chrono::time_point_cast<std::chrono::microseconds>(endPoint).time_since_epoch().count();
// Print the time to the console.
printf("Time taken: %15I64d\n", static_cast<__int64>(end - start));
}
private:
std::chrono::time_point<std::chrono::high_resolution_clock> startPoint; // The start time point.
};
And after the output in a graph (compiled using the Release configuration in Visual Studio 2019), the results are as follows,
Note: The above code is made to profile Functional vs Object oriented approach performance differences when building a large scale library. The profiling is done by running the application 5 times, recompiling the source code. Each run has 100 iterations. The tests are done both ways (object oriented first, functional second and vise versa) but the performance results are more or less the same.
I am aware that inheritance is somewhat slow because it has to resolve the function pointers from the V-Table at runtime. But the part which I don't understand is, if I'm correct, function pointers are also resolved at runtime. Which means that the program needs to fetch the function code prior to executing it.
So my questions are,
Why does the function pointers perform somewhat better than virtual methods?
Why does the virtual methods have performance drops at some points but the function pointers are somewhat stable?
Thank You!
Virtual method lookup tables need to be accessed (basically) every time the method is called. It adds another indirection to every call.
When you initialize a backend and then save the function pointers you essentially take out this extra indirection and pre-compute it once at the start.
It is thus not a surprise to see a small performance benefit from direct function pointers.
I am receiveing commands through json, which I insert in to a pipe. For this reason thye must have the same base class.
The pipe is read by a pipe handler, some commands are consumed by the pipe handler, others have to be passed down to a device, which is a member of the pipe handler. I could simply do this:
class Command{};
class HandlerCommand : public Command {
void execute(Handler* h);
};
class DeviceCommand : public Command {
void execute(Device* d);
};
Command* c = pipe.receive();
if (const auto hc = dynamic_cast<const HandlerCommand*>(c)) { hc.execute( **handlerptr** ); }
else if (const auto dc = dynamic_cast<const DeviceCommand*>(c)) { dc.execute( **deviceptr** );}
Device and pipehandler should not have the same base, since they have no common methods, fields, they are conceptually different.
Is there a way to avoid using dynamic cast here. I was thinking maybe there is some neat design pattern for this, but couldn`t quit come up with a better solution.
EDIT: did not derive DeviceCommand and HandlerCommand from command, fixed this.
You cannot use polymorphism of two things which have nothing in common. You will need the same base class/interface: in your case Command. As mentioned above your base class requires a pure virtual function that must be implemented by the derived classes. I will utilize a Command * clone()const prototype, which could be very useful later on. Please introduce a virtual destructor of your base class, otherwise, to track down this memory error could be a pain in the ass. Note, regarding your dynamic_cast the member function execute, must be const. You may try this:
#include <iostream>
#include <vector>
class Handler
{
public:
Handler(){}
};
class Device
{
public:
Device(){}
};
enum class CommandType{Handler,Devise};
class Command
{
public:
virtual ~Command(){}
virtual Command*clone()const = 0;
virtual CommandType getType()const = 0;
};
class HandlerCommand : public Command {
public:
HandlerCommand():Command(){}
void execute(Handler* h) const
{
std::cout << __FUNCTION__<<"\n";
}
HandlerCommand*clone()const { return new HandlerCommand(*this); }
CommandType getType()const { return CommandType::Handler; }
};
class DeviceCommand : public Command{
public:
DeviceCommand():Command(){}
void execute(Device* d)const
{
std::cout << __FUNCTION__<<"\n";
}
DeviceCommand*clone()const { return new DeviceCommand(*this); }
CommandType getType()const { return CommandType::Devise; }
};
int main()
{
Device dev;
Handler handler;
std::vector<Command*> pipe{ new HandlerCommand(), new DeviceCommand() };
while (!pipe.empty())
{
Command* c = pipe.back();
if (c->getType() == CommandType::Handler) { static_cast<const HandlerCommand*>(c)->execute(&handler); }
else if (c->getType() == CommandType::Devise ) { static_cast<const DeviceCommand*>(c)->execute(&dev); }
delete c;
pipe.pop_back();
}
std::cin.get();
}
outputs:
DeviceCommand::execute
HandlerCommand::execute
Version 2.0 using std::variant. You will need at least C++17 to compile this. Note, a single pipe container can exclusively comprise one of the mentioned classes within the variant. So there is no casting anymore, but you will need two pipes. Because of that, I introduced a time stamp variable.
#include <iostream>
#include <vector>
#include <variant>
class Handler
{
public:
Handler() {}
};
class Device
{
public:
Device() {}
};
class HandlerCommand {
int ts;
public:
HandlerCommand(int _ts):ts(_ts) {}
void execute(Handler* h) const
{
std::cout << ts << ": "<< __FUNCTION__ << "\n";
}
int timeStamp()const { return ts; }
};
class DeviceCommand {
int ts;
public:
DeviceCommand(int _ts) :ts(_ts) {}
void execute(Device* d)const
{
std::cout << ts << ": " << __FUNCTION__ << "\n";
}
int timeStamp()const { return ts; }
};
using Command = std::variant<HandlerCommand, DeviceCommand>;
int main()
{
Device dev;
Handler handler;
std::vector<Command> hcPipe{HandlerCommand(2),HandlerCommand(5)};
std::vector<Command> dcPipe{DeviceCommand(1),DeviceCommand(4)};
Command single = DeviceCommand(0);
if (single.index() == 0)
{
std::get<HandlerCommand>(single).execute(&handler);
}
else
{
std::get<DeviceCommand>(single).execute(&dev);
}
while (!hcPipe.empty() || !dcPipe.empty())
{
if (!hcPipe.empty() && (dcPipe.empty() || std::get<HandlerCommand>(hcPipe.front()).timeStamp() < std::get<DeviceCommand>(dcPipe.front()).timeStamp()))
{
std::get<HandlerCommand>(hcPipe.front()).execute(&handler);
hcPipe.erase(hcPipe.begin());
}
else
{
std::get<DeviceCommand>(dcPipe.front()).execute(&dev);
dcPipe.erase(dcPipe.begin());
}
}
std::cin.get();
}
outputs:
0: DeviceCommand::execute
1: DeviceCommand::execute
2: HandlerCommand::execute
4: DeviceCommand::execute
5: HandlerCommand::execute
I have the following implementation of interlocked singly linked list using C++11 atomics:
struct notag {};
template<class T, class Tag=notag>
struct s_list_base
{
};
template<class T, class Tag = notag>
struct s_list : s_list_base<T, Tag>
{
s_list_base<T, Tag> *next_ptr;
};
template<bool auto_destruct, class T, class Tag = notag>
class atomic_s_list
{
struct s_head : s_list_base<T, Tag>
{
std::atomic<s_list_base<T, Tag > *> next_ptr { this };
};
using LinkType = s_list<T, Tag> *;
s_head head;
public:
atomic_s_list() = default;
atomic_s_list(const atomic_s_list &) = delete;
atomic_s_list &operator =(const atomic_s_list &) = delete;
~atomic_s_list()
{
clear();
}
void clear() noexcept
{
if (auto_destruct)
{
T *item;
do
{
item = pop();
delete item;
} while (item);
}
else
head.next_ptr = &head;
}
void push(T *pItem) noexcept
{
auto p = static_cast<LinkType>(pItem);
auto phead = head.next_ptr.load(std::memory_order_relaxed);
do
{
p->next_ptr = phead;
} while (!head.next_ptr.compare_exchange_weak(phead, p));
}
T *pop() noexcept
{
auto result = head.next_ptr.load(std::memory_order_relaxed);
while (!head.next_ptr.compare_exchange_weak(result, static_cast<LinkType>(result)->next_ptr))
;
return result == &head ? nullptr : static_cast<T *>(result);
}
};
The problem is that in real program I have several concurrently running threads that take an object from this list with pop, work with it and then put it back with push and it seems like I have a race when sometimes two threads end up getting the same object from a list.
I have tried to make a simple example out of that program to illustrate a race.
Here it is:
struct item : s_list<item>
{
std::atomic<int> use{ 0 };
};
atomic_s_list<true, item> items;
item *allocate()
{
auto *result = items.pop();
if (!result)
result = new item;
return result;
}
void free(item *p)
{
items.push(p);
}
int main()
{
using namespace std::chrono_literals;
static const int N = 20;
std::vector<std::thread> threads;
threads.reserve(N);
for (int i = 0; i < N; ++i)
{
threads.push_back(std::thread([&]
{
while (true)
{
auto item = allocate();
if (0 != item->use.fetch_add(1, std::memory_order_relaxed))
std::terminate();
item->use.fetch_sub(1, std::memory_order_relaxed);
free(item);
}
}));
}
std::this_thread::sleep_for(20min);
}
So the question is: is this implementation of interlocked singly-linked list correct?
After more research I can confirm that I face with an ABA problem.
It appears like no one should ever trust this simple interlocked singly linked list implementation on modern hardware (with lots of hardware threads) and highly-contended interlocked lists.
After considering implementing the tricks described in Wikipedia article, I have decided to use boost implementation (see boost::lockfree::stack) as it seems like having good efforts on fighting ABA problem.
For now my test code does not fail, and neither does the original program.
Is not a correct implementation!
thread A and thread B call pop; A and B get the same "result" -> head.next_ptr
thread A modify the memory.
thread B read "result->next_ptr" get INCORRECT DATA
thread A call push. head.next_ptr == "result" NOW!
thread B call compare_exchange_weak; head.next_ptr update with the INCORRECT DATA