Lock-free multiple producer multiple consumer in C++ - c++

I have to program a multiple producer-consumer system in C++, but I'm lost trying to put together each part of the model (threads with its correct buffer). The basic functioning of the model is: I have an initial thread that executes a function. This returned results need to be put in an undetermined number of buffers, because each elements that the function proccess is different and it needs to be treated in a single thread. Then, with the data stored in the buffers, another n threads need to get the data of this buffers to do another function, and the return of this need to be put in some buffers again.
At the moment I have got this buffer structure created:
template <typename T>
class buffer {
public:
atomic_buffer(int n);
int bufSize() const noexcept;
bool bufEmpty() const noexcept;
bool full() const noexcept;
~atomic_buffer() = default;
void put(const T & x, bool last) noexcept;
std::pair<bool,T> get() noexcept;
private:
int next_pos(int p) const noexcept;
private:
struct item {
bool last;
T value;
};
const int size_;
std::unique_ptr<item[]> buf_;
alignas(64) std::atomic<int> nextRd_ {0};
alignas(64) std::atomic<int> nextWrt_ {0};
};
I've also created a vectorstructure which stores a collection un buffers, in order to satisfy the undetermined number of threads necessity.
std::vector<std::unique_ptr<locked_buffer<std::pair<int, std::vector<std::vector<unsigned char>>>>>> v1;
for(int i=0; i<n; i++){
v1.push_back(std::unique_ptr<locked_buffer<std::pair<int,std::vector<std::vector<unsigned char>>>>> (new locked_buffer<std::pair<int, std::vector<std::vector<unsigned char>>>>(aux)));
}
Edit:

Without knowing more context, this looks like an application for a standard thread pool. You have different tasks that are enqueued to a synchronized queue (like the buffer class you have there). Each worker thread of the thread pool polls this queue and processes one task each time (by executing a run() method for example). They write the results back into another synchronized queue.
Each worker thread has an own thread-local pair of input and output buffers. They don't need synchronization because they are only accessed from within the owner thread itself.
Edit: Actually, I think this can be simplified a lot: Just use a thread pool and one synchronized queue. The worker threads can enqueue new tasks directly into the queue. Each of your threads in the drawing would correspond to one type of task and implement a common Task interface.
You don't need mutiple buffers. You can use polymorphism and put everything in one buffer.
Edit 2 - Explanation of thread pools:
A thread pool is just a concept. Forget about the pooling aspect, use a fixed number of threads. The main idea is: Instead of having several threads with a specific function, have N threads that can process any kind of task. Where N is the number of cores of the CPU.
You can transform this
into
The worker thread does something like the following. Note that this is simplified, but you should get the idea.
void Thread::run(buffer<Task*>& queue) {
while(true) {
Task* task = queue.get();
if(task)
task->execute();
while(queue.isEmpty())
waitUntilQueueHasElement();
}
}
And your tasks implement a common interface so you can put Task* pointers into a single queue:
struct Task {
virtual void execute() = 0;
}
struct Task1 : public Task {
virtual void execute() override {
A();
B1();
C();
}
}
...
Also, do yourself a favour and use typedefs ;)
`std::vector<std::unique_ptr<locked_buffer<std::pair<int, std::vector<std::vector<unsigned char>>>>>> v1;`
becomes
typedef std::vector<std::vector<unsigned char>> vector2D_uchar;
typedef std::pair<int, vector2D_uchar> int_vec_pair;
typedef std::unique_ptr<locked_buffer<int_vec_pair>> locked_buffer_ptr;
std::vector<locked_buffer_ptr> v1;

Related

c++11 shared_ptr using in multi-threads

Recently I'm thinking a high performance event-driven multi-threads framework using c++11. And it mainly takes c++11 facilities such as std::thread, std::condition_variable, std::mutex, std::shared_ptr etc into consideration. In general, this framework has three basic components: job, worker and streamline, well, it seems to be a real factory. When user construct his business model in server end, he just needs to consider the data and its processor. Once the model is established, user only needs to construct data class inherited job and processor class inherited worker.
For example:
class Data : public job {};
class Processsor : public worker {};
When server get data, it just new a Data object through auto data = std::make_shared<Data>() in the data source callback thread and call the streamline. job_dispatch to transfer the processor and data to other thread. Of course user doesn't have to think to free memory. The streamline. job_dispatch mainly do below stuff:
void evd_thread_pool::job_dispatch(std::shared_ptr<evd_thread_job> job) {
auto task = std::make_shared<evd_task_wrap>(job);
task->worker = streamline.worker;
// worker has been registered in streamline first of all
{
std::unique_lock<std::mutex> lck(streamline.mutex);
streamline.task_list.push_back(std::move(task));
}
streamline.cv.notify_all();
}
The evd_task_wrap used in the job_dispatch defined as:
struct evd_task_wrap {
std::shared_ptr<evd_thread_job> order;
std::shared_ptr<evd_thread_processor> worker;
evd_task_wrap(std::shared_ptr<evd_thread_job>& o)
:order(o) {}
};
Finally the task_wrap will be dispatched into the processing thread through task_list that is a std::list object. And the processing thread mainly do the stuff as:
void evd_factory_impl::thread_proc() {
std::shared_ptr<evd_task_wrap> wrap = nullptr;
while (true) {
{
std::unique_lock<std::mutex> lck(streamline.mutex);
if (streamline.task_list.empty())
streamline.cv.wait(lck,
[&]()->bool{return !streamline.task_list.empty();});
wrap = std::move(streamline.task_list.front());
streamline.task_list.pop_front();
}
if (-1 == wrap->order->get_type())
break;
wrap->worker->process_task(wrap->order);
wrap.reset();
}
}
But I don't know why the process will often crash in the thread_proc function. And the coredump prompt that sometimes the wrap is a empty shared_ptr or segment fault happened in _Sp_counted_ptr_inplace::_M_dispose that is called in wrap.reset(). And I supposed the shared_ptr has the thread synchronous problem in this scenario while I know the control block in shared_ptr is thread-safety. And of course the shared_ptr in job_dispatch and thread_proc is different shared_ptr object even though they point to the same storage. Does anyone has more specific suggestion on how to solve this problem? Or if there exists similar lightweight framework with automatic memory management using c++11
The example of process_task such as:
void log_handle::process_task(std::shared_ptr<crx::evd_thread_job> job) {
auto j = std::dynamic_pointer_cast<log_job>(job);
j->log->Printf(0, j->print_str.c_str());
write(STDOUT_FILENO, j->print_str.c_str(), j->print_str.size());
}
class log_factory {
public:
log_factory(const std::string& name);
virtual ~log_factory();
void print_ts(const char *format, ...) { //here dispatch the job
char log_buf[4096] = {0};
va_list args;
va_start(args, format);
vsprintf(log_buf, format, args);
va_end(args);
auto job = std::make_shared<log_job>(log_buf, &m_log);
m_log_th.job_dispatch(job);
}
public:
E15_Log m_log;
std::shared_ptr<log_handle> m_log_handle;
crx::evd_thread_pool m_log_th;
};
I detected a problem in your code, which may or may not be related:
You use notify_all from your condition variable. That will awaken ALL threads from sleep. It is OK if you wrap your wait in a while loop, like:
while (streamline.task_list.empty())
streamline.cv.wait(lck, [&]()->bool{return !streamline.task_list.empty();});
But since you are using an if, all threads leave the wait. If you dispatch a single product and having several consumer threads, all but one thread will call wrap = std::move(streamline.task_list.front()); while the tasklist is empty and cause UB.

Set the limit in Queue

I want to set a limit on my Queue. You can find below the implementation of the Queue class.
So, in short, I want to write in Queue in one thread until the limit and then wait for the free space. And the second thread read the Queue and do some operations with the data that it received.
int main()
{
//loop that adds new elements to the Queue.
thread one(buildQueue, input, Queue);
loop{
obj = Queue.pop()
func(obj) //do some math
}
}
So the problems is that the Queue builds until the end, but I want to set only 10 elements, for example. And program should work like this:
Check whether free space in Queue is available.
If there is no space - wait.
Write in the Queue until the limit.
Class Queue
template <typename T> class Queue{
private:
const unsigned int MAX = 5;
std::deque<T> newQueue;
std::mutex d_mutex;
std::condition_variable d_condition;
public:
void push(T const& value)
{
{
std::unique_lock<std::mutex> lock(this->d_mutex);
newQueue.push_front(value);
}
this->d_condition.notify_one();
}
T pop()
{
std::unique_lock<std::mutex> lock(this->d_mutex);
this->d_condition.wait(lock, [=]{ return !this->newQueue.empty(); });
T rc(std::move(this->newQueue.back()));
this->newQueue.pop_back();
return rc;
}
unsigned int size()
{
return newQueue.size();
}
unsigned int maxQueueSize()
{
return this->MAX;
}
};
I'm pretty new in threads program so that I could misunderstand the conception. That is why different hints are highly appreciated.
You should have researched the Queue class in the MSDN website. It provides extensive information revolving methods included with Queue. However, to answer your question specifically, to set a queue with a specific capacity, it would be the following:
Queue(int capacity)
where it is Type System::int32 capacity is the initial number of elements in the queue. Then, your queue will be filled until the limit. The issue is, the queue will not "stop" once its filled. It will start to allocate more memory as that is its nature so in your threaded (or multithreaded by the sounds of it), you must make sure to take care of the realoc-deallocation of the queue memory based on timing. You should be able to determine the milliseconds needed to fill your queue with the desired capacity and read the queue, meanwhile clearing the queue. Likewise, you can copy the queue contents to a 1D array and do a full queue clear using MyQueue->Clear()without having to read queue elements 1 by one (if timing and code complexity is an issue).

C++ thread safe bound queue returning object for original thread to delete - 1 writer - 1 reader

The goal is to have a writer thread and a reader thread but only the writer news and deletes the action object. There is only one reader and one writer.
something like:
template<typename T, std::size_t MAX>
class TSQ
{
public:
// blocks if there are MAX items in queue
// returns used Object to be deleted or 0 if none exist
T * push(T * added); // added will be processed by reader
// blocks if there are no objects in queue
// returns item pushed from writer for deletion
T * pop(T * used); // used will be freed by writer
private:
// stuff here
};
-or better if the delete and return can be encapsulated:
template<typename T, std::size_t MAX>
class TSQ
{
public:
// blocks if there are MAX items in queue
push(T * added); // added will be processed by reader
// blocks if there are no objects in queue
// returns item pushed from writer for deletion
T& pop();
private:
// stuff here
};
where the writer thread has a loop like:
my_object *action;
while (1) {
// create action
delete my_queue.push(action);
}
and the reader has a loop like:
my_object * action=0;
while(1) {
action=my_queue.pop(action);
// do stuff with action
}
The reason to have the writer delete the action item is for performance
Is there an optimal way to do this?
Bonus points if MAX=0 is specialized to be unbounded (not required, just tidy)
I'm not looking for the full code, just the data structure and general approach
This is an instance of the producer-consumer problem. A popular way to solve this is to use a lockfree queue.
Also, the first practical change you might want to make is to add a sleep(0) into the production/consumption loops, so you will give up your time slice every iteration and won't end up using 100% of a CPU core.
The most common solution to this problem is to pass values, not pointers.
You can pass shared_ptr to this queue. Your queue doesn't need to know how to free memory after you.
If you use something like Lamport's ring buffer for single producer - single consumer blocking queue, it's a natural solution to use std::vector under the hood that will call destructors for every element automatically.
template<typename T, std::size_t MAX>
class TSQ
{
public:
// blocks if there are MAX items in queue
void push(T added); // added will be processed by reader
// blocks if there are no objects in queue
T pop();
private:
std::vector<T> _content;
size_t _push_index;
size_t _pop_index;
...
};

Asynchronous thread-safe logging in C++ (no mutex)

I'm actually looking for a way to do an asynchronous and thread-safe logging in my C++.
I have already explored thread-safe logging solutions like log4cpp, log4cxx, Boost:log or rlog, but it seems that all of them use a mutex. And as far as I know, mutex is a synchronous solution, which means that all threads are locked as they try to write their messages while other does.
Do you know a solution?
I think your statement is wrong: using mutex is not necessary equivalent to a synchronous solution. Yes, Mutex is for synchronization control but it can be used for many different thing. We can use mutex in, for example, a producer consumer queue while the logging is still happening asynchronously.
Honestly I haven't looked into the implementation of these logging library but it should be feasible to make a asynchronous appender (for log4j like lib) which logger writes to an producer consumer queue and another worker thread is responsible to write to a file (or even delegate to another appender), in case it is not provided.
Edit:
Just have had a brief scan in log4cxx, it does provide an AsyncAppender which does what I suggested: buffers the incoming logging event, and delegate to attached appender asynchronously.
I'd recomment avoiding the problem by using only one thread for logging. For passing the necessary data to log, you can use lock-free fifo queue (thread safe as long as producer and consumer are strictly separated and only one thread has each role -- therefore you will need one queue for each producer.)
Example of fast lock-free queue is included:
queue.h:
#ifndef QUEUE_H
#define QUEUE_H
template<typename T> class Queue
{
public:
virtual void Enqueue(const T &element) = 0;
virtual T Dequeue() = 0;
virtual bool Empty() = 0;
};
hybridqueue.h:
#ifndef HYBRIDQUEUE_H
#define HYBRIDQUEUE_H
#include "queue.h"
template <typename T, int size> class HybridQueue : public Queue<T>
{
public:
virtual bool Empty();
virtual T Dequeue();
virtual void Enqueue(const T& element);
HybridQueue();
virtual ~HybridQueue();
private:
struct ItemList
{
int start;
T list[size];
int end;
ItemList volatile * volatile next;
};
ItemList volatile * volatile start;
char filler[256];
ItemList volatile * volatile end;
};
/**
* Implementation
*
*/
#include <stdio.h>
template <typename T, int size> bool HybridQueue<T, size>::Empty()
{
return (this->start == this->end) && (this->start->start == this->start->end);
}
template <typename T, int size> T HybridQueue<T, size>::Dequeue()
{
if(this->Empty())
{
return NULL;
}
if(this->start->start >= size)
{
ItemList volatile * volatile old;
old = this->start;
this->start = this->start->next;
delete old;
}
T tmp;
tmp = this->start->list[this->start->start];
this->start->start++;
return tmp;
}
template <typename T, int size> void HybridQueue<T, size>::Enqueue(const T& element)
{
if(this->end->end >= size) {
this->end->next = new ItemList();
this->end->next->start = 0;
this->end->next->list[0] = element;
this->end->next->end = 1;
this->end = this->end->next;
}
else
{
this->end->list[this->end->end] = element;
this->end->end++;
}
}
template <typename T, int size> HybridQueue<T, size>::HybridQueue()
{
this->start = this->end = new ItemList();
this->start->start = this->start->end = 0;
}
template <typename T, int size> HybridQueue<T, size>::~HybridQueue()
{
}
#endif // HYBRIDQUEUE_H
If I get your question right you are concerned about doing I/O operation (probably write to a file) in a logger's critical section.
Boost:log lets you define a custom writer object. You can define operator() to call async I/O or pass a message to your logging thread (which is doing I/Os).
http://www.torjo.com/log2/doc/html/workflow.html#workflow_2b
No libraries will do this as far as I know - it's too complex. You'll have to roll your own, and here's an idea which I just had, create a per thread log file, ensure that the first item in each entry is a timestamp, and then merge the logs after then run and sort (by timestamp) to get a final log file.
You can use some thread local storage may be (say a FILE handle AFAIK it won't be possible to store a stream object in thread local storage) and look this handle up on each log line and write to that specific file.
All this complexity vs locking the mutex? I don't know the performance requirements of your application, but if it is sensitive - why would you be logging (excessively)? Think of other ways to obtain the information you require without logging?
Also one other thing to consider is to use the mutex for the least amount of time possible, i.e. construct your log entry first and then just before writing to the file, acquire the lock.
In a Windows program, we use a user-defined Windows message. First, memory is allocated for the log entry on the heap. Then PostMessage is called, with the pointer as the LPARAM, and the record size as the WPARAM. The receiver window extracts the record, displays it, and saves it in the log file. Then PostMessage returns, and the allocated memory is deallocated by the sender. This approach is thread-safe, and you don't have to use mutexes. Concurrency is handled by the message queue mechanism of Windows. Not very elegant, but works.
Lock-free algorithms are not necessarily the fastest ones. Define your boundaries. How many threads are there for logging? How much will be written in a single log operation at most?
I/O bound operations are much much slower than thread context switching due to blocking/awaking threads. Using lock-free/spinning lock algorithm with 10 writing threads will bring a heavy load to CPU.
Shortly, block other threads when you are writing to a file.

Store Templated Object in Container

Is it possible to store a templated class like
template <typename rtn, typename arg>
class BufferAccessor {
public:
int ThreadID;
virtual rtn do_work(arg) = 0;
};
BufferAccessor<void,int> access1;
BufferAccessor<int,void> access2;
in the same container like a vector or list
edit:
The purpose for this is I am trying to make a circular buffer where the objects that want to use the buffer need to register with the buffer. The buffer will store a boost::shared_ptr to the accessor objects and generate a callback to there functions that will push or pull data to/from the buffer. The callback will be used in a generic thread worker function that I have created similar to a thread pool with the fact that they need to access a shared memory object. Below is some code I have typed up that might help illustrate what I am trying to do, but it hasn't been compiled it yet and this is also my first time using bind, function, multi-threading
typedef boost::function<BUF_QObj (void)> CallbackT_pro;
typedef boost::function<void (BUF_QObj)> CallbackT_con;
typedef boost::shared_ptr<BufferAccessor> buf_ptr;
// Register the worker object
int register_consumer(BufferAccesser &accessor) {
mRegCons[mNumConsumers] = buf_ptr(accessor);
return ++mNumConsumers;
}
int register_producer(BufferAccesser &accessor) {
mRegPros[mNumProducers] = buf_ptr(accessor);
return ++mNumProducers;
}
// Dispatch consumer threads
for(;x<mNumConsumers; ++x) {
CallBack_Tcon callback_con = boost::bind(&BufferAccessor::do_work, mRegCons[x]);
tw = new boost:thread(boost::bind(&RT_ActiveCircularBuffer::consumerWorker, this, callback_con));
consumers.add(tw);
}
// Dispatch producer threads
for(x=0;x<mNumProducers; ++x) {
CallBack_Tpro callback_pro = boost::bind(&BufferAccessor::do_work, mRegPros[x], _1);
tw = new boost:thread(boost::bind(&RT_ActiveCircularBuffer::producerWorker, this, callback_pro));
producers.add(tw);
}
// Thread Template Workers - Consumer
void consumerWorker(CallbackT_con worker) {
struct BUF_QObj *qData;
while(!mRun)
cond.wait(mLock);
while(!mTerminate) {
// Set interruption point so that thread can be interrupted
boost::thread::interruption_point();
{ // Code Block
boost::mutex::scoped_lock lock(mLock);
if(buf.empty()) {
cond.wait(mLock)
qData = mBuf.front();
mBuf.pop_front(); // remove the front element
} // End Code Block
worker(qData); // Process data
// Sleep that thread for 1 uSec
boost::thread::sleep(boost::posix_time::nanoseconds(1000));
} // End of while loop
}
// Thread Template Workers - Producer
void producerWorker(CallbackT_pro worker) {
struct BUF_QObj *qData;
boost::thread::sleep(boost::posix_time::nanoseconds(1000));
while(!mRun)
cond.wait(mLock);
while(!mTerminate) {
// Set interruption point so that thread can be interrupted
boost::thread::interruption_point();
qData = worker(); // get data to be processed
{ // Code Block
boost::mutex::scoped_lock lock(mLock);
buf.push_back(qData);
cond.notify_one(mLock);
} // End Code Block
// Sleep that thread for 1 uSec
boost::thread::sleep(boost::posix_time::nanoseconds(1000));
} // End of while loop
}
No it's not, because STL containers are homogenous, and access1 and access2 have completely different unrelated types. But you could make the class BufferAccessor non-template one but the do-work member as a template, like this:
class BufferAccessor
{
template<class R, class A>
R doWork(A arg) {...}
};
In this case you could store BufferAccessors in a container, but you can't make a member template function virtual.
Yes, you can use vector<BufferAccessor<void,int> > to store BufferAccessor<void,int> objects and vector<BufferAccessor<int,void> > to store BufferAccessor<int,void> objects.
What you cant do is use same vector to store both BufferAccessor<int,void> and BufferAccessor<void,int> object
The reason it doesnt work is because BufferAccessor<void,int>, and BufferAccessor<int,void> are two different classes
Note: it is possible to use same vector to store both BufferAccessor<int,void> and BufferAccessor<void,int> but you would have to either store them as void * using shared_ptr<void>. Or better yet you can use a boost::variant