I have a high-priority process that needs to pass data to a low-priority process. I've written a basic ring buffer to handle the passing of data:
class RingBuffer {
public:
RingBuffer(int size);
~RingBuffer();
int count() {return (size + end - start) % size;}
void write(char *data, int bytes) {
// some work that uses only buffer and end
end = (end + bytes) % size;
}
void read(char *data, int bytes) {
// some work that uses only buffer and start
start = (start + bytes) % size;
}
private:
char *buffer;
const int size;
int start, end;
};
Here's the problem. Suppose the low-priority process has an oracle that tells it exactly how much data needs to be read, so that count() need never be called. Then (unless I'm missing something) there are no concurrency issues. However, as soon as the low-priority thread needs to call count() (the high-priority thread might want to call it too to check if the buffer is too full) there is the possibility that the math in count() or the update to end is not atomic, introducing a bug.
I could put a mutex around the accesses to start and end but that would cause priority inversion if the high-priority thread has to wait for the lock acquired by the low-priority thread.
I might be able to work something out using atomic operations but I'm not aware of a nice, cross-platform library providing these.
Is there a standard ring-buffer design that avoids these issues?
What you have should be OK, as long as you adhere to these guidelines:
Only one thread can do writes.
Only one thread can do reads.
Updates and accesses to start and end are atomic. This might be automatic, for example Microsoft states:
Simple reads and writes to
properly-aligned 32-bit variables are
atomic operations. In other words, you
will not end up with only one portion
of the variable updated; all bits are
updated in an atomic fashion.
You allow for the fact that count might be out of date even as you get the value. In the reading thread, count will return the minimum count you can rely on; for the writing thread count will return the maximum count and the true count might be lower.
Boost provides a circular buffer, but it's not thread safe. Unfortunately, I don't know of any implementation that is.
The upcoming C++ standard adds atomic operations to the standard library, so they'll be available in the future, but they aren't supported by most implementations yet.
I don't see any cross-platform solution to keeping count sane while both threads are writing to it, unless you implement locking.
Normally, you would probably use a messaging system and force the low priority thread to request that the high priority thread make updates, or something similar. For example, if the low priority thread consumes 15 bytes, it should ask the high priority thread to reduce the count by 15.
Essentially, you would limit 'write' access to the high priority thread, and only allow the low priority thread to read. This way, you can avoid all locking, and the high priority thread won't have to worry about waiting for a write to be completed by the lower thread, making the high priority thread truly high priority.
boost::interprocess offers cross-platform atomic functions in boost/interprocess/detail/atomic.hpp
Related
I am trying to implement the following functionality:
Atomic and lock-free write or read-modify-write of a data type with arbitrary size (in my case usually a float/int vector with up to 6 elements).
Atomic read from above data type that doesn't block the writing thread. The read operation may be blocked by the write operation.
Usecase: I am trying to write the software for a CNC machine. The step pulses for the motors are generated in software in a realtime thread. This realtime thread is constantly updating a variable which holds the current position of the axis. Multiple other non-realtime threads may read that variable, e.g. to display the current position.
Question 1: Is there a standard/accepted solution or pattern for this kind of problem?
I came up with the following idea: use an std::atomic<uint64_t> to protect the data and track weather a thread is currently writing (by checking the last bit) or has written since the read started (by incrementing the value on a write).
template <class DATA, class FN>
void read_modify_write(DATA& data, std::atomic<uint64_t>& protector, FN fn)
{
auto old_protector_value = protector.load();
do
{
// wait until no other thread is writing
while(old_protector_value % 2 != 0)
old_protector_value = protector.load();
// try to acquire write privileges
} while(!protector.compare_exchange_weak(old_protector_value, old_protector_value + 1));
// write data
data = fn(data);
// unlock
protector = old_protector_value + 2;
};
template <class DATA>
DATA read(const DATA& data, std::atomic<uint64_t>& protector)
{
while(true)
{
uint64_t old_protector_value = protector.load();
// wait until no thread is writing
while(old_protector_value % 2 != 0)
old_protector_value = protector.load();
// read data
auto ret = data;
// check if data has changed in the meantime
if(old_protector_value == protector)
return ret;
}
}
Question 2: Is the above code thread-safe and fulfilling above requirements?
Question 3: Can it be improved?
(The only theoretical problem I could find is if the counter wraps around, i.e. exactly 2^63 write operations are performed during 1 read operation. I would consider this weakness acceptable if there are no better solutions.)
Thank you
Strictly speaking your code is not lock-free, because you effectively use the LSB of protector to implement a spinlock.
Your solution looks very much like a sequence lock. However, the actual read operation auto ret = data; is strictly speaking a data race. To be fair, it is simply not possible to write a fully standard compliant seqlock in C++17, for that we have to wait for C++20.
It is possible to extend the seqlock to make read operations lock-free at the cost of higher memory usage. The idea is to have multiple instances of the data (let's call them slots), and the write operation always writes to the next slot in a round robin fashion. This allows a read operation to read from the last fully written slot. Dmitry Vyukov described his approach in Improved Lockfree SeqLock. You can take a look at my seqlock implementation that is part of my xenium library. It also optionally allows lock-free read operations with a configurable number of slots (though it slightly differs to Vyukov in how a reader finds the last fully written slot).
I have an output cointainer similar to this:
struct cont {
std::mutex m;
size_t offset;
char* data;
cont(size_t sizeB) {
data = new char[sizeB];
}
void write(char* data, size_t sizeB) {
m.lock();
size_t off = offset;
offset += sizeB;
m.unlock();
std::memcpy(this->data + off, data, sizeB);
}
};
The idea is that I have many threads, each working on a dynamically sized workload and outputting data in no specific order into that container. The threads are triggered by server access and there is no telling how many are in concurrently or how much they will contribute.
The reason I'm questioning this is because as you can see, the main workload is outside the mutex lock since in theory, only the distribution of the available buffer needs to be synchronized and the threads shouldn't collide after that bit.
Its been working fine so far but from previous experience, threading problems can manifest themselves way down the road so is this considered a thread safe practice?
Seems OK. If you want to optimize, you could make the offset atomic, to avoid the mutex altogether. So, just declare
std::atomic<size_t> offset;
and the mutex can be removed.
As it stands, I'm afraid this is incomplete: your solution correctly allocates space between multiple threads, but you also need a solution for threads to "commit" their writes. Imagine that one writer thread is indefinitely delayed in the midst of a memcpy (or even prior to commencing its memcpy). How does any other thread ever find out about this so that you can eventually use this buffer at all?
This seems perfectly safe. You're probably worried about trampling on "leftover" bytes concurrently when offset changes by a number which is not a multiple of 4 or 8 bytes. I wanted to alleviate your concerns by quoting the Standard, but the entry for memcpy points to the C Library Reference, which is scant on details. Nevertheless, the function treat the buffers as arrays of unsigned char, so it cannot reliably assume it can also optimize copying the tail when it's unaligned or an incomplete word, as that could constitute an out-of-bound access.
So i'm using a boost::lockfree::spec_queue to communicate via two boost_threads running functors of two objects in my application.
All is fine except for the fact that the spec_queue::pop() method is non blocking. It returns True or False even if there is nothing in the queue. However my queue always seems to return True (problem #1). I think this is because i preallocate the queue.
typedef boost::lockfree::spsc_queue<q_pl, boost::lockfree::capacity<100000> > spsc_queue;
This means that to use the queue efficiently i have to busy wait constantly popping the queue using 100% cpu. Id rather not sleep for arbitrary amounts of time. I've used other queues in java which block until an object is made available. Can this be done with std:: or boost:: data structures?
A lock free queue, by definition, does not have blocking operations.
How would you synchronize on the datastructure? There is no internal lock, for obvious reasons, because that would mean all clients need to synchronize on it, making it your grandfathers locking concurrent queue.
So indeed, you will have to devise a waiting function yourself. How you do this depends on your use case, which is probably why the library doesn't supply one (disclaimer: I haven't checked and I don't claim to know the full documentation).
So what can you do:
As you already described, you can spin in a tight loop. Obviously, you'll do this if you know that your wait condition (queue non-empty) is always going to be satisfied very quickly.
Alternatively, poll the queue at a certain frequency (doing micro-sleeps in the mean time). Scheduling a good good frequency is an art: for some applications 100ms is optimal, for others, a potential 100ms wait would destroy throughput. So, vary and measure your performance indicators (don't forget about power consumption if your application is going to be deployed on many cores in a datacenter :)).
Lastly, you could arrive at a hybrid solution, spinning for a fixed number of iterations, and resorting to (increasing) interval polling if nothing arrives. This would nicely support servers applications where high loads occur in bursts.
Use a semaphore to cause the producers to sleep when the queue is full, and another semaphore to cause the consumers to sleep when the queue is empty.
when the queue is neither full nor empty, the sem_post and sem_wait operations are nonblocking (in newer kernels)
#include <semaphore.h>
template<typename lock_free_container>
class blocking_lock_free
{
public:
lock_free_queue_semaphore(size_t n) : container(n)
{
sem_init(&pop_semaphore, 0, 0);
sem_init(&push_semaphore, 0, n);
}
~lock_free_queue_semaphore()
{
sem_destroy(&pop_semaphore);
sem_destroy(&push_semaphore);
}
bool push(const lock_free_container::value_type& v)
{
sem_wait(&push_semaphore);
bool ret = container::bounded_push(v);
ASSERT(ret);
if (ret)
sem_post(&pop_semaphore);
else
sem_post(&push_semaphore); // shouldn't happen
return ret;
}
bool pop(lock_free_container::value_type& v)
{
sem_wait(&pop_semaphore);
bool ret = container::pop(v);
ASSERT(ret);
if (ret)
sem_post(&push_semaphore);
else
sem_post(&pop_semaphore); // shouldn't happen
return ret;
}
private:
lock_free_container container;
sem_t pop_semaphore;
sem_t push_semaphore;
};
I have this problem:
I have a C++ code that use some threads. These thread are pthread type.
In my iPhone app I use NSOperationQueue and also some C++ code.
The problem is this: the C++ pthread always have lower priority than NsOperationQueue.
How can I fix this? I have also tried to give low priority to NSOpertionQueue but this fix does not work.
If you have to resort to twiddling priority (notably upwards), it's usually indicative of a design flaw in concurrent models. This should be reserved for very special cases, like a realtime thread (e.g. audio playback).
First assess how your threads and tasks operate, and make sure you have no other choice. Typically, you can do something simple, like reducing the operation queue's max operation count, reducing total thread count, or by grouping your tasks by the resource they require.
What method are you using to determine the threads' priorities?
Also note that setting an operation's priority affects the ordering of enqueued operations (not the thread itself).
I've always been able to solve this problem by tweaking distribution. You should stop reading now :)
Available, but NOT RECOMMENDED:
To lower an operation's priority, you could approach it in your operation's main:
- (void)main
{
#autorelease {
const double priority = [NSThread threadPriority];
const bool isMainThread = [NSThread isMainThread];
if (!isMainThread) {
[NSThread setThreadPriority:priority * 0.5];
}
do_your_work_here
if (!isMainThread) {
[NSThread setThreadPriority:priority];
}
}
}
If you really need to push the kernel after all that, this is how you can set a pthread's priority:
pthreads with real time priority
How to increase thread priority in pthreads?
I came across a problem in multithreading, Model of multithreading is 1 Producer - N Consumer.
Producer produces the data (character data around 200bytes each), put it in fixed size cache ( i.e 2Mil). The data is not relevent to all the threads. It apply the filter ( configured ) and determines no of threads qualify for the produced data.
Producer pushes the pointer to data into the queue of qualifying threads ( only pointer to the data to avoid data copy). Threads will deque and send it over TCP/IP to their clients.
Problem: Because of only pointer to data is given to multiple threads, When cache becomes full, Produces wants to delete the first item(old one). possibility of any thread still referring to the data.
Feasible Way : Use Atomic granularity, When producer determines the number of qualifying threads, It can update the counter and list of thread ids.
class InUseCounter
{
int m_count;
set<thread_t> m_in_use_threads;
Mutex m_mutex;
Condition m_cond;
public:
// This constructor used by Producer
InUseCounter(int count, set<thread_t> tlist)
{
m_count = count;
m_in_use_threads = tlist;
}
// This function is called by each threads
// When they are done with the data,
// Informing that I no longer use the reference to the data.
void decrement(thread_t tid)
{
Gaurd<Mutex> lock(m_mutex);
--m_count;
m_in_use_threads.erease(tid);
}
int get_count() const { return m_count; }
};
master chache
map<seqnum, Data>
|
v
pair<CharData, InUseCounter>
When producer removes the element it checks the counter, is more than 0, it sends action to release the reference to threads in m_in_use_threads set.
Question
If there are 2Mil records in master cache, there will be equal
number of InUseCounter, so the Mutex varibles, Is this advisable to have 2Mil mutex varible in one single process.
Having big single data structure to maintain the InUseCounter will
cause more locking time to find and decrement
What would be the best alternative to my approach to find out the references, and who
all have the references with very less locking time.
Advance thanks for you advices.
2 million mutexes is a bit much. Even if they are lightweight locks,
they still take up some overhead.
Putting the InUseCounter in a single structure would end up involving contention between threads when they release a record; if the threads do not execute in lockstep, this might be negligible. If they are frequently releasing records and the contention rate goes up, this is obviously a performance sink.
You can improve performance by having one thread responsible for maintaining the record reference counts (the producer thread) and having the other threads send back record release events over a separate queue, in effect, turning the producer into a record release event consumer. When you need to flush an entry, process all the release queues first, then run your release logic. You will have some latency to deal with, as you are now queueing up release events instead of attempting to process them immediately, but the performance should be much better.
Incidentally, this is similar to how the Disruptor framework works. It's a high performance Java(!) concurrency framework for high frequency trading. Yes, I did say high performance Java and concurrency in the same sentence. There is a lot of valuable insight into high performance concurrency design and implementation.
Since you already have a Producer->Consumer queue, one very simple system consists in having a "feedback" queue (Consumer->Producer).
After having consumed an item, the consumer feeds the pointer back to the Producer so that the Producer can remove the item and updates the "free-list" of the cache.
This way, only the Producer ever touches the cache innards, and no synchronization is necessary there: only the queues need be synchronized.
Yes, 2000000 mutexes are an overkill.
1 big structure will be locked longer, but will require much less lock/unlocks.
the best approach would be to use shared_ptr smart pointers: they seem to be tailor made for this. You don't check the counter yourself, you just clean up your pointer. shared_ptr is thread-safe, not the data it points to, but for 1 producer (writer) / N consumer (readers), this should not be an issue.