How does LMAX's disruptor pattern work? - concurrency

I am trying to understand the disruptor pattern. I have watched the InfoQ video and tried to read their paper. I understand there is a ring buffer involved, that it is initialized as an extremely large array to take advantage of cache locality, eliminate allocation of new memory.
It sounds like there are one or more atomic integers which keep track of positions. Each 'event' seems to get a unique id and it's position in the ring is found by finding its modulus with respect to the size of the ring, etc., etc.
Unfortunately, I don't have an intuitive sense of how it works. I have done many trading applications and studied the actor model, looked at SEDA, etc.
In their presentation they mentioned that this pattern is basically how routers work; however I haven't found any good descriptions of how routers work either.
Are there some good pointers to a better explanation?

The Google Code project does reference a technical paper on the implementation of the ring buffer, however it is a bit dry, academic and tough going for someone wanting to learn how it works. However there are some blog posts that have started to explain the internals in a more readable way. There is an explanation of ring buffer that is the core of the disruptor pattern, a description of the consumer barriers (the part related to reading from the disruptor) and some information on handling multiple producers available.
The simplest description of the Disruptor is: It is a way of sending messages between threads in the most efficient manner possible. It can be used as an alternative to a queue, but it also shares a number of features with SEDA and Actors.
Compared to Queues:
The Disruptor provides the ability to pass a message onto another threads, waking it up if required (similar to a BlockingQueue). However, there are 3 distinct differences.
The user of the Disruptor defines how messages are stored by extending Entry class and providing a factory to do the preallocation. This allows for either memory reuse (copying) or the Entry could contain a reference to another object.
Putting messages into the Disruptor is a 2-phase process, first a slot is claimed in the ring buffer, which provides the user with the Entry that can be filled with the appropriate data. Then the entry must be committed, this 2-phase approach is necessary to allow for the flexible use of memory mentioned above. It is the commit that makes the message visible to the consumer threads.
It is the responsibility of the consumer to keep track of the messages that have been consumed from the ring buffer. Moving this responsibility away from the ring buffer itself helped reduce the amount of write contention as each thread maintains its own counter.
Compared to Actors
The Actor model is closer the Disruptor than most other programming models, especially if you use the BatchConsumer/BatchHandler classes that are provided. These classes hide all of the complexities of maintaining the consumed sequence numbers and provide a set of simple callbacks when important events occur. However, there are a couple of subtle differences.
The Disruptor uses a 1 thread - 1 consumer model, where Actors use an N:M model i.e. you can have as many actors as you like and they will be distributed across a fixed numbers of threads (generally 1 per core).
The BatchHandler interface provides an additional (and very important) callback onEndOfBatch(). This allows for slow consumers, e.g. those doing I/O to batch events together to improve throughput. It is possible to do batching in other Actor frameworks, however as nearly all other frameworks don't provide a callback at the end of the batch you need to use a timeout to determine the end of the batch, resulting in poor latency.
Compared to SEDA
LMAX built the Disruptor pattern to replace a SEDA based approach.
The main improvement that it provided over SEDA was the ability to do work in parallel. To do this the Disruptor supports multi-casting the same messages (in the same order) to multiple consumers. This avoids the need for fork stages in the pipeline.
We also allow consumers to wait on the results of other consumers without having to put another queuing stage between them. A consumer can simply watch the sequence number of a consumer that it is dependent on. This avoids the need for join stages in pipeline.
Compared to Memory Barriers
Another way to think about it is as a structured, ordered memory barrier. Where the producer barrier forms the write barrier and the consumer barrier is the read barrier.

First we'd like to understand the programming model it offers.
There are one or more writers. There are one or more readers. There is a line of entries, totally ordered from old to new (pictured as left to right). Writers can add new entries on the right end. Every reader reads entries sequentially from left to right. Readers can't read past writers, obviously.
There is no concept of entry deletion. I use "reader" instead of "consumer" to avoid the image of entries being consumed. However we understand that entries on the left of the last reader become useless.
Generally readers can read concurrently and independently. However we can declare dependencies among readers. Reader dependencies can be arbitrary acyclic graph. If reader B depends on reader A, reader B can't read past reader A.
Reader dependency arises because reader A can annotate an entry, and reader B depends on that annotation. For example, A does some calculation on an entry, and stores the result in field a in the entry. A then move on, and now B can read the entry, and the value of a A stored. If reader C does not depend on A, C should not attempt to read a.
This is indeed an interesting programming model. Regardless of the performance, the model alone can benefit lots of applications.
Of course, LMAX's main goal is performance. It uses a pre-allocated ring of entries. The ring is large enough, but it's bounded so that the system will not be loaded beyond design capacity. If the ring is full, writer(s) will wait until the slowest readers advance and make room.
Entry objects are pre-allocated and live forever, to reduce garbage collection cost. We don't insert new entry objects or delete old entry objects, instead, a writer asks for a pre-existing entry, populate its fields, and notify readers. This apparent 2-phase action is really simply an atomic action
setNewEntry(EntryPopulator);
interface EntryPopulator{ void populate(Entry existingEntry); }
Pre-allocating entries also means adjacent entries (very likely) locate in adjacent memory cells, and because readers read entries sequentially, this is important to utilize CPU caches.
And lots of efforts to avoid lock, CAS, even memory barrier (e.g. use a non-volatile sequence variable if there's only one writer)
For developers of readers: Different annotating readers should write to different fields, to avoid write contention. (Actually they should write to different cache lines.) An annotating reader should not touch anything that other non-dependent readers may read. This is why I say these readers annotate entries, instead of modify entries.

Martin Fowler has written an article about LMAX and the disruptor pattern, The LMAX Architecture, which may clarify it further.

I actually took the time to study the actual source, out of sheer curiosity, and the idea behind it is quite simple. The most recent version at the time of writing this post is 3.2.1.
There is a buffer storing pre-allocated events that will hold the data for consumers to read.
The buffer is backed by an array of flags (integer array) of its length that describes the availability of the buffer slots (see further for details). The array is accessed like a java#AtomicIntegerArray, so for the purpose of this explenation you may as well assume it to be one.
There can be any number of producers. When the producer wants to write to the buffer, an long number is generated (as in calling AtomicLong#getAndIncrement, the Disruptor actually uses its own implementation, but it works in the same manner). Let's call this generated long a producerCallId. In a similar manner, a consumerCallId is generated when a consumer ENDS reading a slot from a buffer. The most recent consumerCallId is accessed.
(If there are many consumers, the call with the lowest id is choosen.)
These ids are then compared, and if the difference between the two is lesser that the buffer side, the producer is allowed to write.
(If the producerCallId is greater than the recent consumerCallId + bufferSize, it means that the buffer is full, and the producer is forced to bus-wait until a spot becomes available.)
The producer is then assigned the slot in the buffer based on his callId (which is prducerCallId modulo bufferSize, but since the bufferSize is always a power of 2 (limit enforced on buffer creation), the actuall operation used is producerCallId & (bufferSize - 1)). It is then free to modify the event in that slot.
(The actual algorithm is a bit more complicated, involving caching recent consumerId in a separate atomic reference, for optimisation purposes.)
When the event was modified, the change is "published". When publishing the respective slot in the flag array is filled with the updated flag. The flag value is the number of the loop (producerCallId divided by bufferSize (again since bufferSize is power of 2, the actual operation is a right shift).
In a similar manner there can be any number of consumers. Each time a consumer wants to access the buffer, a consumerCallId is generated (depending on how the consumers were added to the disruptor the atomic used in id generation may be shared or separate for each of them). This consumerCallId is then compared to the most recent producentCallId, and if it is lesser of the two, the reader is allowed to progress.
(Similarly if the producerCallId is even to the consumerCallId, it means that the buffer is empety and the consumer is forced to wait. The manner of waiting is defined by a WaitStrategy during disruptor creation.)
For individual consumers (the ones with their own id generator), the next thing checked is the ability to batch consume. The slots in the buffer are examined in order from the one respective to the consumerCallId (the index is determined in the same manner as for producers), to the one respective to the recent producerCallId.
They are examined in a loop by comparing the flag value written in the flag array, against a flag value generated for the consumerCallId. If the flags match it means that the producers filling the slots has commited their changes. If not, the loop is broken, and the highest commited changeId is returned. The slots from ConsumerCallId to received in changeId can be consumed in batch.
If a group of consumers read together (the ones with shared id generator), each one only takes a single callId, and only the slot for that single callId is checked and returned.

From this article:
The disruptor pattern is a batching queue backed up by a circular
array (i.e. the ring buffer) filled with pre-allocated transfer
objects which uses memory-barriers to synchronize producers and
consumers through sequences.
Memory-barriers are kind of hard to explain and Trisha's blog has done the best attempt in my opinion with this post: http://mechanitis.blogspot.com/2011/08/dissecting-disruptor-why-its-so-fast.html
But if you don't want to dive into the low-level details you can just know that memory-barriers in Java are implemented through the volatile keyword or through the java.util.concurrent.AtomicLong. The disruptor pattern sequences are AtomicLongs and are communicated back and forth among producers and consumers through memory-barriers instead of locks.
I find it easier to understand a concept through code, so the code below is a simple helloworld from CoralQueue, which is a disruptor pattern implementation done by CoralBlocks with which I am affiliated. In the code below you can see how the disruptor pattern implements batching and how the ring-buffer (i.e. circular array) allows for garbage-free communication between two threads:
package com.coralblocks.coralqueue.sample.queue;
import com.coralblocks.coralqueue.AtomicQueue;
import com.coralblocks.coralqueue.Queue;
import com.coralblocks.coralqueue.util.MutableLong;
public class Sample {
public static void main(String[] args) throws InterruptedException {
final Queue<MutableLong> queue = new AtomicQueue<MutableLong>(1024, MutableLong.class);
Thread consumer = new Thread() {
#Override
public void run() {
boolean running = true;
while(running) {
long avail;
while((avail = queue.availableToPoll()) == 0); // busy spin
for(int i = 0; i < avail; i++) {
MutableLong ml = queue.poll();
if (ml.get() == -1) {
running = false;
} else {
System.out.println(ml.get());
}
}
queue.donePolling();
}
}
};
consumer.start();
MutableLong ml;
for(int i = 0; i < 10; i++) {
while((ml = queue.nextToDispatch()) == null); // busy spin
ml.set(System.nanoTime());
queue.flush();
}
// send a message to stop consumer...
while((ml = queue.nextToDispatch()) == null); // busy spin
ml.set(-1);
queue.flush();
consumer.join(); // wait for the consumer thread to die...
}
}

Related

Lock-free producer consumer [duplicate]

Anecdotally, I've found that a lot of programmers mistakenly believe that "lock-free" simply means "concurrent programming without mutexes". Usually, there's also a correlated misunderstanding that the purpose of writing lock-free code is for better concurrent performance. Of course, the correct definition of lock-free is actually about progress guarantees. A lock-free algorithm guarantees that at least one thread is able to make forward progress regardless of what any other threads are doing.
This means a lock-free algorithm can never have code where one thread is depending on another thread in order to proceed. E.g., lock-free code can not have a situation where Thread A sets a flag, and then Thread B keeps looping while waiting for Thread A to unset the flag. Code like that is basically implementing a lock (or what I would call a mutex in disguise).
However, other cases are more subtle and there are some cases where I honestly can't really tell if an algorithm qualifies as lock-free or not, because the notion of "making progress" sometimes appears subjective to me.
One such case is in the (well-regarded, afaik) concurrency library, liblfds. I was studying the implementation of a multi-producer/multi-consumer bounded queue in liblfds - the implementation is very straightforward, but I cannot really tell if it should qualify as lock-free.
The relevant algorithm is in lfds711_queue_bmm_enqueue.c. Liblfds uses custom atomics and memory barriers, but the algorithm is simple enough for me to describe in a paragraph or so.
The queue itself is a bounded contiguous array (ringbuffer). There is a shared read_index and write_index. Each slot in the queue contains a field for user-data, and a sequence_number value, which is basically like an epoch counter. (This avoids ABA issues).
The PUSH algorithm is as follows:
Atomically LOAD the write_index
Attempt to reserve a slot in the queue at write_index % queue_size using a CompareAndSwap loop that attempts to set write_index to write_index + 1.
If the CompareAndSwap is successful, copy the user data into the
reserved slot.
Finally, update the sequence_index on the
slot by making it equal to write_index + 1.
The actual source code uses custom atomics and memory barriers, so for further clarity about this algorithm I've briefly translated it into (untested) standard C++ atomics for better readability, as follows:
bool mcmp_queue::enqueue(void* data)
{
int write_index = m_write_index.load(std::memory_order_relaxed);
for (;;)
{
slot& s = m_slots[write_index % m_num_slots];
int sequence_number = s.sequence_number.load(std::memory_order_acquire);
int difference = sequence_number - write_index;
if (difference == 0)
{
if (m_write_index.compare_exchange_weak(
write_index,
write_index + 1,
std::memory_order_acq_rel
))
{
break;
}
}
if (difference < 0) return false; // queue is full
}
// Copy user-data and update sequence number
//
s.user_data = data;
s.sequence_number.store(write_index + 1, std::memory_order_release);
return true;
}
Now, a thread that wants to POP an element from the slot at read_index will not be able to do so until it observes that the slot's sequence_number is equal to read_index + 1.
Okay, so there are no mutexes here, and the algorithm likely performs well (it's only a single CAS for PUSH and POP), but is this lock-free? The reason it's unclear to me is because the definition of "making progress" seems murky when there is the possibility that a PUSH or POP can always just fail if the queue is observed to be full or empty.
But what's questionable to me is that the PUSH algorithm essentially reserves a slot, meaning that the slot can never be POP'd until the push thread gets around to updating the sequence number. This means that a POP thread that wants to pop a value depends on the PUSH thread having completed the operation. Otherwise, the POP thread will always return false because it thinks the queue is EMPTY. It seems debatable to me whether this actually falls within the definition of "making progress".
Generally, truly lock-free algorithms involve a phase where a pre-empted thread actually tries to ASSIST the other thread in completing an operation. So, in order to be truly lock-free, I would think that a POP thread that observes an in-progress PUSH would actually need to try and complete the PUSH, and then only after that, perform the original POP operation. If the POP thread simply returns that the queue is EMPTY when a PUSH is in progress, the POP thread is basically blocked until the PUSH thread completes the operation. If the PUSH thread dies, or goes to sleep for 1,000 years, or otherwise gets scheduled into oblivion, the POP thread can do nothing except continuously report that the queue is EMPTY.
So does this fit the defintion of lock-free? From one perspective, you can argue that the POP thread can always make progress, because it can always report that the queue is EMPTY (which is at least some form of progress I guess.) But to me, this isn't really making progress, since the only reason the queue is observed as empty is because we are blocked by a concurrent PUSH operation.
So, my question is: is this algorithm truly lock-free? Or is the index reservation system basically a mutex in disguise?
This queue data structure is not strictly lock-free by what I consider the most reasonable definition. That definition is something like:
A structure is lock-free if only if any thread can be indefinitely
suspended at any point while still leaving the structure usable by the
remaining threads.
Of course this implies a suitable definition of usable, but for most structures this is fairly simple: the structure should continue to obey its contracts and allow elements to be inserted and removed as expected.
In this case a thread that has succeeded in incrementing m_write_increment, but hasn't yet written s.sequence_number leaves the container in what will soon be an unusable state. If such a thread is killed, the container will eventually report both "full" and "empty" to push and pop respectively, violating the contract of a fixed size queue.
There is a hidden mutex here (the combination of m_write_index and the associated s.sequence_number) - but it basically works like a per-element mutex. So the failure only becomes apparent to writers once you've looped around and a new writer tries to get the mutex, but in fact all subsequent writers have effectively failed to insert their element into the queue since no reader will ever see it.
Now this doesn't mean this is a bad implementation of a concurrent queue. For some uses it may behave mostly as if it was lock free. For example, this structure may have most of the useful performance properties of a truly lock-free structure, but at the same time it lacks some of the useful correctness properties. Basically the term lock-free usually implies a whole bunch of properties, only a subset of which will usually be important for any particular use. Let's look at them one by one and see how this structure does. We'll broadly categorize them into performance and functional categories.
Performance
Uncontended Performance
The uncontended or "best case" performance is important for many structures. While you need a concurrent structure for correctness, you'll usually still try to design your application so that contention is kept to a minimum, so the uncontended cost is often important. Some lock-free structures help here, by reducing the number of expensive atomic operations in the uncontended fast-path, or avoiding a syscall.
This queue implementation does a reasonable job here: there is only a single "definitely expensive" operation: the compare_exchange_weak, and a couple of possibly expensive operations (the memory_order_acquire load and memory_order_release store)1, and little other overhead.
This compares to something like std::mutex which would imply something like one atomic operation for lock and another for unlock, and in practice on Linux the pthread calls have non-negligible overhead as well.
So I expect this queue to perform reasonably well in the uncontended fast-path.
Contended Performance
One advantage of lock-free structures is that they often allow better scaling when a structure is heavily contended. This isn't necessarily an inherent advantage: some lock-based structures with multiple locks or read-write locks may exhibit scaling that matches or exceeds some lock-free approaches, but it is usually that case that lock-free structures exhibit better scaling that a simple one-lock-to-rule-them-all alternative.
This queue performs reasonably in this respect. The m_write_index variable is atomically updated by all readers and will be a point of contention, but the behavior should be reasonable as long as the underlying hardware CAS implementation is reasonable.
Note that a queue is generally a fairly poor concurrent structure since inserts and removals all happen at the same places (the head and the tail), so contention is inherent in the definition of the structure. Compare this to a concurrent map, where different elements have no particular ordered relationship: such a structure can offer efficient contention-free simultaneous mutation if different elements are being accessed.
Context-switch Immunity
One performance advantage of lock-free structures that is related to the core definition above (and also to the functional guarantees) is that a context switch of a thread which is mutating the structure doesn't delay all the other mutators. In a heavily loaded system (especially when runnable threads >> available cores), a thread may be switched out for hundreds of milliseconds or seconds. During this time, any concurrent mutators will block and incur additional scheduling costs (or they will spin which may also produce poor behavior). Even though such "unluckly scheduling" may be rare, when it does occur the entire system may incur a serious latency spike.
Lock-free structures avoid this since there is no "critical region" where a thread can be context switched out and subsequently block forward progress by other threads.
This structure offers partial protection in this area — the specifics of which depend on the queue size and application behavior. Even if a thread is switched out in the critical region between the m_write_index update and the sequence number write, other threads can continue to push elements to the queue as long as they don't wrap all the way around to the in-progress element from the stalled thread. Threads can also pop elements, but only up to the in-progress element.
While the push behavior may not be a problem for high-capacity queues, the pop behavior can be a problem: if the queue has a high throughput compared to the average time a thread is context switched out, and the average fullness, the queue will quickly appear empty to all consumer threads, even if there are many elements added beyond the in-progress element. This isn't affected by the queue capacity, but simply the application behavior. It means that the consumer side may completely stall when this occurs. In this respect, the queue doesn't look very lock-free at all!
Functional Aspects
Async Thread Termination
On advantage of lock-free structures it they are safe for use by threads that may be asynchronously canceled or may otherwise terminate exceptionally in the critical region. Cancelling a thread at any point leaves the structure is a consistent state.
This is not the case for this queue, as described above.
Queue Access from Interrupt or Signal
A related advantage is that lock-free structures can usually be examined or mutated from an interrupt or signal. This is useful in many cases where an interrupt or signal shares a structure with regular process threads.
This queue mostly supports this use case. Even if the signal or interrupt occurs when another thread is in the critical region, the asynchronous code can still push an element onto the queue (which will only be seen later by consuming threads) and can still pop an element off of the queue.
The behavior isn't as complete as a true lock-free structure: imagine a signal handler with a way to tell the remaining application threads (other than the interrupted one) to quiesce and which then drains all the remaining elements of the queue. With a true lock-free structure, this would allow the signal handler to full drain all the elements, but this queue might fail to do that in the case a thread was interrupted or switched out in the critical region.
1 In particular, on x86, this will only use an atomic operation for the CAS as the memory model is strong enough to avoid the need for atomics or fencing for the other operations. Recent ARM can do acquire and release fairly efficiently as well.
I am the author of liblfds.
The OP is correct in his description of this queue.
It is the single data structure in the library which is not lock-free.
This is described in the documentation for the queue;
http://www.liblfds.org/mediawiki/index.php?title=r7.1.1:Queue_%28bounded,_many_producer,_many_consumer%29#Lock-free_Specific_Behaviour
"It must be understood though that this is not actually a lock-free data structure."
This queue is an implementation of an idea from Dmitry Vyukov (1024cores.net) and I only realised it was not lock-free while I was making the test code work.
By then it was working, so I included it.
I do have some thought to remove it, since it is not lock-free.
Most of the time people use lock-free when they really mean lockless. lockless means a data-structure or algorithm that does not use locks, but there is no guarantee for forward progress. Also check this question. So the queue in liblfds is lockless, but as BeeOnRope mentioned is not lock-free.
A thread that calls POP before the next update in sequence is complete is NOT "effectively blocked" if the POP call returns FALSE immediately. The thread can go off and do something else. I'd say that this queue qualifies as lock-free.
However, I wouldn't say that it qualifies as a "queue" -- at least not the kind of queue that you could publish as a queue in a library or something -- because it doesn't guarantee a lot of the behaviors that you can normally expect from a queue. In particular, you can PUSH and element and then try and FAIL to POP it, because some other thread is busy pushing an earlier item.
Even so, this queue could still be useful in some lock-free solutions for various problems.
For many applications, however, I would worry about the possibility for consumer threads to be starved while a producer thread is pre-empted. Maybe liblfds does something about that?
"Lock-free" is a property of the algorithm, which implements some functionality. The property doesn't correlate with a way, how given functionality is used by a program.
When talk about mcmp_queue::enqueue function, which returns FALSE if underlying queue is full, its implementation (given in the question post) is lock-free.
However, implementing mcmp_queue::dequeue in lock-free manner would be difficult. E.g., this pattern is obviously not-lock free, as it spins on the variable changed by other thread:
while(s.sequence_number.load(std::memory_order_acquire) == read_index);
data = s.user_data;
...
return data;
I did formal verification on this same code using Spin a couple years ago for a course in concurrency testing and it is definitely not lock-free.
Just because there is no explicit "locking", doesn't mean it's lock-free. When it comes to reasoning about progress conditions, think of it from an individual thread's perspective:
Blocking/locking: if another thread gets descheduled and this can block my progress, then it is blocking.
Lock-free/non-blocking: if I am able to eventually make progress in the absence of contention from other threads, then it is at most lock-free.
If no other thread can block my progress indefinitely, then it is wait-free.

Atomic operations on critical section data

Some part of shared memory modified in a critical section consists of considerable amount of data however only small portion of it is changed in a single pass (e.g. free memory pages bitmap).
How to make sure that when program is interrupted/killed the data remains in a consistent state. Any suggestions other than having two copies
(like a copy&swap in an example bellow or having some kind of rollback segment) ?
struct some_data{
int a;
int t[100000]; //large number of total data but a few bytes changed in a single pass (eg. free entries bitmap/tree).
};
short int active=0;
some_data section_data[2];
//---------------------------------------------------
//semaphore down
int inactive=active % 2;
section_data[inactive]=section_data[active];
// now, make changes to the section data (section_data[next_active])
active=inactive;
//semaphore up
You are looking for transactional consistency: a transaction occurs in whole, or not at all.
A common pattern is a journal, where you store the change you intend to make while you apply them. Anyone accessing the shared memory and detecting the crashed process (such as noticing that they somehow acquired the semaphore with a partially present journal), takes responsibility for replaying the journal before continuing.
You still have one race case, the actual writing of a bit signalling to all processes that there is, in fact, a journal to consume. However, that is a small enough body of information that you can send it through whatever channel you please, such as another semaphore or clever use of fences.
It's best if the journal is sufficiently independent of the state of the memory such that the repairing process can just start at the start of the journal and replay the whole thing. If you have to identify which entry in the journal is "next," then you need a whole lot more synchronization.

Keep Track of Reference to Data ( How Many / Who ) in Multithreading

I came across a problem in multithreading, Model of multithreading is 1 Producer - N Consumer.
Producer produces the data (character data around 200bytes each), put it in fixed size cache ( i.e 2Mil). The data is not relevent to all the threads. It apply the filter ( configured ) and determines no of threads qualify for the produced data.
Producer pushes the pointer to data into the queue of qualifying threads ( only pointer to the data to avoid data copy). Threads will deque and send it over TCP/IP to their clients.
Problem: Because of only pointer to data is given to multiple threads, When cache becomes full, Produces wants to delete the first item(old one). possibility of any thread still referring to the data.
Feasible Way : Use Atomic granularity, When producer determines the number of qualifying threads, It can update the counter and list of thread ids.
class InUseCounter
{
int m_count;
set<thread_t> m_in_use_threads;
Mutex m_mutex;
Condition m_cond;
public:
// This constructor used by Producer
InUseCounter(int count, set<thread_t> tlist)
{
m_count = count;
m_in_use_threads = tlist;
}
// This function is called by each threads
// When they are done with the data,
// Informing that I no longer use the reference to the data.
void decrement(thread_t tid)
{
Gaurd<Mutex> lock(m_mutex);
--m_count;
m_in_use_threads.erease(tid);
}
int get_count() const { return m_count; }
};
master chache
map<seqnum, Data>
|
v
pair<CharData, InUseCounter>
When producer removes the element it checks the counter, is more than 0, it sends action to release the reference to threads in m_in_use_threads set.
Question
If there are 2Mil records in master cache, there will be equal
number of InUseCounter, so the Mutex varibles, Is this advisable to have 2Mil mutex varible in one single process.
Having big single data structure to maintain the InUseCounter will
cause more locking time to find and decrement
What would be the best alternative to my approach to find out the references, and who
all have the references with very less locking time.
Advance thanks for you advices.
2 million mutexes is a bit much. Even if they are lightweight locks,
they still take up some overhead.
Putting the InUseCounter in a single structure would end up involving contention between threads when they release a record; if the threads do not execute in lockstep, this might be negligible. If they are frequently releasing records and the contention rate goes up, this is obviously a performance sink.
You can improve performance by having one thread responsible for maintaining the record reference counts (the producer thread) and having the other threads send back record release events over a separate queue, in effect, turning the producer into a record release event consumer. When you need to flush an entry, process all the release queues first, then run your release logic. You will have some latency to deal with, as you are now queueing up release events instead of attempting to process them immediately, but the performance should be much better.
Incidentally, this is similar to how the Disruptor framework works. It's a high performance Java(!) concurrency framework for high frequency trading. Yes, I did say high performance Java and concurrency in the same sentence. There is a lot of valuable insight into high performance concurrency design and implementation.
Since you already have a Producer->Consumer queue, one very simple system consists in having a "feedback" queue (Consumer->Producer).
After having consumed an item, the consumer feeds the pointer back to the Producer so that the Producer can remove the item and updates the "free-list" of the cache.
This way, only the Producer ever touches the cache innards, and no synchronization is necessary there: only the queues need be synchronized.
Yes, 2000000 mutexes are an overkill.
1 big structure will be locked longer, but will require much less lock/unlocks.
the best approach would be to use shared_ptr smart pointers: they seem to be tailor made for this. You don't check the counter yourself, you just clean up your pointer. shared_ptr is thread-safe, not the data it points to, but for 1 producer (writer) / N consumer (readers), this should not be an issue.

Optimal strategy to make a C++ hash table, thread safe

(I am interested in design of implementation NOT a readymade construct that will do it all.)
Suppose we have a class HashTable (not hash-map implemented as a tree but hash-table)
and say there are eight threads.
Suppose read to write ratio is about 100:1 or even better 1000:1.
Case A) Only one thread is a writer and others including writer can read from HashTable(they may simply iterate over entire hash table)
Case B) All threads are identical and all could read/write.
Can someone suggest best strategy to make the class thread safe with following consideration
1. Top priority to least lock contention
2. Second priority to least number of locks
My understanding so far is thus :
One BIG reader-writer lock(semaphore).
Specialize the semaphore so that there could be eight instances writer-resource for case B, where each each writer resource locks one row(or range for that matter).
(so i guess 1+8 mutexes)
Please let me know if I am thinking on the correct line, and how could we improve on this solution.
With such high read/write ratios, you should consider a lock free solution, e.g. nbds.
EDIT:
In general, lock free algorithms work as follows:
arrange your data structures such that for each function you intend to support there is a point at which you are able to, in one atomic operation, determine whether its results are valid (i.e. other threads have not mutated its inputs since they have been read) and commit to them; with no changes to state visible to other threads unless you commit. This will involve leveraging platform-specific functions such as Win32's atomic compare-and-swap or Cell's cache line reservation opcodes.
each supported function becomes a loop that repeatedly reads the inputs and attempts to perform the work, until the commit succeeds.
In cases of very low contention, this is a performance win over locking algorithms since functions mostly succeed the first time through without incurring the overhead of acquiring a lock. As contention increases, the gains become more dubious.
Typically the amount of data it is possible to atomically manipulate is small - 32 or 64 bits is common - so for functions involving many reads and writes, the resulting algorithms become complex and potentially very difficult to reason about. For this reason, it is preferable to look for and adopt a mature, well-tested and well-understood third party lock free solution for your problem in preference to rolling your own.
Hashtable implementation details will depend on various aspects of the hash and table design. Do we expect to be able to grow the table? If so, we need a way to copy bulk data from the old table into the new safely. Do we expect hash collisions? If so, we need some way of walking colliding data. How do we make sure another thread doesn't delete a key/value pair between a lookup returning it and the caller making use of it? Some form of reference counting, perhaps? - but who owns the reference? - or simply copying the value on lookup? - but what if values are large?
Lock-free stacks are well understood and relatively straightforward to implement (to remove an item from the stack, get the current top, attempt to replace it with its next pointer until you succeed, return it; to add an item, get the current top and set it as the item's next pointer, until you succeed in writing a pointer to the item as the new top; on architectures with reserve/conditional write semantics, this is enough, on architectures only supporting CAS you need to append a nonce or version number to the atomically manipulated data to avoid the ABA problem). They are one way of keeping track of free space for keys/data in an atomic lock free manner, allowing you to reduce a key/value pair - the data actually stored in a hashtable entry - to a pointer/offset or two, a small enough amount to be manipulated using your architecture's atomic instructions. There are others.
Reads then become a case of looking up the entry, checking the kvp against the requested key, doing whatever it takes to make sure the value will remain valid when we return it (taking a copy / increasing its reference count), checking the entry hasn't been modified since we began the read, returning the value if so, undoing any reference count changes and repeating the read if not.
Writes will depend on what we're doing about collisions; in the trivial case, they are simply a case of finding the correct empty slot and writing the new kvp.
The above is greatly simplified and insufficient to produce your own safe implementation, especially if you are not familiar with lock-free/wait-free techniques. Possible complications include the ABA problem, priority inversion, starvation of particular threads; I have not addressed hash collisions.
The nbds page links to an excellent presentation on a real world approach that allows growth / collisions. Others exist, a quick Google finds lots of papers.
Lock free and wait free algorithms are fascinating areas of research; I encourage the reader to Google around. That said, naive lock free implementations can easily look reasonable and behave correctly much of the time while in reality being subtly unsafe. While it is important to have a solid grasp on the principles, I strongly recommend using an existing, well-understood and proven implementation over rolling your own.
You may want to look at Java's ConcurrentHashMap implementation for one possible implementation.
The basic idea is NOT to lock for every read operation but only for writes. Since in your interview they specifically mentioned an extremely high read:write ratio it makes sense trying to stuff as much overhead as possible into writes.
The ConcurrentHashMap divides the hashtable into so called "Segments" that are themselves concurrently readable hashtables and keep every single segment in a consistent state to allow traversing without locking.
When reading you basically have the usual hashmap get() with the difference that you have to worry about reading stale values, so things like the value of the correct node, the first node of the segment table and next pointers have to be volatile (with c++'s non-existent memory model you probably can't do this portably; c++0x should help here, but haven't looked at it so far).
When putting a new element in there you get all the overhead, first of all having to lock the given segment. After locking it's basically a usual put() operation, but you have to guarantee atomic writes when updating the next pointer of a node (pointing to the newly created node whose next pointer has to be already correctly pointing to the old next node) or overwriting the value of a node.
When growing the segment, you have to rehash the existing nodes and put them into the new, larger table. The important part is to clone nodes for the new table as not to influence the old table (by changing their next pointers too early) until the new table is complete and replaces the old one (they use some clever trick there that means they only have to clone about 1/6 of the nodes - nice that but I'm not really sure how they reach that number).
Note that garbage collection makes this a whole lot easier because you don't have to worry about the old nodes that weren't reused - as soon as all readers are finished they will automatically be GCed. That's solvable though, but I'm not sure what the best approach would be.
I hope the basic idea is somewhat clear - obviously there are several points that aren't trivially ported to c++, but it should give you a good idea.
No need to lock the whole table, just have a lock per bucket. That immediately gives parallelism. Inserting a new node to the table requires a lock on the bucket about to have the head node modified. New nodes are always added at the head of the table so that readers can iterate through the nodes without worrying about seeing new nodes.
Each node has a r/w lock; readers iterating get a read lock lock. Node modification requires a write lock.
Iteration without the bucket lock leading to node removal requires an attempt to take the bucket lock, and if it fails it must release the locks and retry to avoid deadlock because the lock order is different.
Brief overview.
You can try atomic_hashtable for c
https://github.com/Taymindis/atomic_hashtable for read, write, and delete without locking while multithreading accessing the data, Simple and Stable
API documents given in README.

Lock Free Queue -- Single Producer, Multiple Consumers

I am looking for a method to implement lock-free queue data structure that supports single producer, and multiple consumers. I have looked at the classic method by Maged Michael and Michael Scott (1996) but their version uses linked lists. I would like an implementation that makes use of bounded circular buffer. Something that uses atomic variables?
On a side note, I am not sure why these classic methods are designed for linked lists that require a lot of dynamic memory management. In a multi-threaded program, all memory management routines are serialized. Aren't we defeating the benefits of lock-free methods by using them in conjunction with dynamic data structures?
I am trying to code this in C/C++ using pthread library on a Intel 64-bit architecture.
Thank you,
Shirish
The use of a circular buffer makes a lock necessary, since blocking is needed to prevent the head from going past the tail. But otherwise the head and tail pointers can easily be updated atomically. Or in some cases the buffer can be so large that overwriting is not an issue. (in real life you will see this in automated trading systems, with circular buffers sized to hold X minutes of market data. If you are X minutes behind, you have wayyyy worse problems than overwriting your buffer).
When I implemented the MS queue in C++, I built a lock free allocator using a stack, which is very easy to implement. If I have MSQueue then at compile time I know sizeof(MSQueue::node). Then I make a stack of N buffers of the required size. The N can grow, i.e. if pop() returns null, it is easy to go ask the heap for more blocks, and these are pushed onto the stack. Outside of the possibly blocking call for more memory, this is a lock free operation.
Note that the T cannot have a non-trivial dtor. I worked on a version that did allow for non-trivial dtors, that actually worked. But I found that it was easier just to make the T a pointer to the T that I wanted, where the producer released ownership, and the consumer acquired ownership. This of course requires that the T itself is allocated using lockfree methods, but the same allocator I made with the stack works here as well.
In any case the point of lock-free programming is not that the data structures themselves are slower. The points are this:
lock free makes me independent of the scheduler. Lock-based programming depends on the scheduler to make sure that the holders of a lock are running so that they can release the lock. This is what causes "priority inversion" On Linux there are some lock attributes to make sure this happens
If I am independent of the scheduler, the OS has a far easier time managing timeslices, and I get far less context switching
it is easier to write correct multithreaded programs using lockfree methods since I dont have to worry about deadlock , livelock, scheduling, syncronization, etc This is espcially true with shared memory implementations, where a process could die while holding a lock in shared memory, and there is no way to release the lock
lock free methods are far easier to scale. In fact, I have implemented lock free methods using messaging over a network. Distributed locks like this are a nightmare
That said, there are many cases where lock-based methods are preferable and/or required
when updating things that are expensive or impossible to copy. Most lock free methods use some sort of versioning, i.e. make a copy of the object, update it, and check if the shared version is still the same as when you copied it, then make the current version you update version. Els ecopy it again, apply the update, and check again. Keep doing this until it works. This is fine when the objects are small, but it they are large, or contain file handles, etc then not recommended
Most types are impossible to access in a lock free way, e.g. any STL container. These have invariants that require non atomic access , for example assert(vector.size()==vector.end()-vector.begin()). So if you are updating/reading a vector that is shared, you have to lock it.
This is an old question, but no one has provided an accepted solution. So I offer this info for others who may be searching.
This website: http://www.1024cores.net
Provides some really useful lockfree/waitfree data structures with thorough explanations.
What you are seeking is a lock-free solution to the reader/writer problem.
See: http://www.1024cores.net/home/lock-free-algorithms/reader-writer-problem
For a traditional one-block circular buffer I think this simply cannot be done safely with atomic operations. You need to do so much in one read. Suppose you have a structure that has this:
uint8_t* buf;
unsigned int size; // Actual max. buffer size
unsigned int length; // Actual stored data length (suppose in write prohibited from being > size)
unsigned int offset; // Start of current stored data
On a read you need to do the following (this is how I implemented it anyway, you can swap some steps like I'll discuss afterwards):
Check if the read length does not surpass stored length
Check if the offset+read length do not surpass buffer boundaries
Read data out
Increase offset, decrease length
What should you certainly do synchronised (so atomic) to make this work? Actually combine steps 1 and 4 in one atomic step, or to clarify: do this synchronised:
check read_length, this can be sth like read_length=min(read_length,length);
decrease length with read_length: length-=read_length
get a local copy from offset unsigned int local_offset = offset
increase offset with read_length: offset+=read_length
Afterwards you can just do a memcpy (or whatever) starting from your local_offset, check if your read goes over circular buffer size (split in 2 memcpy's), ... . This is 'quite' threadsafe, your write method could still write over the memory you're reading, so make sure your buffer is really large enough to minimize that possibility.
Now, while I can imagine you can combine 3 and 4 (I guess that's what they do in the linked-list case) or even 1 and 2 in atomic operations, I cannot see you do this whole deal in one atomic operation :).
You can however try to drop 'length' checking if your consumers are very smart and will always know what to read. You'd also need a new woffset variable then, because the old method of (offset+length)%size to determine write offset wouldn't work anymore. Note this is close to the case of a linked list, where you actually always read one element (= fixed, known size) from the list. Also here, if you make it a circular linked list, you can read to much or write to a position you're reading at that moment!
Finally: my advise, just go with locks, I use a CircularBuffer class, completely safe for reading & writing) for a realtime 720p60 video streamer and I have got no speed issues at all from locking.
This is an old question but no one has provided an answer that precisely answers it. Given that still comes up high in search results for (nearly) the same question, there should be an answer, given that one exists.
There may be more than one solution, but here is one that has an implementation:
https://github.com/tudinfse/FFQ
The conference paper referenced in the readme details the algorithm.