Related
I looked other SO questions/answers about this, but none of them resolved my issue.
My threads have default scheduling policy SCHED_OTHER and it has the priority as 0, so they don't have priority level. Or in other words, you can't change the priority using (sched_param) param.sched_priority.
So, this means using system call pthread_setschedparam is ruled out.
pthread_setschedprio(std::thread::native_handle(), -1) - this doesn't affect the thread's priority. I verified using getpriority (PRIO_PROCESS, tid). Can we use pthread_setschedprio() for default schedulers?
After carefully reading this page, I understood that I need to change the thread's dynamic priority by tweaking the nice value which can be achieved by either of these:
nice(19);
I tried this and it doesn't have any effect. I hope it is process wide nice value change.
I ruled this out too.
setpriority(PRIO_PROCESS, id, 19)
It returns -1 always and errno is ESRCH (No such process). Why this is not working?
According to this, it can be used to change the priority of the thread.
syscall(SYS_sched_setattr, id, &attr, flags)
struct sched_attr attr;
unsigned int flags = 0;
attr.size = sizeof(attr);
attr.sched_nice = 6;
attr.sched_policy = SCHED_OTHER;
This also didn't work, no luck.
I want to lower the priority of these threads(this consumes more CPU) in a process without changing the policy. This ensures other threads which has the same priority as these to get CPU time
Preliminary: context suggests that you're on Linux, so portions of this answer are Linux specific.
My threads have default scheduling policy SCHED_OTHER and it has the priority as 0, so they don't have priority level. Or in other words, you can't change the priority using (sched_param) param.sched_priority.
That seems to mischaracterize the situation a bit. Per sched(7),
For threads scheduled under one of the normal scheduling policies (SCHED_OTHER, SCHED_IDLE, SCHED_BATCH), sched_priority is not used in scheduling decisions (it must be specified as 0).
You appear to be interpreting that or some alternative formulation of it to be a speaking about the behavior of certain system interfaces, such as pthread_setschedprio(). Perhaps that comes from "it must be specified as 0", but the part to focus on is rather "not used in scheduling decisions". Threads scheduled according to the SCHED_OTHER policy effectively do not have a priority property at all. Their scheduling does not take any such property into account.
So, this means using system call pthread_setschedparam is ruled out.
Well, no, not exactly. With pthread_setschedparam() you can set both the policy and the priority. If you set the policy to one that considers thread priorities, then you can also meaningfully set the priority. Not that I'm recommending that.
pthread_setschedprio(std::thread::native_handle(), -1) - this doesn't affect the thread's priority.
It shouldn't if the thread's scheduling policy is SCHED_OTHER. Again, threads scheduled according to that policy do not actually have priority, and according to the docs, the (unused) value of their priority properties should be 0.
I verified using getpriority (PRIO_PROCESS, tid).
You should verify by observing that pthread_setschedprio() fails with EINVAL.
Can we use pthread_setschedprio() for default schedulers?
If by "default schedulers" you mean SCHED_OTHER then no. See above. If you include the realtime schedulers SCHED_FIFO and SCHED_RR then yes, for threads scheduled via those schedulers.
After carefully reading this page, I understood that I need to change the thread's dynamic priority by tweaking the nice value
Traditionally and per POSIX, individual threads do not have "nice" values. These are properties of processes. On systems where that is the case, you cannot use niceness to produce different scheduling behavior for different threads of the same process.
On Linux, however, individual threads do have their own niceness. I don't particularly recommend relying on that, but if you
do not require portability beyond Linux,
must provide preferrential scheduling of some threads of one process over other threads of that process, and
must use the SCHED_OTHER scheduling policy
then it's probably your best option.
setpriority(PRIO_PROCESS, id, 19)
It returns -1 always and errno is ESRCH (No such process). Why this is not working?
The system is telling you that id is not a valid process ID. I speculate that it is a pthread_t obtained via std::thread::native_handle(), but note that although on Linux, setpriority(PRIO_PROCESS, ...) can set per-thread niceness, the pid_t identifier it expects is not drawn from the same space as pthreads thread identifiers. See gettid(2).
syscall(SYS_sched_setattr, id, &attr, flags)
If you're using the same id that elicits an ESRCH from setpriority() then I would not expect the sched_settatr syscall to like it any better just because you invoke that directly.
I'm processing items in multiple threads, and the producers may output them into buffers out of order. Some later pipeline stages are not memoryless, and I need to put the partially processed items in order, so I have a thread gathering them from buffers output by previous stage workers and putting them into a standard heap-based priority queue, pulling from the top of the heap while the item counter is the successor to the last item that was pulled.
The item are stamped with a 32-bit unsigned counter by the hardware that generates them. There are several thousand items per second, and after a few days the counter wraps around. How do I handle this without switching to 64-bit counters? The program needs to be able to run indefinitely.
[Edit]
One idea I had is that, since the heap is limited in size to a few million items, I can modify the heap comparator to check the difference between the counters being compared and set a threshold of say half the maximum value of the unsigned, which if exceeded would be taken to assume a wraparound has occurred. The downside is the overhead of an extra conditional per each item checked in heap operations, and I don't know if there's some way to reduce it to a combination of subtraction/cast/etc. with just a single comparison.
How about using a 2nd queue. The insert operation switches on wrap and popping switches when current queue is empty, or just use a flag for the active queue
I am trying to understand the disruptor pattern. I have watched the InfoQ video and tried to read their paper. I understand there is a ring buffer involved, that it is initialized as an extremely large array to take advantage of cache locality, eliminate allocation of new memory.
It sounds like there are one or more atomic integers which keep track of positions. Each 'event' seems to get a unique id and it's position in the ring is found by finding its modulus with respect to the size of the ring, etc., etc.
Unfortunately, I don't have an intuitive sense of how it works. I have done many trading applications and studied the actor model, looked at SEDA, etc.
In their presentation they mentioned that this pattern is basically how routers work; however I haven't found any good descriptions of how routers work either.
Are there some good pointers to a better explanation?
The Google Code project does reference a technical paper on the implementation of the ring buffer, however it is a bit dry, academic and tough going for someone wanting to learn how it works. However there are some blog posts that have started to explain the internals in a more readable way. There is an explanation of ring buffer that is the core of the disruptor pattern, a description of the consumer barriers (the part related to reading from the disruptor) and some information on handling multiple producers available.
The simplest description of the Disruptor is: It is a way of sending messages between threads in the most efficient manner possible. It can be used as an alternative to a queue, but it also shares a number of features with SEDA and Actors.
Compared to Queues:
The Disruptor provides the ability to pass a message onto another threads, waking it up if required (similar to a BlockingQueue). However, there are 3 distinct differences.
The user of the Disruptor defines how messages are stored by extending Entry class and providing a factory to do the preallocation. This allows for either memory reuse (copying) or the Entry could contain a reference to another object.
Putting messages into the Disruptor is a 2-phase process, first a slot is claimed in the ring buffer, which provides the user with the Entry that can be filled with the appropriate data. Then the entry must be committed, this 2-phase approach is necessary to allow for the flexible use of memory mentioned above. It is the commit that makes the message visible to the consumer threads.
It is the responsibility of the consumer to keep track of the messages that have been consumed from the ring buffer. Moving this responsibility away from the ring buffer itself helped reduce the amount of write contention as each thread maintains its own counter.
Compared to Actors
The Actor model is closer the Disruptor than most other programming models, especially if you use the BatchConsumer/BatchHandler classes that are provided. These classes hide all of the complexities of maintaining the consumed sequence numbers and provide a set of simple callbacks when important events occur. However, there are a couple of subtle differences.
The Disruptor uses a 1 thread - 1 consumer model, where Actors use an N:M model i.e. you can have as many actors as you like and they will be distributed across a fixed numbers of threads (generally 1 per core).
The BatchHandler interface provides an additional (and very important) callback onEndOfBatch(). This allows for slow consumers, e.g. those doing I/O to batch events together to improve throughput. It is possible to do batching in other Actor frameworks, however as nearly all other frameworks don't provide a callback at the end of the batch you need to use a timeout to determine the end of the batch, resulting in poor latency.
Compared to SEDA
LMAX built the Disruptor pattern to replace a SEDA based approach.
The main improvement that it provided over SEDA was the ability to do work in parallel. To do this the Disruptor supports multi-casting the same messages (in the same order) to multiple consumers. This avoids the need for fork stages in the pipeline.
We also allow consumers to wait on the results of other consumers without having to put another queuing stage between them. A consumer can simply watch the sequence number of a consumer that it is dependent on. This avoids the need for join stages in pipeline.
Compared to Memory Barriers
Another way to think about it is as a structured, ordered memory barrier. Where the producer barrier forms the write barrier and the consumer barrier is the read barrier.
First we'd like to understand the programming model it offers.
There are one or more writers. There are one or more readers. There is a line of entries, totally ordered from old to new (pictured as left to right). Writers can add new entries on the right end. Every reader reads entries sequentially from left to right. Readers can't read past writers, obviously.
There is no concept of entry deletion. I use "reader" instead of "consumer" to avoid the image of entries being consumed. However we understand that entries on the left of the last reader become useless.
Generally readers can read concurrently and independently. However we can declare dependencies among readers. Reader dependencies can be arbitrary acyclic graph. If reader B depends on reader A, reader B can't read past reader A.
Reader dependency arises because reader A can annotate an entry, and reader B depends on that annotation. For example, A does some calculation on an entry, and stores the result in field a in the entry. A then move on, and now B can read the entry, and the value of a A stored. If reader C does not depend on A, C should not attempt to read a.
This is indeed an interesting programming model. Regardless of the performance, the model alone can benefit lots of applications.
Of course, LMAX's main goal is performance. It uses a pre-allocated ring of entries. The ring is large enough, but it's bounded so that the system will not be loaded beyond design capacity. If the ring is full, writer(s) will wait until the slowest readers advance and make room.
Entry objects are pre-allocated and live forever, to reduce garbage collection cost. We don't insert new entry objects or delete old entry objects, instead, a writer asks for a pre-existing entry, populate its fields, and notify readers. This apparent 2-phase action is really simply an atomic action
setNewEntry(EntryPopulator);
interface EntryPopulator{ void populate(Entry existingEntry); }
Pre-allocating entries also means adjacent entries (very likely) locate in adjacent memory cells, and because readers read entries sequentially, this is important to utilize CPU caches.
And lots of efforts to avoid lock, CAS, even memory barrier (e.g. use a non-volatile sequence variable if there's only one writer)
For developers of readers: Different annotating readers should write to different fields, to avoid write contention. (Actually they should write to different cache lines.) An annotating reader should not touch anything that other non-dependent readers may read. This is why I say these readers annotate entries, instead of modify entries.
Martin Fowler has written an article about LMAX and the disruptor pattern, The LMAX Architecture, which may clarify it further.
I actually took the time to study the actual source, out of sheer curiosity, and the idea behind it is quite simple. The most recent version at the time of writing this post is 3.2.1.
There is a buffer storing pre-allocated events that will hold the data for consumers to read.
The buffer is backed by an array of flags (integer array) of its length that describes the availability of the buffer slots (see further for details). The array is accessed like a java#AtomicIntegerArray, so for the purpose of this explenation you may as well assume it to be one.
There can be any number of producers. When the producer wants to write to the buffer, an long number is generated (as in calling AtomicLong#getAndIncrement, the Disruptor actually uses its own implementation, but it works in the same manner). Let's call this generated long a producerCallId. In a similar manner, a consumerCallId is generated when a consumer ENDS reading a slot from a buffer. The most recent consumerCallId is accessed.
(If there are many consumers, the call with the lowest id is choosen.)
These ids are then compared, and if the difference between the two is lesser that the buffer side, the producer is allowed to write.
(If the producerCallId is greater than the recent consumerCallId + bufferSize, it means that the buffer is full, and the producer is forced to bus-wait until a spot becomes available.)
The producer is then assigned the slot in the buffer based on his callId (which is prducerCallId modulo bufferSize, but since the bufferSize is always a power of 2 (limit enforced on buffer creation), the actuall operation used is producerCallId & (bufferSize - 1)). It is then free to modify the event in that slot.
(The actual algorithm is a bit more complicated, involving caching recent consumerId in a separate atomic reference, for optimisation purposes.)
When the event was modified, the change is "published". When publishing the respective slot in the flag array is filled with the updated flag. The flag value is the number of the loop (producerCallId divided by bufferSize (again since bufferSize is power of 2, the actual operation is a right shift).
In a similar manner there can be any number of consumers. Each time a consumer wants to access the buffer, a consumerCallId is generated (depending on how the consumers were added to the disruptor the atomic used in id generation may be shared or separate for each of them). This consumerCallId is then compared to the most recent producentCallId, and if it is lesser of the two, the reader is allowed to progress.
(Similarly if the producerCallId is even to the consumerCallId, it means that the buffer is empety and the consumer is forced to wait. The manner of waiting is defined by a WaitStrategy during disruptor creation.)
For individual consumers (the ones with their own id generator), the next thing checked is the ability to batch consume. The slots in the buffer are examined in order from the one respective to the consumerCallId (the index is determined in the same manner as for producers), to the one respective to the recent producerCallId.
They are examined in a loop by comparing the flag value written in the flag array, against a flag value generated for the consumerCallId. If the flags match it means that the producers filling the slots has commited their changes. If not, the loop is broken, and the highest commited changeId is returned. The slots from ConsumerCallId to received in changeId can be consumed in batch.
If a group of consumers read together (the ones with shared id generator), each one only takes a single callId, and only the slot for that single callId is checked and returned.
From this article:
The disruptor pattern is a batching queue backed up by a circular
array (i.e. the ring buffer) filled with pre-allocated transfer
objects which uses memory-barriers to synchronize producers and
consumers through sequences.
Memory-barriers are kind of hard to explain and Trisha's blog has done the best attempt in my opinion with this post: http://mechanitis.blogspot.com/2011/08/dissecting-disruptor-why-its-so-fast.html
But if you don't want to dive into the low-level details you can just know that memory-barriers in Java are implemented through the volatile keyword or through the java.util.concurrent.AtomicLong. The disruptor pattern sequences are AtomicLongs and are communicated back and forth among producers and consumers through memory-barriers instead of locks.
I find it easier to understand a concept through code, so the code below is a simple helloworld from CoralQueue, which is a disruptor pattern implementation done by CoralBlocks with which I am affiliated. In the code below you can see how the disruptor pattern implements batching and how the ring-buffer (i.e. circular array) allows for garbage-free communication between two threads:
package com.coralblocks.coralqueue.sample.queue;
import com.coralblocks.coralqueue.AtomicQueue;
import com.coralblocks.coralqueue.Queue;
import com.coralblocks.coralqueue.util.MutableLong;
public class Sample {
public static void main(String[] args) throws InterruptedException {
final Queue<MutableLong> queue = new AtomicQueue<MutableLong>(1024, MutableLong.class);
Thread consumer = new Thread() {
#Override
public void run() {
boolean running = true;
while(running) {
long avail;
while((avail = queue.availableToPoll()) == 0); // busy spin
for(int i = 0; i < avail; i++) {
MutableLong ml = queue.poll();
if (ml.get() == -1) {
running = false;
} else {
System.out.println(ml.get());
}
}
queue.donePolling();
}
}
};
consumer.start();
MutableLong ml;
for(int i = 0; i < 10; i++) {
while((ml = queue.nextToDispatch()) == null); // busy spin
ml.set(System.nanoTime());
queue.flush();
}
// send a message to stop consumer...
while((ml = queue.nextToDispatch()) == null); // busy spin
ml.set(-1);
queue.flush();
consumer.join(); // wait for the consumer thread to die...
}
}
I've written a 'server' program that writes to shared memory, and a client program that reads from the memory. The server has different 'channels' that it can be writing to, which are just different linked lists that it's appending items too. The client is interested in some of the linked lists, and wants to read every node that's added to those lists as it comes in, with the minimum latency possible.
I have 2 approaches for the client:
For each linked list, the client keeps a 'bookmark' pointer to keep its place within the linked list. It round robins the linked lists, iterating through all of them over and over (it loops forever), moving each bookmark one node forward each time if it can. Whether it can is determined by the value of a 'next' member of the node. If it's non-null, then jumping to the next node is safe (the server switches it from null to non-null atomically). This approach works OK, but if there are a lot of lists to iterate over, and only a few of them are receiving updates, the latency gets bad.
The server gives each list a unique ID. Each time the server appends an item to a list, it also appends the ID number of the list to a master 'update list'. The client only keeps one bookmark, a bookmark into the update list. It endlessly checks if the bookmark's next pointer is non-null ( while(node->next_ == NULL) {} ), if so moves ahead, reads the ID given, and then processes the new node on the linked list that has that ID. This, in theory, should handle large numbers of lists much better, because the client doesn't have to iterate over all of them each time.
When I benchmarked the latency of both approaches (using gettimeofday), to my surprise #2 was terrible. The first approach, for a small number of linked lists, would often be under 20us of latency. The second approach would have small spats of low latencies but often be between 4,000-7,000us!
Through inserting gettimeofday's here and there, I've determined that all of the added latency in approach #2 is spent in the loop repeatedly checking if the next pointer is non-null. This is puzzling to me; it's as if the change in one process is taking longer to 'publish' to the second process with the second approach. I assume there's some sort of cache interaction going on I don't understand. What's going on?
Update: Originally, approach #2 used a condition variable, so that if node->next_ == NULL it would wait on the condition, and the server would notify on the condition everytime it issued an update. The latency was the same, and in trying to figure out why I reduced the code down to the approach above. I'm running on a multicore machine, so one process spinlocking shouldn't affect the other.
Update 2: node->next_ is volatile.
Since it sounds like reads and writes are occurring on separate CPUs, perhaps a memory barrier would help? Your writes may not be occurring when you expect them to be.
You are doing a Spin Lock in #2, which is generally not such a great idea, and is chewing up cycles.
Have you tried adding a yield after each failed polling-attempt in your second approach? Just a guess, but it may reduce the power-looping.
With Boost.Thread this would look like this:
while(node->next_ == NULL) {
boost::this_thread::yield( );
}
In Maurice Herlihy paper "Wait-free synchronization" he defines wait-free:
"A wait-free implementation of a concurrent data object is one that guarantees
that any process can complete any operation in a finite number of steps, regardless
the execution speeds on the other processes."
www.cs.brown.edu/~mph/Herlihy91/p124-herlihy.pdf
Let's take one operation op from the universe.
(1) Does the definition mean: "Every process completes a certain operation op in the same finite number n of steps."?
(2) Or does it mean: "Every process completes a certain operation op in any finite number of steps. So that a process can complete op in k steps another process in j steps, where k != j."?
Just by reading the definition i would understand meaning (2). However this makes no sense to me, since a process executing op in k steps and another time in k + m steps meets the definition, but m steps could be a waiting loop. If meaning (2) is right, can anybody explain to me, why this describes wait-free?
In contrast to (2), meaning (1) would guarantee that op is executed in the same number of steps k. So there can't be any additional steps m that are necessary e.g. in a waiting loop.
Which meaning is right and why?
Thanks a lot,
sema
The answer means definition (2). Consider that the waiting loop may potentially never terminate, if the process that is waited for runs indefinitely: “regardless the execution speeds on the other processes”.
So the infinite waiting loop effectively means that a given process may not be able to complete an operation in a finite number of steps.
When an author of a theoretical paper like this writes "a finite number of steps", it means that there exists some constant k (you do not necessarily know k), so that the number of steps is smaller than k (i.e. your waiting time surely won't be infinite).
I'm not sure what 'op' means in this context, but generally, when you have a multithreaded program, threads might wait for one another to do something.
Example: a thread has a lock, and other threads wait for this lock to be freed until they can operate.
This example is not wait free, since if the thread holding the lock does not get a chance to do any ops (this is bad, since the requirement here is that other threads will continue regardless of any other thread), other threads are doomed, and will never ever make any progress.
Other Example: there are several threads each trying to CAS on the same address
This example is wait free, because although all threads but one will fail in such an operation, there will always be progress no matter which threads are chosen to run.
It sounds like you're concerned that definition 2 would allow for an infinite wait loop, but such a loop—being infinite—would not satisfy the requirement for completion within a finite number of steps.
I take "wait-free" to mean that making progress does not require any participant to wait for another participant to finish. If such waiting was necessary, if one participant hangs or operates slowly, other participants suffer similarly.
By contrast, with a wait-free approach, each participant tries its operation and accommodates competitive interaction with other participants. For instance, each thread may try to advance some state, and if two try "at the same" time, only one should succeed, but there's no need for any participants that "failed" to retry. They merely recognize that someone else already got the job done, and they move on.
Rather than focusing on "waiting my turn to act", a wait-free approach encourages "trying to help", acknowledging that others may also be trying to help at the same time. Each participant has to know how to detect success, when to retry, and when to give up, confident that trying only failed because someone else got in there first. As long as the job gets done, it doesn't matter which thread got it done.
Wait-free essentially means that it needs no synchronization to be used in a multi-processing environment. The 'finite number of steps' refers to not having to wait on a synchronization device (e.g. a mutex) for an unknown -- and potentially infinite (deadlock) -- length of time while another process executes a critical section.