empty std::queue pushing data to end of stale items - c++

I am using an std::queue to buffer messages on my network (CAN bus in this case). During an interrupt I am adding the message to the "inbox". Then my main program checks every cycle if the queue is empty, if not handles the messages. Problem is, the queue is pop'd until empty (it exits from while (! inbox.empty()), but the next time I push data to it, it works as normal BUT the old data is still hanging out at the back.
For example, first message pushes a "1" to the queue. Loop reads
1
Next message is "2". Next read is
2
1
If I were to get in TWO messages before another read, "3", "4", then next read would be
3
4
2
1
I am very confused. I am also working with an STM32F0 ARM chip and mbed online, and have no idea if this is working poorly on the hardware or what!
I was concerned about thread safety, so I added an extra buffer queue and only push to the inbox when it "unlocked". And once I ran this I have not seen any conflict occur anyway!
Pusher code:
if (bInboxUnlocked) {
while (! inboxBuffer.empty()) {
inbox.push (inboxBuffer.front());
inboxBuffer.pop();
}
inbox.push(msg);
} else {
inboxBuffer.push(msg);
printf("LOCKED!");
}
Main program read code
bInboxUnlocked = 0;
while (! inbox.empty()) {
printf("%d\r\n", inbox.front().data);
inbox.pop();
}
bInboxUnlocked = 1;
Thoughts anyone? Am I using this wrong? Any other ways to easily accomplish what I am doing? I expect the buffers to be small enough to implement a small circular array, but with queue on hand I was hoping not to have to do that.

Based on what I can figure out from a basic Google search, your CPU is a single core CPU, essentially. If so, then there should not be any memory fencing issues to deal with, here.
If, on the other hand, you had multiple CPU cores to deal with here, it will be necessary to either cram in explicit fences, in key places, or employ C++11 classes like std::mutex, that will take care of this for you.
But going with the original use case of a single CPU, and no memory fencing issues, if you can guarantee that:
A) There's some definite upper limit on the number of messages you expect to buffer by your interrupt handling code in the queue before it gets drained, and:
B) the messages you're buffering are PODs
Then a potential alternative to std::queue worth exploring here is to roll your own simple queue, using nothing more than a static std::array, or maybe a std::vector, an int head pointer, and an int tail pointer. A google search should find plenty of examples of implementing this simple algorithm:
The puller checks "if head != tail", if so, reads the message in queue[head] and increments head. Increment means: head=(head+1)%queuesize. The puller checks if incrementing tail (also modulo queuesize) results in head, if so the queue has filled up (something that shouldn't happen, according to the prerequisites of this approach). If not, put the message into queue[tail], and increment tail.
If all of these operations are done in the right order, the net effect would be the same as using std::queue but:
1) Without the overhead of std::queue and the heap allocation it uses. Should be a major win on an embedded platform.
2) Since the queue is a vector, in contiguous memory, this should take advantage of CPU caching that's often the case here, with traditional CPUs.

Related

How to match processing time with reception time in c++ multithreading

I'm writing a c++ application, in which I'll receive 4096 bytes of data for every 0.5 seconds. This is processed and the output will be sent to some other application. Processing each set of data is taking nearly 2 seconds.
This is how exactly I'm doing this.
In my main function, I'm receiving the data and pushing it into a vector.
I've created a thread, which will always process the first element and deletes it immediately after processing. Below is the simulation of my application receiving part.
#include<iostream>
#include <unistd.h>
#include <vector>
#include <mutex>
#include <pthread.h>
using namespace std;
struct Student{
int id;
int age;
};
vector<Student> dustBin;
pthread_mutex_t lock1;
bool isEven=true;
void *processData(void* arg){
Student st1;
while(true)
{
if(dustBin.size())
{
printf("front: %d\tSize: %d\n",dustBin.front(),dustBin.size());
st1 = dustBin.front();
cout << "Currently Processing ID "<<st1.id<<endl;
sleep(2);
pthread_mutex_lock(&lock1);
dustBin.erase(dustBin.begin());
cout<<"Deleted"<<endl;
pthread_mutex_unlock(&lock1);
}
}
return NULL;
}
int main()
{
pthread_t ptid;
Student st;
dustBin.clear();
pthread_mutex_init(&lock1, NULL);
pthread_create(&ptid, NULL, &processData, NULL);
while(true)
{
for(int i=0; i<4096; i++)
{
st.id = i+1;
st.age = i+2;
pthread_mutex_lock(&lock1);
dustBin.push_back(st);
printf("Pushed: %d\n",st.id);
pthread_mutex_unlock(&lock1);
usleep(500000);
}
}
pthread_join(ptid, NULL);
pthread_mutex_destroy(&lock1);
}
The output of this code is
Output
In the output image posted here, you can observe the exact sequence of the processing. It is processing only one item for every 4 insertions.
Note that the reception time of data <<< processing time.
Because of this reason, my input buffer is growing very rapidly. And one more thing is that as the main thread and the processData thread are using a mutex, they are dependent on each other for the lock to release. Because of this reason my incoming buffer is getting locked sometimes leading to data misses. Please, someone, suggest to me how to handle this or suggest me some method to do.
Thanks & Regards
Vamsi
Undefined behavior
When you read data, you must lock before getting the size.
Busy waiting
You should always avoid tight loop that does nothing. Here if dustBin is empty, you will immediately check it against forever which will use 100% of that core and slow down everything else, drain the laptop battery and make it hotter than it should be. Very bad idea to write such code!
Learn multithreading first
You should read a book or 2 on multithreading. Doing multithreading right is hard and almost impossible without taking time to learn it properly. C++ Concurrency in Action is highly recommended for standard C++ multithreading.
Condition variable
Usually you will use a condition variable or some sort of event to tell the consumer thread when data is added so it does not have to wake up uselessly to check if it is the case.
Since you have a typical producer/consumer, you should be able to find lot of information on how to do it or special containers or other constructs that will help implement your code.
Output
Your printf and cout stuff will have an impact on the performance and since some are inside a lock and other not, you will probably get an improperly formatted output. If you really need output, a third thread might be a better option. In any case, you want to minimize the time you have a lock so formatting into a temporary buffer might be a good idea too.
By the way, standard output is relatively slow and it is perfectly possible that it might even be the reason why you are not able to process rapidly all data.
Processing rate
Obviously if you are able to produce 4096 bytes of data every 0.5 second but need 2 seconds to process that data, you have a serious problem.
You should really think about what you want to do in such case before asking a question here as without that information, we are making guess about possible solutions.
Here are some possibilities:
Slow down the producer. Obviously, this does not work if you get data in real time.
Optimize the consumer (better algorithms, better hardware, optimal parallelism…)
Skip some data
Obviously for performance problems, you should use a profiler to know were you lost your time. Once you know that, you will have a better idea where to check to improve you code.
Taking 2 seconds to process the data is really slow but we cannot help you since we have no idea of what your code is doing.
For example, if you add the data into a database and it is not able to follow up, you might want to batch multiple insert into a single command to reduce the overhead of communicating with the database over the network.
Another example, would be if you append the data to a file, you might want to keep the file open and accumulate some data before doing each write.
Container
A vector would not be a good choice if you remove item from the head one by one and it size become somewhat large (say more than 100 small items) as every other item need to be moved every time.
In addition to changing the container as suggested in a comment, another possibility would be to use 2 vectors and swap them. That way, you will be able to reduce the number of time you lock the mutex and process many item without needing a lock.
How to optimize
You should accumulate enough data (say 30 seconds), stop accumulating and then test your processing speed with that data. If you cannot process that data in less that about half the time (15 seconds), then you clearly need to improve your processing speed one way or another. One your consumer(s) is (are) fast enough, then you could optimize communication from the producer to the consumer(s).
You have to know if your bottleneck is I/O, database or what and if some part might be done in parallel.
There are probably a lot of optimization that can be done in the code you have not shown...
If you can't handle messages fast enough, you have to drop some.
Use a circular buffer of a fixed size.
Then if the provider is faster than the consumer, older entries will be overwritten.
If you cannot skip some data and you cannot process it fast enough, you are doomed.
Create two const variables, NBUFFERS and NTHREADS, make them both 8 initially if you have 16 cores and your processing is 4x too slow. Play with these values later.
Create NBUFFERS data buffers, each big enough to hold 4096 samples, In practice, just create a single large buffer and make offsets into it to divide it up.
Start NTHREADS. They will each continuously wait to be told which buffer to process and then they will process it and wait again for another buffer.
In your main program, go into a loop, receiving data. Receive the first 4096 samples into the first buffer and notify the first thread. Receive the second 4096 samples into the second buffer and notify the second thread.
buffer = (buffer + 1) % NBUFFERS
thread = (thread + 1) % NTHREADS
Rinse and repeat. As you have 8 threads, and data only arrives every 0.5 seconds, each thread will only get a new buffer every 4 seconds but only needs 2 seconds to clear the previous buffer.

should i always lock the global data in multi-thread programming, why or why not?

I'm new to multi-thread programming(actually, i'm not a fresh man in multi-threading, but i always use global data for reading and writing thread, i think it makes my code ugly and slow, i'm eager to improve my skill)
and i'm now developing a forwarder server using c++, for simplify the question, we suppose there are only two threads, a receiving-thread and a sending-thread, and, the stupid design as usual, I have an global std::list for saving data :(
receiving-thread read raw data from server and wirte it into global std::list.
sending-thread read global std::list and send it to several clients.
i use pthread_mutex_lock to sync the global std::list.
the problem is that the performance of forward server is poor, global list locked when receiving-thread is wrting, but at that time, my sending-thread wanna read, so it must waiting, but i think this waiting is useless.
what should i do, i know that global is bad, but, without global, how can i sync these two threads?
i'll keep searching from SO and google.
any suggestions, guides, technology or books will be appreciated. thanks!
EDIT
for any suggestions, i wanna know why or why not, please give me the reason, thanks a lot.
Notes:
Please provide more complete examples: http://sscce.org/
Answers:
Yes, you should synchronize access to shared data.
NOTE: this makes assumptions about std::list implementation - which may or may not apply to your case - but since this assumptions is valid for some implementation you cannot assume your implementation must be thread safe without some explicit guarantee
Consider the snippet:
std::list g_list;
void thread1()
{
while( /*input ok*/ )
{
/*read input*/
g_list.push_back( /*something*/ );
}
}
void thread2()
{
while( /*something*/ )
{
/*pop from list*/
data x = g_list.front();
g_list.pop_front();
}
}
say for example list has 1 element in it
std::list::push_back() must do:
allocate space (many CPU instructions)
copy data into new space (many CPU instructions)
update previous element (if it exists) to point to new element
set std::list::_size
std::list::pop_front() must do:
free space
update next element to not have previous element
set std::list_size
Now say thread 1 calls push_back() - after checking that there is an element (check on size) - it continues to update this element - but right after this - before it gets a chance to update the element - thread 2 could be running pop_front - and be busy freeing the memory for the first element - which could result then in thread 1 causing a segmentation fault - or even memory corruption. Similarly updates to size could result in push_back winning over pop_front's update - and then you have size 2 when you only have 1 element.
Do not use pthread_* in C++ unless you really know what your doing - use std::thread (c++11) or boost::thread - or wrap pthread_* in a class by yourself - because if you don't consider exceptions you will end up with deadlocks
You cannot get past some form of synchronization in this specific example - but you could optimize synchronization
Don't copy the data itself into and out of the std::list - copy a pointer to the data into and out of the list
Only lock while your actually accessing the std::list - but don't make this mistake:
{
// lock
size_t i = g_list.size();
// unlock
if ( i )
{
// lock
// work with g_list ...
// unlock
}
}
A more appropriate pattern here would be a message queue - you can implement one with a mutex, a list and a condition variable. Here are some implementations you can look at:
http://pocoproject.org/docs/Poco.Notification.html
http://gnodebian.blogspot.com.es/2013/07/a-thread-safe-asynchronous-queue-in-c11.html
http://docs.wxwidgets.org/trunk/classwx_message_queue_3_01_t_01_4.html
google for more
There is also the option of atomic containers, look at:
http://calumgrant.net/atomic/ - not sure if this is backed by actual atomic storage (as opposed to just using synchronization behind an interface)
google for more
You could also go for an asynchronous approach with boost::asio - though your case should be quite fast if done right.

thread building block combined with pthreads

I have a queue with elements which needs to be processed. I want to process these elements in parallel. The will be some sections on each element which need to be synchronized. At any point in time there can be max num_threads running threads.
I'll provide a template to give you an idea of what I want to achieve.
queue q
process_element(e)
{
lock()
some synchronized area
// a matrix access performed here so a spin lock would do
unlock()
...
unsynchronized area
...
if( condition )
{
new_element = generate_new_element()
q.push(new_element) // synchonized access to queue
}
}
process_queue()
{
while( elements in q ) // algorithm is finished condition
{
e = get_elem_from_queue(q) // synchronized access to queue
process_element(e)
}
}
I can use
pthreads
openmp
intel thread building blocks
Top problems I have
Make sure that at any point in time I have max num_threads running threads
Lightweight synchronization methods to use on queue
My plan is to the intel tbb concurrent_queue for the queue container. But then, will I be able to use pthreads functions ( mutexes, conditions )? Let's assume this works ( it should ). Then, how can I use pthreads to have max num_threads at one point in time? I was thinking to create the threads once, and then, after one element is processes, to access the queue and get the next element. However it if more complicated because I have no guarantee that if there is not element in queue the algorithm is finished.
My question
Before I start implementing I'd like to know if there is an easy way to use intel tbb or pthreads to obtain the behaviour I want? More precisely processing elements from a queue in parallel
Note: I have tried to use tasks but with no success.
First off, pthreads gives you portability which is hard to walk away from. The following appear to be true from your question - let us know if these aren't true because the answer will then change:
1) You have a multi-core processor(s) on which you're running the code
2) You want to have no more than num_threads threads because of (1)
Assuming the above to be true, the following approach might work well for you:
Create num_threads pthreads using pthread_create
Optionally, bind each thread to a different core
q.push(new_element) atomically adds new_element to a queue. pthreads_mutex_lock and pthreads_mutex_unlock can help you here. Examples here: http://pages.cs.wisc.edu/~travitch/pthreads_primer.html
Use pthreads_mutexes for dequeueing elements
Termination is tricky - one way to do this is to add a TERMINATE element to the queue, which upon dequeueing, causes the dequeuer to queue up another TERMINATE element (for the next dequeuer) and then terminate. You will end up with one extra TERMINATE element in the queue, which you can remove by having a named thread dequeue it after all the threads are done.
Depending on how often you add/remove elements from the queue, you may want to use something lighter weight than pthread_mutex_... to enqueue/dequeue elements. This is where you might want to use a more machine-specific construct.
TBB is compatible with other threading packages.
TBB also emphasizes scalability. So when you port over your program to from a dual core to a quad core you do not have to adjust your program. With data parallel programming, program performance increases (scales) as you add processors.
Cilk Plus is also another runtime that provides good results.
www.cilkplus.org
Since pThreads is a low level theading library you have to decide how much control you need in your application because it does offer flexibility, but at a high cost in terms of programmer effort, debugging time, and maintenance costs.
My recommendation is to look at tbb::parallel_do. It was designed to process elements from a container in parallel, even if the container itself is not concurrent; i.e. parallel_do works with an std::queue correctly without any user synchronization (of course you would still need to protect your matrix access inside process_element(). Moreover, with parallel_do you can add more work on the fly, which looks like what you need, as process_element() creates and adds new elements to the work queue (the only caution is that the newly added work will be processed immediately, unlike putting in a queue which would postpone processing till after all "older" items). Also, you don't have to worry about termination: parallel_do will complete automatically as soon as all initial queue items and new items created on the fly are processed.
However, if, besides the computation itself, the work queue can be concurrently fed from another source (e.g. from an I/O processing thread), then parallel_do is not suitable. In this case, it might make sense to look at parallel_pipeline or, better, the TBB flow graph.
Lastly, an application can control the number of active threads with TBB, though it's not a recommended approach.

How does LMAX's disruptor pattern work?

I am trying to understand the disruptor pattern. I have watched the InfoQ video and tried to read their paper. I understand there is a ring buffer involved, that it is initialized as an extremely large array to take advantage of cache locality, eliminate allocation of new memory.
It sounds like there are one or more atomic integers which keep track of positions. Each 'event' seems to get a unique id and it's position in the ring is found by finding its modulus with respect to the size of the ring, etc., etc.
Unfortunately, I don't have an intuitive sense of how it works. I have done many trading applications and studied the actor model, looked at SEDA, etc.
In their presentation they mentioned that this pattern is basically how routers work; however I haven't found any good descriptions of how routers work either.
Are there some good pointers to a better explanation?
The Google Code project does reference a technical paper on the implementation of the ring buffer, however it is a bit dry, academic and tough going for someone wanting to learn how it works. However there are some blog posts that have started to explain the internals in a more readable way. There is an explanation of ring buffer that is the core of the disruptor pattern, a description of the consumer barriers (the part related to reading from the disruptor) and some information on handling multiple producers available.
The simplest description of the Disruptor is: It is a way of sending messages between threads in the most efficient manner possible. It can be used as an alternative to a queue, but it also shares a number of features with SEDA and Actors.
Compared to Queues:
The Disruptor provides the ability to pass a message onto another threads, waking it up if required (similar to a BlockingQueue). However, there are 3 distinct differences.
The user of the Disruptor defines how messages are stored by extending Entry class and providing a factory to do the preallocation. This allows for either memory reuse (copying) or the Entry could contain a reference to another object.
Putting messages into the Disruptor is a 2-phase process, first a slot is claimed in the ring buffer, which provides the user with the Entry that can be filled with the appropriate data. Then the entry must be committed, this 2-phase approach is necessary to allow for the flexible use of memory mentioned above. It is the commit that makes the message visible to the consumer threads.
It is the responsibility of the consumer to keep track of the messages that have been consumed from the ring buffer. Moving this responsibility away from the ring buffer itself helped reduce the amount of write contention as each thread maintains its own counter.
Compared to Actors
The Actor model is closer the Disruptor than most other programming models, especially if you use the BatchConsumer/BatchHandler classes that are provided. These classes hide all of the complexities of maintaining the consumed sequence numbers and provide a set of simple callbacks when important events occur. However, there are a couple of subtle differences.
The Disruptor uses a 1 thread - 1 consumer model, where Actors use an N:M model i.e. you can have as many actors as you like and they will be distributed across a fixed numbers of threads (generally 1 per core).
The BatchHandler interface provides an additional (and very important) callback onEndOfBatch(). This allows for slow consumers, e.g. those doing I/O to batch events together to improve throughput. It is possible to do batching in other Actor frameworks, however as nearly all other frameworks don't provide a callback at the end of the batch you need to use a timeout to determine the end of the batch, resulting in poor latency.
Compared to SEDA
LMAX built the Disruptor pattern to replace a SEDA based approach.
The main improvement that it provided over SEDA was the ability to do work in parallel. To do this the Disruptor supports multi-casting the same messages (in the same order) to multiple consumers. This avoids the need for fork stages in the pipeline.
We also allow consumers to wait on the results of other consumers without having to put another queuing stage between them. A consumer can simply watch the sequence number of a consumer that it is dependent on. This avoids the need for join stages in pipeline.
Compared to Memory Barriers
Another way to think about it is as a structured, ordered memory barrier. Where the producer barrier forms the write barrier and the consumer barrier is the read barrier.
First we'd like to understand the programming model it offers.
There are one or more writers. There are one or more readers. There is a line of entries, totally ordered from old to new (pictured as left to right). Writers can add new entries on the right end. Every reader reads entries sequentially from left to right. Readers can't read past writers, obviously.
There is no concept of entry deletion. I use "reader" instead of "consumer" to avoid the image of entries being consumed. However we understand that entries on the left of the last reader become useless.
Generally readers can read concurrently and independently. However we can declare dependencies among readers. Reader dependencies can be arbitrary acyclic graph. If reader B depends on reader A, reader B can't read past reader A.
Reader dependency arises because reader A can annotate an entry, and reader B depends on that annotation. For example, A does some calculation on an entry, and stores the result in field a in the entry. A then move on, and now B can read the entry, and the value of a A stored. If reader C does not depend on A, C should not attempt to read a.
This is indeed an interesting programming model. Regardless of the performance, the model alone can benefit lots of applications.
Of course, LMAX's main goal is performance. It uses a pre-allocated ring of entries. The ring is large enough, but it's bounded so that the system will not be loaded beyond design capacity. If the ring is full, writer(s) will wait until the slowest readers advance and make room.
Entry objects are pre-allocated and live forever, to reduce garbage collection cost. We don't insert new entry objects or delete old entry objects, instead, a writer asks for a pre-existing entry, populate its fields, and notify readers. This apparent 2-phase action is really simply an atomic action
setNewEntry(EntryPopulator);
interface EntryPopulator{ void populate(Entry existingEntry); }
Pre-allocating entries also means adjacent entries (very likely) locate in adjacent memory cells, and because readers read entries sequentially, this is important to utilize CPU caches.
And lots of efforts to avoid lock, CAS, even memory barrier (e.g. use a non-volatile sequence variable if there's only one writer)
For developers of readers: Different annotating readers should write to different fields, to avoid write contention. (Actually they should write to different cache lines.) An annotating reader should not touch anything that other non-dependent readers may read. This is why I say these readers annotate entries, instead of modify entries.
Martin Fowler has written an article about LMAX and the disruptor pattern, The LMAX Architecture, which may clarify it further.
I actually took the time to study the actual source, out of sheer curiosity, and the idea behind it is quite simple. The most recent version at the time of writing this post is 3.2.1.
There is a buffer storing pre-allocated events that will hold the data for consumers to read.
The buffer is backed by an array of flags (integer array) of its length that describes the availability of the buffer slots (see further for details). The array is accessed like a java#AtomicIntegerArray, so for the purpose of this explenation you may as well assume it to be one.
There can be any number of producers. When the producer wants to write to the buffer, an long number is generated (as in calling AtomicLong#getAndIncrement, the Disruptor actually uses its own implementation, but it works in the same manner). Let's call this generated long a producerCallId. In a similar manner, a consumerCallId is generated when a consumer ENDS reading a slot from a buffer. The most recent consumerCallId is accessed.
(If there are many consumers, the call with the lowest id is choosen.)
These ids are then compared, and if the difference between the two is lesser that the buffer side, the producer is allowed to write.
(If the producerCallId is greater than the recent consumerCallId + bufferSize, it means that the buffer is full, and the producer is forced to bus-wait until a spot becomes available.)
The producer is then assigned the slot in the buffer based on his callId (which is prducerCallId modulo bufferSize, but since the bufferSize is always a power of 2 (limit enforced on buffer creation), the actuall operation used is producerCallId & (bufferSize - 1)). It is then free to modify the event in that slot.
(The actual algorithm is a bit more complicated, involving caching recent consumerId in a separate atomic reference, for optimisation purposes.)
When the event was modified, the change is "published". When publishing the respective slot in the flag array is filled with the updated flag. The flag value is the number of the loop (producerCallId divided by bufferSize (again since bufferSize is power of 2, the actual operation is a right shift).
In a similar manner there can be any number of consumers. Each time a consumer wants to access the buffer, a consumerCallId is generated (depending on how the consumers were added to the disruptor the atomic used in id generation may be shared or separate for each of them). This consumerCallId is then compared to the most recent producentCallId, and if it is lesser of the two, the reader is allowed to progress.
(Similarly if the producerCallId is even to the consumerCallId, it means that the buffer is empety and the consumer is forced to wait. The manner of waiting is defined by a WaitStrategy during disruptor creation.)
For individual consumers (the ones with their own id generator), the next thing checked is the ability to batch consume. The slots in the buffer are examined in order from the one respective to the consumerCallId (the index is determined in the same manner as for producers), to the one respective to the recent producerCallId.
They are examined in a loop by comparing the flag value written in the flag array, against a flag value generated for the consumerCallId. If the flags match it means that the producers filling the slots has commited their changes. If not, the loop is broken, and the highest commited changeId is returned. The slots from ConsumerCallId to received in changeId can be consumed in batch.
If a group of consumers read together (the ones with shared id generator), each one only takes a single callId, and only the slot for that single callId is checked and returned.
From this article:
The disruptor pattern is a batching queue backed up by a circular
array (i.e. the ring buffer) filled with pre-allocated transfer
objects which uses memory-barriers to synchronize producers and
consumers through sequences.
Memory-barriers are kind of hard to explain and Trisha's blog has done the best attempt in my opinion with this post: http://mechanitis.blogspot.com/2011/08/dissecting-disruptor-why-its-so-fast.html
But if you don't want to dive into the low-level details you can just know that memory-barriers in Java are implemented through the volatile keyword or through the java.util.concurrent.AtomicLong. The disruptor pattern sequences are AtomicLongs and are communicated back and forth among producers and consumers through memory-barriers instead of locks.
I find it easier to understand a concept through code, so the code below is a simple helloworld from CoralQueue, which is a disruptor pattern implementation done by CoralBlocks with which I am affiliated. In the code below you can see how the disruptor pattern implements batching and how the ring-buffer (i.e. circular array) allows for garbage-free communication between two threads:
package com.coralblocks.coralqueue.sample.queue;
import com.coralblocks.coralqueue.AtomicQueue;
import com.coralblocks.coralqueue.Queue;
import com.coralblocks.coralqueue.util.MutableLong;
public class Sample {
public static void main(String[] args) throws InterruptedException {
final Queue<MutableLong> queue = new AtomicQueue<MutableLong>(1024, MutableLong.class);
Thread consumer = new Thread() {
#Override
public void run() {
boolean running = true;
while(running) {
long avail;
while((avail = queue.availableToPoll()) == 0); // busy spin
for(int i = 0; i < avail; i++) {
MutableLong ml = queue.poll();
if (ml.get() == -1) {
running = false;
} else {
System.out.println(ml.get());
}
}
queue.donePolling();
}
}
};
consumer.start();
MutableLong ml;
for(int i = 0; i < 10; i++) {
while((ml = queue.nextToDispatch()) == null); // busy spin
ml.set(System.nanoTime());
queue.flush();
}
// send a message to stop consumer...
while((ml = queue.nextToDispatch()) == null); // busy spin
ml.set(-1);
queue.flush();
consumer.join(); // wait for the consumer thread to die...
}
}

Why does one loop take longer to detect a shared memory update than another loop?

I've written a 'server' program that writes to shared memory, and a client program that reads from the memory. The server has different 'channels' that it can be writing to, which are just different linked lists that it's appending items too. The client is interested in some of the linked lists, and wants to read every node that's added to those lists as it comes in, with the minimum latency possible.
I have 2 approaches for the client:
For each linked list, the client keeps a 'bookmark' pointer to keep its place within the linked list. It round robins the linked lists, iterating through all of them over and over (it loops forever), moving each bookmark one node forward each time if it can. Whether it can is determined by the value of a 'next' member of the node. If it's non-null, then jumping to the next node is safe (the server switches it from null to non-null atomically). This approach works OK, but if there are a lot of lists to iterate over, and only a few of them are receiving updates, the latency gets bad.
The server gives each list a unique ID. Each time the server appends an item to a list, it also appends the ID number of the list to a master 'update list'. The client only keeps one bookmark, a bookmark into the update list. It endlessly checks if the bookmark's next pointer is non-null ( while(node->next_ == NULL) {} ), if so moves ahead, reads the ID given, and then processes the new node on the linked list that has that ID. This, in theory, should handle large numbers of lists much better, because the client doesn't have to iterate over all of them each time.
When I benchmarked the latency of both approaches (using gettimeofday), to my surprise #2 was terrible. The first approach, for a small number of linked lists, would often be under 20us of latency. The second approach would have small spats of low latencies but often be between 4,000-7,000us!
Through inserting gettimeofday's here and there, I've determined that all of the added latency in approach #2 is spent in the loop repeatedly checking if the next pointer is non-null. This is puzzling to me; it's as if the change in one process is taking longer to 'publish' to the second process with the second approach. I assume there's some sort of cache interaction going on I don't understand. What's going on?
Update: Originally, approach #2 used a condition variable, so that if node->next_ == NULL it would wait on the condition, and the server would notify on the condition everytime it issued an update. The latency was the same, and in trying to figure out why I reduced the code down to the approach above. I'm running on a multicore machine, so one process spinlocking shouldn't affect the other.
Update 2: node->next_ is volatile.
Since it sounds like reads and writes are occurring on separate CPUs, perhaps a memory barrier would help? Your writes may not be occurring when you expect them to be.
You are doing a Spin Lock in #2, which is generally not such a great idea, and is chewing up cycles.
Have you tried adding a yield after each failed polling-attempt in your second approach? Just a guess, but it may reduce the power-looping.
With Boost.Thread this would look like this:
while(node->next_ == NULL) {
boost::this_thread::yield( );
}