sem_wait() failed to wake up on linux - c++

I have a real-time application that uses a shared FIFO. There are several writer processes and one reader process. Data is periodically written into the FIFO and constantly drained. Theoretically the FIFO should never overflow because the reading speed is faster than all writers combined. However, the FIFO does overflow.
I tried to reproduce the problem and finally worked out the following (simplified) code:
#include <stdint.h>
#include <cstdio>
#include <cstdlib>
#include <cstring>
#include <cassert>
#include <pthread.h>
#include <semaphore.h>
#include <sys/time.h>
#include <unistd.h>
class Fifo
{
public:
Fifo() : _deq(0), _wptr(0), _rptr(0), _lock(0)
{
memset(_data, 0, sizeof(_data));
sem_init(&_data_avail, 1, 0);
}
~Fifo()
{
sem_destroy(&_data_avail);
}
void Enqueue()
{
struct timeval tv;
gettimeofday(&tv, NULL);
uint64_t enq = tv.tv_usec + tv.tv_sec * 1000000;
while (__sync_lock_test_and_set(&_lock, 1))
sched_yield();
uint8_t wptr = _wptr;
uint8_t next_wptr = (wptr + 1) % c_entries;
int retry = 0;
while (next_wptr == _rptr) // will become full
{
printf("retry=%u enq=%lu deq=%lu count=%d\n", retry, enq, _deq, Count());
for (uint8_t i = _rptr; i != _wptr; i = (i+1)%c_entries)
printf("%u: %lu\n", i, _data[i]);
assert(retry++ < 2);
usleep(500);
}
assert(__sync_bool_compare_and_swap(&_wptr, wptr, next_wptr));
_data[wptr] = enq;
__sync_lock_release(&_lock);
sem_post(&_data_avail);
}
int Dequeue()
{
struct timeval tv;
gettimeofday(&tv, NULL);
uint64_t deq = tv.tv_usec + tv.tv_sec * 1000000;
_deq = deq;
uint8_t rptr = _rptr, wptr = _wptr;
uint8_t next_rptr = (rptr + 1) % c_entries;
bool empty = Count() == 0;
assert(!sem_wait(&_data_avail));// bug in sem_wait?
_deq = 0;
uint64_t enq = _data[rptr]; // enqueue time
assert(__sync_bool_compare_and_swap(&_rptr, rptr, next_rptr));
int latency = deq - enq; // latency from enqueue to dequeue
if (empty && latency < -500)
{
printf("before dequeue: w=%u r=%u; after dequeue: w=%u r=%u; %d\n", wptr, rptr, _wptr, _rptr, latency);
}
return latency;
}
int Count()
{
int count = 0;
assert(!sem_getvalue(&_data_avail, &count));
return count;
}
static const unsigned c_entries = 16;
private:
sem_t _data_avail;
uint64_t _data[c_entries];
volatile uint64_t _deq; // non-0 indicates when dequeue happened
volatile uint8_t _wptr, _rptr; // write, read pointers
volatile uint8_t _lock; // write lock
};
static const unsigned c_total = 10000000;
static const unsigned c_writers = 3;
static Fifo s_fifo;
// writer thread
void* Writer(void* arg)
{
for (unsigned i = 0; i < c_total; i++)
{
int t = rand() % 200 + 200; // [200, 399]
usleep(t);
s_fifo.Enqueue();
}
return NULL;
}
int main()
{
pthread_t thread[c_writers];
for (unsigned i = 0; i < c_writers; i++)
pthread_create(&thread[i], NULL, Writer, NULL);
for (unsigned total = 0; total < c_total*c_writers; total++)
s_fifo.Dequeue();
}
When Enqueue() overflows, the debug print indicates that Dequeue() is stuck (because _deq is not 0). The only place where Dequeue() can get stuck is sem_wait(). However, since the fifo is full (also confirmed by sem_getvalue()), I don't understand how that could happen. Even after several retries (each waits 500us) the fifo was still full even though Dequeue() should definitely drain while Enqueue() is completely stopped (busy retrying).
In the code example, there are 3 writers, each writing every 200-400us. On my computer (8-core i7-2860 running centOS 6.5 kernel 2.6.32-279.22.1.el6.x86_64, g++ 4.47 20120313), the code would fail in a few minutes. I also tried on several other centOS systems and it also failed the same way.
I know that making the fifo bigger can reduce overflow probability (in fact, the program still fails with c_entries=128), but in my real-time application there is hard constraint on enqueue-dequeue latency, so data must be drained quickly. If it's not a bug in sem_wait(), then what prevents it from getting the semaphore?
P.S. If I replace
assert(!sem_wait(&_data_avail));// bug in sem_wait?
with
while (sem_trywait(&_data_avail) < 0) sched_yield();
then the program runs fine. So it seems that there's something wrong in sem_wait() and/or scheduler.

You need to use a combination of sem_wait/sem_post calls to be able to manage your read and write threads.
Your enqueue thread performs a sem_post only and your dequeue performs sem_wait only call. you need to add sem_wait to the enqueue thread and a sem_post on the dequeue thread.
A long time ago, I implemented the ability to have multiple threads/process be able to read some shared memory and only one thread/process write to the shared memory. I used two semaphore, a write semaphore and a read semaphore. The read threads would wait until the write semaphore was not set and then it would set the read semaphore. The write threads would set the write semaphore and then wait until the read semaphore is not set. The read and write threads would then unset the set semaphores when they've completed their tasks. The read semaphore can have n threads lock the read semaphore at a time while the write semaphore can be lock by a single thread at a time.

If it's not a bug in sem_wait(), then what prevents it from getting
the semaphore?
Your program's impatience prevents it. There is no guarantee that the Dequeue() thread is scheduled within a given number of retries. If you change
assert(retry++ < 2);
to
retry++;
you'll see that the program happily continues the reader process sometimes after 8 or perhaps even more retries.
Why does Enqueue have to retry?
It has to retry simply because the main thread's Dequeue() hasn't been scheduled by then.
Dequeue speed is much faster than all writers combined.
Your program shows that this assumption is sometimes false. While apparently the execution time of Dequeue() is much shorter than that of the writers (due to the usleep(t)), this does not imply that Dequeue() is scheduled by the Completely Fair Scheduler more often - and for this the main reason is that you used a nondeterministic scheduling policy. man sched_yield:
sched_yield() is intended for use with read-time scheduling policies
(i.e., SCHED_FIFO or SCHED_RR). Use of sched_yield() with
nondeterministic scheduling policies such as SCHED_OTHER is
unspecified and very likely means your application design is broken.
If you insert
struct sched_param param = { .sched_priority = 1 };
if (sched_setscheduler(0, SCHED_FIFO, &param) < 0)
perror("sched_setscheduler");
at the start of main(), you'll likely see that your program performs as expected (when run with the appropriate priviledge).

Related

C++ Wait in main thread for future without while(true)

Question
I want to know if it is possible to wait in the main-Thread without any while(1)-loop.
I launch a few threads via std::async() and do calculation of numbers on each thread. After i start the threads i want to receive the results back. I do that with a std::future<>.get().
My problem
When i receive the result i call std::future.get(), which blocks the main thread until the calculation on the thread is done. This leads to some slower execution time, if one thread needs considerably more time then the following, where i could do some calculation with the finished results instead and then when the slowest thread is done i maybe have some some further calculation.
Is there a way to idle the main thread until ANY of the threads has finished running? I have thought of a callback function which wakes the main thread up, but i still don't know how to idle the main function without making it unresponsive for i.e. a second and not running a while(true) loop instead.
Current code
#include <iostream>
#include <future>
uint64_t calc_factorial(int start, int number);
int main()
{
uint64_t n = 1;
//The user entered number
uint64_t number = 0;
// get the user input
printf("Enter number (uint64_t): ");
scanf("%lu", &number);
std::future<uint64_t> results[4];
for (int i = 0; i < 4; i++)
{
// push to different cores
results[i] = std::async(std::launch::async, calc_factorial, i + 2, number);
}
for (int i = 0; i < 4; i++)
{
//retrieve result...I don't want to wait here if one threads needs more time than usual
n *= results[i].get();
}
// print n or the time needed
return 0;
}
uint64_t calc_factorial(int start, int number)
{
uint64_t n = 1;
for (int i = start; i <= number; i+=4) n *= i;
return n;
}
I prepared a code snippet which runs fine, I am using the GMP Lib for the big results, but the code runs with uint64_t instead if you enter small numbers.
Note
If you have compiled the GMP library for whatever reason on your PC already you could replace every uint64_t with mpz_class
I'd approach this somewhat differently.
Unless I have a fairly specific reason to do otherwise, I tend to approach most multithreaded code the same general way: use a (thread-safe) queue to transmit results. So create an instance of a thread-safe queue, and pass a reference to it to each of the threads that's doing to generate the data. The have whatever thread is going to collect the results grab them from the queue.
This makes it automatic (and trivial) that you create each result as it's produced, rather than getting stuck waiting for one after another has produced results.

Why would adding a delay improve data throughput in this multithreaded environment?

In my application, I have two threads, a producer (thread 1) and a consumer (thread 2). Each thread has an input and output interface (effectively a pointer to a list) that is connected to a third thread which serves as a router.
When the producer writes, it calls memcpy to copy data into a buffer and pushes the buffer into a list. Meanwhile, the router thread is round-robin searching through all the threads that are connected to it and monitoring their interfaces to see if any thread has data to send out. When it sees that thread 1's list is non-empty, it checks to determine which thread the data is intended for. The data is spliced into the destination thread's (in this case thread 2) input list, at which point thread 2 will malloc some memory, memcpy the data into it and return the pointer to this new region.
For my test, I'm measuring throughput to see how long it takes to send 100k messages of varying sizes. Thread 1 sends data of some size, thread 2 reads it and sends back a small reply message, which thread 1 reads. This would be one complete exchange. In the first test, in thread 1, I'm sending all 100k messages, and then reading 100k replies. In the second test, in thread 1, I'm alternating sending a message and waiting for the reply and repeating 100k times. In both tests, thread 2 is in a loop reading the message and sending a reply. I would expect test 1 to have higher throughput because the threads should spend less time waiting around. However, it has markedly worse throughput than test 2. I've measured how long individual function calls (to read/write) take in the two test cases and they invariably take longer in test 1 (based on the means and medians and no delay) though the numbers are of the same order of magnitude.
When I add a loop doing nothing into thread 1's sending loop in test 1, I see dramatically improved throughput for this case as opposed to not having the delay. My only guess is that adding a delay slows down the producer so the consumer can absorb the data which prevents its input list from growing very large. I'm wondering if there may be other explanations and if so, how I can test for them.
Edit
Unfortunately, my own code is just the test I described above which calls a library that actually performs the reads/writes, creates that third thread etc. It's difficult to make a minimal example out of it because the library is complex and not mine. I provide some pseudocode to illustrate the setup in more detail.
int NUM_ITERATIONS = 100000;
int msg_reply = 2; // size of the reply message in words
int msg_size = 512; // indicates 512 64 bit words
void generate(int iterations, int size, interface* out){
std::vector<long long> vec(size);
for(int i = 0; i < size; i++)
vec[i] = (long long) i;
for(int i = 0; i < iterations; i++)
out->lib_write((char*) vec.data(), size);
}
void receive(int iterations, int size, interface* in){
for(int i = 0; i < iterations; i++)
char* data = in->lib_read(size)
void producer(interface* in, interface* out){
// test 1
start = std::chrono::high_resolution_clock::now();
// write data of size msg_size, NUM_ITERATIONS times to out
generate(NUM_ITERATIONS, msg_size, out);
// read data of size msg_reply, NUM_ITERATIONS times from in
receive(NUM_ITERATIONS, msg_reply, in);
end = std::chrono::high_resolution_clock::now();
// using NUM_ITERATIONS, msg_size and time, compute and print throughput to stdout
print_throughput(end-start, "throughput_0", msg_size);
// test 2
start = std::chrono::high_resolution_clock::now();
for(int j = 0; j < NUM_ITERATIONS; j++){
generate(1, msg_size, out);
receive(1, msg_reply, in);
}
end = std::chrono::high_resolution_clock::now();
print_throughput(end-start, "throughput_1", msg_size);
}
void consumer(interface* in, interface* out){
for(int i = 0; i < 2; i++}{
for(int j = 0; j < NUM_ITERATIONS; j++){
receive(1, msg_size, in);
generate(1, msg_reply, out);
}
}
}
The calls to lib_write() and lib_read() become fairly complex. To elaborate on the description above, the data gets memcpy'd into a buffer and then moved into a list. The interface has a condition variable member and the write calls its notify_one() method. The third thread is looping through all the interface pointers it has and checking to see if their lists are non-empty. If so, the data is spliced from one output list to the destination's input list using the splice() method in std::list. Meanwhile, the consumer calls the lib_read() which waits on the condition variable while the interface is empty, and then memcpy's the data into a new region and returns it.
// note: these will not compile as is. Undefined variables are class members
char * interface::lib_read(size_t * _size){
char * ret;
{
std::unique_lock<std::mutex> lock(mutex);
// packets is an std::list containing the incoming data
while (packets.empty()) {
cv.wait(lock);
}
curr_read_it = packets.begin();
}
size_t buff_size = curr_read_it->size;
ret = (char *)malloc(buff_size);
memcpy((char *)ret, (char *)curr_read_it->data, buff_size);
{
std::unique_lock<std::mutex> lock(mutex);
packets.erase(curr_read_it);
curr_read_it = packets.end();
}
return ret;
}
void interface::lib_write(char * data, int size){
// indicates the destination thread id
long long header = 1;
// buffer is a just an array that's max packet sized
memcpy((char *)buffer.data, &header, sizeof(long long));
memcpy((char *)buffer.data + sizeof(long long), (char *)data, size * sizeof(long long));
std::lock_guard<std::mutex> guard(mutex);
packets.push_back(std::move(buffer));
cv.notify_one();
}
// this is on thread 3
void route(){
do{
// this is a vector containing all the "out" interfaces
for(int i = 0; i < out_ptrs.size(); i++){
interface <long long> * _out = out_ptrs[i];
if(!_out->empty()){
// this just returns the header id (also locks the mutex)
long long dest= _out->get_dest();
// looks up the correct interface based on the id and splices
// a packet into from _out to the appropriate one. Locks mutex
in_ptrs[dest_map[dest]]->splice(_out);
}
}
}while(!done());
I was looking for general advice on what factors may influence multithreading performance and what to test for in order to better understand what was going on.
I talked to some other people and the advice I got that was helpful was to determine if the OS scheduling was the issue (which is what I suspected but was unsure how to test). Essentially, I used taskset and sched_affinity() to force the application to run on one core or on a subset of cores and looked at how they compared to each other and to the unrestricted case.
Based on the restrictions, I got dramatically different results and could see some trends so I'm pretty confident in saying that it's an OS scheduling issue. Different ones can yield better performance under different workloads.

How to pass parameters to threads that have already started running

Hi I have an application that uses one thread to copy buffer from *src to *dst but I want to have the thread started at the beginning of the program. When I want to use the thread I want to pass *src, *dst and size to the thread so that it can start copying the buffer. How do I achieve this ? Because when I start a thread I pass the values as I instantiate the object, ThreadX when creating a thread.
Thread^ t0 = gcnew Thread(gcnew ThreadStart(gcnew ThreadX(output, input, size), &ThreadX::ThreadEntryPoint));
To summarize I want to do this way:
program starts
create a thread
wait in the thread
pass the parameters and wake up the thread to start copying
once copying is done in the thread then let the main thread know it's done
just before the program ends join the thread
The sample code is shown below.
Thanks!
#include "stdafx.h"
#include <iostream>
#if 1
using namespace System;
using namespace System::Diagnostics;
using namespace System::Runtime::InteropServices;
using namespace System::Threading;
public ref class ThreadX
{
unsigned short* destination;
unsigned short* source;
unsigned int num;
public:
ThreadX(unsigned short* dstPtr, unsigned short* srcPtr, unsigned int size)
{
destination = dstPtr;
source = srcPtr;
num = size;
}
void ThreadEntryPoint()
{
memcpy(destination, source, sizeof(unsigned short)*num);
}
};
int main()
{
int size = 5056 * 2960 * 10; //iris 15 size
unsigned short* input; //16bit
unsigned short* output;
Stopwatch^ sw = gcnew Stopwatch();
input = new unsigned short[size];
output = new unsigned short[size];
//elapsed time for each test
int sw0;
int sw1;
int sw2;
int sw3;
//initialize input
for (int i = 0; i < size; i++) { input[i] = i % 0xffff; }
//initialize output
for (int i = 0; i < size; i++) { output[i] = 0; }
// TEST 1 //////////////////////////////////////////////////////////////////////
for (int i = 0; i < size; i++) { output[i] = 0; }
//-----------------------------------------------------------------------
Thread^ t0 = gcnew Thread(gcnew ThreadStart(gcnew ThreadX(output, input, size), &ThreadX::ThreadEntryPoint));
t0->Name = "t1";
t0->Start();
t0->Join();
//-----------------------------------------------------------------------
return 0
}
Basically, you need some basic building blocks for solving this problem (I am assuming you want to perform this copy operation just once. We can easily extend the solution if you have a constant stream of inputs):
1) Shared memory - For exchanging control information. In this case, it would be source buffer pointer, destination buffer pointer and size (from main thread to worker thread). You would also want some datastructure (lets begin with a simple boolean flag) to share information in the reverse direction (from worker thread to main thread), when the work is done.
2) Conditional variable - To send signal from main thread to worker thread, and in reverse direction. So, you need 2 different conditional variables.
3) A synchronization primitive like a mutex to protect the shared memory (Since they would be simultaneously accessed by both threads)
Given these building blocks, the pseudo code of your program will look like this:
struct Control {
void* src, *dest;
int num_of_bytes = -1;
bool isDone = false;
conditional_var inputReceived;
conditional_var copyDone;
mutex m;
};
void childThread() {
m.lock();
while (num_of_bytes == -1) {
inputReceived.wait(m); // wait till you receive input.
}
// Input received. Make sure you set src and dest pointers, before setting num_of_bytes
mempcy(dest, src, num_of_bytes);
isDone = true; // mark work completion.
copyDone.notify(); // notify the main thread of work completion.
m.unlock();
}
void mainThread()
{
// Create worker thread at start;
thread_t thread = pthread_create(&childThread);
// Do other stuff...
//
//
// Input parameters received. Set control information, and notify the
// workerthread.
mutex.lock();
src = input.src;
dest = input.dest;
num_of_bytes = input.num_of_bytes;
inputReceived.notify(); // wake up worker thread.
while (!isDone) { // Wait for copy to be over.
copyDone.wait();
}
m.unlock(); // unlock the mutex.
thread.join(); // wait for thread to join. If the thread has already ended before we execute thread join, it will return immediately.
}
If you want to extend this solution to handle a stream of inputs, we can use 2 queues for requests and responses, with each element of the queue being input and output parameters.
Don't suspend the thread. That's bad design, and is highly likely to cause you problems down the road.
Instead, think of it like this: Have the thread block waiting for information on what it should do. When it gets that information, it should unblock, do the work, then block again waiting for the next thing.
A quick search for "C# blocking collection" reveals the BlockingCollection<T> class, and this guide to cancelling one of the blocking operations. Make it so that you activate the CancellationToken when it's time for your thread to exit, and have the thread sit waiting at the blocking operation when it's not doing work.

Inconsistent timings when passing data between two threads

I have a piece of code that I use to test various containers (e.g. deque and a circular buffer) when passing data from a producer (thread 1) to a consumer (thread 2). A data is represented by a struct with a pair of timestamps. First timestamp is taken before push in the producer, and the second one is taken when data is popped by the consumer.
The container is protected with a pthread spinlock.
The machine runs redhat 5.5 with 2.6.18 kernel (old!), it is a 4-core system with hyperthreading disabled. gcc 4.7 with -std=c++11 flag was used in all tests.
Producer acquires the lock, timestamps the data and pushes it into the queue, unlocks and sleeps in a busy loop for 2 microseconds (the only reliable way I found to sleep for precisely 2 micros on that system).
Consumer locks, pops the data, timestamps it and generates some statistics (running mean delay and standard deviation). The stats is printed every 5 seconds (M is the mean, M2 is the std dev) and reset. I used gettimeofday() to obtain the timestamps, which means that the mean delay number can be thought of as the percentage of delays that exceed 1 microsecond.
Most of the time the output looks like this:
CNT=2500000 M=0.00935 M2=0.910238
CNT=2500000 M=0.0204112 M2=1.57601
CNT=2500000 M=0.0045016 M2=0.372065
but sometimes (probably 1 trial out of 20) like this:
CNT=2500000 M=0.523413 M2=4.83898
CNT=2500000 M=0.558525 M2=4.98872
CNT=2500000 M=0.581157 M2=5.05889
(note the mean number is much worse than in the first case, and it never recovers as the program runs).
I would appreciate thoughts on why this could happen. Thanks.
#include <iostream>
#include <string.h>
#include <stdexcept>
#include <sys/time.h>
#include <deque>
#include <thread>
#include <cstdint>
#include <cmath>
#include <unistd.h>
#include <xmmintrin.h> // _mm_pause()
int64_t timestamp() {
struct timeval tv;
gettimeofday(&tv, 0);
return 1000000L * tv.tv_sec + tv.tv_usec;
}
//running mean and a second moment
struct StatsM2 {
StatsM2() {}
double m = 0;
double m2 = 0;
long count = 0;
inline void update(long x, long c) {
count = c;
double delta = x - m;
m += delta / count;
m2 += delta * (x - m);
}
inline void reset() {
m = m2 = 0;
count = 0;
}
inline double getM2() { // running second moment
return (count > 1) ? m2 / (count - 1) : 0.;
}
inline double getDeviation() {
return std::sqrt(getM2() );
}
inline double getM() { // running mean
return m;
}
};
// pause for usec microseconds using busy loop
int64_t busyloop_microsec_sleep(unsigned long usec) {
int64_t t, tend;
tend = t = timestamp();
tend += usec;
while (t < tend) {
t = timestamp();
}
return t;
}
struct Data {
Data() : time_produced(timestamp() ) {}
int64_t time_produced;
int64_t time_consumed;
};
int64_t sleep_interval = 2;
StatsM2 statsm2;
std::deque<Data> queue;
bool producer_running = true;
bool consumer_running = true;
pthread_spinlock_t spin;
void producer() {
producer_running = true;
while(producer_running) {
pthread_spin_lock(&spin);
queue.push_back(Data() );
pthread_spin_unlock(&spin);
busyloop_microsec_sleep(sleep_interval);
}
}
void consumer() {
int64_t count = 0;
int64_t print_at = 1000000/sleep_interval * 5;
Data data;
consumer_running = true;
while (consumer_running) {
pthread_spin_lock(&spin);
if (queue.empty() ) {
pthread_spin_unlock(&spin);
// _mm_pause();
continue;
}
data = queue.front();
queue.pop_front();
pthread_spin_unlock(&spin);
++count;
data.time_consumed = timestamp();
statsm2.update(data.time_consumed - data.time_produced, count);
if (count >= print_at) {
std::cerr << "CNT=" << count << " M=" << statsm2.getM() << " M2=" << statsm2.getDeviation() << "\n";
statsm2.reset();
count = 0;
}
}
}
int main(void) {
if (pthread_spin_init(&spin, PTHREAD_PROCESS_PRIVATE) < 0)
exit(2);
std::thread consumer_thread(consumer);
std::thread producer_thread(producer);
sleep(40);
consumer_running = false;
producer_running = false;
consumer_thread.join();
producer_thread.join();
return 0;
}
EDIT:
I believe that 5 below is the only thing that can explain 1/2 second latency. When on the same core, each would run for a long time and only then switch to the other.
The rest of the things on the list are too small to cause a 1/2 second delay.
You can use pthread_setaffinity_np to pin your threads to specific cores. You can try different combinations and see how performance changes.
EDIT #2:
More things you should take care of: (who said testing was simple...)
1. Make sure the consumer is already running when the producer starts producing. Not too important in your case as the producer is not really producing in a tight loop.
2. This is very important: you divide by count every time, which is not the right thing to do for your stats. This means that the first measurement in every stats window weight a lot more than the last. To measure the median you have to collect all the values. Measuring the average and min/max, without collecting all numbers, should give you a good enough picture of the latency.
It's not surprising, really.
1. The time is taken in Data(), but then the container spends time calling malloc.
2. Are you running 64 bit or 32? In 32 bit gettimeofday is a system call while in 64 bit it's a VDSO that doesn't get into the kernel... you may want to time gettimeofday itself and record the variance. Or enroll your own using rdtsc.
The best would be to use cycles instead of micros because micros are really too big for this scenario... only the rounding to micros gets you very much skewed when dealing with such a small scale of things
3. Are you guaranteed to not get preempted between producer and consumer? I guess that not. But this should not happen very frequently on a box dedicated to testing...
4. Is it 4 cores on a single socket or 2? if it's a 2 socket box, you want to have the 2 threads on the same socket, or you pay (at least) double for data transfer.
5. Make sure the threads are not running on the same core.
6. If the Data you transfer and the additional data (container node) are sharing cache lines (kind of likely) with other Data+node, the producer would be delayed by the consumer when it writes to the consumed timestamp. This is called false sharing. You can eliminate this by padding/aligning to 64 bytes and using an intrusive container.
gettimeofday is not a good way to profile computation overhead. It is the wall clock and your computer is multiprocessing. Even you think you are not running anything else, the OS scheduler always has some other activities to keep the system running. To profile your process overhead, you have to at least raise the priority of the process you are profiling. Also use high resolution timer or cpu ticks to do the timing measure.

C++ Low-Latency Threaded Asynchronous Buffered Stream (intended for logging) – Boost

Question:
3 while loops below contain code that has been commented out. I search for ("TAG1", "TAG2", and "TAG3") for easy identification. I simply want the while loops to wait on the condition tested to become true before proceeding while minimizing CPU resources as much as possible. I first tried using Boost condition variables, but there's a race condition. Putting the thread to sleep for 'x' microseconds is inefficient because there is no way to precisely time the wakeup. Finally, boost::this_thread::yield() does not seem to do anything. Probably because I only have 2 active threads on a dual-core system. Specifically, how can I make the three tagged areas below run more efficiently while introducing as little unnecessary blocking as possible.
BACKGROUND
Objective:
I have an application that logs a lot of data. After profiling, I found that much time is consumed on the logging operations (logging text or binary to a file on the local hard disk). My objective is to reduce the latency on logData calls by replacing non-threaded direct write calls with calls to a threaded buffered stream logger.
Options Explored:
Upgrade 2005-era slow hard disk to SSD...possible. Cost is not prohibitive...but involves a lot of work... more than 200 computers would have to be upgraded...
Boost ASIO...I don't need all the proactor / networking overhead, looking for something simpler and more light-weight.
Design:
Producer and consumer thread pattern, the application writes data into a buffer and a background thread then writes it to disk sometime later. So the ultimate goal is to have the writeMessage function called by the application layer return as fast as possible while data is correctly / completely logged to the log file in a FIFO order sometime later.
Only one application thread, only one writer thread.
Based on ring buffer. The reason for this decision is to use as few locks as possible and ideally...and please correct me if I'm wrong...I don't think I need any.
Buffer is a statically-allocated character array, but could move it to the heap if needed / desired for performance reasons.
Buffer has a start pointer that points to the next character that should be written to the file. Buffer has an end pointer that points to the array index after the last character to be written to the file. The end pointer NEVER passes the start pointer. If a message comes in that is larger than the buffer, then the writer waits until the buffer is emptied and writes the new message to the file directly without putting the over-sized message in the buffer (once the buffer is emptied, the worker thread won't be writing anything so no contention).
The writer (worker thread) only updates the ring buffer's start pointer.
The main (application thread) only updates the ring buffer's end pointer, and again, it only inserts new data into the buffer when there is available space...otherwise it either waits for space in the buffer to become available or writes directly as described above.
The worker thread continuously checks to see if there is data to be written (indicated by the case when the buffer start pointer != buffer end pointer). If there is no data to be written, the worker thread should ideally go to sleep and wake up once the application thread has inserted something into the buffer (and changed the buffer's end pointer such that it no longer points to the same index as the start pointer). What I have below involves while loops continuously checking that condition. It is a very bad / inefficient way of waiting on the buffer.
Results:
On my 2009-era dual-core laptop with SSD, I see that the total write time of the threaded / buffered benchmark vs. direct write is about 1 : 6 (0.609 sec vs. 0.095 sec), but highly variable. Often the buffered write benchmark is actually slower than direct write. I believe that the variability is due to the poor implementation of waiting for space to free up in the buffer, waiting for the buffer to empty, and having the worker-thread wait for work to become available. I have measured that some of the while loops consume over 10000 cycles and I suspect that those cycles are actually competing for hardware resources that the other thread (worker or application) requires to finish the computation being waited on.
Output seems to check out. With TEST mode enabled and a small buffer size of 10 as a stress test, I diffed hundreds of MBs of output and found it to equal the input.
Compiles with current version of Boost (1.55)
Header
#ifndef BufferedLogStream_h
#define BufferedLogStream_h
#include <stdio.h>
#include <iostream>
#include <iostream>
#include <cstdlib>
#include "boost\chrono\chrono.hpp"
#include "boost\thread\thread.hpp"
#include "boost\thread\locks.hpp"
#include "boost\thread\mutex.hpp"
#include "boost\thread\condition_variable.hpp"
#include <time.h>
using namespace std;
#define BENCHMARK_STR_SIZE 128
#define NUM_BENCHMARK_WRITES 524288
#define TEST 0
#define BENCHMARK 1
#define WORKER_LOOP_WAIT_MICROSEC 20
#define MAIN_LOOP_WAIT_MICROSEC 10
#if(TEST)
#define BUFFER_SIZE 10
#else
#define BUFFER_SIZE 33554432 //4 MB
#endif
class BufferedLogStream {
public:
BufferedLogStream();
void openFile(char* filename);
void flush();
void close();
inline void writeMessage(const char* message, unsigned int length);
void writeMessage(string message);
bool operator() () { return start != end; }
private:
void threadedWriter();
inline bool hasSomethingToWrite();
inline unsigned int getFreeSpaceInBuffer();
void appendStringToBuffer(const char* message, unsigned int length);
FILE* fp;
char* start;
char* end;
char* endofringbuffer;
char ringbuffer[BUFFER_SIZE];
bool workerthreadkeepalive;
boost::mutex mtx;
boost::condition_variable waitforempty;
boost::mutex workmtx;
boost::condition_variable waitforwork;
#if(TEST)
struct testbuffer {
int length;
char message[BUFFER_SIZE * 2];
};
public:
void test();
private:
void getNextRandomTest(testbuffer &tb);
FILE* datatowrite;
#endif
#if(BENCHMARK)
public:
void runBenchmark();
private:
void initBenchmarkString();
void runDirectWriteBaseline();
void runBufferedWriteBenchmark();
char benchmarkstr[BENCHMARK_STR_SIZE];
#endif
};
#if(TEST)
int main() {
BufferedLogStream* bl = new BufferedLogStream();
bl->openFile("replicated.txt");
bl->test();
bl->close();
cout << "Done" << endl;
cin.get();
return 0;
}
#endif
#if(BENCHMARK)
int main() {
BufferedLogStream* bl = new BufferedLogStream();
bl->runBenchmark();
cout << "Done" << endl;
cin.get();
return 0;
}
#endif //for benchmark
#endif
Implementation
#include "BufferedLogStream.h"
BufferedLogStream::BufferedLogStream() {
fp = NULL;
start = ringbuffer;
end = ringbuffer;
endofringbuffer = ringbuffer + BUFFER_SIZE;
workerthreadkeepalive = true;
}
void BufferedLogStream::openFile(char* filename) {
if(fp) close();
workerthreadkeepalive = true;
boost::thread t2(&BufferedLogStream::threadedWriter, this);
fp = fopen(filename, "w+b");
}
void BufferedLogStream::flush() {
fflush(fp);
}
void BufferedLogStream::close() {
workerthreadkeepalive = false;
if(!fp) return;
while(hasSomethingToWrite()) {
boost::unique_lock<boost::mutex> u(mtx);
waitforempty.wait_for(u, boost::chrono::microseconds(MAIN_LOOP_WAIT_MICROSEC));
}
flush();
fclose(fp);
fp = NULL;
}
void BufferedLogStream::threadedWriter() {
while(true) {
if(start != end) {
char* currentend = end;
if(start < currentend) {
fwrite(start, 1, currentend - start, fp);
}
else if(start > currentend) {
if(start != endofringbuffer) fwrite(start, 1, endofringbuffer - start, fp);
fwrite(ringbuffer, 1, currentend - ringbuffer, fp);
}
start = currentend;
waitforempty.notify_one();
}
else { //start == end...no work to do
if(!workerthreadkeepalive) return;
boost::unique_lock<boost::mutex> u(workmtx);
waitforwork.wait_for(u, boost::chrono::microseconds(WORKER_LOOP_WAIT_MICROSEC));
}
}
}
bool BufferedLogStream::hasSomethingToWrite() {
return start != end;
}
void BufferedLogStream::writeMessage(string message) {
writeMessage(message.c_str(), message.length());
}
unsigned int BufferedLogStream::getFreeSpaceInBuffer() {
if(end > start) return (start - ringbuffer) + (endofringbuffer - end) - 1;
if(end == start) return BUFFER_SIZE-1;
return start - end - 1; //case where start > end
}
void BufferedLogStream::appendStringToBuffer(const char* message, unsigned int length) {
if(end + length <= endofringbuffer) { //most common case for appropriately-sized buffer
memcpy(end, message, length);
end += length;
}
else {
int lengthtoendofbuffer = endofringbuffer - end;
if(lengthtoendofbuffer > 0) memcpy(end, message, lengthtoendofbuffer);
int remainderlength = length - lengthtoendofbuffer;
memcpy(ringbuffer, message + lengthtoendofbuffer, remainderlength);
end = ringbuffer + remainderlength;
}
}
void BufferedLogStream::writeMessage(const char* message, unsigned int length) {
if(length > BUFFER_SIZE - 1) { //if string is too large for buffer, wait for buffer to empty and bypass buffer, write directly to file
while(hasSomethingToWrite()); {
boost::unique_lock<boost::mutex> u(mtx);
waitforempty.wait_for(u, boost::chrono::microseconds(MAIN_LOOP_WAIT_MICROSEC));
}
fwrite(message, 1, length, fp);
}
else {
//wait until there is enough free space to insert new string
while(getFreeSpaceInBuffer() < length) {
boost::unique_lock<boost::mutex> u(mtx);
waitforempty.wait_for(u, boost::chrono::microseconds(MAIN_LOOP_WAIT_MICROSEC));
}
appendStringToBuffer(message, length);
}
waitforwork.notify_one();
}
#if(TEST)
void BufferedLogStream::getNextRandomTest(testbuffer &tb) {
tb.length = 1 + (rand() % (int)(BUFFER_SIZE * 1.05));
for(int i = 0; i < tb.length; i++) {
tb.message[i] = rand() % 26 + 65;
}
tb.message[tb.length] = '\n';
tb.length++;
tb.message[tb.length] = '\0';
}
void BufferedLogStream::test() {
cout << "Buffer size is: " << BUFFER_SIZE << endl;
testbuffer tb;
datatowrite = fopen("orig.txt", "w+b");
for(unsigned int i = 0; i < 7000000; i++) {
if(i % 1000000 == 0) cout << i << endl;
getNextRandomTest(tb);
writeMessage(tb.message, tb.length);
fwrite(tb.message, 1, tb.length, datatowrite);
}
fflush(datatowrite);
fclose(datatowrite);
}
#endif
#if(BENCHMARK)
void BufferedLogStream::initBenchmarkString() {
for(unsigned int i = 0; i < BENCHMARK_STR_SIZE - 1; i++) {
benchmarkstr[i] = rand() % 26 + 65;
}
benchmarkstr[BENCHMARK_STR_SIZE - 1] = '\n';
}
void BufferedLogStream::runDirectWriteBaseline() {
clock_t starttime = clock();
fp = fopen("BenchMarkBaseline.txt", "w+b");
for(unsigned int i = 0; i < NUM_BENCHMARK_WRITES; i++) {
fwrite(benchmarkstr, 1, BENCHMARK_STR_SIZE, fp);
}
fflush(fp);
fclose(fp);
clock_t elapsedtime = clock() - starttime;
cout << "Direct write baseline took " << ((double) elapsedtime) / CLOCKS_PER_SEC << " seconds." << endl;
}
void BufferedLogStream::runBufferedWriteBenchmark() {
clock_t starttime = clock();
openFile("BufferedBenchmark.txt");
cout << "Opend file" << endl;
for(unsigned int i = 0; i < NUM_BENCHMARK_WRITES; i++) {
writeMessage(benchmarkstr, BENCHMARK_STR_SIZE);
}
cout << "Wrote" << endl;
close();
cout << "Close" << endl;
clock_t elapsedtime = clock() - starttime;
cout << "Buffered write took " << ((double) elapsedtime) / CLOCKS_PER_SEC << " seconds." << endl;
}
void BufferedLogStream::runBenchmark() {
cout << "Buffer size is: " << BUFFER_SIZE << endl;
initBenchmarkString();
runDirectWriteBaseline();
runBufferedWriteBenchmark();
}
#endif
Update: November 25, 2013
I updated the code below use boost::condition_variables, specifically the wait_for() method as recommended by Evgeny Panasyuk. This avoids unnecessarily checking the same condition over and over again. I am currently seeing the buffered version run in about 1/6th the time as the unbuffered / direct-write version. This is not the ideal case because both cases are limited by the hard disk (in my case a 2010 era SSD). I plan to use the code below in an environment where the hard disk will not be the bottleneck and most if not all the time, the buffer should have space available to accommodate the writeMessage requests. That brings me to my next question. How big should I make the buffer? I don't mind allocating 32 MBs or 64 MB to ensure that it never fills up. The code will be running on systems that can spare that. Intuitively, I feel that it's a bad idea to statically allocate a 32 MB character array. Is it? Anyhow, I expect that when I run the code below for my intended application, the latency of logData() calls will be greatly reduced which will yield a significant reduction in overall processing time.
If anyone sees any way to make the code below better (faster, more robust, leaner, etc), please let me know. I appreciate the feedback. Lazin, how would your approach be faster or more efficient than what I have posted below? I kinda like the idea of just having one buffer and making it large enough so that it practically never fills up. Then I don't have to worry about reading from different buffers. Evgeny Panasyuk, I like the approach of using existing code whenever possible, especially if it's an existing boost library. However, I also don't see how the spcs_queue is more efficient than what I have below. I'd rather deal with one large buffer than many smaller ones and have to worry about splitting splitting my input stream on the input and splicing it back together on the output. Your approach would allow me to offload the formatting from the main thread onto the worker thread. That is a cleaver approach. But I'm not sure yet whether it will save a lot of time and to realize the full benefit, I would have to modify code that I do not own.
//End Update
General solution.
I think you must look at the Naggle algorithm. For one producer and one consumer this would look like this:
At the beginning buffer is empty, worker thread is idle and waiting for the events.
Producer writes data to the buffer and notifies worker thread.
Worker thread woke up and start the write operation.
Producer tries to write another message, but buffer is used by worker, so producer allocates another buffer and writes message to it.
Producer tries to write another message, I/O still in progress so producer writes message to previously allocated buffer.
Worker thread done writing buffer to file and sees that there is another buffer with data so it grabs it and starts to write.
The very first buffer is used by producer to write all consecutive messages, until second write operation in progress.
This schema will help achieve low latency requirement, single message will be written to disc instantaneously, but large amount of events will be written by large batches for greather throughput.
If your log messages have levels - you can improve this schema a little bit. All error messages have high priority(level) and must be saved on disc immediately (because they are rare but very valuable) but debug and trace messages have low priority and can be buffered to save bandwidth (because they are very frequent but not as valuable as error and info messages). So when you write error message, you must wait until worker thread is done writing your message (and all messages that are in the same buffer) and then continue, but debug and trace messages can be just written to buffer.
Threading.
Spawning worker thread for each application thread is to costly. You must use single writer thread for each log file. Write buffers must be shared between threads. Each buffer must have two pointers - commit_pointer and prepare_pointer. All buffer space between beginning of the buffer and commit_pointer are available for worker thread. Buffer space between commit_pointer and prepare_pointer are currently updated by application threads. Invariant: commit_pointer <= prepare_pointer.
Write operations can be performed in two steps.
Prepare write. This operation reserves space in a buffer.
Producer calculates len(message) and atomically updates prepare_pointer;
Old prepare_pointer value and len is saved by consumer;
Commit write.
Producer writes message at the beginning of the reserved buffer space (old prepare_pointer value).
Producer busy-waits until commit_pointer is equal to old prepare_pointer value that its save in local variable.
Producer commit write operation by doing commit_pointer = commit_pointer + len atomically.
To prevent false sharing, len(message) can be rounded to cache line size and all extra space can be filled with spaces.
// pseudocode
void write(const char* message) {
int len = strlen(message); // TODO: round to cache line size
const char* old_prepare_ptr;
// Prepare step
while(1)
{
old_prepare_ptr = prepare_ptr;
if (
CAS(&prepare_ptr,
old_prepare_ptr,
prepare_ptr + len) == old_prepare_ptr
)
break;
// retry if another thread perform prepare op.
}
// Write message
memcpy((void*)old_prepare_ptr, (void*)message, len);
// Commit step
while(1)
{
const char* old_commit_ptr = commit_ptr;
if (
CAS(&commit_ptr,
old_commit_ptr,
old_commit_ptr + len) == old_commit_ptr
)
break;
// retry if another thread commits
}
notify_worker_thread();
}
concurrent_queue<T, Size>
The question that I have is how to make the worker thread go to work as soon as there is work to do and sleep when there is no work.
There is boost::lockfree::spsc_queue - wait-free single-producer single-consumer queue. It can be configured to have compile-time capacity (the size of the internal ringbuffer).
From what I understand, you want something similar to following configuration:
template<typename T, size_t N>
class concurrent_queue
{
// T can be wrapped into struct with padding in order to avoid false sharing
mutable boost::lockfree::spsc_queue<T, boost::lockfree::capacity<N>> q;
mutable mutex m;
mutable condition_variable c;
void wait() const
{
unique_lock<mutex> u(m);
c.wait_for(u, chrono::microseconds(1)); // Or whatever period you need.
// Timeout is required, because modification happens not under mutex
// and notification can be lost.
// Another option is just to use sleep/yield, without notifications.
}
void notify() const
{
c.notify_one();
}
public:
void push(const T &t)
{
while(!q.push(t))
wait();
notify();
}
void pop(T &result)
{
while(!q.pop(result))
wait();
notify();
}
};
When there are elements in queue - pop does not block. And when there is enough space in internal buffer - push does not block.
concurrent<T>
I want to reduce both formatting and write times as much as possible so I plan to reduce both.
Check out Herb Sutter talk at C++ and Beyond 2012: C++ Concurrency. At page 14 he shows example of concurrent<T>. Basically it is wrapper around object of type T which starts separate thread for performing all operations on that object. Usage is:
concurrent<ostream*> x(&cout); // starts thread internally
// ...
// x acts as function object.
// It's function call operator accepts action
// which is performed on wrapped object in separate thread.
int i = 42;
x([i](ostream *out){ *out << "i=" << i; }); // passing lambda as action
You can use similar pattern in order to offload all formatting work to consumer thread.
Small Object Optimization
Otherwise, new buffers are allocated and I want to avoid memory allocation after the buffer stream is constructed.
Above concurrent_queue<T, Size> example uses fixed-size buffer which is fully contained within queue, and does not imply additional allocations.
However, Herb's concurrent<T> example uses std::function to pass action into worker thread. That may incur costly allocation.
std::function implementations may use Small Object Optimization (and most implementations do) - small function objects are in-place copy-constructed in internal buffer, but there is no guarantee, and for function objects bigger than threshold - heap allocation would happen.
There are several options to avoid this allocation:
Implement std::function analog with internal buffer large enough to hold target function objects (for example, you can try to modify boost::function or this version).
Use your own function object which would represent all type of log messages. Basically it would contain just values required to format message. As potentially there are different types of messages, consider to use boost::variant (which is literary union coupled with type tag) to represent them.
Putting it all together, here is proof-of-concept (using second option):
LIVE DEMO
#include <boost/lockfree/spsc_queue.hpp>
#include <boost/optional.hpp>
#include <boost/variant.hpp>
#include <condition_variable>
#include <iostream>
#include <cstddef>
#include <thread>
#include <chrono>
#include <mutex>
using namespace std;
/*********************************************/
template<typename T, size_t N>
class concurrent_queue
{
mutable boost::lockfree::spsc_queue<T, boost::lockfree::capacity<N>> q;
mutable mutex m;
mutable condition_variable c;
void wait() const
{
unique_lock<mutex> u(m);
c.wait_for(u, chrono::microseconds(1));
}
void notify() const
{
c.notify_one();
}
public:
void push(const T &t)
{
while(!q.push(t))
wait();
notify();
}
void pop(T &result)
{
while(!q.pop(result))
wait();
notify();
}
};
/*********************************************/
template<typename T, typename F>
class concurrent
{
typedef boost::optional<F> Job;
mutable concurrent_queue<Job, 16> q; // use custom size
mutable T x;
thread worker;
public:
concurrent(T x)
: x{x}, worker{[this]
{
Job j;
while(true)
{
q.pop(j);
if(!j) break;
(*j)(this->x); // you may need to handle exceptions in some way
}
}}
{}
void operator()(const F &f)
{
q.push(Job{f});
}
~concurrent()
{
q.push(Job{});
worker.join();
}
};
/*********************************************/
struct LogEntry
{
struct Formatter
{
typedef void result_type;
ostream *out;
void operator()(double x) const
{
*out << "floating point: " << x << endl;
}
void operator()(int x) const
{
*out << "integer: " << x << endl;
}
};
boost::variant<int, double> data;
void operator()(ostream *out)
{
boost::apply_visitor(Formatter{out}, data);
}
};
/*********************************************/
int main()
{
concurrent<ostream*, LogEntry> log{&cout};
for(int i=0; i!=1024; ++i)
{
log({i});
log({i/10.});
}
}