Kernel copying CoW pages after child process exit

Kernel copying CoW pages after child process exit - c++

In Linux, whenever a process is forked, the memory mappings of the parent process are cloned into the child process. In reality, for performance reasons, the pages are set to be copy-on-write -- initially they are shared and, in the event one of the two processes writing on one of them, they will then be cloned (MAP_PRIVATE).
This is a very common mechanism of getting a snapshot of the state of a running program -- you do a fork, and this gives you a (consistent) view of the memory of the process at that point in time.
I did a simple benchmark where I have two components:
A parent process that has a pool of threads writing into an array
A child process that has a pool of threads making a snapshot of the array and unmapping it
Under some circumstances (machine/architecture/memory placement/number of threads/...) I am able to make the copy finish much earlier than the threads write into the array.
However, when the child process exits, in htop I still see most of the CPU time being spent in the kernel, which is consistent to it being used to handle the copy-on-write whenever the parent process writes to a page.
In my understanding, if an anonymous page marked as copy-on-write is mapped by a single process, it should not be copied and instead should be used directly.
How can I be sure that this is indeed time being spent copying the memory?
In case I'm right, how can I avoid this overhead?
The core of the benchmark is below, in modern C++.
Define WITH_FORK to enable the snapshot; leave undefined to disable the child process.
#include <unistd.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <numaif.h>
#include <numa.h>
#include <algorithm>
#include <cassert>
#include <condition_variable>
#include <mutex>
#include <iomanip>
#include <iostream>
#include <cmath>
#include <numeric>
#include <thread>
#include <vector>
#define ARRAY_SIZE 1073741824 // 1GB
#define NUM_WORKERS 28
#define NUM_CHECKPOINTERS 4
#define BATCH_SIZE 2097152 // 2MB
using inttype = uint64_t;
using timepoint = std::chrono::time_point<std::chrono::high_resolution_clock>;
constexpr uint64_t NUM_ELEMS() {
return ARRAY_SIZE / sizeof(inttype);
}
int main() {
// allocate array
std::array<inttype, NUM_ELEMS()> *arrayptr = new std::array<inttype, NUM_ELEMS()>();
std::array<inttype, NUM_ELEMS()> & array = *arrayptr;
// allocate checkpoint space
std::array<inttype, NUM_ELEMS()> *cpptr = new std::array<inttype, NUM_ELEMS()>();
std::array<inttype, NUM_ELEMS()> & cp = *cpptr;
// initialize array
std::fill(array.begin(), array.end(), 123);
#ifdef WITH_FORK
// spawn checkpointer threads
int pid = fork();
if (pid == -1) {
perror("fork");
exit(-1);
}
// child process -- do checkpoint
if (pid == 0) {
std::array<std::thread, NUM_CHECKPOINTERS> cpthreads;
for (size_t tid = 0; tid < NUM_CHECKPOINTERS; tid++) {
cpthreads[tid] = std::thread([&, tid] {
// copy array
const size_t numBatches = ARRAY_SIZE / BATCH_SIZE;
for (size_t i = tid; i < numBatches; i += NUM_CHECKPOINTERS) {
void *src = reinterpret_cast<void*>(
reinterpret_cast<intptr_t>(array.data()) + i * BATCH_SIZE);
void *dst = reinterpret_cast<void*>(
reinterpret_cast<intptr_t>(cp.data()) + i * BATCH_SIZE);
memcpy(dst, src, BATCH_SIZE);
munmap(src, BATCH_SIZE);
}
});
}
for (std::thread& thread : cpthreads) {
thread.join();
}
printf("CP finished successfully! Child exiting.\n");
exit(0);
}
#endif // #ifdef WITH_FORK
// spawn worker threads
std::array<std::thread, NUM_WORKERS> threads;
for (size_t tid = 0; tid < NUM_WORKERS; tid++) {
threads[tid] = std::thread([&, tid] {
// write to array
std::array<inttype, NUM_ELEMS()>::iterator it;
for (it = array.begin() + tid; it < array.end(); it += NUM_WORKERS) {
*it = tid;
}
});
}
timepoint tStart = std::chrono::high_resolution_clock::now();
#ifdef WITH_FORK
// allow reaping child process while workers work
std::thread childWaitThread = std::thread([&] {
if (waitpid(pid, nullptr, 0)) {
perror("waitpid");
}
timepoint tChild = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> durationChild = tChild - tStart;
printf("reunited with child after (s): %lf\n", durationChild.count());
});
#endif
// wait for workers to finish
for (std::thread& thread : threads) {
thread.join();
}
timepoint tEnd = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> duration = tEnd - tStart;
printf("duration (s): %lf\n", duration.count());
#ifdef WITH_FORK
childWaitThread.join();
#endif
}

The size of the array is 1GB, which is about 250K pages, where each page is 4KB in size. For this program, the number of page faults that occur due writing to CoW pages can be easily estimated. It can also measured using the Linux perf tool. The new operator initializes the array to zero. So the following line of code:
std::array<inttype, NUM_ELEMS()> *arrayptr = new std::array<inttype, NUM_ELEMS()>();
will cause about 250K page faults. Similarly, the following line of the code:
std::array<inttype, NUM_ELEMS()> *cpptr = new std::array<inttype, NUM_ELEMS()>();
will cause another 250K page faults. All of these page faults are minor, i.e., they can be handled without accessing the disk drive. Allocating two 1GB arrays will not cause any major faults for a system with much more physical memory.
At this point, about 500K page faults have already occurred (there will be other pages faults caused by other memory accesses from the program, of course, but they can be neglected). The execution of std::fill will not cause any minor faults but the virtual pages of the arrays have already been mapped to dedicated physical pages.
The execution of the program then proceeds to forking the child process and creating the worker threads of the parent process. The creation of the child process by itself is sufficient to make a snapshot of the array, so there is really no need to do anything in the child process. In fact, when the child process is forked, the virtual pages of both arrays are marked as copy-on-write. The child process reads from arrayptr and writes to cpptr, which results in additional 250K minor faults. The parent process also writes to arrayptr, which also results in additional 250K minor faults. So making a copy in the child process and unmapping the pages does not improve performance. On the contrary, the number of page faults is doubled and performance is significantly degraded.
You can measure the number of minor and major faults using the following command:
perf stat -r 3 -e minor-faults,major-faults ./binary
This will, by default, count minor and major faults for the whole process tree. The -r 3 option tells perf to repeat the experiment three times and report the average and standard deviation.
I noticed also that the total number of threads is 28 + 4. The optimal number of threads is approximately equal to the total number of online logical cores on your system. If the number of threads is much larger or much smaller than that, performance will be degraded due to the overhead of creating too many threads and switching between them.
Another potential issue may exist in the following loop:
for (it = array.begin() + tid; it < array.end(); it += NUM_WORKERS) {
*it = tid;
}
Different threads may try to write more than once to the same cache line at the same time, resulting in false sharing. This may not be a significant issue depending on the size of the cache line of your processor, the number of threads, and whether all cores are running at the same frequency, so it's hard to say without measuring. A better loop shape would be having the elements of each thread to be contiguous in the array.

Related

recursive threading with C++ gives a Resource temporarily unavailable

So I'm trying to create a program that implements a function that generates a random number (n) and based on n, creates n threads. The main thread is responsible to print the minimum and maximum of the leafs. The depth of hierarchy with the Main thread is 3.
I have written the code below:
#include <iostream>
#include <thread>
#include <time.h>
#include <string>
#include <sstream>
using namespace std;
// a structure to keep the needed information of each thread
struct ThreadInfo
{
long randomN;
int level;
bool run;
int maxOfVals;
double minOfVals;
};
// The start address (function) of the threads
void ChildWork(void* a) {
ThreadInfo* info = (ThreadInfo*)a;
// Generate random value n
srand(time(NULL));
double n=rand()%6+1;
// initialize the thread info with n value
info->randomN=n;
info->maxOfVals=n;
info->minOfVals=n;
// the depth of recursion should not be more than 3
if(info->level > 3)
{
info->run = false;
}
// Create n threads and run them
ThreadInfo* childInfo = new ThreadInfo[(int)n];
for(int i = 0; i < n; i++)
{
childInfo[i].level = info->level + 1;
childInfo[i].run = true;
std::thread tt(ChildWork, &childInfo[i]) ;
tt.detach();
}
// checks if any child threads are working
bool anyRun = true;
while(anyRun)
{
anyRun = false;
for(int i = 0; i < n; i++)
{
anyRun = anyRun || childInfo[i].run;
}
}
// once all child threads are done, we find their max and min value
double maximum=1, minimum=6;
for( int i=0;i<n;i++)
{
// cout<<childInfo[i].maxOfVals<<endl;
if(childInfo[i].maxOfVals>=maximum)
maximum=childInfo[i].maxOfVals;
if(childInfo[i].minOfVals< minimum)
minimum=childInfo[i].minOfVals;
}
info->maxOfVals=maximum;
info->minOfVals=minimum;
// we set the info->run value to false, so that the parrent thread of this thread will know that it is done
info->run = false;
}
int main()
{
ThreadInfo info;
srand(time(NULL));
double n=rand()%6+1;
cout<<"n is: "<<n<<endl;
// initializing thread info
info.randomN=n;
info.maxOfVals=n;
info.minOfVals=n;
info.level = 1;
info.run = true;
std::thread t(ChildWork, &info) ;
t.join();
while(info.run);
info.maxOfVals= max<unsigned long>(info.randomN,info.maxOfVals);
info.minOfVals= min<unsigned long>(info.randomN,info.minOfVals);
cout << "Max is: " << info.maxOfVals <<" and Min is: "<<info.minOfVals;
}
The code compiles with no error, but when I execute it, it gives me this :
libc++abi.dylib: terminating with uncaught exception of type
std::__1::system_error: thread constructor failed: Resource
temporarily unavailable Abort trap: 6

You spawn too many threads. It looks a bit like a fork() bomb. Threads are a very heavy-weight system resource. Use them sparingly.
Within the function void Childwork I see two mistakes:
As someone already pointed out in the comments, you check the info level of a thread and then you go and create some more threads regardless of the previous check.
Within the for loop that spawns your new threads, you increment the info level right before you spawn the actual thread. However you increment a freshly created instance of ThreadInfo here ThreadInfo* childInfo = new ThreadInfo[(int)n]. All instances within childInfo hold a level of 0. Basically the level of each thread you spawn is 1.
In general avoid using threads to achieve concurrency for I/O bound operations (*). Just use threads to achieve concurrency for independent CPU bound operations. As a rule of thumb you never need more threads than you have CPU cores in your system (**). Having more does not improve concurrency and does not improve performance.
(*) You should always use direct function calls and an event based system to run pseudo concurrent I/O operations. You do not need any threading to do so. For example a TCP server does not need any threads to serve thousands of clients.
(**) This is the ideal case. In practice your software is composed of multiple parts, developed by independent developers and maintained in different modes, so it is ok to have some threads which could be theoretically avoided.
Multithreading is still rocket science in 2019. Especially in C++. Do not do it unless you know exactly what you are doing. Here is a good series of blog posts that handle threads.

What could cause a mutex to misbehave?

I've been busy the last couple of months debugging a rare crash caused somewhere within a very large proprietary C++ image processing library, compiled with GCC 4.7.2 for an ARM Cortex-A9 Linux target. Since a common symptom was glibc complaining about heap corruption, the first step was to employ a heap corruption checker to catch oob memory writes. I used the technique described in https://stackoverflow.com/a/17850402/3779334 to divert all calls to free/malloc to my own function, padding every allocated chunk of memory with some amount of known data to catch out-of-bounds writes - but found nothing, even when padding with as much as 1 KB before and after every single allocated block (there are hundreds of thousands of allocated blocks due to intensive use of STL containers, so I can't enlarge the padding further, plus I assume any write more than 1KB out of bounds would eventually trigger a segfault anyway). This bounds checker has found other problems in the past so I don't doubt its functionality.
(Before anyone says 'Valgrind', yes, I have tried that too with no results either.)
Now, my memory bounds checker also has a feature where it prepends every allocated block with a data struct. These structs are all linked in one long linked list, to allow me to occasionally go over all allocations and test memory integrity. For some reason, even though all manipulations of this list are mutex protected, the list was getting corrupted. When investigating the issue, it began to seem like the mutex itself was occasionally failing to do its job. Here is the pseudocode:
pthread_mutex_t alloc_mutex;
static bool boolmutex; // set to false during init. volatile has no effect.
void malloc_wrapper() {
// ...
pthread_mutex_lock(&alloc_mutex);
if (boolmutex) {
printf("mutex misbehaving\n");
__THROW_ERROR__; // this happens!
}
boolmutex = true;
// manipulate linked list here
boolmutex = false;
pthread_mutex_unlock(&alloc_mutex);
// ...
}
The code commented with "this happens!" is occasionally reached, even though this seems impossible. My first theory was that the mutex data structure was being overwritten. I placed the mutex within a struct, with large arrays before and after it, but when this problem occurred the arrays were untouched so nothing seems to be overwritten.
So.. What kind of corruption could possibly cause this to happen, and how would I find and fix the cause?
A few more notes. The test program uses 3-4 threads for processing. Running with less threads seems to make the corruptions less common, but not disappear. The test runs for about 20 seconds each time and completes successfully in the vast majority of cases (I can have 10 units repeating the test, with the first failure occurring after 5 minutes to several hours). When the problem occurs it is quite late in the test (say, 15 seconds in), so this isn't a bad initialization issue. The memory bounds checker never catches actual out of bounds writes but glibc still occasionally fails with a corrupted heap error (Can such an error be caused by something other than an oob write?). Each failure generates a core dump with plenty of trace information; there is no pattern I can see in these dumps, no particular section of code that shows up more than others. This problem seems very specific to a particular family of algorithms and does not happen in other algorithms, so I'm quite certain this isn't a sporadic hardware or memory error. I have done many more tests to check for oob heap accesses which I don't want to list to keep this post from getting any longer.
Thanks in advance for any help!

Thanks to all commenters. I've tried nearly all suggestions with no results, when I finally decided to write a simple memory allocation stress test - one that would run a thread on each of the CPU cores (my unit is a Freescale i.MX6 quad core SoC), each allocating and freeing memory in random order at high speed. The test crashed with a glibc memory corruption error within minutes or a few hours at most.
Updating the kernel from 3.0.35 to 3.0.101 solved the problem; both the stress test and the image processing algorithm now run overnight without failing. The problem does not reproduce on Intel machines with the same kernel version, so the problem is specific either to ARM in general or perhaps to some patch Freescale included with the specific BSP version that included kernel 3.0.35.
For those curious, attached is the stress test source code. Set NUM_THREADS to the number of CPU cores and build with:
<cross-compiler-prefix>g++ -O3 test_heap.cpp -lpthread -o test_heap
I hope this information helps someone. Cheers :)
// Multithreaded heap stress test. By Itay Chamiel 20151012.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <assert.h>
#include <pthread.h>
#include <sys/time.h>
#define NUM_THREADS 4 // set to number of CPU cores
#define ALIVE_INDICATOR NUM_THREADS
// Each thread constantly allocates and frees memory. In each iteration of the infinite loop, decide at random whether to
// allocate or free a block of memory. A list of 500-1000 allocated blocks is maintained by each thread. When memory is allocated
// it is added to this list; when freeing, a random block is selected from this list, freed and removed from the list.
void* thr(void* arg) {
int* alive_flag = (int*)arg;
int thread_id = *alive_flag; // this is a number between 0 and (NUM_THREADS-1) given by main()
int cnt = 0;
timeval t_pre, t_post;
gettimeofday(&t_pre, NULL);
const int ALLOCATE=1, FREE=0;
const unsigned int MINSIZE=500, MAXSIZE=1000;
const int MAX_ALLOC=10000;
char* membufs[MAXSIZE];
unsigned int membufs_size = 0;
int num_allocs = 0, num_frees = 0;
while(1)
{
int action;
// Decide whether to allocate or free a memory block.
// if we have less than MINSIZE buffers, allocate.
if (membufs_size < MINSIZE) action = ALLOCATE;
// if we have MAXSIZE, free.
else if (membufs_size >= MAXSIZE) action = FREE;
// else, decide randomly.
else {
action = ((rand() & 0x1)? ALLOCATE : FREE);
}
if (action == ALLOCATE) {
// choose size to allocate, from 1 to MAX_ALLOC bytes
size_t size = (rand() % MAX_ALLOC) + 1;
// allocate and fill memory
char* buf = (char*)malloc(size);
memset(buf, 0x77, size);
// add buffer to list
membufs[membufs_size] = buf;
membufs_size++;
assert(membufs_size <= MAXSIZE);
num_allocs++;
}
else { // action == FREE
// choose a random buffer to free
size_t pos = rand() % membufs_size;
assert (pos < membufs_size);
// free and remove from list by replacing entry with last member
free(membufs[pos]);
membufs[pos] = membufs[membufs_size-1];
membufs_size--;
assert(membufs_size >= 0);
num_frees++;
}
// once in 10 seconds print a status update
gettimeofday(&t_post, NULL);
if (t_post.tv_sec - t_pre.tv_sec >= 10) {
printf("Thread %d [%d] - %d allocs %d frees. Alloced blocks %u.\n", thread_id, cnt++, num_allocs, num_frees, membufs_size);
gettimeofday(&t_pre, NULL);
}
// indicate alive to main thread
*alive_flag = ALIVE_INDICATOR;
}
return NULL;
}
int main()
{
int alive_flag[NUM_THREADS];
printf("Memory allocation stress test running on %d threads.\n", NUM_THREADS);
// start a thread for each core
for (int i=0; i<NUM_THREADS; i++) {
alive_flag[i] = i; // tell each thread its ID.
pthread_t th;
int ret = pthread_create(&th, NULL, thr, &alive_flag[i]);
assert(ret == 0);
}
while(1) {
sleep(10);
// check that all threads are alive
bool ok = true;
for (int i=0; i<NUM_THREADS; i++) {
if (alive_flag[i] != ALIVE_INDICATOR)
{
printf("Thread %d is not responding\n", i);
ok = false;
}
}
assert(ok);
for (int i=0; i<NUM_THREADS; i++)
alive_flag[i] = 0;
}
return 0;
}

TBB task_arena & task_group usage for scaling parallel_for work

I am trying to use the Threaded Building Blocks task_arena. There is a simple array full of '0'. Arena's threads put '1' in the array on the odd places. Main thread put '2' in the array on the even places.
/* Odd-even arenas tbb test */
#include <tbb/parallel_for.h>
#include <tbb/blocked_range.h>
#include <tbb/task_arena.h>
#include <tbb/task_group.h>
#include <iostream>
using namespace std;
const int SIZE = 100;
int main()
{
tbb::task_arena limited(1); // no more than 1 thread in this arena
tbb::task_group tg;
int myArray[SIZE] = {0};
//! Main thread create another thread, then immediately returns
limited.enqueue([&]{
//! Created thread continues here
tg.run([&]{
tbb::parallel_for(tbb::blocked_range<int>(0, SIZE),
[&](const tbb::blocked_range<int> &r)
{
for(int i = 0; i != SIZE; i++)
if(i % 2 == 0)
myArray[i] = 1;
}
);
});
});
//! Main thread do this work
tbb::parallel_for(tbb::blocked_range<int>(0, SIZE),
[&](const tbb::blocked_range<int> &r)
{
for(int i = 0; i != SIZE; i++)
if(i % 2 != 0)
myArray[i] = 2;
}
);
//! Main thread waiting for 'tg' group
//** it does not create any threads here (doesn't it?) */
limited.execute([&]{
tg.wait();
});
for(int i = 0; i < SIZE; i++) {
cout << myArray[i] << " ";
}
cout << endl;
return 0;
}
The output is:
0 2 0 2 ... 0 2
So the limited.enque{tg.run{...}} block doesn't work.
What's the problem? Any ideas? Thank you.

You have created limited arena for one thread only, and by default this slot is reserved for the master thread. Though, enqueuing into such a serializing arena will temporarily boost its concurrency level to 2 (in order to satisfy 'fire-and-forget' promise of the enqueue), enqueue() does not guarantee synchronous execution of the submitted task. So, tg.wait() can start before tg.run() executes and thus the program will not wait when the worker thread is created, joins the limited arena, and fills the array with '1' (BTW, the whole array is filled in each of 100 parallel_for iterations).
So, in order to wait for the tg.run() to complete, use limited.execute instead. But it will prevent automatic enhancing of the limited concurrency level and the task will be deferred till tg.wait() executed by master thread.
If you want to see asynchronous execution, set arena's concurrency to 2 manually: tbb::task_arena limited(2);
or disable slot reservation for master thread: tbb::task_arena limited(1,0) (but note, it implies additional overheads for dynamic balancing of the number of threads in arena).
P.S. TBB has no points where threads are guaranteed to come (unlike OpenMP). Only enqueue methods guarantee creation of at least one worker thread, but it says nothing about when it will come. See local observer feature to get notification when threads are actually joining arenas.

sem_wait() failed to wake up on linux

I have a real-time application that uses a shared FIFO. There are several writer processes and one reader process. Data is periodically written into the FIFO and constantly drained. Theoretically the FIFO should never overflow because the reading speed is faster than all writers combined. However, the FIFO does overflow.
I tried to reproduce the problem and finally worked out the following (simplified) code:
#include <stdint.h>
#include <cstdio>
#include <cstdlib>
#include <cstring>
#include <cassert>
#include <pthread.h>
#include <semaphore.h>
#include <sys/time.h>
#include <unistd.h>
class Fifo
{
public:
Fifo() : _deq(0), _wptr(0), _rptr(0), _lock(0)
{
memset(_data, 0, sizeof(_data));
sem_init(&_data_avail, 1, 0);
}
~Fifo()
{
sem_destroy(&_data_avail);
}
void Enqueue()
{
struct timeval tv;
gettimeofday(&tv, NULL);
uint64_t enq = tv.tv_usec + tv.tv_sec * 1000000;
while (__sync_lock_test_and_set(&_lock, 1))
sched_yield();
uint8_t wptr = _wptr;
uint8_t next_wptr = (wptr + 1) % c_entries;
int retry = 0;
while (next_wptr == _rptr) // will become full
{
printf("retry=%u enq=%lu deq=%lu count=%d\n", retry, enq, _deq, Count());
for (uint8_t i = _rptr; i != _wptr; i = (i+1)%c_entries)
printf("%u: %lu\n", i, _data[i]);
assert(retry++ < 2);
usleep(500);
}
assert(__sync_bool_compare_and_swap(&_wptr, wptr, next_wptr));
_data[wptr] = enq;
__sync_lock_release(&_lock);
sem_post(&_data_avail);
}
int Dequeue()
{
struct timeval tv;
gettimeofday(&tv, NULL);
uint64_t deq = tv.tv_usec + tv.tv_sec * 1000000;
_deq = deq;
uint8_t rptr = _rptr, wptr = _wptr;
uint8_t next_rptr = (rptr + 1) % c_entries;
bool empty = Count() == 0;
assert(!sem_wait(&_data_avail));// bug in sem_wait?
_deq = 0;
uint64_t enq = _data[rptr]; // enqueue time
assert(__sync_bool_compare_and_swap(&_rptr, rptr, next_rptr));
int latency = deq - enq; // latency from enqueue to dequeue
if (empty && latency < -500)
{
printf("before dequeue: w=%u r=%u; after dequeue: w=%u r=%u; %d\n", wptr, rptr, _wptr, _rptr, latency);
}
return latency;
}
int Count()
{
int count = 0;
assert(!sem_getvalue(&_data_avail, &count));
return count;
}
static const unsigned c_entries = 16;
private:
sem_t _data_avail;
uint64_t _data[c_entries];
volatile uint64_t _deq; // non-0 indicates when dequeue happened
volatile uint8_t _wptr, _rptr; // write, read pointers
volatile uint8_t _lock; // write lock
};
static const unsigned c_total = 10000000;
static const unsigned c_writers = 3;
static Fifo s_fifo;
// writer thread
void* Writer(void* arg)
{
for (unsigned i = 0; i < c_total; i++)
{
int t = rand() % 200 + 200; // [200, 399]
usleep(t);
s_fifo.Enqueue();
}
return NULL;
}
int main()
{
pthread_t thread[c_writers];
for (unsigned i = 0; i < c_writers; i++)
pthread_create(&thread[i], NULL, Writer, NULL);
for (unsigned total = 0; total < c_total*c_writers; total++)
s_fifo.Dequeue();
}
When Enqueue() overflows, the debug print indicates that Dequeue() is stuck (because _deq is not 0). The only place where Dequeue() can get stuck is sem_wait(). However, since the fifo is full (also confirmed by sem_getvalue()), I don't understand how that could happen. Even after several retries (each waits 500us) the fifo was still full even though Dequeue() should definitely drain while Enqueue() is completely stopped (busy retrying).
In the code example, there are 3 writers, each writing every 200-400us. On my computer (8-core i7-2860 running centOS 6.5 kernel 2.6.32-279.22.1.el6.x86_64, g++ 4.47 20120313), the code would fail in a few minutes. I also tried on several other centOS systems and it also failed the same way.
I know that making the fifo bigger can reduce overflow probability (in fact, the program still fails with c_entries=128), but in my real-time application there is hard constraint on enqueue-dequeue latency, so data must be drained quickly. If it's not a bug in sem_wait(), then what prevents it from getting the semaphore?
P.S. If I replace
assert(!sem_wait(&_data_avail));// bug in sem_wait?
with
while (sem_trywait(&_data_avail) < 0) sched_yield();
then the program runs fine. So it seems that there's something wrong in sem_wait() and/or scheduler.

You need to use a combination of sem_wait/sem_post calls to be able to manage your read and write threads.
Your enqueue thread performs a sem_post only and your dequeue performs sem_wait only call. you need to add sem_wait to the enqueue thread and a sem_post on the dequeue thread.
A long time ago, I implemented the ability to have multiple threads/process be able to read some shared memory and only one thread/process write to the shared memory. I used two semaphore, a write semaphore and a read semaphore. The read threads would wait until the write semaphore was not set and then it would set the read semaphore. The write threads would set the write semaphore and then wait until the read semaphore is not set. The read and write threads would then unset the set semaphores when they've completed their tasks. The read semaphore can have n threads lock the read semaphore at a time while the write semaphore can be lock by a single thread at a time.

If it's not a bug in sem_wait(), then what prevents it from getting
the semaphore?
Your program's impatience prevents it. There is no guarantee that the Dequeue() thread is scheduled within a given number of retries. If you change
assert(retry++ < 2);
to
retry++;
you'll see that the program happily continues the reader process sometimes after 8 or perhaps even more retries.
Why does Enqueue have to retry?
It has to retry simply because the main thread's Dequeue() hasn't been scheduled by then.
Dequeue speed is much faster than all writers combined.
Your program shows that this assumption is sometimes false. While apparently the execution time of Dequeue() is much shorter than that of the writers (due to the usleep(t)), this does not imply that Dequeue() is scheduled by the Completely Fair Scheduler more often - and for this the main reason is that you used a nondeterministic scheduling policy. man sched_yield:
sched_yield() is intended for use with read-time scheduling policies
(i.e., SCHED_FIFO or SCHED_RR). Use of sched_yield() with
nondeterministic scheduling policies such as SCHED_OTHER is
unspecified and very likely means your application design is broken.
If you insert
struct sched_param param = { .sched_priority = 1 };
if (sched_setscheduler(0, SCHED_FIFO, &param) < 0)
perror("sched_setscheduler");
at the start of main(), you'll likely see that your program performs as expected (when run with the appropriate priviledge).

Windows7 memory management - how to prevent concurrent threads from blocking

I'm working on a program consisting of two concurrent threads. One (here "Clock") is performing some computation on a regular basis (10 Hz) and is quite memory-intensive. The other one (here "hugeList") uses even more RAM but is not as time critical as the first one. So I decided to reduce its priority to THREAD_PRIORITY_LOWEST. Yet, when the thread frees most of the memory it has used the critical one doesn't manage to keep its timing.
I was able to condense down the problem to this bit of code (make sure optimizations are turned off!):
while Clock tries to keep a 10Hz-timing the hugeList-thread allocates and frees more and more memory not organized in any sort of chunks.
#include "stdafx.h"
#include <stdio.h>
#include <forward_list>
#include <time.h>
#include <windows.h>
#include <vector>
void wait_ms(double _ms)
{
clock_t endwait;
endwait = clock () + _ms * CLOCKS_PER_SEC/1000;
while (clock () < endwait) {} // active wait
}
void hugeList(void)
{
SetThreadPriority(GetCurrentThread(), THREAD_PRIORITY_LOWEST);
unsigned int loglimit = 3;
unsigned int limit = 1000;
while(true)
{
for(signed int cnt=loglimit; cnt>0; cnt--)
{
printf(" Countdown %d...\n", cnt);
wait_ms(1000.0);
}
printf(" Filling list...\n");
std::forward_list<double> list;
for(unsigned int cnt=0; cnt<limit; cnt++)
list.push_front(42.0);
loglimit++;
limit *= 10;
printf(" Clearing list...\n");
while(!list.empty())
list.pop_front();
}
}
void Clock()
{
clock_t start = clock()-CLOCKS_PER_SEC*100/1000;
while(true)
{
std::vector<double> dummyData(100000, 42.0); // just get some memory
printf("delta: %d ms\n", (clock()-start)*1000/CLOCKS_PER_SEC);
start = clock();
wait_ms(100.0);
}
}
int main()
{
DWORD dwThreadId;
if (CreateThread(NULL, 0, (LPTHREAD_START_ROUTINE)&Clock, (LPVOID) NULL, 0, &dwThreadId) == NULL)
printf("Thread could not be created");
if (CreateThread(NULL, 0, (LPTHREAD_START_ROUTINE)&hugeList, (LPVOID) NULL, 0, &dwThreadId) == NULL)
printf("Thread could not be created");
while(true) {;}
return 0;
}
First of all I noticed that allocating memory for the linked list is way faster than freeing it.
On my machine (Windows7) at around the 4th iteration of the "hugeList"-method the Clock-Thread gets significantly disturbed (up to 200ms). The effect disappears without the dummyData-vector "asking" for some memory in the Clock-Thread.
So,
Is there any way of increasing the priority of memory allocation for the Clock-Thread in Win7?
Or do I have to split both operations onto two contexts (processes)?
Note that my original code uses some communication via shared variables which would require for some kind of IPC if I chose the second option.
Note that my original code gets stuck for about 1sec when the equivalent to the "hugeList"-method clears a boost::unordered_map and enters ntdll.dll!RtIInitializeCriticalSection many many times.
(observed by systinernals process explorer)
Note that the effects observed are not due to swapping, I'm using 1.4GB of my 16GB (64bit win7).
edit:
just wanted to let you know that up to now I haven't been able to solve my issue. Splitting both parts of the code onto two processes does not seem to be an option since my time is rather limited and I've never worked with processes so far. I'm afraid I won't be able to get to a running version in time.
However, I managed to reduce the effects by reducing the number of memory deallocations made by the non-critical thread. This was achieved by using a fast pooling memory allocator (like the one provided in the boost library).
There does not seem to be the possibility of explicitly creating certain objects (like e.g. the huge forward list in my example) on some sort of threadprivate heap that would not require synchronisation.
For further reading:
http://bmagic.sourceforge.net/memalloc.html
Do threads have a distinct heap?
Memory Allocation/Deallocation Bottleneck?
http://software.intel.com/en-us/articles/avoiding-heap-contention-among-threads
http://www.boost.org/doc/libs/1_55_0/libs/pool/doc/html/boost_pool/pool/introduction.html

Replacing std::forward_list with a std::list, I ran your code on a corei7 4GB machine until 2GB is consumed. No disturbances at all. (In debug build)
P.S
Yes. The release build recreates the issue. I replaced the forward list with an array
double* p = new double[limit];
for(unsigned int cnt=0; cnt<limit; cnt++)
p[cnt] = 42.0;
and
for(unsigned int cnt=0; cnt<limit; cnt++)
p[cnt] = -1;
delete [] p;
It does not recreates then.
It seems thread scheduler is punishing for asking for lot of small memory chunks.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js