ReleaseSemaphore does not release the semaphore - c++

(In short: main()'s WaitForSingleObject hangs in the program below).
I'm trying to write a piece of code that dispatches threads and waits for them to finish before it resumes. Instead of creating the threads every time, which is costly, I put them to sleep. The main thread creates X threads in CREATE_SUSPENDED state.
The synch is done with a semaphore with X as MaximumCount. The semaphore's counter is put down to zero and the threads are dispatched. The threds perform some silly loop and call ReleaseSemaphore before they go to sleep. Then the main thread uses WaitForSingleObject X times to be sure every thread finished its job and is sleeping. Then it loops and does it all again.
From time to time the program does not exit. When I beak the program I can see that WaitForSingleObject hangs. This means that a thread's ReleaseSemaphore did not work. Nothing is printf'ed so supposedly nothing went wrong.
Maybe two threads shouldn't call ReleaseSemaphore at the exact same time, but that would nullify the purpose of semaphores...
I just don't grok it...
Other solutions to synch threads are gratefully accepted!
#define TRY 100
#define LOOP 100
HANDLE *ids;
HANDLE semaphore;
DWORD WINAPI Count(__in LPVOID lpParameter)
{
float x = 1.0f;
while(1)
{
for (int i=1 ; i<LOOP ; i++)
x = sqrt((float)i*x);
while (ReleaseSemaphore(semaphore,1,NULL) == FALSE)
printf(" ReleaseSemaphore error : %d ", GetLastError());
SuspendThread(ids[(int) lpParameter]);
}
return (DWORD)(int)x;
}
int main()
{
SYSTEM_INFO sysinfo;
GetSystemInfo( &sysinfo );
int numCPU = sysinfo.dwNumberOfProcessors;
semaphore = CreateSemaphore(NULL, numCPU, numCPU, NULL);
ids = new HANDLE[numCPU];
for (int j=0 ; j<numCPU ; j++)
ids[j] = CreateThread(NULL, 0, Count, (LPVOID)j, CREATE_SUSPENDED, NULL);
for (int j=0 ; j<TRY ; j++)
{
for (int i=0 ; i<numCPU ; i++)
{
if (WaitForSingleObject(semaphore,1) == WAIT_TIMEOUT)
printf("Timed out !!!\n");
ResumeThread(ids[i]);
}
for (int i=0 ; i<numCPU ; i++)
WaitForSingleObject(semaphore,INFINITE);
ReleaseSemaphore(semaphore,numCPU,NULL);
}
CloseHandle(semaphore);
printf("Done\n");
getc(stdin);
}

Instead of using a semaphore (at least directly) or having main explicitly wake up a thread to get some work done, I've always used a thread-safe queue. When main wants a worker thread to do something, it pushes a description of the job to be done onto the queue. The worker threads each just do a job, then try to pop another job from the queue, and end up suspended until there's a job in the queue for them to do:
The code for the queue looks like this:
#ifndef QUEUE_H_INCLUDED
#define QUEUE_H_INCLUDED
#include <windows.h>
template<class T, unsigned max = 256>
class queue {
HANDLE space_avail; // at least one slot empty
HANDLE data_avail; // at least one slot full
CRITICAL_SECTION mutex; // protect buffer, in_pos, out_pos
T buffer[max];
long in_pos, out_pos;
public:
queue() : in_pos(0), out_pos(0) {
space_avail = CreateSemaphore(NULL, max, max, NULL);
data_avail = CreateSemaphore(NULL, 0, max, NULL);
InitializeCriticalSection(&mutex);
}
void push(T data) {
WaitForSingleObject(space_avail, INFINITE);
EnterCriticalSection(&mutex);
buffer[in_pos] = data;
in_pos = (in_pos + 1) % max;
LeaveCriticalSection(&mutex);
ReleaseSemaphore(data_avail, 1, NULL);
}
T pop() {
WaitForSingleObject(data_avail,INFINITE);
EnterCriticalSection(&mutex);
T retval = buffer[out_pos];
out_pos = (out_pos + 1) % max;
LeaveCriticalSection(&mutex);
ReleaseSemaphore(space_avail, 1, NULL);
return retval;
}
~queue() {
DeleteCriticalSection(&mutex);
CloseHandle(data_avail);
CloseHandle(space_avail);
}
};
#endif
And a rough equivalent of your code in the threads to use it looks something like this. I didn't sort out exactly what your thread function was doing, but it was something with summing square roots, and apparently you're more interested in the thread synch than what the threads actually do, for the moment.
Edit: (based on comment):
If you need main() to wait for some tasks to finish, do some more work, then assign more tasks, it's generally best to handle that by putting an event (for example) into each task, and have your thread function set the events. Revised code to do that would look like this (note that the queue code isn't affected):
#include "queue.hpp"
#include <iostream>
#include <process.h>
#include <math.h>
#include <vector>
struct task {
int val;
HANDLE e;
task() : e(CreateEvent(NULL, 0, 0, NULL)) { }
task(int i) : val(i), e(CreateEvent(NULL, 0, 0, NULL)) {}
};
void process(void *p) {
queue<task> &q = *static_cast<queue<task> *>(p);
task t;
while ( -1 != (t=q.pop()).val) {
std::cout << t.val << "\n";
SetEvent(t.e);
}
}
int main() {
queue<task> jobs;
enum { thread_count = 4 };
enum { task_count = 10 };
std::vector<HANDLE> threads;
std::vector<HANDLE> events;
std::cout << "Creating thread pool" << std::endl;
for (int t=0; t<thread_count; ++t)
threads.push_back((HANDLE)_beginthread(process, 0, &jobs));
std::cout << "Thread pool Waiting" << std::endl;
std::cout << "First round of tasks" << std::endl;
for (int i=0; i<task_count; ++i) {
task t(i+1);
events.push_back(t.e);
jobs.push(t);
}
WaitForMultipleObjects(events.size(), &events[0], TRUE, INFINITE);
events.clear();
std::cout << "Second round of tasks" << std::endl;
for (int i=0; i<task_count; ++i) {
task t(i+20);
events.push_back(t.e);
jobs.push(t);
}
WaitForMultipleObjects(events.size(), &events[0], true, INFINITE);
events.clear();
for (int j=0; j<thread_count; ++j)
jobs.push(-1);
WaitForMultipleObjects(threads.size(), &threads[0], TRUE, INFINITE);
return 0;
}

the problem happens in the following case:
the main thread resumes the worker threads:
for (int i=0 ; i<numCPU ; i++)
{
if (WaitForSingleObject(semaphore,1) == WAIT_TIMEOUT)
printf("Timed out !!!\n");
ResumeThread(ids[i]);
}
the worker threads do their work and release the semaphore:
for (int i=1 ; i<LOOP ; i++)
x = sqrt((float)i*x);
while (ReleaseSemaphore(semaphore,1,NULL) == FALSE)
the main thread waits for all worker threads and resets the semaphore:
for (int i=0 ; i<numCPU ; i++)
WaitForSingleObject(semaphore,INFINITE);
ReleaseSemaphore(semaphore,numCPU,NULL);
the main thread goes into the next round, trying to resume the worker threads (note that the worker threads haven't event suspended themselves yet! this is where the problem starts... you are trying to resume threads that aren't necessarily suspended yet):
for (int i=0 ; i<numCPU ; i++)
{
if (WaitForSingleObject(semaphore,1) == WAIT_TIMEOUT)
printf("Timed out !!!\n");
ResumeThread(ids[i]);
}
finally the worker threads suspend themselves (although they should already start the next round):
SuspendThread(ids[(int) lpParameter]);
and the main thread waits forever since all workers are suspended now:
for (int i=0 ; i<numCPU ; i++)
WaitForSingleObject(semaphore,INFINITE);
here's a link that shows how to correctly solve producer/consumer problems:
http://en.wikipedia.org/wiki/Producer-consumer_problem
also i think critical sections are much faster than semaphores and mutexes. they're also easier to understand in most cases (imo).

I don't understand the code, but the threading sync is definitely bad. You assume that threads will call SuspendThread() in a certain order. A succeeded WaitForSingleObject() call doesn't tell you which thread called ReleaseSemaphore(). You'll thus call ReleaseThread() on a thread that wasn't suspended. This quickly deadlocks the program.
Another bad assumption is that a thread already called SuspendThread after the WFSO returned. Usually yes, not always. The thread could be pre-empted right after the RS call. You'll again call ReleaseThread() on a thread that wasn't suspended. That one usually takes a day or so to deadlock your program.
And I think there's one ReleaseSemaphore call too many. Trying to unwedge it, no doubt.
You cannot control threading with Suspend/ReleaseThread(), don't try.

The problem is that you are waiting more often than you are signaling.
The for (int j=0 ; j<TRY ; j++) loop waits eight times for the semaphore, while the four threads will only signal once each and the loop itself signals it once. The first time through the loop, this is not an issue of because the semaphore is given an initial count of four. The second and each subsequent time, you are waiting for too many signals. This is mitigated by the fact that on the first four waits you limit the time and don't retry on error. So sometimes it may work and sometimes your wait will hang.
I think the following (untested) changes will help.
Initialize the semaphore to zero count:
semaphore = CreateSemaphore(NULL, 0, numCPU, NULL);
Get rid of the wait in the thread resumption loop (i.e. remove the following):
if (WaitForSingleObject(semaphore,1) == WAIT_TIMEOUT)
printf("Timed out !!!\n");
Remove the extraneous signal from the end of the try loop (i.e. remove the following):
ReleaseSemaphore(semaphore,numCPU,NULL);

Here is a practical solution.
I wanted my main program to use threads (then using more than one core) to munch jobs and wait for all the threads to complete before resuming and doing other stuff. I did not want to let the threads die and create new ones because that's slow. In my question, I was trying to do that by suspending the threads, which seemed natural. But as nobugz pointed out, "Thou canst control threading with Suspend/ReleaseThread()".
The solution involves semaphores like the one I was using to control the threads. Actually one more semaphore is used to control the main thread. Now I have one semaphore per thread to control the threads and one semaphore to control the main.
Here is the solution:
#include <windows.h>
#include <stdio.h>
#include <math.h>
#include <process.h>
#define TRY 500000
#define LOOP 100
HANDLE *ids;
HANDLE *semaphores;
HANDLE allThreadsSemaphore;
DWORD WINAPI Count(__in LPVOID lpParameter)
{
float x = 1.0f;
while(1)
{
WaitForSingleObject(semaphores[(int)lpParameter],INFINITE);
for (int i=1 ; i<LOOP ; i++)
x = sqrt((float)i*x+rand());
ReleaseSemaphore(allThreadsSemaphore,1,NULL);
}
return (DWORD)(int)x;
}
int main()
{
SYSTEM_INFO sysinfo;
GetSystemInfo( &sysinfo );
int numCPU = sysinfo.dwNumberOfProcessors;
ids = new HANDLE[numCPU];
semaphores = new HANDLE[numCPU];
for (int j=0 ; j<numCPU ; j++)
{
ids[j] = CreateThread(NULL, 0, Count, (LPVOID)j, NULL, NULL);
// Threads blocked until main releases them one by one
semaphores[j] = CreateSemaphore(NULL, 0, 1, NULL);
}
// Blocks main until threads finish
allThreadsSemaphore = CreateSemaphore(NULL, 0, numCPU, NULL);
for (int j=0 ; j<TRY ; j++)
{
for (int i=0 ; i<numCPU ; i++) // Let numCPU threads do their jobs
ReleaseSemaphore(semaphores[i],1,NULL);
for (int i=0 ; i<numCPU ; i++) // wait for numCPU threads to finish
WaitForSingleObject(allThreadsSemaphore,INFINITE);
}
for (int j=0 ; j<numCPU ; j++)
CloseHandle(semaphores[j]);
CloseHandle(allThreadsSemaphore);
printf("Done\n");
getc(stdin);
}

Related

A semaphore implmentation with Peterson's N process algorithm

I need feedback on my code for following statement, am I on right path?
Problem statement:
a. Implement a semaphore class that has a private int and three public methods: init, wait and signal. The wait and signal methods should behave as expected from a semaphore and must use Peterson's N process algorithm in their implementation.
b. Write a program that creates 5 threads that concurrently update the value of a shared integer and use an object of semaphore class created in part a) to ensure the correctness of the concurrent updates.
Here is my working program:
#include <iostream>
#include <pthread.h>
using namespace std;
pthread_mutex_t mid; //muted id
int shared=0; //global shared variable
class semaphore {
int counter;
public:
semaphore(){
}
void init(){
counter=1; //initialise counter 1 to get first thread access
}
void wait(){
pthread_mutex_lock(&mid); //lock the mutex here
while(1){
if(counter>0){ //check for counter value
counter--; //decrement counter
break; //break the loop
}
}
pthread_mutex_unlock(&mid); //unlock mutex here
}
void signal(){
pthread_mutex_lock(&mid); //lock the mutex here
counter++; //increment counter
pthread_mutex_unlock(&mid); //unlock mutex here
}
};
semaphore sm;
void* fun(void* id)
{
sm.wait(); //call semaphore wait
shared++; //increment shared variable
cout<<"Inside thread "<<shared<<endl;
sm.signal(); //call signal to semaphore
}
int main() {
pthread_t id[5]; //thread ids for 5 threads
sm.init();
int i;
for(i=0;i<5;i++) //create 5 threads
pthread_create(&id[i],NULL,fun,NULL);
for(i=0;i<5;i++)
pthread_join(id[i],NULL); //join 5 threads to complete their task
cout<<"Outside thread "<<shared<<endl;//final value of shared variable
return 0;
}
You need to release the mutex while spinning in the wait loop.
The test happens to work because the threads very likely run their functions start to finish before there is any context switch, and hence each one finishes before the next one even starts. So you have no contention over the semaphore. If you did, they'd get stuck with one waiter spinning with the mutex held, preventing anyone from accessing the counter and hence release the spinner.
Here's an example that works (though it may still have an initialization race that causes it to sporadically not launch correctly). It looks more complicated, mainly because it uses the gcc built-in atomic operations. These are needed whenever you have more than a single core, since each core has its own cache. Declaring the counters 'volatile' only helps with compiler optimization - for what is effectively SMP, cache consistency requires cross-processor cache invalidation, which means special processor instructions need to be used. You can try replacing them with e.g. counter++ and counter-- (and same for 'shared') - and observe how on a multi-core CPU it won't work. (For more details on the gcc atomic ops, see https://gcc.gnu.org/onlinedocs/gcc-4.8.2/gcc/_005f_005fatomic-Builtins.html)
#include <stdio.h>
#include <pthread.h>
#include <unistd.h>
#include <stdint.h>
class semaphore {
pthread_mutex_t lock;
int32_t counter;
public:
semaphore() {
init();
}
void init() {
counter = 1; //initialise counter 1 to get first access
}
void spinwait() {
while (true) {
// Spin, waiting until we see a positive counter
while (__atomic_load_n(&counter, __ATOMIC_SEQ_CST) <= 0)
;
pthread_mutex_lock(&lock);
if (__atomic_load_n(&counter, __ATOMIC_SEQ_CST) <= 0) {
// Someone else stole the count from under us or it was
// a fluke - keep trying
pthread_mutex_unlock(&lock);
continue;
}
// It's ours
__atomic_fetch_add(&counter, -1, __ATOMIC_SEQ_CST);
pthread_mutex_unlock(&lock);
return;
}
}
void signal() {
pthread_mutex_lock(&lock); //lock the mutex here
__atomic_fetch_add(&counter, 1, __ATOMIC_SEQ_CST);
pthread_mutex_unlock(&lock); //unlock mutex here
}
};
enum {
NUM_TEST_THREADS = 5,
NUM_BANGS = 1000
};
// Making semaphore sm volatile would be complicated, because the
// pthread_mutex library calls don't expect volatile arguments.
int shared = 0; // Global shared variable
semaphore sm; // Semaphore protecting shared variable
volatile int num_workers = 0; // So we can wait until we have N threads
void* fun(void* id)
{
usleep(100000); // 0.1s. Encourage context switch.
const int worker = (intptr_t)id + 1;
printf("Worker %d ready\n", worker);
// Spin, waiting for all workers to be in a runnable state. These printouts
// could be out of order.
++num_workers;
while (num_workers < NUM_TEST_THREADS)
;
// Go!
// Bang on the semaphore. Odd workers increment, even decrement.
if (worker & 1) {
for (int n = 0; n < NUM_BANGS; ++n) {
sm.spinwait();
__atomic_fetch_add(&shared, 1, __ATOMIC_SEQ_CST);
sm.signal();
}
} else {
for (int n = 0; n < NUM_BANGS; ++n) {
sm.spinwait();
__atomic_fetch_add(&shared, -1, __ATOMIC_SEQ_CST);
sm.signal();
}
}
printf("Worker %d done\n", worker);
return NULL;
}
int main() {
pthread_t id[NUM_TEST_THREADS]; //thread ids
// create test worker threads
for(int i = 0; i < NUM_TEST_THREADS; i++)
pthread_create(&id[i], NULL, fun, (void*)((intptr_t)(i)));
// join threads to complete their task
for(int i = 0; i < NUM_TEST_THREADS; i++)
pthread_join(id[i], NULL);
//final value of shared variable. For an odd number of
// workers this is the loop count, NUM_BANGS
printf("Test done. Final value: %d\n", shared);
const int expected = (NUM_TEST_THREADS & 1) ? NUM_BANGS : 0;
if (shared == expected) {
puts("PASS");
} else {
printf("Value expected was: %d\nFAIL\n", expected);
}
return 0;
}

C++ threads: cannot unlock mutex in array after condition_variable wait

I am trying to synchronize one main thread with N children threads. After some reading, I used condition_variable and unique_lock. However, I always get the errors condition_variable::wait: mutex not locked: Operation not permitted or unique_lock::unlock: not locked: Operation not permitted, in OS X. In Linux, I get Operation not permitted only.
To be clearer: my goal is to get a sequence of prints:
main thread, passing to 0
thread 0, passing back to main
main thread, passing to 0
thread 0, passing back to main
...
for each of the four threads.
I adapted the code from the example in http://en.cppreference.com/w/cpp/thread/condition_variable. This example uses unlock after wait, and it works wonderfully with only one thread other than main (N=1). But when adapted to work with N>1 threads, the error above happens.
Yam Marcovic said in the comments that I should not use unlock. But then, why does the cppreference example use it? And why does it work well with one main and one other threads?
Here is the code:
#include <cstdio>
#include <thread>
#include <mutex>
#include <condition_variable>
using namespace std;
constexpr int N_THREADS = 4;
constexpr int N_ITER = 10;
bool in_main[N_THREADS] = {false};
void fun(mutex *const mtx, condition_variable *const cv, int tid){
for(int i=0; i<N_ITER; i++) {
unique_lock<mutex> lk(*mtx);
// Wait until in_main[tid] is false
cv->wait(lk, [=]{return !in_main[tid];});
// After the wait we own the lock on mtx, which is in lk
printf("thread %d, passing back to main\n", tid);
in_main[tid] = true;
lk.unlock(); // error here, but example uses unlock
cv->notify_one();
}
}
int main(int argc, char *argv[]) {
// We are going to create N_THREADS threads. Create mutexes and
// condition_variables for all of them.
mutex mtx[N_THREADS];
condition_variable cv[N_THREADS];
thread t[N_THREADS];
// Create N_THREADS unique_locks for using the condition_variable with each
// thread
unique_lock<mutex> lk[N_THREADS];
for(int i=0; i<N_THREADS; i++) {
lk[i] = unique_lock<mutex>(mtx[i]);
// Create the new thread, giving it its thread id, the mutex and the
// condition_variable,
t[i] = thread(fun, &mtx[i], &cv[i], i);
}
for(int i=0; i < N_ITER*N_THREADS; i++) {
int tid=i % N_THREADS; // Thread id
// Wait until in_main[tid] is true
cv[tid].wait(lk[tid], [=]{return in_main[tid];});
// After the wait we own the lock on mtx[tid], which is in lk[tid]
printf("main thread, passing to %d\n", tid);
in_main[tid] = false;
lk[tid].unlock(); // error here, but example uses unlock
cv[tid].notify_one();
}
for(int i=0; i<N_THREADS; i++)
t[i].join();
return 0;
}
Sample output:
thread 0, passing back to main
main thread, passing to 0
thread 1, passing back to main
thread 0, passing back to main
main thread, passing to 1
thread 2, passing back to main
thread 1, passing back to main
main thread, passing to 2
thread 2, passing back to main
thread 3, passing back to main
main thread, passing to 3
main thread, passing to 0
thread 3, passing back to main
libc++abi.dylib: terminating with uncaught exception of type std::__1::system_error: unique_lock::unlock: not locked: Operation not permitted
Abort trap: 6
you are trying to unlock your mutexes many times! look at the code carefully:
for(int i=0; i < N_ITER*N_THREADS; i++) {
int tid=i % N_THREADS; // Thread id
where N_ITER is 10 and N_THREADS is 4 always, because they are constexpr
we get:
for(int i=0; i < 40; i++) {
int tid=i % 4; // Thread id
so, when i = 0 the mutex in lk[0] is unlocked, and then when i=4 then tid = 4%4 so again tid = 0 and you are unlocking it again! std::system_error is thrown in this case.
plus, why are all of these C-Pointers anyway? it's not like anyof them can be null at any time.. switch to references..
also, usually when dealing with array indexes the convention is to use size_t and not int.
I found what the problem is. This question Using std::mutex, std::condition_variable and std::unique_lock helped me.
Constructing a unique_lock is acquiring the unique_lock too. So it must be done inside the loop, just before calling wait. The function fun looks the same, but main now looks like this:
int main(int argc, char *argv[]) {
// We are going to create N_THREADS threads. Create mutexes and
// condition_variables for all of them.
mutex mtx[N_THREADS];
condition_variable cv[N_THREADS];
thread t[N_THREADS];
// Create N_THREADS unique_locks for using the condition_variable with each
// thread
for(int i=0; i<N_THREADS; i++) {
// Create the new thread, giving it its thread id, the mutex and the
// condition_variable,
t[i] = thread(fun, &mtx[i], &cv[i], i);
// DO NOT construct, therefore acquire, a unique_lock
}
for(int i=0; i < N_ITER*N_THREADS; i++) {
int tid=i % N_THREADS; // Thread id
// Acquire the unique_lock here
unique_lock<mutex> lk(mtx[tid]);
// Wait until in_main[tid] is true
cv[tid].wait(lk, [=]{return in_main[tid];});
// After the wait we own the lock on mtx[tid], which is in lk[tid]
printf("main thread, passing to %d\n", tid);
in_main[tid] = false;
lk.unlock(); // error here, but example uses unlock
cv[tid].notify_one();
}
for(int i=0; i<N_THREADS; i++)
t[i].join();
return 0;
}
The only difference is that the unique_lock is constructed inside the loop.

why does pthread_cond_signal cause deadlock

I am new to conditional variables and get deadlock if not using pthread_cond_broadcast().
#include <iostream>
#include <pthread.h>
pthread_mutex_t m_mut = PTHREAD_MUTEX_INITIALIZER;
pthread_cond_t cv = PTHREAD_COND_INITIALIZER;
bool ready = false;
void* print_id (void *ptr )
{
pthread_mutex_lock(&m_mut);
while (!ready) pthread_cond_wait(&cv, &m_mut);
int id = *((int*) ptr);
std::cout << "thread " << id << '\n';
pthread_mutex_unlock(&m_mut);
pthread_exit(0);
return NULL;
}
condition is changed here!
void go() {
pthread_mutex_lock(&m_mut);
ready = true;
pthread_mutex_unlock(&m_mut);
pthread_cond_signal(&cv);
}
It can work if I change the last line of go() to pthread_cond_broadcast(&cv);
int main ()
{
pthread_t threads[10];
// spawn 10 threads:
for (int i=0; i<10; i++)
pthread_create(&threads[i], NULL, print_id, (void *) new int(i));
go();
for (int i=0; i<10; i++) pthread_join(threads[i], NULL);
pthread_mutex_destroy(&m_mut);
pthread_cond_destroy(&cv);
return 0;
}
The expected answer (arbitrary order) is
thread 0
....
thread 9
However, on my machine (ubuntu), it prints nothing.
Could anyone tell me the reason? Thanks.
From the manual page (with my emphasis):
pthread_cond_signal restarts one of the threads that are waiting on the condition variable cond. If no threads are waiting on cond, nothing happens. If several threads are waiting on cond, exactly one is restarted, but it is not specified which.
pthread_cond_broadcast restarts all the threads that are waiting on the condition variable cond. Nothing happens if no threads are waiting on cond.
Each of your ten threads is waiting on the same condition. You only call go() once - that's from main(). This calls pthread_cond_signal, which will only signal one of the threads (an arbitrary one). All the others will still be waiting, and hence the pthread_join hangs as they won't terminate. When you switch it to pthread_cond_broadcast, all of the threads are triggered.

WinAPI's sleep doesn't work inside child thread

I'm a beginner and I'm trying to reproduce a rae condition in order to familirize myself with the issue. In order to do that, I created the following program:
#include <Windows.h>
#include <iostream>
using namespace std;
#define numThreads 1000
DWORD __stdcall addOne(LPVOID pValue)
{
int* ipValue = (int*)pValue;
*ipValue += 1;
Sleep(5000ull);
*ipValue += 1;
return 0;
}
int main()
{
int value = 0;
HANDLE threads[numThreads];
for (int i = 0; i < numThreads; ++i)
{
threads[i] = CreateThread(NULL, 0, addOne, &value, 0, NULL);
}
WaitForMultipleObjects(numThreads, threads, true, INFINITE);
cout << "resulting value: " << value << endl;
return 0;
}
I added sleep inside a thread's function in order to reproduce the race condition as, how I understood, if I just add one as a workload, the race condition doesn't manifest itself: a thread is created, then it runs the workload and it happens to finish before the other thread which is created on the other iteration starts its workload. My problem is that Sleep() inside the workload seems to be ignored. I set the parameter to be 5sec and I expect the program to run at least 5 secs, but insted it finishes immediately. When I place Sleep(5000) inside main function, the program runs as expected (> 5 secs). Why is Sleep inside thread unction ignored?
But anyway, even if the Sleep() is ignored, the program outputs this everytime it is launched:
resulting value: 1000
while the correct answer should be 2000. Can you guess why is that happening?
WaitForMultipleObjects only allows waiting for up to MAXIMUM_WAIT_OBJECTS (which is currently 64) threads at a time. If you take that into account:
#include <Windows.h>
#include <iostream>
using namespace std;
#define numThreads MAXIMUM_WAIT_OBJECTS
DWORD __stdcall addOne(LPVOID pValue) {
int* ipValue=(int*)pValue;
*ipValue+=1;
Sleep(5000);
*ipValue+=1;
return 0;
}
int main() {
int value=0;
HANDLE threads[numThreads];
for (int i=0; i < numThreads; ++i) {
threads[i]=CreateThread(NULL, 0, addOne, &value, 0, NULL);
}
WaitForMultipleObjects(numThreads, threads, true, INFINITE);
cout<<"resulting value: "<<value<<endl;
return 0;
}
...things work much more as you'd expect. Whether you'll actually see results from the race condition is, of course, a rather different story--but on multiple runs, I do see slight variations in the resulting value (e.g., a low of around 125).
Jerry Coffin has the right answer, but just to save you typing:
#include <Windows.h>
#include <iostream>
#include <assert.h>
using namespace std;
#define numThreads 1000
DWORD __stdcall addOne(LPVOID pValue)
{
int* ipValue = (int*)pValue;
*ipValue += 1;
Sleep(5000);
*ipValue += 1;
return 0;
}
int main()
{
int value = 0;
HANDLE threads[numThreads];
for (int i = 0; i < numThreads; ++i)
{
threads[i] = CreateThread(NULL, 0, addOne, &value, 0, NULL);
}
DWORD Status = WaitForMultipleObjects(numThreads, threads, true, INFINITE);
assert(Status != WAIT_FAILED);
cout << "resulting value: " << value << endl;
return 0;
}
When things go wrong, make sure you've asserted the return value of any Windows API function that can fail. If you really badly need to wait on lots of threads, it is possible to overcome the 64-thread limit by chaining. I.e., for every additional 64 threads you need to wait on, you sacrifice a thread whose sole purpose is to wait on 64 other threads, and so on. We (Windows Developer's Journal) published an article demonstrating the technique years ago, but I can't recall the author name off the top of my head.

How to overcome the MAXIMUM_WAIT_OBJECTS restriction of WaitForMultipleObjects?

Because of the MAXIMUM_WAIT_OBJECTS restriction of WaitForMultipleObjects function, I tried to write my own "wait for threads" function but didn't get it work. Can you give me a hint, how to do it?
This is my "wait for threads" function:
void WaitForThreads(std::set<HANDLE>& handles)
{
for (int i = 0; i < SECONDSTOWAIT; i++)
{
// erase idiom
for (std::set<HANDLE>::iterator it = handles.begin();
it != handles.end();)
{
if (WaitForSingleObject(*it, 0) == WAIT_OBJECT_0)
handles.erase(it++);
else
++it;
}
if (!handles.size())
// all threads terminated
return;
Sleep(1000);
}
// handles.size() threads still running
handles.clear();
}
As long as the thread runs WaitForSingleObject returns WAIT_TIMEOUT but when the thread terminates the return value is WAIT_FAILED instead of WAIT_OBJECT_0. I guess the thread handle is no longer valid because GetLastError returns ERROR_INVALID_HANDLE.
The MSDN suggests following solutions:
Create a thread to wait on MAXIMUM_WAIT_OBJECTS handles, then wait on that thread plus the other handles. Use this technique to break the handles into groups of MAXIMUM_WAIT_OBJECTS.
Call RegisterWaitForSingleObject to wait on each handle. A wait thread from the thread pool waits on MAXIMUM_WAIT_OBJECTS registered objects and assigns a worker thread after the object is signaled or the time-out interval expires.
But it seems to me that both are too much effort.
Edit:
The threads are created with the MFC function AfxBeginThread. The returned CWinThread pointer is only used to get the associated handle.
CWinThread* thread = AfxBeginThread(LANAbfrage, par);
if ((*thread).m_hThread)
{
threads.insert((*thread).m_hThread);
helper::setStatus("%u LAN Threads active", threads.size());
}
else
theVar->TraceN("Error: Can not create thread");
But it seems to me that both are too much effort.
If you want it to work with wait handles, that's what you'll have to do. But if all you need is something that will block until all of the threads have finished, you can use a Semaphore or perhaps a Synchronization Barrier.
With the answer from Jim Mischel I found a solution. Semaphore Objects can solve two issues:
Waiting for all threads
Limiting the number of running threads
This is a small, self contained example:
#include <iostream>
#include <vector>
#include <windows.h>
static const LONG SEMCOUNT = 3;
DWORD CALLBACK ThreadProc(void* vptr)
{
HANDLE* sem = (HANDLE*)vptr;
Sleep(10000);
ReleaseSemaphore(*sem, 1, NULL);
return 0;
}
int main()
{
HANDLE semh = CreateSemaphore(NULL, SEMCOUNT, SEMCOUNT, 0);
// create 10 threads, but only SEMCOUNT threads run at once
for (int i = 0; i < 10; i++)
{
DWORD id;
WaitForSingleObject(semh, INFINITE);
HANDLE h = CreateThread(NULL, 0, ThreadProc, (void*)&semh, 0, &id);
if (!h)
CloseHandle(h);
}
// wait until all threads have released the semaphore
for (LONG j = 0; j < SEMCOUNT; j++)
{
WaitForSingleObject(semh, INFINITE);
std::cout << "Semaphore count = " << j << std::endl;
}
std::cout << "All threads terminated" << std::endl;
return 0;
}