Issue with POSIX thread synchronization and/or pthread_create() argument passing - c++

I am trying to create a number of different threads that are required to wait for all of the threads to be created before they can perform any actions. This is a smaller part of a large program, I am just trying to take it in steps. As each thread is created it is immediately blocked by a semaphore. After all of the threads have been created, I loop through and release all the threads. I then wish each thread to print out its thread number to verify that they all waited. I only allow one thread to print at a time using another semaphore.
The issue I'm having is that although I create threads #1-10, a thread prints that it is #11. Also, a few threads say they have the same number as another one. Is the error in my passing the threadID or is it in my synchronization somehow?
Here is relevant code:
//Initialize semaphore to 0. Then each time a thread is spawned it will call
//semWait() making the value negative and blocking that thread. Once all of the
//threads are created, semSignal() will be called to release each of the threads
sem_init(&threadCreation,0,0);
sem_init(&tester,0,1);
//Spawn all of the opener threads, 1 for each valve
pthread_t threads[T_Valve_Numbers];
int check;
//Loop starts at 1 instead of the standard 0 so that numbering of valves
//is somewhat more logical.
for(int i =1; i <= T_Valve_Numbers;i++)
{
cout<<"Creating thread: "<<i<<endl;
check=pthread_create(&threads[i], NULL, Valve_Handler,(void*)&i);
if(check)
{
cout <<"Couldn't create thread "<<i<<" Error: "<<check<<endl;
exit(-1);
}
}
//Release all of the blocked threads now that they have all been created
for(int i =1; i<=T_Valve_Numbers;i++)
{
sem_post(&threadCreation);
}
//Make the main process wait for all the threads before terminating
for(int i =1; i<=T_Valve_Numbers;i++)
{
pthread_join(threads[i],NULL);
}
return 0;
}
void* Valve_Handler(void* threadNumArg)
{
int threadNum = *((int *)threadNumArg);
sem_wait(&threadCreation);//Blocks the thread until all are spawned
sem_wait(&tester);
cout<<"I'm thread "<<threadNum<<endl;
sem_post(&tester);
}
When T_Valve_Numbers = 10, some sample output is:
Creating thread: 1
Creating thread: 2
Creating thread: 3
Creating thread: 4
Creating thread: 5
Creating thread: 6
Creating thread: 7
Creating thread: 8
Creating thread: 9
Creating thread: 10
I'm thread 11 //Where is 11 coming from?
I'm thread 8
I'm thread 3
I'm thread 4
I'm thread 10
I'm thread 9
I'm thread 7
I'm thread 3
I'm thread 6
I'm thread 6 //How do I have 2 6's?
OR
Creating thread: 1
Creating thread: 2
Creating thread: 3
Creating thread: 4
Creating thread: 5
Creating thread: 6
Creating thread: 7
Creating thread: 8
Creating thread: 9
Creating thread: 10
I'm thread 11
I'm thread 8
I'm thread 8
I'm thread 4
I'm thread 4
I'm thread 8
I'm thread 10
I'm thread 3
I'm thread 9
I'm thread 8 //Now '8' showed up 3 times
"I'm thread..." is printing 10 times so it appears like my semaphore is letting all of the threads through. I'm just not sure why their thread number is messed up.

check=pthread_create(&threads[i], NULL, Valve_Handler,(void*)&i);
^^
You're passing the thread start function the address of i. i is changing all the time in the main loop, unsynchronized with the thread functions. You have no idea what the value of i will be once the thread function gets around to actually dereferencing that pointer.
Pass in an actual integer rather than a pointer to the local variable if that's the only thing you'll ever need to pass. Otherwise, create a simple struct with all the parameters, build an array of those (one for each thread) and pass each thread a pointer to its own element.
Example: (assuming your thread index never overflows an int)
#include <stdint.h> // for intptr_t
...
check = pthread_create(..., (void*)(intptr_t)i);
...
int threadNum = (intptr_t)threadNumArg;
Better/more flexible/doesn't require intprt_t that might not exist example:
struct thread_args {
int thread_index;
int thread_color;
// ...
}
// ...
struct thread_args args[T_Valve_Numbers];
for (int i=0; i<T_Valve_Numbers; i++) {
args[i].thread_index = i;
args[i].thread_color = ...;
}
// ...
check = pthread_create(..., &(args[i-1])); // or loop from 0, less surprising
A word of caution about this though: that array of thread arguments needs to stay alive at least as long as the threads will use it. In some situations, you might be better of with a dynamic allocation for each structure, passing that pointer (and its ownership) to the thread function (especially if you're going to detach the threads rather than joining them).
If you're going to join the threads at some point, keep those arguments around the same way you keep your pthread_t structures around. (And if you're creating and joining in the same function, the stack is usually fine for that.)

Related

thread_local in C++ 11 isn't consistent

Why the output isn't consistently 1, when I declared Counter object as thread_local?
#include <iostream>
#include <thread>
#define MAX 10
class Counter {
int c;
public:
Counter() : c(0) {}
void increment() {
++c;
}
~Counter() {
std::cout << "Thread " << std::this_thread::get_id() << " having counter value " << c << "\n";
}
};
thread_local Counter c;
void doWork() {
c.increment();
}
int main() {
std::thread t[MAX];
for ( int i = 0; i < MAX; i++ )
t[i] = std::thread(doWork);
for ( int i = 0; i < MAX; i++ )
t[i].join();
return 0;
}
Output :
Thread 2 having counter value 19469616
Thread 3 having counter value 1
Thread 4 having counter value 19464528
Thread 5 having counter value 19464528
Thread 7 having counter value 1
Thread 6 having counter value 1
Thread 8 having counter value 1
Thread 9 having counter value 1
Thread 10 having counter value 1
Thread 11 having counter value 1
The only interesting thing I got out of exploring this a bit is that MSVC operates differently than clang/gcc.
Here is MSVC's output on the latest compiler:
Thread 34316 having counter value 1
Thread 34312 having counter value 1
Thread 34320 having counter value 1
Thread 34328 having counter value 1
Thread 34324 having counter value 1
Thread 34332 having counter value 1
Thread 34336 having counter value 1
Thread 34340 having counter value 1
Thread 34344 having counter value 1
Thread 34348 having counter value 1
Thread 29300 having counter value 0
c:\dv\TestCpp\out\build\x64-Debug (default)\TestCpp.exe (process 27748) exited with code 0.
Notice: "Thread 29300 having counter value 0". This is the main thread. Since doWork() was never called in the main thread the counter value is 0. This is working as I would expect things to work. "c" is a global Counter object - it should be created in all threads, including the main thread.
However, when I run (and godbolt doesn't seem to like to run multithreaded programs, AFAICT) under clang and gcc using rextester.com I get:
Thread 140199006193408 having counter value 1
Thread 140198906181376 having counter value 1
Thread 140198806169344 having counter value 1
Thread 140198706157312 having counter value 1
Thread 140199106205440 having counter value 1
Thread 140199206217472 having counter value 1
Thread 140199306229504 having counter value 1
Thread 140199406241536 having counter value 1
Thread 140199506253568 having counter value 1
Thread 140199606265600 having counter value 1
Notice: No main thread with the counter value of 0. The "c" Counter object was never created and destroyed in the main thread under gcc and clang however it was created and destroyed in the main thread under MSVC.
I believe that your personal bug is that you are using an old version of a compiler that has a bug in it. I couldn't reproduce your bug no matter which latest version of the various types of compilers that I used, but I am going to give a bug to the MSVC team because this is Standard C++ and one figures that it should produce the same results regardless of the compiler and it doesn't.
It seems (examining https://en.cppreference.com/w/cpp/language/storage_duration) that MSVC may be doing it wrong as the thread_local "Counter c" object was never explicitly accessed in the main thread and yet the object was created and destroyed.
We'll see what they say - I have a few other bugs pending to them...
Further research: The cpp reference article seems to indicate that since this is a "pure global" - i.e. it isn't a local static thread local object like:
Foo &
GetFooThreadSingleton()
{
static thread_local Foo foo;
return foo;
}
then they way I read the cppreference article is: it's a global like any other global, in which case MSVC is doing it correctly.
Someone please correct me if I am wrong. Thanks.
Also without running it natively in Ubuntu inside of gdb and checking I cannot be sure that the global wasn't actually created and destroyed in the main thread under gcc/clang because it could be that we just missed that output for some reason. I'll admit this seems unlikely. I'll try this later because this issue has intrigued me a bit.
I ran it natively in Ubuntu and got what I expected: The global "Counter c" is only created and destroyed if it is accessed in the main thread. So my thought is that this is a bug in gcc/clang at this point.

C++ track thread completion

I have a vector of 25 elements and I want to process 5 elements concurrently in different threads
But at any time I don't want to create more than 5 thread i.e in one particular time 5 elements will be processed( 0-4, 5-9, 10-14, 15-19, 20-24)
The code is as follows
void test(vector<string> &input){
vector<std::thread> threads;
for(int i=0; i<25;++i){
threads.push_back(std::thread(func, input, i)); //func is a function called by the thread which accepts 2 argument vector<string> and int
if(threads.size()==5){
for (auto &th : threads) {
th.join();
}
threads.clear();
}
}
Here the code will wait till all the 5 thread are completed and then will process for next set
Is there a way to check if any thread gets completed, then a new thread is created but the total count of threads should be <= 5

weird pthread_mutex_t behavior

Consider the next piece of code -
#include <iostream>
using namespace std;
int sharedIndex = 10;
pthread_mutex_t mutex;
void* foo(void* arg)
{
while(sharedIndex >= 0)
{
pthread_mutex_lock(&mutex);
cout << sharedIndex << endl;
sharedIndex--;
pthread_mutex_unlock(&mutex);
}
return NULL;
}
int main() {
pthread_t p1;
pthread_t p2;
pthread_t p3;
pthread_create(&p1, NULL, foo, NULL);
pthread_create(&p2, NULL, foo, NULL);
pthread_create(&p3, NULL, foo, NULL);
pthread_join(p1, NULL);
pthread_join(p2, NULL);
pthread_join(p3, NULL);
return 0;
}
I simply created three pthreads and gave them all the same function foo, in hope that every thread, at its turn, will print and decrement the sharedIndex.
But this is the output -
10
9
8
7
6
5
4
3
2
1
0
-1
-2
I don't understand why the process doesn't stop when sharedIndex
reaches 0.
sharedIndex is protected by a mutex. How come it's accessed after it became 0? Aren't the threads supposed to directly skip to return NULL;?
EDIT
In addition, it seems that only the first thread decrements the sharedIndex.
Why isn't every thread decrement the shared resource at it's turn?
Here's the output after a fix -
Current thread: 140594495477504
10
Current thread: 140594495477504
9
Current thread: 140594495477504
8
Current thread: 140594495477504
7
Current thread: 140594495477504
6
Current thread: 140594495477504
5
Current thread: 140594495477504
4
Current thread: 140594495477504
3
Current thread: 140594495477504
2
Current thread: 140594495477504
1
Current thread: 140594495477504
0
Current thread: 140594495477504
Current thread: 140594478692096
Current thread: 140594487084800
I wish that all of the thread will decrement the shared source - Meaning, every contex switch, a different thread will access the resource and do its thing.
This program's behaviour is undefined.
You have not initialized the mutex. You need to either call pthread_mutex_init or statically initialize it:
pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
You read this variable outside the critical section:
while(sharedIndex >= 0)
That means you could read a garbage value while another thread is updating it. You should not read the shared variable until you have locked the mutex and have exclusive access to it.
Edit:
it seems that only the first thread decrements the sharedIndex
That's because of the undefined behaviour. Fix the problems above and you should see other threads run.
With your current code the compiler is allowed to assume that the sharedIndex is never updated by other threads, so it doesn't bother re-reading it, but just lets the first thread run ten times, then the other two threads run once each.
Meaning, every contex switch, a different thread will access the resource and do its thing.
There is no guarantee that pthread mutexes behave fairly. If you want to guarantee a round-robin behaviour where each thread runs in turn then you will need to impose that yourself, e.g. by having another shared variable (and maybe a condition variable) that says which thread's turn it is to run, and blocking the other threads until it is their turn.
The threads will be hanging out on pthread_mutex_lock(&mutex); waiting to get the lock. Once a thread decrements to 0 and releases the lock, the next thread waiting at lock will then go about it's business (making the value -1), and same for the next thread (making the value -2).
You need to alter your logic on checking value and locking the mutex.
int sharedIndex = 10;
pthread_mutex_t mutex;
void* foo(void* arg)
{
while(sharedIndex >= 0)
{
pthread_mutex_lock(&mutex);
cout << sharedIndex << endl;
sharedIndex--;
pthread_mutex_unlock(&mutex);
}
return NULL;
}
According to this code sharedIndex is the shared resource for all the threads.
Thus each access to it (both read and write) should be wrapped by mutex.
Otherwise assume the situation where all the threads sample sharedIndex simultaneously and its value is 1.
All threads, then, enter the while loop and each one decreases sharedIndex by one leading it to -2 at the end.
EDIT
Possible fix (as one of the possible options):
bool is_positive;
do
{
pthread_mutex_lock(&mutex);
is_positive = (sharedIndex >= 0);
if (is_positive)
{
cout << sharedIndex << endl;
sharedIndex--;
}
pthread_mutex_unlock(&mutex);
}while(is_positive);
EDIT2
Note that you must initialize the mutex:
pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;

pthread sleep() function stop whole execution

I am writing a multi threads c++ program, follows a simple function I used for tests. If I comment out the sleep() there, the program will work. However if I put the sleep() in the loop, there will be no output on screen, it seems like the sleep() kills the whole execution. Please help. Will the sleep function kill execution of a pthread program? How can I sleep the threads?
void *shop(void *array){
int second = difftime( time(0), start );
// init shopping list with characters
string *info = (string *)array;
int size = stoi(info[0]);
string character = info[1];
string career = info[2];
if ( career == "Auror" ) {
int process = 3;
while (process < size+1) {
int shop = stoi(info[process]);
pthread_mutex_lock(&counter);
cout << process << endl;
sleep(1);
pthread_mutex_unlock(&counter);
process++;
}
}
The output without sleep is:
3
4
5
6
3
4
5
6
The output with sleep() is like:
3
3
It breaks the while loop.
Let the main thread wait for the child thread to finish up its task through pthread_join.
The pthread_join() function waits for the thread specified by thread to terminate. If that thread has already terminated, then pthread_join() returns immediately. The thread specified by thread must be joinable.
Yes, if the main thread dies, your app dies including all "Worker Threads".
Add a getchar() or a system("pause") or whatever in your main thread.

How to make Win32/MFC threads loop in lockstep?

I'm new to multithreading in Windows, so this might be a trivial question: what's the easiest way of making sure that threads perform a loop in lockstep?
I tried passing a shared array of Events to all threads and using WaitForMultipleObjects at the end of the loop to synchronize them, but this gives me a deadlock after one, sometimes two, cycles. Here's a simplified version of my current code (with just two threads, but I'd like to make it scalable):
typedef struct
{
int rank;
HANDLE* step_events;
} IterationParams;
int main(int argc, char **argv)
{
// ...
IterationParams p[2];
HANDLE step_events[2];
for (int j=0; j<2; ++j)
{
step_events[j] = CreateEvent(NULL, FALSE, FALSE, NULL);
}
for (int j=0; j<2; ++j)
{
p[j].rank = j;
p[j].step_events = step_events;
AfxBeginThread(Iteration, p+j);
}
// ...
}
UINT Iteration(LPVOID pParam)
{
IterationParams* p = (IterationParams*)pParam;
int rank = p->rank;
for (int i=0; i<100; i++)
{
if (rank == 0)
{
printf("%dth iteration\n",i);
// do something
SetEvent(p->step_events[0]);
WaitForMultipleObjects(2, p->step_events, TRUE, INFINITE);
}
else if (rank == 1)
{
// do something else
SetEvent(p->step_events[1]);
WaitForMultipleObjects(2, p->step_events, TRUE, INFINITE);
}
}
return 0;
}
(I know I'm mixing C and C++, it's actually legacy C code that I'm trying to parallelize.)
Reading the docs at MSDN, I think this should work. However, thread 0 only prints once, occasionally twice, and then the program hangs. Is this a correct way of synchronizing threads? If not, what would you recommend (is there really no built-in support for a barrier in MFC?).
EDIT: this solution is WRONG, even including Alessandro's fix. For example, consider this scenario:
Thread 0 sets its event and calls Wait, blocks
Thread 1 sets its event and calls Wait, blocks
Thread 0 returns from Wait, resets its event, and completes a cycle without Thread 1 getting control
Thread 0 sets its own event and calls Wait. Since Thread 1 had no chance to reset its event yet, Thread 0's Wait returns immediately and the threads go out of sync.
So the question remains: how does one safely ensure that the threads stay in lockstep?
Introduction
I implemented a simple C++ program for your consideration (tested in Visual Studio 2010). It is using only Win32 APIs (and standard library for console output and a bit of randomization). You should be able to drop it into a new Win32 console project (without precompiled headers), compile and run.
Solution
#include <tchar.h>
#include <windows.h>
//---------------------------------------------------------
// Defines synchronization info structure. All threads will
// use the same instance of this struct to implement randezvous/
// barrier synchronization pattern.
struct SyncInfo
{
SyncInfo(int threadsCount) : Awaiting(threadsCount), ThreadsCount(threadsCount), Semaphore(::CreateSemaphore(0, 0, 1024, 0)) {};
~SyncInfo() { ::CloseHandle(this->Semaphore); }
volatile unsigned int Awaiting; // how many threads still have to complete their iteration
const int ThreadsCount;
const HANDLE Semaphore;
};
//---------------------------------------------------------
// Thread-specific parameters. Note that Sync is a reference
// (i.e. all threads share the same SyncInfo instance).
struct ThreadParams
{
ThreadParams(SyncInfo &sync, int ordinal, int delay) : Sync(sync), Ordinal(ordinal), Delay(delay) {};
SyncInfo &Sync;
const int Ordinal;
const int Delay;
};
//---------------------------------------------------------
// Called at the end of each itaration, it will "randezvous"
// (meet) all the threads before returning (so that next
// iteration can begin). In practical terms this function
// will block until all the other threads finish their iteration.
static void RandezvousOthers(SyncInfo &sync, int ordinal)
{
if (0 == ::InterlockedDecrement(&(sync.Awaiting))) { // are we the last ones to arrive?
// at this point, all the other threads are blocking on the semaphore
// so we can manipulate shared structures without having to worry
// about conflicts
sync.Awaiting = sync.ThreadsCount;
wprintf(L"Thread %d is the last to arrive, releasing synchronization barrier\n", ordinal);
wprintf(L"---~~~---\n");
// let's release the other threads from their slumber
// by using the semaphore
::ReleaseSemaphore(sync.Semaphore, sync.ThreadsCount - 1, 0); // "ThreadsCount - 1" because this last thread will not block on semaphore
}
else { // nope, there are other threads still working on the iteration so let's wait
wprintf(L"Thread %d is waiting on synchronization barrier\n", ordinal);
::WaitForSingleObject(sync.Semaphore, INFINITE); // note that return value should be validated at this point ;)
}
}
//---------------------------------------------------------
// Define worker thread lifetime. It starts with retrieving
// thread-specific parameters, then loops through 5 iterations
// (randezvous-ing with other threads at the end of each),
// and then finishes (the thread can then be joined).
static DWORD WINAPI ThreadProc(void *p)
{
ThreadParams *params = static_cast<ThreadParams *>(p);
wprintf(L"Starting thread %d\n", params->Ordinal);
for (int i = 1; i <= 5; ++i) {
wprintf(L"Thread %d is executing iteration #%d (%d delay)\n", params->Ordinal, i, params->Delay);
::Sleep(params->Delay);
wprintf(L"Thread %d is synchronizing end of iteration #%d\n", params->Ordinal, i);
RandezvousOthers(params->Sync, params->Ordinal);
}
wprintf(L"Finishing thread %d\n", params->Ordinal);
return 0;
}
//---------------------------------------------------------
// Program to illustrate iteration-lockstep C++ solution.
int _tmain(int argc, _TCHAR* argv[])
{
// prepare to run
::srand(::GetTickCount()); // pseudo-randomize random values :-)
SyncInfo sync(4);
ThreadParams p[] = {
ThreadParams(sync, 1, ::rand() * 900 / RAND_MAX + 100), // a delay between 200 and 1000 milliseconds will simulate work that an iteration would do
ThreadParams(sync, 2, ::rand() * 900 / RAND_MAX + 100),
ThreadParams(sync, 3, ::rand() * 900 / RAND_MAX + 100),
ThreadParams(sync, 4, ::rand() * 900 / RAND_MAX + 100),
};
// let the threads rip
HANDLE t[] = {
::CreateThread(0, 0, ThreadProc, p + 0, 0, 0),
::CreateThread(0, 0, ThreadProc, p + 1, 0, 0),
::CreateThread(0, 0, ThreadProc, p + 2, 0, 0),
::CreateThread(0, 0, ThreadProc, p + 3, 0, 0),
};
// wait for the threads to finish (join)
::WaitForMultipleObjects(4, t, true, INFINITE);
return 0;
}
Sample Output
Running this program on my machine (dual-core) yields the following output:
Starting thread 1
Starting thread 2
Starting thread 4
Thread 1 is executing iteration #1 (712 delay)
Starting thread 3
Thread 2 is executing iteration #1 (798 delay)
Thread 4 is executing iteration #1 (477 delay)
Thread 3 is executing iteration #1 (104 delay)
Thread 3 is synchronizing end of iteration #1
Thread 3 is waiting on synchronization barrier
Thread 4 is synchronizing end of iteration #1
Thread 4 is waiting on synchronization barrier
Thread 1 is synchronizing end of iteration #1
Thread 1 is waiting on synchronization barrier
Thread 2 is synchronizing end of iteration #1
Thread 2 is the last to arrive, releasing synchronization barrier
---~~~---
Thread 2 is executing iteration #2 (798 delay)
Thread 3 is executing iteration #2 (104 delay)
Thread 1 is executing iteration #2 (712 delay)
Thread 4 is executing iteration #2 (477 delay)
Thread 3 is synchronizing end of iteration #2
Thread 3 is waiting on synchronization barrier
Thread 4 is synchronizing end of iteration #2
Thread 4 is waiting on synchronization barrier
Thread 1 is synchronizing end of iteration #2
Thread 1 is waiting on synchronization barrier
Thread 2 is synchronizing end of iteration #2
Thread 2 is the last to arrive, releasing synchronization barrier
---~~~---
Thread 4 is executing iteration #3 (477 delay)
Thread 3 is executing iteration #3 (104 delay)
Thread 1 is executing iteration #3 (712 delay)
Thread 2 is executing iteration #3 (798 delay)
Thread 3 is synchronizing end of iteration #3
Thread 3 is waiting on synchronization barrier
Thread 4 is synchronizing end of iteration #3
Thread 4 is waiting on synchronization barrier
Thread 1 is synchronizing end of iteration #3
Thread 1 is waiting on synchronization barrier
Thread 2 is synchronizing end of iteration #3
Thread 2 is the last to arrive, releasing synchronization barrier
---~~~---
Thread 2 is executing iteration #4 (798 delay)
Thread 3 is executing iteration #4 (104 delay)
Thread 1 is executing iteration #4 (712 delay)
Thread 4 is executing iteration #4 (477 delay)
Thread 3 is synchronizing end of iteration #4
Thread 3 is waiting on synchronization barrier
Thread 4 is synchronizing end of iteration #4
Thread 4 is waiting on synchronization barrier
Thread 1 is synchronizing end of iteration #4
Thread 1 is waiting on synchronization barrier
Thread 2 is synchronizing end of iteration #4
Thread 2 is the last to arrive, releasing synchronization barrier
---~~~---
Thread 3 is executing iteration #5 (104 delay)
Thread 4 is executing iteration #5 (477 delay)
Thread 1 is executing iteration #5 (712 delay)
Thread 2 is executing iteration #5 (798 delay)
Thread 3 is synchronizing end of iteration #5
Thread 3 is waiting on synchronization barrier
Thread 4 is synchronizing end of iteration #5
Thread 4 is waiting on synchronization barrier
Thread 1 is synchronizing end of iteration #5
Thread 1 is waiting on synchronization barrier
Thread 2 is synchronizing end of iteration #5
Thread 2 is the last to arrive, releasing synchronization barrier
---~~~---
Finishing thread 4
Finishing thread 3
Finishing thread 2
Finishing thread 1
Note that for simplicity each thread has random duration of iteration, but all iterations of that thread will use that same random duration (i.e. it doesn't change between iterations).
How does it work?
The "core" of the solution is in the "RandezvousOthers" function. This function will either block on a shared semaphore (if the thread on which this function was called was not the last one to call the function) or reset Sync structure and unblock all the threads blocking on a shared semaphore (if the thread on which this function was called was the last one to call the function).
To have it work, set the second parameter of CreateEvent to TRUE. This will make the events "manual reset" and prevents the Waitxxx to reset it.
Then place a ResetEvent at the beginning of the loop.
I found this SyncTools (download SyncTools.zip) by googling "barrier synchronization windows". It uses one CriticalSection and one Event to implement a barrier for N threads.