WinAPI c++ multithread computation

WinAPI c++ multithread computation - c++

I have a task to compute Pi with following formula:
(i is in range from 0 to N, N = 10^8)
Computation should be completed in multiple threads with following requirement: each thread receives only a small fixed amount of computations to complete (in my case - 40 sum members at a time), and there should be a "Task pool" which gives new set of computations into a thread when it reports completion of previous set of operations given to it. Before a thread receives new task, it should wait. All of this should be done with WinAPI.
My solution is this class:
#include "ThreadManager.h"
#include <string>
HANDLE ThreadManager::mutex = (CreateMutexA(nullptr, true, "m"));
ThreadManager::ThreadManager(size_t threadCount)
{
threads.reserve(threadCount);
for (int i = 0; i < threadCount; i++)
{
threadInfo.push_back(new ThreadStruct(i * OP_COUNT));
HANDLE event = CreateEventA(nullptr, false, true, std::to_string(i).c_str());
if (event)
{
threadEvents.push_back(event);
DuplicateHandle(GetCurrentProcess(), event, GetCurrentProcess(),
&(threadInfo[i]->threadEvent), 0, false, DUPLICATE_SAME_ACCESS);
}
else std::cout << "Unknown error: " << GetLastError() << std::endl;
HANDLE thread = CreateThread(nullptr, 0,
reinterpret_cast<LPTHREAD_START_ROUTINE>(&ThreadManager::threadFunc),
threadInfo[i],
CREATE_SUSPENDED, nullptr);
if (thread) threads.push_back(thread);
else std::cout << "Unknown error: " << GetLastError() << std::endl;
}
}
double ThreadManager::run()
{
size_t operations_done = threads.size() * OP_COUNT;
for (HANDLE t : threads) ResumeThread(t);
DWORD index;
Sleep(10);
while (operations_done < ThreadManager::N)
{
ReleaseMutex(ThreadManager::mutex);
index = WaitForMultipleObjects(this->threadEvents.size(), this->threadEvents.data(), false, 10000);
WaitForSingleObject(ThreadManager::mutex, 1000);
threadInfo[index] -> operationIndex = operations_done + OP_COUNT;
SetEvent(threadEvents[index]);
//std::cout << "Operations completed: " << operations_done << "/1000" << std::endl;
operations_done += OP_COUNT;
}
long double res_pi = 0;
for (auto&& ts: this->threadInfo)
{
res_pi += ts->pi;
ts->operationIndex = N;
}
res_pi /= N;
WaitForMultipleObjects(this->threads.size(), this->threads.data(), true, 10000);
std::cout.precision(10);
std::cout << "Pi value for " << threads.size() << " threads: " << res_pi;
threads.clear();
return 0;
}
ThreadManager::~ThreadManager()
{
if (!threads.empty())
for (HANDLE t: threads)
{
TerminateThread(t, -1);
CloseHandle(t);
}
std::destroy(threadInfo.begin(), threadInfo.end());
}
long double ThreadManager::calc(size_t startIndex)
{
long double xi = 0;
long double pi = 0;
for (size_t i = startIndex; i < startIndex + OP_COUNT; i++)
{
const long double ld_i = i;
const long double half = 0.5f;
xi = (ld_i + half) * (1.0 / N);
pi += ((4.0 / (1.0 + xi * xi)));
}
return pi;
}
DWORD WINAPI ThreadManager::threadFunc(ThreadStruct *ts)
{
while (ts->operationIndex < N)
{
WaitForSingleObject(ts->threadEvent, 1000);
ts->pi += calc(ts->operationIndex);
WaitForSingleObject(ThreadManager::mutex, 1000);
SetEvent(ts->threadEvent);
ReleaseMutex(ThreadManager::mutex);
}
return 0;
}
ThreadStruct::ThreadStruct(size_t opIndex)
{
this -> pi = 0;
this -> operationIndex = opIndex;
}
My Idea was that there will be an auto-reset event for each thread, which is set to signaled when a thread finishes it's computation. Main thread is waiting on one of thread Events to signal, and after modifying some values in a shared ThreadStruct (to enable thread start another portion of computations) it sets that same event to signaled, which is received by the exact same thread and the process received. But this doesn't work for even one thread: as a result i see values which are pretty random and not close to Pi (like 0.0001776328265).
Though my GDB debugger was working poorly (not displaying some variables and sometimes even crashing), I noticed that there were too much computations happening (I scaled down N to 1000. Therefore, I should have seen threads printing out "computing" 1000/40 = 25 times, but actually it happened hundreds of times)
Then I tried adding a mutex so threads wait until main thread is not busy before signaling the event. That made computation much slower, and still inaccurate and random (example: 50.26492171 in case of 16 threads).
What can be the problem? Or, if it's completely wrong, how do I organize multithread calculation then? Was creating a class a bad idea?
If you want to reproduce the problem, here is header file content (I am using c++20, MinGW 6.0):
#ifndef MULTITHREADPI_THREADMANAGER_H
#define MULTITHREADPI_THREADMANAGER_H
#include <iostream>
#include <vector>
#include <list>
#include <windows.h>
#include <memory>
struct ThreadStruct
{
size_t operationIndex;
long double pi;
HANDLE threadEvent = nullptr;
explicit ThreadStruct(size_t opIndex);
};
class ThreadManager
{
public:
explicit ThreadManager(size_t threadCount);
double run();
~ThreadManager();
private:
std::vector<ThreadStruct*> threadInfo;
std::vector<HANDLE> threads;
std::vector<HANDLE> threadEvents;
static HANDLE mutex;
static long double calc(size_t startIndex);
static const int OP_COUNT = 40;
static const int N = 100000000;
static DWORD WINAPI threadFunc(ThreadStruct* ts);
};
#endif //MULTITHREADPI_THREADMANAGER_H
To execute code, just construct ThreadManager with desired number of threads as argument and call run() on it.

Even with all below changed, it doesn't give consistent values close to PI. There must be more stuff to fix. I think it has to do with the events. If I understand it correctly, there are two different things the mutex protects. And the event is also used for 2 different things. So both change their meaning during execution. This makes it very hard to think it through.
1. Timeouts
WaitForMultipleObjects may run into a timeout. In that case it returns WAIT_TIMEOUT, which is defined as 0x102 or 258. You access the threadInfo vector with that value without bounds checking. You can use at(n) for a bounds-checked version of [n].
You can easily run into a 10 second timeout when debugging or when setting OP_COUNT to high numbers. So, maybe you want to set it to INFINITE instead.
This leads to all sorts of misbehavior:
the threads information (operationIndex) is updated while the thread might work on it.
operations_done is updated although those operations may not be done
The mutex is probably overreleased
2. Limit the number of threads
The thread manager should also check the number of threads, since you can't set it to a number higher than MAXIMUM_WAIT_OBJECTS, otherwise WaitForMultipleObjects() won't work reliably.
3. Off by 1 error
Should be
size_t operations_done = (threads.size()-1) * OP_COUNT;
or
threadInfo[index] -> operationIndex = operations_done; // was + OP_COUNT
otherwise it'll skip one batch
4. Ending the threads
Ending the threads relies on the timeouts.
When you replace all timeouts by INFINITE, you'll notice that your threads never end. You need another ReleaseMutex(mutex); before
res_pi /= N;

struct CommonData;
struct ThreadData
{
CommonData* pData;
ULONG i, k;
ThreadData(CommonData* pData, ULONG i, ULONG k) : pData(pData), i(i), k(k) {}
static ULONG CALLBACK Work(void* p);
};
struct CommonData
{
HANDLE hEvent = 0;
LONG dwActiveThreadCount = 1;
ULONG N;
union {
double res = 0;
__int64 i64;
};
CommonData(ULONG N) : N(N) {}
~CommonData()
{
if (HANDLE h = hEvent)
{
CloseHandle(h);
}
}
void DecThread()
{
if (!InterlockedDecrement(&dwActiveThreadCount))
{
if (!SetEvent(hEvent)) __debugbreak();
}
}
BOOL AddThread(ULONG i, ULONG k)
{
InterlockedIncrementNoFence(&dwActiveThreadCount);
if (ThreadData* ptd = new ThreadData(this, i, k))
{
if (HANDLE hThread = CreateThread(0, 0, ThreadData::Work, ptd, 0, 0))
{
CloseHandle(hThread);
return TRUE;
}
delete ptd;
}
DecThread();
return FALSE;
}
BOOL Init()
{
return 0 != (hEvent = CreateEvent(0, 0, 0, 0));
}
void Wait()
{
DecThread();
if (WaitForSingleObject(hEvent, INFINITE) != WAIT_OBJECT_0) __debugbreak();
}
};
ULONG CALLBACK ThreadData::Work(void* p)
{
CommonData* pData = reinterpret_cast<ThreadData*>(p)->pData;
ULONG i = reinterpret_cast<ThreadData*>(p)->i;
ULONG k = reinterpret_cast<ThreadData*>(p)->k;
delete p;
ULONG N = pData->N;
double pi = 0;
do
{
double xi = (i++ + 0.5) / N;
pi += 4 / (1 + xi * xi);
} while (--k);
union {
double d;
__int64 i64;
};
i64 = pData->i64;
for (;;)
{
union {
double d_compare;
__int64 i64_compare;
};
i64_compare = i64;
d += pi;
if (i64_compare == (i64 = InterlockedCompareExchange64(
&pData->i64, i64, i64_compare)))
{
break;
}
}
pData->DecThread();
return 0;
}
double calc_pi(ULONG N)
{
SYSTEM_INFO si;
GetSystemInfo(&si);
if (si.dwNumberOfProcessors)
{
CommonData cd(N);
if (cd.Init())
{
ULONG k = (N + si.dwNumberOfProcessors - 1) / si.dwNumberOfProcessors, i = 0;
do
{
if (!cd.AddThread(i, k))
{
break;
}
} while (i += k, --si.dwNumberOfProcessors);
cd.Wait();
if (!si.dwNumberOfProcessors)
{
return cd.res/ N;
}
}
}
return 0;
}
when i call calc_pi(100000000) on 8 core i got 3.1415926535898153

Related

Chrono library multithreading time derivation limitations?

I am trying to solve the problem with a time derivation in a multithreaded setup. I have 3 threads, all pinned to different cores. The first two threads (reader_threads.cc) run in the infinite while loop inside the run() function. They finish their execution and send the current time window they are into the third thread.
The current time window is calculated based on the value from chrono time / Ti
The third thread is running at its own pace, and it's checking only the request when the flag has been raised, which is also sent via Message to the third thread.
I was able to get the desired behavior of all three threads in the same epoch if one epoch is at least 20000us. In the results, you can find more info.
Reader threads
#include <cstdio>
#include <cstdlib>
#include <cstring>
#include <iostream>
#include <chrono>
#include <atomic>
#include <mutex>
#include "control_thread.h"
#define INTERNAL_THREAD
#if defined INTERNAL_THREAD
#include <thread>
#include <pthread.h>
#else
#endif
using namespace std;
atomic<bool> thread_active[2];
atomic<bool> go;
pthread_barrier_t barrier;
template <typename T>
void send(Message volatile * m, unsigned int epoch, bool flag) {
for (int i = 0 ; i < sizeof(T); i++){
m->epoch = epoch;
m->flag = flag;
}
}
ControlThread * ct;
// Main run for threads
void run(unsigned int threadID){
// Put message into incoming buffer
Message volatile * m1 = &(ct->incoming_requests[threadID - 1]);
thread_active[threadID] = true;
std::atomic<bool> flag;
// this thread is done initializing stuff
thread_active[threadID] = true;
while (!go);
while(true){
using namespace std::chrono;
// Get current time with precision of microseconds
auto now = time_point_cast<microseconds>(steady_clock::now());
// sys_microseconds is type time_point<system_clock, microseconds>
using sys_microseconds = decltype(now);
// Convert time_point to signed integral type
auto duration = now.time_since_epoch();
// Convert signed integral type to time_point
sys_microseconds dt{microseconds{duration}};
// test
if (dt != now){
std::cout << "Failure." << std::endl;
}else{
// std::cout << "Success." << std::endl;
}
auto epoch = duration / Ti;
pthread_barrier_wait(&barrier);
flag = true;
// send current time to the control thread
send<int>(m1, epoch, flag);
auto current_position = duration % Ti;
std::chrono::duration<double, micro> multi_thread_sleep = chrono::microseconds(Ti) - chrono::microseconds(current_position);
if(multi_thread_sleep > chrono::microseconds::zero()){
this_thread::sleep_for(multi_thread_sleep);
}
}
}
int threads_num = 3;
void server() {
// Don't start control thread until reader threds finish init
for (int i=1; i < threads_num; i++){
while (!thread_active[i]);
}
go = true;
while (go) {
for (int i = 0; i < threads_num; i++) {
ct->current_requests(i);
}
// Arbitrary sleep to ensure that locking is accurate
std::this_thread::sleep_for(50us);
}
}
class Thread {
public:
#if defined INTERNAL_THREAD
thread execution_handle;
#endif
unsigned int id;
Thread(unsigned int i) : id(i) {}
};
void init(){
ct = new ControlThread();
}
int main (int argc, char * argv[]){
Thread * r[4];
pthread_barrier_init(&barrier, NULL, 2);
init();
/* start threads
*================*/
for (unsigned int i = 0; i < threads_num; i++) {
r[i] = new Thread(i);
#if defined INTERNAL_THREAD
if(i==0){
r[0]->execution_handle = std::thread([] {server();});
}else if(i == 1){
r[i]->execution_handle = std::thread([i] {run(i);});
}else if(i == 2){
r[i]->execution_handle = std::thread([i] {run(i);});
}
/* pin to core i */
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(i, &cpuset);
int rc = pthread_setaffinity_np(r[i]->execution_handle.native_handle(), sizeof(cpuset), &cpuset);
#endif
}
// wait for threads to end
for (unsigned int i = 0; i < threads_num + 1; i++) {
#if defined INTERNAL_THREAD
r[i]->execution_handle.join();
#endif
}
pthread_barrier_destroy(&barrier);
return 0;
}
Control Thread
#ifndef __CONTROL_THEAD_H__
#define __CONTROL_THEAD_H__
// Global vars
const auto Ti = std::chrono::microseconds(15000);
std::mutex m;
int count;
class Message{
public:
std::atomic<bool> flag;
unsigned long epoch;
};
class ControlThread {
public:
/* rw individual threads */
Message volatile incoming_requests[4];
void current_requests(unsigned long current_thread) {
using namespace std::chrono;
auto now = time_point_cast<microseconds>(steady_clock::now());
// sys_milliseconds is type time_point<system_clock, milliseconds>
using sys_microseconds = decltype(now);
// Convert time_point to signed integral type
auto time = now.time_since_epoch();
// Convert signed integral type to time_point
sys_microseconds dt{microseconds{time}};
// test
if (dt != now){
std::cout << "Failure." << std::endl;
}else{
// std::cout << "Success." << std::endl;
}
long contol_thread_epoch = time / Ti;
// Only check request when flag is raised
if(incoming_requests[current_thread].flag){
m.lock();
incoming_requests[current_thread].flag = false;
m.unlock();
// If reader thread epoch and control thread matches
if(incoming_requests[current_thread].epoch == contol_thread_epoch){
// printf("Successful desired behaviour\n");
}else{
count++;
if(count > 0){
printf("Missed %d\n", count);
}
}
}
}
};
#endif
RUN
g++ -std=c++2a -pthread -lrt -lm -lcrypt reader_threads.cc -o run
sudo ./run
Results
The following missed epochs are with one loop iteration (single Ti) equal to 1000us. Also, by increasing Ti, the less number of epochs have been skipped. Finally, if Ti is set to the 20000 us , no skipped epochs are detected. Does anyone have an idea whether I am making a mistake in casting or in communication between threads? Why the threads are not in sync if epoch is i.e. 5000us?
Missed 1
Missed 2
Missed 3
Missed 4
Missed 5
Missed 6
Missed 7
Missed 8
Missed 9
Missed 10
Missed 11
Missed 12
Missed 13
Missed 14
Missed 15
Missed 16

Shared resource single thread writing, multiple thread reading using interlock

I'm trying to implement single thread writing, multiple thread reading mechanism for shared resource management using interlock in C++, windows environment.
Q1. The result code seems to work as what I intend, but I'd like to ask for your wisdom if I am missing something.
Q2. If there is a real life or good active open source code example I can refer to, it will be really appreciated.
Following are the objectives I've taken into account.
Writing can only be executed by single thread and reading must be blocked when writing in order to avoid "invariant" break.
Reading can be executed by multiple threads.
#include <iostream>
#include <Windows.h>
char g_c = 0i8;
char g_pReadChar[3]{};
void* g_pThreads[4]{};
unsigned long g_pThreadIDs[4]{};
long long g_llLock = 0ULL; // 0 : Not locked / 1 : Locked (Writing) / 2 : Locked (Reading)
long long g_llEntryCount = 0ULL; // Thread entry count
__forceinline void Read()
{
// <- if a thread execution is here (case 0)
InterlockedIncrement64(&g_llEntryCount);
// <- if a thread execution is here (case 1)
for (unsigned long long i = 0ULL; i < 100000ULL; ++i)
{
if (InterlockedCompareExchange64(&g_llLock, 2LL, 0LL) == 1LL)
{
continue;
}
// <- if a thread execution is here (case 2)
// --------------------------------------------------
// Read data
std::cout << g_c;
// --------------------------------------------------
InterlockedExchange64(&g_llLock, 1LL); // Lock is needed in order to block case 0
if (InterlockedDecrement64(&g_llEntryCount) == 0LL)
{
InterlockedExchange64(&g_llLock, 0LL);
}
else
{
InterlockedExchange64(&g_llLock, 2LL);
}
return;
}
InterlockedDecrement64(&g_llEntryCount);
}
__forceinline unsigned long __stdcall ReadLoop(void* _pParam)
{
while (true)
{
Read();
Sleep(1);
}
}
__forceinline void Write(const unsigned long long _ullKey)
{
for (unsigned long long i = 0ULL; i < 100000ULL; ++i)
{
if (InterlockedCompareExchange64(&g_llLock, 1LL, 0LL) != 0LL)
{
continue;
}
// --------------------------------------------------
// Write data
if (_ullKey == 0ULL)
{
g_c = 'A';
}
else if (_ullKey == 1ULL)
{
g_c = 'B';
}
else
{
g_c = 'C';
}
// --------------------------------------------------
InterlockedExchange64(&g_llLock, 0LL);
return;
}
}
__forceinline unsigned long __stdcall WriteLoop(void* _pParam)
{
unsigned long long ullCount = 0ULL;
unsigned long long ullKey = 0ULL;
while (true)
{
if (ullCount > 10000ULL)
{
++ullKey;
if (ullKey >= 3ULL)
{
ullKey = 0ULL;
}
ullCount = 0ULL;
}
Write(ullKey);
++ullCount;
}
}
int main()
{
g_pThreads[0] = CreateThread(nullptr, 0ULL, ReadLoop, nullptr, 0UL, &g_pThreadIDs[0]);
g_pThreads[1] = CreateThread(nullptr, 0ULL, ReadLoop, nullptr, 0UL, &g_pThreadIDs[1]);
g_pThreads[2] = CreateThread(nullptr, 0ULL, ReadLoop, nullptr, 0UL, &g_pThreadIDs[2]);
g_pThreads[3] = CreateThread(nullptr, 0ULL, WriteLoop, nullptr, 0UL, &g_pThreadIDs[3]);
Sleep(100000);
return 0;
}

How to make 5 threads read and write array concurrently

I'm having an attempt at the famous producer-consumer problem in c++ and I have came up with an implementation like this...
#include <iostream>
#include <pthread.h>
#include <unistd.h>
#include <ctime>
void *consumeThread(void *i);
void *produceThread(void *i);
using std::cout;
using std::endl;
//Bucket size
#define Bucket_size 10
int buckets[Bucket_size];
pthread_mutex_t lock;
pthread_cond_t consume_now, produce_now;
time_t timer;
int o = 0;
int p = 0;
int main()
{
int i[5] = {1, 2, 3, 4, 5};
pthread_t consumer[5];
pthread_t producer[5];
pthread_mutex_init(&lock, nullptr);
pthread_cond_init(&consume_now, nullptr);
pthread_cond_init(&produce_now, nullptr);
timer = time(nullptr) + 10;
srand(time(nullptr));
for (int x = 0; x < 5; x++)
{
pthread_create(&producer[x], nullptr, &produceThread, &i[x]);
}
for (int x = 0; x < 5; x++)
{
pthread_create(&consumer[x], nullptr, &consumeThread, &i[x]);
}
pthread_cond_signal(&produce_now);
for (int x = 0; x < 5; x++)
{
pthread_join(producer[x], nullptr);
pthread_join(consumer[x], nullptr);
}
pthread_mutex_destroy(&lock);
pthread_cond_destroy(&consume_now);
pthread_cond_destroy(&produce_now);
return 0;
}
void *consumeThread(void *i)
{
bool quit = false;
while (!quit)
{
pthread_mutex_lock(&lock);
pthread_cond_wait(&consume_now, &lock);
printf("thread %d consuming element at array[%d] : the element is %d \n", *((int *)i), o, buckets[o]);
buckets[o] = 0;
p++;
printArray();
usleep(100000);
pthread_cond_signal(&produce_now);
pthread_mutex_unlock(&lock);
quit = time(nullptr) > timer;
}
return EXIT_SUCCESS;
}
void *produceThread(void *i)
{
int a = 0;
bool quit = false;
while (!quit)
{
o = p % 10;
buckets[o] = (rand() % 20) + 1;
printf("thread %d adding element in array[%d] : the element is %d \n", *((int *)i), o, buckets[o]);
a++;
printArray();
usleep(100000);
quit = time(nullptr) > timer;
}
return EXIT_SUCCESS;
}
currently this solution has 5 producer threads and 5 consumer threads, however it only lets 1 thread produce and 1 thread consume at a time, is there a way to make 5 of the producer and consumer threads work concurrently?
Example output from the program:
thread 1 adding element in array[0] : the element is 6
[6,0,0,0,0,0,0,0,0,0]

Your first problem is that you are treating condition variables as if they have memory. In your main(), you pthread_cond_signal(), but at that point, you have no idea if any of your threads are waiting upon that condition. Since condition variables do not have memory, your signal may well be lost.
Your second problem is that o is effectively protected by the condition; since each consumer uses it; and each producer modifies it, you can neither permit multiple producers or consumers to execute concurrently.
Your desired solution amounts to a queue which you inject o's into from the producers; and collect them in the consumers. That way, your concurrency is gated by your ability to produce o's.

How to make it run sync?

Hello I want to sync two threads one incrementing a variable and other decrementing it.
The result that I want looks like:
Thread #0 j = 1
Thread #1 j = 0
Thread #0 j = 1
Thread #1 j = 0
And so on.. but my code sometimes works like that in some cases it print really weird values. I supose that I have some undefined behavior in somewhere but I can't figured out what is really happen.
My code consist in a HANDLE ghMutex that containg the handler of my mutex:
My main function:
int main(void)
{
HANDLE aThread[THREADCOUNT];
ghMutex = CreateMutex(NULL, FALSE, NULL);
aThread[0] = (HANDLE)_beginthreadex(NULL, 0, &inc, NULL, CREATE_SUSPENDED, 0);
aThread[1] = (HANDLE)_beginthreadex(NULL, 0, &dec, NULL, CREATE_SUSPENDED, 0);
ResumeThread(aThread[0]);
ResumeThread(aThread[1]);
WaitForMultipleObjects(THREADCOUNT, aThread, TRUE, INFINITE);
printf("j = %d\n", j);
for (int i = 0; i < THREADCOUNT; i++)
CloseHandle(aThread[i]);
CloseHandle(ghMutex);
return 0;
}
Inc function:
unsigned int __stdcall inc(LPVOID)
{
for (volatile int i = 0; i < MAX; ++i)
{
WaitForSingleObject(
ghMutex, // handle to mutex
INFINITE); // no time-out interval
j++;
printf("Thread %d j = %d\n", GetCurrentThreadId(), j);
ReleaseMutex(ghMutex);
}
_endthread();
return TRUE;
}
Dec function:
unsigned int __stdcall dec(void*)
{
for (volatile int i = 0; i < MAX; ++i)
{
WaitForSingleObject(
ghMutex, // handle to mutex
INFINITE); // no time-out interval
j--;
printf("Thread %d j = %d\n", GetCurrentThreadId(), j);
ReleaseMutex(ghMutex);
}
_endthread();
return TRUE;
}
I need a win api solution in std c++98.

A mutex is not the right tool to synchronize two threads, it is there to protect a resource. You do have a resource j which is protected by your mutex, however the sequence of which thread gets the lock is undefined, so you can have the case where dec gets called several times before inc has the chance to run.
If you want to synchronize the order of the threads you will have to use another synchronization primitive, for example a semaphore. You could, for example, increment the semaphore in inc and decrement it in dec. This would be the classic producer - consumer relationship where the producer will be stalled when the semaphore reaches its maximum value and the consumer will wait for items to consume.
Sorry, no WinAPI C++98 solution from me because that would be silly, but I hope I pointed you to the right direction.

windows mutex object guarantees exclusive ownership, but does not care about the ownership order. so that the same thread can capture several times in a row while others will wait.
for your task you need signal to another thread, when your task is done, and then wait for signal from another thread. for this task can be used event pair for example. thread (i) signal event (1-i) and wait on event (i). for optimize instead 2 calls -
SetEvent(e[1-i]); WaitForSingleObject(e[i], INFINITE);
we can use single call SignalObjectAndWait
SignalObjectAndWait(e[1-i], e[i], INFINITE, FALSE)
of course start and end of loop require special care. for inc
HANDLE hObjectToSignal = _hEvent[1], hObjectToWaitOn = _hEvent[0];
for (;;)
{
_shared_value++;
if (!--n)
{
SetEvent(hObjectToSignal);
break;
}
SignalObjectAndWait(hObjectToSignal, hObjectToWaitOn, INFINITE, FALSE);
}
and for dec
HANDLE hObjectToSignal = _hEvent[0], hObjectToWaitOn = _hEvent[1];
WaitForSingleObject(hObjectToWaitOn, INFINITE);
for (;;)
{
--_shared_value;
if (!--n)
{
break;
}
SignalObjectAndWait(hObjectToSignal, hObjectToWaitOn, INFINITE, FALSE);
}
if write full test, with error checking
struct Task
{
HANDLE _hEvent[4];
ULONG _n;
LONG _iTasks;
LONG _shared_value;
Task()
{
RtlZeroMemory(this, sizeof(*this));
}
~Task()
{
ULONG n = RTL_NUMBER_OF(_hEvent);
do
{
if (HANDLE hEvent = _hEvent[--n]) CloseHandle(hEvent);
} while (n);
}
ULONG WaitTaskEnd()
{
return WaitForSingleObject(_hEvent[2], INFINITE);
}
ULONG WaitTaskReady()
{
return WaitForSingleObject(_hEvent[3], INFINITE);
}
void SetTaskReady()
{
SetEvent(_hEvent[3]);
}
void End()
{
if (!InterlockedDecrement(&_iTasks)) SetEvent(_hEvent[2]);
}
void Begin()
{
InterlockedIncrementNoFence(&_iTasks);
}
static ULONG WINAPI IncThread(PVOID p)
{
return reinterpret_cast<Task*>(p)->Inc(), 0;
}
void Inc()
{
if (WaitTaskReady() == WAIT_OBJECT_0)
{
if (ULONG n = _n)
{
HANDLE hObjectToSignal = _hEvent[1], hObjectToWaitOn = _hEvent[0];
for (;;)
{
if (_shared_value) __debugbreak();
if (n < 17) DbgPrint("Inc(%u)\n", n);
_shared_value++;
if (!--n)
{
SetEvent(hObjectToSignal);
break;
}
if (SignalObjectAndWait(hObjectToSignal, hObjectToWaitOn, INFINITE, FALSE) != WAIT_OBJECT_0)
{
break;
}
}
}
}
End();
}
static ULONG WINAPI DecThread(PVOID p)
{
return reinterpret_cast<Task*>(p)->Dec(), 0;
}
void Dec()
{
if (WaitTaskReady() == WAIT_OBJECT_0)
{
if (ULONG n = _n)
{
HANDLE hObjectToSignal = _hEvent[0], hObjectToWaitOn = _hEvent[1];
if (WaitForSingleObject(hObjectToWaitOn, INFINITE) == WAIT_OBJECT_0)
{
for (;;)
{
--_shared_value;
if (_shared_value) __debugbreak();
if (n < 17) DbgPrint("Dec(%u)\n", n);
if (!--n)
{
break;
}
if (SignalObjectAndWait(hObjectToSignal, hObjectToWaitOn, INFINITE, FALSE) != WAIT_OBJECT_0)
{
break;
}
}
}
}
}
End();
}
ULONG Create()
{
ULONG n = RTL_NUMBER_OF(_hEvent);
do
{
if (HANDLE hEvent = CreateEventW(0, n > 2, 0, 0)) _hEvent[--n] = hEvent;
else return GetLastError();
} while (n);
return NOERROR;
}
ULONG Start()
{
static PTHREAD_START_ROUTINE aa[] = { IncThread, DecThread };
ULONG n = RTL_NUMBER_OF(aa);
do
{
Begin();
if (HANDLE hThread = CreateThread(0, 0, aa[--n], this, 0, 0))
{
CloseHandle(hThread);
}
else
{
n = GetLastError();
End();
return n;
}
} while (n);
return NOERROR;
}
ULONG Start(ULONG n)
{
_iTasks = 1;
ULONG dwError = Start();
_n = dwError ? 0 : n;
SetTaskReady();
End();
return dwError;
}
};
void TaskTest(ULONG n)
{
Task task;
if (task.Create() == NOERROR)
{
task.Start(n);
task.WaitTaskEnd();
}
}
note, that no any sense declare local variable (which will be accessed only from single thread and not accessed by any interrupts, etc) as volatile
also when we write code, like:
// thread #1
write_shared_data();
SetEvent(hEvent);
// thread #2
WaitForSingleObject(hEvent, INFINITE);
read_shared_data();
inside SetEvent(hEvent); was atomic write to event state with release semantic (really stronger of course) and inside wait for event function - atomic read it state with more than acquire semantic. as result all what thread #1 write to memory before SetEvent - will be visible to thread #2 after Wait for event (if wait finished as result of call Set from thread #1)

C++ Using semaphores instead of busy waiting

I am attempting to learn about semaphores and multi-threading. The example I am working with creates 1 to t threads with each thread pointing to the next and the last thread pointing to the first thread. This program allows each thread to sequentially take a turn until all threads have taken n turns. That is when the program ends. The only problem is in the tFunc function, I am busy waiting until it is a specific thread's turn. I want to know how to use semaphores in order to make all the threads go to sleep and waking up a thread only when it is its turn to execute to improve efficiency.
int turn = 1;
int counter = 0;
int t, n;
struct tData {
int me;
int next;
};
void *tFunc(void *arg) {
struct tData *data;
data = (struct tData *) arg;
for (int i = 0; i < n; i++) {
while (turn != data->me) {
}
counter++;
turn = data->next;
}
}
int main (int argc, char *argv[]) {
t = atoi(argv[1]);
n = atoi(argv[2]);
struct tData td[t];
pthread_t threads[t];
int rc;
for (int i = 1; i <= t; i++) {
if (i == t) {
td[i].me = i;
td[i].next = 1;
}
else {
td[i].me = i;
td[i].next = i + 1;
}
rc = pthread_create(&threads[i], NULL, tFunc, (void *)&td[i]);
if (rc) {
cout << "Error: Unable to create thread, " << rc << endl;
exit(-1);
}
}
for (int i = 1; i <= t; i++) {
pthread_join(threads[i], NULL);
}
pthread_exit(NULL);
}

Uses mutexes and condition variables. Here's a working example:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
int turn = 1;
int counter = 0;
int t, n;
struct tData {
int me;
int next;
};
pthread_mutex_t mutex;
pthread_cond_t cond;
void *tFunc(void *arg)
{
struct tData *data;
data = (struct tData *) arg;
pthread_mutex_lock(&mutex);
for (int i = 0; i < n; i++)
{
while (turn != data->me)
pthread_cond_wait(&cond, &mutex);
counter++;
turn = data->next;
printf("%d goes (turn %d of %d), %d next\n", data->me, i+1, n, turn);
pthread_cond_broadcast(&cond);
}
pthread_mutex_unlock(&mutex);
}
int main (int argc, char *argv[]) {
t = atoi(argv[1]);
n = atoi(argv[2]);
struct tData td[t + 1];
pthread_t threads[t + 1];
int rc;
pthread_mutex_init(&mutex, NULL);
pthread_cond_init(&cond, NULL);
for (int i = 1; i <= t; i++)
{
td[i].me = i;
if (i == t)
td[i].next = 1;
else
td[i].next = i + 1;
rc = pthread_create(&threads[i], NULL, tFunc, (void *)&td[i]);
if (rc)
{
printf("Error: Unable to create thread: %d\n", rc);
exit(-1);
}
}
void *ret;
for (int i = 1; i <= t; i++)
pthread_join(threads[i], &ret);
}

Use N+1 semaphores. On startup, thread i waits on semaphore i. When woken up it "takes a turnand signals semaphorei + 1`.
The main thread spawns the N, threads, signals semaphore 0 and waits on semaphore N.
Pseudo code:
sem s[N+1];
thread_proc (i):
repeat N:
wait (s [i])
do_work ()
signal (s [i+1])
main():
for i in 0 .. N:
spawn (thread_proc, i)
repeat N:
signal (s [0]);
wait (s [N]);

Have one semaphore per thread. Have each thread wait on its semaphore, retrying if sem_wait returns EINTR. Once it's done with its work, have it post to the next thread's semaphore. This avoids the "thundering herd" behaviour of David's solution by waking only one thread at a time.
Also notice that, since your semaphores will never have a value larger than one, you can use a pthread_mutex_t for this.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js