A faster thread blocks slower thread

A faster thread blocks slower thread - c++

I'm working on a point cloud viewer, and my design is based on two thread
first thread updates the point cloud data ( about 10 fps)
second thread is a D3D renderer to render the point set to screen (about 90 fps)
so my code looks like this:
std::shared_ptr<PointCloud> pointcloud;
CRITICAL_SECTION updateLock;
void FirstThreadProc()
{
while(true)
{
/* some algorithm processes point cloud, takes time */
EnterCriticalSection(&updateLock);
pointcloud->Update(data,length,...); //also takes time to copy and process
LeaveCriticalSection(&updateLock);
}
}
/*...*/
std::shared_ptr<D3DRenderer> renderer;
void SecondThreadProc()
{
MSG msg = { 0 };
while (WM_QUIT != msg.message)
{
if (PeekMessage(&msg, NULL, 0, 0, PM_REMOVE))
{
TranslateMessage(&msg);
DispatchMessage(&msg);
}
else
{
EnterCriticalSection(&updateLock);
renderer->Render(pointcloud);
LeaveCriticalSection(&updateLock);
}
}
}
I was thought that the second thread is way more fast than first one, so when first one entered the critical section, the second one is blocked, so the renderer window should freeze now or then. but what i'm observed right now is that the renderer window runs very smooth, camera rotate or zoom in/out, all good, but the first thread is very unstable, its fps is ranging from 10 fps to 1 fps.
I'm thinking about two point cloud buffers, then first thread updates the second buffer when outsides the critical section, then swap two buffers within critical section. Will it work?

As mentioned in this, CRITICAL_SECTION is not provide first-in, first-out(FIFO) ordering. since the second thread is way more fast than the first thread, and its whole loop is critical section, it will enter the critical section right after leave it. This may always in the critical section and keep the first one out of it.
my solution is to put more job of the second thread outside the critical section, then it works fine.

Related

Incomplete multi-threading RayTracer taking twice as much time as expected

I am making a MT Ray-Tracer multithreading, and as the title says, its taking twice as much to execute as the single thread version. Obviously the purpose is to cut the render time by the half, however what I am doing now is just to send the ray-tracing method to run twice, one for each thread, basically executing the same rendering twice. Nonetheless, as threads can run in parallel, shall not there be a meaningful increase in execution time. But is about doubling.
This has to be related to my multithreading setup. I think its related to the fact I create them as joinable. So I am going to explain what I am doing and also put the related code to see if someone can confirm if that's the issue.
I create two threads and set them as joinable so. Create a RayTracer that allocates enough memory to store the image pixels (this is done in the constructor). Run a two iterations loop for sending relevant info for each thread, like the thread id and the adress of the Raytracer instance.
Then pthread_create calls run_thread, whose purpose is to call the ray_tracer:draw method where the work is done. On the draw method, I have a
pthread_exit (NULL);
as the last thing on it (the only MT thing on it). Then do another loop to join the threads. Finally I star to write the file in a small loop. Finally close the file and delete the pointers related to the array used to store the image in the draw method.
I may not need to use to join now that I am not doing a "real" multithreading ray-tracer, just rendering it twice, but as soon as I start alternate between the image pixels (ie, thread0 -> renders pixel0 - thread0 -> stores pixel0, thread1 -> renders pixel1 - thread1 -> stores pixel1, thread0 -> renders pixel2 - thread0 -> stores pixel2, , thread1 -> renders pixel3 - thread1 -> stores pixel3,etc...) I think I will need it so to be able to write the pixels in correct order on a file.
Is that correct? Do I really need to use join here with my method (or with any other?). If I do, how can I send the threads to run concurrently, not waiting for the other to complete? Is the problem totally unrelated to join?
pthread_t threads [2];
thread_data td_array [2];
pthread_attr_t attr;
void *status;
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
TGAManager tgaManager ("z.tga",true);
if (tgaManager.isFileOpen()) {
tgaManager.writeHeadersData (image);
RayTracer rt (image.getHeight() * image.getWidth());
int rc;
for (int i=0; i<2; i++) {
//cout << "main() : creating thread, " << i << endl;
td_array[i].thread_id=i;
td_array[i].rt_ptr = &rt;
td_array[i].img_ptr = ℑ
td_array[i].scene_ptr = &scene;
//cout << "td_array.thread_index: " << td_array[i].thread_id << endl;
rc = pthread_create (&threads[i], NULL, RayTracer::run_thread, &td_array[i]);
}
if (rc) {
cout << "Error:unable to create thread," << rc << endl;
exit(-1);
}
pthread_attr_destroy(&attr);
for (int i=0; i<2; i++ ) {
rc = pthread_join(threads[i], &status);
if (rc) {
cout << "Error:unable to join," << rc << endl;
exit(-1);
}
}
//tgaManager.writeImage (rt,image.getSize());
for (int i=0; i<image.getWidth() * image.getHeight(); i++) {
cout << i << endl;
tgaManager.file_writer.put (rt.b[i]);
tgaManager.file_writer.put (rt.c[i]);
tgaManager.file_writer.put (rt.d[i]);
}
tgaManager.closeFile(1);
rt.deleteImgPtr ();
}

You do want to join() the threads, because if you don't, you have several problems:
How do you know when the threads have finished executing? You don't want to start writing out the resulting image only to find that it wasn't fully calculated at the moment you wrote it out.
How do you know when it is safe to tear down any data structures that the threads might be accessing? For example, your RayTracer object is on the stack, and (AFAICT) your threads are writing into its pixel-array. If your main function returns before the the threads have exited, there is a very good chance that the threads will sometimes end up writing into a RayTracer object that no longer exists, which will corrupt the stack by overwriting whatever other objects might exist (by happenstance) at those same locations after your function returned.
So you definitely need to join() your threads; you don't need to explicitly declare them as PTHREAD_CREATE_JOINABLE, though, since that attribute is already set by default anyway.
Joining the threads should not cause the threads to slow down, as long as both threads are created and running before you call join() on any of them (which appears to be the case in your posted code).
As for why you are seeing a slowdown with two threads, that's hard to say since a slowdown could be coming from a number of places. Some possibilities:
Something in your ray-tracing code is locking a mutex, such that for much of the ray-tracing run, only one of the two threads is allowed to execute at a time anyway.
Both threads are writing to the same memory locations at around the same time, and that is causing cache-contention which slows down the execution of both threads.
My suggestion would be to set your threads so that thread #1 renders only the top half of the image, and thread #2 renders only the bottom half of the image; that way when they write their output they will be writing to different sections of memory.
If that doesn't help, you might temporarily replace the rendering code with something simpler (e.g. a "renderer" that just sets pixels to random values) to see if you can see a speedup with that. If so, then there might be something in your RayTracer's implementation that isn't multithreading-friendly.

Check for a condition periodically without blocking

In my project, function clipsUpdate reads some facts which are set by CLIPS without the interference of my C++ code. Based on the read facts, clipsUpdate calls the needed function.
void updateClips(void)
{
// read clipsAction
switch(clipsAction)
{
case ActMove:
goToPosition (0, 0, clipsActionArg);
break;
}
}
In goToPosition function, a message is sent to the vehicle to move to the specified position and then a while loop is used to wait until the vehicle reaches the position.
void goToPosition(float north, float east, float down)
{
// Prepare and send the message
do
{
// Read new location information.
}while(/*Specified position reached?*/)
}
The problem is that updateClips should be called every 500 ms and when the goToPosition function is called, the execution is blocked until the target location is reached. During this waiting period, something may happen that requires the vehicle to stop. Therefore, updateClips should be called every 500 ms no matter what, and it should be able to stop executing goToPosition if it's running.
I tried using threads as following, but it didn't work successfully with me and it was difficult for me to debug. I think it can be done with a simpler and cleaner way.
case ActMove:
std::thread t1(goToPosition, 0, 0, clipsActionArg);
t1.detach();
break;
My question is, how can I check if the target location is reached without blocking the execution, i.e., without using while?

You probably want an event-driven model.
In an event-driven model, your main engine is a tight loop that reads events, updates state, then waits for more events.
Some events are time based, others are input based.
The only code that is permitted to block your main thread is the main loop, where it blocks until a timer hits or a new event arrives.
It might very roughly look like this:
using namespace std::literals::chrono_literals;
void main_loop( engine_state* state ) {
bool bContinue = true;
while(bContinue) {
update_ui(state);
while(bContinue && process_message(state, 10ms)) {
bContinue = update_state(state);
}
bContinue = update_state(state);
}
}
update_ui provides feedback to the user, if required.
process_message(state, duration) looks for a message to process, or for 10ms to occur. If it sees a message (like goToPosition), it modifies state to reflect that message (for example, it might store the desired destionation). It does not block, nor does it take lots of time.
If no message is recived in duration time, it returns anyhow without modifying state (I'm assuming you want things to happen even if no new input/messages occur).
update_state takes the state and evolves it. state might have a last updated time stamp; update_state would then make the "physics" reflect the time since last one. Or do any other updates.
The point is that process_message doesn't do work on the state (it encodes desires), while update_state advances "reality".
It returns false if the main loop should exit.
update_state is called once for every process_message call.
updateClips being called every 500ms can be encoded as a repeated automatic event in the queue of messages process_message reads.
void process_message( engine_state* state, std::chrono::milliseconds ms ) {
auto start = std::chrono::high_resolution_clock::now();
while (start + ms > std::chrono::high_resolution_clock::now()) {
// engine_state::delayed is a priority_queue of timestamp/action
// ordered by timestamp:
while (!state->delayed.empty()) {
auto stamp = state->delayed.front().stamp;
if (stamp >= std::chrono::high_resolution_clock::now()) {
auto f = state->queue.front().action;
state->queue.pop();
f(stamp, state);
} else {
break;
}
}
//engine_state.queue is std::queue<std::function<void(engine_state*)>>
if (!state->queue.empty()) {
auto f = state->queue.front();
state->queue.pop();
f(state);
}
}
}
The repeated polling is implemented as a delayed action that, as its first operation, inserts a new delayed action due 500ms after this one. We pass in the time the action was due to run.
"Normal" events can be instead pushed into the normal action queue, which is a sequence of std::function<void(engine_state*)> and executed in order.
If there is nothing to do, the above function busy-waits for ms time and then returns. In some cases, we might want to go to sleep instead.
This is just a sketch of an event loop. There are many, many on the internet.

Multiple threads waiting for same semaphore

Suppose there are 5 threads waiting for a semaphore
CreateSemaphore(sem_bridgempty,0,1,INFINITE);
WaitForSingleObject(sem_bridgempty, INFINITE);
Now when sem_bridgeempty is signalled, one of the 5 threads will wake up and rest will again wait for sem_bridgeempty to be signalled.Am i right here?
I am implementing one lane bridge problem where there can be vehicles moving from one direction only at a time.Also the capacity of the bridge is fixed at 5.What i have done so far is
unsigned WINAPI enter(void *param)
{
int direction = *((int *)param);
while (1)
{
WaitForSingleObject(sem_bridgecount, INFINITE);
WaitForSingleObject(mut_mutex, INFINITE);
if (curr_direction == -1 || direction == curr_direction)
{
curr_direction = direction;
cars_count++;
std::cout << "Car with direction " << direction << " entered " << GetCurrentThreadId() << std::endl;
ReleaseMutex(mut_mutex);
break;
}
else
{
ReleaseMutex(mut_mutex);
WaitForSingleObject(sem_bridgempty, INFINITE);
}
}
Sleep(5000);
exit1(NULL);
return 0;
}
unsigned WINAPI exit1(void *param)
{
WaitForSingleObject(mut_mutex, INFINITE);
cars_count--;
std::cout << "A Car exited " << GetCurrentThreadId() << std::endl;
ReleaseSemaphore(sem_bridgecount, 1, NULL);
if (cars_count == 0)
{
curr_direction = -1;
std::cout << "Bridge is empty " << GetCurrentThreadId() << std::endl;
ReleaseSemaphore(sem_bridgempty, 1, NULL);
}
ReleaseMutex(mut_mutex);
return 0;
}
int main()
{
sem_bridgecount = CreateSemaphore(NULL, 5, 5, NULL);
sem_bridgempty = CreateSemaphore(NULL, 0, 1, NULL);
mut_mutex = CreateMutex(NULL, false, NULL);
//create threads here
}
Consider the below portion
else
{
ReleaseMutex(mut_mutex);
WaitForSingleObject(sem_bridgempty, INFINITE);
A car is going in direction 1.Now there are three enter requests with direction 2.All 3 will be blocked at WaitForSingleObject(sem_bridgempty, INFINITE);.Now when the bridge goes empty.One of the three will be picked up.The one picked up will again make bridge non empty.Then the other two will still wait for the bridge to go empty even though the direction is same.
So even though there is direction=2 car on the bridge, other cars with the same direction are still waiting for the sem_bridgempty.
I even thought of using sem_bridgempty as an event instead of semaphore(setevent() in exit1() when cars_count=0 and resetevent() in enter() when first car enters).But still all threads don't wake up.

The cleanest option would be to use a critical section and a condition variable.
The ENTER algorithm would look like this:
Claim the critical section.
Call SleepConditionVariableCS in a loop, as shown in Using Condition Variables, until either:
The traffic is going in the right direction and the bridge has capacity left, or
The bridge is empty.
Update the state to represent your car entering the bridge.
Release the critical section.
The EXIT algorithm would look like this:
Claim the critical section.
Update the state to represent your car leaving the bridge.
Release the critical section.
Call WakeConditionVariable.
The condition variable could be an integer whose magnitude represents the number of cars on the bridge and whose sign represents the direction of travel.
If you wanted to avoid condition variables, the simplest solution I could come up with requires one critical section and three auto-reset events: one for each direction of travel, plus one to indicate that the bridge is empty. You will also need a variable representing the number of cars on the bridge.
The ENTER algorithm would look like this:
Using WaitForMultipleObjects, claim the event corresponding to your direction of travel or the event corresponding to the bridge being empty, whichever is available first.
Enter the critical section.
Increment the count to represent your car entering the bridge.
If the count is not at capacity, set the event representing your direction of travel.
Leave the critical section.
The EXIT algorithm would look like this:
Enter the critical section.
Decrement the count to represent your car leaving the bridge.
If the count is zero, set the event indicating that the bridge is empty.
If the count is nonzero, set the event corresponding to your direction of travel.
Release the critical section.

need create objects which most corresponded to task. in current task - we have 2 queues - on both direction. both this queue is FIFO by sense. and we need have ability wake exactly count of entries in queue - not only one or all. the windows semaphore is exactly correspond to this. this is FIFO queue and by call ReleaseSemaphore we can exactly set amount of threads (entries) to wake - this is second parameter of api lReleaseCount. in case event or ConditionVariable we can only wake single or all waiters.
your mistake not in that you select semaphore - this is the best choice for this task. you mistake that you select it for wrong essences - sem_bridgecount, sem_bridgempty - which is not queue by sence at all. you ned have 2 semaphores for 2 directions - HANDLE _hSemaphore[2]; - one semaphore per direction - create it as _hSemaphore[0] = CreateSemaphore(0, 0, MAXLONG, 0) - initial count is 0 (!) and maximum count is unlimited (but can select any value >= 5). when car try enter to bridge in direction and can not, because now another direction is active or no free space on bridge - it must wait on semaphore (in FIFO queue) _hSemaphore[direction]. and when car exit from bridge - he need check current situation on bridge and wake one or another direction on some exactly cars count (n) (not all or single) - so call ReleaseSemaphore(_hSemaphore[direction], n, 0);
in general:
void enter(int direction)
{
EnterCriticalSection(..);
BOOL IsNeedWait = fn(direction);
LeaveCriticalSection(..);
if (IsNeedWait) WaitForSingleObject(_hSemaphore[direction], INFINITE)
}
and
void exit(int direction)
{
EnterCriticalSection(..);
direction = calc_new(direction);
if (int WakeCount = calc_wake_count(direction))
{
ReleaseSemaphore(_hSemaphore[direction], WakeCount, 0);
}
LeaveCriticalSection(..);
}
note that in every enter - car only once enter to CriticalSection and after wait on _hSemaphore[direction] it just enter to bridge without again enter to cs and check conditions. this is because we can calculate exactly cars count (not single or all) and direction in exit - and wake only cars which and must enter to bridge, this will be impossible if use events or conditional variables
despite solution with conditional variables and CS is possible, i think it not best because:
thread after wait in SleepConditionVariableCS - again enter to cs which is absolute not need
we need or wake only single car by WakeConditionVariable when really can multiple cars enter to bridge, or wake all by WakeAllConditionVariable
but in this case several threads in concurrent again try enter to the same cs and only one will be winner, another will be wait here
count of waiting threads can be more than maximum place on bridge (5 in your case) - and some threads will be need begin wait again in loop.
all this can be avoid if correct use semaphore
full working implementation here

Prevent frame dropping while saving frames to disk

I am trying to write C++ code which saves incoming video frames to disk. Asynchronously arriving frames are pushed onto queue by a producer thread. The frames are popped off the queue by a consumer thread. Mutual exclusion of producer and consumer is done using a mutex. However, I still notice frames being dropped. The dropped frames (likely) correspond to instances when producer tries to push the current frame onto queue but cannot do so since consumer holds the lock. Any suggestions ? I essentially do not want the producer to wait. A waiting consumer is okay for me.
EDIT-0 : Alternate idea which does not involve locking. Will this work ?
Producer initially enqueues n seconds worth of video. n can be some small multiple of frame-rate.
As long as queue contains >= n seconds worth of video, consumer dequeues on a frame by frame basis and saves to disk.
When the video is done, the queue is flushed to disk.
EDIT-1: The frames arrive at ~ 15 fps.
EDIT-2 : Outline of code :
Main driver code
// Main function
void LVD::DumpFrame(const IplImage *frame)
{
// Copies frame into internal buffer.
// buffer object is a wrapper around OpenCV's IplImage
Initialize(frame);
// (Producer thread) -- Pushes buffer onto queue
// Thread locks queue, pushes buffer onto queue, unlocks queue and dies
PushBufferOntoQueue();
// (Consumer thread) -- Pop off queue and save to disk
// Thread locks queue, pops it, unlocks queue,
// saves popped buffer to disk and dies
DumpQueue();
++m_frame_id;
}
void LVD::Initialize(const IplImage *frame)
{
if(NULL == m_buffer) // first iteration
m_buffer = new ImageBuffer(frame);
else
m_buffer->Copy(frame);
}
Producer
void LVD::PushBufferOntoQueue()
{
m_queingThread = ::CreateThread( NULL, 0, ThreadFuncPushImageBufferOntoQueue, this, 0, &m_dwThreadID);
}
DWORD WINAPI LVD::ThreadFuncPushImageBufferOntoQueue(void *arg)
{
LVD* videoDumper = reinterpret_cast<LVD*>(arg);
LocalLock ll( &videoDumper->m_que_lock, 60*1000 );
videoDumper->m_frameQue.push(*(videoDumper->m_buffer));
ll.Unlock();
return 0;
}
Consumer
void LVD::DumpQueue()
{
m_dumpingThread = ::CreateThread( NULL, 0, ThreadFuncDumpFrames, this, 0, &m_dwThreadID);
}
DWORD WINAPI LVD::ThreadFuncDumpFrames(void *arg)
{
LVD* videoDumper = reinterpret_cast<LVD*>(arg);
LocalLock ll( &videoDumper->m_que_lock, 60*1000 );
if(videoDumper->m_frameQue.size() > 0 )
{
videoDumper->m_save_frame=videoDumper->m_frameQue.front();
videoDumper->m_frameQue.pop();
}
ll.Unlock();
stringstream ss;
ss << videoDumper->m_saveDir.c_str() << "\\";
ss << videoDumper->m_startTime.c_str() << "\\";
ss << setfill('0') << setw(6) << videoDumper->m_frame_id;
ss << ".png";
videoDumper->m_save_frame.SaveImage(ss.str().c_str());
return 0;
}
Note:
(1) I cannot use C++11. Therefore, Herb Sutter's DDJ article is not an option.
(2) I found a reference to an unbounded single producer-consumer queue. However, the author(s) state that enqueue(adding frames) is probably not wait-free.
(3) I also found liblfds, a C-library but not sure if it will serve my purpose.

The queue cannot be the problem. Video frames arrive at 16 msec intervals, at worst. Your queue only needs to store a pointer to a frame. Adding/removing one in a thread-safe way can never take more than a microsecond.
You'll need to look for another explanation and solution. Video does forever present a fire-hose problem. Disk drives are not generally fast enough to keep up with an uncompressed video stream. So if your consumer cannot keep up with the producer then something is going go give. With a dropped frame the likely outcome when you (correctly) prevent the queue from growing without bound.
Be sure to consider encoding the video. Real-time MPEG and AVC encoders are available. After they compress the stream you should not have a problem keeping up with the disk.

Circular buffer is definitely a good alternative. If you make it use a 2^n size, you can also use this trick to update the pointers:
inline int update_index(int x)
{
return (x + 1) & (size-1);
}
That way, there is no need to use expensive compare (and consequential jumps) or divide (the single most expensive integer operation in any processor - not counting "fill/copy large chunks of memory" type operations).
When dealing with video (or graphics in general) it is essential to do "buffer management". Typically, this is a case of tracking state of the "framebuffer" and avoiding to copy content more than necessary.
The typical approach is to allocate 2 or 3 video-buffers (or frame buffers, or what you call it). A buffer can be owned by either the producer or the consumer. The transfer is ONLY the ownership. So when the video-driver signals that "this buffer is full", the ownership is now with the consumer, that will read the buffer and store it to disk [or whatever]. When the storing is finished, the buffer is given back ("freed") so that the producer can re-use it. Copying the data out of the buffer is expensive [takes time], so you don't want to do that unless it's ABSOLUTELY necessary.

Win32 threads dying for no apparent reason

I have a program that spawns 3 worker threads that do some number crunching, and waits for them to finish like so:
#define THREAD_COUNT 3
volatile LONG waitCount;
HANDLE pSemaphore;
int main(int argc, char **argv)
{
// ...
HANDLE threads[THREAD_COUNT];
pSemaphore = CreateSemaphore(NULL, THREAD_COUNT, THREAD_COUNT, NULL);
waitCount = 0;
for (int j=0; j<THREAD_COUNT; ++j)
{
threads[j] = CreateThread(NULL, 0, Iteration, p+j, 0, NULL);
}
WaitForMultipleObjects(THREAD_COUNT, threads, TRUE, INFINITE);
// ...
}
The worker threads use a custom Barrier function at certain points in the code to wait until all other threads reach the Barrier:
void Barrier(volatile LONG* counter, HANDLE semaphore, int thread_count = THREAD_COUNT)
{
LONG wait_count = InterlockedIncrement(counter);
if ( wait_count == thread_count )
{
*counter = 0;
ReleaseSemaphore(semaphore, thread_count - 1, NULL);
}
else
{
WaitForSingleObject(semaphore, INFINITE);
}
}
(Implementation based on this answer)
The program occasionally deadlocks. If at that point I use VS2008 to break execution and dig around in the internals, there is only 1 worker thread waiting on the Wait... line in Barrier(). The value of waitCount is always 2.
To make things even more awkward, the faster the threads work, the more likely they are to deadlock. If I run in Release mode, the deadlock comes about 8 out of 10 times. If I run in Debug mode and put some prints in the thread function to see where they hang, they almost never hang.
So it seems that some of my worker threads are killed early, leaving the rest stuck on the Barrier. However, the threads do literally nothing except read and write memory (and call Barrier()), and I'm quite positive that no segfaults occur. It is also possible that I'm jumping to the wrong conclusions, since (as mentioned in the question linked above) I'm new to Win32 threads.
What could be going on here, and how can I debug this sort of weird behavior with VS?

How do I debug weird thread behaviour?
Not quite what you said, but the answer is almost always: understand the code really well, understand all the possible outcomes and work out which one is happening. A debugger becomes less useful here, because you can either follow one thread and miss out on what is causing other threads to fail, or follow from the parent, in which case execution is no longer sequential and you end up all over the place.
Now, onto the problem.
pSemaphore = CreateSemaphore(NULL, THREAD_COUNT, THREAD_COUNT, NULL);
From the MSDN documentation:
lInitialCount [in]: The initial count for the semaphore object. This value must be greater than or equal to zero and less than or equal to lMaximumCount. The state of a semaphore is signaled when its count is greater than zero and nonsignaled when it is zero. The count is decreased by one whenever a wait function releases a thread that was waiting for the semaphore. The count is increased by a specified amount by calling the ReleaseSemaphore function.
And here:
Before a thread attempts to perform the task, it uses the WaitForSingleObject function to determine whether the semaphore's current count permits it to do so. The wait function's time-out parameter is set to zero, so the function returns immediately if the semaphore is in the nonsignaled state. WaitForSingleObject decrements the semaphore's count by one.
So what we're saying here, is that a semaphore's count parameter tells you how many threads are allowed to perform a given task at once. When you set your count initially to THREAD_COUNT you are allowing all your threads access to the "resource" which in this case is to continue onwards.
The answer you link uses this creation method for the semaphore:
CreateSemaphore(0, 0, 1024, 0)
Which basically says none of the threads are permitted to use the resource. In your implementation, the semaphore is signaled (>0), so everything carries on merrily until one of the threads manages to decrease the count to zero, at which point some other thread waits for the semaphore to become signaled again, which probably isn't happening in sync with your counters. Remember when WaitForSingleObject returns it decreases the counter on the semaphore.
In the example you've posted, setting:
::ReleaseSemaphore(sync.Semaphore, sync.ThreadsCount - 1, 0);
Works because each of the WaitForSingleObject calls decrease the semaphore's value by 1 and there are threadcount - 1 of them to do, which happen when the threadcount - 1 WaitForSingleObjects all return, so the semaphore is back to 0 and therefore unsignaled again, so on the next pass everybody waits because nobody is allowed to access the resource at once.
So in short, set your initial value to zero and see if that fixes it.
Edit A little explanation: So to think of it a different way, a semaphore is like an n-atomic gate. What you do is usually this:
// Set the number of tickets:
HANDLE Semaphore = CreateSemaphore(0, 20, 200, 0);
// Later on in a thread somewhere...
// Get a ticket in the queue
WaitForSingleObject(Semaphore, INFINITE);
// Only 20 threads can access this area
// at once. When one thread has entered
// this area the available tickets decrease
// by one. When there are 20 threads here
// all other threads must wait.
// do stuff
ReleaseSemaphore(Semaphore, 1, 0);
// gives back one ticket.
So the use we're putting semaphores to here isn't quite the one for which they were designed.

It's a bit hard to guess exactly what you might be running into. Parallel programming is one of those places that (IMO) it pays to follow the philosophy of "keep it so simple it's obviously correct", and unfortunately I can't say that your Barrier code seems to qualify. Personally, I think I'd have something like this:
// define and initialize the array of events use for the barrier:
HANDLE barrier_[thread_count];
for (int i=0; i<thread_count; i++)
barrier_[i] = CreateEvent(NULL, true, false, NULL);
// ...
Barrier(size_t thread_num) {
// Signal that this thread has reached the barrier:
SetEvent(barrier_[thread_num]);
// Then wait for all the threads to reach the barrier:
WaitForMultipleObjects(thread_count, barrier_, true, INFINITE);
}
Edit:
Okay, now that the intent has been clarified (need to handle multiple iterations), I'd modify the answer, but only slightly. Instead of one array of Events, have two: one for the odd iterations and one for the even iterations:
// define and initialize the array of events use for the barrier:
HANDLE barrier_[2][thread_count];
for (int i=0; i<thread_count; i++) {
barrier_[0][i] = CreateEvent(NULL, true, false, NULL);
barrier_[1][i] = CreateEvent(NULL, true, false, NULL);
}
// ...
Barrier(size_t thread_num, int iteration) {
// Signal that this thread has reached the barrier:
SetEvent(barrier_[iteration & 1][thread_num]);
// Then wait for all the threads to reach the barrier:
WaitForMultipleObjects(thread_count, &barrier[iteration & 1], true, INFINITE);
ResetEvent(barrier_[iteration & 1][thread_num]);
}

In your barrier, what prevents this line:
*counter = 0;
to be executed while this other one is executed by another thread?
LONG wait_count =
InterlockedIncrement(counter);

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js