I'm working on a program that simulates a gas station. Each car at the station is it's own thread. Each car must loop through a single bitmask to check if a pump is open, and if it is, update the bitmask, fill up, and notify other cars that the pump is now open. My current code works but there are some issues with load balancing. Ideally all the pumps are used the same amount and all cars get equal fill-ups.
EDIT: My program basically takes a number of cars, pumps, and a length of time to run the test for. During that time, cars will check for an open pump by constantly calling this function.
int Station::fillUp()
{
// loop through the pumps using the bitmask to check if they are available
for (int i = 0; i < pumpsInStation; i++)
{
//Check bitmask to see if pump is open
stationMutex->lock();
if ((freeMask & (1 << i)) == 0 )
{
//Turning the bit on
freeMask |= (1 << i);
stationMutex->unlock();
// Sleeps thread for 30ms and increments counts
pumps[i].fillTankUp();
// Turning the bit back off
stationMutex->lock();
freeMask &= ~(1 << i);
stationCondition->notify_one();
stationMutex->unlock();
// Sleep long enough for all cars to have a chance to fill up first.
this_thread::sleep_for(std::chrono::milliseconds((((carsInStation-1) * 30) / pumpsInStation)-30));
return 1;
}
stationMutex->unlock();
}
// If not pumps are available, wait until one becomes available.
stationCondition->wait(std::unique_lock<std::mutex>(*stationMutex));
return -1;
}
I feel the issue has something to do with locking the bitmask when I read it. Do I need to have some sort of mutex or lock around the if check?
It looks like every car checks the availability of pump #0 first, and if that pump is busy it then checks pump #1, and so on. Given that, it seems expected to me that pump #0 would service the most cars, followed by pump #1 serving the second-most cars, all the way down to pump #(pumpsInStation-1) which only ever gets used in the (relatively rare) situation where all of the pumps are in use simultaneously at the time a new car pulls in.
If you'd like to get better load-balancing, you should probably have each car choose a different random ordering to iterate over the pumps, rather than having them all check the pumps' availability in the same order.
Normally I wouldn't suggest refactoring as it's kind of rude and doesn't go straight to the answer, but here I think it would help you a bit to break your logic into three parts, like so, to better show where the contention lies:
int Station::acquirePump()
{
// loop through the pumps using the bitmask to check if they are available
ScopedLocker locker(&stationMutex);
for (int i = 0; i < pumpsInStation; i++)
{
// Check bitmask to see if pump is open
if ((freeMask & (1 << i)) == 0 )
{
//Turning the bit on
freeMask |= (1 << i);
return i;
}
}
return -1;
}
void Station::releasePump(int n)
{
ScopedLocker locker(&stationMutex);
freeMask &= ~(1 << n);
stationCondition->notify_one();
}
bool Station::fillUp()
{
// If a pump is available:
int i = acquirePump();
if (i != -1)
{
// Sleeps thread for 30ms and increments counts
pumps[i].fillTankUp();
releasePump(i)
// Sleep long enough for all cars to have a chance to fill up first.
this_thread::sleep_for(std::chrono::milliseconds((((carsInStation-1) * 30) / pumpsInStation)-30));
return true;
}
// If no pumps are available, wait until one becomes available.
stationCondition->wait(std::unique_lock<std::mutex>(*stationMutex));
return false;
}
Now when you have the code in this form, there is a load balancing issue which is important to fix if you don't want to "exhaust" one pump or if it too might have a lock inside. The issue lies in acquirePump where you are checking the availability of free pumps in the same order for each car. A simple tweak you can make to balance it better is like so:
int Station::acquirePump()
{
// loop through the pumps using the bitmask to check if they are available
ScopedLocker locker(&stationMutex);
for (int n = 0, i = startIndex; n < pumpsInStation; ++n, i = (i+1) % pumpsInStation)
{
// Check bitmask to see if pump is open
if ((freeMask & (1 << i)) == 0 )
{
// Change the starting index used to search for a free pump for
// the next car.
startIndex = (startIndex+1) % pumpsInStation;
// Turning the bit on
freeMask |= (1 << i);
return i;
}
}
return -1;
}
Another thing I have to ask is if it's really necessary (ex: for memory efficiency) to use bit flags to indicate whether a pump is used. If you can use an array of bool instead, you'll be able to avoid locking completely and simply use atomic operations to acquire and release pumps, and that'll avoid creating a traffic jam of locked threads.
Imagine that the mutex has a queue associated with it, containing the waiting threads. Now, one of your threads manages to get the mutex that protects the bitmask of occupied stations, checks if one specific place is free. If it isn't, it releases the mutex again and loops, only to go back to the end of the queue of threads waiting for the mutex. Firstly, this is unfair, because the first one to wait is not guaranteed to get the next free slot, only if that slot happens to be the one on its loop counter. Secondly, it causes an extreme amount of context switches, which is bad for performance. Note that your approach should still produce correct results in that no two cars collide while accessing a single filling station, but the behaviour is suboptimal.
What you should do instead is this:
lock the mutex to get exclusive access to the possible filling stations
locate the next free filling station
if none of the stations are free, wait for the condition variable and restart at point 2
mark the slot as occupied and release the mutex
fill up the car (this is where the sleep in the simulation actually makes sense, the other one doesn't)
lock the mutex
mark the slot as free and signal the condition variable to wake up others
release the mutex again
Just in case that part isn't clear to you, waiting on a condition variable implicitly releases the mutex while waiting and reacquires it afterwards!
Related
Suppose there are 5 threads waiting for a semaphore
CreateSemaphore(sem_bridgempty,0,1,INFINITE);
WaitForSingleObject(sem_bridgempty, INFINITE);
Now when sem_bridgeempty is signalled, one of the 5 threads will wake up and rest will again wait for sem_bridgeempty to be signalled.Am i right here?
I am implementing one lane bridge problem where there can be vehicles moving from one direction only at a time.Also the capacity of the bridge is fixed at 5.What i have done so far is
unsigned WINAPI enter(void *param)
{
int direction = *((int *)param);
while (1)
{
WaitForSingleObject(sem_bridgecount, INFINITE);
WaitForSingleObject(mut_mutex, INFINITE);
if (curr_direction == -1 || direction == curr_direction)
{
curr_direction = direction;
cars_count++;
std::cout << "Car with direction " << direction << " entered " << GetCurrentThreadId() << std::endl;
ReleaseMutex(mut_mutex);
break;
}
else
{
ReleaseMutex(mut_mutex);
WaitForSingleObject(sem_bridgempty, INFINITE);
}
}
Sleep(5000);
exit1(NULL);
return 0;
}
unsigned WINAPI exit1(void *param)
{
WaitForSingleObject(mut_mutex, INFINITE);
cars_count--;
std::cout << "A Car exited " << GetCurrentThreadId() << std::endl;
ReleaseSemaphore(sem_bridgecount, 1, NULL);
if (cars_count == 0)
{
curr_direction = -1;
std::cout << "Bridge is empty " << GetCurrentThreadId() << std::endl;
ReleaseSemaphore(sem_bridgempty, 1, NULL);
}
ReleaseMutex(mut_mutex);
return 0;
}
int main()
{
sem_bridgecount = CreateSemaphore(NULL, 5, 5, NULL);
sem_bridgempty = CreateSemaphore(NULL, 0, 1, NULL);
mut_mutex = CreateMutex(NULL, false, NULL);
//create threads here
}
Consider the below portion
else
{
ReleaseMutex(mut_mutex);
WaitForSingleObject(sem_bridgempty, INFINITE);
A car is going in direction 1.Now there are three enter requests with direction 2.All 3 will be blocked at WaitForSingleObject(sem_bridgempty, INFINITE);.Now when the bridge goes empty.One of the three will be picked up.The one picked up will again make bridge non empty.Then the other two will still wait for the bridge to go empty even though the direction is same.
So even though there is direction=2 car on the bridge, other cars with the same direction are still waiting for the sem_bridgempty.
I even thought of using sem_bridgempty as an event instead of semaphore(setevent() in exit1() when cars_count=0 and resetevent() in enter() when first car enters).But still all threads don't wake up.
The cleanest option would be to use a critical section and a condition variable.
The ENTER algorithm would look like this:
Claim the critical section.
Call SleepConditionVariableCS in a loop, as shown in Using Condition Variables, until either:
The traffic is going in the right direction and the bridge has capacity left, or
The bridge is empty.
Update the state to represent your car entering the bridge.
Release the critical section.
The EXIT algorithm would look like this:
Claim the critical section.
Update the state to represent your car leaving the bridge.
Release the critical section.
Call WakeConditionVariable.
The condition variable could be an integer whose magnitude represents the number of cars on the bridge and whose sign represents the direction of travel.
If you wanted to avoid condition variables, the simplest solution I could come up with requires one critical section and three auto-reset events: one for each direction of travel, plus one to indicate that the bridge is empty. You will also need a variable representing the number of cars on the bridge.
The ENTER algorithm would look like this:
Using WaitForMultipleObjects, claim the event corresponding to your direction of travel or the event corresponding to the bridge being empty, whichever is available first.
Enter the critical section.
Increment the count to represent your car entering the bridge.
If the count is not at capacity, set the event representing your direction of travel.
Leave the critical section.
The EXIT algorithm would look like this:
Enter the critical section.
Decrement the count to represent your car leaving the bridge.
If the count is zero, set the event indicating that the bridge is empty.
If the count is nonzero, set the event corresponding to your direction of travel.
Release the critical section.
need create objects which most corresponded to task. in current task - we have 2 queues - on both direction. both this queue is FIFO by sense. and we need have ability wake exactly count of entries in queue - not only one or all. the windows semaphore is exactly correspond to this. this is FIFO queue and by call ReleaseSemaphore we can exactly set amount of threads (entries) to wake - this is second parameter of api lReleaseCount. in case event or ConditionVariable we can only wake single or all waiters.
your mistake not in that you select semaphore - this is the best choice for this task. you mistake that you select it for wrong essences - sem_bridgecount, sem_bridgempty - which is not queue by sence at all. you ned have 2 semaphores for 2 directions - HANDLE _hSemaphore[2]; - one semaphore per direction - create it as _hSemaphore[0] = CreateSemaphore(0, 0, MAXLONG, 0) - initial count is 0 (!) and maximum count is unlimited (but can select any value >= 5). when car try enter to bridge in direction and can not, because now another direction is active or no free space on bridge - it must wait on semaphore (in FIFO queue) _hSemaphore[direction]. and when car exit from bridge - he need check current situation on bridge and wake one or another direction on some exactly cars count (n) (not all or single) - so call ReleaseSemaphore(_hSemaphore[direction], n, 0);
in general:
void enter(int direction)
{
EnterCriticalSection(..);
BOOL IsNeedWait = fn(direction);
LeaveCriticalSection(..);
if (IsNeedWait) WaitForSingleObject(_hSemaphore[direction], INFINITE)
}
and
void exit(int direction)
{
EnterCriticalSection(..);
direction = calc_new(direction);
if (int WakeCount = calc_wake_count(direction))
{
ReleaseSemaphore(_hSemaphore[direction], WakeCount, 0);
}
LeaveCriticalSection(..);
}
note that in every enter - car only once enter to CriticalSection and after wait on _hSemaphore[direction] it just enter to bridge without again enter to cs and check conditions. this is because we can calculate exactly cars count (not single or all) and direction in exit - and wake only cars which and must enter to bridge, this will be impossible if use events or conditional variables
despite solution with conditional variables and CS is possible, i think it not best because:
thread after wait in SleepConditionVariableCS - again enter to cs which is absolute not need
we need or wake only single car by WakeConditionVariable when really can multiple cars enter to bridge, or wake all by WakeAllConditionVariable
but in this case several threads in concurrent again try enter to the same cs and only one will be winner, another will be wait here
count of waiting threads can be more than maximum place on bridge (5 in your case) - and some threads will be need begin wait again in loop.
all this can be avoid if correct use semaphore
full working implementation here
I have a big file and i want to read and also [process] all lines (even lines) of the file with multi threads.
One suggests to read the whole file and break it to multiple files (same count as threads), then let every thread process a specific file. as this idea will read the whole file, write it again and read multiple files it seems to be slow (3x I/O) and i think there must be better scenarios,
I myself though this could be a better scenario:
One thread will read the file and put the data on a global variable and other threads will read the data from that variable and process. more detailed:
One thread will read the main file with running func1 function and put each even line on a Buffer: line1Buffer of a max size MAX_BUFFER_SIZE and other threads will pop their data from the Buffer and process it with running func2 function. in code:
Global variables:
#define MAX_BUFFER_SIZE 100
vector<string> line1Buffer;
bool continue = true;// to end thread 2 to last thread by setting to false
string file = "reads.fq";
Function func1 : (thread 1)
void func1(){
ifstream ifstr(file.c_str());
for (long long i = 0; i < numberOfReads; i++) { // 2 lines per read
getline(ifstr,ReadSeq);
getline(ifstr,ReadSeq);// reading even lines
while( line1Buffer.size() == MAX_BUFFER_SIZE )
; // to delay when the buffer is full
line1Buffer.push_back(ReadSeq);
}
continue = false;
return;
}
And function func2 : (other threads)
void func2(){
string ReadSeq;
while(continue){
if(line2Buffer.size() > 0 ){
ReadSeq = line1Buffer.pop_back();
// do the proccessing....
}
}
}
About the speed:
If the reading part is slower so the total time will be equal to reading the file for just one time(and the buffer may just contain 1 file at each time and hence just 1 another thread will be able to work with thread 1). and if the processing part is slower then the total time will be equal to the time for the whole processing with numberOfThreads - 1 threads. both cases is faster than reading the file and writing in multiple files with 1 thread and then read the files with multi threads and process...
and so there is 2 question:
1- how to call the functions by threads the way thread 1 runs func1 and others run func2 ?
2- is there any faster scenario?
3-[Deleted] anyone can extend this idea to M threads for reading and N threads for processing? obviously we know :M+N==umberOfThreads is true
Edit: the 3rd question is not right as multiple threads can't help in reading a single file
Thanks All
An other approach could be interleaved thread.
Reading is done by every thread, but only 1 at once.
Because of the waiting in the very first iteration, the
threads will be interleaved.
But this is only an scaleable option, if work() is the bottleneck
(then every non-parallel execution would be better)
Thread:
while (!end) {
// should be fair!
lock();
read();
unlock();
work();
}
basic example: (you should probably add some error-handling)
void thread_exec(ifstream* file,std::mutex* mutex,int* global_line_counter) {
std::string line;
std::vector<std::string> data;
int i;
do {
i = 0;
// only 1 concurrent reader
mutex->lock();
// try to read the maximum number of lines
while(i < MAX_NUMBER_OF_LINES_PER_ITERATION && getline(*file,line)) {
// only the even lines we want to process
if (*global_line_counter % 2 == 0) {
data.push_back(line);
i++;
}
(*global_line_counter)++;
}
mutex->unlock();
// execute work for every line
for (int j=0; j < data.size(); j++) {
work(data[j]);
}
// free old data
data.clear();
//until EOF was not reached
} while(i == MAX_NUMBER_OF_LINES_PER_ITERATION);
}
void process_data(std::string file) {
// counter for checking if line is even
int global_line_counter = 0;
// open file
ifstream ifstr(file.c_str());
// mutex for synchronization
// maybe a fair-lock would be a better solution
std::mutex mutex;
// create threads and start them with thread_exec(&ifstr, &mutex, &global_line_counter);
std::vector<std::thread> threads(NUM_THREADS);
for (int i=0; i < NUM_THREADS; i++) {
threads[i] = std::thread(thread_exec, &ifstr, &mutex, &global_line_counter);
}
// wait until all threads have finished
for (int i=0; i < NUM_THREADS; i++) {
threads[i].join();
}
}
What is your bottleneck? Hard disk or processing time?
If it's the hard disk, then you're probably not going to get any more performance out as you've hit the limits of the hardware. Concurrent reads are by far faster than trying to jump around the file. Having multiple threads trying to read your file will almost certainly reduce the overall speed as it will increase disk thrashing.
A single thread reading the file and a thread pool (or just 1 other thread) to deal with the contents is probably as good as you can get.
Global variables:
This is a bad habit to get into.
Assume having #p treads, two scenarios mentioned in the post and answers:
1) Reading with 'a' thread and processing with other threads, in this case #p-1 thread will process in comparison with only one thread reading. assume the time for full operation is jobTime and time for processing with n threads is pTime(n) so:
worst case occurs when reading time is very slower than processing and jobTime = pTime(1)+readTime and the best case is when the processing is slower than reading in which jobTime is equal to pTime(#p-1)+readTime
2) read and process with all #p threads. in this scenario every thread needs to do two steps. first step is to read a part of the file with size MAX_BUFFER_SIZE which is sequential; means no two threads can read at one time. but the second part is processing the read data which can be parallel. this way in the worst case jobTime is pTime(1)+readTime as before (but*), but the best optimized case is pTime(#p)+readTime which is better than previous.
*: in 2nd approach's worst case, however reading is slower but you can find a optimized MAX_BUFFER_SIZE in which (in the worst case) some reading with one thread will overlaps with some processing with another thread. with this optimized MAX_BUFFER_SIZE the jobTime will be less than pTime(1)+readTime and could diverge to readTime
First off, reading a file is a slow operation so unless you are doing some superheavy processing, the file reading will be limiting.
If you do decide to go the multithreaded route a queue is the right approach. Just make sure you push in front an pop out back. An stl::deque should work well. Also you will need to lock the queue with a mutex and sychronize it with a conditional variable.
One last thing is you will need to limit the size if the queue for the scenario where we are pushing faster than we are popping.
I wanted to use threading to run check multiple images in a vector at the same time. Here is the code
boost::thread_group tGroup;
for (int line = 0;line < sourceImageData.size(); line++) {
for (int pixel = 0;pixel < sourceImageData[line].size();pixel++) {
for (int im = 0;im < m_images.size();im++) {
tGroup.create_thread(boost::bind(&ClassX::ClassXFunction, this, line, pixel, im));
}
tGroup.join_all();
}
}
This creates the thread group and loops thru lines of pixel data and each pixel and then multiple images. Its a weird project but anyway I bind the thread to a method in the same instance of the class this code is in so "this" is used. This runs through a population of about 20 images, binding each thread as it goes and then when it is done looping the join_all function takes effect when the threads are done. Then it goes to the next pixel and starts over again.
I'v tested running 50 threads at the same time with this simple program
void run(int index) {
for (int i = 0;i < 100;i++) {
std::cout << "Index : " <<index<<" "<<i << std::endl;
}
}
int main() {
boost::thread_group tGroup;
for (int i = 0;i < 50;i++){
tGroup.create_thread(boost::bind(run, i));
}
tGroup.join_all();
int done;
std::cin >> done;
return 0;
}
This works very quickly. Even though the method the threads are bound to in the previous program is more complicated it shouldn't be as slow as it is. It takes like 4 seconds for one loop of sourceImageData (line) to complete. I'm new to boost threading so I don't know if something is blatantly wrong with the nested loops or otherwise. Any insight is appreciated.
The answer is simple. Don't start that many threads. Consider starting as many threads as you have logical CPU cores. Starting threads is very expensive.
Certainly never start a thread just to do one tiny job. Keep the threads and give them lots of (small) tasks using a task queue.
See here for a good example where the number of threads was similarly the issue: boost thread throwing exception "thread_resource_error: resource temporarily unavailable"
In this case I'd think you can gain a lot of performance by increasing the size of each task (don't create one per pixel, but per scan-line for example)
I believe the difference here is in when you decide to join the threads.
In the first piece of code, you join the threads at every pixel of the supposed source image. In the second piece of code, you only join the threads once at the very end.
Thread synchronization is expensive and often a bottleneck for parallel programs because you are basically pausing execution of any new threads until ALL threads that need to be synchronized, which in this case is all the threads that are active, are done running.
If the iterations of the innermost loop(the one with im) are not dependent on each other, I would suggest you join the threads after the entire outermost loop is done.
I am rewriting an algorithm in C++ AMP and just ran into an issue with atomic writes, more specifically atomic_fetch_add, which apparently is only for integers?
I need to add a double_4 (or if I have to, a float_4) in an atomic fashion. How do I accomplish that with C++ AMP's atomics?
Is the best/only solution really to have a lock variable which my code can use to control the writes? I actually need to do atomic writes for a long list of output doubles, so I would essentially need a lock for every output.
I have already considered tiling this for better performance, but right now I am just in the first iteration.
EDIT:
Thanks for the quick answers already given.
I have a quick update to my question though.
I made the following lock attempt, but it seems that when one thread in a warp gets past the lock, all the other threads in the same warp just tags along. I was expecting the first warp thread to get the lock, but I must be missing something (note that it has been quite a few years since my cuda days, so I have just gotten dumb)
parallel_for_each(attracting.extent, [=](index<1> idx) restrict(amp)
{
.....
for (int j = 0; j < attracted.extent.size(); j++)
{
...
int lock = 0; //the expected lock value
while (!atomic_compare_exchange(&locks[j], &lock, 1));
//when one warp thread gets the lock, ALL threads continue on
...
acceleration[j] += ...; //locked write
locks[j] = 0; //leaving the lock again
}
});
It is as such not a big problem, since I should write into a shared variable at first and only write it to global memory after all threads in a tile have completed, but I just don't understand this behavior.
All the atomic add ops are only for integer types. You can do what you want without locks using 128-bit CAS (compare-and-swap) operations though for float_4 (I'm assuming this is 4 floats), but there's no 256-bit CAS ops what you would need for double_4. What you have to do is to have a loop which atomically reads float_4 from memory, perform the float add in the regular way, and then use CAS to test & swap the value if it's the original (and loop if not, i.e. some other thread changed the value between read & write). Note that the 128-bit CAS is only available on 64-bit architectures and that your data needs to be properly aligned.
if the critical code is short, you can create your own lock using atomic operations:
int lock = 1;
while(__sync_lock_test_and_set(&lock, 0) == 0) // trying to acquire lock
{
//yield the thread or go to sleep
}
//critical section, do the work
// release lock
lock = 1;
the advantage is you save the overhead of the OS locks.
The question has as such been answered by others and the answer is that you need to handle double atomics yourself. There is no function for it in the library.
I would also like to elaborate on my own edit, in case others come here with the same failing lock.
In the following example, my error was in not realizing that when the exchange failed, it actually changed the expected value! Thus the first thread would expect lock being zero and write a 1 in it. The next thread would expect 0 and would fail to write a one - but then the exchange wrote a one in the variable holding the expected value. This means that the next time the thread tried to do an exchange it expects a 1 in the lock! This it gets and then it thinks it gets the lock.
I was absolutely not aware that the &lock would receive a 1 on failed exchange match!
parallel_for_each(attracting.extent, [=](index<1> idx) restrict(amp)
{
.....
for (int j = 0; j < attracted.extent.size(); j++)
{
...
int lock = 0; //the expected lock value
**//note that, if locks[j]!=lock then lock=1
//meaning that ACE will be true the next time if locks[j]==1
//meaning the while will terminate even though someone else has the lock**
while (!atomic_compare_exchange(&locks[j], &lock, 1));
//when one warp thread gets the lock, ALL threads continue on
...
acceleration[j] += ...; //locked write
locks[j] = 0; //leaving the lock again
}
});
It seems that a fix is to do
parallel_for_each(attracting.extent, [=](index<1> idx) restrict(amp)
{
.....
for (int j = 0; j < attracted.extent.size(); j++)
{
...
int lock = 0; //the expected lock value
while (!atomic_compare_exchange(&locks[j], &lock, 1))
{
lock=0; //reset the expected value
};
//when one warp thread gets the lock, ALL threads continue on
...
acceleration[j] += ...; //locked write
locks[j] = 0; //leaving the lock again
}
});
I have a program that spawns 3 worker threads that do some number crunching, and waits for them to finish like so:
#define THREAD_COUNT 3
volatile LONG waitCount;
HANDLE pSemaphore;
int main(int argc, char **argv)
{
// ...
HANDLE threads[THREAD_COUNT];
pSemaphore = CreateSemaphore(NULL, THREAD_COUNT, THREAD_COUNT, NULL);
waitCount = 0;
for (int j=0; j<THREAD_COUNT; ++j)
{
threads[j] = CreateThread(NULL, 0, Iteration, p+j, 0, NULL);
}
WaitForMultipleObjects(THREAD_COUNT, threads, TRUE, INFINITE);
// ...
}
The worker threads use a custom Barrier function at certain points in the code to wait until all other threads reach the Barrier:
void Barrier(volatile LONG* counter, HANDLE semaphore, int thread_count = THREAD_COUNT)
{
LONG wait_count = InterlockedIncrement(counter);
if ( wait_count == thread_count )
{
*counter = 0;
ReleaseSemaphore(semaphore, thread_count - 1, NULL);
}
else
{
WaitForSingleObject(semaphore, INFINITE);
}
}
(Implementation based on this answer)
The program occasionally deadlocks. If at that point I use VS2008 to break execution and dig around in the internals, there is only 1 worker thread waiting on the Wait... line in Barrier(). The value of waitCount is always 2.
To make things even more awkward, the faster the threads work, the more likely they are to deadlock. If I run in Release mode, the deadlock comes about 8 out of 10 times. If I run in Debug mode and put some prints in the thread function to see where they hang, they almost never hang.
So it seems that some of my worker threads are killed early, leaving the rest stuck on the Barrier. However, the threads do literally nothing except read and write memory (and call Barrier()), and I'm quite positive that no segfaults occur. It is also possible that I'm jumping to the wrong conclusions, since (as mentioned in the question linked above) I'm new to Win32 threads.
What could be going on here, and how can I debug this sort of weird behavior with VS?
How do I debug weird thread behaviour?
Not quite what you said, but the answer is almost always: understand the code really well, understand all the possible outcomes and work out which one is happening. A debugger becomes less useful here, because you can either follow one thread and miss out on what is causing other threads to fail, or follow from the parent, in which case execution is no longer sequential and you end up all over the place.
Now, onto the problem.
pSemaphore = CreateSemaphore(NULL, THREAD_COUNT, THREAD_COUNT, NULL);
From the MSDN documentation:
lInitialCount [in]: The initial count for the semaphore object. This value must be greater than or equal to zero and less than or equal to lMaximumCount. The state of a semaphore is signaled when its count is greater than zero and nonsignaled when it is zero. The count is decreased by one whenever a wait function releases a thread that was waiting for the semaphore. The count is increased by a specified amount by calling the ReleaseSemaphore function.
And here:
Before a thread attempts to perform the task, it uses the WaitForSingleObject function to determine whether the semaphore's current count permits it to do so. The wait function's time-out parameter is set to zero, so the function returns immediately if the semaphore is in the nonsignaled state. WaitForSingleObject decrements the semaphore's count by one.
So what we're saying here, is that a semaphore's count parameter tells you how many threads are allowed to perform a given task at once. When you set your count initially to THREAD_COUNT you are allowing all your threads access to the "resource" which in this case is to continue onwards.
The answer you link uses this creation method for the semaphore:
CreateSemaphore(0, 0, 1024, 0)
Which basically says none of the threads are permitted to use the resource. In your implementation, the semaphore is signaled (>0), so everything carries on merrily until one of the threads manages to decrease the count to zero, at which point some other thread waits for the semaphore to become signaled again, which probably isn't happening in sync with your counters. Remember when WaitForSingleObject returns it decreases the counter on the semaphore.
In the example you've posted, setting:
::ReleaseSemaphore(sync.Semaphore, sync.ThreadsCount - 1, 0);
Works because each of the WaitForSingleObject calls decrease the semaphore's value by 1 and there are threadcount - 1 of them to do, which happen when the threadcount - 1 WaitForSingleObjects all return, so the semaphore is back to 0 and therefore unsignaled again, so on the next pass everybody waits because nobody is allowed to access the resource at once.
So in short, set your initial value to zero and see if that fixes it.
Edit A little explanation: So to think of it a different way, a semaphore is like an n-atomic gate. What you do is usually this:
// Set the number of tickets:
HANDLE Semaphore = CreateSemaphore(0, 20, 200, 0);
// Later on in a thread somewhere...
// Get a ticket in the queue
WaitForSingleObject(Semaphore, INFINITE);
// Only 20 threads can access this area
// at once. When one thread has entered
// this area the available tickets decrease
// by one. When there are 20 threads here
// all other threads must wait.
// do stuff
ReleaseSemaphore(Semaphore, 1, 0);
// gives back one ticket.
So the use we're putting semaphores to here isn't quite the one for which they were designed.
It's a bit hard to guess exactly what you might be running into. Parallel programming is one of those places that (IMO) it pays to follow the philosophy of "keep it so simple it's obviously correct", and unfortunately I can't say that your Barrier code seems to qualify. Personally, I think I'd have something like this:
// define and initialize the array of events use for the barrier:
HANDLE barrier_[thread_count];
for (int i=0; i<thread_count; i++)
barrier_[i] = CreateEvent(NULL, true, false, NULL);
// ...
Barrier(size_t thread_num) {
// Signal that this thread has reached the barrier:
SetEvent(barrier_[thread_num]);
// Then wait for all the threads to reach the barrier:
WaitForMultipleObjects(thread_count, barrier_, true, INFINITE);
}
Edit:
Okay, now that the intent has been clarified (need to handle multiple iterations), I'd modify the answer, but only slightly. Instead of one array of Events, have two: one for the odd iterations and one for the even iterations:
// define and initialize the array of events use for the barrier:
HANDLE barrier_[2][thread_count];
for (int i=0; i<thread_count; i++) {
barrier_[0][i] = CreateEvent(NULL, true, false, NULL);
barrier_[1][i] = CreateEvent(NULL, true, false, NULL);
}
// ...
Barrier(size_t thread_num, int iteration) {
// Signal that this thread has reached the barrier:
SetEvent(barrier_[iteration & 1][thread_num]);
// Then wait for all the threads to reach the barrier:
WaitForMultipleObjects(thread_count, &barrier[iteration & 1], true, INFINITE);
ResetEvent(barrier_[iteration & 1][thread_num]);
}
In your barrier, what prevents this line:
*counter = 0;
to be executed while this other one is executed by another thread?
LONG wait_count =
InterlockedIncrement(counter);