Readers-writers problem with writer priority - concurrency

Problem
In the book "Little Book of Semaphores", p. 71, there is the following problem:
Write a solution to the readers-writers problem that gives priority to writers. That is, once a writer arrives, no readers should be allowed to enter until all writers have left the system.
I arrived at a solution, but it's a bit different from the one given in the book.
My solution
Shared variables:
readSwitch = Lightswitch()
writeSwitch = Lightswitch()
noWriters = Semaphore(1)
noReaders = Semaphore(1)
Semaphore(1) means a semaphore initialized to 1, and Lightswitch is defined in the book like this:
class Lightswitch:
def __init__(self):
self.counter = 0
self.mutex = Semaphore(1)
def lock(self, semaphore):
self.mutex.wait()
self.counter += 1
if self.counter == 1:
semaphore.wait()
self.mutex.signal()
def unlock(self, semaphore):
self.mutex.wait()
self.counter -= 1
if self.counter == 0:
semaphore.signal()
self.mutex.signal()
Reader logic:
noReaders.wait()
noReaders.signal()
readSwitch.lock(noWriters)
# critical section for readers
readSwitch.unlock(noWriters)
Writer logic:
writeSwitch.lock(noReaders)
noWriters.wait()
# critical section for writers
writeSwitch.unlock(noReaders)
noWriters.signal()
Book's solution
Shared variables:
readSwitch = Lightswitch()
writeSwitch = Lightswitch()
noWriters = Semaphore(1)
noReaders = Semaphore(1)
Reader logic:
noReaders.wait()
readSwitch.lock(noWriters)
noReaders.signal()
# critical section for readers
readSwitch.unlock(noWriters)
Writer logic:
writeSwitch.lock(noReaders)
noWriters.wait()
# critical section for writers
noWriters.signal()
writeSwitch.unlock(noReaders)
Questions
1) In the reader logic, in my solution, noReaders.signal() immediately follows noReaders.wait(). The idea here is that noReaders behaves as a kind of turnstile that allows the readers to pass through it, but is locked by a writer as soon as one arrives.
However, in the book's solution, noReaders.signal() is done after calling readSwitch.lock(noWriters).
Is there any reason why my ordering would produce an incorrect behavior?
2) In the writer logic, in my solution, writeSwitch.unlock(noReaders) comes before noWriters.signal(). However, the book places them in the reverse order. Is there any reason why my ordering would produce an incorrect behavior?
Edit: additional question
I have an additional question. There is something that I am probably misunderstanding about the book's solution. Let's say that the following happens:
Reader 1 acquires noReaders (noReaders becomes 0)
Reader 2 waits in noReaders (noReaders becomes -1)
Writer 1 waits in noReaders (noReaders becomes -2)
Reader 3 waits in noReaders (noReaders becomes -3)
Reader 4 waits in noReaders (noReaders becomes -4)
Reader 1 acquires noWriters (noWriters becomes 0)
Reader 1 signals noReaders (noReaders becomes -3; Reader 2 gets unblocked)
Reader 2 passes through noWriters lightswitch; signals noReaders (noReaders becomes -2,
Reader 3 gets unblocked)
In the above situation, it seems that additional readers can keep arriving and entering the critical section although a writer is waiting.
Besides, considering that reader and writer are looping, a reader that finished the critical section could loop around and wait on noReaders again, even if there are writers already waiting.
What am I misunderstanding here?

With your solution, the following situation is possible:
A new writer comes after writeSwitch.unlock(noReaders) but before noWriters.signal()
New writer executes writeSwitch.lock(noReaders)
A reader executes readSwitch.lock(noWriters), going into critical section even if there is a writer.
The book's solution wakes up waiting writers first before waking up the readers. Your solution wakes up readers first, and also allows readers to queue up at readSwitch.lock(noWriters) so they can run even if there is a writer waiting.

Related

Does vkQueuePresentKHR prevent later commands from executing while it is waiting on the semaphore?

This is kind of a follow-up question for this question, and it is also based on the code provided by the same Vulkan tutorial.
Here is a simplified example:
// Vulkan handles defined and initialized elsewhere
VkDevice device;
VkQueue queue;
VkSempahore semaphore;
VkSwapchain swapchain;
VkCommandBuffer cmd_buffer;
// Renderer code
uint32_t image_index; // image acquisition is omitted
VkPresentInfoKHR present_info{};
present_info.sType = VK_STRUCTURE_TYPE_PRESENT_INFO_KHR;
present_info.waitSemaphoreCount = 1;
present_info.pWaitSemaphores = &semaphore;
present_info.swapchainCount = 1;
present_info.pSwapchains = &swapchain;
present_info.pImageIndices = &image_index;
VkSubmitInfo submit_info{};
submit_info.sType = VK_STRUCTURE_TYPE_SUBMIT_INFO;
// ... irrelevant code omitted
submit_info.pCommandBuffers = &cmd_buffer;
vkQueuePresentKHR(queue, &present_info);
vkQueueSubmit(queue, 1, &submit_info, VK_NULL_HANDLE);
In the above example, will the commands in cmd_buffer also have to wait until semaphore is signaled?
I am asking about this because a comment below the tutorial mentioned that:
However, if the graphics and present queue do end up being the same, then the renderFinished semaphore guarantees proper execution ordering. This is because the vkQueuePresentKHR command waits on that semaphore and it must begin before later commands in the queue begin (due to implicit ordering) and that only happens after rendering from the previous frame finished.
In the above example, will the commands in cmd_buffer also have to wait until semaphore is signaled?
Only if you use the semaphore as a waitSemaphore for the later submit.
This is because the vkQueuePresentKHR command waits on that semaphore and it must begin before later commands in the queue begin (due to implicit ordering) and that only happens after rendering from the previous frame finished.
I don't believe this is true.
Commands start in implicit order with respect to other commands in the queue, but this is pipelined on a stage-to-stage basis. Also note the spec wording says "start in order" not "complete in order", which is a specification sleight of hand. Hardware is perfectly free to overlap and out-of-order execution of individual commands that are otherwise sequential in the stream unless the stream contains synchronization primitives that stop it doing so.

Do I need dedicated fences/semaphores per swap chain image, per frame or per command pool in Vulkan?

I've read several articles on the CPU-GPU (using fences) and GPU-GPU (using semaphores) synchronization mechanisms, but still got trouble to understand how I should implement a simple render-loop.
Please take a look at the simple render() function below. If I got it right, the minimal requirement is that we ensure the GPU-GPU synchronization between vkAcquireNextImageKHR, vkQueueSubmit and vkQueuePresentKHR by a single set of semaphores image_available and rendering_finished as I've done in the example code below.
However, is this really safe? All operations are asynchronous. So, is it really safe to "reuse" the image_available semaphore in a subsequent call of render() again even though the signal request from the previous call hasn't fired yet? I would think it's not, but, on the other hand, we're using the same queues (don't know if it matters where the graphics and presentation queue are actually the same) and operations inside a queue should be sequentially consumed ... But if I got it right, they might not be consumed "as a whole" and could be reordered ...
The second thing is that (again, unless I'm missing something) I clearly should use one fence per swap chain image to ensure that the operation on the image corresponding to the image_index of the call to render() has finished. But does that mean that I necessarily need to do a
if (vkWaitForFences(device(), 1, &fence[image_index_of_last_call], VK_FALSE, std::numeric_limits<std::uint64_t>::max()) != VK_SUCCESS)
throw std::runtime_error("vkWaitForFences");
vkResetFences(device(), 1, &fence[image_index_of_last_call]);
before my call to vkAcquireNextImageKHR? And do I then need dedicated image_available and rendering_finished semaphores per swap chain image? Or maybe per frame? Or maybe per command buffer/pool? I'm really confused ...
void render()
{
std::uint32_t image_index;
switch (vkAcquireNextImageKHR(device(), swap_chain().handle(),
std::numeric_limits<std::uint64_t>::max(), m_image_available, VK_NULL_HANDLE, &image_index))
{
case VK_SUBOPTIMAL_KHR:
case VK_SUCCESS:
break;
case VK_ERROR_OUT_OF_DATE_KHR:
on_resized();
return;
default:
throw std::runtime_error("vkAcquireNextImageKHR");
}
static VkPipelineStageFlags constexpr wait_destination_stage_mask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT;
VkSubmitInfo submit_info{};
submit_info.sType = VK_STRUCTURE_TYPE_SUBMIT_INFO;
submit_info.waitSemaphoreCount = 1;
submit_info.pWaitSemaphores = &m_image_available;
submit_info.signalSemaphoreCount = 1;
submit_info.pSignalSemaphores = &m_rendering_finished;
submit_info.pWaitDstStageMask = &wait_destination_stage_mask;
if (vkQueueSubmit(graphics_queue().handle, 1, &submit_info, VK_NULL_HANDLE) != VK_SUCCESS)
throw std::runtime_error("vkQueueSubmit");
VkPresentInfoKHR present_info{};
present_info.sType = VK_STRUCTURE_TYPE_PRESENT_INFO_KHR;
present_info.waitSemaphoreCount = 1;
present_info.pWaitSemaphores = &m_rendering_finished;
present_info.swapchainCount = 1;
present_info.pSwapchains = &swap_chain().handle();
present_info.pImageIndices = &image_index;
switch (vkQueuePresentKHR(presentation_queue().handle, &present_info))
{
case VK_SUCCESS:
break;
case VK_ERROR_OUT_OF_DATE_KHR:
case VK_SUBOPTIMAL_KHR:
on_resized();
return;
default:
throw std::runtime_error("vkQueuePresentKHR");
}
}
EDIT: As suggested in the answers below, assume we have k "frames in flight" and hence k instances of the semaphores and the fence used in the code above, which I will denote by m_image_available[i], m_rendering_finished[i] and m_fence[i] for i = 0, ..., k - 1. Let i denote the current index of the frame in flight, which is increased by 1 after each invocation of render(), and j denote the number of invocations of render(), starting from j = 0.
Now, assume the swap chain contains three images.
If j = 0, then i = 0 and the first frame in flight is using swap chain image 0
In the same way, if j = a, then i = a and the ath frame in flight is using swap chain image a, for a= 2, 3
Now, if j = 3, then i = 3, but since the swap chain image only has three images, the fourth frame in flight is using swap chain image 0 again. I wonder whether this is problematic or not. I guess it's not, since the wait/signal semaphores m_image_available[3]/m_rendering_finished[3], used in the calls of vkAcquireNextImageKHR, vkQueueSubmit and vkQueuePresentKHR in this invocation of render(), are dedicated to this particular frame in flight.
If we reach j = k, then i = 0 again, since there are only k frames in flight. Now we potentially wait at the beginning of render(), if the call to vkQueuePresentKHR from the first invocation (i = 0) of render() hasn't signaled m_fence[0] yet.
So, besides my doubts described in the third bullet point above, the only question which remains is why I shouldn't take k as large as possible? What I theoretically could imagine is that if we are submitting work to the GPU in a quicker fashion than the GPU is able to consume, the used queue(s) might continually grow and eventually overflow (is there some kind of "max commands in queue" limit?).
If I got it right, the minimal requirement is that we ensure the GPU-GPU synchronization between vkAcquireNextImageKHR, vkQueueSubmit and vkQueuePresentKHR by a single set of semaphores image_available and rendering_finished as I've done in the example code below.
Yes, you got it right. You submit the desire to get a new image to render into via vkAcquireNextImageKHR. The presentation engine will signal the m_image_available semaphore as soon as an image to render into has become available. But you have already submitted the instruction.
Next, you submit some commands to the graphics queue via submit_info. I.e. they are also already submitted to the GPU and wait there until the m_image_available semaphore receives its signal.
Furthermore, a presentation instruction is submitted to the presentation engine that expresses the dependency that it needs to wait until the submit_info-commands have completed by waiting on the m_rendering_finished semaphore.
I.e. everything has been submitted. If nothing has been signalled yet, everything just sits there in some GPU buffers and waits for signals.
Now, if your code loops right back into the render() function and re-uses the same m_image_available and m_rendering_finished semaphores, it will only work if you are very lucky, namely if all the semaphores have already been signalled before you use them again.
The specifications says the following for vkAcquireNextImageKHR:
If semaphore is not VK_NULL_HANDLE it must not have any uncompleted signal or wait operations pending
and furthermore, it says under 7.4.2. Semaphore Waiting
the act of waiting for a binary semaphore also unsignals that semaphore.
I.e. indeed, you need to wait on the CPU until you know for sure that the previous vkAcquireNextImageKHR that uses the same m_image_available semaphore has completed.
And yes, you already got it right: You need to use a fence for that which you pass to vkQueueSubmit. If you do not synchronize on the CPU, you'll shovel ever more work to the GPU (which is a problem) and the semaphores that you are re-using might not get properly unsignalled in time (which is a problem).
What is often done is that the semaphores and fences are multiplied, e.g. to 3 each, and these sets of synchronization objects are used in sequence, so that more work can be parallelized on the GPU. The Vulkan Tutorial describes this quite nicely in its Rendering and presentation chapter. It is also explained with animation in this lecture starting at 7:59.
So first of all, as you mentioned correctly, semaphores are strictly for GPU-GPU synchronization, e.g. to make sure that one batch of commands (one submit) has finished before another one starts. This is here used to synchronize the rendering commands with the present command such that the presenting engine knows when to present the rendered image.
Fences are the main utility for CPU-GPU synchronization. You place a fence in a queue submit and then on the CPU side wait for it before you want to proceed. This is usually done here such that we do not queue any new rendering/present commands while the previous frame hasn't finished.
But does that mean that I necessarily need to do a
if (vkWaitForFences(device(), 1, &fence[image_index_of_last_call], VK_FALSE, std::numeric_limits<std::uint64_t>::max()) != VK_SUCCESS)
throw std::runtime_error("vkWaitForFences");
vkResetFences(device(), 1, &fence[image_index_of_last_call]);
before my call to vkAcquireNextImageKHR?
Yes, you definitely need this in your code, otherwise your semaphores would not be safe and you would probably get validation errors.
In general, if you want your CPU to wait until your GPU has finished rendering of the previous frame, you would have only a single fence and a single pair of semaphores. You could also replace the fence by a waitIdle command of the queue or device.
However, in practice you do not want to stall the CPU and in the meantime record commands for the next frame. This is done via frames in flight. This simply means that for every frame in flight (i.e. number of frames that can be recorded in parallel to the execution on the GPU), you have one fence and one pair of semaphores which synchronize that particular frame.
So in essence for your render loop to work properly you need a pair of semaphores + fence per frame in flight, independent of the number of swapchain images. However, do note that the current frame index (frame in flight) and image index (swapchain) will generally not be the same except you use the same amount of swapchain images as frames in flight. This is because the presenting engine might give you swapchain images out of order depending on your presenting mode.

The producer/consumer problem - changing semaphores order

Let's assume I have got a multi producers and a single consumer scenario.
The pseudo code for a producer is:
product = produce()
wait(empty)
wait(mutex)
array[in] = product
in = (in + 1) % n
signal(mutex)
signal(full)
The pseudo code for a consumer is:
wait(full)
product = array[out]
out = (out + 1) % n
signal(empty)
useProduct()
What would happen if I swap semaphores in the consumer i.e. signal(empty) before wait(full) ??
I have tried to implement this scenario in java but I can't really see any change.
wait(full) is there to notify the consumer that there is something to consume. If you do not issue wait(full) first, the consumer can consume before a producer produced anything.
If you're testing this in Java, try starting the consumer before the producer, and let the producer wait a little bit before producing the first item to let the consumer "consume".

How can I use Boost condition variables in producer-consumer scenario?

EDIT: below
I have one thread responsible for streaming data from a device in buffers. In addition, I have N threads doing some processing on that data. In my setup, I would like the streamer thread to fetch data from the device, and wait until the N threads are done with the processing before fetching new data or a timeout is reached. The N threads should wait until new data has been fetched before continuing to process. I believe that this framework should work if I don't want the N threads to repeat processing on a buffer and if I want all buffers to be processed without skipping any.
After careful reading, I found that condition variables is what I needed. I have followed tutorials and other stack overflow questions, and this is what I have:
global variables:
boost::condition_variable cond;
boost::mutex mut;
member variables:
std::vector<double> buffer
std::vector<bool> data_ready // Size equal to number of threads
data receiver loop (1 thread runs this):
while (!gotExitSignal())
{
{
boost::unique_lock<boost::mutex> ll(mut);
while(any(data_ready))
cond.wait(ll);
}
receive_data(buffer);
{
boost::lock_guard<boost::mutex> ll(mut);
set_true(data_ready);
}
cond.notify_all();
}
data processing loop (N threads run this)
while (!gotExitSignal())
{
{
boost::unique_lock<boost::mutex> ll(mut);
while(!data_ready[thread_id])
cond.wait(ll);
}
process_data(buffer);
{
boost::lock_guard<boost::mutex> ll(mut);
data_ready[thread_id] = false;
}
cond.notify_all();
}
These two loops are in their own member functions of the same class. The variable buffer is a member variable, so it can be shared across threads.
The receiver thread will be launched first. The data_ready variable is a vector of bools of size N. data_ready[i] is true if data is ready to be processed and false if the thread has already processed data. The function any(data_ready) outputs true if any of the elements of data_ready is true, and false otherwise. The set_true(data_ready) function sets all of the elements of data_ready to true. The receiver thread will check if any processing thread still is processing. If not, it will fetch data, set the data_ready flags, notify the threads, and continue with the loop which will stop at the beginning until processing is done. The processing threads will check their respective data_ready flag to be true. Once it is true, the processing thread will do some computations, set its respective data_ready flag to 0, and continue with the loop.
If I only have one processing thread, the program runs fine. Once I add more threads, I'm getting into issues where the output of the processing is garbage. In addition, the order of the processing threads matters for some reason; in other words, the LAST thread I launch will output correct data whereas the previous threads will output garbage, no matter what the input parameters are for the processing (assuming valid parameters). I don't know if the problem is due to my threading code or if there is something wrong with my device or data processing setup. I try using couts at the processing and receiving steps, and with N processing threads, I see the output as it should:
receive data
process 1
process 2
...
process N
receive data
process 1
process 2
...
Is the usage of the condition variables correct? What could be the problem?
EDIT: I followed fork's suggestions and changed the code to:
data receiver loop (1 thread runs this):
while (!gotExitSignal())
{
if(!any(data_ready))
{
receive_data(buffer);
boost::lock_guard<boost::mutex> ll(mut);
set_true(data_ready);
cond.notify_all();
}
}
data processing loop (N threads run this)
while (!gotExitSignal())
{
// boost::unique_lock<boost::mutex> ll(mut);
boost::mutex::scoped_lock ll(mut);
cond.wait(ll);
process_data(buffer);
data_ready[thread_id] = false;
}
It works somewhat better. Am I using the correct locks?
I did not read your whole story but if i look at the code quickly i see that you use conditions wrong.
A condition is like a state, once you set a thread in a waiting condition it gives away the cpu. So your thread will effectively stop running untill some other process/thread notifies it.
In your code you have a while loop and each time you check for data you wait. That is wrong, it should be an if instead of a while. But then again it should not be there. The checking for data should be done somewhere else. And your worker thread should put itself in waiting condition after it has done its work.
Your worker threads are the consumers. And the producers are the ones that deliver the data.
I think a better construction would be to make a thread check if there is data and notify the worker(s).
PSEUDO CODE:
//producer
while (true) {
1. lock mutex
2. is data available
3. unlock mutex
if (dataAvailableVariable) {
4. notify a worker
5. set waiting condition
}
}
//consumer
while (true) {
1. lock mutex
2. do some work
3. unlock mutex
4. notify producer that work is done
5. set wait condition
}
You should also take care of the fact that some thread needs to be alive in order to avoid a deadlock, means all threads in waiting condition.
I hope that helps you a little.

c++ Handling multiple threads in a main thread

I am a bit new to multi threading, so forgive me if these questions are too trivial.
My application needs to create multiple threads in a thread and perform actions from each thread.
For example, I have a set of files to read, say 50 and I create a thread to read these files using CreateThread() function.
Now this main thread creates 4 threads to access the file. 1st thread is given file 1, second file 2 and so on.
After 1st thread completed reading file 1 and gives main thread the required data, main thread needs to invoke it with file 5 and obtain data from it. Similar goes for all other threads until all 50 files are read.
After that, each thread is destroyed and finally my main thread is destroyed.
The issue I am facing is:
1) How to stop a thread to exit after file reading?
2) How to invoke the thread again with other file name?
3) How would my child thread give information to main thread?
4) After a thread completes reading the file and returns the main thread a data, how main thread would know which thread has provided the data?
Thanks
This is a very common problem in multi-threaded programming. You can view this as a producer-consumer problem: the main thread "produces" tasks which are "consumed" by the worker threads (s. e.g. http://www.mario-konrad.ch/blog/programming/multithread/tutorial-06.html) . You might also want to read about "thread pools".
I would highly recommend to read into boost's Synchronization (http://www.boost.org/doc/libs/1_50_0/doc/html/thread.html) and use boost's threading functionality as it is platform independent and good to use.
To be more specific to your question: You should create a queue with operations to be done (usually it's the same queue for all worker threads. If you really want to ensure thread 1 is performing task 1, 5, 9 ... you might want to have one queue per worker thread). Access to this queue must be synchronized by a mutex, waiting threads can be notified by condition_variables when new data is added to the mutex.
1.) don't exit the thread function but wait until a condition is fired and then restart using a while ([exit condition not true]) loop
2.) see 1.
3.) through any variable to which both have access and which is secured by a mutex (e.g. a result queue)
4.) by adding this information as the result written to the result queue.
Another advice: It's always hard to get multi-threading correct. So try to be as careful as possible and write tests to detect deadlocks and race conditions.
The typical solution for this kind of problem is using a thread pool and a queue. The main thread pushes all files/filenames to a queue, then starts a thread pool, ie different threads, in which each thread takes an item from the queue and processes it. When one item is processed, it goes on to the next one (if by then the queue is not yet empty). The main thread knows everything is processed when the queue is empty and all threads have exited.
So, 1) and 2) are somewhat conflicting: you don't stop the thread and invoke it again, it just keeps running as long as it finds items on the queue.
For 3) you can again use a queue in which the thread puts information, and from which the main thread reads. For 4) you could give each thread an id and put that together with the data. However normally the main thread should not need to know which thread exactly processed data.
Some very basic pseudocode to give you an idea, locking for threadsafety omitted:
//main
for( all filenames )
queue.push_back( filename );
//start some thread
threadPool.StartThreads( 4, CreateThread( queue ) );
//wait for threads to end
threadPool.Join();
//thread
class Thread
{
public:
Thread( queue q ) : q( q ) {}
void Start();
bool Join();
void ThreadFun()
{
auto nextQueueItem = q.pop_back();
if( !nextQueuItem )
return; //q empty
ProcessItem( nextQueueItem );
}
}
Whether you use a thread pool or not to execute your synchronies file reads, it boils down to a chain of functions or groups of functions that have to run serialized. So let's assume, you find a way to execute functions in parallel (be it be starting one thread per function or by using a thread pool), to wait for the first 4 files to read, you can use a queue, where the reading threads push there results into, the fifth function now pulls 4 results out of the queue (the queue blocks when empty) and processes. If there are more dependencies between functions, you can add more queues between them. Sketch:
void read_file( const std::string& name, queue& q )
{
file_content f= .... // read file
q.push( f )
}
void process4files( queue& q )
{
std::vector< file_content > result;
for ( int i = 0; i != 4; ++i )
result.push_back( q.pop() );
// now 4 files are read ...
assert( result.size() == 4u );
}
queue q;
thread t1( &read_file, "file1", q );
thread t2( &read_file, "file2", q );
thread t3( &read_file, "file3", q );
thread t4( &read_file, "file4", q );
thread t5( &process4files, q );
t5.join();
I hope you get the idea.
Torsten