Do I need dedicated fences/semaphores per swap chain image, per frame or per command pool in Vulkan? - c++

I've read several articles on the CPU-GPU (using fences) and GPU-GPU (using semaphores) synchronization mechanisms, but still got trouble to understand how I should implement a simple render-loop.
Please take a look at the simple render() function below. If I got it right, the minimal requirement is that we ensure the GPU-GPU synchronization between vkAcquireNextImageKHR, vkQueueSubmit and vkQueuePresentKHR by a single set of semaphores image_available and rendering_finished as I've done in the example code below.
However, is this really safe? All operations are asynchronous. So, is it really safe to "reuse" the image_available semaphore in a subsequent call of render() again even though the signal request from the previous call hasn't fired yet? I would think it's not, but, on the other hand, we're using the same queues (don't know if it matters where the graphics and presentation queue are actually the same) and operations inside a queue should be sequentially consumed ... But if I got it right, they might not be consumed "as a whole" and could be reordered ...
The second thing is that (again, unless I'm missing something) I clearly should use one fence per swap chain image to ensure that the operation on the image corresponding to the image_index of the call to render() has finished. But does that mean that I necessarily need to do a
if (vkWaitForFences(device(), 1, &fence[image_index_of_last_call], VK_FALSE, std::numeric_limits<std::uint64_t>::max()) != VK_SUCCESS)
throw std::runtime_error("vkWaitForFences");
vkResetFences(device(), 1, &fence[image_index_of_last_call]);
before my call to vkAcquireNextImageKHR? And do I then need dedicated image_available and rendering_finished semaphores per swap chain image? Or maybe per frame? Or maybe per command buffer/pool? I'm really confused ...
void render()
{
std::uint32_t image_index;
switch (vkAcquireNextImageKHR(device(), swap_chain().handle(),
std::numeric_limits<std::uint64_t>::max(), m_image_available, VK_NULL_HANDLE, &image_index))
{
case VK_SUBOPTIMAL_KHR:
case VK_SUCCESS:
break;
case VK_ERROR_OUT_OF_DATE_KHR:
on_resized();
return;
default:
throw std::runtime_error("vkAcquireNextImageKHR");
}
static VkPipelineStageFlags constexpr wait_destination_stage_mask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT;
VkSubmitInfo submit_info{};
submit_info.sType = VK_STRUCTURE_TYPE_SUBMIT_INFO;
submit_info.waitSemaphoreCount = 1;
submit_info.pWaitSemaphores = &m_image_available;
submit_info.signalSemaphoreCount = 1;
submit_info.pSignalSemaphores = &m_rendering_finished;
submit_info.pWaitDstStageMask = &wait_destination_stage_mask;
if (vkQueueSubmit(graphics_queue().handle, 1, &submit_info, VK_NULL_HANDLE) != VK_SUCCESS)
throw std::runtime_error("vkQueueSubmit");
VkPresentInfoKHR present_info{};
present_info.sType = VK_STRUCTURE_TYPE_PRESENT_INFO_KHR;
present_info.waitSemaphoreCount = 1;
present_info.pWaitSemaphores = &m_rendering_finished;
present_info.swapchainCount = 1;
present_info.pSwapchains = &swap_chain().handle();
present_info.pImageIndices = &image_index;
switch (vkQueuePresentKHR(presentation_queue().handle, &present_info))
{
case VK_SUCCESS:
break;
case VK_ERROR_OUT_OF_DATE_KHR:
case VK_SUBOPTIMAL_KHR:
on_resized();
return;
default:
throw std::runtime_error("vkQueuePresentKHR");
}
}
EDIT: As suggested in the answers below, assume we have k "frames in flight" and hence k instances of the semaphores and the fence used in the code above, which I will denote by m_image_available[i], m_rendering_finished[i] and m_fence[i] for i = 0, ..., k - 1. Let i denote the current index of the frame in flight, which is increased by 1 after each invocation of render(), and j denote the number of invocations of render(), starting from j = 0.
Now, assume the swap chain contains three images.
If j = 0, then i = 0 and the first frame in flight is using swap chain image 0
In the same way, if j = a, then i = a and the ath frame in flight is using swap chain image a, for a= 2, 3
Now, if j = 3, then i = 3, but since the swap chain image only has three images, the fourth frame in flight is using swap chain image 0 again. I wonder whether this is problematic or not. I guess it's not, since the wait/signal semaphores m_image_available[3]/m_rendering_finished[3], used in the calls of vkAcquireNextImageKHR, vkQueueSubmit and vkQueuePresentKHR in this invocation of render(), are dedicated to this particular frame in flight.
If we reach j = k, then i = 0 again, since there are only k frames in flight. Now we potentially wait at the beginning of render(), if the call to vkQueuePresentKHR from the first invocation (i = 0) of render() hasn't signaled m_fence[0] yet.
So, besides my doubts described in the third bullet point above, the only question which remains is why I shouldn't take k as large as possible? What I theoretically could imagine is that if we are submitting work to the GPU in a quicker fashion than the GPU is able to consume, the used queue(s) might continually grow and eventually overflow (is there some kind of "max commands in queue" limit?).

If I got it right, the minimal requirement is that we ensure the GPU-GPU synchronization between vkAcquireNextImageKHR, vkQueueSubmit and vkQueuePresentKHR by a single set of semaphores image_available and rendering_finished as I've done in the example code below.
Yes, you got it right. You submit the desire to get a new image to render into via vkAcquireNextImageKHR. The presentation engine will signal the m_image_available semaphore as soon as an image to render into has become available. But you have already submitted the instruction.
Next, you submit some commands to the graphics queue via submit_info. I.e. they are also already submitted to the GPU and wait there until the m_image_available semaphore receives its signal.
Furthermore, a presentation instruction is submitted to the presentation engine that expresses the dependency that it needs to wait until the submit_info-commands have completed by waiting on the m_rendering_finished semaphore.
I.e. everything has been submitted. If nothing has been signalled yet, everything just sits there in some GPU buffers and waits for signals.
Now, if your code loops right back into the render() function and re-uses the same m_image_available and m_rendering_finished semaphores, it will only work if you are very lucky, namely if all the semaphores have already been signalled before you use them again.
The specifications says the following for vkAcquireNextImageKHR:
If semaphore is not VK_NULL_HANDLE it must not have any uncompleted signal or wait operations pending
and furthermore, it says under 7.4.2. Semaphore Waiting
the act of waiting for a binary semaphore also unsignals that semaphore.
I.e. indeed, you need to wait on the CPU until you know for sure that the previous vkAcquireNextImageKHR that uses the same m_image_available semaphore has completed.
And yes, you already got it right: You need to use a fence for that which you pass to vkQueueSubmit. If you do not synchronize on the CPU, you'll shovel ever more work to the GPU (which is a problem) and the semaphores that you are re-using might not get properly unsignalled in time (which is a problem).
What is often done is that the semaphores and fences are multiplied, e.g. to 3 each, and these sets of synchronization objects are used in sequence, so that more work can be parallelized on the GPU. The Vulkan Tutorial describes this quite nicely in its Rendering and presentation chapter. It is also explained with animation in this lecture starting at 7:59.

So first of all, as you mentioned correctly, semaphores are strictly for GPU-GPU synchronization, e.g. to make sure that one batch of commands (one submit) has finished before another one starts. This is here used to synchronize the rendering commands with the present command such that the presenting engine knows when to present the rendered image.
Fences are the main utility for CPU-GPU synchronization. You place a fence in a queue submit and then on the CPU side wait for it before you want to proceed. This is usually done here such that we do not queue any new rendering/present commands while the previous frame hasn't finished.
But does that mean that I necessarily need to do a
if (vkWaitForFences(device(), 1, &fence[image_index_of_last_call], VK_FALSE, std::numeric_limits<std::uint64_t>::max()) != VK_SUCCESS)
throw std::runtime_error("vkWaitForFences");
vkResetFences(device(), 1, &fence[image_index_of_last_call]);
before my call to vkAcquireNextImageKHR?
Yes, you definitely need this in your code, otherwise your semaphores would not be safe and you would probably get validation errors.
In general, if you want your CPU to wait until your GPU has finished rendering of the previous frame, you would have only a single fence and a single pair of semaphores. You could also replace the fence by a waitIdle command of the queue or device.
However, in practice you do not want to stall the CPU and in the meantime record commands for the next frame. This is done via frames in flight. This simply means that for every frame in flight (i.e. number of frames that can be recorded in parallel to the execution on the GPU), you have one fence and one pair of semaphores which synchronize that particular frame.
So in essence for your render loop to work properly you need a pair of semaphores + fence per frame in flight, independent of the number of swapchain images. However, do note that the current frame index (frame in flight) and image index (swapchain) will generally not be the same except you use the same amount of swapchain images as frames in flight. This is because the presenting engine might give you swapchain images out of order depending on your presenting mode.

Related

Which way to synchronize vkQueueSubmit() to use?

I have a function that copies data from one buffer to another, I need to synchronize its execution.
I have such a bad option:
void MainWindow::copyBuffer(VkBuffer srcBuffer, VkBuffer dstBuffer, VkDeviceSize size)
{
VkCommandBuffer commandBuffer;
vkAllocateCommandBuffers(logicalDevice, &allocInfo, &commandBuffer);
//Start recording
vkBeginCommandBuffer(commandBuffer, &beginInfo);
vkCmdCopyBuffer(commandBuffer, srcBuffer, dstBuffer, 1, &copyRegion);
vkEndCommandBuffer(commandBuffer);
//Run command buffer
vkQueueSubmit(graphicsQueue, 1, &submitInfo, VK_NULL_HANDLE);
//Waiting for completion
vkQueueWaitIdle(graphicsQueue);
vkFreeCommandBuffers(logicalDevice, commandPool, 1, &commandBuffer);
}
This option is bad because if I want to execute the copyBuffer() function several times, then all the buffers will be copied strictly one at a time.
I want to use a fence for each function call so that multiple calls can run in parallel.
So far, I have only such a solution:
void MainWindow::copyBuffer(VkBuffer srcBuffer, VkBuffer dstBuffer, VkDeviceSize size)
{
VkCommandBuffer commandBuffer;
vkAllocateCommandBuffers(logicalDevice, &allocInfo, &commandBuffer);
//Create fence
VkFenceCreateInfo fenceInfo{};
fenceInfo.sType = VK_STRUCTURE_TYPE_FENCE_CREATE_INFO;
fenceInfo.flags = VK_FENCE_CREATE_SIGNALED_BIT;
VkFence executionCompleteFence = VK_NULL_HANDLE;
if (vkCreateFence(logicalDevice, &fenceInfo, VK_NULL_HANDLE, &executionCompleteFence) != VK_SUCCESS) {
throw MakeErrorInfo("Failed to create fence");
}
//Start recording
vkBeginCommandBuffer(commandBuffer, &beginInfo);
vkCmdCopyBuffer(commandBuffer, srcBuffer, dstBuffer, 1, &copyRegion);
vkEndCommandBuffer(commandBuffer);
//Run command buffer
vkQueueSubmit(graphicsQueue, 1, &submitInfo, VK_NULL_HANDLE);
vkWaitForFences(logicalDevice, 1, &executionCompleteFence, VK_TRUE, UINT64_MAX);
vkResetFences(logicalDevice, 1, &executionCompleteFence);
vkFreeCommandBuffers(logicalDevice, commandPool, 1, &commandBuffer);
vkDestroyFence(logicalDevice, executionCompleteFence, VK_NULL_HANDLE);
}
Which of these options is better?
Is the second option written correctly?
Both functions are bad in the same way. They both block the CPU from doing anything until the transfer is done. And they both could be used to potentially submit multiple CBs to the same queue in the same frame, but with different submit commands.
Neither is desirable if performance is something you care about.
Ultimately, what you need to do is have your copyBuffer function not actually perform the copy. You should have a function which builds a command buffer to do a copy. That CB is then stored in a place to be submitted later with other copying CBs. Or better yet, you can have just one copying CB that each command adds to (the first one called in a frame will create the CB).
At some point in the future, before you've submitted the work that will use this data, you need to submit the transfer operations. And the way this works depends on if you're submitting the transfer operations on the same queue as the work that will consume them or not.
If they're on the same queue, then all you need to do is have an event in a command buffer at the end of your batch that synchronizes the transfer operations with their receivers. If you want to be more clever, each transfer operation can have its own event, which the receiving operations will wait on.
And in same-queue transfers, you also want to make sure that you submit the transfers in the same vkQueueSubmit call as the rest of your work. Or to put it another way, you should never make more than one call to vkQueueSubmit for a particular queue in a particular frame.
If you're dealing with separate queues, then things change. A bit. If timeline semaphores aren't an option, you'll need to submit your transfer work before you submit the receiving operations. This is because the transfer batch will need to signal a semaphore that the receiving operation will wait on. And a binary semaphore cannot be waited on until the operation that signals it has been submitted to a queue.
But otherwise, everything else stays the same. Of course, you don't need events since you're synchronizing by semaphore.
The two functions are semantically identical and do exactly the same blocking behavior.
The second is slightly better. vkQueueWaitIdle is kind of a debug and out-of-hotspot feature. It might incur a hidden second submit to signal the implicit fence.
You don't need to reset fence that you subsequently destroy anyway. And you are creating it presignaled, which is a bug. Also you forgot to pass it to the vkQueueSubmit.

WSI synchronization subpass dependency and link to color attachment output

I think I do understand how Vulkan synchronization works, but I do have a problem of understanding the synchronization with the WSI.
The the Synchronization Examples, we can find this code
/* Only need a dependency coming in to ensure that the first
layout transition happens at the right time.
Second external dependency is implied by having a different
finalLayout and subpass layout. */
VkSubpassDependency dependency = {
.srcSubpass = VK_SUBPASS_EXTERNAL,
.dstSubpass = 0,
// .srcStageMask needs to be a part of pWaitDstStageMask in the WSI semaphore.
.srcStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT,
.dstStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT,
.srcAccessMask = 0,
.dstAccessMask = VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT,
.dependencyFlags = 0};
According to me, it should be something like that :
VkSubpassDependency dependency = {
.srcSubpass = VK_SUBPASS_EXTERNAL,
.dstSubpass = 0,
// .srcStageMask needs to be a part of pWaitDstStageMask in the WSI semaphore.
.srcStageMask = VK_PIPELINE_STAGE_BOTTOM_OF_PIPE,
.dstStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT,
.srcAccessMask = 0,
.dstAccessMask = 0,
.dependencyFlags = 0};
Indeed, since we are going to WRITE into the attachment, there is no need to use a WRITE_BIT (meaning make writes available) in the dstAccessMask.
But the true issue is in the srcStageMask.
I understand why the dstStageMask is COLOR_ATTACHMENT_OUTPUT_BIT. It is because it is okay to have the prior stages working since we don't touch the attachment.
However, for the srcStageMask, I did not see anything about the link between WSI and the COLOR_ATTACHMENT_OUTPUT_BIT. To me, the layout transition must appear at the end of the presentation, and just before the beginning of the COLOR_ATTACHMENT_OUTPUT stage.
And the end of presentation, for me, should be represented by BOTTOM_OF_PIPE and not COLOR_ATTACHMENT_OUTPUT
Where am I mistaken?
Indeed, since we are going to WRITE into the attachment, there is no need to use a WRITE_BIT (meaning make writes available) in the dstAccessMask.
I am not exactly sure of your logic here. It should be WRITE exactly because we are going to write to the attachment.
It would perhaps help describe what is happening here (from krOoze/Hello_Triangle/doc):
The VkSubpassDependency chains of the the semaphore (via pWaitDstStageMask). Then it performs its layout transition. Then it syncs with the Load Op. (Then Load Op happens in the subpass. And subpass vkDraws some stuff into the swapchain image.)
Now assumably (as is typical for first use of the swapchain image) our Load Op is VK_ATTACHMENT_LOAD_OP_CLEAR. That means VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT + VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT access.
So:
Presentation is gonna read the swapchain image, and submit a semaphore signal.
We chain our VkSubpassDependency to the signal (via pWaitDstStageMask). Semaphore signal already covers all memory accesses, therefore our srcAccessMask = 0.
The Dependency performs its implicit dependency to the Layout Transition (takes your src, and invents some internal dst), then Layout Transition happens, which reads the image and writes it back, then the dependency performs another implicit Dependency (invents some src that covers the layout transition, and uses your dst).
The Load Op happens in the subpass, and you have to make sure explicitly it happens-after everything above. So your dst in your Dependency must be VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT + VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT to cover the Load Op.
Now, there are three types of memory hazards: read→write, write→write, and write→read (and read→read is a non-hazard). In WWH something writes a resource, while some other write also writes it. It is not clear which write happens last, so the memory contents becomes garbage. In RWH and WRH read and write may happen at the same time. It is not clear if the read sees the unmodified memory, or the written one (i.e. it reads garbage).
Without an explicit dependency, the Layout Transition and the subsequent Load Op form a write→write hazard. Therefore dstAccessMask must not be 0 to resolve the hazard and make sure one write happens-before the second one.
(It is perhaps worth noting we introduce the VkSubpassDependency solely for the sake of the layout transition. Otherwise the semaphore wait would already be all that is needed. The default is srcStageMask = TOP_OF_PIPE, which means without explicit Dependency the layout transition could happen before the semaphore wait, i.e. before presentation finishes reading it; that would be a read→write hazard).
To me, the layout transition must appear at the end of the presentation, and just before the beginning of the COLOR_ATTACHMENT_OUTPUT stage. And the end of presentation, for me, should be represented by BOTTOM_OF_PIPE and not COLOR_ATTACHMENT_OUTPUT
We have little bit of choice here: pWaitDstStageMask = dependency.srcStageMask = ?
Now our situation is this:
vkBeginCommandBuffer();
[possibly vkCmdDispatch()?]
vkBeginRenderPass();
vkCmdDraw(); // does vertex shading + fragment shading, then color writes
vkEndRenderPass();
vkEndCommandBuffer();
vkQueueSubmit(.pwaitDstStageMask = ?);
If we use VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT, then the semaphore wait does not block starting the hypothetical vkCmdDispatch() (VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT). And the Subpass Dependency src does not force it to finish neither. Great!
The Stage used should not be any earlier stage than VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT. E.g. with VK_PIPELINE_STAGE_ALL_COMMANDS the semaphore wait would unnecesserily block the vkCmdDispatch() as well as vertex and fragment shader of the vkCmdDraw. Meanwhile we need the swapchain image only when we actually need to write it at the Load Op (which you can imagine happens when the vkCmdDraw() gets to the point it needs to perform color writes).
Now, we could choose VK_PIPELINE_STAGE_BOTTOM_OF_PIPE. The semaphore blocks nothing (dstStage\pWaitDstStageMask = VK_PIPELINE_STAGE_BOTTOM_OF_PIPE means the same as "nothing"). Great! But the Subpass Dependency now blocks everything (srcStageMask = VK_PIPELINE_STAGE_BOTTOM_OF_PIPE means the same as VK_PIPELINE_STAGE_ALL_COMMANDS). That means our vkCmdDispatch has to finish, before our subpass starts. Not so great...
So, the best practice is to use pWaitDstStageMask = dependency.srcStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT.

Multithreading Implementation in C++

I am a beginner using multithreading in C++, so I'd appreciate it if you can give me some recommendations.
I have a function which receives the previous frame and current frame from a video stream (let's call this function, readFrames()). The task of that function is to compute Motion Estimation.
The idea when calling readFrames() would be:
Store the previous and current frame in a buffer.
I want to compute the value of Motion between each pair of frames from the buffer but without blocking the function readFrames(), because more frames can be received while computing that value. I suppose I have to write a function computeMotionValue() and every time I want to execute it, create a new thread and launch it. This function should return some float motionValue.
Every time the motionValue returned by any thread is over a threshold, I want to +1 a common int variable, let's call it nValidMotion.
My problem is that I don't know how to "synchronize" the threads when accessing motionValue and nValidMotion.
Can you please explain to me in some pseudocode how can I do that?
and every time I want to execute it, create a new thread and launch it
That's usually a bad idea. Threads are usually fairly heavy-weight, and spawning one is usually slower than just passing a message to an existing thread pool.
Anyway, if you fall behind, you'll end up with more threads than processor cores and then you'll fall even further behind due to context-switching overhead and memory pressure. Eventually creating a new thread will fail.
My problem is that I don't know how to "synchronize" the threads when accessing motionValue and nValidMotion.
Synchronization of access to a shared resource is usually handled with std::mutex (mutex means "mutual exclusion", because only one thread can hold the lock at once).
If you need to wait for another thread to do something, use std::condition_variable to wait/signal. You're waiting-for/signalling a change in state of some shared resource, so you need a mutex for that as well.
The usual recommendation for this kind of processing is to have at most one thread per available core, all serving a thread pool. A thread pool has a work queue (protected by a mutex, and with the empty->non-empty transition signalled by a condvar).
For combining the results, you could have a global counter protected by a mutex (but this is relatively heavy-weight for a single integer), or you could just have each task added to added to the thread pool return a bool via the promise/future mechanism, or you could just make your counter atomic.
Here is a sample pseudo code you may use:
// Following thread awaits notification from worker threads, detecting motion
nValidMotion_woker_Thread()
{
while(true) { message_recieve(msg_q); ++nValidMotion; }
}
// Worker thread, computing motion on 2 frames; if motion detected, notify uysing message Q to nValidMotion_woker_Thread
WorkerThread(frame1 ,frame2)
{
x = computeMotionValue(frame1 ,frame2);
if x > THRESHOLD
msg_q.send();
}
// main thread
main_thread()
{
// 1. create new message Q for inter-thread communication
msg_q = new msg_q();
// start listening thread
Thread a = new nValidMotion_woker_Thread();
a.start();
while(true)
{
// collect 2 frames
frame1 = readFrames();
frame2 = readFrames();
// start workre thread
Thread b = new WorkerThread(frame1 ,frame2);
b.start();
}
}

Vulkan samples: vkQueueSubmit always followed by vkWaitForFences?

In the API-Samples that come with Vulkan, it appears that there is always a call to vkWaitForFences after a call to vkQueueSubmit, either directly or via execute_queue_command_buffer (in util_init.hpp). The call to vkWaitForFences will block the CPU execution until the GPU has finished doing all the work in the previous vkQueueSubmit. This in effect doesn't allow multiple frames to be constructed simultaneously, which (in theory) is limiting performance significantly.
Are these calls required, and if so, is there another way to not require the GPU to be idle before constructing a new frame?
The way we achieved multiple frames in flight is to have a fence for each swapchain framebuffer you have. Then still use the vkWaitForFences but wait for the ((n+1)%num_fences) fence.
There is example code here https://imgtec.com/tools/powervr-early-access-program/
uint32_t current_buffer = num_swaps_ % swapchain_fences.size();
vkQueueSubmit(graphics_queue, 1, &submit_info, swapchain_fences[current_buffer]);
// Wait for a queuesubmit to finish so we can continue rendering if we are n-2 frames behind
if(num_swaps_ > swapchain_fences.size() - 1)
{
uint32_t fence_to_wait_for = (num_swaps_ + 1) % swapchain_fences.size();
vkWaitForFences(device, 1, &swapchain_fences[fence_to_wait_for], true, UINT64_MAX);
vkResetFences(device, 1, &swapchain_fences[current_buffer]);
}

Of these 3 methods for reading linked lists from shared memory, why is the 3rd fastest?

I have a 'server' program that updates many linked lists in shared memory in response to external events. I want client programs to notice an update on any of the lists as quickly as possible (lowest latency). The server marks a linked list's node's state_ as FILLED once its data is filled in and its next pointer has been set to a valid location. Until then, its state_ is NOT_FILLED_YET. I am using memory barriers to make sure that clients don't see the state_ as FILLED before the data within is actually ready (and it seems to work, I never see corrupt data). Also, state_ is volatile to be sure the compiler doesn't lift the client's checking of it out of loops.
Keeping the server code exactly the same, I've come up with 3 different methods for the client to scan the linked lists for changes. The question is: Why is the 3rd method fastest?
Method 1: Round robin over all the linked lists (called 'channels') continuously, looking to see if any nodes have changed to 'FILLED':
void method_one()
{
std::vector<Data*> channel_cursors;
for(ChannelList::iterator i = channel_list.begin(); i != channel_list.end(); ++i)
{
Data* current_item = static_cast<Data*>(i->get(segment)->tail_.get(segment));
channel_cursors.push_back(current_item);
}
while(true)
{
for(std::size_t i = 0; i < channel_list.size(); ++i)
{
Data* current_item = channel_cursors[i];
ACQUIRE_MEMORY_BARRIER;
if(current_item->state_ == NOT_FILLED_YET) {
continue;
}
log_latency(current_item->tv_sec_, current_item->tv_usec_);
channel_cursors[i] = static_cast<Data*>(current_item->next_.get(segment));
}
}
}
Method 1 gave very low latency when then number of channels was small. But when the number of channels grew (250K+) it became very slow because of looping over all the channels. So I tried...
Method 2: Give each linked list an ID. Keep a separate 'update list' to the side. Every time one of the linked lists is updated, push its ID on to the update list. Now we just need to monitor the single update list, and check the IDs we get from it.
void method_two()
{
std::vector<Data*> channel_cursors;
for(ChannelList::iterator i = channel_list.begin(); i != channel_list.end(); ++i)
{
Data* current_item = static_cast<Data*>(i->get(segment)->tail_.get(segment));
channel_cursors.push_back(current_item);
}
UpdateID* update_cursor = static_cast<UpdateID*>(update_channel.tail_.get(segment));
while(true)
{
ACQUIRE_MEMORY_BARRIER;
if(update_cursor->state_ == NOT_FILLED_YET) {
continue;
}
::uint32_t update_id = update_cursor->list_id_;
Data* current_item = channel_cursors[update_id];
if(current_item->state_ == NOT_FILLED_YET) {
std::cerr << "This should never print." << std::endl; // it doesn't
continue;
}
log_latency(current_item->tv_sec_, current_item->tv_usec_);
channel_cursors[update_id] = static_cast<Data*>(current_item->next_.get(segment));
update_cursor = static_cast<UpdateID*>(update_cursor->next_.get(segment));
}
}
Method 2 gave TERRIBLE latency. Whereas Method 1 might give under 10us latency, Method 2 would inexplicably often given 8ms latency! Using gettimeofday it appears that the change in update_cursor->state_ was very slow to propogate from the server's view to the client's (I'm on a multicore box, so I assume the delay is due to cache). So I tried a hybrid approach...
Method 3: Keep the update list. But loop over all the channels continuously, and within each iteration check if the update list has updated. If it has, go with the number pushed onto it. If it hasn't, check the channel we've currently iterated to.
void method_three()
{
std::vector<Data*> channel_cursors;
for(ChannelList::iterator i = channel_list.begin(); i != channel_list.end(); ++i)
{
Data* current_item = static_cast<Data*>(i->get(segment)->tail_.get(segment));
channel_cursors.push_back(current_item);
}
UpdateID* update_cursor = static_cast<UpdateID*>(update_channel.tail_.get(segment));
while(true)
{
for(std::size_t i = 0; i < channel_list.size(); ++i)
{
std::size_t idx = i;
ACQUIRE_MEMORY_BARRIER;
if(update_cursor->state_ != NOT_FILLED_YET) {
//std::cerr << "Found via update" << std::endl;
i--;
idx = update_cursor->list_id_;
update_cursor = static_cast<UpdateID*>(update_cursor->next_.get(segment));
}
Data* current_item = channel_cursors[idx];
ACQUIRE_MEMORY_BARRIER;
if(current_item->state_ == NOT_FILLED_YET) {
continue;
}
found_an_update = true;
log_latency(current_item->tv_sec_, current_item->tv_usec_);
channel_cursors[idx] = static_cast<Data*>(current_item->next_.get(segment));
}
}
}
The latency of this method was as good as Method 1, but scaled to large numbers of channels. The problem is, I have no clue why. Just to throw a wrench in things: if I uncomment the 'found via update' part, it prints between EVERY LATENCY LOG MESSAGE. Which means things are only ever found on the update list! So I don't understand how this method can be faster than method 2.
The full, compilable code (requires GCC and boost-1.41) that generates random strings as test data is at: http://pastebin.com/0kuzm3Uf
Update: All 3 methods are effectively spinlocking until an update occurs. The difference is in how long it takes them to notice the update has occurred. They all continuously tax the processor, so that doesn't explain the speed difference. I'm testing on a 4-core machine with nothing else running, so the server and the client have nothing to compete with. I've even made a version of the code where updates signal a condition and have clients wait on the condition -- it didn't help the latency of any of the methods.
Update2: Despite there being 3 methods, I've only tried 1 at a time, so only 1 server and 1 client are competing for the state_ member.
Hypothesis: Method 2 is somehow blocking the update from getting written by the server.
One of the things you can hammer, besides the processor cores themselves, is your coherent cache. When you read a value on a given core, the L1 cache on that core has to acquire read access to that cache line, which means it needs to invalidate the write access to that line that any other cache has. And vice versa to write a value. So this means that you're continually ping-ponging the cache line back and forth between a "write" state (on the server-core's cache) and a "read" state (in the caches of all the client cores).
The intricacies of x86 cache performance are not something I am entirely familiar with, but it seems entirely plausible (at least in theory) that what you're doing by having three different threads hammering this one memory location as hard as they can with read-access requests is approximately creating a denial-of-service attack on the server preventing it from writing to that cache line for a few milliseconds on occasion.
You may be able to do an experiment to detect this by looking at how long it takes for the server to actually write the value into the update list, and see if there's a delay there corresponding to the latency.
You might also be able to try an experiment of removing cache from the equation, by running everything on a single core so the client and server threads are pulling things out of the same L1 cache.
I don't know if you have ever read the Concurrency columns from Herb Sutter. They are quite interesting, especially when you get into the cache issues.
Indeed the Method2 seems better here because the id being smaller than the data in general would mean that you don't have to do round-trips to the main memory too often (which is taxing).
However, what can actually happen is that you have such a line of cache:
Line of cache = [ID1, ID2, ID3, ID4, ...]
^ ^
client server
Which then creates contention.
Here is Herb Sutter's article: Eliminate False Sharing. The basic idea is simply to artificially inflate your ID in the list so that it occupies one line of cache entirely.
Check out the other articles in the serie while you're at it. Perhaps you'll get some ideas. There's a nice lock-free circular buffer I think that could help for your update list :)
I've noticed in both method 1 and method 3 you have a line, ACQUIRE_MEMORY_BARRIER, which I assume has something to do with multi-threading/race conditions?
Either way, method 2 doesn't have any sleeps which means the following code...
while(true)
{
if(update_cursor->state_ == NOT_FILLED_YET) {
continue;
}
is going to hammer the processor. The typical way to do this kind of producer/consumer task is to use some kind of semaphore to signal to the reader that the update list has changed. A search for producer/consumer multi threading should give you a large number of examples. The main idea here is that this allows the thread to go to sleep while it's waiting for the update_cursor->state to change. This prevents this thread from stealing all the cpu cycles.
The answer was tricky to figure out, and to be fair would be hard with the information I presented though if anyone actually compiled the source code I provided they'd have a fighting chance ;) I said that "found via update list" was printed after every latency log message, but this wasn't actually true -- it was only true for as far as I could scrollback in my terminal. At the very beginning there were a slew of updates found without using the update list.
The issue is that between the time when I set my starting point in the update list and my starting point in each of the data lists, there is going to be some lag because these operations take time. Remember, the lists are growing the whole time this is going on. Consider the simplest case where I have 2 data lists, A and B. When I set my starting point in the update list there happen to be 60 elements in it, due to 30 updates on list A and 30 updates on list B. Say they've alternated:
A
B
A
B
A // and I start looking at the list here
B
But then after I set the update list to there, there are a slew of updates to B and no updates to A. Then I set my starting places in each of the data lists. My starting points for the data lists are going to be after that surge of updates, but my starting point in the update list is before that surge, so now I'm going to check for a bunch of updates without finding them. The mixed approach above works best because by iterating over all the elements when it can't find an update, it quickly closes the temporal gap between where the update list is and where the data lists are.