React uses a single thread when processing a channel

React uses a single thread when processing a channel - concurrency

My intention is to have different threads reading simultaneously from a single channel, and processing things asynchronously. I thought this would do the trick:
my Channel $KXGA .= new;
for ^100 {
$KXGA.send( (100000..200000).pick );
}
my $sums = start react whenever $KXGA -> $number {
say "In thread ", $*THREAD.id;
say "→ ", (^$number).sum;
}
(this hangs up because I'm not closing the channel, but please don't look at that). This outputs, whatever I do:
In thread 4
→ 6995966328
In thread 4
→ 12323793510
In thread 4
→ 5473561506
So it's always using a single thread and processing things sequentially, and not in parallel. Is there some way of doing that? Even if I start the thread within the whenever block, the result will be exactly the same...

The react/whenever construct is intended for processing one message at a time and in full. The point is that state held within the react block is safe thanks to this.
It's possible to have multiple workers reading from a Channel; they'll just need setting up as follows:
my #sums;
for ^4 {
push #sums, start react whenever $KXGA -> $number {
say "In thread ", $*THREAD.id;
say "→ ", (^$number).sum;
}
}
This react approach, when used with use v6.d.PREVIEW, has the advantage that when there's less work, it won't actually occupy 4 threads at all, and they can get on with other pooled work. If instead the whole application is just processing stuff from the Channel and that's all, you'll have less overhead and better locality with just:
my #sums;
for ^4 {
push #sums, start for $KXGA.list -> $number {
say "In thread ", $*THREAD.id;
say "→ ", (^$number).sum;
}
}
Unlike with react there's no opportunity to react to different data sources in this approach (the reason Channels work with react is primarily so you can consume them and simultaneously work with async data sources too, or perhaps deal with multiple channels at once). But if you don't need to, then the for way is a tad simpler in the code and is almost certainly faster (and, as with react, the loops compete over the items in the Channel and terminate gracefully when it is closed).

Related

Hard Realtime C++ for Robot Control

I am trying to control a robot using a template-based controller class written in c++. Essentially I have a UDP connection setup with the robot to receive the state of the robot and send new torque commands to the robot. I receive new observations at a higher frequency (say 2000Hz) and my controller takes about 1ms (1000Hz) to calculate new torque commands to send to the robot. The problem I am facing is that I don't want my main code to wait to send the old torque commands while my controller is still calculating new commands to send. From what I understand I can use Ubuntu with RT-Linux kernel, multi-thread the code so that my getTorques() method runs in a different thread, set priorities for the process, and use mutexes and locks to avoid data race between the 2 threads, but I was hoping to learn what the best strategies to write hard-realtime code for such a problem are.
// main.cpp
#include "CONTROLLER.h"
#include "llapi.h"
void main{
...
CONTROLLERclass obj;
...
double new_observation;
double u;
...
while(communicating){
get_newObs(new_observation); // Get new state of the robot (2000Hz)
obj.getTorques(new_observation, u); // Takes about 1ms to calculate new torques
send_newCommands(u); // Send the new torque commands to the robot
}
...
}
Thanks in advance!

Okay, so first of all, it sounds to me like you need to deal with the fact that you receive input at 2 KHz, but can only compute results at about 1 KHz.
Based on that, you're apparently going to have to discard roughly half the inputs, or else somehow (in a way that makes sense for your application) quickly combine the inputs that have arrived since the last time you processed the inputs.
But as the code is structured right now, you're going to fetch and process older and older inputs, so even though you're producing outputs at ~1 KHz, those outputs are constantly being based on older and older data.
For the moment, let's assume you want to receive inputs as fast as you can, and when you're ready to do so, you process the most recent input you've received, produce an output based on that input, and repeat.
In that case, you'd probably end up with something on this general order (using C++ threads and atomics for the moment):
std::atomic<double> new_observation;
std::thread receiver = [&] {
double d;
get_newObs(d);
new_observation = d;
};
std::thread sender = [&] {
auto input = new_observation;
auto u = get_torques(input);
send_newCommands(u);
};
I've assumed that you'll always receive input faster than you can consume it, so the processing thread can always process whatever input is waiting, without receiving anything to indicate that the input has been updated since it was last processed. If that's wrong, things get a little more complex, but I'm not going to try to deal with that right now, since it sounds like it's unnecessary.
As far as the code itself goes, the only thing that may not be obvious is that instead of passing a reference to new_input to either of the existing functions, I've read new_input into variable local to the thread, then passed a reference to that.

Do I need dedicated fences/semaphores per swap chain image, per frame or per command pool in Vulkan?

I've read several articles on the CPU-GPU (using fences) and GPU-GPU (using semaphores) synchronization mechanisms, but still got trouble to understand how I should implement a simple render-loop.
Please take a look at the simple render() function below. If I got it right, the minimal requirement is that we ensure the GPU-GPU synchronization between vkAcquireNextImageKHR, vkQueueSubmit and vkQueuePresentKHR by a single set of semaphores image_available and rendering_finished as I've done in the example code below.
However, is this really safe? All operations are asynchronous. So, is it really safe to "reuse" the image_available semaphore in a subsequent call of render() again even though the signal request from the previous call hasn't fired yet? I would think it's not, but, on the other hand, we're using the same queues (don't know if it matters where the graphics and presentation queue are actually the same) and operations inside a queue should be sequentially consumed ... But if I got it right, they might not be consumed "as a whole" and could be reordered ...
The second thing is that (again, unless I'm missing something) I clearly should use one fence per swap chain image to ensure that the operation on the image corresponding to the image_index of the call to render() has finished. But does that mean that I necessarily need to do a
if (vkWaitForFences(device(), 1, &fence[image_index_of_last_call], VK_FALSE, std::numeric_limits<std::uint64_t>::max()) != VK_SUCCESS)
throw std::runtime_error("vkWaitForFences");
vkResetFences(device(), 1, &fence[image_index_of_last_call]);
before my call to vkAcquireNextImageKHR? And do I then need dedicated image_available and rendering_finished semaphores per swap chain image? Or maybe per frame? Or maybe per command buffer/pool? I'm really confused ...
void render()
{
std::uint32_t image_index;
switch (vkAcquireNextImageKHR(device(), swap_chain().handle(),
std::numeric_limits<std::uint64_t>::max(), m_image_available, VK_NULL_HANDLE, &image_index))
{
case VK_SUBOPTIMAL_KHR:
case VK_SUCCESS:
break;
case VK_ERROR_OUT_OF_DATE_KHR:
on_resized();
return;
default:
throw std::runtime_error("vkAcquireNextImageKHR");
}
static VkPipelineStageFlags constexpr wait_destination_stage_mask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT;
VkSubmitInfo submit_info{};
submit_info.sType = VK_STRUCTURE_TYPE_SUBMIT_INFO;
submit_info.waitSemaphoreCount = 1;
submit_info.pWaitSemaphores = &m_image_available;
submit_info.signalSemaphoreCount = 1;
submit_info.pSignalSemaphores = &m_rendering_finished;
submit_info.pWaitDstStageMask = &wait_destination_stage_mask;
if (vkQueueSubmit(graphics_queue().handle, 1, &submit_info, VK_NULL_HANDLE) != VK_SUCCESS)
throw std::runtime_error("vkQueueSubmit");
VkPresentInfoKHR present_info{};
present_info.sType = VK_STRUCTURE_TYPE_PRESENT_INFO_KHR;
present_info.waitSemaphoreCount = 1;
present_info.pWaitSemaphores = &m_rendering_finished;
present_info.swapchainCount = 1;
present_info.pSwapchains = &swap_chain().handle();
present_info.pImageIndices = &image_index;
switch (vkQueuePresentKHR(presentation_queue().handle, &present_info))
{
case VK_SUCCESS:
break;
case VK_ERROR_OUT_OF_DATE_KHR:
case VK_SUBOPTIMAL_KHR:
on_resized();
return;
default:
throw std::runtime_error("vkQueuePresentKHR");
}
}
EDIT: As suggested in the answers below, assume we have k "frames in flight" and hence k instances of the semaphores and the fence used in the code above, which I will denote by m_image_available[i], m_rendering_finished[i] and m_fence[i] for i = 0, ..., k - 1. Let i denote the current index of the frame in flight, which is increased by 1 after each invocation of render(), and j denote the number of invocations of render(), starting from j = 0.
Now, assume the swap chain contains three images.
If j = 0, then i = 0 and the first frame in flight is using swap chain image 0
In the same way, if j = a, then i = a and the ath frame in flight is using swap chain image a, for a= 2, 3
Now, if j = 3, then i = 3, but since the swap chain image only has three images, the fourth frame in flight is using swap chain image 0 again. I wonder whether this is problematic or not. I guess it's not, since the wait/signal semaphores m_image_available[3]/m_rendering_finished[3], used in the calls of vkAcquireNextImageKHR, vkQueueSubmit and vkQueuePresentKHR in this invocation of render(), are dedicated to this particular frame in flight.
If we reach j = k, then i = 0 again, since there are only k frames in flight. Now we potentially wait at the beginning of render(), if the call to vkQueuePresentKHR from the first invocation (i = 0) of render() hasn't signaled m_fence[0] yet.
So, besides my doubts described in the third bullet point above, the only question which remains is why I shouldn't take k as large as possible? What I theoretically could imagine is that if we are submitting work to the GPU in a quicker fashion than the GPU is able to consume, the used queue(s) might continually grow and eventually overflow (is there some kind of "max commands in queue" limit?).

If I got it right, the minimal requirement is that we ensure the GPU-GPU synchronization between vkAcquireNextImageKHR, vkQueueSubmit and vkQueuePresentKHR by a single set of semaphores image_available and rendering_finished as I've done in the example code below.
Yes, you got it right. You submit the desire to get a new image to render into via vkAcquireNextImageKHR. The presentation engine will signal the m_image_available semaphore as soon as an image to render into has become available. But you have already submitted the instruction.
Next, you submit some commands to the graphics queue via submit_info. I.e. they are also already submitted to the GPU and wait there until the m_image_available semaphore receives its signal.
Furthermore, a presentation instruction is submitted to the presentation engine that expresses the dependency that it needs to wait until the submit_info-commands have completed by waiting on the m_rendering_finished semaphore.
I.e. everything has been submitted. If nothing has been signalled yet, everything just sits there in some GPU buffers and waits for signals.
Now, if your code loops right back into the render() function and re-uses the same m_image_available and m_rendering_finished semaphores, it will only work if you are very lucky, namely if all the semaphores have already been signalled before you use them again.
The specifications says the following for vkAcquireNextImageKHR:
If semaphore is not VK_NULL_HANDLE it must not have any uncompleted signal or wait operations pending
and furthermore, it says under 7.4.2. Semaphore Waiting
the act of waiting for a binary semaphore also unsignals that semaphore.
I.e. indeed, you need to wait on the CPU until you know for sure that the previous vkAcquireNextImageKHR that uses the same m_image_available semaphore has completed.
And yes, you already got it right: You need to use a fence for that which you pass to vkQueueSubmit. If you do not synchronize on the CPU, you'll shovel ever more work to the GPU (which is a problem) and the semaphores that you are re-using might not get properly unsignalled in time (which is a problem).
What is often done is that the semaphores and fences are multiplied, e.g. to 3 each, and these sets of synchronization objects are used in sequence, so that more work can be parallelized on the GPU. The Vulkan Tutorial describes this quite nicely in its Rendering and presentation chapter. It is also explained with animation in this lecture starting at 7:59.

So first of all, as you mentioned correctly, semaphores are strictly for GPU-GPU synchronization, e.g. to make sure that one batch of commands (one submit) has finished before another one starts. This is here used to synchronize the rendering commands with the present command such that the presenting engine knows when to present the rendered image.
Fences are the main utility for CPU-GPU synchronization. You place a fence in a queue submit and then on the CPU side wait for it before you want to proceed. This is usually done here such that we do not queue any new rendering/present commands while the previous frame hasn't finished.
But does that mean that I necessarily need to do a
if (vkWaitForFences(device(), 1, &fence[image_index_of_last_call], VK_FALSE, std::numeric_limits<std::uint64_t>::max()) != VK_SUCCESS)
throw std::runtime_error("vkWaitForFences");
vkResetFences(device(), 1, &fence[image_index_of_last_call]);
before my call to vkAcquireNextImageKHR?
Yes, you definitely need this in your code, otherwise your semaphores would not be safe and you would probably get validation errors.
In general, if you want your CPU to wait until your GPU has finished rendering of the previous frame, you would have only a single fence and a single pair of semaphores. You could also replace the fence by a waitIdle command of the queue or device.
However, in practice you do not want to stall the CPU and in the meantime record commands for the next frame. This is done via frames in flight. This simply means that for every frame in flight (i.e. number of frames that can be recorded in parallel to the execution on the GPU), you have one fence and one pair of semaphores which synchronize that particular frame.
So in essence for your render loop to work properly you need a pair of semaphores + fence per frame in flight, independent of the number of swapchain images. However, do note that the current frame index (frame in flight) and image index (swapchain) will generally not be the same except you use the same amount of swapchain images as frames in flight. This is because the presenting engine might give you swapchain images out of order depending on your presenting mode.

ZeroMQ: how to reduce multithread-communication latency with inproc?

I'm using inproc and PAIR to achieve inter-thread communication and trying to solve a latency problem due to polling. Correct me if I'm wrong: Polling is inevitable, because a plain recv() call will usually block and cannot take a specific timeout.
In my current case, among N threads, each of the N-1 worker threads has a main while-loop. The N-th thread is a controller thread which will notify all the worker threads to quit at any time. However, worker threads have to use polling with a timeout to get that quit message. This introduces a latency, the latency parameter is usually 1000ms.
Here is an example
while (true) {
const std::chrono::milliseconds nTimeoutMs(1000);
std::vector<zmq::poller_event<std::size_t>> events(n);
size_t nEvents = m_poller.wait_all(events, nTimeoutMs);
bool isToQuit = false;
for (auto& evt : events) {
zmq::message_t out_recved;
try {
evt.socket.recv(out_recved, zmq::recv_flags::dontwait);
}
catch (std::exception& e) {
trace("{}: Caught exception while polling: {}. Skipped.", GetLogTitle(), e.what());
continue;
}
if (!out_recved.empty()) {
if (IsToQuit(out_recved))
isToQuit = true;
break;
}
}
if (isToQuit)
break;
//
// main business
//
...
}
To make things worse, when the main loop has nested loops, the worker threads then need to include more polling code in each layer of the nested loops. Very ugly.
The reason why I chose ZMQ for multithread communication is because of its elegance and the potential of getting rid of thread-locking. But I never realized the polling overhead.
Am I able to achieve the typical latency when using a regular mutex or an std::atomic data operation? Should I understand that the inproc is in fact a network communication pattern in disguise so that some latency is inevitable?

An above posted statement ( a hypothesis ):
"...a plain recv() call will usually block and cannot take a specific timeout."
is not correct:
a plain .recv( ZMQ_NOBLOCK )-call will never "block",
a plain .recv( ZMQ_NOBLOCK )-call can get decorated so as to mimick "a specific timeout"
An above posted statement ( a hypothesis ):
"...have to use polling with a timeout ... introduces a latency, the latency parameter is usually 1000ms."
is not correct:
- one need not use polling with a timeout
- the less one need not set 1000 ms code-"injected"-latency, spent obviously only on-no-new-message state
Q : "Am I able to achieve the typical latency when using a regular mutex or an std::atomic data operation?"
Yes.
Q : "Should I understand that the inproc is in fact a network communication pattern in disguise so that some latency is inevitable?"
No. inproc-transport-class is the fastest of all these kinds as it is principally protocol-less / stack-less and has more to do with ultimately fast pointer-mechanics, like in a dual-end ring-buffer pointer-management.
The Best Next Step:
1 )Re-factor your code, so as to always harness but the zero-wait { .poll() | .recv() }-methods, properly decorated for both { event- | no-event- }-specific looping.
2 )
If then willing to shave the last few [us] from the smart-loop-detection turn-around-time, may focus on improved Context()-instance setting it to work with larger amount of nIOthreads > N "under the hood".
optionally 3 )
For almost hard-Real-Time systems' design one may finally harness a deterministically driven Context()-threads' and socket-specific mapping of these execution-vehicles onto specific, non-overlapped CPU-cores ( using a carefully-crafted affinity-map )
Having set 1000 [ms] in code, no one is fair to complain about spending those very 1000 [ms] waiting in a timeout, coded by herself / himself. No excuse for doing this.
Do not blame ZeroMQ for behaviour, that was coded from the application side of the API.
Never.

Multi threading with third party library in C++

I'm using two functions from a third party library currently in my application. The first function, aka .SourceMeasure basically collecting data from some hardware while the second function, aka .ComputeErrors is purely calculation based on the data collected from the first function. And the measure-calculate executions will be looped for 5 times.
I'm thinking create a multi thread to move the .ComputeErrors to the worker thread to save some times.
Will there be a issue if the.SourceMeasure is in main thread and the .ComputeErrors in the worker thread and both of them coming from the same library?
//The execution is something like this..
for (int i=0; z < 5; z++)
{
Lib.SourceMeasure (data)
Lib.ComputeErorrs (data) //Want to put this in a separate thread
}

I don't know, which library you're using, but it's almost sure that you can not start Lib.ComputeErorrs() until Lib.SourceMeasure() is still running, on the same data set.
What you can do is to set up a queue and two threads:
The "measure thread":
create a data item
call Lib.SourceMeasure() on it
push data to a FIFO queue
The "compute thread":
hang on some if the queue is empty
pick a data item from the queue
call Lib.ComputeErorrs() with the data
On result, the measurement and the computation will run parallel (but not the same items, measurement will be some ahead). All you need to find is a thread-safe queue.

Of these 3 methods for reading linked lists from shared memory, why is the 3rd fastest?

I have a 'server' program that updates many linked lists in shared memory in response to external events. I want client programs to notice an update on any of the lists as quickly as possible (lowest latency). The server marks a linked list's node's state_ as FILLED once its data is filled in and its next pointer has been set to a valid location. Until then, its state_ is NOT_FILLED_YET. I am using memory barriers to make sure that clients don't see the state_ as FILLED before the data within is actually ready (and it seems to work, I never see corrupt data). Also, state_ is volatile to be sure the compiler doesn't lift the client's checking of it out of loops.
Keeping the server code exactly the same, I've come up with 3 different methods for the client to scan the linked lists for changes. The question is: Why is the 3rd method fastest?
Method 1: Round robin over all the linked lists (called 'channels') continuously, looking to see if any nodes have changed to 'FILLED':
void method_one()
{
std::vector<Data*> channel_cursors;
for(ChannelList::iterator i = channel_list.begin(); i != channel_list.end(); ++i)
{
Data* current_item = static_cast<Data*>(i->get(segment)->tail_.get(segment));
channel_cursors.push_back(current_item);
}
while(true)
{
for(std::size_t i = 0; i < channel_list.size(); ++i)
{
Data* current_item = channel_cursors[i];
ACQUIRE_MEMORY_BARRIER;
if(current_item->state_ == NOT_FILLED_YET) {
continue;
}
log_latency(current_item->tv_sec_, current_item->tv_usec_);
channel_cursors[i] = static_cast<Data*>(current_item->next_.get(segment));
}
}
}
Method 1 gave very low latency when then number of channels was small. But when the number of channels grew (250K+) it became very slow because of looping over all the channels. So I tried...
Method 2: Give each linked list an ID. Keep a separate 'update list' to the side. Every time one of the linked lists is updated, push its ID on to the update list. Now we just need to monitor the single update list, and check the IDs we get from it.
void method_two()
{
std::vector<Data*> channel_cursors;
for(ChannelList::iterator i = channel_list.begin(); i != channel_list.end(); ++i)
{
Data* current_item = static_cast<Data*>(i->get(segment)->tail_.get(segment));
channel_cursors.push_back(current_item);
}
UpdateID* update_cursor = static_cast<UpdateID*>(update_channel.tail_.get(segment));
while(true)
{
ACQUIRE_MEMORY_BARRIER;
if(update_cursor->state_ == NOT_FILLED_YET) {
continue;
}
::uint32_t update_id = update_cursor->list_id_;
Data* current_item = channel_cursors[update_id];
if(current_item->state_ == NOT_FILLED_YET) {
std::cerr << "This should never print." << std::endl; // it doesn't
continue;
}
log_latency(current_item->tv_sec_, current_item->tv_usec_);
channel_cursors[update_id] = static_cast<Data*>(current_item->next_.get(segment));
update_cursor = static_cast<UpdateID*>(update_cursor->next_.get(segment));
}
}
Method 2 gave TERRIBLE latency. Whereas Method 1 might give under 10us latency, Method 2 would inexplicably often given 8ms latency! Using gettimeofday it appears that the change in update_cursor->state_ was very slow to propogate from the server's view to the client's (I'm on a multicore box, so I assume the delay is due to cache). So I tried a hybrid approach...
Method 3: Keep the update list. But loop over all the channels continuously, and within each iteration check if the update list has updated. If it has, go with the number pushed onto it. If it hasn't, check the channel we've currently iterated to.
void method_three()
{
std::vector<Data*> channel_cursors;
for(ChannelList::iterator i = channel_list.begin(); i != channel_list.end(); ++i)
{
Data* current_item = static_cast<Data*>(i->get(segment)->tail_.get(segment));
channel_cursors.push_back(current_item);
}
UpdateID* update_cursor = static_cast<UpdateID*>(update_channel.tail_.get(segment));
while(true)
{
for(std::size_t i = 0; i < channel_list.size(); ++i)
{
std::size_t idx = i;
ACQUIRE_MEMORY_BARRIER;
if(update_cursor->state_ != NOT_FILLED_YET) {
//std::cerr << "Found via update" << std::endl;
i--;
idx = update_cursor->list_id_;
update_cursor = static_cast<UpdateID*>(update_cursor->next_.get(segment));
}
Data* current_item = channel_cursors[idx];
ACQUIRE_MEMORY_BARRIER;
if(current_item->state_ == NOT_FILLED_YET) {
continue;
}
found_an_update = true;
log_latency(current_item->tv_sec_, current_item->tv_usec_);
channel_cursors[idx] = static_cast<Data*>(current_item->next_.get(segment));
}
}
}
The latency of this method was as good as Method 1, but scaled to large numbers of channels. The problem is, I have no clue why. Just to throw a wrench in things: if I uncomment the 'found via update' part, it prints between EVERY LATENCY LOG MESSAGE. Which means things are only ever found on the update list! So I don't understand how this method can be faster than method 2.
The full, compilable code (requires GCC and boost-1.41) that generates random strings as test data is at: http://pastebin.com/0kuzm3Uf
Update: All 3 methods are effectively spinlocking until an update occurs. The difference is in how long it takes them to notice the update has occurred. They all continuously tax the processor, so that doesn't explain the speed difference. I'm testing on a 4-core machine with nothing else running, so the server and the client have nothing to compete with. I've even made a version of the code where updates signal a condition and have clients wait on the condition -- it didn't help the latency of any of the methods.
Update2: Despite there being 3 methods, I've only tried 1 at a time, so only 1 server and 1 client are competing for the state_ member.

Hypothesis: Method 2 is somehow blocking the update from getting written by the server.
One of the things you can hammer, besides the processor cores themselves, is your coherent cache. When you read a value on a given core, the L1 cache on that core has to acquire read access to that cache line, which means it needs to invalidate the write access to that line that any other cache has. And vice versa to write a value. So this means that you're continually ping-ponging the cache line back and forth between a "write" state (on the server-core's cache) and a "read" state (in the caches of all the client cores).
The intricacies of x86 cache performance are not something I am entirely familiar with, but it seems entirely plausible (at least in theory) that what you're doing by having three different threads hammering this one memory location as hard as they can with read-access requests is approximately creating a denial-of-service attack on the server preventing it from writing to that cache line for a few milliseconds on occasion.
You may be able to do an experiment to detect this by looking at how long it takes for the server to actually write the value into the update list, and see if there's a delay there corresponding to the latency.
You might also be able to try an experiment of removing cache from the equation, by running everything on a single core so the client and server threads are pulling things out of the same L1 cache.

I don't know if you have ever read the Concurrency columns from Herb Sutter. They are quite interesting, especially when you get into the cache issues.
Indeed the Method2 seems better here because the id being smaller than the data in general would mean that you don't have to do round-trips to the main memory too often (which is taxing).
However, what can actually happen is that you have such a line of cache:
Line of cache = [ID1, ID2, ID3, ID4, ...]
^ ^
client server
Which then creates contention.
Here is Herb Sutter's article: Eliminate False Sharing. The basic idea is simply to artificially inflate your ID in the list so that it occupies one line of cache entirely.
Check out the other articles in the serie while you're at it. Perhaps you'll get some ideas. There's a nice lock-free circular buffer I think that could help for your update list :)

I've noticed in both method 1 and method 3 you have a line, ACQUIRE_MEMORY_BARRIER, which I assume has something to do with multi-threading/race conditions?
Either way, method 2 doesn't have any sleeps which means the following code...
while(true)
{
if(update_cursor->state_ == NOT_FILLED_YET) {
continue;
}
is going to hammer the processor. The typical way to do this kind of producer/consumer task is to use some kind of semaphore to signal to the reader that the update list has changed. A search for producer/consumer multi threading should give you a large number of examples. The main idea here is that this allows the thread to go to sleep while it's waiting for the update_cursor->state to change. This prevents this thread from stealing all the cpu cycles.

The answer was tricky to figure out, and to be fair would be hard with the information I presented though if anyone actually compiled the source code I provided they'd have a fighting chance ;) I said that "found via update list" was printed after every latency log message, but this wasn't actually true -- it was only true for as far as I could scrollback in my terminal. At the very beginning there were a slew of updates found without using the update list.
The issue is that between the time when I set my starting point in the update list and my starting point in each of the data lists, there is going to be some lag because these operations take time. Remember, the lists are growing the whole time this is going on. Consider the simplest case where I have 2 data lists, A and B. When I set my starting point in the update list there happen to be 60 elements in it, due to 30 updates on list A and 30 updates on list B. Say they've alternated:
A
B
A
B
A // and I start looking at the list here
B
But then after I set the update list to there, there are a slew of updates to B and no updates to A. Then I set my starting places in each of the data lists. My starting points for the data lists are going to be after that surge of updates, but my starting point in the update list is before that surge, so now I'm going to check for a bunch of updates without finding them. The mixed approach above works best because by iterating over all the elements when it can't find an update, it quickly closes the temporal gap between where the update list is and where the data lists are.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js