avoiding collisions when collapsing infinity lock-free buffer to circular-buffer

avoiding collisions when collapsing infinity lock-free buffer to circular-buffer - c++

I'm solving two feeds arbitrate problem of FAST protocol.
Please don't worry if you not familar with it, my question is pretty general actually. But i'm adding problem description for those who interested (you can skip it).
Data in all UDP Feeds are disseminated in two identical feeds (A and B) on two different multicast IPs. It is strongly recommended that client receive and process both feeds because of possible UDP packet loss. Processing two identical feeds allows one to statistically decrease the probability of packet loss.
It is not specified in what particular feed (A or B) the message appears for the first time. To arbitrate these feeds one should use the message sequence number found in Preamble or in tag 34-MsgSeqNum. Utilization of the Preamble allows one to determine message sequence number without decoding of FAST message.
Processing messages from feeds A and B should be performed using the following algorithm:
Listen feeds A and B
Process messages according to their sequence numbers.
Ignore a message if one with the same sequence number was already processed before.
If the gap in sequence number appears, this indicates packet loss in both feeds (A and B). Client should initiate one of the Recovery process. But first of all client should wait a reasonable time, perhaps the lost packet will come a bit later due to packet reordering. UDP protocol can’t guarantee the delivery of packets in a sequence.
// tcp recover algorithm further
I wrote such very simple class. It preallocates all required classes and then first thread that receive particular seqNum can process it. Another thread will drop it later:
class MsgQueue
{
public:
MsgQueue();
~MsgQueue(void);
bool Lock(uint32_t msgSeqNum);
Msg& Get(uint32_t msgSeqNum);
void Commit(uint32_t msgSeqNum);
private:
void Process();
static const int QUEUE_LENGTH = 1000000;
// 0 - available for use; 1 - processing; 2 - ready
std::atomic<uint16_t> status[QUEUE_LENGTH];
Msg updates[QUEUE_LENGTH];
};
Implementation:
MsgQueue::MsgQueue()
{
memset(status, 0, sizeof(status));
}
MsgQueue::~MsgQueue(void)
{
}
// For the same msgSeqNum should return true to only one thread
bool MsgQueue::Lock(uint32_t msgSeqNum)
{
uint16_t expected = 0;
return status[msgSeqNum].compare_exchange_strong(expected, 1);
}
void MsgQueue::Commit(uint32_t msgSeqNum)
{
status[msgSeqNum] = 2;
Process();
}
// this method probably should be combined with "Lock" but please ignore! :)
Msg& MsgQueue::Get(uint32_t msgSeqNum)
{
return updates[msgSeqNum];
}
void MsgQueue::Process()
{
// ready packets must be processed,
}
Usage:
if (!msgQueue.Lock(seq)) {
return;
}
Msg msg = msgQueue.Get(seq);
msg.Ticker = "HP"
msg.Bid = 100;
msg.Offer = 101;
msgQueue.Commit(seq);
This works fine if we assume that QUEUE_LENGTH is infinity. Because in this case one msgSeqNum = one updates array item.
But I have to make buffer circular because it is not possible to store entire history (many millions of packets) and there are no reason to do so. Actually I need to buffer enough packets to reconstruct the session, and once session is reconstructed i can drop them.
But having circular buffer significantly complicates algorithm. For example assume that we have circular buffer of length 1000. And at the same time we try to process seqNum = 10 000 and seqNum = 11 000 (this is VERY unlikely but still possible). Both these packets will map to the array updates at index 0 and so collision occur. In such case buffer should 'drop' old packets and process new packets.
It's trivial to implement what I want using locks but writing lock-free code on circular-buffer that used from different threads is really complicated. So I welcome any suggestions and advice how to do that. Thanks!

I don't believe you can use a ring buffer. A hashed index can be used in the status[] array. Ie, hash = seq % 1000. The issue is that the sequence number is dictated by the network and you have no control over it's ordering. You wish to lock based on this sequence number. Your array doesn't need to be infinite, just the range of the sequence number; but that is probably larger than practical.
I am not sure what is happening when the sequence number is locked. Does this mean another thread is processing it? If so, you must maintain a sub-list for hash collisions to resolve the particular sequence number.
You may also consider an array size as a power of 2. For example, 1024 will allow hash = seq & 1023; which should be quite efficient.

Related

Explicit throughput limiting on part of an akka stream

I have a flow in our system which reads some elements from SQS (using alpakka) and does some preporcessing (~ 10 stages, normally < 1 minute in total). Then, the prepared element is sent to the main processing (single stage, taking a few minutes). The whole thing runs on AWS/K8S and we’d like to scale out when the SQS queue grows above a certain threshold. The issue is, the SQS queue takes a long time to blow up, since there are a lot of elements “idling” in-process, having done their preprocessing but waiting for the main thing.
We can’t externalize the preprocessing stuff to a separate queue since their outcome can’t survive a de/serialization roundtrip. Also, this service and the “main” processor are deeply coupled (this service runs as main’s sidecar) and can’t be scaled independently.
The preprocessing stages are technically .mapAsyncUnordered, but the whole thing is already very slim (stream stages and SQS batches/buffers).
We tried lowering the interstage buffer (akka.stream.materializer.max-input-buffer-size), but that only gives some indirect benefit, no direct control (and is too internal to be mucking with, for my taste anyway).
I tried implementing a “gate” wrapper which would limit the amount of elements allowed inside some arbitrary Flow, looking something like:
class LimitingGate[T, U](originalFlow: Flow[T, U], maxInFlight: Int) {
private def in: InputGate[T] = ???
private def out: OutputGate[U] = ???
def gatedFlow: Flow[T, U, NotUsed] = Flow[T].via(in).via(originalFlow).via(out)
}
And using callbacks between the in/out gates for throttling.
The implementation partially works (stream termination is giving me a hard time), but it feels like the wrong way to go about achieving the actual goal.
Any ideas / comments / enlightening questions are appreciated
Thanks!

Try something along these lines (I'm only compiling it in my head):
def inflightLimit[A, B, M](n: Int, source: Source[T, M])(businessFlow: Flow[T, B, _])(implicit materializer: Materializer): Source[B, M] = {
require(n > 0) // alternatively, could just result in a Source.empty...
val actorSource = Source.actorRef[Unit](
completionMatcher = PartialFunction.empty,
failureMatcher = PartialFunction.empty,
bufferSize = 2 * n,
overflowStrategy = OverflowStrategy.dropHead // shouldn't matter, but if the buffer fills, the effective limit will be reduced
)
val (flowControl, unitSource) = actorSource.preMaterialize()
source.statefulMapConcat { () =>
var firstElem: Boolean = true
{ a =>
if (firstElem) {
(0 until n).foreach(_ => flowControl.tell(())) // prime the pump on stream materialization
firstElem = false
}
List(a)
}}
.zip(unitSource)
.map(_._1)
.via(businessFlow)
.wireTap { _ => flowControl.tell(()) } // wireTap is Akka Streams 2.6, but can be easily replaced by a map stage which sends () to flowControl and passes through the input
}
Basically:
actorSource will emit a Unit ((), i.e. meaningless) element for every () it receives
statefulMapConcat will cause n messages to be sent to the actorSource only when the stream first starts (thus allowing n elements from the source through)
zip will pass on a pair of the input from source and a () only when actorSource and source both have an element available
for every element which exits businessFlow, a message will be sent to the actorSource, which will allow another element from the source through
Some things to note:
this will not in any way limit buffering within source
businessFlow cannot drop elements: after n elements are dropped the stream will no longer process elements but won't fail; if dropping elements is required, you may be able to inline businessFlow and have the stages which drop elements send a message to flowControl when they drop an element; there are other things to address this which you can do as well

How to tell if SSL_read has received and processed all the records from single message

Following is the dilemma,
SSL_read, on success returns number of bytes read, SSL_pending is used to tell if the processed record has more that to be read, that means probably buffer provided is not sufficient to contain the record.
SSL_read may return n > 0, but what if this happens when first records has been processed and message effectively is multi record communication.
Question: I am using epoll to send/receive messages, which means I have to queue up event in case I expect more data. What check will ensure that all the records have been read from single message and it's time to remove this event and queue up an response event that will write the response back to client?
PS: This code hasn't been tested so it may be incorrect. Purpose of the code is to share the idea that I am trying to implement.
Following is code snippet for the read -
//read whatever is available.
while (1)
{
auto n = SSL_read(ssl_, ptr_ + tail_, sz_ - tail_);
if (n <= 0)
{
int ssle = SSL_get_error(ch->ssl_, rd);
auto old_ev = evt_.events;
if (ssle == SSL_ERROR_WANT_READ)
{
//need more data to process, wait for epoll notification again
evt_.events = EPOLLIN | EPOLLERR;
}
else if (err == SSL_ERROR_WANT_WRITE)
{
evt_.events = EPOLLOUT | EPOLLERR;
}
else
{
/* connection closed by peer, or
some irrecoverable error */
done_ = true;
tail_ = 0; //invalidate the data
break;
}
if (old_ev != evt_.events)
if (epoll_ctl(epoll_fd_, EPOLL_CTL_MOD, socket_fd_, &evt_) < 0)
{
perror("handshake failed at EPOLL_CTL_MOD");
SSL_free(ssl_);
ssl_ = nullptr;
return false;
}
}
else //some data has been read
{
tail_ = n;
if (SSL_pending(ssl_) > 0)
//buffer wasn't enough to hold the content. resize and reread
resize();
else
break;
}
}
```
enter code here

SSL_read() returns the number of decrypted bytes returned in the caller's buffer, not the number of bytes received on the connection. This mimics the return value of recv() and read().
SSL_pending() returns the number of decrypted bytes that are still in the SSL's buffer and haven't been read by the caller yet. This would be equivalent to calling ioctl(FIONREAD) on a socket.
There is no way to know how many SSL/TLS records constitute an "application message", that is for the decrypted protocol data to dictate. The protocol needs to specify where a message ends and a new message begins. For instance, by including the message length in the message data. Or delimiting messages with terminators.
Either way, the SSL/TLS layer has no concept of "messages", only an arbitrary stream of bytes that it encrypts and decrypts as needed, and transmits in "records" of its choosing. Similar to how TCP breaks up a stream of arbitrary bytes into IP frames, etc.
So, while your loop is reading arbitrary bytes from OpenSSL, it needs to process those bytes to detect separations between protocol messages, so it can then act accordingly per message.

What check will ensure that all the records have been read from single message and it's time to remove this event and queue up an response event that will write the response back to client?
I'd have hoped that your message has a header with the number of records in it. Otherwise the protocol you've got is probably unparseable.
What you'd need is to have a stateful parser that consumes all the available bytes and outputs records once they are complete. Such a parser needs to suspend its state once it reaches the last byte of decrypted input, and then must be called again when more data is available to be read. But in all cases if you can't predict ahead of time how much data is expected, you won't be able to tell when the message is finished - that is unless you're using a self-synchronizing protocol. Something like ATM headers would be a starting point. But such complication is unnecessary when all you need is just to properly delimit your data so that the packet parser can know exactly whether it's got all it needs or not.
That's the problem with sending messages: it's very easy to send stuff that can't be decoded by the receiver, since the sender is perfectly fine with losing data - it just doesn't care. But the receiver will certainly need to know how many bytes or records are expected - somehow. It can be told this a-priori by sending headers that include byte counts or fixed-size record counts (it's the same size information just in different units), or a posteriori by using unique record delimiters. For example, when sending printable text split into lines, such delimiters can be Unicode paragraph separators (U+2029).
It's very important to ensure that the record delimiters can't occur within the record data itself. Thus you need some sort of a "stuffing" mechanism, where if a delimiter sequence appears in the payload, you can alter it so that it's not a valid delimiter anymore. You also need an "unstuffing" mechanism so that such altered delimiter sequences can be detected and converted back to their original form, of course without being interpreted as a delimiter. A very simple example of such delimiting process is the octet-stuffed framing in the PPP protocol. It is a form of HDLC framing. The record separator is 0x7E. Whenever this byte is detected in the payload, it is escaped - replaced by a 0x7D 0x5E sequence. On the receiving end, the 0x7D is interpreted to mean "the following character has been XOR'd with 0x20". Thus, the receiver converts 0x7D 0x5E to 0x5E first (it removes the escape byte), and then XORs it with 0x20, yielding the original 0x7E. Such framing is easy to implement but potentially has more overhead than framing with a longer delimiter sequence, or even a dynamic delimiter sequence whose form differs for each position within the stream. This could be used to prevent denial-of-service attacks, when the attacker may maliciously provide a payload that will incur a large escaping overhead. The dynamic delimiter sequence - especially if unpredictable, e.g. by negotiating a new sequence for every connection - prevents such service degradation.

Socket Commuication with High frequency

I need to send data to another process every 0.02s.
The Server code:
//set socket, bind, listen
while(1){
sleep(0.02);
echo(newsockfd);
}
void echo (int sock)
{
int n;
char buffer[256]="abc";
n=send(sock,buffer,strlen(buffer),0);
if (n < 0) error("ERROR Sending");
}
The Client code:
//connect
while(1)
{
bzero(buffer,256);
n = read(sock,buffer,255);
printf("Recieved data:%s\n",buffer);
if (n < 0)
error("ERROR reading from socket");
}
The problem is that:
The client shows something like this:
Recieved data:abc
Recieved data:abcabcabc
Recieved data:abcabc
....
How does it happen? When I set sleep time:
...
sleep(2)
...
It would be ok:
Recieved data:abc
Recieved data:abc
Recieved data:abc
...

TCP sockets do not guarantee framing. When you send bytes over a TCP socket, those bytes will be received on the other end in the same order, but they will not necessarily be grouped the same way — they may be split up, or grouped together, or regrouped, in any way the operating system sees fit.
If you need framing, you will need to send some sort of packet header to indicate where each chunk of data starts and ends. This may take the form of either a delimiter (e.g, a \n or \0 to indicate where each chunk ends), or a length value (e.g, a number at the head of each chunk to denote how long it is).
Also, as other respondents have noted, sleep() takes an integer, so you're effectively not sleeping at all here.

sleep takes unsigned int as argument, so sleep(0.02) is actually sleep(0).
unsigned int sleep(unsigned int seconds);
Use usleep(20) instead. It will sleep in microseconds:
int usleep(useconds_t usec);

The OS is at liberty to buffer data (i.e. why not just send a full packet instead of multiple packets)
Besides sleep takes a unsigned integer.

The reason is that the OS is buffering data to be sent. It will buffer based on either size or time. In this case, you're not sending enough data, but you're sending it fast enough the OS is choosing to bulk it up before putting it on the wire.
When you add the sleep(2), that is long enough that the OS chooses to send a single "abc" before the next one comes in.
You need to understand that TCP is simply a byte stream. It has no concept of messages or sizes. You simply put bytes on the wire on one end and take them off on the other. If you want to do specific things, then you need to interpret the data special ways when you read it. Because of this, the correct solution is to create an actual protocol for this. That protocol could be as simple as "each 3 bytes is one message", or more complicated where you send a size prefix.
UDP may also be a good solution for you, depending on your other requirements.

sleep(0.02)
is effectively
sleep(0)
because argument is unsigned int, so implicit conversion does it for you. So you have no sleep at all here. You can use sleep(2) to sleep for 2 microseconds.Next, even if you had, there is no guarantee that your messages will be sent in a different frames. If you need this, you should apply some sort of delimiter, I have seen
'\0'
character in some implementation.

TCPIP stacks buffer up data until there's a decent amount of data, or until they decide that there's no more coming from the application and send what they've got anyway.
There are two things you will need to do. First, turn off Nagle's algorithm. Second, sort out some sort of framing mechanism.
Turning off Nagle's algorithm will cause the stack to "send data immediately", rather than waiting on the off chance that you'll be wanting to send more. It actually leads to less network efficiency because you're not filling up Ethernet frames, something to bare in mind on Gigabit where jumbo frames are required to get best throughput. But in your case timeliness is more important than throughput.
You can do your own framing by very simple means, eg by send an integer first that says how long the rest if the message will be. At the reader end you would read the integer, and then read that number of bytes. For the next message you'd send another integer saying how long that message is, etc.
That sort of thing is ok but not hugely robust. You could look at something like ASN.1 or Google Protocol buffers.
I've used Objective System's ASN.1 libraries and tools (they're not free) and they do a good job of looking after message integrity, framing, etc. They're good because they don't read data from a network connection one byte at a time so the efficiency and speed isn't too bad. Any extra data read is retained and included in the next message decode.
I've not used Google Protocol Buffers myself but it's possible that they have similar characteristics, and there maybe other similar serialisation mechanisms out there. I'd recommend avoiding XML serialisation for speed/efficiency reasons.

Reading packets off a fragmented byte stream with ranges?

I believe I sort of know ranges, but I have no real idea for when and where to use them, or how. I fail to "get" ranges. Consider this example:
Let's assume we have a network handler, that we have no control over, that calls our callback function in some thread whenever there's some new data for us:
void receivedData(ubyte[] data)
This stream of data contains packets with the layout
{
ushort size;
ubyte[size] body;
}
However, the network handler doesn't know about this, so a call to dataReceived() may contain one or two partial packets, one or several complete packets, or a combination. For simplicity, let's assume there can be no corrupt packets and that we'll receive a data.length == 0 when the connection is closed.
What we'd like now is some beautiful D code that turns all this chaos into appropriate calls to
void receivedPacket(ubyte[] body)
I can surely think of "brute-force" ways of accomplishing this. But here's perhaps where my confusion comes in: Can ranges play a role in this? Can we wrap up receivedData() into a nice range-thingy? How? Or is this just not the kind of problems where you'd use ranges? Why not?
(If it would make more sense for using ranges, feel free to redefine the example.)

what I'd do is
ubyte[1024] buffer=void;//temp buffer set the size as needed...
ushort filledPart;//length of the part of the buffer containing partial packet
union{ushort nextPacketLength=0;ubyte[2] packetLengtharr;}//length of the next packet
void receivedData(ubyte[] data){
if(!data.length)return;
if(nextPacketLength){
dataPart = min(nextPacketLength-filledPart.length,data.length);
buffer[filledPart..nextPacketLength] = data[0..dataPart];
filledPart += dataPart;
if(filledPart == nextPacketLength){
receivedPacket(buffer[0..nextPacketLength]);//the call
filledPart=nextPacketLength=0;//reset state
receivedData(datadataPart..$]);//recurse
}
} else{
packetLengtharr[]=data[0..2];//read length of next packet
if(nextPacketLength<data.length){//full paket in data -> avoid unnecessary copies
receivedPacket(data[2..2+nextPacketLength]);
receivedData(data[2+nextPacketLength..$]);//recurse
}else
receivedData(data[2..$]);//recurse to use the copy code above
}
}
it's recursive with 3 possible paths:
data is empty -> no action
nextPacketLength is set to a value != 0 -> copy as much data into the buffer as possible and if the packet is complete call the callback and reset filledPart and nextPacketLength and recurse with rest of data
nextPacketLength ==0 read the packet length (here with a union) if full packet available call callback then recurse with rest of data
there's only 1 bug still in there when data only hold the first byte of length but I'll let you figure that out (and I'm too lazy to do it now)

Of these 3 methods for reading linked lists from shared memory, why is the 3rd fastest?

I have a 'server' program that updates many linked lists in shared memory in response to external events. I want client programs to notice an update on any of the lists as quickly as possible (lowest latency). The server marks a linked list's node's state_ as FILLED once its data is filled in and its next pointer has been set to a valid location. Until then, its state_ is NOT_FILLED_YET. I am using memory barriers to make sure that clients don't see the state_ as FILLED before the data within is actually ready (and it seems to work, I never see corrupt data). Also, state_ is volatile to be sure the compiler doesn't lift the client's checking of it out of loops.
Keeping the server code exactly the same, I've come up with 3 different methods for the client to scan the linked lists for changes. The question is: Why is the 3rd method fastest?
Method 1: Round robin over all the linked lists (called 'channels') continuously, looking to see if any nodes have changed to 'FILLED':
void method_one()
{
std::vector<Data*> channel_cursors;
for(ChannelList::iterator i = channel_list.begin(); i != channel_list.end(); ++i)
{
Data* current_item = static_cast<Data*>(i->get(segment)->tail_.get(segment));
channel_cursors.push_back(current_item);
}
while(true)
{
for(std::size_t i = 0; i < channel_list.size(); ++i)
{
Data* current_item = channel_cursors[i];
ACQUIRE_MEMORY_BARRIER;
if(current_item->state_ == NOT_FILLED_YET) {
continue;
}
log_latency(current_item->tv_sec_, current_item->tv_usec_);
channel_cursors[i] = static_cast<Data*>(current_item->next_.get(segment));
}
}
}
Method 1 gave very low latency when then number of channels was small. But when the number of channels grew (250K+) it became very slow because of looping over all the channels. So I tried...
Method 2: Give each linked list an ID. Keep a separate 'update list' to the side. Every time one of the linked lists is updated, push its ID on to the update list. Now we just need to monitor the single update list, and check the IDs we get from it.
void method_two()
{
std::vector<Data*> channel_cursors;
for(ChannelList::iterator i = channel_list.begin(); i != channel_list.end(); ++i)
{
Data* current_item = static_cast<Data*>(i->get(segment)->tail_.get(segment));
channel_cursors.push_back(current_item);
}
UpdateID* update_cursor = static_cast<UpdateID*>(update_channel.tail_.get(segment));
while(true)
{
ACQUIRE_MEMORY_BARRIER;
if(update_cursor->state_ == NOT_FILLED_YET) {
continue;
}
::uint32_t update_id = update_cursor->list_id_;
Data* current_item = channel_cursors[update_id];
if(current_item->state_ == NOT_FILLED_YET) {
std::cerr << "This should never print." << std::endl; // it doesn't
continue;
}
log_latency(current_item->tv_sec_, current_item->tv_usec_);
channel_cursors[update_id] = static_cast<Data*>(current_item->next_.get(segment));
update_cursor = static_cast<UpdateID*>(update_cursor->next_.get(segment));
}
}
Method 2 gave TERRIBLE latency. Whereas Method 1 might give under 10us latency, Method 2 would inexplicably often given 8ms latency! Using gettimeofday it appears that the change in update_cursor->state_ was very slow to propogate from the server's view to the client's (I'm on a multicore box, so I assume the delay is due to cache). So I tried a hybrid approach...
Method 3: Keep the update list. But loop over all the channels continuously, and within each iteration check if the update list has updated. If it has, go with the number pushed onto it. If it hasn't, check the channel we've currently iterated to.
void method_three()
{
std::vector<Data*> channel_cursors;
for(ChannelList::iterator i = channel_list.begin(); i != channel_list.end(); ++i)
{
Data* current_item = static_cast<Data*>(i->get(segment)->tail_.get(segment));
channel_cursors.push_back(current_item);
}
UpdateID* update_cursor = static_cast<UpdateID*>(update_channel.tail_.get(segment));
while(true)
{
for(std::size_t i = 0; i < channel_list.size(); ++i)
{
std::size_t idx = i;
ACQUIRE_MEMORY_BARRIER;
if(update_cursor->state_ != NOT_FILLED_YET) {
//std::cerr << "Found via update" << std::endl;
i--;
idx = update_cursor->list_id_;
update_cursor = static_cast<UpdateID*>(update_cursor->next_.get(segment));
}
Data* current_item = channel_cursors[idx];
ACQUIRE_MEMORY_BARRIER;
if(current_item->state_ == NOT_FILLED_YET) {
continue;
}
found_an_update = true;
log_latency(current_item->tv_sec_, current_item->tv_usec_);
channel_cursors[idx] = static_cast<Data*>(current_item->next_.get(segment));
}
}
}
The latency of this method was as good as Method 1, but scaled to large numbers of channels. The problem is, I have no clue why. Just to throw a wrench in things: if I uncomment the 'found via update' part, it prints between EVERY LATENCY LOG MESSAGE. Which means things are only ever found on the update list! So I don't understand how this method can be faster than method 2.
The full, compilable code (requires GCC and boost-1.41) that generates random strings as test data is at: http://pastebin.com/0kuzm3Uf
Update: All 3 methods are effectively spinlocking until an update occurs. The difference is in how long it takes them to notice the update has occurred. They all continuously tax the processor, so that doesn't explain the speed difference. I'm testing on a 4-core machine with nothing else running, so the server and the client have nothing to compete with. I've even made a version of the code where updates signal a condition and have clients wait on the condition -- it didn't help the latency of any of the methods.
Update2: Despite there being 3 methods, I've only tried 1 at a time, so only 1 server and 1 client are competing for the state_ member.

Hypothesis: Method 2 is somehow blocking the update from getting written by the server.
One of the things you can hammer, besides the processor cores themselves, is your coherent cache. When you read a value on a given core, the L1 cache on that core has to acquire read access to that cache line, which means it needs to invalidate the write access to that line that any other cache has. And vice versa to write a value. So this means that you're continually ping-ponging the cache line back and forth between a "write" state (on the server-core's cache) and a "read" state (in the caches of all the client cores).
The intricacies of x86 cache performance are not something I am entirely familiar with, but it seems entirely plausible (at least in theory) that what you're doing by having three different threads hammering this one memory location as hard as they can with read-access requests is approximately creating a denial-of-service attack on the server preventing it from writing to that cache line for a few milliseconds on occasion.
You may be able to do an experiment to detect this by looking at how long it takes for the server to actually write the value into the update list, and see if there's a delay there corresponding to the latency.
You might also be able to try an experiment of removing cache from the equation, by running everything on a single core so the client and server threads are pulling things out of the same L1 cache.

I don't know if you have ever read the Concurrency columns from Herb Sutter. They are quite interesting, especially when you get into the cache issues.
Indeed the Method2 seems better here because the id being smaller than the data in general would mean that you don't have to do round-trips to the main memory too often (which is taxing).
However, what can actually happen is that you have such a line of cache:
Line of cache = [ID1, ID2, ID3, ID4, ...]
^ ^
client server
Which then creates contention.
Here is Herb Sutter's article: Eliminate False Sharing. The basic idea is simply to artificially inflate your ID in the list so that it occupies one line of cache entirely.
Check out the other articles in the serie while you're at it. Perhaps you'll get some ideas. There's a nice lock-free circular buffer I think that could help for your update list :)

I've noticed in both method 1 and method 3 you have a line, ACQUIRE_MEMORY_BARRIER, which I assume has something to do with multi-threading/race conditions?
Either way, method 2 doesn't have any sleeps which means the following code...
while(true)
{
if(update_cursor->state_ == NOT_FILLED_YET) {
continue;
}
is going to hammer the processor. The typical way to do this kind of producer/consumer task is to use some kind of semaphore to signal to the reader that the update list has changed. A search for producer/consumer multi threading should give you a large number of examples. The main idea here is that this allows the thread to go to sleep while it's waiting for the update_cursor->state to change. This prevents this thread from stealing all the cpu cycles.

The answer was tricky to figure out, and to be fair would be hard with the information I presented though if anyone actually compiled the source code I provided they'd have a fighting chance ;) I said that "found via update list" was printed after every latency log message, but this wasn't actually true -- it was only true for as far as I could scrollback in my terminal. At the very beginning there were a slew of updates found without using the update list.
The issue is that between the time when I set my starting point in the update list and my starting point in each of the data lists, there is going to be some lag because these operations take time. Remember, the lists are growing the whole time this is going on. Consider the simplest case where I have 2 data lists, A and B. When I set my starting point in the update list there happen to be 60 elements in it, due to 30 updates on list A and 30 updates on list B. Say they've alternated:
A
B
A
B
A // and I start looking at the list here
B
But then after I set the update list to there, there are a slew of updates to B and no updates to A. Then I set my starting places in each of the data lists. My starting points for the data lists are going to be after that surge of updates, but my starting point in the update list is before that surge, so now I'm going to check for a bunch of updates without finding them. The mixed approach above works best because by iterating over all the elements when it can't find an update, it quickly closes the temporal gap between where the update list is and where the data lists are.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js