Building an Orderbook representation for a Bitcoin exchange

Building an Orderbook representation for a Bitcoin exchange - c++

I am trying to build an Orderbook representation for the Poloniex Bitcoin exchange. I am subscribing to the Push-API which sends updates of the Orderbook over Websocket. The problem is that my Orderbook gets inconsistent over time, i.e. orders which should have been removed are still in my book.
The Orderbook on the following picture has this format:
Exchange-Name - ASK - Amount - Price | Price - Amount - BID - Exchange-Name
On the left side (ASK) are people who are selling a currency. On the right side (BID) are people who are buying a currency. BTCUSD, ETHBTC and ETHUSD describe the different markets. BTCUSD means Bitcoin is exchanged for US-Dollar, ETHBTC means Ethereum is exchanged for Bitcoin and ETHUSD means Ethereum is exchanged for US-Dollar.
Poloniex sends updates over Websocket in JSON-Format. Here is an example of such an update:
[
36,
7597659581972377,
8089731807973507,
{},
[
{"data":{"rate":"609.00000029","type":"bid"},"type":"orderBookRemove"},{"data":{"amount":"0.09514285","rate":"609.00000031","type":"bid"},"type":"orderBookModify"}
],
{
"seq":19976127
}
]
json[0] can be ignored for this question.
json[1] is the market identifier. That means I send a request like "Subscribe to market BTCUSD" and they answer "BTCUSD updates will be sent under identifier number 7597659581972377".
json[2] can be ignored for this question.
json[3] can be ignored for this question.
json[4] contains the actual update data. More about that later.
json[5] contains a sequence number. It is used to execute the updates correctly if they arrive out of order. So if I receive 5 updates within 1 second by the order 1 - 3 - 5 - 4 - 2 they have to be executed like 1 - 2 - 3 - 4 - 5. Each market gets a different "sequence-number-sequence".
As I said, json[4] contains an array of updates. There are three different kinds in json[4][array-index]["type"]:
orderBookModify: The available amount for a specific price has changed.
orderBookRemove: The order is not available anymore and must be removed.
newTrade: Can be used to build a trade history. Not required for what I am trying to do so it can be ignored.
json[4][array-index]["data"] contains two values if it is a orderBookRemove and three values if it is a orderBookModify.
rate: The price.
amount (only existant if it is a orderBookModify): The new amount.
type: ask or bid.
There is also one kind of special message:
[36,8932491360003688,1315671639915103,{},[],{"seq":98045310}]
It only contains a sequence number. It is kind of a heartbeat message and does not send any updates.
The Code
I use three containers:
std::map<std::uint64_t,CMarket> m_mMarkets;
std::map<CMarket, long> m_mCurrentSeq;
std::map<CMarket, std::map<long, web::json::value>> m_mStack;
m_mMarkets is used to map the market-identifier number to the Market as it is stored inside my program.
m_mCurrentSeq is used to store the current sequence number for each market.
m_mStack stores the updates by market and sequence-number (that's what the long is for) until they can be executed.
This is the part which receives the updates:
// ....
// This method can be called asynchronously, so lock the containers.
this->m_muMutex.lock();
// Map the market-identifier to a CMarket object.
CMarket market = this->m_mMarkets.at(json[1].as_number().to_uint64());
// Check if it is a known market. This should never happen!
if(this->m_mMarkets.find(json[1].as_number().to_uint64()) == this->m_mMarkets.end())
{
this->m_muMutex.unlock();
throw std::runtime_error("Received Market update of unknown Market");
}
// Add the update to the execution-queue
this->m_mStack[market][(long)json[5]["seq"].as_integer()] = json;
// Execute the execution-queue
this->executeStack();
this->m_muMutex.unlock();
// ....
Now comes the execution-queue. I think this is where my mistake is located.
Function: "executeStack":
for(auto& market : this->m_mMarkets) // For all markets
{
if(this->m_mCurrentSeq.find(market.second) != this->m_mCurrentSeq.end()) // if market has a sequence number
{
long seqNum = this->m_mCurrentSeq.at(market.second);
// erase old entries
for(std::map<long, web::json::value>::iterator it = this->m_mStack.at(market.second).begin(); it != this->m_mStack.at(market.second).end(); )
{
if((*it).first < seqNum)
it = this->m_mStack.at(market.second).erase(it);
else
++it;
}
// This container is used to store the updates to the Orderbook temporarily.
std::vector<Order> addOrderStack{};
while(this->m_mStack.at(market.second).find(seqNum) != this->m_mStack.at(market.second).end())// has entry for seqNum
{
web::json::value json = this->m_mStack.at(market.second).at(seqNum);
for(auto& v : json[4].as_array())
{
if(v["type"].as_string().compare("orderBookModify") == 0)
{
Order::Type t = v["data"]["type"].as_string().compare("ask") == 0 ? Order::Type::Ask : Order::Type::Bid;
Order newOrder(std::stod(v["data"]["rate"].as_string()), std::stod(v["data"]["amount"].as_string()), t, market.second, this->m_pclParent, v.serialize());
addOrderStack.push_back(newOrder);
} else if(v["type"].as_string().compare("orderBookRemove") == 0)
{
Order::Type t = v["data"]["type"].as_string().compare("ask") == 0 ? Order::Type::Ask : Order::Type::Bid;
Order newOrder(std::stod(v["data"]["rate"].as_string()), 0, t, market.second, this->m_pclParent, v.serialize());
addOrderStack.push_back(newOrder);
} else if(v["type"].as_string().compare("newTrade") == 0)
{
//
} else
{
throw std::runtime_error("Unknown message format");
}
}
this->m_mStack.at(market.second).erase(seqNum);
seqNum++;
}
// The actual OrderList gets modified here. The mistake CANNOT be inside OrderList::addOrderStack, because I am running Orderbooks for other exchanges too and they use the same method to modify the Orderbook, and they do not get inconsistent.
if(addOrderStack.size() > 0)
OrderList::addOrderStack(addOrderStack);
this->m_mCurrentSeq.at(market.second) = seqNum;
}
}
So if this runs for a longer period, the Orderbook becomes inconsistent. That means Orders which should have been removed are still available and there are wrong entrys inside the book. I am not quite sure why this is happening. Maybe I did something wrong with the sequence-numbers, because it seems that the Update-Stack does not always get executed correctly. I have tried everything that came to my mind but I could not get it to work and now I am out of ideas what could be wrong. If you have any questions please feel free to ask.

tl;dr: Poloniex API is imperfect and drops messages. Some simply never arrive. I've found that this happens for all users subscribed regardless of location in the world.
Hope that answer regarding utilization of Autobahn|cpp to connect to Poloniex' Websocket API (here) was useful. I suspect you had already figured it out though (otherwise this question/problem couldn't exist for you). As you might have gathered, I too have a Crypto Currency Bot written in C++. I've been working on it off and on now for about 3.5 years.
The problem set you're facing is something I had to overcome as well. In this case, I'd prefer not to provide my source code as the speed at which you process this can have huge effects on your profit margins. However, I will give sudo code that offers some very rough insight into how I'm handling Web Socket events processing for Poloniex.
//Sudo Code
void someClass::handle_poloniex_ws_event(ws_event event){
if(event.seq_num == expected_seq_num){
process_ws_event(event)
update_expected_seq_num
}
else{
if(in_cache(expected_seq_num){
process_ws_event(from_cache(expected_seq_num))
update_expected_seq_num
}
else{
cache_event(event)
}
}
}
Note that what I've written above is a super simplified version of what I'm actually doing. My actual solution is about 500+ lines long with "goto xxx" and "goto yyy" throughout. I recommend taking timestamps/cpu clock cycle counts and comparing to current time/cycle counts to help you make decisions at any given moment (such as, should I wait for the missing event, should I continue processing and note to the rest of the program that there may be inaccuracies, should I utilize a GET request to refill my table, etc.?). The name of the game here is speed, as I'm sure you know. Good luck! Hope to hear from ya. :-)

Related

Problem with programming a basic hardware

I have an animation shown on LEDs. When the button is pressed, the animation has to stop and then continue after the button is pressed again.
There is a method that processes working with the button:
void checkButton(){
GPIO_PinState state;
state = HAL_GPIO_ReadPin(GPIOC, GPIO_PIN_15);
if (state == GPIO_PIN_RESET) {
while(1){
state = HAL_GPIO_ReadPin(GPIOC, GPIO_PIN_15);
if (state == GPIO_PIN_SET){
break;
}
}
//while (state == GPIO_PIN_RESET) {
//state = HAL_GPIO_ReadPin(GPIOC, GPIO_PIN_15);
//}
}
}
GPIO_PIN_SET is the default button position. GPIO_PIN_RESET is the condition when the button is pressed. The commented section is what I tried instead of the while(1){...} loop. The checkButton() method is called in the main loop from time to time to be run. The program runs on STM32 with an extension module (here the type of an extension module does not matter).
The fact is that this method stops animation just for a moment and does not work as I would like it to. Could you correct anything about this program to make it work properly?

Could you correct anything about this program to make it work
properly?
My guess is that you are trying to add a 'human interaction' aspect to your design. Your current approach relies on a single (button position) sample randomly timed by a) your application and b) a human finger. This timing is simply not reliable, but the correction is possibly not too difficult.
Note 1: A 'simple' mechanical button will 'bounce' during it's activation or release (yes, either way). This means that the value which the software 'sees' (in a few microseconds) is unpredictable for several (tbd) milliseconds(?) near the button push or release.
Note 2: Another way to look at this issue, is that your state value exists two places: in the physical button AND in the variable "GPIO_PinState state;". IMHO, a state value can only reside in one location. Two locations is always a mistake.
The solution, then (if you believe) is to decide to keep one state 'record', and eliminate the other. IMHO, I think you want to keep the button, which seems to be your human input. To be clear, you want to eliminate the variable "GPIO_PinState state;"
This line:
state = HAL_GPIO_ReadPin(GPIOC, GPIO_PIN_15);
samples the switch state one time.
HOWEVER, you already know that this design can not rely on the one read being correct. After all, your user might have just pressed or released the button, and it is simply bouncing at the time of the sample.
Before we get to accumulating samples, you should be aware that the bouncing can last much more than a few microseconds. I've seen some switches bounce up to 10 milliseconds or more. If test equipment is available, I would hook it up and take a look at the characteristics of your button. If not, well, you can try the adjusting the controls of the following sample accumulator.
So, how do we 'accumulate' enough samples to feel confident we can know the state of the switch?
Consider multiple samples, spaced-in-time by short delays (2 controls?). I think you can simply accumulate them. The first count to reach tbr - 5 (or 10 or 100?) samples wins. So spin sample, delay, and increment one of two counters:
stateCount [2] = {0,0}; // state is either set or reset, init both to 0
// vvv-------max samples
for (int i=0; i<100; ++i) // worst case how long does your switch bounce
{
int sample = HAL_GPIO_ReadPin(GPIOC, GPIO_PIN_15); // capture 1 sample
stateCount[sample] += 1; // increment based on sample
// if 'enough' samples are the same, kick out early
// v ---- how long does your switch bounce
if (stateCount[sample] > 5) break; // 5 or 10 or 100 ms
// to-be-determined --------vvv --- how long does switch bounce
std::this_thread::sleep_for(1ms); // 1, 3, 5 or 11 ms between samples
// C++ provides, but use what is available for your system
// and balanced with the needs of your app
}
FYI - The above scheme has 3 adjustments to handle different switch-bounce durations ... You have some experimenting to do. I would start with max samples at 20. I have no recommendation for sleep_for ... you provided no other info about your system.
Good luck.
It has been a long time, but I think I remember the push-buttons on a telecom infrastructure equipment bounced 5 to 15 ms.

C++ download progress report algorithm

I have an application (Qt but that is not really important) which is downloading several files and I want to notify the user about the progress. The c++ app runs on a different machine and progress reports are sent over network (protocoll does not matter here). I do not want to sent for each data receival a message over the network but only in defined intervalls e.g. every 5% (so 0%, 5%, 10%).
Basically I have it like this right now:
void Downloader::OnUpdateDownloadProgress(int downloaded_bytes)
{
m_files_downloaded_size += downloaded_bytes;
int perc_download = (int) ((m_files_downloaded_size / m_files_total_size)*100);
if(m_percentage_buffer > LocalConfig::getDownloadReportSteps() || m_files_downloaded_size == m_files_total_size){
emit sigDownloadProgress(DOWNLOAD_PROGRESS, perc_download);
m_percentage_buffer = 0;
}else{
m_percentage_buffer += (downloaded_bytes / m_files_total_size) * 100;
}
}
Which means that for each data receival triggering this slot I need to perform:
greater comparison, addition , division, multiplication
I know that I could at least skimp on the multiplication by storing a float in the settings and comparing to that. Other than that are there any ways to get this more performant or did I do good on my first try implementing?

avoiding collisions when collapsing infinity lock-free buffer to circular-buffer

I'm solving two feeds arbitrate problem of FAST protocol.
Please don't worry if you not familar with it, my question is pretty general actually. But i'm adding problem description for those who interested (you can skip it).
Data in all UDP Feeds are disseminated in two identical feeds (A and B) on two different multicast IPs. It is strongly recommended that client receive and process both feeds because of possible UDP packet loss. Processing two identical feeds allows one to statistically decrease the probability of packet loss.
It is not specified in what particular feed (A or B) the message appears for the first time. To arbitrate these feeds one should use the message sequence number found in Preamble or in tag 34-MsgSeqNum. Utilization of the Preamble allows one to determine message sequence number without decoding of FAST message.
Processing messages from feeds A and B should be performed using the following algorithm:
Listen feeds A and B
Process messages according to their sequence numbers.
Ignore a message if one with the same sequence number was already processed before.
If the gap in sequence number appears, this indicates packet loss in both feeds (A and B). Client should initiate one of the Recovery process. But first of all client should wait a reasonable time, perhaps the lost packet will come a bit later due to packet reordering. UDP protocol can’t guarantee the delivery of packets in a sequence.
// tcp recover algorithm further
I wrote such very simple class. It preallocates all required classes and then first thread that receive particular seqNum can process it. Another thread will drop it later:
class MsgQueue
{
public:
MsgQueue();
~MsgQueue(void);
bool Lock(uint32_t msgSeqNum);
Msg& Get(uint32_t msgSeqNum);
void Commit(uint32_t msgSeqNum);
private:
void Process();
static const int QUEUE_LENGTH = 1000000;
// 0 - available for use; 1 - processing; 2 - ready
std::atomic<uint16_t> status[QUEUE_LENGTH];
Msg updates[QUEUE_LENGTH];
};
Implementation:
MsgQueue::MsgQueue()
{
memset(status, 0, sizeof(status));
}
MsgQueue::~MsgQueue(void)
{
}
// For the same msgSeqNum should return true to only one thread
bool MsgQueue::Lock(uint32_t msgSeqNum)
{
uint16_t expected = 0;
return status[msgSeqNum].compare_exchange_strong(expected, 1);
}
void MsgQueue::Commit(uint32_t msgSeqNum)
{
status[msgSeqNum] = 2;
Process();
}
// this method probably should be combined with "Lock" but please ignore! :)
Msg& MsgQueue::Get(uint32_t msgSeqNum)
{
return updates[msgSeqNum];
}
void MsgQueue::Process()
{
// ready packets must be processed,
}
Usage:
if (!msgQueue.Lock(seq)) {
return;
}
Msg msg = msgQueue.Get(seq);
msg.Ticker = "HP"
msg.Bid = 100;
msg.Offer = 101;
msgQueue.Commit(seq);
This works fine if we assume that QUEUE_LENGTH is infinity. Because in this case one msgSeqNum = one updates array item.
But I have to make buffer circular because it is not possible to store entire history (many millions of packets) and there are no reason to do so. Actually I need to buffer enough packets to reconstruct the session, and once session is reconstructed i can drop them.
But having circular buffer significantly complicates algorithm. For example assume that we have circular buffer of length 1000. And at the same time we try to process seqNum = 10 000 and seqNum = 11 000 (this is VERY unlikely but still possible). Both these packets will map to the array updates at index 0 and so collision occur. In such case buffer should 'drop' old packets and process new packets.
It's trivial to implement what I want using locks but writing lock-free code on circular-buffer that used from different threads is really complicated. So I welcome any suggestions and advice how to do that. Thanks!

I don't believe you can use a ring buffer. A hashed index can be used in the status[] array. Ie, hash = seq % 1000. The issue is that the sequence number is dictated by the network and you have no control over it's ordering. You wish to lock based on this sequence number. Your array doesn't need to be infinite, just the range of the sequence number; but that is probably larger than practical.
I am not sure what is happening when the sequence number is locked. Does this mean another thread is processing it? If so, you must maintain a sub-list for hash collisions to resolve the particular sequence number.
You may also consider an array size as a power of 2. For example, 1024 will allow hash = seq & 1023; which should be quite efficient.

C++ : Avoid lot of boolean variable for multiple verification conditions in trading app

i am a junior dev in trading app... we have a order refresh verification unit. It has to verify order confirmation from exchange. We send a bunch of different request in bulk ( NEW, MODIFY, CANCEL ) to exchange... Verification has to happen for max N times with each T intervals for all orders. if verification successful for all the order before N retry then fine.. otherwise we need to indicate as verification unsuccessfull. i hv done a basic coding done in very urgent like below
for( N times )
{
for_each ( sent_request_order ) // SENT
{
1) get all the refreshed order from DB or shared mem i.e REFRESHED
2) find current sent order in REFRESHED
if( not_found )
not refreshed from exchange, continue to next order
if( found )
case NEW : //check for new status, mark verification done
case MODIFY : //check for modified status..
//if not mark pending, go to next order,
//revisit the same after T time
case CANCEL : //check for cancelled status..
//if not mark pending, go to next order,
//revisit the same after T time
}
if( all_verified )
exit from verification.
wait ( T sec )
}
order_verification_pending, order_verification_done, order_visited, order_not_visited, all_verified, all_not_verified ... lot of boolean flags used for indication..
is there any better approach for doing this.... splitting responsibilities across the classes......????
i know this is not a general question.... but still flags are making me tidious to handle...

Your algorithm looks workable. Implement it.
Do not try to optimize your code before you got it working. Once you have a working version running, nevermind how ugly, then you look at ways & means to optimize. Chances are good that you will then find a way to handle the flags that gives you so much trouble.

You talk about "order_verification_pending, order_verification_done, order_visited, order_not_visited, all_verified, all_not_verified"... but that seems to double the number of booleans, for example: if you have order_visited then you don't need order_not_visited... it is just "!order_visited". When there are more than two states involved, use an enum instead of a lot of complicated, overlapping booleans. For example, if verification might be pending, done, failed etc. but these are all mutually exclusive, then store the single current state in that enum.
It's more elegant to have a set of pending operations, and remove elements from that set until the set is empty or the whole verification times out. That way, you're not checking operations you already found succeeded.

Of these 3 methods for reading linked lists from shared memory, why is the 3rd fastest?

I have a 'server' program that updates many linked lists in shared memory in response to external events. I want client programs to notice an update on any of the lists as quickly as possible (lowest latency). The server marks a linked list's node's state_ as FILLED once its data is filled in and its next pointer has been set to a valid location. Until then, its state_ is NOT_FILLED_YET. I am using memory barriers to make sure that clients don't see the state_ as FILLED before the data within is actually ready (and it seems to work, I never see corrupt data). Also, state_ is volatile to be sure the compiler doesn't lift the client's checking of it out of loops.
Keeping the server code exactly the same, I've come up with 3 different methods for the client to scan the linked lists for changes. The question is: Why is the 3rd method fastest?
Method 1: Round robin over all the linked lists (called 'channels') continuously, looking to see if any nodes have changed to 'FILLED':
void method_one()
{
std::vector<Data*> channel_cursors;
for(ChannelList::iterator i = channel_list.begin(); i != channel_list.end(); ++i)
{
Data* current_item = static_cast<Data*>(i->get(segment)->tail_.get(segment));
channel_cursors.push_back(current_item);
}
while(true)
{
for(std::size_t i = 0; i < channel_list.size(); ++i)
{
Data* current_item = channel_cursors[i];
ACQUIRE_MEMORY_BARRIER;
if(current_item->state_ == NOT_FILLED_YET) {
continue;
}
log_latency(current_item->tv_sec_, current_item->tv_usec_);
channel_cursors[i] = static_cast<Data*>(current_item->next_.get(segment));
}
}
}
Method 1 gave very low latency when then number of channels was small. But when the number of channels grew (250K+) it became very slow because of looping over all the channels. So I tried...
Method 2: Give each linked list an ID. Keep a separate 'update list' to the side. Every time one of the linked lists is updated, push its ID on to the update list. Now we just need to monitor the single update list, and check the IDs we get from it.
void method_two()
{
std::vector<Data*> channel_cursors;
for(ChannelList::iterator i = channel_list.begin(); i != channel_list.end(); ++i)
{
Data* current_item = static_cast<Data*>(i->get(segment)->tail_.get(segment));
channel_cursors.push_back(current_item);
}
UpdateID* update_cursor = static_cast<UpdateID*>(update_channel.tail_.get(segment));
while(true)
{
ACQUIRE_MEMORY_BARRIER;
if(update_cursor->state_ == NOT_FILLED_YET) {
continue;
}
::uint32_t update_id = update_cursor->list_id_;
Data* current_item = channel_cursors[update_id];
if(current_item->state_ == NOT_FILLED_YET) {
std::cerr << "This should never print." << std::endl; // it doesn't
continue;
}
log_latency(current_item->tv_sec_, current_item->tv_usec_);
channel_cursors[update_id] = static_cast<Data*>(current_item->next_.get(segment));
update_cursor = static_cast<UpdateID*>(update_cursor->next_.get(segment));
}
}
Method 2 gave TERRIBLE latency. Whereas Method 1 might give under 10us latency, Method 2 would inexplicably often given 8ms latency! Using gettimeofday it appears that the change in update_cursor->state_ was very slow to propogate from the server's view to the client's (I'm on a multicore box, so I assume the delay is due to cache). So I tried a hybrid approach...
Method 3: Keep the update list. But loop over all the channels continuously, and within each iteration check if the update list has updated. If it has, go with the number pushed onto it. If it hasn't, check the channel we've currently iterated to.
void method_three()
{
std::vector<Data*> channel_cursors;
for(ChannelList::iterator i = channel_list.begin(); i != channel_list.end(); ++i)
{
Data* current_item = static_cast<Data*>(i->get(segment)->tail_.get(segment));
channel_cursors.push_back(current_item);
}
UpdateID* update_cursor = static_cast<UpdateID*>(update_channel.tail_.get(segment));
while(true)
{
for(std::size_t i = 0; i < channel_list.size(); ++i)
{
std::size_t idx = i;
ACQUIRE_MEMORY_BARRIER;
if(update_cursor->state_ != NOT_FILLED_YET) {
//std::cerr << "Found via update" << std::endl;
i--;
idx = update_cursor->list_id_;
update_cursor = static_cast<UpdateID*>(update_cursor->next_.get(segment));
}
Data* current_item = channel_cursors[idx];
ACQUIRE_MEMORY_BARRIER;
if(current_item->state_ == NOT_FILLED_YET) {
continue;
}
found_an_update = true;
log_latency(current_item->tv_sec_, current_item->tv_usec_);
channel_cursors[idx] = static_cast<Data*>(current_item->next_.get(segment));
}
}
}
The latency of this method was as good as Method 1, but scaled to large numbers of channels. The problem is, I have no clue why. Just to throw a wrench in things: if I uncomment the 'found via update' part, it prints between EVERY LATENCY LOG MESSAGE. Which means things are only ever found on the update list! So I don't understand how this method can be faster than method 2.
The full, compilable code (requires GCC and boost-1.41) that generates random strings as test data is at: http://pastebin.com/0kuzm3Uf
Update: All 3 methods are effectively spinlocking until an update occurs. The difference is in how long it takes them to notice the update has occurred. They all continuously tax the processor, so that doesn't explain the speed difference. I'm testing on a 4-core machine with nothing else running, so the server and the client have nothing to compete with. I've even made a version of the code where updates signal a condition and have clients wait on the condition -- it didn't help the latency of any of the methods.
Update2: Despite there being 3 methods, I've only tried 1 at a time, so only 1 server and 1 client are competing for the state_ member.

Hypothesis: Method 2 is somehow blocking the update from getting written by the server.
One of the things you can hammer, besides the processor cores themselves, is your coherent cache. When you read a value on a given core, the L1 cache on that core has to acquire read access to that cache line, which means it needs to invalidate the write access to that line that any other cache has. And vice versa to write a value. So this means that you're continually ping-ponging the cache line back and forth between a "write" state (on the server-core's cache) and a "read" state (in the caches of all the client cores).
The intricacies of x86 cache performance are not something I am entirely familiar with, but it seems entirely plausible (at least in theory) that what you're doing by having three different threads hammering this one memory location as hard as they can with read-access requests is approximately creating a denial-of-service attack on the server preventing it from writing to that cache line for a few milliseconds on occasion.
You may be able to do an experiment to detect this by looking at how long it takes for the server to actually write the value into the update list, and see if there's a delay there corresponding to the latency.
You might also be able to try an experiment of removing cache from the equation, by running everything on a single core so the client and server threads are pulling things out of the same L1 cache.

I don't know if you have ever read the Concurrency columns from Herb Sutter. They are quite interesting, especially when you get into the cache issues.
Indeed the Method2 seems better here because the id being smaller than the data in general would mean that you don't have to do round-trips to the main memory too often (which is taxing).
However, what can actually happen is that you have such a line of cache:
Line of cache = [ID1, ID2, ID3, ID4, ...]
^ ^
client server
Which then creates contention.
Here is Herb Sutter's article: Eliminate False Sharing. The basic idea is simply to artificially inflate your ID in the list so that it occupies one line of cache entirely.
Check out the other articles in the serie while you're at it. Perhaps you'll get some ideas. There's a nice lock-free circular buffer I think that could help for your update list :)

I've noticed in both method 1 and method 3 you have a line, ACQUIRE_MEMORY_BARRIER, which I assume has something to do with multi-threading/race conditions?
Either way, method 2 doesn't have any sleeps which means the following code...
while(true)
{
if(update_cursor->state_ == NOT_FILLED_YET) {
continue;
}
is going to hammer the processor. The typical way to do this kind of producer/consumer task is to use some kind of semaphore to signal to the reader that the update list has changed. A search for producer/consumer multi threading should give you a large number of examples. The main idea here is that this allows the thread to go to sleep while it's waiting for the update_cursor->state to change. This prevents this thread from stealing all the cpu cycles.

The answer was tricky to figure out, and to be fair would be hard with the information I presented though if anyone actually compiled the source code I provided they'd have a fighting chance ;) I said that "found via update list" was printed after every latency log message, but this wasn't actually true -- it was only true for as far as I could scrollback in my terminal. At the very beginning there were a slew of updates found without using the update list.
The issue is that between the time when I set my starting point in the update list and my starting point in each of the data lists, there is going to be some lag because these operations take time. Remember, the lists are growing the whole time this is going on. Consider the simplest case where I have 2 data lists, A and B. When I set my starting point in the update list there happen to be 60 elements in it, due to 30 updates on list A and 30 updates on list B. Say they've alternated:
A
B
A
B
A // and I start looking at the list here
B
But then after I set the update list to there, there are a slew of updates to B and no updates to A. Then I set my starting places in each of the data lists. My starting points for the data lists are going to be after that surge of updates, but my starting point in the update list is before that surge, so now I'm going to check for a bunch of updates without finding them. The mixed approach above works best because by iterating over all the elements when it can't find an update, it quickly closes the temporal gap between where the update list is and where the data lists are.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js