C++ download progress report algorithm

C++ download progress report algorithm - c++

I have an application (Qt but that is not really important) which is downloading several files and I want to notify the user about the progress. The c++ app runs on a different machine and progress reports are sent over network (protocoll does not matter here). I do not want to sent for each data receival a message over the network but only in defined intervalls e.g. every 5% (so 0%, 5%, 10%).
Basically I have it like this right now:
void Downloader::OnUpdateDownloadProgress(int downloaded_bytes)
{
m_files_downloaded_size += downloaded_bytes;
int perc_download = (int) ((m_files_downloaded_size / m_files_total_size)*100);
if(m_percentage_buffer > LocalConfig::getDownloadReportSteps() || m_files_downloaded_size == m_files_total_size){
emit sigDownloadProgress(DOWNLOAD_PROGRESS, perc_download);
m_percentage_buffer = 0;
}else{
m_percentage_buffer += (downloaded_bytes / m_files_total_size) * 100;
}
}
Which means that for each data receival triggering this slot I need to perform:
greater comparison, addition , division, multiplication
I know that I could at least skimp on the multiplication by storing a float in the settings and comparing to that. Other than that are there any ways to get this more performant or did I do good on my first try implementing?

Related

QCustomPlot Huge Amount of Data Plotting

I am trying to plot some serial data on my Qt Gui program using qcustomplot class. I had no trouble when I tried to plot low sampling frequency datas like 100 data/second. The graph was really cool and was plotting the data smoothly. But at high sampling rates such 1000data/second, plotter makes a bottleneck for serial read function. It slow downs serial there was a huge delay like 4-5 seconds apart from device. Straightforwardly, plotter could not reach the data stream speed. So, is there any common issue which i dont know about or any recommendation?
I thougth these scenarious,
1- to devide whole program to 2 or 3 thread. For example, serial part runs in one thread and plotting part runs in another thread and two thread communicates with a QSemaphore
2- fps of qcustom plot is limited. but there should be a solution because NI LABVIEW plots up to 2k of datas without any delay
3- to desing a new virtual serial device in usb protocol. Now, I am using ft232rl serial to usb convertor.
4- to change programming language. What is the situation and class support in C# or java for realtime plotting? (I know it is like a kid saying, but this is a pretex to be experienced in other languages)
My serial device send data funct(it is foo device for experiment there is no serious coding) is briefly that:
void progTask()
{
DelayMsec(1); //my delay function, milisecond
//read value from adc13
Adc13Read(adcValue.ui32Part);
sendData[0] = (char)'a';
sendData[1] = (char)'k';
sendData[2] = adcValue.bytes[0];
sendData[3] = (adcValue.bytes[1] & 15);
Qt Program read function is that:
//send test data
UARTSend(UART6_BASE,&sendData[0],4);
}
union{
unsigned char bytes[2];
unsigned int intPart;
unsigned char *ptr;
}serData;
void MedicalSoftware::serialReadData()
{
if(serial->bytesAvailable()<4)
{
//if the frame size is less than 4 bytes return and
//wait to full serial receive buffer
//note: serial->setReadBufferSize(4)!!!!
return;
}
QByteArray serialInData = serial->readAll();
//my algorithm
if(serialInData[0] == 'a' && serialInData[1] == 'k')
{
serData.bytes[0] = serialInData[2];
serData.bytes[1] = serialInData[3];
}else if(serialInData[2] == 'a' && serialInData[3] == 'k')
{
serData.bytes[0] = serialInData[0];
serData.bytes[1] = serialInData[1];
}
else if(serialInData[1] == 'a' && serialInData[2] == 'k')
{
serial->read(1);
return;
}else if(serialInData[0] == 'k' && serialInData[3] == 'a')
{
serData.bytes[0] = serialInData[1];
serData.bytes[1] = serialInData[2];
}
plotMainGraph(serData.intPart);
serData.intPart = 0;
}
And qcustom plot setting fuction is:
void MedicalSoftware::setGraphsProperties()
{
//MAIN PLOTTER
ui->mainPlotter->addGraph();
ui->mainPlotter->xAxis->setRange(0,2000);
ui->mainPlotter->yAxis->setRange(-0.1,3.5);
ui->mainPlotter->xAxis->setLabel("Time(s)");
ui->mainPlotter->yAxis->setLabel("Magnitude(mV)");
QSharedPointer<QCPAxisTickerTime> timeTicker(new QCPAxisTickerTime());
timeTicker->setTimeFormat("%h:%m:%s");
ui->mainPlotter->xAxis->setTicker(timeTicker);
ui->mainPlotter->axisRect()->setupFullAxesBox();
QPen pen;
pen.setColor(QColor("blue"));
ui->mainPlotter->graph(0)->setPen(pen);
dataTimer = new QTimer;
}
And the last is plot function:
void MedicalSoftware::plotMainGraph(const quint16 serData)
{
static QTime time(QTime::currentTime());
double key = time.elapsed()/1000.0;
static double lastPointKey = 0;
if(key-lastPointKey>0.005)
{
double value0 = serData*(3.3/4096);
ui->mainPlotter->graph(0)->addData(key,value0);
lastPointKey = key;
}
ui->mainPlotter->xAxis->setRange(key+0.25, 2, Qt::AlignRight);
counter++;
ui->mainPlotter->replot();
counter = 0;
}

Quick answer:
Have you tried:
ui->mainPlotter->replot(QCustomPlot::rpQueuedReplot);
according to the documentation it can improves performances when doing a lot of replots.
Longer answer:
My feeling on your code is that you are trying to replot as often as you can to get a "real time" plot. But if you are on a PC with a desktop OS there is no such thing as real time.
What you should care about is:
Ensure that the code that read/write to the serial port is not delayed too much. "Too much" is to be interpreted with respect to the connected hardware. If it gets really time critical (which seems to be your case) you have to optimize your read/write functions and eventually put them alone in a thread. This can go as far as reserving a full hardware CPU core for this thread.
Ensure that the graph plot is refreshed fast enough for the human eyes. You do not need to do a full repaint each time you receive a single data point.
In your case you receive 1000 data/s which make 1 data every ms. That is quite fast because that is beyond the default timer resolution of most desktop OS. That means you are likely to have more than a single point of data when calling your "serialReadData()" and that you could optimize it by calling it less often (e.g call it every 10ms and read 10 data points each time). Then you could call "replot()" every 30ms which would add 30 new data points each time, skip about 29 replot() calls every 30ms compared to your code and give you ~30fps.
1- to devide whole program to 2 or 3 thread. For example, serial part
runs in one thread and plotting part runs in another thread and two
thread communicates with a QSemaphore
Dividing the GUI from the serial part in 2 threads is good because you will prevent a bottleneck in GUI to block the serial communication. Also you could skip using semaphore and simply rely on Qt signal/slot connections (connected in Qt::QueuedConnection mode).
4- to change programming language. What is the situation and class
support in C# or java for realtime plotting? (I know it is like a kid
saying, but this is a pretex to be experienced in other languages)
Changing the programming language, in best case, won't change anything or could hurt your performances, especially if you go toward languages which are not compiled to native CPU instructions.
Changing the plotting library on the other hand could change the performances. You can look at Qt Charts and Qwt. I do not know how they compare to QCustomPlot though.

Building an Orderbook representation for a Bitcoin exchange

I am trying to build an Orderbook representation for the Poloniex Bitcoin exchange. I am subscribing to the Push-API which sends updates of the Orderbook over Websocket. The problem is that my Orderbook gets inconsistent over time, i.e. orders which should have been removed are still in my book.
The Orderbook on the following picture has this format:
Exchange-Name - ASK - Amount - Price | Price - Amount - BID - Exchange-Name
On the left side (ASK) are people who are selling a currency. On the right side (BID) are people who are buying a currency. BTCUSD, ETHBTC and ETHUSD describe the different markets. BTCUSD means Bitcoin is exchanged for US-Dollar, ETHBTC means Ethereum is exchanged for Bitcoin and ETHUSD means Ethereum is exchanged for US-Dollar.
Poloniex sends updates over Websocket in JSON-Format. Here is an example of such an update:
[
36,
7597659581972377,
8089731807973507,
{},
[
{"data":{"rate":"609.00000029","type":"bid"},"type":"orderBookRemove"},{"data":{"amount":"0.09514285","rate":"609.00000031","type":"bid"},"type":"orderBookModify"}
],
{
"seq":19976127
}
]
json[0] can be ignored for this question.
json[1] is the market identifier. That means I send a request like "Subscribe to market BTCUSD" and they answer "BTCUSD updates will be sent under identifier number 7597659581972377".
json[2] can be ignored for this question.
json[3] can be ignored for this question.
json[4] contains the actual update data. More about that later.
json[5] contains a sequence number. It is used to execute the updates correctly if they arrive out of order. So if I receive 5 updates within 1 second by the order 1 - 3 - 5 - 4 - 2 they have to be executed like 1 - 2 - 3 - 4 - 5. Each market gets a different "sequence-number-sequence".
As I said, json[4] contains an array of updates. There are three different kinds in json[4][array-index]["type"]:
orderBookModify: The available amount for a specific price has changed.
orderBookRemove: The order is not available anymore and must be removed.
newTrade: Can be used to build a trade history. Not required for what I am trying to do so it can be ignored.
json[4][array-index]["data"] contains two values if it is a orderBookRemove and three values if it is a orderBookModify.
rate: The price.
amount (only existant if it is a orderBookModify): The new amount.
type: ask or bid.
There is also one kind of special message:
[36,8932491360003688,1315671639915103,{},[],{"seq":98045310}]
It only contains a sequence number. It is kind of a heartbeat message and does not send any updates.
The Code
I use three containers:
std::map<std::uint64_t,CMarket> m_mMarkets;
std::map<CMarket, long> m_mCurrentSeq;
std::map<CMarket, std::map<long, web::json::value>> m_mStack;
m_mMarkets is used to map the market-identifier number to the Market as it is stored inside my program.
m_mCurrentSeq is used to store the current sequence number for each market.
m_mStack stores the updates by market and sequence-number (that's what the long is for) until they can be executed.
This is the part which receives the updates:
// ....
// This method can be called asynchronously, so lock the containers.
this->m_muMutex.lock();
// Map the market-identifier to a CMarket object.
CMarket market = this->m_mMarkets.at(json[1].as_number().to_uint64());
// Check if it is a known market. This should never happen!
if(this->m_mMarkets.find(json[1].as_number().to_uint64()) == this->m_mMarkets.end())
{
this->m_muMutex.unlock();
throw std::runtime_error("Received Market update of unknown Market");
}
// Add the update to the execution-queue
this->m_mStack[market][(long)json[5]["seq"].as_integer()] = json;
// Execute the execution-queue
this->executeStack();
this->m_muMutex.unlock();
// ....
Now comes the execution-queue. I think this is where my mistake is located.
Function: "executeStack":
for(auto& market : this->m_mMarkets) // For all markets
{
if(this->m_mCurrentSeq.find(market.second) != this->m_mCurrentSeq.end()) // if market has a sequence number
{
long seqNum = this->m_mCurrentSeq.at(market.second);
// erase old entries
for(std::map<long, web::json::value>::iterator it = this->m_mStack.at(market.second).begin(); it != this->m_mStack.at(market.second).end(); )
{
if((*it).first < seqNum)
it = this->m_mStack.at(market.second).erase(it);
else
++it;
}
// This container is used to store the updates to the Orderbook temporarily.
std::vector<Order> addOrderStack{};
while(this->m_mStack.at(market.second).find(seqNum) != this->m_mStack.at(market.second).end())// has entry for seqNum
{
web::json::value json = this->m_mStack.at(market.second).at(seqNum);
for(auto& v : json[4].as_array())
{
if(v["type"].as_string().compare("orderBookModify") == 0)
{
Order::Type t = v["data"]["type"].as_string().compare("ask") == 0 ? Order::Type::Ask : Order::Type::Bid;
Order newOrder(std::stod(v["data"]["rate"].as_string()), std::stod(v["data"]["amount"].as_string()), t, market.second, this->m_pclParent, v.serialize());
addOrderStack.push_back(newOrder);
} else if(v["type"].as_string().compare("orderBookRemove") == 0)
{
Order::Type t = v["data"]["type"].as_string().compare("ask") == 0 ? Order::Type::Ask : Order::Type::Bid;
Order newOrder(std::stod(v["data"]["rate"].as_string()), 0, t, market.second, this->m_pclParent, v.serialize());
addOrderStack.push_back(newOrder);
} else if(v["type"].as_string().compare("newTrade") == 0)
{
//
} else
{
throw std::runtime_error("Unknown message format");
}
}
this->m_mStack.at(market.second).erase(seqNum);
seqNum++;
}
// The actual OrderList gets modified here. The mistake CANNOT be inside OrderList::addOrderStack, because I am running Orderbooks for other exchanges too and they use the same method to modify the Orderbook, and they do not get inconsistent.
if(addOrderStack.size() > 0)
OrderList::addOrderStack(addOrderStack);
this->m_mCurrentSeq.at(market.second) = seqNum;
}
}
So if this runs for a longer period, the Orderbook becomes inconsistent. That means Orders which should have been removed are still available and there are wrong entrys inside the book. I am not quite sure why this is happening. Maybe I did something wrong with the sequence-numbers, because it seems that the Update-Stack does not always get executed correctly. I have tried everything that came to my mind but I could not get it to work and now I am out of ideas what could be wrong. If you have any questions please feel free to ask.

tl;dr: Poloniex API is imperfect and drops messages. Some simply never arrive. I've found that this happens for all users subscribed regardless of location in the world.
Hope that answer regarding utilization of Autobahn|cpp to connect to Poloniex' Websocket API (here) was useful. I suspect you had already figured it out though (otherwise this question/problem couldn't exist for you). As you might have gathered, I too have a Crypto Currency Bot written in C++. I've been working on it off and on now for about 3.5 years.
The problem set you're facing is something I had to overcome as well. In this case, I'd prefer not to provide my source code as the speed at which you process this can have huge effects on your profit margins. However, I will give sudo code that offers some very rough insight into how I'm handling Web Socket events processing for Poloniex.
//Sudo Code
void someClass::handle_poloniex_ws_event(ws_event event){
if(event.seq_num == expected_seq_num){
process_ws_event(event)
update_expected_seq_num
}
else{
if(in_cache(expected_seq_num){
process_ws_event(from_cache(expected_seq_num))
update_expected_seq_num
}
else{
cache_event(event)
}
}
}
Note that what I've written above is a super simplified version of what I'm actually doing. My actual solution is about 500+ lines long with "goto xxx" and "goto yyy" throughout. I recommend taking timestamps/cpu clock cycle counts and comparing to current time/cycle counts to help you make decisions at any given moment (such as, should I wait for the missing event, should I continue processing and note to the rest of the program that there may be inaccuracies, should I utilize a GET request to refill my table, etc.?). The name of the game here is speed, as I'm sure you know. Good luck! Hope to hear from ya. :-)

Loop every x seconds based on process speed

I am implementing a basic (just for kiddies) anti-cheat for my game. I've included a timestamp to each of my movement packets and do sanity checks on server side for the time difference between those packets.
I've also included a packet that sends a timestamp every 5 seconds based on process speed. But it seems like this is a problem when the PC lags.
So what should I use to check if the process time is faster due to "speed hack"?
My current loop speed check on client:
this_time = clock();
time_counter += (double)(this_time - last_time);
last_time = this_time;
if (time_counter > (double)(5 * CLOCKS_PER_SEC))
{
time_counter -= (double)(5 * CLOCKS_PER_SEC);
milliseconds ms = duration_cast<milliseconds>(system_clock::now().time_since_epoch());
uint64_t curtime = ms.count();
if (state == WALK) {
// send the CURTIME to server
}
}
// other game loop function
The code above works fine if the clients PC doesn't lag maybe because of RAM or CPU issues. They might be running too many applications.
Server side code for reference: (GoLang)
// pktData[3:] packet containing the CURTIME from client
var speed = pickUint64(pktData, 3)
var speedDiff = speed - lastSpeed
if lastSpeed == 0 {
speedDiff = 5000
}
lastSpeed = speed
if speedDiff < 5000 /* 5000 millisec or 5 sec */ {
c.hackDetect("speed hack") // hack detect when speed is faster than the 5 second send loop in client
}

Your system has a critical flaw which makes it easy to circumvent for cheaters: It relies on the timestamp provided by the client. Any data you receive from the client can be manipulated by a cheater, so it must not be trusted.
If you want to check for speed hacking on the server:
log the current position of the players avatar at irregular intervals. Store the timestamp of each log according to the server-time.
Measure the speed between two such logs-entries by calculating the distance and divide it by the timestamp-difference.
When the speed is larger than the speed limit of the player, then you might have a cheater. But keep in mind that lags can lead to sudden spikes, so it might be better to take the average speed measurement of multiple samples to detect if the player is speed-hacking. This might make the speedhack-detection less reliable, but that might actually be a good thing, because it makes it harder for hackers to know how reliable any evasion methods they use are working.
To avoid false-positives, remember to keep track of any artificial ways of moving players around which do not obey the speed limit (like teleporting to spawn after being killed). When such an event occurs, the current speed measurement is meaningless and should be discarded.

running a background process on arduino

I am trying to get my arduino mega to run a function in the background while it is also running a bunch of other functions.
The function that I am trying to run in the background is a function to determine wind speed from an anemometer. The way it processes the data is similar to that of an odometer in that it reads the number of turns that the anemometer makes during a set time period and then takes that number of turns over the time to determine the wind speed. The longer time period that i have it run over the more accurate data i receive as there is more data to average.
The problem that i have is there is a bunch of other data that i am also reading in to the arduino which i would like to be reading in once a second. This one second time interval is too short for me to get accurate wind readings as not enough revolutions are being completed by the anemometer to give high accuracy wind data.
Is there a way to have the wind sensor function run in the background and update a global variable once every 5 seconds or so while the rest of my program is running simultaneously and updating the other data every second.
Here is the code that i have for reading the data from the wind sensor. Every time the wind sensor makes a revolution there is a portion where the signal reads in as 0, otherwise the sensor reads in as a integer larger than 0.
void windmeterturns(){
startime = millis();
endtime = startime + 5000;
windturncounter = 0;
turned = false;
int terminate = startime;
while(terminate <= endtime){
terminate = millis();
windreading = analogRead(windvelocityPin);
if(windreading == 0){
if(turned == true){
windturncounter = windturncounter + 1;
turned = false;
}
}
else if(windreading >= 1){
turned = true;
}
delay(5);
}
}
The rest of the processing of takes place in another function but this is the one that I am currently struggling with. Posting the whole code would not really be reasonable here as it is close to a 1000 lines.
The rest of the functions run with a 1 second delay in the loop but as i have found through trial and error the delay along with the processing of the other functions make it so that the delay is actually longer than a second and it varies based off of what kind of data i am reading in from the other sensors so a 5 loop counter for timing i do not think will work here

Let Interrupts do the work for you.
In short, I recommend using a Timer Interrupt to generate a periodic interrupt that measures the analog reading in the background. Subsequently this can update a static volatile variable.
See my answer here as it is a similar scenario, detailing how to use the timer interrupt. Where you can replace the callback() with your above analogread and increment.

Without seeing how the rest of your code is set up, I would try having windturncounter as a global variable, and add another integer that is iterated every second your main program loops. Then:
// in the main loop
if(iteratorVariable >= 5){
iteratorVariable = 0;
// take your windreading and implement logic here
} else {
iteratorVariable++;
}
I'm not sure how your anemometer stores data or what other challenges you might be facing, so this may not be a 100% solution, but it would allow you to run the logic from your original post every five seconds.

Of these 3 methods for reading linked lists from shared memory, why is the 3rd fastest?

I have a 'server' program that updates many linked lists in shared memory in response to external events. I want client programs to notice an update on any of the lists as quickly as possible (lowest latency). The server marks a linked list's node's state_ as FILLED once its data is filled in and its next pointer has been set to a valid location. Until then, its state_ is NOT_FILLED_YET. I am using memory barriers to make sure that clients don't see the state_ as FILLED before the data within is actually ready (and it seems to work, I never see corrupt data). Also, state_ is volatile to be sure the compiler doesn't lift the client's checking of it out of loops.
Keeping the server code exactly the same, I've come up with 3 different methods for the client to scan the linked lists for changes. The question is: Why is the 3rd method fastest?
Method 1: Round robin over all the linked lists (called 'channels') continuously, looking to see if any nodes have changed to 'FILLED':
void method_one()
{
std::vector<Data*> channel_cursors;
for(ChannelList::iterator i = channel_list.begin(); i != channel_list.end(); ++i)
{
Data* current_item = static_cast<Data*>(i->get(segment)->tail_.get(segment));
channel_cursors.push_back(current_item);
}
while(true)
{
for(std::size_t i = 0; i < channel_list.size(); ++i)
{
Data* current_item = channel_cursors[i];
ACQUIRE_MEMORY_BARRIER;
if(current_item->state_ == NOT_FILLED_YET) {
continue;
}
log_latency(current_item->tv_sec_, current_item->tv_usec_);
channel_cursors[i] = static_cast<Data*>(current_item->next_.get(segment));
}
}
}
Method 1 gave very low latency when then number of channels was small. But when the number of channels grew (250K+) it became very slow because of looping over all the channels. So I tried...
Method 2: Give each linked list an ID. Keep a separate 'update list' to the side. Every time one of the linked lists is updated, push its ID on to the update list. Now we just need to monitor the single update list, and check the IDs we get from it.
void method_two()
{
std::vector<Data*> channel_cursors;
for(ChannelList::iterator i = channel_list.begin(); i != channel_list.end(); ++i)
{
Data* current_item = static_cast<Data*>(i->get(segment)->tail_.get(segment));
channel_cursors.push_back(current_item);
}
UpdateID* update_cursor = static_cast<UpdateID*>(update_channel.tail_.get(segment));
while(true)
{
ACQUIRE_MEMORY_BARRIER;
if(update_cursor->state_ == NOT_FILLED_YET) {
continue;
}
::uint32_t update_id = update_cursor->list_id_;
Data* current_item = channel_cursors[update_id];
if(current_item->state_ == NOT_FILLED_YET) {
std::cerr << "This should never print." << std::endl; // it doesn't
continue;
}
log_latency(current_item->tv_sec_, current_item->tv_usec_);
channel_cursors[update_id] = static_cast<Data*>(current_item->next_.get(segment));
update_cursor = static_cast<UpdateID*>(update_cursor->next_.get(segment));
}
}
Method 2 gave TERRIBLE latency. Whereas Method 1 might give under 10us latency, Method 2 would inexplicably often given 8ms latency! Using gettimeofday it appears that the change in update_cursor->state_ was very slow to propogate from the server's view to the client's (I'm on a multicore box, so I assume the delay is due to cache). So I tried a hybrid approach...
Method 3: Keep the update list. But loop over all the channels continuously, and within each iteration check if the update list has updated. If it has, go with the number pushed onto it. If it hasn't, check the channel we've currently iterated to.
void method_three()
{
std::vector<Data*> channel_cursors;
for(ChannelList::iterator i = channel_list.begin(); i != channel_list.end(); ++i)
{
Data* current_item = static_cast<Data*>(i->get(segment)->tail_.get(segment));
channel_cursors.push_back(current_item);
}
UpdateID* update_cursor = static_cast<UpdateID*>(update_channel.tail_.get(segment));
while(true)
{
for(std::size_t i = 0; i < channel_list.size(); ++i)
{
std::size_t idx = i;
ACQUIRE_MEMORY_BARRIER;
if(update_cursor->state_ != NOT_FILLED_YET) {
//std::cerr << "Found via update" << std::endl;
i--;
idx = update_cursor->list_id_;
update_cursor = static_cast<UpdateID*>(update_cursor->next_.get(segment));
}
Data* current_item = channel_cursors[idx];
ACQUIRE_MEMORY_BARRIER;
if(current_item->state_ == NOT_FILLED_YET) {
continue;
}
found_an_update = true;
log_latency(current_item->tv_sec_, current_item->tv_usec_);
channel_cursors[idx] = static_cast<Data*>(current_item->next_.get(segment));
}
}
}
The latency of this method was as good as Method 1, but scaled to large numbers of channels. The problem is, I have no clue why. Just to throw a wrench in things: if I uncomment the 'found via update' part, it prints between EVERY LATENCY LOG MESSAGE. Which means things are only ever found on the update list! So I don't understand how this method can be faster than method 2.
The full, compilable code (requires GCC and boost-1.41) that generates random strings as test data is at: http://pastebin.com/0kuzm3Uf
Update: All 3 methods are effectively spinlocking until an update occurs. The difference is in how long it takes them to notice the update has occurred. They all continuously tax the processor, so that doesn't explain the speed difference. I'm testing on a 4-core machine with nothing else running, so the server and the client have nothing to compete with. I've even made a version of the code where updates signal a condition and have clients wait on the condition -- it didn't help the latency of any of the methods.
Update2: Despite there being 3 methods, I've only tried 1 at a time, so only 1 server and 1 client are competing for the state_ member.

Hypothesis: Method 2 is somehow blocking the update from getting written by the server.
One of the things you can hammer, besides the processor cores themselves, is your coherent cache. When you read a value on a given core, the L1 cache on that core has to acquire read access to that cache line, which means it needs to invalidate the write access to that line that any other cache has. And vice versa to write a value. So this means that you're continually ping-ponging the cache line back and forth between a "write" state (on the server-core's cache) and a "read" state (in the caches of all the client cores).
The intricacies of x86 cache performance are not something I am entirely familiar with, but it seems entirely plausible (at least in theory) that what you're doing by having three different threads hammering this one memory location as hard as they can with read-access requests is approximately creating a denial-of-service attack on the server preventing it from writing to that cache line for a few milliseconds on occasion.
You may be able to do an experiment to detect this by looking at how long it takes for the server to actually write the value into the update list, and see if there's a delay there corresponding to the latency.
You might also be able to try an experiment of removing cache from the equation, by running everything on a single core so the client and server threads are pulling things out of the same L1 cache.

I don't know if you have ever read the Concurrency columns from Herb Sutter. They are quite interesting, especially when you get into the cache issues.
Indeed the Method2 seems better here because the id being smaller than the data in general would mean that you don't have to do round-trips to the main memory too often (which is taxing).
However, what can actually happen is that you have such a line of cache:
Line of cache = [ID1, ID2, ID3, ID4, ...]
^ ^
client server
Which then creates contention.
Here is Herb Sutter's article: Eliminate False Sharing. The basic idea is simply to artificially inflate your ID in the list so that it occupies one line of cache entirely.
Check out the other articles in the serie while you're at it. Perhaps you'll get some ideas. There's a nice lock-free circular buffer I think that could help for your update list :)

I've noticed in both method 1 and method 3 you have a line, ACQUIRE_MEMORY_BARRIER, which I assume has something to do with multi-threading/race conditions?
Either way, method 2 doesn't have any sleeps which means the following code...
while(true)
{
if(update_cursor->state_ == NOT_FILLED_YET) {
continue;
}
is going to hammer the processor. The typical way to do this kind of producer/consumer task is to use some kind of semaphore to signal to the reader that the update list has changed. A search for producer/consumer multi threading should give you a large number of examples. The main idea here is that this allows the thread to go to sleep while it's waiting for the update_cursor->state to change. This prevents this thread from stealing all the cpu cycles.

The answer was tricky to figure out, and to be fair would be hard with the information I presented though if anyone actually compiled the source code I provided they'd have a fighting chance ;) I said that "found via update list" was printed after every latency log message, but this wasn't actually true -- it was only true for as far as I could scrollback in my terminal. At the very beginning there were a slew of updates found without using the update list.
The issue is that between the time when I set my starting point in the update list and my starting point in each of the data lists, there is going to be some lag because these operations take time. Remember, the lists are growing the whole time this is going on. Consider the simplest case where I have 2 data lists, A and B. When I set my starting point in the update list there happen to be 60 elements in it, due to 30 updates on list A and 30 updates on list B. Say they've alternated:
A
B
A
B
A // and I start looking at the list here
B
But then after I set the update list to there, there are a slew of updates to B and no updates to A. Then I set my starting places in each of the data lists. My starting points for the data lists are going to be after that surge of updates, but my starting point in the update list is before that surge, so now I'm going to check for a bunch of updates without finding them. The mixed approach above works best because by iterating over all the elements when it can't find an update, it quickly closes the temporal gap between where the update list is and where the data lists are.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js