Safely and unambiguously manipulating atomic variables in C++11 - c++

I have to read some data (which is coming at a blinding speed - upto 5000 messages per second) from a multicast (UDP) stream. Because the stream is multicast (and the data is quite critical) the data provider has provided two streams that send identical data (their logic being that the possibility of the same packet dropping in both streams is very close to zero). All data packets are tagged with a sequence number to keep track.
Also, the application is so time critical that I am forced to listen to both streams in parallel and pick up the next sequence number from whichever multicast stream it was received on first - When the same packet comes on the mirror stream, I simply drop it.
I am planning to implement this drop feature using a common "sequence_number" variable between the two functions - which by the way run in different threads. The sequence number is atomic as it is going to be read and updated from two different threads.
The obvious algorithm that comes to mind is
if (sequence number received from the stream > sequence_number)
{
process packet;
sequence_number = sequence number received from the stream;
}
(The above algorithm needs to be modified for times when sequence numbers come out of order - and they can as it is a UDP stream - but lets forget about it for the time being)
My question is this:
From the time I std::load my sequence_number, check if it is smaller than the sequence number I have received from the stream, accept the packet, and finally std::store the new sequence number to sequence_number; if the other stream receives the same packet (with the same sequence number) and performs the same operations (before the first stream finishes std::store on that sequence number), I will essentially end up with the same packet twice in my system. What is a way to overcome this situation ?

Don't put off worrying about handling out of order packets until later, because solving that also provides the most elegant solution to synchronizing threads.
Elements of an array are unique memory locations for the purposes of data races. If you put each packet (atomically via pointer write) into a different array element according to its sequence number, you'll get rid of most of the contention. Also use compare-exchange to detect whether the other thread (other stream) has already seen that packet.
Note that you won't have the retry loop normally associated with compare-exchange, either you have the first copy of the packet and compare-exchange succeeds, or the packet already exists and your copy can be discarded. So this approach is not only lock-free but also wait-free :)

Here is one option, if you are using std::atomic values, using compare_exchange.
Not shown is how to initialize last_processed_seqnum, as you'll need to set it to a valid value, namely, one less than the seqnum of the next packet to arrive.
It will need to be adapted for the case in which there are sequence number gaps. You mention as part of your premise that there will be no dropped seqnums; but the example below will stop processing packets (i.e. fail catastrophically) upon any seqnum gaps.
std::atomic<int> last_processed_seqnum;
// sync last_processed_seqnum to first message(s).
int seqnum_from_stream = ...;
int putative_last_processed_seqnum = seqnum_from_stream - 1;
if (last_processed_seqnum.compare_exchange_strong(putative_last_processed_seqnum,
seqnum_from_stream))
{
// sequence number has been updated in compare_exchange_strong
// process packet;
}
Ideally, what we want is a compare_exchange function that uses greater than, not equals. I don't know of any way to achieve that behavior in one operation. The SO question I linked to links to an answer about iterating over all values less than a target to update.

You are probably implementing a price feed handler, which exchange is it and what protocol? Is it ITCH or FIX Fast? I would not recommend two threads for the same feed since you probably have to join several multicast groups for different market segments/boards.

Related

c++: streaming and rate regulation

An interesting question I was asked in an interview:
Suppose you are constantly receiving some byte-stream from a source (let's assume a client-server model) with variable rate. You want to sort the packets on-the-fly at your end and retransmit them elsewhere at a constant rate. how would you implement such a system in C++?
I offered a basic system with a worker thread pushing packets into a heap and a dispatcher thread popping sorted packets and sending them away in sync with some internal clock in constant intervals of X.
The interviewer reasonably argued that such a system is prone to miss it's retransmission deadlines due to context switching between the threads. I replied that without any control of the thread scheduling algorithms in the specific machine, I can't guarantee constant retransmission rate. He followed with insistence that made me think that maybe I'm wrong and this is in fact achievable. So, am I?
Description of the problem is too vague. What is meant by "sorting"? Do packets have some sort of sequence numbers in their bodies? What if we never receive packet with some sequence number? Here are my thoughts on some generic algorithm that may be adapted for different specific situations.
Parameters
Algorithm depends on following parameters:
Max number of packets we are allowed to buffer (e.g. 4096).
High watermark value. It is a percentage relative to (1). If we currently have more packets in buffer than HWV, then we send packet even if it is in not sorted order. (50%, or 2048).
TTL -- max time we allow our packet to be buffered.
Variables
First of all, we need a ring buffer of fixed length.
Ring buffer consists of:
Array[4096] of buffered packets
Array[4096] of pointers into above array (to avoid copying packets themselves during sort operation)
Array[4096] of meta-information for each packet: its sequence number (parsed from packet) and timestamp of when packet was received
Pointers to head and tail.
We also need to store global variable -- next sequence number to send.
Algorithm
When packet arrives, we add our packet into buffer to the proper position (one step of insertion sort), so our buffer is always sorted by sequence number. After that, we may want to send packet from head of buffer if at least one of the following is true:
Seq number of head packet is less or equal than expected seq number. This means that head packet is next in sorted order. Typical situation.
Number of packets in buffer exceeds high watermark value (2048). We send head packet (despite the fact that it is not in sorted order) because we are afraid that burst of incoming activity may fill the rest of our buffer and we will have to throw away further incoming packets.
Current time minus time of arrival of head packet exceeds TTL.
If at least one of above is true, we send head packet and remove it from buffer. We also assign expected sequence number to be seq number of sent packet plus one. If, after sending head packet, buffer is not empty, we also (re)start timer with TTL value. When timer is fired, we perform the very same above checks. This is to avoid keeping packets indefinitely in buffer (in case there are no more incoming packets).

Searching for a Binary Value

I am trying to find a way to identify the start of a chunk of data sent via a TCP socket. The data chunk has the value of the integer 1192 written into it as the first four bytes, followed by the content length. How can I search the binary data (the char* received) for this value? I realize I can loop through and advance the pointer by one each time, copy out the first four bytes, and check it, but that isn't the most elegant or possibly efficient solution.
Is there also another way this could be done that I'm not thinking of?
Thanks in advance.
It sounds like linear scanning might be required, but you shouldn't really be losing your message positioning if the sending side of the connection is making its send()/write() calls in a sensible manner, you are reading in your buffers properly, and there isn't an indeterminate amount of "dead" space in the stream between messages.
If the protocol itself is sensible (there is at least a length field!), you should never lose track of message boundaries. Just read the marker/length pair, then read length payload bytes, and the next message should start immediately after this, so a linear scan shouldn't have to go anywhere ideally.
Also, don't bother copying explicitly, just cast:
// call htonl() to flip endianness if need be...
uint32_t x = *reinterpret_cast<uint32_t *>(charptr);

Is this an appropriate use for shared_ptr?

Project: typical chat program. Server must receive text from multiple clients and fan each input out to all clients.
In the server I want to have each client to have a struct containing the socket fd and a std::queue. Each structure will be on a std::list.
As input is received from a client socket I want to iterate over the list of structs and put new input into each client struct's queue. A string is new[ed] because I don't want copies of the string multiplied over all the clients. But I also want to avoid the headache of have multiple pointers to the string spread out and deciding when it is time to finally delete the string.
Is this an appropriate occassion for a shared pointer? If so, is the shared_ptr incremented each time I push them into the queue and decremented when I pop them from the queue?
Thanks for any help.
This is a case where a pseudo-garbage collector system will work much better than reference counting.
You need only one list of strings, because you "fan every input out to all clients". Because you will add to one end and remove from the other, a deque is an appropriate data structure.
Now, each connection needs only to keep track of the index of the last string it sent. Periodically (every 1000th message received, or every 4MB received, or something like that), you find the minimum of this index across all clients, and delete strings up to that point. This periodic check is also an opportunity to detect clients which have fallen far behind (possible broken connection) and recover. Without this check, a single stuck client will cause your program to leak memory (even under the reference counting scheme).
This scheme is several times less data than reference counting, and also removes one of the major points of cache contention (reference counts must be written from multiple threads, so they ruin performance). If you aren't using threads, it'll still be faster.
That is an appropriate use of a shared_ptr. And yes, the use count will be increment because a new shared_ptr will be create to push.

Buffering Incomplete High Speed Reads

I am reading data ~100 bytes at 100hz from a serial port. My buffer is 1024 bytes, so often my buffer doesn't get completely used. Sometimes however, I get hiccups from the serial port and the buffer gets filled up.
My data is organized as a [header]data[checksum]. When my buffer gets filled up, sometimes a message/data is split across two reads from the serial port.
This is a simple problem, and I'm sure there are a lot of different approaches. I am ahead of schedule so I would like to research different approaches. Could you guys name some paradigms that cover buffering in high speed data that might need to be put together from two reads? Note, the main difference I see in this problem from say other buffering I've done (image acquisition, tcp/ip), is that there we are guaranteed full packets/messages. Here a "packet" may be split between reads, which we will only know once we start parsing the data.
Oh yes, note that the data buffered in from the read has to be parsed, so to make things simple, the data should be contiguous when it reaches the parsing. (Plus I don't think that's the parser's responsibility)
Some Ideas I Had:
Carry over unused bytes to my original buffer, then fill it with the read after the left over bytes from the previous read. (For example, we read 1024 bytes, 24 bytes are left at the end, they're a partial message, memcpy to the beginning of the read_buffer_, pass the beginning + 24 to read and read in 1024 - 24)
Create my own class that just gets blocks of data. It has two pointers, read/write and a large chunk of memory (1024 * 4). When you pass in the data, the class updates the write pointer correctly, wraps around to the beginning of its buffer when it reaches the end. I guess like a ring buffer?
I was thinking maybe using a std::vector<unsigned char>. Dynamic memory allocation, guaranteed to be contiguous.
Thanks for the info guys!
Define some 'APU' application-protocol-unit class that will represent your '[header]data[checksum]'. Give it some 'add' function that takes a char parameter and returns a 'valid' bool. In your serial read thread, create an APU and read some data into your 1024-byte buffer. Iterate the data in the buffer, pushing it into the APU add() until either the APU add() function returns true or the iteration is complete. If the add() returns true, you have a complete APU - queue it off for handling, create another one and start add()-ing the remaining buffer bytes to it. If the iteration is complete, loop back round to read more serial data.
The add() method would use a state-machine, or other mechanism, to build up and check the incoming bytes, returning 'true' only in the case of a full sanity-checked set of data with the correct checksum. If some part of the checking fails, the APU is 'reset' and waits to detect a valid header.
The APU could maybe parse the data itself, either byte-by-byte during the add() data input, just before add() returns with 'true', or perhaps as a separate 'parse()' method called later, perhaps by some other APU-processing thread.
When reading from a serial port at speed, you typically need some kind of handshaking mechanism to control the flow of data. This can be hardware (e.g. RTS/CTS), software (Xon/Xoff), or controlled by a higher level protocol. If you're reading a large amount of data at speed without handshaking, your UART or serial controller needs to be able to read and buffer all the available data at that speed to ensure no data loss. On 16550 compatible UARTs that you see on Windows PCs, this buffer is just 14 bytes, hence the need for handshaking or a real time OS.

How does LMAX's disruptor pattern work?

I am trying to understand the disruptor pattern. I have watched the InfoQ video and tried to read their paper. I understand there is a ring buffer involved, that it is initialized as an extremely large array to take advantage of cache locality, eliminate allocation of new memory.
It sounds like there are one or more atomic integers which keep track of positions. Each 'event' seems to get a unique id and it's position in the ring is found by finding its modulus with respect to the size of the ring, etc., etc.
Unfortunately, I don't have an intuitive sense of how it works. I have done many trading applications and studied the actor model, looked at SEDA, etc.
In their presentation they mentioned that this pattern is basically how routers work; however I haven't found any good descriptions of how routers work either.
Are there some good pointers to a better explanation?
The Google Code project does reference a technical paper on the implementation of the ring buffer, however it is a bit dry, academic and tough going for someone wanting to learn how it works. However there are some blog posts that have started to explain the internals in a more readable way. There is an explanation of ring buffer that is the core of the disruptor pattern, a description of the consumer barriers (the part related to reading from the disruptor) and some information on handling multiple producers available.
The simplest description of the Disruptor is: It is a way of sending messages between threads in the most efficient manner possible. It can be used as an alternative to a queue, but it also shares a number of features with SEDA and Actors.
Compared to Queues:
The Disruptor provides the ability to pass a message onto another threads, waking it up if required (similar to a BlockingQueue). However, there are 3 distinct differences.
The user of the Disruptor defines how messages are stored by extending Entry class and providing a factory to do the preallocation. This allows for either memory reuse (copying) or the Entry could contain a reference to another object.
Putting messages into the Disruptor is a 2-phase process, first a slot is claimed in the ring buffer, which provides the user with the Entry that can be filled with the appropriate data. Then the entry must be committed, this 2-phase approach is necessary to allow for the flexible use of memory mentioned above. It is the commit that makes the message visible to the consumer threads.
It is the responsibility of the consumer to keep track of the messages that have been consumed from the ring buffer. Moving this responsibility away from the ring buffer itself helped reduce the amount of write contention as each thread maintains its own counter.
Compared to Actors
The Actor model is closer the Disruptor than most other programming models, especially if you use the BatchConsumer/BatchHandler classes that are provided. These classes hide all of the complexities of maintaining the consumed sequence numbers and provide a set of simple callbacks when important events occur. However, there are a couple of subtle differences.
The Disruptor uses a 1 thread - 1 consumer model, where Actors use an N:M model i.e. you can have as many actors as you like and they will be distributed across a fixed numbers of threads (generally 1 per core).
The BatchHandler interface provides an additional (and very important) callback onEndOfBatch(). This allows for slow consumers, e.g. those doing I/O to batch events together to improve throughput. It is possible to do batching in other Actor frameworks, however as nearly all other frameworks don't provide a callback at the end of the batch you need to use a timeout to determine the end of the batch, resulting in poor latency.
Compared to SEDA
LMAX built the Disruptor pattern to replace a SEDA based approach.
The main improvement that it provided over SEDA was the ability to do work in parallel. To do this the Disruptor supports multi-casting the same messages (in the same order) to multiple consumers. This avoids the need for fork stages in the pipeline.
We also allow consumers to wait on the results of other consumers without having to put another queuing stage between them. A consumer can simply watch the sequence number of a consumer that it is dependent on. This avoids the need for join stages in pipeline.
Compared to Memory Barriers
Another way to think about it is as a structured, ordered memory barrier. Where the producer barrier forms the write barrier and the consumer barrier is the read barrier.
First we'd like to understand the programming model it offers.
There are one or more writers. There are one or more readers. There is a line of entries, totally ordered from old to new (pictured as left to right). Writers can add new entries on the right end. Every reader reads entries sequentially from left to right. Readers can't read past writers, obviously.
There is no concept of entry deletion. I use "reader" instead of "consumer" to avoid the image of entries being consumed. However we understand that entries on the left of the last reader become useless.
Generally readers can read concurrently and independently. However we can declare dependencies among readers. Reader dependencies can be arbitrary acyclic graph. If reader B depends on reader A, reader B can't read past reader A.
Reader dependency arises because reader A can annotate an entry, and reader B depends on that annotation. For example, A does some calculation on an entry, and stores the result in field a in the entry. A then move on, and now B can read the entry, and the value of a A stored. If reader C does not depend on A, C should not attempt to read a.
This is indeed an interesting programming model. Regardless of the performance, the model alone can benefit lots of applications.
Of course, LMAX's main goal is performance. It uses a pre-allocated ring of entries. The ring is large enough, but it's bounded so that the system will not be loaded beyond design capacity. If the ring is full, writer(s) will wait until the slowest readers advance and make room.
Entry objects are pre-allocated and live forever, to reduce garbage collection cost. We don't insert new entry objects or delete old entry objects, instead, a writer asks for a pre-existing entry, populate its fields, and notify readers. This apparent 2-phase action is really simply an atomic action
setNewEntry(EntryPopulator);
interface EntryPopulator{ void populate(Entry existingEntry); }
Pre-allocating entries also means adjacent entries (very likely) locate in adjacent memory cells, and because readers read entries sequentially, this is important to utilize CPU caches.
And lots of efforts to avoid lock, CAS, even memory barrier (e.g. use a non-volatile sequence variable if there's only one writer)
For developers of readers: Different annotating readers should write to different fields, to avoid write contention. (Actually they should write to different cache lines.) An annotating reader should not touch anything that other non-dependent readers may read. This is why I say these readers annotate entries, instead of modify entries.
Martin Fowler has written an article about LMAX and the disruptor pattern, The LMAX Architecture, which may clarify it further.
I actually took the time to study the actual source, out of sheer curiosity, and the idea behind it is quite simple. The most recent version at the time of writing this post is 3.2.1.
There is a buffer storing pre-allocated events that will hold the data for consumers to read.
The buffer is backed by an array of flags (integer array) of its length that describes the availability of the buffer slots (see further for details). The array is accessed like a java#AtomicIntegerArray, so for the purpose of this explenation you may as well assume it to be one.
There can be any number of producers. When the producer wants to write to the buffer, an long number is generated (as in calling AtomicLong#getAndIncrement, the Disruptor actually uses its own implementation, but it works in the same manner). Let's call this generated long a producerCallId. In a similar manner, a consumerCallId is generated when a consumer ENDS reading a slot from a buffer. The most recent consumerCallId is accessed.
(If there are many consumers, the call with the lowest id is choosen.)
These ids are then compared, and if the difference between the two is lesser that the buffer side, the producer is allowed to write.
(If the producerCallId is greater than the recent consumerCallId + bufferSize, it means that the buffer is full, and the producer is forced to bus-wait until a spot becomes available.)
The producer is then assigned the slot in the buffer based on his callId (which is prducerCallId modulo bufferSize, but since the bufferSize is always a power of 2 (limit enforced on buffer creation), the actuall operation used is producerCallId & (bufferSize - 1)). It is then free to modify the event in that slot.
(The actual algorithm is a bit more complicated, involving caching recent consumerId in a separate atomic reference, for optimisation purposes.)
When the event was modified, the change is "published". When publishing the respective slot in the flag array is filled with the updated flag. The flag value is the number of the loop (producerCallId divided by bufferSize (again since bufferSize is power of 2, the actual operation is a right shift).
In a similar manner there can be any number of consumers. Each time a consumer wants to access the buffer, a consumerCallId is generated (depending on how the consumers were added to the disruptor the atomic used in id generation may be shared or separate for each of them). This consumerCallId is then compared to the most recent producentCallId, and if it is lesser of the two, the reader is allowed to progress.
(Similarly if the producerCallId is even to the consumerCallId, it means that the buffer is empety and the consumer is forced to wait. The manner of waiting is defined by a WaitStrategy during disruptor creation.)
For individual consumers (the ones with their own id generator), the next thing checked is the ability to batch consume. The slots in the buffer are examined in order from the one respective to the consumerCallId (the index is determined in the same manner as for producers), to the one respective to the recent producerCallId.
They are examined in a loop by comparing the flag value written in the flag array, against a flag value generated for the consumerCallId. If the flags match it means that the producers filling the slots has commited their changes. If not, the loop is broken, and the highest commited changeId is returned. The slots from ConsumerCallId to received in changeId can be consumed in batch.
If a group of consumers read together (the ones with shared id generator), each one only takes a single callId, and only the slot for that single callId is checked and returned.
From this article:
The disruptor pattern is a batching queue backed up by a circular
array (i.e. the ring buffer) filled with pre-allocated transfer
objects which uses memory-barriers to synchronize producers and
consumers through sequences.
Memory-barriers are kind of hard to explain and Trisha's blog has done the best attempt in my opinion with this post: http://mechanitis.blogspot.com/2011/08/dissecting-disruptor-why-its-so-fast.html
But if you don't want to dive into the low-level details you can just know that memory-barriers in Java are implemented through the volatile keyword or through the java.util.concurrent.AtomicLong. The disruptor pattern sequences are AtomicLongs and are communicated back and forth among producers and consumers through memory-barriers instead of locks.
I find it easier to understand a concept through code, so the code below is a simple helloworld from CoralQueue, which is a disruptor pattern implementation done by CoralBlocks with which I am affiliated. In the code below you can see how the disruptor pattern implements batching and how the ring-buffer (i.e. circular array) allows for garbage-free communication between two threads:
package com.coralblocks.coralqueue.sample.queue;
import com.coralblocks.coralqueue.AtomicQueue;
import com.coralblocks.coralqueue.Queue;
import com.coralblocks.coralqueue.util.MutableLong;
public class Sample {
public static void main(String[] args) throws InterruptedException {
final Queue<MutableLong> queue = new AtomicQueue<MutableLong>(1024, MutableLong.class);
Thread consumer = new Thread() {
#Override
public void run() {
boolean running = true;
while(running) {
long avail;
while((avail = queue.availableToPoll()) == 0); // busy spin
for(int i = 0; i < avail; i++) {
MutableLong ml = queue.poll();
if (ml.get() == -1) {
running = false;
} else {
System.out.println(ml.get());
}
}
queue.donePolling();
}
}
};
consumer.start();
MutableLong ml;
for(int i = 0; i < 10; i++) {
while((ml = queue.nextToDispatch()) == null); // busy spin
ml.set(System.nanoTime());
queue.flush();
}
// send a message to stop consumer...
while((ml = queue.nextToDispatch()) == null); // busy spin
ml.set(-1);
queue.flush();
consumer.join(); // wait for the consumer thread to die...
}
}