Packet CRC Computation Approach - c++

I am writing a class that is reading in incoming packets of serial data. The packets are laid out with a header, some data, and are followed by a two byte CRC.
I also have written a class where I can build up packets to send. This class has GenerateCRC() method which allows the caller to compute a CRC for a packet which they have built up via calls to other methods. The GenerateCRC() call is only meant to be called once the packet header and data have been set up properly. As a result, this method iterates over the packet in a for loop and computes the CRC this way.
Now that I'm writing code to read in the packets, I need to verify them by computing a CRC. I'm trying to reuse the previous "builder" class as much as possible given that as I'm reading in the packet, I want to store it in memory and the best way to do so is to use the "builder" class. However, I have hit a snag with computation of the CRC.
There are two main approaches that I'm considering and I am having trouble weighing the pros and cons and deciding on an approach. Here are my two choices:
Compute the CRC as I read in the bytes. The data that I'm reading in is pushed onto a queue, so I pop off the bytes one at a time. I would keep a running "total" CRC and be finished with the computation as soon as the last data byte is read in.
Compute the CRC only once I have read in the full packet. In this case, I don't have to keep a running total, but I would have to iterate over the packet again. I should note that this would allow me to reuse my previously written code.
Currently I am leaning towards option 1 and moving any common functionality between the "builder" and the "reader" to a separate header file. However, I want to make sure that the first option is in fact the better one in terms of performance since it does make my code a bit more jumbled.
Thanks in advance for the help.

I would pick Door #2. That allows simpler validation of the code by using identical code on both ends, and also permits faster CRC algorithms to be used that process four or eight bytes at a time.

Related

Visual Studio (C++): How to send more than one value with TCP socket?

it's probably a pretty basic question, but I couldn't really find something online so I hope you can help me out here!:)
I'm currently working with C++ in Visual Studio and I'm trying to send different variables via TCP Connection at the same time. But I'm not quite sure, what the best way to do it would lookn' like.
I was thinking about sending a long string at once with all variables and between every variable there could be a symbol which shows the end of the variable, just like:
123.45#16.45#33#true#.....
but the programm has to run as fast as possible. A algorithm which sorts every income variables doesn't sound pretty efficient, does it?
The second thought was to create a socket for every variable I got, with a different Port. But I'm not quite sure if that isn't a dirty programming behavior.
So what do you think? Any ideas/experiences?
Thanks in Advance!
Before answering this question, we need to be aligned with some concepts.
TCP is a stream protocol. It's like you're accessing a regular file, but socket API encapsulates many complex tasks from users. When you do a write on a socket, you always append to this socket's TCP stream. So accessing the same socket from two different threads will not give you what you want, instead two different atomic data segments added to TCP stack (you have just made 2 append operation to the stream).
"sending at the same time" may have 2 different meanings:
i) sending the variables at the same TCP session.
In this case, TCP session is no different than writing to a binary file. This is rather a serialization problem than a network protocol problem in that case. You have many options:
a) convert everything to the string representation and send this buffer (fairly slow and inefficient). This is the simplest method. You also need some additional protocol to handle written data as you mentioned (adding # in between). JSON or XML are good examples of that.
b) if you know the exact order to write and read (encode/decode) just write ints, char arrays, shorts, longs as binary (hey float and double serialization may not be that straightforward). Decoder will read binary data in the exact same order. Remember to convert network byte-order if you pick this solution.
ii) sending the variables simultaneously
If you really want to send many variables at the same time (i mean almost simultaneously), then you'd better to have many threads and many sockets at that case. Maybe you are reading integer sensor data from one sensor which creates thousands of integer in a second, and double values from another sensor. The meaning of simultaneously here is that you do not have to wait for reading the whole bunch of integers to read doubles or vice versa. This also saves encoding/decoding time.

Search block in network stream as quick as possible

I want ask the method for searching pattern in network stream.
My current method is to allocate a big cache, put data from socket to cache, when the data size exceed a threshold, then start to search all of sync headers (using KMP algorithm) in cache. It works, but looks somehow cumbersome.
The header is very simple flag such as "0xFFEEBBAA1290".
Is there a trick to check the header as quick as possible in realtime without accumulation? That is while receiving data, check whether the complete data block is arrived just in time.
The data is arriving continuously and no any interval to indicates different data blcok.
I used the circular buffer to check the first header and next header to decide the whole block, but numerous modulo (for circular buffer index) operations slow down the speed drastically. I just used memcmp to find header.
FYI, My prefer language is C/C++.
Hope to get your advice. Any reference link also welcome.
Thank you.
Please allow me to add some details about this problem.
The data come from a board which is not in my control.
The device send data from arbitray position from his source and don't follow any rule like when connection established must start with a package the header on the front. And even worse the block length is not fixed, I must get block by checking 2 headers.
In my approach, I try to find first header at begin time, if it not meet, I will drop each byte until the header come.
On this way at least I can gurantee the first header is at the begin of cache(The cache size is much smaller than KMP approach because I don't want search headers in delay), then continue to receive data and check next header simultaneously.
If found the block, the block data will move to other process, then second header will move to front of cache.
It causes the cache should be re-aligned to accept next data, this is why I used the circular buffer (store data to array) to implement. i.e., just set read and write position, not actually move remain data in cache.
list or vector is tried but not used because of byte chunk operations and performance consideration.
The problem is I have to continuously check the next header while data arriving.
Is there an elegant way to avoid such frequent byte scan?
Or if the speed is reasonable I also can accept the frequent byte scan, but the modulo operation for calculating reading and writing position in circular buffer seems slow down the performance.
I used different profiling tool and all indicate the frequent modulo is performance bottleneck.
"as quick as possible in realtime" is already a contradiction. Realtime means as fast as the data arrives; no need to be faster than that. In fact, realtime often is slower than batch processing.
Realtime also requires hard figures on time available and time taken, neither of which are available here.
Your header appears to <8 bytes, which is 1 cache line. KMP or similar algorithms are unlikely to be needed. Checking all bytes in a cache line for 0xFF is almost certainly faster than checking a single byte against 0xFF, 0xEE, 0xBB, 0xAA, 0x12 or 0x90.
Now, "numerous modulo (for circular buffer index) operations slow down the speed drastically" is a realistic problem. But that does have a straightforward solution. Make sure that the buffer size is a compile time constant, and a power of two. x%(1<<N) is equal to x & ((1<<N)-1)

How to write data into a buffer and write the buffer into a binary file with a second thread?

I am getting data from a sensor(camera) and writing the data into a binary file. The problem is it takes lot of space on the disk.
So, I used the compression from boost (zlib) and the space reduced a lot! The problem is the compression process is slow and lots of data is missing.
So, I want to implement two threads, with one getting the data from the camera and writing the data into a buffer. The second thread will take the front data of the buffer and write it into the binary file. And in this case, all the data will be present.
How do I implement this buffer? It needs to expand dynamically and pop_front. Shall I use std::deque, or does something better already exist?
First, you have to consider these four rates (or speeds):
Speed of Production (SP): The average number of bytes your sensor produces per second.
Speed of Compression (SC): The average number of bytes per second you can compress. This is the number of input bytes to the compression algorithm.
Rate of Compression (RC): The average ratio of compressed data to uncompressed data your compress algorithm produces (ratio of size of output to the input of compression.) (This is obviously somewhere between 0 and 1.)
Speed of Writing (SW): The average number of bytes you can write to disk, per second.
If SC is less than SP, you are in trouble. It means you can't compress all the data you gather from your sensor, in real time. Which means you'll eventually run out of buffer memory. You'll have to find a faster compression algorithm, or dedicate more CPU cores to compression.
If SW is less than SP times RC (which is the size of sensor data after compression,) you are again in trouble. It means you can't write out your output data as fast as you are producing and compressing them, and again, you will eventually run out of buffer memory, no matter how much you have. You might be able to gain some speed by adopting a better write strategy or file system, but a real gain in SW comes from a better disk system (RAID, SSD, better hardware, etc.)
Now, if everything is OK speed-wise, you can probably employ something like the following architecture to read, compress and write the data out:
You'll have three threads (or two, described later) that do one part of the pipeline each. You'll also have two thread-safe queues, one for communication from each stage of the pipeline to the next.
Assuming the two queues are named Q1 and Q2, the high-level operation of the threads will look like this:
Input Thread:
Read K bytes of sensor data
Put the whole K bytes as a unit on Q1.
Go to 1.
Compression Thread:
Wait till there is something on Q1.
Pop one buffer of data (probably K bytes) from Q1.
Compress the buffer into a hopefully smaller buffer and put it on Q2.
Go to 1.
Output Thread:
Wait till there is something on Q2.
Pop one buffer of data from Q2.
Write the buffer to the output file.
Go to 1.
The most CPU-intensive part of the work is in the second thread, and the other two probably don't consume much CPU time and therefore probably can share a CPU core. This means that the above strategy may be runnable on two cores. But it can also run on a single core if the workload is light, or require many many cores. That all depends on the four rates I described up top.
Using asynchronous writes (e.g. IOCP on Windows or epoll on Linux,) you can drop the third thread and the second queue altogether. Then your second thread needs to execute something like this:
Wait till there is something on Q1.
Pop one buffer of data (probably K bytes) from Q1.
Compress the buffer into a hopefully smaller buffer.
Issue an asynchronous write request to the OS to write out the compressed buffer to disk.
Go to 1.
There are four more issues worth mentioning:
K should be selected so that the time required for various (usually constant time) activities associated with allocating a buffer, pushing it into and popping it from a thread-safe queue, starting a compression run and issuing a write request into a file become negligible relative to doing the actual work (reading sensor data, compressing bytes and writing to disk.) This usually means that K needs to be as large as possible. But if K is very large (many megabytes or hundreds of megabytes) then if your application crashes, you'll lose a lot of data. You need to find a balance between performance and risk of data loss. I suggest (without any knowledge of your specific needs and constraints) a value between 10KiB to 1MiB for K.
Implementing a thread-safe queue is easy if you have some knowledge and experience with concurrent/parallel programming, but rather hard and error-prone if you do not. Finding good examples and implementations should not be hard. A normal std::deque or std::list or std::anything won't be usable by itself, but can used as a good basis for writing a thread-safe queue.
Note that you are queuing buffers of data, not individual numbers or bytes. If you pass your data one number at a time through this pipeline, it will be painfully slow and wasteful.
Some compression algorithms are limited in how much data they can consume in each invocation, or that you must sync the output of each one call to compression routine with one call to the decompression routine later on. These might affect the choice of K, and also how you write your output file. You might have to add some metadata so that you can be able to actually decompress and read the data later.

how to implement arbitration of feed A and feed B in FAST financial protocol?

I need to implement feeds arbitration for FAST protocol. Problem is pretty common and there are even hardware solutions As problem is wide known i do think there should be at least general suggestions of how to implement that (how many queries should I use, how many ring-buffers, how many readers, when to drop packets etc.) and probably someone can point me to some implementation. For those who not familar with FAST I add some description:
Data in all UDP Feeds are disseminated in two identical feeds (A and B) on two different multicast IPs. It is strongly recommended that client receive and process both feeds because of possible UDP packet loss. Processing two identical feeds allows one to statistically decrease the probability of packet loss.
It is not specified in what particular feed (A or B) the message appears for the first time. To arbitrate these feeds one should use the message sequence number found in Preamble or in tag 34-MsgSeqNum. Utilization of the Preamble allows one to determine message sequence number without decoding of FAST message.
Processing messages from feeds A and B should be performed using the following algorithm:
Listen feeds A and B
Process messages according to their sequence numbers.
Ignore a message if one with the same sequence number was already processed before.
// tcp recover algorithm further
So I think that solution should be like that:
For each of two feed create dedicated thread and dedicated buffer. Add data to buffer as data arrives. (should it be ring-buffer or queue or what?)
Create "reader" which "spin" and checks both thread for the last available "sequence number". As soon as "sequence number" is available next packet need to be processed and both threads should drop it after that.
Any suggestions of how to implement algorithm itself and probably suggestion which structures to use are welcome. In particular probably someone can suggest lock-free queue / ring-buffer implementation.
The FAST protocol is usually used for market data streaming, so data arrives in UDP packets, usually multicast. These packets are sequenced, so if you need to arbitrate between two redundant feeds, all you have to do is process the next expected packet from whichever channel you receive it first. The dupe packet coming late you simply drop. Arbitration not only decreases your chances of losing a packet (you have to lose it in both channels) but decreases your latency as you always have a second option in case one of the feeds gets slow.
You should be more concerned with decoding the FAST bits. This can be time consuming. Check CoralFIX for an example of how to generate source code for FAST decoding from an exchange XML template.
Disclaimer: I am one of the developers of CoralFIX.

C++ IPC Communication

I am in dilema to make decision on the below scenarios. Kindly need experts help.
Scenario : There is TCP/IP communication between two process running in two boxes.
Communication Method 1 : Stream based communication on the socket. Where on the receiver side , he will receive the entire byte buffer and interpret first few fixed bytes as header and desrialize it and get to know the message length and start take message of that length and deserialize it and proceed to next message header like that goes on....
Communication Method2 : Put all the messages in a vector and vector will be residing in a class object. serialize the class object in one go and send to receiver. Receiver deserialize the class object and read the vector array one by one.
Please let me know which approach is efficient and if any other approach , please guide me.
Also pros and cons of class based data transmission and structure based data transmission and which is suitable for which scenario ?
Your question lacks some key details, and mixes different concerns, frustrating any attempt to provide a good answer.
Specifically, Method 2 mysteriously "serialises" and "deserialises" the object and contained vector without specifying any details of how that's being done. In practice, the details are of the kind alluded to in Method 1. So, 1 and 2 aren't alternatives unless you're choosing between using a serialisation library and doing it from scratch (in which case I'd say use the library as you're new to this and the library's more likely to get it right).
What I can say:
at a TCP level, it's most efficient to read into a decent sized block (given I tend to work on PC/server hardware, I'd just use 64k though smaller may be enough to get the same kind of throughput) and have each read() or recv() read as much data from the socket as possible
after reading enough bytes (in however many read/recvs) to attempt some interpretation of the data, it's necessary to recognise the end of particular parts of the serialised input: sometimes that's implicit in the data type involved, other times it's communicated using some sentinel (e.g. a linefeed or NUL), and other times there can be a prefixed fixed-size "expect N bytes" header. This aspect/consideration often applies hierarchically to the stream of objects and nested sub objects etc..
the TCP read/recvs may deliver more data than were sent in any single request, so you may have 1 or more bytes that are logically part of the subsequent but incomplete logical message at the end of the block assembled above
the process of reading larger blocks then accessing various fixed and variable sized elements inside the buffers is already supported by C++ iostreams, but you can roll your own if you want
So, let me emphasise this: do NOT assume you will receive any more than 1 byte from any given read of the socket: if you have say a 20 byte header you should loop reading until you hit either an error or have assembled all 20 bytes. Sending 20 bytes in a single write() or send() does not mean the 20 bytes will be presented to a single read() / recv(). TCP is a byte stream protocol, and you have to take arbitrary numbers of bytes as and when they're provided, waiting until you have enough data to interpret it. Similarly, be prepared to get more data than the client could write in a single write()/`send().
Also pros and cons of class based data transmission and structure based data transmission and which is suitable for which scenario ?
These terms are completely bogus. classes and structures are almost identical things in C++ - mechanisms for grouping data and related functions (they differ only in how they - by default - expose the base classes and data members to client code). Either can have or not have member functions or support code that helps serialise and deserialise the data. For example, the simplest and most typical support are operator<< and/or operator>> streaming functions.
If you want to contrast these kind of streaming functions with an ad-hoc "write a binary block, read a binary block" approach (perhaps from thinking of structs as being POD without support code), then I'd say prefer streaming functions where possible, starting with streaming to human-readable representations as they'll make your system easier and quicker to develop, debug and support. Once you're really comfortable with that, if the runtime performance requires it then optimise with a binary representation. If you write the serialisation code well, you won't notice much difference in performance between a cruder void*/#bytes model of data and proper per-member serialisation, but the latter can more easily support unusual cases - portability across systems with different size ints/longs etc., different byte ordering, intentional choices re shallow vs. deep copying of pointed to data etc....
I'd also recommend looking at the boost serialisation library. Even if you don't use it, it should give you a better understanding of how this kind of thing is reasonably implemented in C++.
Both methods are equivalent. In both you must send a header with message size and identifier in order to deserialize. If you assume that first option is composed by a serialized 'class' like a normal message, you must implement the same 'code'.
Another thing you must have in mind is message's size in order to full TCP buffers to optimize communications. If your 1st method messages are so little, try to improve the communication ratio with bigger messages like in 2nd option you describe.
Keep in mind that it's not safe simply streaming out a struct or class directly by interpreting it as a sequence of bytes, even if it's a simple POD - there's issues like endianness (which is unlikely to be a real-world problem for most of us), and structure alignment/padding (which is a potential problem).
C++ doesn't have any built-in serialization/deserialization, you'll either have to roll your own or take a look at things like boost Serialization, or Google's protobuf.
If it is not a homework or study project, there may be little point in fiddling with IPC at TCP stream level especially if that's not something one is familiar with.
Use a messaging library, like ØMQ to send and receive complete messages rather than streams of bytes.