Searching for a Binary Value - c++

I am trying to find a way to identify the start of a chunk of data sent via a TCP socket. The data chunk has the value of the integer 1192 written into it as the first four bytes, followed by the content length. How can I search the binary data (the char* received) for this value? I realize I can loop through and advance the pointer by one each time, copy out the first four bytes, and check it, but that isn't the most elegant or possibly efficient solution.
Is there also another way this could be done that I'm not thinking of?
Thanks in advance.

It sounds like linear scanning might be required, but you shouldn't really be losing your message positioning if the sending side of the connection is making its send()/write() calls in a sensible manner, you are reading in your buffers properly, and there isn't an indeterminate amount of "dead" space in the stream between messages.
If the protocol itself is sensible (there is at least a length field!), you should never lose track of message boundaries. Just read the marker/length pair, then read length payload bytes, and the next message should start immediately after this, so a linear scan shouldn't have to go anywhere ideally.
Also, don't bother copying explicitly, just cast:
// call htonl() to flip endianness if need be...
uint32_t x = *reinterpret_cast<uint32_t *>(charptr);

Related

Can TCP data overlap in the buffer

If I keep sending data to a receiver is it possible for the data sent to overlap such that they accumulate in the buffer and so the next read to the buffer reads also the data of another sent data?
I'm using Qt and readAll() to receive data and parse it. This data has some structure in it so I can know if the data is already complete or if it is valid data at all but I'm worried that other data will overlap with others when I call readAll() and so would invalidate this suppose-to-be valid data.
If it can happen, how do I prevent/control it? Or is that something the OS/API worries about instead? I'm worried partly because of how the method is called. lol
TCP is a stream based connection, not a packet based connection, so you may not assume that what is sent in one time will also be received in one time. You still need some kind of protocol to packetize your stream.
For sending strings, you could use the nul-character as separator or you could begin with a header which contains a magic and a length.
According to http://qt-project.org/doc/qt-4.8/qiodevice.html#readAll this function snarfs all the data and returns it as an array. I don't see how the API raises concerns about overlapping data. The array is returned by value, and given that it represents the entire stream, so what would it even overlap with? Are you worried that the returned object actually has reference semantics (i.e. that it just holds pointers to storage that is re-used in other calls to the same function?)
If send and receive buffers overlap in any system, that's a bug, unless special care is taken that the use is completely serialized. (I.e. a buffer is somehow used only for sending and only for receiving, without any mixup.)
Why dont you use a fixed length header followed by variable length packet with the header holding the information of length of packet.
This way you can avoid worrying about packet boundaries. Say for example instead of just sending the string send the length of the string followed by the string. In the receiver end always read the length and then based on the length read the string.

Buffering Incomplete High Speed Reads

I am reading data ~100 bytes at 100hz from a serial port. My buffer is 1024 bytes, so often my buffer doesn't get completely used. Sometimes however, I get hiccups from the serial port and the buffer gets filled up.
My data is organized as a [header]data[checksum]. When my buffer gets filled up, sometimes a message/data is split across two reads from the serial port.
This is a simple problem, and I'm sure there are a lot of different approaches. I am ahead of schedule so I would like to research different approaches. Could you guys name some paradigms that cover buffering in high speed data that might need to be put together from two reads? Note, the main difference I see in this problem from say other buffering I've done (image acquisition, tcp/ip), is that there we are guaranteed full packets/messages. Here a "packet" may be split between reads, which we will only know once we start parsing the data.
Oh yes, note that the data buffered in from the read has to be parsed, so to make things simple, the data should be contiguous when it reaches the parsing. (Plus I don't think that's the parser's responsibility)
Some Ideas I Had:
Carry over unused bytes to my original buffer, then fill it with the read after the left over bytes from the previous read. (For example, we read 1024 bytes, 24 bytes are left at the end, they're a partial message, memcpy to the beginning of the read_buffer_, pass the beginning + 24 to read and read in 1024 - 24)
Create my own class that just gets blocks of data. It has two pointers, read/write and a large chunk of memory (1024 * 4). When you pass in the data, the class updates the write pointer correctly, wraps around to the beginning of its buffer when it reaches the end. I guess like a ring buffer?
I was thinking maybe using a std::vector<unsigned char>. Dynamic memory allocation, guaranteed to be contiguous.
Thanks for the info guys!
Define some 'APU' application-protocol-unit class that will represent your '[header]data[checksum]'. Give it some 'add' function that takes a char parameter and returns a 'valid' bool. In your serial read thread, create an APU and read some data into your 1024-byte buffer. Iterate the data in the buffer, pushing it into the APU add() until either the APU add() function returns true or the iteration is complete. If the add() returns true, you have a complete APU - queue it off for handling, create another one and start add()-ing the remaining buffer bytes to it. If the iteration is complete, loop back round to read more serial data.
The add() method would use a state-machine, or other mechanism, to build up and check the incoming bytes, returning 'true' only in the case of a full sanity-checked set of data with the correct checksum. If some part of the checking fails, the APU is 'reset' and waits to detect a valid header.
The APU could maybe parse the data itself, either byte-by-byte during the add() data input, just before add() returns with 'true', or perhaps as a separate 'parse()' method called later, perhaps by some other APU-processing thread.
When reading from a serial port at speed, you typically need some kind of handshaking mechanism to control the flow of data. This can be hardware (e.g. RTS/CTS), software (Xon/Xoff), or controlled by a higher level protocol. If you're reading a large amount of data at speed without handshaking, your UART or serial controller needs to be able to read and buffer all the available data at that speed to ensure no data loss. On 16550 compatible UARTs that you see on Windows PCs, this buffer is just 14 bytes, hence the need for handshaking or a real time OS.

Sending variable length arrays over a network?

In the game I'm making, I nee to be able to send std::vectors of integer over a network.
A packet seems to be made up entirely of a string. Since enet, the network libray im using takes care of endian, my first idea on solving this is to send a messsage where the first byte is the message id, as usual, the next 4 bytes would be an integer indicating the length of the array, and all subsequent bytes would be the ints in the array. On the client side I can then push these back into a vector.
Is this how it is usually done or am I missing something critical? Is there a better way to do it?
Thanks
In general, there are two approaches to solving this problem which can be combined. One is to put the length of the array before the actual array. The other involves including some framing to mark the end (or beginning) of your message. There are advantages and disadvantages to each.
Putting the length before the array is simplest. However, if there should ever be a bug where the length does not match the number of integers, there is no way to detect this or recover from it.
Using a framing byte(s) to mark the end of the message has the advantage of more robustness and the ability to recover from an improperly formatted message at the cost of complexity. The complexity comes in the fact that if your framing bytes appear in your array of integers, you must escape the bytes (i.e. prepend and escape character). Also, your code to read messages from the network becomes a little more complicated.
Of course, this all assumes that you are dealing stream and not a datagram. If your messages are clearly packetized already, then the length should be enough.
There are two ways to send variable length data on a stream: by prefixing the data with the length, or by suffixing it with a delimiter.
A suffix can have a theoretically infinite size, but it means the delimiter must not appear in the data. This approach can be used for strings, with a NUL ('\0') character as the delimiter.
When dealing with binary data, then you don't have any choice but to prefix the length. The size of the data will be limited to the the size of the prefix, which is rarely a problem with a 4 byte prefix (because otherwise it means you're sending more than 4 gigabytes of data).
So, it all depends on the data being sent.
Appending that header information to your packet is a good approach to take. Another option you can do on the receive side, if the data is stored in some unsigned char* buffer of memory, is create a structure like so:
typedef struct network_packet
{
char id;
int message_size;
int data[];
} __attribute__((packed)) network_packet;
You can then simply "overlay" this structure on-top of your received buffer like so:
unsigned char* buffer;
//...fill the buffer and make sure it's the right size -- endian is taken care of
//via library
network_packet* packet_ptr = (network_packet*)buffer;
//access the 10th integer in the packet if the packet_ptr->message_size
//is long enough
if (packet_ptr->message_size >= 10)
int tenth_int = packet_ptr->data[9];
This will avoid you having to go through the expense of copying all the data back, which already exists in a buffer, back into another std::vector on the receive side.

Advantages of knowing for a client, how big the package sended by the server is

I'm really new at network-programming, so I hope this isn't a complete Newbie-question.
I read a tutorial at the Qt-Homepage how to build a little server, and I found this:
QByteArray block;
QDataStream out(&block, QIODevice::WriteOnly);
out << (quint16)0;
out << "..."; // just some text
out.device()->seek(0);
out << (quint16)(block.size() - sizeof(quint16));
At the start of our QByteArray, we reserve space for a 16 bit integer that will contain the total size of the data block we are sending. [We continue by streaming in a random fortune.] Then we seek back to the beginning of the QByteArray, and overwrite the reserved 16 bit integer value with the total size of the array. By doing this, we provide a way for clients to verify how much data they can expect before reading the whole packet.
So I want to know, what are the advantages of this procedure? What can happen if you don't do that? Maybe you also could add a little example.
It is standard stuff.
To the receiving program everything coming over the network is just a stream of bytes. The stream has no meaning beyond what the application imposes upon it, exactly the same way a file has no meaning beyond how its records, lines, etc., are defined by the application(s).
The only way the client and server can make sense of the stream is to establish a convention, or protocol, that they agree upon.
So some common ways to accomplish this are by:
have a delimiter that designates the end of a message (e.g. a carriage return)
pass a length field, as in your example, which tells the receiver how much data comprises the next message.
just establish a fixed convention (e.g. every message will be 20 bytes or type 'A' records will be one defined format, type 'B' records another...)
just treat it like a stream by having no convention at all (e.g. take whatever comes over the network and put it in a file w/o paying any attention to what it is)
One advantage of the length byte method is that the receiver knows exactly how much data to expect. With some added sanity checks this can help eliminate things like buffer overflows and such in your application.
Knowing the packet size before receiving it has a performance advantage.
You can then allocated exactly the needed number of bytes from the heap or whatever buffer management you use and receive all by few (ideally one) calls to the 'network receive function'.
If you don't know the size in advantage, you have to call the 'network receive function' for very small portion of the message.
Since the 'network receive function' (which may be recv() or whatever Qt offers to you) is a system call which also does TCP buffer handling and so on it should be assumed as being slow with a large per-call overhead. So you should call it as few as possible.

TCP/IP Message Framing Examples

I'm trying to find concrete examples of how to manage breaking an incoming stream of data on a TCP/IP socket and aggregating this data in a buffer of some sort so that I could find the messages in it (variable length with header + delimiters) and extract them to reconstruct the messages for the receiving application.
Any good pointers/links/examples on an efficient way of doing this would be appreciated as I couldn't find good examples online and I'm sure this problem has been addressed by others in an efficient way in the past.
Efficient memory allocation of aggregation buffer
Quickly finding the message boundaries of a message to extract it from the buffer
Thanks
David
I've found that the simple method works pretty well.
Allocate a buffer of fixed size double the size of your biggest message. One buffer. Keep a pointer to the end of the data in the buffer.
Allocation happens once. The next part is the message loop:
If not using blocking sockets, then poll or select here.
Read data into the buffer at the end-data pointer. Only read what will fit into the buffer.
Scan the new data for your delimiters with strchr. If you found a message:
memcpy the message into its own buffer. (Note: I do this because I was using threading and you probably should too.)
memmove the remaining buffer data to the beginning of the buffer and update the end of data pointer.
Call the processing function for the message. (Send it to the thread pool.)
There are more complicated methods. I haven't found them worth the bother in the end but you might depending on circumstances.
You could use a circular buffer with beginning and end of data pointers. Lots of hassle keeping track and computing remaining space, etc.
You could allocate a new buffer after finding each message. You wouldn't have to copy so much data around. You do still have to move the excess data into a new message buffer after finding the delimiter.
Do not think that dumb tricks like reading one byte at a time out of the socket will improve performance. Every system call round-trip makes an 8 kB memmove look cheap.