Parser for TCP buffers - c++

I want to implement a protocol to share the data between server and client.
I don't know the correct one. By keeping performance as main criteria can anyone suggest the best protocol to parse the data.
I have one in mind, don't the actual name but it will be like the this
[Header][Message][Header][Message]
the header contains the length of the message and the header size is fixed.
I have tried this with some by performing a lot of concatenation and substring operation which are costlier. can any suggest the best implementation for this

The question is very broad.
On the topic of avoiding buffer/string concatenation, like at Buffer Sequences, described in Boost Asio's "Scatter-Gather" Documentation

For parsing there are two common solutions:
small messages
Receive the data into a buffer, e.g. 64k. Then use pointers into that buffer to parse the header and message. Since the messages are small there can be many messages in the buffer and you would call the parser again as long as there is data in the buffer. Note that the last message in the buffer might be truncated. In which case you have to keep the partial message and read more data into the buffer. If the message is near the end of the buffer then copying it to the front might be necessary.
large messages
With large messages it makes sense to first only read the header. Then parse the header to get the message size, allocate an appropriate buffer for the message and then read the whole message into it before parsing it.
Note: In both cases you might want to handle overly large messages by either skipping them or terminating the connection with an error. For the first case a message can not be larger than the buffer and should be a lot smaller. For the second case you don't want to allocate e.g. a gigabyte to buffer a message if they are supposed to be around 1MB only.
For sending messages it's best to first collect all the output. A std::vector can be sufficient. Or a rope of strings. Avoid copying the message into a larger buffer over and over. At most copy it once at the end when you have all the pieces. Using writev() to write a list of buffers instead of copying them all into one buffer can also be a solution.
As for the best protocol... What's best? Simply sending the data in binary format is fastest but will break when you have different architectures or versions. Something like google protobuffers can solve that but at the cost of some speed. It all depends on what your needs are.

Related

Searching for a Binary Value

I am trying to find a way to identify the start of a chunk of data sent via a TCP socket. The data chunk has the value of the integer 1192 written into it as the first four bytes, followed by the content length. How can I search the binary data (the char* received) for this value? I realize I can loop through and advance the pointer by one each time, copy out the first four bytes, and check it, but that isn't the most elegant or possibly efficient solution.
Is there also another way this could be done that I'm not thinking of?
Thanks in advance.
It sounds like linear scanning might be required, but you shouldn't really be losing your message positioning if the sending side of the connection is making its send()/write() calls in a sensible manner, you are reading in your buffers properly, and there isn't an indeterminate amount of "dead" space in the stream between messages.
If the protocol itself is sensible (there is at least a length field!), you should never lose track of message boundaries. Just read the marker/length pair, then read length payload bytes, and the next message should start immediately after this, so a linear scan shouldn't have to go anywhere ideally.
Also, don't bother copying explicitly, just cast:
// call htonl() to flip endianness if need be...
uint32_t x = *reinterpret_cast<uint32_t *>(charptr);

Can TCP data overlap in the buffer

If I keep sending data to a receiver is it possible for the data sent to overlap such that they accumulate in the buffer and so the next read to the buffer reads also the data of another sent data?
I'm using Qt and readAll() to receive data and parse it. This data has some structure in it so I can know if the data is already complete or if it is valid data at all but I'm worried that other data will overlap with others when I call readAll() and so would invalidate this suppose-to-be valid data.
If it can happen, how do I prevent/control it? Or is that something the OS/API worries about instead? I'm worried partly because of how the method is called. lol
TCP is a stream based connection, not a packet based connection, so you may not assume that what is sent in one time will also be received in one time. You still need some kind of protocol to packetize your stream.
For sending strings, you could use the nul-character as separator or you could begin with a header which contains a magic and a length.
According to http://qt-project.org/doc/qt-4.8/qiodevice.html#readAll this function snarfs all the data and returns it as an array. I don't see how the API raises concerns about overlapping data. The array is returned by value, and given that it represents the entire stream, so what would it even overlap with? Are you worried that the returned object actually has reference semantics (i.e. that it just holds pointers to storage that is re-used in other calls to the same function?)
If send and receive buffers overlap in any system, that's a bug, unless special care is taken that the use is completely serialized. (I.e. a buffer is somehow used only for sending and only for receiving, without any mixup.)
Why dont you use a fixed length header followed by variable length packet with the header holding the information of length of packet.
This way you can avoid worrying about packet boundaries. Say for example instead of just sending the string send the length of the string followed by the string. In the receiver end always read the length and then based on the length read the string.

Advantages of knowing for a client, how big the package sended by the server is

I'm really new at network-programming, so I hope this isn't a complete Newbie-question.
I read a tutorial at the Qt-Homepage how to build a little server, and I found this:
QByteArray block;
QDataStream out(&block, QIODevice::WriteOnly);
out << (quint16)0;
out << "..."; // just some text
out.device()->seek(0);
out << (quint16)(block.size() - sizeof(quint16));
At the start of our QByteArray, we reserve space for a 16 bit integer that will contain the total size of the data block we are sending. [We continue by streaming in a random fortune.] Then we seek back to the beginning of the QByteArray, and overwrite the reserved 16 bit integer value with the total size of the array. By doing this, we provide a way for clients to verify how much data they can expect before reading the whole packet.
So I want to know, what are the advantages of this procedure? What can happen if you don't do that? Maybe you also could add a little example.
It is standard stuff.
To the receiving program everything coming over the network is just a stream of bytes. The stream has no meaning beyond what the application imposes upon it, exactly the same way a file has no meaning beyond how its records, lines, etc., are defined by the application(s).
The only way the client and server can make sense of the stream is to establish a convention, or protocol, that they agree upon.
So some common ways to accomplish this are by:
have a delimiter that designates the end of a message (e.g. a carriage return)
pass a length field, as in your example, which tells the receiver how much data comprises the next message.
just establish a fixed convention (e.g. every message will be 20 bytes or type 'A' records will be one defined format, type 'B' records another...)
just treat it like a stream by having no convention at all (e.g. take whatever comes over the network and put it in a file w/o paying any attention to what it is)
One advantage of the length byte method is that the receiver knows exactly how much data to expect. With some added sanity checks this can help eliminate things like buffer overflows and such in your application.
Knowing the packet size before receiving it has a performance advantage.
You can then allocated exactly the needed number of bytes from the heap or whatever buffer management you use and receive all by few (ideally one) calls to the 'network receive function'.
If you don't know the size in advantage, you have to call the 'network receive function' for very small portion of the message.
Since the 'network receive function' (which may be recv() or whatever Qt offers to you) is a system call which also does TCP buffer handling and so on it should be assumed as being slow with a large per-call overhead. So you should call it as few as possible.

TCP/IP Message Framing Examples

I'm trying to find concrete examples of how to manage breaking an incoming stream of data on a TCP/IP socket and aggregating this data in a buffer of some sort so that I could find the messages in it (variable length with header + delimiters) and extract them to reconstruct the messages for the receiving application.
Any good pointers/links/examples on an efficient way of doing this would be appreciated as I couldn't find good examples online and I'm sure this problem has been addressed by others in an efficient way in the past.
Efficient memory allocation of aggregation buffer
Quickly finding the message boundaries of a message to extract it from the buffer
Thanks
David
I've found that the simple method works pretty well.
Allocate a buffer of fixed size double the size of your biggest message. One buffer. Keep a pointer to the end of the data in the buffer.
Allocation happens once. The next part is the message loop:
If not using blocking sockets, then poll or select here.
Read data into the buffer at the end-data pointer. Only read what will fit into the buffer.
Scan the new data for your delimiters with strchr. If you found a message:
memcpy the message into its own buffer. (Note: I do this because I was using threading and you probably should too.)
memmove the remaining buffer data to the beginning of the buffer and update the end of data pointer.
Call the processing function for the message. (Send it to the thread pool.)
There are more complicated methods. I haven't found them worth the bother in the end but you might depending on circumstances.
You could use a circular buffer with beginning and end of data pointers. Lots of hassle keeping track and computing remaining space, etc.
You could allocate a new buffer after finding each message. You wouldn't have to copy so much data around. You do still have to move the excess data into a new message buffer after finding the delimiter.
Do not think that dumb tricks like reading one byte at a time out of the socket will improve performance. Every system call round-trip makes an 8 kB memmove look cheap.

C++ byte stream

For a networked application, the way we have been transmitting dynamic data is through memcpying a struct into a (void*). This poses some problems, like when this is done to an std::string. Strings can be dynamic length, so how will the other side know when the string ends? An idea I had was to use something similiar to Java's DataOuputStream, where I could just pass whatever variables to it and it could then be put into a (void*). If this can't be done, then its cool. I just don't really like memcpying a struct. Something about it doesn't seem quite right.
Thanks,
Robbie
nothing wrong with memcpy on a struct - as lng as the struct is filled with fixed-size buffers. Put a dynamic variable in there and you have to serialise it differently.
If you have a struct with std::strings in there, create a stream operator and use it to format a buffer. You can then memcpy that buffer to the data transport. If you have boost, use Boost::serialize which does all this for you (that link also has links to alternative serialization libs)
Notes: the usual way to pass a variable-size buffer is to begin by sending the length, then that many bytes of data. Occasionally you see data transferred until a delimiter is received (and fields within that data are delimited themselves by another character, eg a comma).
I see two parts of this question:
- serialization of data over a network
- how to pass structures into a network stack
To serialize data over a network, you'll need a protocol. Doesn't have to be difficult; for ASCII even a cr/lf as packet end may do. If you use a framework (like MFC), it may provide serialization functions for you; in that case you need to worry about how to send this in packets. A packetization which often works well for me is :
<length><data_type>[data....][checksum]
In this case the checksum is optional, and also zero-data is possible, for instance if the signal is carried in the data_type (i.e. Ack for acklnowedgement)
If you're working on the memcpy with structures, you'll need to consider that memcpy only makes a shallow copy. A pointer is worthless once transmitted over a network; instand you should transmit the data from that pointer (i.e. the contents of your string example)
For sending dynamic data across the network you have the following options.
First option in the same packet.
void SendData()
{
int size;
char payload[256];
Send(messageType)
Send(size);
Send(payload)
}
Second option:
void SendData()
{
char payload[256];
Send(messageType)
Send(payload)
}
Though in either situation, you will be faced with more of a design choice. In the first example you would send the message type, and the payload size and also then the payload.
The second option you have is you can send the message type and then you can send the string that has a delimiter of null terminator.
Though either option does not cover fully the problem your facing I think. Firstly, you need to determine if you're building a game what type of protocal you will be using, UDP? TCP? The second problem you will be facing is the maximum packet size. Then on top of that you need to have the framework in place so that you can calculate the optimum packet size that will not be fragmented and lost to the inter web. After that you have bandwidth control in regards to how much data you can transmitted and receive between the client and server.
For example the way that most games approach this situation is each packet is identified with the following.
MessageType
MessageSize
CRCCheckSum
MessageID
void buffer[payload]
In situation where you need to send dynamic data you would send a series of packets not just one. For example if you were to send a file accross the network the best option would to use TCP/IP because its a streaming protocal and it garnentees that the complete stream arrives safly to the other end. On the other hand UDP is a packet based protocal and is does not do any checking that all packets arrived in order or at all on the other end.
So in conclusion.
For dynamic data, send multiple packets but with a special flag
to say more data is to arrive to complete this message.
Keep it simple and if your working with C++ dont assume the packet or data
will contain a null terminator and check the size compared to the
payload if you decide to use a null terminator.