Stream while serializing with Cap'n'Proto - c++

Consider a Cap'n'Proto schema like this:
struct Document {
header #0 : Header;
records #1 :List(Record); // usually large number of records.
footer #2 :Footer;
}
struct Header { numberOfRecords : UInt32; /* some fields */ };
struct Footer { /* some fields */ };
struct Record {
type : UInt32;
desc : Text;
/* some more fields, relatively large in total */
}
Now I want to serialize (i.e. build) a document instance and stream it to a remote destination.
Since the document is usually very large I don't want to completely build it in memory before sending it. Instead I am looking for a builder that directly sends struct by struct over the wire. Such that the additional needed memory buffer is constant (i.e. O(max(sizeof(Header), sizeof(Record), sizeof(Footer))).
Looking at the tutorial material I don't find such a builder. The MallocMessageBuilder seems to create everything in memory first (then you call writeMessageToFd on it).
Does the Cap'n'Proto API support such a use-case?
Or is Cap'n'Proto more meant to be used for messages that fit into memory before sending?
In this example, the Document struct could be omitted and then one could just send a sequence of one Header message, n Record messages and one Footer. Since a Cap'n'Proto message is self-delimiting, this should work. But you loose your document root - perhaps sometimes this is not really an option.

The solution you outlined -- sending the parts of the document as separate messages -- is probably best for your use case. Fundamentally, Cap'n Proto is not designed for streaming chunks of a single message, since that would not fit well with its random-access properties (e.g. what happens when you try to follow a pointer that points to a chunk you haven't received yet?). Instead, when you want streaming, you should split a large message into a series of smaller messages.
That said, unlike other similar systems (e.g. Protobuf), Cap'n Proto does not strictly require messages to fit into memory. Specifically, you can do some tricks using mmap(2). If your document data is coming from a file on disk, you can mmap() the file into memory and then incorporate it into your message. With mmap(), the operating system does not actually read the data from disk until you attempt to access the memory, and the OS can also purge the pages from memory after they are accessed since it knows it still has a copy on disk. This often lets you write much simpler code, since you no longer need to think about memory management.
In order to incorporate an mmap()ed chunk into a Cap'n Proto message, you'll want to use capnp::Orphanage::referenceExternalData(). For example, given:
struct MyDocument {
body #0 :Data;
# (other fields)
}
You might write:
// Map file into memory.
void* ptr = (kj::byte*)mmap(
nullptr, size, PROT_READ, MAP_PRIVATE, fd, 0);
if (ptr == MAP_FAILED) {
KJ_FAIL_SYSCALL("mmap", errno);
}
auto data = capnp::Data::Reader((kj::byte*)ptr, size);
// Incorporate it into a message.
capnp::MallocMessageBuilder message;
auto root = message.getRoot<MyDocument>();
root.adoptDocumentBody(
message.getOrphanage().referenceExternalData(data));
Because Cap'n Proto is zero-copy, it will end up writing the mmap()ed memory directly out to the socket without ever accessing it. It's then up to the OS to read the content from disk and out to the socket as appropriate.
Of course, you still have a problem on the receiving end. You'll find it a lot more difficult to design the receiving end to read into mmap()ed memory. One strategy might be to dump the entire stream directly to a file first (without involving the Cap'n Proto library), then mmap() that file and use capnp::FlatArrayMessageReader to read the mmap()ed data in-place.
I describe all this because it's a neat thing that is possible with Cap'n Proto but not most other serialization frameworks (e.g. you couldn't do this with Protobuf). Playing tricks with mmap() is sometimes really useful -- I've used this successfully in several places in Sandstorm, Cap'n Proto's parent project. However, I suspect that for your use case, splitting the document into a series of messages probably makes more sense.

Related

Parser for TCP buffers

I want to implement a protocol to share the data between server and client.
I don't know the correct one. By keeping performance as main criteria can anyone suggest the best protocol to parse the data.
I have one in mind, don't the actual name but it will be like the this
[Header][Message][Header][Message]
the header contains the length of the message and the header size is fixed.
I have tried this with some by performing a lot of concatenation and substring operation which are costlier. can any suggest the best implementation for this
The question is very broad.
On the topic of avoiding buffer/string concatenation, like at Buffer Sequences, described in Boost Asio's "Scatter-Gather" Documentation
For parsing there are two common solutions:
small messages
Receive the data into a buffer, e.g. 64k. Then use pointers into that buffer to parse the header and message. Since the messages are small there can be many messages in the buffer and you would call the parser again as long as there is data in the buffer. Note that the last message in the buffer might be truncated. In which case you have to keep the partial message and read more data into the buffer. If the message is near the end of the buffer then copying it to the front might be necessary.
large messages
With large messages it makes sense to first only read the header. Then parse the header to get the message size, allocate an appropriate buffer for the message and then read the whole message into it before parsing it.
Note: In both cases you might want to handle overly large messages by either skipping them or terminating the connection with an error. For the first case a message can not be larger than the buffer and should be a lot smaller. For the second case you don't want to allocate e.g. a gigabyte to buffer a message if they are supposed to be around 1MB only.
For sending messages it's best to first collect all the output. A std::vector can be sufficient. Or a rope of strings. Avoid copying the message into a larger buffer over and over. At most copy it once at the end when you have all the pieces. Using writev() to write a list of buffers instead of copying them all into one buffer can also be a solution.
As for the best protocol... What's best? Simply sending the data in binary format is fastest but will break when you have different architectures or versions. Something like google protobuffers can solve that but at the cost of some speed. It all depends on what your needs are.

Sending vector over internet using C++

I want to send a vector from one computer to another through internet. I'm looking for a Peer-to-Peer solution in C++. I have made a Winsock2 solution, but it can only send char* with the send and recv-functions which I don't think will work for my project.
Is there a way of using JSON with a P2P-solution in C++? So make a JSON-object of the vector and send it through internet? Or do you know a better solution?
The vector I want to send through internet to another computer looks like this:
Vector<AVpacket> data;
AVpacket is a struct from ffmpeg, consisting 14 data members. https://ffmpeg.org/doxygen/trunk/structAVPacket.html. You don't want to make this to a char*
You can actually send anything using the send and recv functions. You just have to make sure you pass a pointer to the data, and then typecast that pointer as a char * and it will work.
However, you can't send a std::vector as is. Instead you should first send its size (otherwise the receiving end will not know how much data it should receive) then you send the actual data in the vector, i.e. someVector.data() or &someVector[0].
Though in your case it will be even more complicated, as the structures you want to send contains embedded pointers. You can't send pointers over the Internet, it's barely possible to transfer pointers between two processes on the same system. You need to read about serialization and maybe about the related subject marshalling.
In short: You can send any kind of data, it doesn't have to be characters, and for the kind of structures you want to send you have to convert them to a format is transferable through serialization.
You can not simply send a vector. You can not be sure how the std allocator reserved the memory. So it is very likely that the whole memory of the vector is not just one big linear chunk.
In addition to this, as pointed out above, there are pointers in this struct. They point to data in your local memory. These addresses aren't valid on the recipients side, thus you would access invalid memory trying to read this.
I guess, that in order to achieve what you want to do, you have to try a completely different approach. Do not get lost by trying to send the data of the pointers or similar, rather try having parallel data on both machine.
E.g. both load the same video e.g. from a file which both share. Then use a unique identifier for that video to reference the video on both sides.

How to validate length of received byte array, which is not null terminated?

I have a C\C++ code that receives a structure over the network, from this form:
struct DataStruct
{
int DataLen;
BYTE* Data;
}
The code I have runs over Data in a loop of DataLen times and processes the data.
...The problem:
After the code came to security experts for penetration tests, they prepared a fake application which sends this struct with DataLen bigger than the real length of Data. This causes, of course, an access violation exception.
So, the question is - how can I validate the real length of the received Data? Is it possible without changing the structure?
Thanks in advance.
Nice security experts! I wish my company had a department like that.
Whenever data is received from the network, the network IO reports the number of bytes actually written to the buffer, whether you used read(2), recv(2), or boost::asio::async_read or anything else I've seen. Typical use case when there's a "number of bytes to follow" field in the header of your data structure, is to repeatedly call read/recv/etc until that many bytes were received (or until error occurred), and only then it should construct and return your DataStruct (or report error).
You know how much bytes you have received, so just compare it with DataLen.
It is impossible without changing the structure. Data received from TCP/IP socket is plain stream. Logically, it is not divided into packets. Physical packet may contain one or more DataStruct instances, one DataStruct instance can be divided to two or more physical packets. Current information structure can be used only if there are no communication errors or invalid packets.
Corruption is easy if you haven't any intrinsic limitation.
Some protection mechanisms would be:
Try to realloc() the buffer, within some acceptable size (if Data is dynamic)
Exceptions are friends: use SIGSEGV in signal(2), signal(7) and setjmp(2) to do some useful try/catch code structure. See Combining setjmp()/longjmp() and Signal Handling for a quick introduction on this topic. Use sigaction(2) to go deeper (by getting the faulty address;).

How to interpret binary data in C++?

I am sending and receiving binary data to/from a device in packets (64 byte). The data has a specific format, parts of which vary with different request / response.
Now I am designing an interpreter for the received data. Simply reading the data by positions is OK, but doesn't look that cool when I have a dozen different response formats. I am currently thinking about creating a few structs for that purpose, but I don't know how will it go with padding.
Maybe there's a better way?
Related:
Safe, efficient way to access unaligned data in a network packet from C
You need to use structs and or unions. You'll need to make sure your data is properly packed on both sides of the connection and you may want to translate to and from network byte order on each end if there is any chance that either side of the connection could be running with a different endianess.
As an example:
#pragma pack(push) /* push current alignment to stack */
#pragma pack(1) /* set alignment to 1 byte boundary */
typedef struct {
unsigned int packetID; // identifies packet in one direction
unsigned int data_length;
char receipt_flag; // indicates to ack packet or keep sending packet till acked
char data[]; // this is typically ascii string data w/ \n terminated fields but could also be binary
} tPacketBuffer ;
#pragma pack(pop) /* restore original alignment from stack */
and then when assigning:
packetBuffer.packetID = htonl(123456);
and then when receiving:
packetBuffer.packetID = ntohl(packetBuffer.packetID);
Here are some discussions of Endianness and Alignment and Structure Packing
If you don't pack the structure it'll end up aligned to word boundaries and the internal layout of the structure and it's size will be incorrect.
I've done this innumerable times before: it's a very common scenario. There's a number of things which I virtually always do.
Don't worry too much about making it the most efficient thing available.
If we do wind up spending a lot of time packing and unpacking packets, then we can always change it to be more efficient. Whilst I've not encountered a case where I've had to as yet, I've not been implementing network routers!
Whilst using structs/unions is the most efficient approach in term of runtime, it comes with a number of complications: convincing your compiler to pack the structs/unions to match the octet structure of the packets you need, work to avoid alignment and endianness issues, and a lack of safety since there is no or little opportunity to do sanity checks on debug builds.
I often wind up with an architecture including the following kinds of things:
A packet base class. Any common data fields are accessible (but not modifiable). If the data isn't stored in a packed format, then there's a virtual function which will produce a packed packet.
A number of presentation classes for specific packet types, derived from common packet type. If we're using a packing function, then each presentation class must implement it.
Anything which can be inferred from the specific type of the presentation class (i.e. a packet type id from a common data field), is dealt with as part of initialisation and is otherwise unmodifiable.
Each presentation class can be constructed from an unpacked packet, or will gracefully fail if the packet data is invalid for the that type. This can then be wrapped up in a factory for convenience.
If we don't have RTTI available, we can get "poor-man's RTTI" using the packet id to determine which specific presentation class an object really is.
In all of this, it's possible (even if just for debug builds) to verify that each field which is modifiable is being set to a sane value. Whilst it might seem like a lot of work, it makes it very difficult to have an invalidly formatted packet, a pre-packed packets contents can be easilly checked by eye using a debugger (since it's all in normal platform-native format variables).
If we do have to implement a more efficient storage scheme, that too can be wrapped in this abstraction with little additional performance cost.
It's hard to say what the best solution is without knowing the exact format(s) of the data. Have you considered using unions?
I agree with Wuggy. You can also use code generation to do this. Use a simple data-definition file to define all your packet types, then run a python script over it to generate prototype structures and serialiation/unserialization functions for each one.
This is an "out-of-the-box" solution, but I'd suggest to take a look at the Python construct library.
Construct is a python library for
parsing and building of data
structures (binary or textual). It is
based on the concept of defining data
structures in a declarative manner,
rather than procedural code: more
complex constructs are composed of a
hierarchy of simpler ones. It's the
first library that makes parsing fun,
instead of the usual headache it is
today.
construct is very robust and powerful, and just reading the tutorial will help you understand the problem better. The author also has plans for auto-generating C code from definitions, so it's definitely worth the effort to read about.

C++ byte stream

For a networked application, the way we have been transmitting dynamic data is through memcpying a struct into a (void*). This poses some problems, like when this is done to an std::string. Strings can be dynamic length, so how will the other side know when the string ends? An idea I had was to use something similiar to Java's DataOuputStream, where I could just pass whatever variables to it and it could then be put into a (void*). If this can't be done, then its cool. I just don't really like memcpying a struct. Something about it doesn't seem quite right.
Thanks,
Robbie
nothing wrong with memcpy on a struct - as lng as the struct is filled with fixed-size buffers. Put a dynamic variable in there and you have to serialise it differently.
If you have a struct with std::strings in there, create a stream operator and use it to format a buffer. You can then memcpy that buffer to the data transport. If you have boost, use Boost::serialize which does all this for you (that link also has links to alternative serialization libs)
Notes: the usual way to pass a variable-size buffer is to begin by sending the length, then that many bytes of data. Occasionally you see data transferred until a delimiter is received (and fields within that data are delimited themselves by another character, eg a comma).
I see two parts of this question:
- serialization of data over a network
- how to pass structures into a network stack
To serialize data over a network, you'll need a protocol. Doesn't have to be difficult; for ASCII even a cr/lf as packet end may do. If you use a framework (like MFC), it may provide serialization functions for you; in that case you need to worry about how to send this in packets. A packetization which often works well for me is :
<length><data_type>[data....][checksum]
In this case the checksum is optional, and also zero-data is possible, for instance if the signal is carried in the data_type (i.e. Ack for acklnowedgement)
If you're working on the memcpy with structures, you'll need to consider that memcpy only makes a shallow copy. A pointer is worthless once transmitted over a network; instand you should transmit the data from that pointer (i.e. the contents of your string example)
For sending dynamic data across the network you have the following options.
First option in the same packet.
void SendData()
{
int size;
char payload[256];
Send(messageType)
Send(size);
Send(payload)
}
Second option:
void SendData()
{
char payload[256];
Send(messageType)
Send(payload)
}
Though in either situation, you will be faced with more of a design choice. In the first example you would send the message type, and the payload size and also then the payload.
The second option you have is you can send the message type and then you can send the string that has a delimiter of null terminator.
Though either option does not cover fully the problem your facing I think. Firstly, you need to determine if you're building a game what type of protocal you will be using, UDP? TCP? The second problem you will be facing is the maximum packet size. Then on top of that you need to have the framework in place so that you can calculate the optimum packet size that will not be fragmented and lost to the inter web. After that you have bandwidth control in regards to how much data you can transmitted and receive between the client and server.
For example the way that most games approach this situation is each packet is identified with the following.
MessageType
MessageSize
CRCCheckSum
MessageID
void buffer[payload]
In situation where you need to send dynamic data you would send a series of packets not just one. For example if you were to send a file accross the network the best option would to use TCP/IP because its a streaming protocal and it garnentees that the complete stream arrives safly to the other end. On the other hand UDP is a packet based protocal and is does not do any checking that all packets arrived in order or at all on the other end.
So in conclusion.
For dynamic data, send multiple packets but with a special flag
to say more data is to arrive to complete this message.
Keep it simple and if your working with C++ dont assume the packet or data
will contain a null terminator and check the size compared to the
payload if you decide to use a null terminator.