Protocol Buffers - Reading header (nested message) common across all messages - c++

I am currently evaluating Protocol Buffers for use in a project (no code written as of yet). One of the things I'm unclear on is how you would read part of an encoded message, for example say I have a common header:
message Header {
required uint16 msg_type = 1;
required uint16 length = 2;
}
And say I deliver multiple different messages to a queue. How would the consumer work out how much data to read per message and what message type is should be constructed as?

There should be no need for a Header message here; the most common approach is to follow the "streaming" advice from here. Within that, you could either treat it as a sequence of identical union type messages, or (my preference) when writing, instead of just writing a length-prefix before each, include a varint that indicates the message type then the length (as a varint). The number that indicates the message type is some arbitrary map you invent, so 1 = Foo, 2 = Bar, 3 = Blap, etc). If you left-shift the message-type by 3 bits then "or" 2, then it will also be a well-formed protobuf stream itself, 100% identical to a repeated YourUnionType.
Basically, this is exactly the same as this answer, but instead of being field 1 each time, the number varies per message-type. Most implementations have a reader/writer API that make it possible to read and write raw varints, and to length-restrict the reader API. Some implementations have helper mechanisms to support streams of heterogeneous messages directly (basically, doing all the above for you).

In a recent project, I used Protocol Buffers like this:
We had one 'container' message that included all the actual messages as optional members:
message ContainerMessage {
optional Message1 message_1 = 1;
optional Message2 message_2 = 2;
//...
optional MessageN message_N = N;
}
Inside an application, you could just use ContainerMessage as a discriminated union of the real Messages.
Between applications, we serialized/deserialized the ContainerMessage and sent the serialized content, prefixed with a simple header containing the length of the serialized content.

That will depend on the protocol you are using.
Note that e.g. a lot of protocols go via serial interfaces, where you might have extra lines telling when a message starts and stops.
Often, messages will have there length at a fixed offset after the message start.
In other cases, you might need to parse the message element by element to find out how much of the message is left. So a string embedded in the message may be of fixed length, or have the length at the beginning, or might have \0 as end marker.
Mostly, when you store messages in a queue for further processing, you will want to add some more information to make your life easier - like when you just have an extra signal telling you when the message stops, you might store the message internally with its length.

Related

How to 'read' from a (binary==true) boost::beast::websocket::stream<tcp::socket> into a buffer (boost::beast::flat_buffer?) so it is not escaped?

I am using boost::beast to read data from a websocket into a std::string. I am closely following the example websocket_sync_client.cpp in boost 1.71.0, with one change--the I/O is sent in binary, there is no text handler at the server end, only a binary stream. Hence, in the example, I added one line of code:
// Make the stream binary?? https://github.com/boostorg/beast/issues/1045
ws.binary(true);
Everything works as expected, I 'send' a message, then 'read' the response to my sent message into a std::string using boost::beast::buffers_to_string:
// =============================================================
// This buffer will hold the incoming message
beast::flat_buffer wbuffer;
// Read a message into our buffer
ws.read(wbuffer);
// =============================================================
// ==flat_buffer to std::string=================================
string rcvdS = beast::buffers_to_string(wbuffer.data());
std::cout << "<string_rcvdS>" << rcvdS << "</string_rcvdS>" << std::endl;
// ==flat_buffer to std::string=================================
This just about works as I expected, except there is some kind of escaping happening on the data of the (binary) stream.
There is no doubt some layer of boost logic (perhaps character traits?) that has enabled/caused all non-printable characters to be '\u????' escaped, human-readable text.
The binary data that is read contains many (intentional) non-printable ASCII control characters to delimit/organize chunks of data in the message:
I would rather not have the stream escaping these non-printable characters, since I will have to "undo" that effort anyway, if I cannot coerce the 'read' buffer into leaving the data as-is, raw. If I have to find another boost API to undo the escaping, that is just wasted processing that no doubt is detrimental to performance.
My question has to have a simple solution. How can I cause the resulting flat_buffer that is ws.read into 'rcvdS' to contain truely raw, unescaped bytes of data? Is it possible, or is it necessary for me to simply choose a different buffer template/class, so that the escaping does not happen?
Here is a visual aid - showing expected vs. actual data:
Beast does not alter the contents of the message in any way. The only thing that binary() and text() do is set a flag in the message which the other end receives. Text messages are validated against the legal character set, while binary messages are not. Message data is never changed by Beast. buffers_to_string just transfers the bytes in the buffer to a std::string, it does not escape anything. So if the buffer contains a null, or lets say a ctrl+A, you will get a 0x00 and a 0x01 in the std::string respectively.
If your message is being encoded or translated, it isn't Beast that is doing it. Perhaps it is a consequence of writing the raw bytes to the std::cout? Or it could be whatever you are using to display those messages in the image you posted. I note that the code you provided does not match the image.
If anyone else lands here, rest assured, it is your server end, not the client end that is escaping your data.

modify raw protobuf stream

Let's say I have compiled an application (Receiver) with the following proto file:
syntax = "proto3";
message Control {
bytes version = 1;
uint32 id = 2;
bytes color = 3;
}
and I have another application (Transmitter) which initially has the same proto file but after an update a new field is added like:
syntax = "proto3";
message Control {
bytes name = 1;
uint32 id = 2;
bytes color = 3;
uint32 color_id = 4;
}
I have seen that if the Receiver app tries to parse the proto, change some data and then serialize it back the added fields coming from the Transmitter app are removed.
I need a way to change the id field directly accessing to the raw bytes without having to parse/serialize the proto. Is it possible ?
This is needed because I have some "header" fields in the Control message that I know that will never be changed but others that can be added/changed in the same proto of trasmitter app due to app update.
I have seen: https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.io.coded_stream
but I was not able to modify an existing bytestream and the ReadString is not able to understand the string length.
Thanks in advance
I don't think there is an official way to do it. You could do this by hand following the encoding guidelines by protobuf (https://developers.google.com/protocol-buffers/docs/encoding#structure).
Basically you should do this:
start decoding with the very first bit
decode until you reach the field number of the id
identify the bits representing the id and replace them with your new (encoded!) id
This is bad for several reasons. Most importantly, your code has to know details about the message structure and content (field number and data type of your id), and this is exactly what you want to avoid when using protocol buffers (you always need some info from the .proto files).
In proto2 syntax, protobuf C++ library used to preserve unknown fields so that when you re-encoded the message, they would remain. Unfortunately this feature (like many others) have been removed in the proto3 syntax.
One workaround could be to do it this way:
Set only the new id value in the Receiver message and encode it.
Append this data after the original binary data.
This relies on the protobuf feature that appended messages replace original values of fields in protobuf messages.
Hmm, actually reading the issue report linked above, it seems that you can turn on unknown field preservation in protobuf version 3.5 and newer.
Just deserialize the entire message and map it on the new message. It is the cleanest way. You do not have a lot of data and probably no real time requirements. Create a mapper and do not overthink the problem.

Sending numeric Array via UDP using winsock vc++

RESOLVED: Problem was primarily with the simulink blockset that was reading in the UDP packet rather than data transmission.
I am trying to send a 20 byte numerical array out of a c++ program by using winsock. The issue I am running into is data packaging, in the end I want each number in the array to go out as its own byte so that my simulink model that is receiving these values does not need an additional processing script.
My array contains 14 boolean values (0|1) and then 6 values that range from -100 to 100. These are reporting the status of a controller input. for example the array would look like
int array msgint[20] = [1,0,1,0,0,1,1,0,0,0,0,0,0,0,50,80,-90,40,90,-20];
I have tried using typecasting and sending multiple strings but all just appear to rearrange the gibberish I am getting or cause a socket error. Currently my sendto function looks like
sendto(sd,message,80,0,(struct sockadd *) &server,server_length)
I know this line works as the packet makes it through it just does not appear as I would like it to. In the send to, message is the formatted string I am trying to create to properly send all of contents of the array. Currently is is arbitrary and has little significance I have it in for debugging purposes essentially.
you are starting at the wrong point. Network communications should start with the design of the wire protocol.
How will you represent something on the wire. Binary or text. Most 'modern' protocols use text (json or xml). A few years ago binary was hot (asn1/ber/der). I suggest json
Then how will you wrap up the payload. Do you need to say 'here is a set of xxxs. now here is a set of yyyys'. I dont know what you are doing so its hard to say what you need
If you want to send 20 bytes, and you know that every integer value in your array will be in the [-100, +100] range, you should not use the int type, which usually contains either 32-bit or 64-bit values on modern platforms.
You might instead want to use the char type, which usually represents a 8-bit value.
For even more certainty, if you can use C++11 features, you should use the <cstdint> header, which defines the int8_t type, guaranteed to be a signed 8-bit type. See http://en.cppreference.com/w/cpp/types/integer.
Your array should look like:
#include <cstdint>
std::int8_t msgint[20] = {1,0,1,0,0,1,1,0,0,0,0,0,0,0,50,80,-90,40,90,-20};
or
char msgint[20] = {1,0,1,0,0,1,1,0,0,0,0,0,0,0,50,80,-90,40,90,-20};
and your sendto will be:
sendto(sd,msgint,20,0,(struct sockadd *) &server,server_length);
"in the end I want each number in the array to go out as its own byte"
Then change your array to
char msgint[20] = ...

Protobuf ParseDelimitedFrom implementation in C++

C# Publisher is publishing continuos marketdata messages in custom protobuff format over the socket using "writeDelimitedTo" API. I have to read all messages in C++ and desearialize it. Below is my code. Since C++ don't have "parseDelimitedFrom", so have coded something like below after going through multiple suggestions in this forum.
Now my question is - Refering to the code below, If the first message size is less than 1024 then in the first iteration, i will have full stream of the 1st message and part of the stream from the 2nd message. After deserializing first message, How can i read remaining streams of the second message from socket and merge it with the stream which i read in the previous iteration ?
EDIT: Support for "delimited" format is now part of the official protobuf library. The post below predates it being added.
I've written optimally-efficient versions of parseDelimitedFrom and writeDelimitedTo in C++ here (the read and write methods of Uncompressed):
https://github.com/capnproto/capnproto/blob/06a7136708955d91f8ddc1fa3d54e620eacba13e/c%2B%2B/src/benchmark/protobuf-common.h#L101
Feel free to copy.
These implementations read from / write to a ZeroCopyInputStream / ZeroCopyOutputStream.(Hmm, for some reason my write is declared to use FileOutputStream, but you should be able to just change that to ZeroCopyOutputStream.)
So, you'll need to create a ZeroCopyInputStream which reads from your StreamSocket, then pass it to my read().
It looks like StreamSocket is a classic copying-read interface. You should therefore use CopyingInputStreamAdaptor as your ZeroCopyInputStream, wrapping an implementation of CopyingInputStream which reads from your StreamSocket.
https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.io.zero_copy_stream_impl_lite#CopyingInputStreamAdaptor

Can protobuf read partially?

I want to save my terrain data to a file and load only some parts of it, because it's just too big to store it in memory as a whole. Actually I don't even know whether the protobuf is good for this purposes.
For example I would have a structure like (might be invalid gramatically, I know only simple basics):
message Quad {
required int32 x = 1;
required int32 z = 2;
repeated int32 y = 3;
}
The x and z values are available in my program and by using them I would like to find the correct Quad object with the same x and z (in the file) to obtain y values. However, I can't just parse the file with the ParseFromIstream(), because (I think so) it loads whole file into memory, but in my case the file is just too big.
So, is the protobuf able to load one object, send me for checking it and if the object is wrong give me the second one?
Actually... I could just ask: does the ParseFromIstream() loads whole file into memory?
While some libraries to allow you to read files partially, the technique recommended by Google is to simply have the file consist of multiple messages:
https://developers.google.com/protocol-buffers/docs/techniques
Protocol Buffers are not designed to handle large messages. As a general rule of thumb, if
you are dealing in messages larger than a megabyte each, it may be time to consider an
alternate strategy.
That said, Protocol Buffers are great for handling individual messages within a large data
set. Usually, large data sets are really just a collection of small pieces, where each small
piece may be a structured piece of data.
So you could just write a long sequence of Quad messages to the file, delimited by the lengths of the messages. If you need to seek randomly to specific Quads, you may want to add some kind of an index.
This depends on which implementation you are using. Some have "read as a sequence" APIs. For example, assuming you stored it as a "repeated Quad", then with protobuf-net that would be:
int x = ..., y = ...;
var found = Serializer.DeserializeItems<Quad>(source)
.Where(q => q.x ==x && q.y == y);
The point being: it yields a spooling (not loaded all at once) and short-circuiting sequence.
I don't know the c++ api specifically, but I would hope it has something similar - but worst case you could parse the varint headers and prepare a length-capped stream.