Omit fields when printing Protobuf message - c++

Is it possible to choose what fields or at least what field types to be considered when calling message.DebugString() in Google Protobuf?
I have the following message description:
message Message
{
optional string name = 1
optional int32 blockSize = 2;
optional bytes block = 3;
}
I only want to print name and blockSize and omit the block field, which happens to be large (e.g.: 64KB) and its content is insignificant.
I built a method that specifically adds to a std::stringstream only the fields of interest but it seems that I have to modify the method for every change in message description.

Your best bet is to make a copy of the message, clear block from the copy, then print it.
Message copy = original;
copy.clear_block();
cout << copy.DebugString() << endl;
Note that there's no performance concern here because DebugString() itself is already much slower than making a copy of the message.
If you want to make this more general, you could write some code based on protobuf reflection which walks over the copied message and removes all fields of type bytes with long sizes.

Related

modify raw protobuf stream

Let's say I have compiled an application (Receiver) with the following proto file:
syntax = "proto3";
message Control {
bytes version = 1;
uint32 id = 2;
bytes color = 3;
}
and I have another application (Transmitter) which initially has the same proto file but after an update a new field is added like:
syntax = "proto3";
message Control {
bytes name = 1;
uint32 id = 2;
bytes color = 3;
uint32 color_id = 4;
}
I have seen that if the Receiver app tries to parse the proto, change some data and then serialize it back the added fields coming from the Transmitter app are removed.
I need a way to change the id field directly accessing to the raw bytes without having to parse/serialize the proto. Is it possible ?
This is needed because I have some "header" fields in the Control message that I know that will never be changed but others that can be added/changed in the same proto of trasmitter app due to app update.
I have seen: https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.io.coded_stream
but I was not able to modify an existing bytestream and the ReadString is not able to understand the string length.
Thanks in advance
I don't think there is an official way to do it. You could do this by hand following the encoding guidelines by protobuf (https://developers.google.com/protocol-buffers/docs/encoding#structure).
Basically you should do this:
start decoding with the very first bit
decode until you reach the field number of the id
identify the bits representing the id and replace them with your new (encoded!) id
This is bad for several reasons. Most importantly, your code has to know details about the message structure and content (field number and data type of your id), and this is exactly what you want to avoid when using protocol buffers (you always need some info from the .proto files).
In proto2 syntax, protobuf C++ library used to preserve unknown fields so that when you re-encoded the message, they would remain. Unfortunately this feature (like many others) have been removed in the proto3 syntax.
One workaround could be to do it this way:
Set only the new id value in the Receiver message and encode it.
Append this data after the original binary data.
This relies on the protobuf feature that appended messages replace original values of fields in protobuf messages.
Hmm, actually reading the issue report linked above, it seems that you can turn on unknown field preservation in protobuf version 3.5 and newer.
Just deserialize the entire message and map it on the new message. It is the cleanest way. You do not have a lot of data and probably no real time requirements. Create a mapper and do not overthink the problem.

Can protobuf read partially?

I want to save my terrain data to a file and load only some parts of it, because it's just too big to store it in memory as a whole. Actually I don't even know whether the protobuf is good for this purposes.
For example I would have a structure like (might be invalid gramatically, I know only simple basics):
message Quad {
required int32 x = 1;
required int32 z = 2;
repeated int32 y = 3;
}
The x and z values are available in my program and by using them I would like to find the correct Quad object with the same x and z (in the file) to obtain y values. However, I can't just parse the file with the ParseFromIstream(), because (I think so) it loads whole file into memory, but in my case the file is just too big.
So, is the protobuf able to load one object, send me for checking it and if the object is wrong give me the second one?
Actually... I could just ask: does the ParseFromIstream() loads whole file into memory?
While some libraries to allow you to read files partially, the technique recommended by Google is to simply have the file consist of multiple messages:
https://developers.google.com/protocol-buffers/docs/techniques
Protocol Buffers are not designed to handle large messages. As a general rule of thumb, if
you are dealing in messages larger than a megabyte each, it may be time to consider an
alternate strategy.
That said, Protocol Buffers are great for handling individual messages within a large data
set. Usually, large data sets are really just a collection of small pieces, where each small
piece may be a structured piece of data.
So you could just write a long sequence of Quad messages to the file, delimited by the lengths of the messages. If you need to seek randomly to specific Quads, you may want to add some kind of an index.
This depends on which implementation you are using. Some have "read as a sequence" APIs. For example, assuming you stored it as a "repeated Quad", then with protobuf-net that would be:
int x = ..., y = ...;
var found = Serializer.DeserializeItems<Quad>(source)
.Where(q => q.x ==x && q.y == y);
The point being: it yields a spooling (not loaded all at once) and short-circuiting sequence.
I don't know the c++ api specifically, but I would hope it has something similar - but worst case you could parse the varint headers and prepare a length-capped stream.

Protocol Buffers - Reading header (nested message) common across all messages

I am currently evaluating Protocol Buffers for use in a project (no code written as of yet). One of the things I'm unclear on is how you would read part of an encoded message, for example say I have a common header:
message Header {
required uint16 msg_type = 1;
required uint16 length = 2;
}
And say I deliver multiple different messages to a queue. How would the consumer work out how much data to read per message and what message type is should be constructed as?
There should be no need for a Header message here; the most common approach is to follow the "streaming" advice from here. Within that, you could either treat it as a sequence of identical union type messages, or (my preference) when writing, instead of just writing a length-prefix before each, include a varint that indicates the message type then the length (as a varint). The number that indicates the message type is some arbitrary map you invent, so 1 = Foo, 2 = Bar, 3 = Blap, etc). If you left-shift the message-type by 3 bits then "or" 2, then it will also be a well-formed protobuf stream itself, 100% identical to a repeated YourUnionType.
Basically, this is exactly the same as this answer, but instead of being field 1 each time, the number varies per message-type. Most implementations have a reader/writer API that make it possible to read and write raw varints, and to length-restrict the reader API. Some implementations have helper mechanisms to support streams of heterogeneous messages directly (basically, doing all the above for you).
In a recent project, I used Protocol Buffers like this:
We had one 'container' message that included all the actual messages as optional members:
message ContainerMessage {
optional Message1 message_1 = 1;
optional Message2 message_2 = 2;
//...
optional MessageN message_N = N;
}
Inside an application, you could just use ContainerMessage as a discriminated union of the real Messages.
Between applications, we serialized/deserialized the ContainerMessage and sent the serialized content, prefixed with a simple header containing the length of the serialized content.
That will depend on the protocol you are using.
Note that e.g. a lot of protocols go via serial interfaces, where you might have extra lines telling when a message starts and stops.
Often, messages will have there length at a fixed offset after the message start.
In other cases, you might need to parse the message element by element to find out how much of the message is left. So a string embedded in the message may be of fixed length, or have the length at the beginning, or might have \0 as end marker.
Mostly, when you store messages in a queue for further processing, you will want to add some more information to make your life easier - like when you just have an extra signal telling you when the message stops, you might store the message internally with its length.

Curlpp, incomplete data from request

I am using Curlpp to send requests to various webservices to send and receive data.
So far this has worked fine since i have only used it for sending/receiving JSON data.
Now i have a situation where a webservice returns a zip file in binary form. This is where i encountered a problem where the data received is not complete.
I first had Curl set to write any data to a ostringstream by using the option WriteStream, but this proved not to be the correct approach since the data contained null characters, and thus the data stopped at the first null char.
After that, instead of using WriteStream i used WriteFunction with a callback function.
The problem in this case is that this function is always called 2 or 3 times, regardless of the amount of data.
This results in always having a few chunks of data that don't seem to be the first part of the file, although the data always contains PK as the first 2 characters, indicating a zip file.
I used several tools to verify that the data is entirely being sent to my application so this is not a problem of the webservice.
Here the code. Do note that the options like hostname, port, headers and postfields are set elsewhere.
string requestData;
size_t WriteStringCallback(char* ptr, size_t size, size_t nmemb)
{
requestData += ptr;
int totalSize= size*nmemb;
return totalSize;
}
const string CurlRequest::Perform()
{
curlpp::options::WriteFunction wf(WriteStringCallback);
this->request.setOpt( wf );
this->request.perform();
return requestData;
}
I hope anyone can help me out with this issue because i've run dry of any leads on how to fix this, also because curlpp is poorly documented(and even worse since the curlpp website disappeared).
The problem with the code is that the data is put into a std::string, despite having the data in binary (ZIP) format. I'd recommend to put the data into a stream (or a binary array).
You can also register a callback to retrieve the response headers and act in the WriteCallback according to the "Content-type".
curlpp::options::HeaderFunction to register a callback to retrieve response-headers.
std::string is not a problem, but the concatenation is:
requestData += ptr;
C string (ptr) is terminated with zero, if the input contains any zero bytes, the input will be truncated. You should wrap it into a string which knows the length of its data:
requestData += std::string(ptr, size*nmemb);

String issue with assert on erase

I am developing a program in C++, using the string container , as in std::string to store network data from the socket (this is peachy), I receive the data in a maximum possible 1452 byte frame at a time, the protocol uses a header that contains information about the data area portion of the packets length, and header is a fixed 20 byte length. My problem is that a string is giving me an unknown debug assertion, as in , it asserts , but I get NO message about the string. Now considering I can receive more than a single packet in a frame at a any time, I place all received data into the string , reinterpret_cast to my data struct, calculate the total length of the packet, then copy the data portion of the packet into a string for regex processing, At this point i do a string.erase, as in mybuff.Erase(totalPackLen); <~ THIS is whats calling the assert, but totalpacklen is less than the strings size.
Is there some convention I am missing here? Or is it that the std::string really is an inappropriate choice here? Ty.
Fixed it on my own. Rolled my own VERY simple buffer with a few C calls :)
int ret = recv(socket,m_buff,0);
if(ret > 0)
{
BigBuff.append(m_buff,ret);
while(BigBuff.size() > 16){
Header *hdr = reinterpret_cast<Header*>(&BigBuff[0]);
if(ntohs(hdr->PackLen) <= BigBuff.size() - 20){
hdr->PackLen = ntohs(hdr->PackLen);
string lData;
lData.append(BigBuff.begin() + 20,BigBuff.begin() + 20 + hdr->PackLen);
Parse(lData); //regex parsing helper function
BigBuff.erase(hdr->PackLen + 20); //assert here when len is packlen is 235 and string len is 1458;
}
}
}
From the code snippet you provided it appears that your packet comprises a fixed-length binary header followed by a variable length ASCII string as a payload. Your first mistake is here:
BigBuff.append(m_buff,ret);
There are at least two problems here:
1. Why the append? You presumably have dispatched with any previous messages. You should be starting with a clean slate.
2. Mixing binary and string data can work, but more often than not it doesn't. It is usually better to keep the binary and ASCII data separate. Don't use std::string for non-string data.
Append adds data to the end of the string. The very next statement after the append is a test for a length of 16, which says to me that you should have started fresh. In the same vein you do that reinterpret cast from BigBuff[0]:
Header *hdr = reinterpret_cast<Header*>(&BigBuff[0]);
Because of your use of append, you are perpetually dealing with the header from the first packet received rather than the current packet. Finally, there's that erase:
BigBuff.erase(hdr->PackLen + 20);
Many problems here:
- If the packet length and the return value from recv are consistent the very first call will do nothing (the erase is at but not past the end of the string).
- There is something very wrong if the packet length and the return value from recv are not consistent. It might mean, for example, that multiple physical frames are needed to form a single logical frame, and that in turn means you need to go back to square one.
- Suppose the physical and logical frames are one and the same, you're still going about this all wrong. As noted, the first time around you are erasing exactly nothing. That append at the start of the loop is exactly what you don't want to do.
Serialization oftentimes is a low-level concept and is best treated as such.
Your comment doesn't make sense:
BigBuff.erase(hdr->PackLen + 20); //assert here when len is packlen is 235 and string len is 1458;
BigBuff.erase(hdr->PackLen + 20) will erase from hdr->PackLen + 20 onwards till the end of the string. From the description of the code - seems to me that you're erasing beyond the end of the content data. Here's the reference for std::string::erase() for you.
Needless to say that std::string is entirely inappropriate here, it should be std::vector.