Include pre-encoded protocol buffer message within outer message - c++

Is there a way to create a protocol buffer message in C++ that contains a pre-encoded inner message, without parsing and then re-serializing the inner message?
To clarify, consider the following message definitions:
message Inner {
required int i = 1;
// ... more fields ...
}
message Outer {
repeated Inner inners = 1;
// ... more fields ...
}
Suppose you have a collection of 10 byte arrays, each of which contains an encoded version of an Inner. You'd like to create an Outer that contains the 10 Inners. You don't want to hand-encode because Outer has other fields and may itself be included in other messages. Is there a way to get protocol buffers to directly copy the pre-encoded Inner?

There is no a clean way, but there are a few hacky ways. One is to define a second message like this:
message RawOuter {
repeated bytes inners = 1;
// ... same fields as Outer ...
}
RawOuter is identical to Outer except that the inners repeated field has been changed from type Inner to type bytes. If you populate inners with the encoded instances of Inner, then serialize the RawOuter, you get exactly the same result as if you had built an Outer with the parsed verisons. That is to say, the wire format for a nested message is identical to the wire format for a bytes field containing the serialization of that nested message. This is one of those funny exploitable quirks of the protobuf encoding.
This hack has some problems, though. In particular, it doesn't work well if you're trying to build an Outer instance that is embedded in some other proto, since you probably don't want to maintain two copies of every containing message, one using Outer and one using RawOuter.
Another, even hackier option is to inject the encoded messages into the Outer instance's UnknownFieldSet.
Outer outer;
for (auto& inner: inners) {
outer.mutable_unknown_fields()
->AddLengthDelimited(1, inner);
}
The UnknownFieldSet is intended to store fields seen while parsing that do not match any known field number defined in the .proto file. The idea is that this allows you to write a proxy server that simply receives messages and forwards them to another server without having to re-compile the proxy every time you add a new field to the protocol. Here, we're abusing it by sticking a value into it that actually corresponds to a known field, but the implementation will not notice, and so it will write out these fields just fine.
The main problem with this approach is that if anyone else inspects your Outer instance in the meantime, it will appear to them as if the inners list is empty, since the values are actually hidden somewhere else. This is a pretty ugly hack that will probably come back to haunt you later. I would only recommend it if you have measured the performance difference and found it to be large.
Also note that the serialization code always writes unknown fields last, whereas known fields are written in order by field number. Parsers are supposed to accept any order, but occasionally you'll find someone who is using the unparsed data as a hash map key or something and that totally breaks if the fields are re-ordered.
By the way, you can improve performance of both of these approaches by swapping the strings into place rather than copying, i.e.
raw_outer->add_inners()->swap(inner);
or
outer->mutable_unknown_fields()->AddLengthDelimited(1)->swap(inner);

Related

register ErrorCollector or intercept parse errors for wire format?

When It is possible to define a custom ErrorCollector class for handling google::protobuf parsing errors
struct ErrorCollector : ::google::protobuf::io::ErrorCollector
{
void AddError(int line, int column, const std::string& message) override
{
// log error
}
void AddWarning(int line, int column, const std::string& message) override
{
// log warning
}
};
When parsing from a text file, you can use the protobuf TextFormat class and register your custom ErrorCollector
::google::protobuf::io::IstreamInputStream input_stream(&file);
::google::protobuf::TextFormat::Parser parser;
ErrorCollector error_collector;
parser.RecordErrorsTo(&error_collector);
if (parser.Parse(&input_stream, &msg))
{
// handle msg
}
For parsing wire format, I currently use Message::ParseFromArray
if (msg.ParseFromArray(data, data_len))
{
// handle msg
}
This doesn't allow me to specify a custom ErrorCollector though.
I've searched through the source code, but as of yet have been unable to find if this is possible.
Is it possible to use an ErrorCollector when parsing wire format?
Is there another way to intercept parse errors and make them available to client code?
There are essentially two ways that parsing the wire format could fail:
The bytes are not a valid protobuf (e.g. they are corrupted, or in a totally different format).
A required field is missing.
For case 1, protobuf does not give you any more information than "it's invalid". This is partly for code simplicity (and speed), but it is also partly because any attempt to provide more information usually turns out more misleading than helpful. Detailed error reporting is useful for text format because text is often written by humans, but machines make very different kinds of errors. In some languages, protobuf actually reports specific errors like "end-group tag does not match start-group tag". In the vast majority of cases, this error really just means "the bytes are corrupted", but inevitably people think the error is trying to tell them something deeper which they do not understand. They then post questions to stack overflow like "How do I make sure my start-group and end-group tags match?" when they really should be comparing bytes between their source and destination to narrow down where they got corrupted. Even reporting the byte position where the parse error occurred is not very useful: protobuf is a dense encoding, which means that many random corrupt byte sequences will parse successfully, which means the parser may only notice a problem somewhere later down the line rather than at the point where things actually went wrong.
The one case that clearly is useful to distinguish is case 2 (missing required fields) -- at least, if you use required fields (I personally recommend avoiding them). There are a couple options here:
Normally, required field checks write errors to the console (on stderr). You can intercept these and record them your own way using SetLogHandler, but this doesn't give you structured information, only text messages.
To check required fields more programmatically, you can separate required field checking from parsing. Use MessageLite::ParsePartialFromArray() or one of the other Partial parsing methods to parse a message while ignoring the absence of required fields. You can then use the MessageLite::IsInitialized() to check if all required fields are set. If it returns false, use Message::FindInitializationErrors() to get a list of paths of all required fields that are missing.

ProtocolBuffer, abort() on SerializeToArray()

I made a ProtocolBuffer object from the proto class I usually use and I need to Serialize it. Now, I take the object and call SerializeToArray() on it like this:
int size = messageObject.ByteSize();
void* buffer = malloc(size);
messageObject.SerializeToArray(buffer, size);
As far as I know there is no problem with this since the object has data in it (I checked it by breaking right before the Serialize line).
When the method calls however it triggers an abort() which I don't know anything about.
I have no idea what it could be. The only data that is included in this object is a "type" enumerator (which I can set to the type of data that is being used in this object since it can include different sorts of messages) and it holds one message object of the repeatable type.
message MessageID
{
enum Type { LOGINDATA = 1; PLAYERDATA = 2; WORLDDATA = 3; }
// Identifies which field is filled in.
required Type type = 1;
// One of the following will be filled in.
repeated PlayerData playerData = 2;
optional WorldData worldData = 3;
optional LoginData loginData = 10;
}
This is the base message. So, Type is 2 in this case which stands for PLAYERDATA. Also, playerData is being set with a single object of the type PlayerData.
An help is appreciated.
Any time that the protobuf library aborts (which, again, should only be in debug mode or in sever circumstances), it will print information about the problem to the console. If your app doesn't have a console, you can use google::protobuf::SetLogHandler to direct the information somewhere else:
https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.common#SetLogHandler.details
typedef void LogHandler(LogLevel level, const char* filename,
int line, const std::string& message);
LogHandler* SetLogHandler(LogHandler* new_func);
The protobuf library sometimes writes warning and error messages to stderr.
These messages are primarily useful for developers, but may also help end users figure out a problem. If you would prefer that these messages be sent somewhere other than stderr, call SetLogHandler() to set your own handler. This returns the old handler. Set the handler to NULL to ignore log messages (but see also LogSilencer, below).
Obviously, SetLogHandler is not thread-safe. You should only call it at initialization time, and probably not from library code. If you simply want to suppress log messages temporarily (e.g. because you have some code that tends to trigger them frequently and you know the warnings are not important to you), use the LogSilencer class below.
The only reason for an abort that I know of (which only applies in debug builds) is if some required field isn't set. You say that the type field is set, so there must be a required field in PlayerData which is not set.

protobuf required field and default value

I am new to protobuf and I have started considering the following trivial example
message Entry {
required int32 id = 1;
}
used by the c++ code
#include <iostream>
#include "example.pb.h"
int main() {
std::string mySerialized;
{
Entry myEntry;
std::cout << "Serialization succesfull "
<< myEntry.SerializeToString(&mySerialized) << std::endl;
std::cout << mySerialized.size() << std::endl;
}
Entry myEntry;
std::cout << "Deserialization successfull "
<< myEntry.ParseFromString(mySerialized) << std::endl;
}
Even if the "id" field is required, since it has not been set, the size of the serialization buffer is 0 (??).
When I deserialize the message an error occurs:
[libprotobuf ERROR google/protobuf/message_lite.cc:123] Can't parse message of type "Entry" because it is missing required fields: id
Is it a normal behavior?
Francesco
ps- If I initialize "id" with the value 0, the behavior is different
pps- The SerializeToString function returns true, the ParseFromString returns false
I dont think I exactly understand your question, but I'll have a go at the answer anyways. Hope this helps you in some way or the other :)
Yes this is normal behavior. You should add required only if the field is important to the message. It makes sense semantically. (why would you skip a required field). To enforce this, protobuf would not parse the message.
It sees that the field marked with number 1 is required, and the has_id() method is returning false. So it wont parse the message at all.
In the developer guide it is advised not to use required fields.
Required Is Forever You should be very careful about marking fields as required. If at some point you wish to stop writing or sending a required field, it will be problematic to change the field to an optional field – old readers will consider messages without this field to be incomplete and may reject or drop them unintentionally. You should consider writing application-specific custom validation routines for your buffers instead. Some engineers at Google have come to the conclusion that using required does more harm than good; they prefer to use only optional and repeated. However, this view is not universal.
Also
Any new fields that you add should be optional or repeated. This means that any messages serialized by code using your "old" message format can be parsed by your new generated code, as they won't be missing any required elements. You should set up sensible default values for these elements so that new code can properly interact with messages generated by old code. Similarly, messages created by your new code can be parsed by your old code: old binaries simply ignore the new field when parsing. However, the unknown fields are not discarded, and if the message is later serialized, the unknown fields are serialized along with it – so if the message is passed on to new code, the new fields are still available. Note that preservation of unknown fields is currently not available for Python

How to dynamically build a new protobuf from a set of already defined descriptors?

At my server, we receive Self Described Messages (as defined here... which btw wasn't all that easy as there aren't any 'good' examples of this in c++).
At this point I am having no issue creating messages from these self-described ones. I can take the FileDescriptorSet, go through each FileDescriptorProto, adding each to a DescriptorPool (using BuildFile, which also gives me every defined FileDescriptor).
From here I can create any of the messages which were defined in the FileDescriptorSet with a DynamicMessageFactory instanced with the DP and calling GetPrototype (which is very easy to do as our SelfDescribedMessage required the messages full_name() and thus we can call the FindMessageTypeByName method of the DP, giving us the properly encoded Message Prototype).
The question is how can I take each already defined Descriptor or message and dynamically BUILD a 'master' message that contains all of the defined messages as nested messages. This would primarily be used for saving the current state of the messages. Currently we're handling this by just instancing a type of each message in the server(to keep a central state across different programs). But when we want to 'save off' the current state, we're forced to stream them to disk as defined here. They're streamed one message at a time (with a size prefix). We'd like to have ONE message (one to rule them all) instead of the steady stream of separate messages. This can be used for other things once it is worked out (network based shared state with optimized and easy serialization)
Since we already have the cross-linked and defined Descriptors, one would think there would be an easy way to build 'new' messages from those already defined ones. So far the solution has alluded us. We've tried creating our own DescriptorProto and adding new fields of the type from our already defined Descriptors but got lost (haven't deep dived into this one yet). We've also looked at possibly adding them as extensions (unknown at this time how to do so). Do we need to create our own DescriptorDatabase (also unknown at this time how to do so)?
Any insights?
Linked example source on BitBucket.
Hopefully this explanation will help.
I am attempting to dynamically build a Message from a set of already defined Messages. The set of already defined messages are created by using the "self-described" method explained(briefly) in the official c++ protobuf tutorial (i.e. these messages not available in compiled form). This newly defined message will need to be created at runtime.
Have tried using the straight Descriptors for each message and attempted to build a FileDescriptorProto. Have tried looking at the DatabaseDescriptor methods. Both with no luck. Currently attempting to add these defined messages as an extension to another message (even tho at compile time those defined messages, and their 'descriptor-set' were not classified as extending anything) which is where the example code starts.
you need a protobuf::DynamicMessageFactory:
{
using namespace google;
protobuf::DynamicMessageFactory dmf;
protobuf::Message* actual_msg = dmf.GetPrototype(some_desc)->New();
const protobuf::Reflection* refl = actual_msg->GetReflection();
const protobuf::FieldDescriptor* fd = trip_desc->FindFieldByName("someField");
refl->SetString(actual_msg, fd, "whee");
...
cout << actual_msg->DebugString() << endl;
}
I was able to solve this problem by dynamically creating a .proto file and loading it with an Importer.
The only requirement is for each client to either send across its proto file (only needed at init... not during full execution). The server then saves each proto file to a temp directory. An alternative if possible is to just point the server to a central location that holds all of the needed proto files.
This was done by first using a DiskSourceTree to map actual path locations to in program virtual ones. Then building the .proto file to import every proto file that was sent across AND define an optional field in a 'master message'.
After the master.proto has been saved to disk, i Import it with the Importer. Now using the Importers DescriptorPool and a DynamicMessageFactory, I'm able to reliably generate the whole message under one message. I will be putting an example of what I am describing up later on tonight or tomorrow.
If anyone has any suggestions on how to make this process better or how to do it different, please say so.
I will be leaving this question unanswered up until the bounty is about to expire just in case someone else has a better solution.
What about serializing all the messages into strings, and making the master message a sequence of (byte) strings, a la
message MessageSet
{
required FileDescriptorSet proto_files = 1;
repeated bytes serialized_sub_message = 2;
}

How to parse a message into the DynamicMessage class and then do the iteration through the fields?

Here's what I am trying to figure out, their docs don't explain this well enough, at least to me..
Senario:
I have 5 proto files that I generate with protoc for C++. My application needs to receive a message and then be able to iterate through all the fields while accessing their values and names.
What I would like to do is parse a message into the DynamicMessage class and then do the iteration through the fields. This way I don't have to know exactly what message it is and I can handle them all in a single generic way.
I know it's possible to handle the messages by parsing them to their specific type then treating them as their Message base class but for my application that is not desirable.
It looks like what I want to do should be possible via the "--descriptor_set_out" and dynamic message class.
What I've Tried (And Failed With):
I moved the descriptor.proto into the folder with my protos and included it along side my others in the compilation step. I also set the--descriptor_set_out flag to print to a file "my_descriptors.pb.ds"
I have no idea where to proceed from there.
Here's what i've referenced, although there isn't much...
Sorry for the long post, and somewhat vague topic naming schema.
Also, incase it wasn't clear, I assume the messages aren't "Unknown." I assume there will still be the requirement of including the respective headers for each proto so my code knows about the 'unknown' message its handling.
The most common way is to use message composition. Something like:
message Foo {...}
message Bar {...}
message GenericMessage {
enum Type {FOO = 1, BAR = 2};
optional Foo foo = 1;
optional Bar bar = 2;
}
If you make sure that exactly one of either Foo or Bar is present in each GenericMessage, you get the desired behaviour. You read one GenericMessage and then process it as one of several specific messages.
Think about refactoring the protocol. If all you need to do is iterate over the fields, maybe you'd be better off with something like a simple key-value map:
message ValueMessage {
required string key = 1;
optional int IntValue = 2;
optional string StringValue = 3;
optional bool BoolValue = 4;
...
}
message GenericMessage{
repeated ValueMessage = 1;
}
Or maybe you can refactor you protocol some other way.
Warning: my answer is not completely correct I am having some compilation errors regarding conflicts, i will edit when I fix it :). but this is a starting point
It might have been a long time since this question was posted, but I faced something similar this days now working with Protocol Buffers.
First of all the reference is wrong the option on the command that must be added is:
--descriptor_set_out=<Directory>
where Directory is where your compiled version of the descriptor.proto (or .proto compiled that describes your file) is located.
after this you will have to add the reference to the Descriptor.proto file in your autodescriving .proto file.
message MyMessage
{
required google.protobuf.FileDescriptorSet proto_files = 1;
...
}