register ErrorCollector or intercept parse errors for wire format? - c++

When It is possible to define a custom ErrorCollector class for handling google::protobuf parsing errors
struct ErrorCollector : ::google::protobuf::io::ErrorCollector
{
void AddError(int line, int column, const std::string& message) override
{
// log error
}
void AddWarning(int line, int column, const std::string& message) override
{
// log warning
}
};
When parsing from a text file, you can use the protobuf TextFormat class and register your custom ErrorCollector
::google::protobuf::io::IstreamInputStream input_stream(&file);
::google::protobuf::TextFormat::Parser parser;
ErrorCollector error_collector;
parser.RecordErrorsTo(&error_collector);
if (parser.Parse(&input_stream, &msg))
{
// handle msg
}
For parsing wire format, I currently use Message::ParseFromArray
if (msg.ParseFromArray(data, data_len))
{
// handle msg
}
This doesn't allow me to specify a custom ErrorCollector though.
I've searched through the source code, but as of yet have been unable to find if this is possible.
Is it possible to use an ErrorCollector when parsing wire format?
Is there another way to intercept parse errors and make them available to client code?

There are essentially two ways that parsing the wire format could fail:
The bytes are not a valid protobuf (e.g. they are corrupted, or in a totally different format).
A required field is missing.
For case 1, protobuf does not give you any more information than "it's invalid". This is partly for code simplicity (and speed), but it is also partly because any attempt to provide more information usually turns out more misleading than helpful. Detailed error reporting is useful for text format because text is often written by humans, but machines make very different kinds of errors. In some languages, protobuf actually reports specific errors like "end-group tag does not match start-group tag". In the vast majority of cases, this error really just means "the bytes are corrupted", but inevitably people think the error is trying to tell them something deeper which they do not understand. They then post questions to stack overflow like "How do I make sure my start-group and end-group tags match?" when they really should be comparing bytes between their source and destination to narrow down where they got corrupted. Even reporting the byte position where the parse error occurred is not very useful: protobuf is a dense encoding, which means that many random corrupt byte sequences will parse successfully, which means the parser may only notice a problem somewhere later down the line rather than at the point where things actually went wrong.
The one case that clearly is useful to distinguish is case 2 (missing required fields) -- at least, if you use required fields (I personally recommend avoiding them). There are a couple options here:
Normally, required field checks write errors to the console (on stderr). You can intercept these and record them your own way using SetLogHandler, but this doesn't give you structured information, only text messages.
To check required fields more programmatically, you can separate required field checking from parsing. Use MessageLite::ParsePartialFromArray() or one of the other Partial parsing methods to parse a message while ignoring the absence of required fields. You can then use the MessageLite::IsInitialized() to check if all required fields are set. If it returns false, use Message::FindInitializationErrors() to get a list of paths of all required fields that are missing.

Related

xerces_3_1 is able to create invalid xml at comments & processing instructions

I've encountered a problem using the xerces-dom library:
When you're adding a comments to the xml-tree like:
DOMDocument* doc = impl->createDocument(0, L"root", 0);
DOMElement* root = doc->getDocumentElement();
DOMComment* com1 = doc->createComment(L"SetA -- DataA");
DOMComment* com2 = doc->createComment(L"SetB -- DataB");
doc->insertBefore(com1, root);
doc->insertBefore(com2, root);
That will create the following xml-tree:
<?xml version="1.0" encoding="UTF-8" standalone="false"?>
<!--SetA -- DataA-->
<!--SetB -- DataB-->
<root/>
which is indeed invalid xml.
The same can be done with processing instructions by using ?> as data:
DOMProcessingInstruction procInstr = doc->createProcessingInstruction(L"target", L"?>");
My question:
Is there a way i can configure xerces to not create these kind of comments or do i have to check for these things myself?
And my other question: Why isn't it possible to just always escape characters like <>&'", even in comments and processing instructions, in order to avoid these kind of problems?
A DOMDocument is not an XML document. It is supposed to represent one, but it is conceivable that a valid DOM may not be serializable into a valid XML document (the converse should be less likely). Indeed this appears to be the case here:
Neither the Level 1 or Level2 two specs say anything about this, but the Level 3 DOM specification added this sentence about the DOMComment interface:
No lexical check is done on the content of a comment and it is therefore possible to have the character sequence "--" (double-hyphen) in the content, which is illegal in a comment per section 2.5 of [XML 1.0]. The presence of this character sequence must generate a fatal error during serialization.
So Xerces is operating within the DOM Level 3 specification even if it accepts a comment with '--' in it, as long as it bombs if you go to serialize it.
Not a great situation, but it makes sense because DOM was originally intended to represent XML Documents that have been read in, not to create new ones. So it is liberal in what it can represent. Fine for reading - a DOMComment can represent anything (and more) the XML document can, but a bit annoying that it doesn't catch the invalid string when you createComment().
Checking DOMDocumentImpl.cpp we see:
DOMComment *DOMDocumentImpl::createComment(const XMLCh *data)
{
return new (this, DOMMemoryManager::COMMENT_OBJECT) DOMCommentImpl(this, data);
}
And in DOMCommentImpl.cpp we have just:
DOMCommentImpl::DOMCommentImpl(DOMDocument *ownerDoc, const XMLCh *dat)
: fNode(ownerDoc), fCharacterData(ownerDoc, dat)
{
fNode.setIsLeafNode(true);
}
Finally we see in DOMCharacterDataImpl.cpp that there is no chance of validation up front - it just saves the user provided string without checking it.
DOMCharacterDataImpl::DOMCharacterDataImpl(DOMDocument *doc, const XMLCh *dat)
{
fDoc = (DOMDocumentImpl*)doc;
XMLSize_t len=XMLString::stringLen(dat);
fDataBuf = fDoc->popBuffer(len+1);
if (!fDataBuf)
fDataBuf = new (fDoc) DOMBuffer(fDoc, len+15);
fDataBuf->set(dat, len);
}
Sadly, no Xerces does not have an option or even a nice hook to check this for you. And because the Level 3 spec seems to demand that "No lexical check is done", it probably isn't even legal to add one.
The answer to your second question is simpler to answer: Because that's the way they wanted it defined it. See the XML 1.1 spec for example:
Comments
[15] Comment ::= '<!--' ((Char - '-') | ('-' (Char - '-')))* '-->'
It is similar for PIs.
The grammar simply does not allow for escapes. Seems about right: baroque and broke.
Maybe there is a way to catch the error on serialization or normalization, but I wasn't able to confirm whether Xerces 3.1 can. To be safe I think the best way is to wrap createComment() and check for it before creating the node, or walk the tree and check it yourself.

protobuf required field and default value

I am new to protobuf and I have started considering the following trivial example
message Entry {
required int32 id = 1;
}
used by the c++ code
#include <iostream>
#include "example.pb.h"
int main() {
std::string mySerialized;
{
Entry myEntry;
std::cout << "Serialization succesfull "
<< myEntry.SerializeToString(&mySerialized) << std::endl;
std::cout << mySerialized.size() << std::endl;
}
Entry myEntry;
std::cout << "Deserialization successfull "
<< myEntry.ParseFromString(mySerialized) << std::endl;
}
Even if the "id" field is required, since it has not been set, the size of the serialization buffer is 0 (??).
When I deserialize the message an error occurs:
[libprotobuf ERROR google/protobuf/message_lite.cc:123] Can't parse message of type "Entry" because it is missing required fields: id
Is it a normal behavior?
Francesco
ps- If I initialize "id" with the value 0, the behavior is different
pps- The SerializeToString function returns true, the ParseFromString returns false
I dont think I exactly understand your question, but I'll have a go at the answer anyways. Hope this helps you in some way or the other :)
Yes this is normal behavior. You should add required only if the field is important to the message. It makes sense semantically. (why would you skip a required field). To enforce this, protobuf would not parse the message.
It sees that the field marked with number 1 is required, and the has_id() method is returning false. So it wont parse the message at all.
In the developer guide it is advised not to use required fields.
Required Is Forever You should be very careful about marking fields as required. If at some point you wish to stop writing or sending a required field, it will be problematic to change the field to an optional field – old readers will consider messages without this field to be incomplete and may reject or drop them unintentionally. You should consider writing application-specific custom validation routines for your buffers instead. Some engineers at Google have come to the conclusion that using required does more harm than good; they prefer to use only optional and repeated. However, this view is not universal.
Also
Any new fields that you add should be optional or repeated. This means that any messages serialized by code using your "old" message format can be parsed by your new generated code, as they won't be missing any required elements. You should set up sensible default values for these elements so that new code can properly interact with messages generated by old code. Similarly, messages created by your new code can be parsed by your old code: old binaries simply ignore the new field when parsing. However, the unknown fields are not discarded, and if the message is later serialized, the unknown fields are serialized along with it – so if the message is passed on to new code, the new fields are still available. Note that preservation of unknown fields is currently not available for Python

Xerces: How to check the validity of an XML file using ErrorHandler

I am trying to determine if a given XML file is valid (has proper syntax and structure), and I am using Xerces. I have been able to succesfully read proper files but when I give it files with incorrect syntax, no errors are thrown.
I have been fishing around and found out that I might have to use an Error handler and user setErrorHandler to catch the errors instead of the traditional try-throw-catch exception handling.
The problem that I am having though is that I am very confused how to declare the proper handler, set it to my parser and then read the errors if there are any that show up.
Is there any chance somebody could shed some light on my situation?
// #input_parameter from function: const string & xmlConfigArg
xercesc::DOMDocument* doc = NULL;
string xmlConfig(xmlConfigArg);
Handler handler; // I'm not sure what type of handler to use
_parser->setErrorHandler(&handler);
try{
_parser->parse(xmlConfigArg.c_str());
doc = _parser-> getDocument();
}catch(...){
//Nothing is ever caught here
}
You need to derive a class from ErrorHandler (< xercesc/sax/ErrorHandler.hpp >)
then overwrite all the virtual methods there.
After doing so, You can get the error code from the class you created. No exceptions will be thrown in the parsing, so you can wave the try/cache block (or keep it for a different use).

How to dynamically build a new protobuf from a set of already defined descriptors?

At my server, we receive Self Described Messages (as defined here... which btw wasn't all that easy as there aren't any 'good' examples of this in c++).
At this point I am having no issue creating messages from these self-described ones. I can take the FileDescriptorSet, go through each FileDescriptorProto, adding each to a DescriptorPool (using BuildFile, which also gives me every defined FileDescriptor).
From here I can create any of the messages which were defined in the FileDescriptorSet with a DynamicMessageFactory instanced with the DP and calling GetPrototype (which is very easy to do as our SelfDescribedMessage required the messages full_name() and thus we can call the FindMessageTypeByName method of the DP, giving us the properly encoded Message Prototype).
The question is how can I take each already defined Descriptor or message and dynamically BUILD a 'master' message that contains all of the defined messages as nested messages. This would primarily be used for saving the current state of the messages. Currently we're handling this by just instancing a type of each message in the server(to keep a central state across different programs). But when we want to 'save off' the current state, we're forced to stream them to disk as defined here. They're streamed one message at a time (with a size prefix). We'd like to have ONE message (one to rule them all) instead of the steady stream of separate messages. This can be used for other things once it is worked out (network based shared state with optimized and easy serialization)
Since we already have the cross-linked and defined Descriptors, one would think there would be an easy way to build 'new' messages from those already defined ones. So far the solution has alluded us. We've tried creating our own DescriptorProto and adding new fields of the type from our already defined Descriptors but got lost (haven't deep dived into this one yet). We've also looked at possibly adding them as extensions (unknown at this time how to do so). Do we need to create our own DescriptorDatabase (also unknown at this time how to do so)?
Any insights?
Linked example source on BitBucket.
Hopefully this explanation will help.
I am attempting to dynamically build a Message from a set of already defined Messages. The set of already defined messages are created by using the "self-described" method explained(briefly) in the official c++ protobuf tutorial (i.e. these messages not available in compiled form). This newly defined message will need to be created at runtime.
Have tried using the straight Descriptors for each message and attempted to build a FileDescriptorProto. Have tried looking at the DatabaseDescriptor methods. Both with no luck. Currently attempting to add these defined messages as an extension to another message (even tho at compile time those defined messages, and their 'descriptor-set' were not classified as extending anything) which is where the example code starts.
you need a protobuf::DynamicMessageFactory:
{
using namespace google;
protobuf::DynamicMessageFactory dmf;
protobuf::Message* actual_msg = dmf.GetPrototype(some_desc)->New();
const protobuf::Reflection* refl = actual_msg->GetReflection();
const protobuf::FieldDescriptor* fd = trip_desc->FindFieldByName("someField");
refl->SetString(actual_msg, fd, "whee");
...
cout << actual_msg->DebugString() << endl;
}
I was able to solve this problem by dynamically creating a .proto file and loading it with an Importer.
The only requirement is for each client to either send across its proto file (only needed at init... not during full execution). The server then saves each proto file to a temp directory. An alternative if possible is to just point the server to a central location that holds all of the needed proto files.
This was done by first using a DiskSourceTree to map actual path locations to in program virtual ones. Then building the .proto file to import every proto file that was sent across AND define an optional field in a 'master message'.
After the master.proto has been saved to disk, i Import it with the Importer. Now using the Importers DescriptorPool and a DynamicMessageFactory, I'm able to reliably generate the whole message under one message. I will be putting an example of what I am describing up later on tonight or tomorrow.
If anyone has any suggestions on how to make this process better or how to do it different, please say so.
I will be leaving this question unanswered up until the bounty is about to expire just in case someone else has a better solution.
What about serializing all the messages into strings, and making the master message a sequence of (byte) strings, a la
message MessageSet
{
required FileDescriptorSet proto_files = 1;
repeated bytes serialized_sub_message = 2;
}

How to parse a message into the DynamicMessage class and then do the iteration through the fields?

Here's what I am trying to figure out, their docs don't explain this well enough, at least to me..
Senario:
I have 5 proto files that I generate with protoc for C++. My application needs to receive a message and then be able to iterate through all the fields while accessing their values and names.
What I would like to do is parse a message into the DynamicMessage class and then do the iteration through the fields. This way I don't have to know exactly what message it is and I can handle them all in a single generic way.
I know it's possible to handle the messages by parsing them to their specific type then treating them as their Message base class but for my application that is not desirable.
It looks like what I want to do should be possible via the "--descriptor_set_out" and dynamic message class.
What I've Tried (And Failed With):
I moved the descriptor.proto into the folder with my protos and included it along side my others in the compilation step. I also set the--descriptor_set_out flag to print to a file "my_descriptors.pb.ds"
I have no idea where to proceed from there.
Here's what i've referenced, although there isn't much...
Sorry for the long post, and somewhat vague topic naming schema.
Also, incase it wasn't clear, I assume the messages aren't "Unknown." I assume there will still be the requirement of including the respective headers for each proto so my code knows about the 'unknown' message its handling.
The most common way is to use message composition. Something like:
message Foo {...}
message Bar {...}
message GenericMessage {
enum Type {FOO = 1, BAR = 2};
optional Foo foo = 1;
optional Bar bar = 2;
}
If you make sure that exactly one of either Foo or Bar is present in each GenericMessage, you get the desired behaviour. You read one GenericMessage and then process it as one of several specific messages.
Think about refactoring the protocol. If all you need to do is iterate over the fields, maybe you'd be better off with something like a simple key-value map:
message ValueMessage {
required string key = 1;
optional int IntValue = 2;
optional string StringValue = 3;
optional bool BoolValue = 4;
...
}
message GenericMessage{
repeated ValueMessage = 1;
}
Or maybe you can refactor you protocol some other way.
Warning: my answer is not completely correct I am having some compilation errors regarding conflicts, i will edit when I fix it :). but this is a starting point
It might have been a long time since this question was posted, but I faced something similar this days now working with Protocol Buffers.
First of all the reference is wrong the option on the command that must be added is:
--descriptor_set_out=<Directory>
where Directory is where your compiled version of the descriptor.proto (or .proto compiled that describes your file) is located.
after this you will have to add the reference to the Descriptor.proto file in your autodescriving .proto file.
message MyMessage
{
required google.protobuf.FileDescriptorSet proto_files = 1;
...
}