OBJ Parser with Boost Spirit - Ignoring comments - c++

I'm trying to write a basic OBJ file loader using the Boost Spirit library. Although I got it working using the standard std::ifstreams, I'm wondering if it's possible to do a phrase_parse on the entire file using a memory mapped file, since it seems to provide the best performance as posted here.
I have the following code, which seems to work well, but it breaks when there is a comment in the file. So, my question is how do you ignore a comment that starts with a '#' in the OBJ file using Spririt?
struct vertex {
double x, y, z;
};
BOOST_FUSION_ADAPT_STRUCT(
vertex,
(double, x)
(double, y)
(double, z)
)
std::vector<vertex> b_vertices
boost::iostreams::mapped_file mmap(
path,
boost::iostreams::mapped_file::readonly);
const char* f = mmap.const_data();
const char* l = f + mmap.size();
using namespace boost::spirit::qi;
bool ok = phrase_parse(f,l,(("v" >> double_ >> double_ >> double_) |
("vn" >> double_ >> double_>> double_)) % eol ,
blank, b_vertices);
The above code works well when there are no comments or any other data except vertices/normals. But when there is a different type of data the parser fails (as it should) and I'm wondering if there is a way to make it work without going back to parsing every line as it is slower (almost 2.5x in my tests). Thank you!

The simplest way that comes to mind is to simply make comments skippable:
bool ok = qi::phrase_parse(
f,l,
(
("v" >> qi::double_ >> qi::double_ >> qi::double_) |
("vn" >> qi::double_ >> qi::double_ >> qi::double_)
)
% qi::eol,
('#' >> *(qi::char_ - qi::eol) >> qi::eol | qi::blank), b_vertices);
Note that this also 'recognizes' comments if # appears somewhere inside the line. This is probably just fine (as it would make the parsing fail, unless it was a comment trailing on an otherwise valid input line).
See it Live on Coliru
Alternatively, use some phoenix magic to handle "comment lines" just as you handle a "vn" or "v" line.

I realize that my comment/post is not directly related code but I'm for not reinventing the wheel if possible and I would have wanted to know about this library. I was working with a handwritten OBJ/Wavefront loader but in my research I found this library Tiny Obj Loader. This library is written C++ with no dependencies excetp C++ STL. It handles the edge cases for the Wavefront spec fairly well and it is very fast. The thing that the user has to do is convert the Tiny OBJ objects into their code. TinyObjLoader has been adopted by quite a number of projects as well. I do apologize for not directly answering the question and my desire is to get knowledge about this great library out.

Related

Efficiently reading two comma-separated floats in brackets from a string without being affected by the global locale

I am a developer of a library and our old code uses sscanf() and sprintf() to read/write a variety of internal types from/to strings. We have had issues with users who used our library and had a locale that was different from the one we based our XML files on ("C" locale). In our case this resulted in incorrect values parsed from those XML files and those submitted as strings in run-time. The locale may be changed by a user directly but can also be changed without the knowledge of the user. This can happen if the locale-changes occurs inside another library, such as GTK, which was the "perpetrator" in one bug report. Therefore, we obviously want to remove any dependency from the locale to permanently free ourselves from these issues.
I have already read other questions and answers in the context of float/double/int/... especially if they are separated by a character or located inside brackets, but so far the proposed solutions I found were not satisfying to us. Our requirements are:
No dependencies on libraries other than the standard library. Using anything from boost is therefore, for example, not an option.
Must be thread-safe. This is meant in specific regarding the locale, which can be changed globally. This is really awful for us, as therefore a thread of our library can be affected by another thread in the user's program, which may also be running code of a completely different library. Anything affected by setlocale() directly is therefore not an option. Also, setting the locale before starting to read/write and setting it back to the original value thereafter is not a solution due to race conditions in threads.
While efficiency is not the topmost priority (#1 & #2 are), it is still definitely of our concern, as strings may be read and written in run-time quite frequently, depending on the user's program. The faster, the better.
Edit: As an additional note: boost::lexical_cast is not guaranteed to be unaffected by the locale (source: Locale invariant guarantee of boost::lexical_cast<>). So that would not be a solution even without requirement #1.
I gathered the following information so far:
First of all, what I saw being suggested a lot is using boost's lexical_cast but unfortunately this is not an option for us as at all, as we can't require all users to also link to boost (and because of the lacking locale-safety, see above). I looked at the code to see if we can extract anything from it but I found it difficult to understand and too large in length, and most likely the big performance-gainers are using locale-dependent functions anyways.
Many functions introduced in C++11, such as std::to_string, std::stod, std::stof, etc. depend on the global locale just the way sscanf and sprintf do, which is extremely unfortunate and to me not understandable, considering that std::thread has been added.
std::stringstream seems to be a solution in general, since it is thread-safe in the context of the locale, but also in general if guarded right. However, if it is constructed freshly every time it can be slow (good comparison: http://www.boost.org/doc/libs/1_55_0/doc/html/boost_lexical_cast/performance.html). I assume this can be solved by having one such stream per thread configured and available, clearing it each time after usage. However, a problem is that it doesn't solve formats as easily as sscanf() does, for example: " { %g , %g } ".
sscanf() patterns that we, for example, need to be able to read are:
" { %g , %g }"
" { { %g , %g } , { %g , %g } }"
" { top: { %g , %g } , left: { %g , %g } , bottom: { %g , %g } , right: { %g , %g }"
Writing these with stringstreams seems no big deal, but reading them seems problematic, especially considering the whitespaces.
Should we use std::regex in this context or is this overkill? Are stringstreams a good solution for this task or is there any better way to do this given the mentioned requirements? Also, are there any other problems in the context of thread-safety and locales that I have not considered in my question - especially regarding the usage of std::stringstream?
In your case the stringstream seems to be the best approach, as you can control it's locale independently of the global locale that was set. But it's true that the formatted reading is not as easy as with sscanf().
Form the point of view of performance, stream input with regex is an overkill for this kind of simple comma separated reading : on an informal benchmark it was more than 10 times slower than a scanf().
You can easily write a little auxiliary class to facilitate reading formats like you have enumerated. Here the general idea on another SO answer The use can be as easy as:
sst >> mandatory_input(" { ")>> x >> mandatory_input(" , ")>>y>> mandatory_input(" } ");
If you're interested, I've written one some time ago. Here the full article with examples and explanation as well as source code. The class is 70 lines of code, but most of them to provide error processing functions in case these are needed. It has acceptable performance, but is still slower than scanf().
Based on the suggestions by Christophe and some other stackoverflow answers I found, I created a set of 2 methods and 1 class to achieve all stream parsing functionality we required. The following methods are sufficient to parse the formats proposed in the question:
The following methods strips preceding whitespaces and then skips an optional character:
template<char matchingCharacter>
std::istream& optionalChar(std::istream& inputStream)
{
if (inputStream.fail())
return inputStream;
inputStream >> std::ws;
if (inputStream.peek() == matchingCharacter)
inputStream.ignore();
else
// If peek is executed but no further characters remain,
// the failbit will be set, we want to undo this
inputStream.clear(inputStream.rdstate() & ~std::ios::failbit);
return inputStream;
}
The second methods strips preceding whitespaces and then checks for a mandatory character. If it doesn't match, the fail bit will be set:
template<char matchingCharacter>
std::istream& mandatoryChar(std::istream& inputStream)
{
if (inputStream.fail())
return inputStream;
inputStream >> std::ws;
if (inputStream.peek() == matchingCharacter)
inputStream.ignore();
else
inputStream.setstate(std::ios_base::failbit);
return inputStream;
}
It makes sense to use a global stringstream (call strStream.str(std::string()) and call clear() before each usage) to increase performance, as hinted to in my question. With the optional character checks I could make the parsing more lenient towards other styles. Here is an example usage:
// Format is: " { { %g , %g } , { %g , %g } } " but we are lenient regarding the format,
// so this is also allowed: " { %g %g } { %g %g } "
std::stringstream sstream(inputString);
sstream.clear();
sstream >> optionalChar<'{'> >> mandatoryChar<'{'> >> val1 >>
optionalChar<','> >> val2 >>
mandatoryChar<'}'> >> optionalChar<','> >> mandatoryChar<'{'> >> val3 >>
optionalChar<','> >> val4;
if (sstream.fail())
logError(inputString);
Addition - Checking for mandatory strings:
Last but not least I created a class for checking for mandatory strings in streams from scratch, based on the idea by Christophe. Header-file:
class MandatoryString
{
public:
MandatoryString(char const* mandatoryString);
friend std::istream& operator>> (std::istream& inputStream, const MandatoryString& mandatoryString);
private:
char const* m_chars;
};
Cpp file:
MandatoryString::MandatoryString(char const* mandatoryString)
: m_chars(mandatoryString)
{}
std::istream& operator>> (std::istream& inputStream, const MandatoryString& mandatoryString)
{
if (inputStream.fail())
return inputStream;
char const* currentMandatoryChar = mandatoryString.m_chars;
while (*currentMandatoryChar != '\0')
{
static const std::locale spaceLocale("C");
if (std::isspace(*currentMandatoryChar, spaceLocale))
{
inputStream >> std::ws;
}
else
{
int peekedChar = inputStream.get();
if (peekedChar != *currentMandatoryChar)
{
inputStream.setstate(std::ios::failbit);
break;
}
}
++currentMandatoryChar;
}
return inputStream;
}
The MandatoryString class is used similar to the above methods, e.g.:
sstream >> MandatoryString(" left");
Conclusion:
While this solution might be more verbose than sscanf, it gives us all the flexibility we needed while being able to use stringstreams, which make this solution generally thread-safe and not depending on the global locale. Also it is easy to check for errors and once an fail bit is set, the parsing will be halted inside the suggested methods. For very long sequences of values to parse in a string, this can actually becomes more readable than sscanf: For example it allows to split the parsing cross multiple lines with the preceding mandatory strings being on the same line with the corresponding variables, respectively.T̶h̶e̶ ̶o̶n̶l̶y̶ ̶p̶a̶r̶t̶ ̶t̶h̶a̶t̶ ̶d̶o̶e̶s̶ ̶n̶o̶t̶ ̶w̶o̶r̶k̶ ̶n̶i̶c̶e̶l̶y̶ ̶w̶i̶t̶h̶ ̶t̶h̶i̶s̶ ̶s̶o̶l̶u̶t̶i̶o̶n̶ ̶i̶s̶ ̶p̶a̶r̶s̶i̶n̶g̶ ̶m̶u̶l̶t̶i̶p̶l̶e̶ ̶h̶e̶x̶a̶d̶e̶c̶i̶m̶a̶l̶s̶ ̶f̶r̶o̶m̶ ̶o̶n̶e̶ ̶s̶t̶r̶i̶n̶g̶,̶ ̶w̶h̶i̶c̶h̶ ̶r̶e̶q̶u̶i̶r̶e̶s̶ ̶a̶ ̶s̶e̶c̶o̶n̶d̶ ̶s̶t̶r̶e̶a̶m̶ ̶a̶n̶d̶ ̶a̶ ̶l̶o̶t̶ ̶o̶f̶ ̶a̶d̶d̶i̶t̶i̶o̶n̶a̶l̶ ̶l̶i̶n̶e̶s̶ ̶o̶f̶ ̶c̶o̶d̶e̶ ̶o̶f̶ ̶c̶l̶e̶a̶r̶i̶n̶g̶ ̶a̶n̶d̶ ̶g̶e̶t̶L̶i̶n̶e̶ ̶c̶a̶l̶l̶s̶.̶ After overloading the stream operators << and >> for our internal types, everything looks very clean and is easily maintainable. Parsing multiple hexadecimals also works fine, we just reset the previously set std::hex value to std::dec after the operation is done.

undefined behaviour somewhere in boost::spirit::qi::phrase_parse

I am learning to use boost::spirit library. I took this example http://www.boost.org/doc/libs/1_56_0/libs/spirit/example/qi/num_list1.cpp and compiled it on my computer - it works fine.
However if I modify it a little - if I initialize the parser itself
auto parser = qi::double_ >> *(',' >> qi::double_);
somewhere as global variable and pass it to phrase_parse, everything goes crazy. Here is the complete modified code (only 1 line is modified and 1 added) - http://pastebin.com/5rWS3pMt
If I run the original code and pass "3.14, 3.15" to stdin, it says Parsing succeeded, but with my modified version it fails. I tried a lot of modifications of the same type - assigning the parser to global variable - in some variants on some compilers it segfaults.
I don't understand why and how it is so.
Here is another, simpler version which prints true and then segfaults on clang++ and just segfaults on g++
#include <boost/spirit/include/qi.hpp>
#include <iostream>
#include <string>
namespace qi = boost::spirit::qi;
namespace ascii = boost::spirit::ascii;
const auto doubles_parser_global = qi::double_ >> *(',' >> qi::double_);
int main() {
const auto doubles_parser_local = qi::double_ >> *(',' >> qi::double_);
const std::string nums {"3.14, 3.15, 3.1415926"};
std::cout << std::boolalpha;
std::cout
<< qi::phrase_parse(
nums.cbegin(), nums.cend(), doubles_parser_local, ascii::space
)
<< std::endl; // works fine
std::cout
<< qi::phrase_parse(
nums.cbegin(), nums.cend(), doubles_parser_global, ascii::space
) // this segfaults
<< std::endl;
}
You cannot use auto to store parser expressions¹
Either you need to evaluate from the temporary expression directly, or you need to assign to a rule/grammar:
const qi::rule<std::string::const_iterator, qi::space_type> doubles_parser_local = qi::double_ >> *(',' >> qi::double_);
You can have your cake and eat it too on most recent BOost versions (possibly the dev branch) there should be a BOOST_SPIRIT_AUTO macro
This is becoming a bit of a FAQ item:
Assigning parsers to auto variables
boost spirit V2 qi bug associated with optimization level
¹ I believe this is actually a limitation of the underlying Proto library. There's a Proto-0x lib version on github (by Eric Niebler) that promises to solve these issues by being completely redesigned to be aware of references. I think this required some c++11 features that Boost Proto currently cannot use.

`y = atoi(x)` vs `x >> y` performance

I need to parse a string containing a comma separated list of objects, like a="1,2,3,4,5".
I also want the object type to be interchangeable, it is defined by typedef type val_type.
My approach is
vector<val_type> v();
istringstream iss(a); string x; val_type y;
while (getline(iss, x, ',') && istringstream(x) >> y) v.push_back(y);
Which works, but apparently runs much slower (about 5x) compared to using atoi() assuming the object type is an integer.
while (getline(iss, x, ',')) v.push_back(atoi(x.c_str()));
Is that huge difference in performance to be expected? Any clever way to get around this issue?

boost spirit and related parts

I need to create a rule via boost spirit that should match situations like
return foo;
and
return (foo);
I tried smth like this:
start %= "return" >> -boost::spirit::qi::char_('(') >> identifier >> -boost::spirit::qi::char_(')') >> ';';
but this will succeeded even in cases like
return (foo;
and
return foo);
How can I solve it?
Your example only looks pathological, because you are using an overly specific example.
In practice, you don't "return" >> identifier;. Usually, the thing that's returned is just an expression. So, you'd say
expr = literal | variable | function_call;
Now the general way to cater for parenthesized expressions in on fell swoop is simply:
expr = literal | variable | function_call
| ('(' >> expr >> ')')
;
Bam. Done. It handles the balancing. It handles nested parentheses. It handles (((foo))) even. Not a whistle was given that day.
I don't think there is /anything/ wrong at all. I've posted probably over 20 recursive different expression grammars in answers on this site. They should provide motivating examples (showing operator precedence and overruling them with these parentheses).

How can you parse simple C++ typedef instructions?

I'd like to parse simple C++ typedef instructions such as
typedef Class NewNameForClass;
typedef Class::InsideTypedef NewNameForTypedef;
typedef TemplateClass<Arg1,Arg2> AliasForObject;
I have written the corresponding grammar that i'd like to see used in parsing.
Name <- ('_'|letter)('_'|letter|digit)*
Type <- Name
Type <- Type::Name
Type <- Name Templates
Templates <- '<' Type (',' Type)* '>'
Instruction <- "typedef" Type Name ';'
Once this is parsed, all i'll want to do is to generate xml with the same information (but layed out differently)
What is the most effective language for writing such a program ?
How can you achieve this ?
EDIT : What i have come up with using Boost Spirit (it's not perfect, but it's good enough for me, at least for now)
rule<> sep_p = space_p;
rule<> name_p = (ch_p('_')|alpha_p) >> *(ch_p('_')|alpha_p|digit_p);
rule<> type_p = name_p
>> !(*sep_p >>str_p("::") >> *sep_p>> name_p)
>> *(*sep_p >> ch_p('*') )
>> !(*sep_p >> str_p("const"))
>> !(*sep_p >> ch_p('&'));
rule<> templated_type_p = name_p >> *sep_p
>> ch_p('<') >> *sep_p
>> (*sep_p>>type_p>>*sep_p)%ch_p(',')
>> ch_p('>') >> *sep_p;
rule<> typedef_p = *sep_p
>> str_p ("typedef")
>> +sep_p >> (type_p|templated_type_p)
>> +sep_p >> name_p
>> *sep_p >> ch_p(';') >> *sep_p;
rule<> typedef_list_p = *typedef_p;
I would alter the grammar slightly
ShortName <- ('_'|letter)('_'|letter|digit)*
Name <- ShortName
Name <- Name::ShortName
Type <- Name
Type <- Name Templates
Templates <- '<' Type (',' Type)* '>'
Instruction <- "typedef" Type Name ';'
Also your grammar leaves out the following cases
Multiple typedef targets.
Pointer targets
Function pointers (this is by far the most difficult)
Parsing a grammar (i love the irony) is a fairly straight forward operation. If you wanted to actually use the grammar in a functional way, I would say the best bet is a lex/yacc combination.
But from your question it appears that you want to spit it out to another format. There really isn't a language designed for this so I would say use whatever language you're most comfortable with.
Edit
The OP asked about multiple typedef targets. It's perfectly legally for a typedef declaration to have more than 1 target. For Example:
typedef _SomeStruct SomeStruct, *PSomeStruct
This creates 2 typedef names.
SomeStruct which is equivalent to "struct _SomeStruct"
PSomeStruct which is equivalent to "struct _SomeStruct*"
Well, since you're apparently already working with/on C++, have you considered using Boost.Spirit? This allows you to hard-code the grammar inline in C++ as a domain-specific language and program against it in normal C++ code.