Error avalanche in Boost.Spirit.Qi usage - c++

I'm not being able to figure out what's wrong with my code. Boost's templates are making me go crazy! I can't make heads or tails out of all this, so I just had to ask.
What's wrong with this?
#include <iostream>
#include <boost/lambda/lambda.hpp>
#include <boost/spirit/include/qi.hpp>
void parsePathTest(const std::string &path)
{
namespace lambda = boost::lambda;
using namespace boost::spirit;
const std::string permitted = "._\\-##a-zA-Z0-9";
const std::string physicalPermitted = permitted + "/\\\\";
const std::string archivedPermitted = permitted + ":{}";
std::string physical,archived;
// avoids non-const reference to rvalue
std::string::const_iterator begin = path.begin(),end = path.end();
// splits a string like "some/nice-path/while_checking:permitted#symbols.bin"
// as physical = "some/nice-path/while_checking"
// and archived = "permitted#symbols.bin" (if this portion exists)
// I could barely find out the type for this expression
auto expr
= ( +char_(physicalPermitted) ) [lambda::var(physical) = lambda::_1]
>> -(
':'
>> (
+char_(archivedPermitted) [lambda::var(archived) = lambda::_1]
)
)
;
// the error occurs in a template instantiated from here
qi::parse(begin,end,expr);
std::cout << physical << '\n' << archived << '\n';
}
The number of errors is immense; I would suggest people who want to help compiling this on their on (trust me, pasting here is unpractical). I am using the latest TDM-GCC version (GCC 4.4.1) and Boost version 1.39.00.
As a bonus, I would like to ask another two things: whether C++0x's new static_assert functionality will help Boost in this sense, and whether the implementation I've quoted above is a good idea, or if I should use Boost's String Algorithms library. Would the latter likely give a much better performance?
Thanks.
-- edit
The following very minimal sample fails at first with the exact same error as the code above.
#include <iostream>
#include <boost/spirit/include/qi.hpp>
int main()
{
using namespace boost::spirit;
std::string str = "sample";
std::string::const_iterator begin(str.begin()), end(str.end());
auto expr
= ( +char_("a-zA-Z") )
;
// the error occurs in a template instantiated from here
if (qi::parse(begin,end,expr))
{
std::cout << "[+] Parsed!\n";
}
else
{
std::cout << "[-] Parsing failed.\n";
}
return 0;
}
-- edit 2
I still don't know why it didn't work in my old version of Boost (1.39), but upgrading to Boost 1.42 solved the problem. The following code compiles and runs perfectly with Boost 1.42:
#include <iostream>
#include <boost/spirit/include/qi.hpp>
int main()
{
using namespace boost::spirit;
std::string str = "sample";
std::string::const_iterator begin(str.begin()), end(str.end());
auto expr
= ( +qi::char_("a-zA-Z") ) // notice this line; char_ is not part of
// boost::spirit anymore (or maybe I didn't
// include the right headers, but, regardless,
// khaiser said I should use qi::char_, so here
// it goes)
;
// the error occurs in a template instantiated from here
if (qi::parse(begin,end,expr))
{
std::cout << "[+] Parsed!\n";
}
else
{
std::cout << "[-] Parsing failed.\n";
}
return 0;
}
Thanks for the tips, hkaiser.

Several remarks: a) don't use the Spirit V2 beta version distributed with Boost V1.39 and V1.40. Use at least Spirit V2.1 (as released with Boost V1.41) instead, as it contains a lot of bug fixes and performance enhancements (both, compile time and runtime performance). If you can't switch Boost versions, read here for how to proceed. b) Try to avoid using boost::lambda or boost::bind with Spirit V2.x. Yes, I know, the docs say it works, but you have to know what you're doing. Use boost::phoenix expressions instead. Spirit 'knows' about Phoenix, which makes writing semantic actions easier. If you use Phoenix, your code will look like:
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
namespace qi = boost::spirit::qi;
namespace phoenix = boost::phoenix;
std::string physical, archived;
auto expr
= ( +char_(physicalPermitted) ) [phoenix::ref(physical) = qi::_1]
>> -(
':'
>> ( +char_(archivedPermitted) )[phoenix::ref(archived) = qi::_1]
)
;
But your overall parser will get even simpler if you utilize Spirit's built-in attribute propagation rules:
std::string physical;
boost::optional<std::string> archived;
qi::parse(begin, end,
+qi::char_(physicalPermitted) >> -(':' >> +qi::char_(archivedPermitted)),
physical, archived);
i.e. no need to have semantic actions at all. If you need more information about the attribute handling, see the article series about the Magic of Attributes on Spirit's web site.
Edit:
Regarding your static_assert question: yes static_assert, can improve error messages as it can be used to trigger compiler errors as early as possible. In fact, Spirit uses this technique extensively already. But it is not possible to protect the user from getting those huge error messages in all cases, but only for those user errors the programmer did expect. Only concepts (which unfortunately didn't make it into the new C++ Standard) could have been used to generally reduce teh size of the error messages.
Regarding your Boost's String Algorithms question: certainly it's possible to utilize this library for simple tasks as yours. You might even be better off using Boost.Tokenizer (if all you need is to split the input string at the ':'). The performance of Spirit should be comparable to the corresponding performance of the string algorithms, but this certainly depends on the code you will write. If you assume that the utilized string algorithm will require one pass over the input string data, then Spirit won't be faster (as it's doing one pass as well).
Neither the Boost String algorithms nor the Boost Tokenizer can give you the verification of the characters matched. Your Spirit grammar matches only the characters you specified in the character classes. So if you need this matching/verification, you should use either Spirit or Boost Regex.

Related

boost::spirit::x3 attribute compatibility rules, intuition or code?

Is there a document somewhere which describes how various spirit::x3 rule definition operations affect attribute compatibility?
I was surprised when:
x3::lexeme[ x3::alpha > *(x3::alnum | x3::char_('_')) ]
could not be moved into a fusion-adapted struct:
struct Name {
std::string value;
};
For the time being, I got rid of the first mandatory alphabetical character, but I would still like to express a rule which defines that the name string must begin with a letter. Is this one of those situations where I need to try adding eps around until it works, or is there a stated reason why the above couldn't work?
I apologize if this has been written down somewhere, I couldn't find it.
If you're not on the develop branch you don't have the fix for that single-element sequence adaptiation bug, so yeah it's probably that.
Due to the genericity of attribute transformation/propagation, there's a lot of wiggle room, but of course it's just documented and ultimately in the code. In other words: there's no magic.
In the Qi days I'd have "fixed" this by just spelling out the desired transform with qi::as<> or qi::attr_cast<>. X3 doesn't have it (yet), but you can use a rule to mimick it very easily:
Live On Coliru
#include <iostream>
#include <boost/fusion/adapted/struct.hpp>
#include <boost/spirit/home/x3.hpp>
namespace x3 = boost::spirit::x3;
struct Name {
std::string value;
};
BOOST_FUSION_ADAPT_STRUCT(Name, value)
int main() {
std::string const input = "Halleo123_1";
Name out;
bool ok = x3::parse(input.begin(), input.end(),
x3::rule<struct _, std::string>{} =
x3::alpha >> *(x3::alnum | x3::char_('_')),
out);
if (ok)
std::cout << "Parsed: " << out.value << "\n";
else
std::cout << "Parse failed\n";
}
Prints:
Parsed: Halleo123_1
Automate it
Because X3 works so nicely with c++14 core language features, it's not hard to reduce typing:
Understanding the List Operator (%) in Boost.Spirit

Varying case-insensitive string comparisons performance

So, my phd project relies on a piece of software I've been building for nearly 3 years. It runs, its stable (It doesn't crash or throw exceptions) and I'm playing with the release version of it. And I've come to realise that there is a huge performance hit, because I'm relying too much on boost::iequals.
I know, there's a lot of on SO about this, this is not a question on how to do it, but rather why is this happening.
Consider the following:
#include <string.h>
#include <string>
#include <boost/algorithm/string.hpp>
void posix_str ( )
{
std::string s1 = "Alexander";
std::string s2 = "Pericles";
std::cout << "POSIX strcasecmp: " << strcasecmp( s1.c_str(), s2.c_str() ) << std::endl;
}
void boost_str ( )
{
std::string s1 = "Alexander";
std::string s2 = "Pericles";
std::cout << "boost::iequals: " << boost::iequals( s1, s2 ) << std::endl;
}
int main ( )
{
posix_str();
boost_str();
return 0;
}
I put this through valgrind and cachegrind, and to my suprise, boost is 4 times slower than the native posix or the std (which appears to be using the same posix) methods. Four times, now that is a lot, even considering that C++ offers a nice safety net. Why is that? I would really like other people to run this, and explain to me, what makes such a performance hit. Is it all the allocations (seems to be from the caller map).
I'm not dissing on boost, I love it and use it everywhere and anywhere.
EDIT: This graph shows what I mean
Boost::iequals is locale-aware. As you can see from its definition here it takes an optional third parameter that is defaulted to a default-constructed std::locale, that represents the current global C++ locale, as set by std::locale::global.
This more or less means that the compiler has no way to know in advance which locale is going to be used, and that means that there will be an indirect call to a certain function to convert each character to lower-case in the current locale.
On the other hand, the documentation for strcasecmp states that:
In the POSIX locale, strcasecmp() and strncasecmp() shall behave as if the strings had been converted to lowercase and then a byte comparison performed. The results are unspecified in other locales.
That means that the locale is fixed, hence you can expect it to be heavily optimized.

undefined behaviour somewhere in boost::spirit::qi::phrase_parse

I am learning to use boost::spirit library. I took this example http://www.boost.org/doc/libs/1_56_0/libs/spirit/example/qi/num_list1.cpp and compiled it on my computer - it works fine.
However if I modify it a little - if I initialize the parser itself
auto parser = qi::double_ >> *(',' >> qi::double_);
somewhere as global variable and pass it to phrase_parse, everything goes crazy. Here is the complete modified code (only 1 line is modified and 1 added) - http://pastebin.com/5rWS3pMt
If I run the original code and pass "3.14, 3.15" to stdin, it says Parsing succeeded, but with my modified version it fails. I tried a lot of modifications of the same type - assigning the parser to global variable - in some variants on some compilers it segfaults.
I don't understand why and how it is so.
Here is another, simpler version which prints true and then segfaults on clang++ and just segfaults on g++
#include <boost/spirit/include/qi.hpp>
#include <iostream>
#include <string>
namespace qi = boost::spirit::qi;
namespace ascii = boost::spirit::ascii;
const auto doubles_parser_global = qi::double_ >> *(',' >> qi::double_);
int main() {
const auto doubles_parser_local = qi::double_ >> *(',' >> qi::double_);
const std::string nums {"3.14, 3.15, 3.1415926"};
std::cout << std::boolalpha;
std::cout
<< qi::phrase_parse(
nums.cbegin(), nums.cend(), doubles_parser_local, ascii::space
)
<< std::endl; // works fine
std::cout
<< qi::phrase_parse(
nums.cbegin(), nums.cend(), doubles_parser_global, ascii::space
) // this segfaults
<< std::endl;
}
You cannot use auto to store parser expressions¹
Either you need to evaluate from the temporary expression directly, or you need to assign to a rule/grammar:
const qi::rule<std::string::const_iterator, qi::space_type> doubles_parser_local = qi::double_ >> *(',' >> qi::double_);
You can have your cake and eat it too on most recent BOost versions (possibly the dev branch) there should be a BOOST_SPIRIT_AUTO macro
This is becoming a bit of a FAQ item:
Assigning parsers to auto variables
boost spirit V2 qi bug associated with optimization level
¹ I believe this is actually a limitation of the underlying Proto library. There's a Proto-0x lib version on github (by Eric Niebler) that promises to solve these issues by being completely redesigned to be aware of references. I think this required some c++11 features that Boost Proto currently cannot use.

Tuple templates in GCC

I first started C++ with Microsoft VC++ in VS2010. I recently found some work, but I've been using RHEL 5 with GCC. My code is mostly native C++, but I've noticed one thing...
GCC doesn't appear to recognize the <tuple> header file, or the tuple template. At first I thought maybe it's just a typo, until I looked at cplusplus.com and found that the header is indeed not part of the standard library.
The problem is that I like to write my code in Visual Studio because the environment is way superior and aesthetically pleasing than eclipse or netbeans, and debugging is a breeze. The thing is, I've already written a good chunk of code to use tuples and I really like my code. How am I supposed to deal with this issue?
Here's my code:
using std::cout;
using std::make_tuple;
using std::remove;
using std::string;
using std::stringstream;
using std::tolower;
using std::tuple;
using std::vector;
// Define three conditions to code
enum {DONE, OK, EMPTY_LINE};
// Tuple containing a condition and a string vector
typedef tuple<int,vector<string>> Code;
// Passed an alias to a string
// Parses the line passed to it
Code ReadAndParse(string& line)
{
/***********************************************/
/****************REMOVE COMMENTS****************/
/***********************************************/
// Sentinel to flag down position of first
// semicolon and the index position itself
bool found = false;
size_t semicolonIndex = -1;
// Convert the line to lowercase
for(int i = 0; i < line.length(); i++)
{
line[i] = tolower(line[i]);
// Find first semicolon
if(line[i] == ';' && !found)
{
semicolonIndex = i;
// Throw the flag
found = true;
}
}
// Erase anything to and from semicolon to ignore comments
if(found != false)
line.erase(semicolonIndex);
/***********************************************/
/*****TEST AND SEE IF THERE'S ANYTHING LEFT*****/
/***********************************************/
// To snatch and store words
Code code;
string token;
stringstream ss(line);
vector<string> words;
// A flag do indicate if we have anything
bool emptyLine = true;
// While the string stream is passing anything
while(ss >> token)
{
// If we hit this point, we did find a word
emptyLine = false;
// Push it onto the words vector
words.push_back(token);
}
// If all we got was nothing, it's an empty line
if(emptyLine)
{
code = make_tuple(EMPTY_LINE, words);
return code;
}
// At this point it should be fine
code = make_tuple(OK, words);
return code;
}
Is there anyway to save my code from compiler incompatibility?
As long as it's just a pair you can use
typedef pair<int,vector<string>> Code;
But I don't think tuple is standard C++ (turns out it is included in TR1 and consequently also standard C++0x). As usual Boost has you covered though. So including:
#include "boost/tuple/tuple.hpp"
will solve your problem across compilers.
Compilers that ship the TR1 library should have it here
#include <tr1/tuple.hpp>
//...
std::tr1::tuple<int, int> mytuple;
Of course for portability you can use the boost library version in the meantime

Tokenizer for full-text

This should be an ideal case of not re-inventing the wheel, but so far my search has been in vain.
Instead of writing one myself, I would like to use an existing C++ tokenizer. The tokens are to be used in an index for full text searching. Performance is very important, I will parse many gigabytes of text.
Edit: Please note that the tokens are to be used in a search index. Creating such tokens is not an exact science (afaik) and requires some heuristics. This has been done a thousand time before, and probably in a thousand different ways, but I can't even find one of them :)
Any good pointers?
Thanks!
The C++ String Toolkit Library (StrTk) has the following solution to your problem:
#include <iostream>
#include <string>
#include <deque>
#include "strtk.hpp"
int main()
{
std::deque<std::string> word_list;
strtk::for_each_line("data.txt",
[&word_list](const std::string& line)
{
const std::string delimiters = "\t\r\n ,,.;:'\""
"!##$%^&*_-=+`~/\\"
"()[]{}<>";
strtk::parse(line,delimiters,word_list);
});
std::cout << strtk::join(" ",word_list) << std::endl;
return 0;
}
More examples can be found Here
If performance is a main issue you should probably stick to good old strtok which is sure to be fast:
/* strtok example */
#include <stdio.h>
#include <string.h>
int main ()
{
char str[] ="- This, a sample string.";
char * pch;
printf ("Splitting string \"%s\" into tokens:\n",str);
pch = strtok (str," ,.-");
while (pch != NULL)
{
printf ("%s\n",pch);
pch = strtok (NULL, " ,.-");
}
return 0;
}
A regular expression library might work well if your tokens aren't too difficult to parse.
I wrote my own tokenizer as part of the open-source
SWISH++ indexing and search engine.
There's also the the ICU tokenizer
that handles Unicode.
I might look into std::stringstream from <sstream>. C-style strtok has a number of usability problems, and C-style strings are just troublesome.
Here's an ultra-trivial example of it tokenizing a sentence into words:
#include <sstream>
#include <iostream>
#include <string>
int main(void)
{
std::stringstream sentence("This is a sentence with a bunch of words");
while (sentence)
{
std::string word;
sentence >> word;
std::cout << "Got token: " << word << std::endl;
}
}
janks#phoenix:/tmp$ g++ tokenize.cc && ./a.out
Got token: This
Got token: is
Got token: a
Got token: sentence
Got token: with
Got token: a
Got token: bunch
Got token: of
Got token: words
Got token:
The std::stringstream class is "bi-directional", in that it supports input and output. You'd probably want to do just one or the other, so you'd use std::istringstream or std::ostringstream.
The beauty of them is that they are also std::istream and std::ostreams respectively, so you can use them as you'd use std::cin or std::cout, which are hopefully familiar to you.
Some might argue these classes are expensive to use; std::strstream from <strstream> is mostly the same thing, but built on top of cheaper C-style 0-terminated strings. It might be faster for you. But anyway, I wouldn't worry about performance right away. Get something working, and then benchmark it. Chances are you can get enough speed by simply writing well-written C++ that minimizes unnecessary object creation and destruction. If it's still not fast enough, then you can look elsewhere. These classes are probably fast enough, though. Your CPU can waste thousands of cycles in the amount of time it takes to read a block of data from a hard disk or network.
You can use the Ragel State Machine Compiler to create a tokenizer (or a lexical analyzer).
The generated code has no external dependencies.
I suggest you look at the clang.rl sample for a relevant example of the syntax and usage.
Well, I would begin by searching Boost and... hop: Boost.Tokenizer
The nice thing ? By default it breaks on white spaces and punctuation because it's meant for text, so you won't forget a symbol.
From the introduction:
#include<iostream>
#include<boost/tokenizer.hpp>
#include<string>
int main(){
using namespace std;
using namespace boost;
string s = "This is, a test";
tokenizer<> tok(s);
for(tokenizer<>::iterator beg=tok.begin(); beg!=tok.end();++beg){
cout << *beg << "\n";
}
}
// prints
This
is
a
test
// notes how the ',' and ' ' were nicely removed
And there are additional features:
it can escape characters
it is compatible with Iterators so you can use it with an istream directly... and thus with an ifstream
and a few options (like keeping empty tokens etc...)
Check it out!