Boost.Spirit bug when mixing "alternates" with "optionals"? - c++

I've only been working with Boost.Spirit (from Boost 1.44) for three days, trying to parse raw e-mail messages by way of the exact grammar in RFC2822. I thought I was starting to understand it and get somewhere, but then I ran into a problem:
#include <iostream>
#include <boost/spirit/include/qi.hpp>
namespace qi = boost::spirit::qi;
using qi::omit;
using qi::repeat;
using std::cout;
using std::endl;
typedef qi::rule<std::string::const_iterator, std::string()> strrule_t;
void test(const std::string input, strrule_t rule) {
std::string target;
std::string::const_iterator i = input.begin(), ie = input.end();
if (qi::parse(i, ie, rule, target)) {
cout << "Success: '" << target << "'" << endl;
} else {
cout << "Failed to match." << endl;
}
}
int main() {
strrule_t obsolete_year = omit[-qi::char_(" \t")] >> repeat(2)[qi::digit] >>
omit[-qi::char_(" \t")];
strrule_t correct_year = repeat(4)[qi::digit];
test("1776", correct_year | repeat(2)[qi::digit]); // 1: Works, reports 1776.
test("76", obsolete_year); // 2: Works, reports 76.
test("76", obsolete_year | correct_year); // 3: Works, reports 76.
test(" 76", correct_year | obsolete_year); // 4: Works, reports 76.
test("76", correct_year | obsolete_year); // 5: Fails.
test("76", correct_year | repeat(2)[qi::digit]); // 6: Also fails.
}
If test #3 works, then why does test #5 -- the exact same test with the two alternatives reversed -- fail?
By the same token, if you'll pardon the expression: if test #4 works, and the space at the beginning is marked as optional, then why does test #5 (the exact same test with the exact same input, save that there's no leading space in the input) fail?
And finally, if this is a bug in Boost.Spirit (as I suspect it must be), how can I work around it?

That is because you hit a bug in Spirit's directive repeat[]. Thanks for the report, I fixed this problem in SVN (rev. [66167]) and it will be available in Boost V1.45. At the same time I would like to add your small test as a regression test to Spirit's test suite. I hope you don't mind me doing so.

Related

eps parser that can fail for a "post-initialization" that could fail

I'm reading the Boost X3 Quick Start tutorial and noticed the line
eps is a special spirit parser that consumes no input but is always successful. We use it to initialize the rule's synthesized attribute, to zero before anything else. [...] Using eps this way is good for doing pre and post initializations.
Now I can't help but wonder if an eps_that_might_fail would be useful to do some sort of semantic/post analysis on a part of the parsed input, which could fail, to have some sort of locality of the check inside the grammar.
Is there a might-fail eps, and is it a good idea to do extra input verification using this construct?
A terrible example of what I'm trying to convey:
int_ >> eps_might_fail[is_prime]
This will only parse prime numbers, if I'm not mistaken, and allow for the full parser to fail at the point where it expects a prime number.
Semantic actions are intended for this.
Spirit Qi
The most natural example would be
qi::int_ [ qi::_pass = is_prime(qi::_1) ]
Be sure to use %= rule assignment in the presence of semantic actions, because without it, semantic actions disable automatic attribute propagation.
You could, obviously, also be more verbose, and write
qi::int_ >> qi::eps(is_prime(qi::_val))
As you can see, that quoted documentation is slightly incomplete: eps can already take a parameter, in this case the lazy actor is_prime(qi::_val), that determines whether it succeeds of fails.
Spirit X3
In Spirit X3 the same mechanism applies, except that X3 doesn't integrate with Phoenix. This means two things:
on the up-side, we can just use core language features (lambdas) for semantic actions, making the learning curve less steep
on the downside, there's no 1-argument version of x3::eps that takes a lazy actor
Here's a demo program with X3:
Live On Coliru
#include <boost/spirit/home/x3.hpp>
namespace parser {
using namespace boost::spirit::x3;
auto is_ltua = [](auto& ctx) {
_pass(ctx) = 0 == (_attr(ctx) % 42);
};
auto start = int_ [ is_ltua ];
}
#include <iostream>
int main() {
for (std::string const txt : { "43", "42", "84", "85" }) {
int data;
if (parse(txt.begin(), txt.end(), parser::start, data))
std::cout << "Parsed " << data << "\n";
else
std::cout << "Parse failed (" << txt << ")\n";
}
}
Prints
Parse failed (43)
Parsed 42
Parsed 84
Parse failed (85)

Spirit Grammar For Path Verificiation

I am trying to write a simple grammar using boost spirit to validate that a string is a valid directory. I am using these tutorials since this is the first grammar I have attempted:
http://www.boost.org/doc/libs/1_36_0/libs/spirit/doc/html/spirit/qi_and_karma.html
http://www.boost.org/doc/libs/1_48_0/libs/spirit/doc/html/spirit/qi/reference/directive/lexeme.html
http://www.boost.org/doc/libs/1_44_0/libs/spirit/doc/html/spirit/qi/tutorials/employee___parsing_into_structs.html
Currently, what I have come up with is:
// I want these to be valid matches
std::string valid1 = "./";
// This string could be any number of sub dirs i.e. /home/user/test/ is valid
std::string valid2 = "/home/user/";
using namespace boost::spirit::qi;
bool match = phrase_parse(valid1.begin(), valid1.end(), lexeme[
((char_('.') | char_('/')) >> +char_ >> char_('/')],
ascii::space);
if (match)
{
std::cout << "Match!" << std::endl;
}
However, this matches nothing. I had a few ideas as to why; however, after doing some research I haven't found the answers. For example I assume the +char_ will probably consume all chars? So how can I find out if some sequence of characters all end with /?
Essentially my thoughts behind writing the above code was I want directories starting with . and / to be valid and then the last character has to be a /. Could someone help me with my grammar or point me to something more similar example to what I want to do? This is purely an excise to learn how to use spirit.
Edit
So I have got the parser to match using:
bool match = phrase_parse(valid1.begin(), valid1.end(), lexeme[
((char_('.') | char_('/')) >> *(+char_ >> char_('/'))],
ascii::space);
if (match)
{
std::cout << "Match!" << std::endl;
}
Not sure if that is proper or not? Or if it is matching for other reasons... Also should the ascii::space be used here? I read in a tutorial that it was to make spaces agnostic i.e. a b is equivalent to ab. Which I wouldn't want in a path name? If it isn't the correct thing to use what would be?
SSCCE:
#include <string>
#include <iostream>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/qi_char.hpp>
#include <boost/spirit/include/qi_eoi.hpp>
int main()
{
namespace qi = boost::spirit::qi;
std::string valid1 = "./";
std::string valid2 = "/home/blah/";
bool match = qi::parse(valid2.begin(), valid2.end(), &((qi::lit("./")|'/') >> (+~qi::char_('/') % '/') >> qi::eoi));
if (match)
{
std::cout << "Match" << std::endl;
}
}
If you don't want to ignore space differences (which you shouldn't), use parse instead of phrase_parse. The use of lexeme inhibits the skipper again (so you were just stripping leading/trailing space). See also stackoverflow.com/questions/17072987/boost-spirit-skipper-issues/17073965#17073965
Use char_("ab") instead of char_('a')|char_('b').
*char_ matches everything. You may have meant *~char_('/').
I'd suggest something like
bool ok = qi::parse(b, f, &(lit("./")|'/') >> (*~char_('/') % '/'));
This won't expose the matched input. Add raw[] around it to achieve that .
Add > qi::eoi to assert all of the input was consumed.

Spirit Qi Parsing multilines from Console

I am using the Console as an input source.
I looked for a way qi would parse a line and then it'd wait for the next line and continue parsing from that point.
for example take the following grammar
start = MultiLine | Line;
Multiline = "{" *(Line) "}";
Line = *(char_("A-Za-z0-9"));
As an input
{ AAAA
Bbbb
lllll
lalala
}
Taking the whole thing from a file is easy.
But what should i do if i need to have the input to come from the console instead?
That i want to have it process what its given already and wait for the rest of the lines.
The most "highlevel" idiomatic way to do this in Qi is qi::match:
Live On Coliru
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/qi_match.hpp>
using Line = std::string;
using MultiLine = std::vector<Line>;
int main() {
std::cin.unsetf(std::ios::skipws);
MultiLine block;
using namespace boost::spirit::qi;
while (std::cin >> phrase_match('{' >> lexeme[*~char_("\r\n}")] % eol >> '}', blank, block))
{
std::cout << "Processing a block of " << block.size() << " lines\n";
block.clear();
}
}
Prints:
Processing a block of 4 lines
Processing a block of 7 lines
Where the second line appears after one second delay (due to the sleep 1 used)
As Chris Beck was hinting, this uses boost::spirit::istream_iterator under the hood. You can also use that explicitly, e.g. if you want to use recursively nested rules.
You can just use an istream iterator which you construct from std::cin for this. If your grammar is templated against typename iterator like Hartmut does in all the examples, then it should work fine.
It would be easy to illustrate if you provide an SSCCE. The istream iterator is only a few lines of code to create.

Why does std::regex_iterator cause a stack overflow with this data?

I've been using std::regex_iterator to parse log files. My program has been working quite nicely for some weeks and has parsed millions of log lines, until today, when today I ran it against a log file and got a stack overflow. It turned out that just one log line in the log file were causing the problem. Does anyone know know why my regex is causing such massive recursion? Here's a small self contained program which shows the issue (my compiler is VC2012):
#include <string>
#include <regex>
#include <iostream>
using namespace std;
std::wstring test = L"L3 T15356 79726859 [CreateRegistryAction] Creating REGISTRY Action:\n"
L" Identity: 272A4FE2-A7EE-49B7-ABAF-7C57BEA0E081\n"
L" Description: Set Registry Value: \"SortOrder\" in Key HKEY_CURRENT_USER\\Software\\Hummingbird\\PowerDOCS\\Core\\Plugins\\Fusion\\Settings\\DetailColumns\\LONEDOCS1\\Search Unsaved\\$AUTHOR.FULL_NAME;DOCSADM.PEOPLE.SYSTEM_ID\n"
L" Operation: 3\n"
L" Hive: HKEY_CURRENT_USER\n"
L" Key: Software\\Hummingbird\\PowerDOCS\\Core\\Plugins\\Fusion\\Settings\\DetailColumns\\LONEDOCS1\\Search Unsaved\\$AUTHOR.FULL_NAME;DOCSADM.PEOPLE.SYSTEM_ID\n"
L" ValueName: SortOrder\n"
L" ValueType: REG_DWORD\n"
L" ValueData: 0\n"
L"L4 T15356 79726859 [CEMRegistryValueAction::ClearRevertData] [ENTER]\n";
int wmain(int argc, wchar_t* argv[])
{
static wregex rgx_log_lines(
L"^L(\\d+)\\s+" // Level
L"T(\\d+)\\s+" // TID
L"(\\d+)\\s+" // Timestamp
L"\\[((?:\\w|\\:)+)\\]" // Function name
L"((?:" // Complex pattern
L"(?!" // Stop matching when...
L"^L\\d" // New log statement at the beginning of a line
L")"
L"[^]" // Matching all until then
L")*)" //
);
try
{
for (std::wsregex_iterator it(test.begin(), test.end(), rgx_log_lines), end; it != end; ++it)
{
wcout << (*it)[1] << endl;
wcout << (*it)[2] << endl;
wcout << (*it)[3] << endl;
wcout << (*it)[4] << endl;
wcout << (*it)[5] << endl;
}
}
catch (std::exception& e)
{
cout << e.what() << endl;
}
return 0;
}
Negative lookahead patterns which are tested on every character just seem like a bad idea to me, and what you're trying to do is not complicated. You want to match (1) the rest of the line and then (2) any number of following (3) lines which start with something other than L\d (small bug; see below): (another edit: these are regexes; if you want to write them as string literals, you need to change \ to \\.)
.*\n(?:(?:[^L]|L\D).*\n)*
| | |
+-1 | +---------------3
+---------------------2
In Ecmascript mode, . should not match \n, but you could always replace the two .s in that expression with [^\n]
Edited to add: I realize that this may not work if there is a blank line just before the end of the log entry, but this should cover that case; I changed . to [^\n] for extra precision:
[^\n]*\n(?:(?:(?:[^L\n]|L\D)[^\n]*)?\n)*
The regex appears to be OK; at least there is nothing in it that could cause catastrophic backtracking.
I see a small possibility to optimize the regex, cutting down on stack use:
static wregex rgx_log_lines(
L"^L(\\d+)\\s+" // Level
L"T(\\d+)\\s+" // TID
L"(\\d+)\\s+" // Timestamp
L"\\[([\\w:]+)\\]" // Function name
L"((?:" // Complex pattern
L"(?!" // Stop matching when...
L"^L\\d" // New log statement at the beginning of a line
L")"
L"[^]" // Matching all until then
L")*)" //
);
Did you set the ECMAScript option? Otherwise, I suspect the regex library defaults to POSIX regexes, and those don't support lookahead assertions.

How would I perform this text pattern matching

Migrated from [Spirit-general] list
Good morning,
I'm trying to parse a relatively simple pattern across 4 std::strings,
extracting whatever the part which matches the pattern into a separate
std::string.
In an abstracted sense, here is what I want:
s1=<string1><consecutive number>, s2=<consecutive number><string2>,
s3=<string1><consecutive number>, s4=<consecutive number><string2>
Less abstracted:
s1="apple 1", s2="2 cheese", s3="apple 3", s4="4 cheese"
Actual contents:
s1="lxckvjlxcjvlkjlkje xvcjxzlvcj wqrej lxvcjz ljvl;x czvouzxvcu
j;ljfds apple 1 xcvljxclvjx oueroi xcvzlkjv; zjx", s2="xzljlkxvc
jlkjxzvl jxcvljzx lvjlkj wre 2 cheese", s3="apple 3", s4="kxclvj
xcvjlxk jcvljxlck jxcvl 4 cheese"
How would I perform this pattern matching?
Thanks for all suggestions,
Alec Taylor
Update 2
Here is a really simple explanation I just figured out to explain the
problem I am trying to solve:
std::string s1=garbagetext1+number1+name1+garbagetext4;
std::string s3=garbagetext2+(number1+2)+name1+garbagetext5;
std::string s5=garbagetext3+(number1+4)+name1+garbagetext6;
Edit for context:
Feel free to add it to stackoverflow (I've been having some trouble
posting there)
I can't give you what I've done so far, because I wasn't sure if it
was within the capabilities of the boost::spirit libraries to do what
I'm trying to do
Edit: Re Update2
Here is a really simple explanation I just figured out to explain the
problem I am trying to solve:
std::string s1=garbagetext1+number1+name1+garbagetext4;
std::string s3=garbagetext2+(number1+2)+name1+garbagetext5;
std::string s5=garbagetext3+(number1+4)+name1+garbagetext6;
It starts looking like a job for:
Tokenizing the 'garbage text/names' - you could make a symbol table of sorts on the fly and use it to match patterns (spirit Lex and Qi's symbol table (qi::symbol) could facilitate it, but I feel you could write that in any number of ways)
conversely, use regular expressions, as suggested before (below, and at least twice in mailing list).
Here's a simple idea:
(\d+) ([a-z]+).*?(\d+) \2
\d+ match a sequence of digits in a "(subexpression)" (NUM1)
([a-z]+) match a name (just picked a simple definition of 'name')
.*? skip any length of garbage, but as little as possible before starting subsequent match
\d+ match another number (sequence of digits) (NUM2)
\2 followed by the same name (backreference)
You can see how you'd already be narrowing the list of matches to inspect down to 'potential' hits. You'd only have to /post-validate/ to see that NUM2 == NUM1+2
Two notes:
Add (...)+ around the tail part to allow repeated matching of patterns
(\d+) ([a-z]+)(.*?(\d+) \2)+
You may wish to make the garbage skip (.*?) aware of separators (by doing negative zerowidth assertions) to avoid more than 2 skipping delimiters (e.g. s\d+=" as a delimiting pattern). I leave it out of scope for clarity now, here's the gist:
((?!s\d+=").)*? -- beware of potential performance degradation
Alec, The following is a show-case of how to do a wide range of things in Boost Spirit, in the context of answering your question.
I had to make assumptions about what is required input structure; I assumed
whitespace was strict (spaces as shown, no newlines)
the sequence numbers should be in increasing order
the sequence numbers should recur exactly in the text values
the keywords 'apple' and 'cheese' are in strict alternation
whether the keyword comes before or after the the sequence number in the text value, is also in strict alternation
Note There are about a dozen places in the implementation below, where significantly less complex choices could possibly have been made. For example, I could have hardcoded the whole pattern (as a de facto regex?), assuming that 4 items are always expected in the input. However I wanted to
make no more assumptions than necessary
learn from the experience. Especially the topic of qi::locals<> and inherited attributes have been on my agenda for a while.
However, the solution allows a great deal of flexibility:
the keywords aren't hardcoded, and you could e.g. easily make the parser accept both keywords at any sequence number
a comment shows how to generate a custom parsing exception when the sequence number is out of sync (not the expected number)
different spellings of the sequence numbers are currently accepted (i.e. s01="apple 001" is ok. Look at Unsigned Integer Parsers for info on how to tune that behaviour)
the output structure is either a vector<std::pair<int, std::string> > or a vector of struct:
struct Entry
{
int sequence;
std::string text;
};
both versions can be switched with the single #if 1/0 line
The sample uses Boost Spirit Qi for parsing.
Conversely, Boost Spirit Karma is used to display the result of parsing:
format((('s' << auto_ << "=\"" << auto_) << "\"") % ", ", parsed)
The output for the actual contents given in the post is:
parsed: s1="apple 1", s2="2 cheese", s3="apple 3", s4="4 cheese"
On to the code.
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/karma.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <boost/spirit/include/phoenix_operator.hpp>
namespace qi = boost::spirit::qi;
namespace karma = boost::spirit::karma;
namespace phx = boost::phoenix;
#if 1 // using fusion adapted struct
#include <boost/fusion/adapted/struct.hpp>
struct Entry
{
int sequence;
std::string text;
};
BOOST_FUSION_ADAPT_STRUCT(Entry, (int, sequence)(std::string, text));
#else // using boring std::pair
#include <boost/fusion/adapted/std_pair.hpp> // for karma output generation
typedef std::pair<int, std::string> Entry;
#endif
int main()
{
std::string input =
"s1=\"lxckvjlxcjvlkjlkje xvcjxzlvcj wqrej lxvcjz ljvl;x czvouzxvcu"
"j;ljfds apple 1 xcvljxclvjx oueroi xcvzlkjv; zjx\", s2=\"xzljlkxvc"
"jlkjxzvl jxcvljzx lvjlkj wre 2 cheese\", s3=\"apple 3\", s4=\"kxclvj"
"xcvjlxk jcvljxlck jxcvl 4 cheese\"";
using namespace qi;
typedef std::string::const_iterator It;
It f(input.begin()), l(input.end());
int next = 1;
qi::rule<It, std::string(int)> label;
qi::rule<It, std::string(int)> value;
qi::rule<It, int()> number;
qi::rule<It, Entry(), qi::locals<int> > assign;
label %= qi::raw [
( eps(qi::_r1 % 2) >> qi::string("apple ") > qi::uint_(qi::_r1) )
| qi::uint_(qi::_r1) > qi::string(" cheese")
];
value %= '"'
>> qi::omit[ *(~qi::char_('"') - label(qi::_r1)) ]
>> label(qi::_r1)
>> qi::omit[ *(~qi::char_('"')) ]
>> '"';
number %= qi::uint_(phx::ref(next)++) /*| eps [ phx::throw_(std::runtime_error("Sequence number out of sync")) ] */;
assign %= 's' > number[ qi::_a = _1 ] > '=' > value(qi::_a);
std::vector<Entry> parsed;
bool ok = false;
try
{
ok = parse(f, l, assign % ", ", parsed);
if (ok)
{
using namespace karma;
std::cout << "parsed:\t" << format((('s' << auto_ << "=\"" << auto_) << "\"") % ", ", parsed) << std::endl;
}
} catch(qi::expectation_failure<It>& e)
{
std::cerr << "Expectation failed: " << e.what() << " '" << std::string(e.first, e.last) << "'" << std::endl;
} catch(const std::exception& e)
{
std::cerr << e.what() << std::endl;
}
if (!ok || (f!=l))
std::cerr << "problem at: '" << std::string(f,l) << "'" << std::endl;
}
Provided you can use c++11 compiler, parsing these patterns is pretty simple using AXE†:
#include <axe.h>
#include <string>
template<class I>
void num_value(I i1, I i2)
{
unsigned n;
unsigned next = 1;
// rule to match unsigned decimal number and compare it with another number
auto num = axe::r_udecimal(n) & axe::r_bool([&](...){ return n == next; });
// rule to match a single word
auto word = axe::r_alphastr();
// rule to match space characters
auto space = axe::r_any(" \t\n");
// semantic action - print to cout and increment next
auto e_cout = axe::e_ref([&](I i1, I i2)
{
std::cout << std::string(i1, i2) << '\n';
++next;
});
// there are only two patterns in this example
auto pattern1 = (word & +space & num) >> e_cout;
auto pattern2 = (num & +space & word) >> e_cout;
auto s1 = axe::r_find(pattern1);
auto s2 = axe::r_find(pattern2);
auto text = s1 & s2 & s1 & s2 & axe::r_end();
text(i1, i2);
}
To parse the text simply call num_value(text.begin(), text.end()); No changes required to parse unicode strings.
† I didn't test it.
Look into Boost.Regex. I've seen an almost-identical poosting in boost-users and the solution is to use regexes for some of the match work.