Boost Spirit - Trimming spaces between last character and separator - c++

Boost Spirit newcomer here.
I have a string in the form of "Key:Value\r\nKey2:Value2\r\n" that I'm trying to parse. In that specific form, it's trivial to parse with Boost Spirit. However, in order to be more robust, I also need to handle cases such as this one:
" My Key : Value \r\n My2ndKey : Long<4 spaces>Value \r\n"
In this case, I need to trim leading and trailing spaces before and after the key/value separators so that I get the following map:
"My Key", "Value"
"My2ndKey", "Long<4 spaces>Value"
I played with qi::hold to achieve this but I get compile errors because of unsupported boost::multi_pass iterator with the embedded parser I was trying to use. There has to be a simple way to achieve this.
I read the following articles (and many others on the subject):
http://boost-spirit.com/home/articles/qi-example/parsing-a-list-of-key-value-pairs-using-spirit-qi/
http://boost-spirit.com/home/2010/02/24/parsing-skippers-and-skipping-parsers/
Boost spirit parsing string with leading and trailing whitespace
I am looking for a solution to my problem, which doesn't seem to be entirely covered by those articles. I would also like to better understand how this is achieved. As a small bonus question, I keep seeing the '%=' operator, is this useful in my case? MyRule %= MyRule ... is used for recursive parsing?
The code below parses my strings properly except that it doesn't remove the spaces between the last non-space character and the separator. :( The skipper used is qi::blank_type (space without EOL).
Thanks!
template <typename Iterator, typename Skipper>
struct KeyValueParser : qi::grammar<Iterator, std::map<std::string, std::string>(), Skipper> {
KeyValueParser() : KeyValueParser::base_type(ItemRule) {
ItemRule = PairRule >> *(qi::lit(END_OF_CMD) >> PairRule);
PairRule = KeyRule >> PAIR_SEP >> ValueRule;
KeyRule = +(qi::char_ - qi::lit(PAIR_SEP));
ValueRule = +(qi::char_ - qi::lit(END_OF_CMD));
}
qi::rule<Iterator, std::map<std::string, std::string>(), Skipper> ItemRule;
qi::rule<Iterator, std::pair<std::string, std::string>(), Skipper> PairRule;
qi::rule<Iterator, std::string()> KeyRule;
qi::rule<Iterator, std::string()> ValueRule;
};

You need to use KeyRule = qi::raw[ +(qi::char_ - qi::lit(PAIR_SEP)) ];
In order to see why, let's try to study several ways to parse the string "a b :".
First let's keep in mind how the following parsers/directives work:
lexeme[subject]: This directive matches subject while disabling the skipper.
raw[subject]: Discards subject's attribute and returns an iterator pair that points to the matched characters in the input stream.
+subject: The plus parser tries to match 1 or more times its subject.
a-b: The difference parser first tries to parse b and if b succeeds, a-b fails. When b fails, it matches a.
char_: matches any char. It's a PrimitiveParser.
lit(':'): matches ':' but ignores its attribute. It's a PrimitiveParser.
lexeme[ +(char_ - lit(':')) ]: by removing the skipper from your rule you have an implicit lexeme. Since there is no skipper it goes like this:
'a' -> ':' fails, char_ matches 'a', the current synthesized attribute is "a"
' ' -> ':' fails, char_ matches ' ', the current synthesized attribute is "a "
'b' -> ':' fails, char_ matches 'b', the current synthesized attribute is "a b"
' ' -> ':' fails, char_ matches ' ', the current synthesized attribute is "a b "
':' -> ':' succeeds, the final synthesized attribute is "a b "
+(char_ - lit(':')): Since it has a skipper every PrimitiveParser will pre-skip before being tried:
'a' -> ':' fails, char_ matches 'a', the current synthesized attribute is "a"
' ' -> this is skipped before ':' is tried
'b' -> ':' fails, char_ matches 'b', the current synthesized attribute is "ab"
' ' -> this is skipped before ':' is tried
':' -> ':' succeeds, the final synthesized attribute is "ab"
raw[ +(char_ - lit(':') ) ]: The subject is exactly the same as 2.. The raw directive ignores "ab" and returns an iterator pair that goes from 'a' to 'b'. Since the attribute of your rule is std::string, a string is constructed from that iterator pair, resulting in "a b" which is what you want.

Related

How to combine skipping and non-skipping (lexeme) rules?

my parser is nearly working :)
(still amazed by Spirit feature set (and compiletimes) and the very welcoming community here on stack overflow)
small sample for online try:
http://coliru.stacked-crooked.com/a/1c1bf88909dce7e3
so i've learned to use more lexeme-rules and try to prevent no_skip -
my rules are smaller and better to read as a result but now i stuck with
combining lexeme-rules and skipping-rules what seems to be not possible (compiletime error with warning about not castable to Skipper)
my problem is the comma seperated list in subscriptions
which does not skip spaces around expressions
parses:
"a.b[a,b]"
fails:
"a.b[ a , b ]"
these are my rules:
qi::rule<std::string::const_iterator, std::string()> identifier_chain;
qi::rule<std::string::const_iterator, std::string()>
expression_list = identifier_chain >> *(qi::char_(',') >> identifier_chain);
qi::rule < std::string::const_iterator, std::string() >
subscription = qi::char_('[') >> expression_list >> qi::char_(']');
qi::rule<std::string::const_iterator, std::string()>
identifier = qi::ascii::alpha >> *(qi::ascii::alnum | '_');
identifier_chain = identifier >> *(('.' >> identifier) | subscription);
as you can see all rules are "lexeme" and i think the subscription rule should be a ascii::space_type skipper but that does not compile
should i add space eaters in the front and back of identifier_chains in the expression_list?
feels like writing an regex :(
expression_list = *qi::blank >> identifier_chain >> *(*qi::blank >> qi::char_(',') >> *qi::blank >> identifier_chain >> *qi::blank);
it works but i've read that this will get me to an much bigger parser in the end (handling all the space skipping by myself)
thx for any advice
btw: any idea why i can't compile if surrounding the '.' in the indentifier_chain with qi::char_('.')
identifier_chain = identifier >> *(('.' >> identifier) | subscription);
UPDATE:
i've updated my expression list as suggested by sehe
qi::rule<std::string::const_iterator, spirit::ascii::blank_type, std::string()>
expression_list = identifier_chain >> *(qi::char_(',') >> identifier_chain);
qi::rule < std::string::const_iterator, std::string() >
subscription = qi::char_('[') >> qi::skip(qi::blank)[expression_list] >> qi::char_(']');
but still get compile error due to non castable Skipper: http://coliru.stacked-crooked.com/a/adcf665742b055dd
i also tried changed the identifer_chain to
identifier_chain = identifier >> *(('.' >> identifier) | qi::skip(qi::blank)[subscription]);
but i still can't compile the example
The answer I linked to earlier describes all the combinations (if I remember correctly): Boost spirit skipper issues
In short:
any rule that declares a skipper (so rule<It, Skipper[, Attr()]> or rule<It, Attr(), Skipper>) MUST be invoked with a compatible skipper (an expression that can be assigned to the type of Skipper).
any rule that does NOT declare a skipper (so of the form rule<It[, Attr()]>) will implicitly behave like a lexeme, meaning no input characters are skipped.
That's it. The slightly subtler ramifications are that given two rules:
rule<It, blank_type> a;
rule<It> b; // b is implicitly lexeme
You can invoke b from a:
a = "test" >> b;
But when you wish to invoke a from b you will find that you have to provide the skipper:
b = "oops" >> a; // DOES NOT COMPILE
b = "okay" >> qi::skip(qi::blank) [ a ];
That's almost all there is to it. There are a few more directives around skippers and lexemes in Qi, see again the answer linked above.
Side Question:
should i add space eaters in the front and back of identifier_chains in the expression_list?
If you look closely at the answer example here Parse a '.' chained identifier list, with qi::lexeme and prevent space skipping, you can see that it already does pre- and post skipping correctly, because I used phrase_parse:
" a.b " OK: ( "a" "b" )
----
"a . b" Failed
Remaining unparsed: "a . b"
----
You COULD also wrap the whole thing in an "outer" rule:
rule<std::string::const_iterator> main_rule =
qi::skip(qi::blank) [ identifier_chain ];
That's just the same but allows users to call parse without specifying the skipper.

Starting with Spirit X3

I've just started using Spirit X3 and I have a little question related with my first test. Do you know why this function is returning "false"?
bool parse()
{
std::string rc = "a 6 literal 8";
auto iter_begin = rc.begin();
auto iter_end = rc.end();
bool bOK= phrase_parse( iter_begin, iter_end,
// ----- start parser -----
alpha >> *alnum >> "literal" >> *alnum
// ----- end parser -----
, space);
return bOK && iter_begin == iter_end;
}
I've seen the problem is related with how I write the grammar. If I replace it with this one, it returns "true"
alpha >> -alnum >> "literal" >> *alnum
I'm using the Spirit version included in Boost 1.61.0.
Thanks in advance,
Sen
Your problem is a combination of the greediness of operator * and the use of a skipper. You need to keep in mind that alnum is a PrimitiveParser and that means that before every time this parser is tried, Spirit will pre-skip, and so the behaviour of your parser is:
alpha parses a.
The kleene operator starts.
alnum skips the space and then parses 6.
alnum skips the space and then parses l.
alnum parses i.
...
alnum parses l.
alnum skips the space and then parses 8.
alnum tries and fails to parse more. This completes the kleene operator with a parsed attribute of 6literal8.
"literal" tries and fails to parse.
The sequence operator fails and the invocation of phrase_parse returns false.
You can easily avoid this problem using the lexeme directive (barebones x3 docs, qi docs). Something like this should work:
alpha >> lexeme[*alnum] >> "literal" >> lexeme[*alnum];

limit qi::hex parser to 2 chars

I'm parsing string with escaped characters, I want '\xYY' to be parsed as character with YY code. This is as far as i understand qi::hex for. But I need only two subsequent chars to be parsed, not more. So "\x30kl" is parsed correctly, but not "\x30fl", because qi::hex parse '30f', not just '30'. The question is how to limit hex parsing length?
This is my grammar:
template <typename Iterator>
struct gram : qi::grammar<Iterator, std::string(), ascii::space_type> {
gram() : gram::base_type(start) {
start %= "'" >> *(string_char) >> "'";
string_char = ("\\" >> qi::char_('\'')) |
("\\x" >> qi::hex) |
(qi::print - "'");
}
qi::rule<Iterator, std::string(), ascii::space_type> string_char, start;
};
And this is link to Coliru: http://coliru.stacked-crooked.com/a/ba96c7410c772c87
Thanks!
Use:
qi::int_parser<unsigned char, 16, 1, 2> hex2_;
Or if you require exactly 2, make it
qi::int_parser<unsigned char, 16, 2, 2> octet_;
Note that unsigned char is now the exposed attribute. You can use char if you prefer (or int...)

Parsing a string (with spaces) but ignoring the spaces at the end of the (Spirit)

I have an input string I'm trying to parse. It might look like either of the two:
sys(error1, 2.3%)
sys(error2 , 2.4%)
sys(this error , 3%)
Note the space sometimes before the comma. In my grammer (boost spirit library) I'd like to capture "error1", "error2", and "this error" respectively.
Here is the original grammar I had to capture this - which absorbed the space at the end of the name:
name_string %= lexeme[+(char_ - ',' - '"')];
name_string.name("Systematic Error Name");
start = (lit("sys")|lit("usys")) > '('
> name_string[boost::phoenix::bind(&ErrorValue::SetName, _val, _1)] > ','
> errParser[boost::phoenix::bind(&ErrorValue::CopyErrorAndRelative, _val, _1)]
> ')';
My attempt to fix this was first:
name_string %= lexeme[*(char_ - ',' - '"') > (char_ - ',' - '"' - ' ')];
however that completely failed. Looks like it failes to parse anything with a space in the middle.
I'm fairly new with Spirit - so perhaps I'm missing something simple. Looks like lexeme turns off skipping on the leading edge - I need something that does it on the leading and trailing edge.
Thanks in advance for any help!
Thanks to psur below, I was able to put together an answer. It isn't perfect (see below), but I thought I would update the post for everyone to see it in context and nicely formatted:
qi::rule<Iterator, std::string(), ascii::space_type> name_word;
qi::rule<Iterator, std::string(), ascii::space_type> name_string;
ErrorValueParser<Iterator> errParser;
name_word %= +(qi::char_("_a-zA-Z0-9+"));
//name_string %= lexeme[name_word >> *(qi::hold[+(qi::char_(' ')) >> name_word])];
name_string %= lexeme[+(qi::char_("-_a-zA-Z0-9+")) >> *(qi::hold[+(qi::char_(' ')) >> +(qi::char_("-_a-zA-Z0-9+"))])];
start = (
lit("sys")[bind(&ErrorValue::MakeCorrelated, _val)]
|lit("usys")[bind(&ErrorValue::MakeUncorrelated, _val)]
)
>> '('
>> name_string[bind(&ErrorValue::SetName, _val, _1)] >> *qi::lit(' ')
>> ','
>> errParser[bind(&ErrorValue::CopyErrorAndRelative, _val, _1)]
>> ')';
This works! They key to this is the name_string, and in it the qi::hold, a operator I was not familiar with before this. It is almost like a sub-rule: everything inside qi::hold[...] must successfully parse for it to go. So, above, it will only allow a space after a word if there is another word following. The result is that if a sequence of words end in a space(s), those last spaces will not be parsed! They can be absorbed by the *qi::lit(' ') that follows (see the start rule).
There are two things I'd like to figure out how to improve here:
It would be nice to put the actual string parsing into name_word. The problem is the declaration of name_word - it fails when it is put in the appropriate spot in the definition of name_string.
It would be even better if name_string could include the parsing of the trailing spaces, though its return value did not. I think I know how to do that...
When/if I figure these out I will update this post. Thanks for the help!
Below rules should work for you:
name_word %= +(qi::char_("_a-zA-Z0-9"));
start %= qi::lit("sys(")
>> qi::lexeme[ name_word >> *(qi::hold[ +(qi::char_(' ')) >> name_word ]) ]
>> *qi::lit(' ')
>> qi::lit(',')
// ...
name_word parse only one word in name; I assumed that it contains only letter, digits and underscore.
In start rule qi::hold is important. It will parse space only if next is name_word. In other case parser will rollback and move to *qi::lit(' ') and then to comma.

Parsing string, with Boost Spirit 2, to fill data in user defined struct

I'm using Boost.Spirit which was distributed with Boost-1.42.0 with VS2005. My problem is like this.
I've this string which was delimted with commas. The first 3 fields of it are strings and rest are numbers. like this.
String1,String2,String3,12.0,12.1,13.0,13.1,12.4
My rule is like this
qi::rule<string::iterator, qi::skip_type> stringrule = *(char_ - ',')
qi::rule<string::iterator, qi::skip_type> myrule= repeat(3)[*(char_ - ',') >> ','] >> (double_ % ',') ;
I'm trying to store the data in a structure like this.
struct MyStruct
{
vector<string> stringVector ;
vector<double> doubleVector ;
} ;
MyStruct var ;
I've wrapped it in BOOST_FUSION_ADAPT_STRUCTURE to use it with spirit.
BOOST_FUSION_ADAPT_STRUCT (MyStruct, (vector<string>, stringVector) (vector<double>, doubleVector))
My parse function parses the line and returns true and after
qi::phrase_parse (iterBegin, iterEnd, myrule, boost::spirit::ascii::space, var) ;
I'm expecting var.stringVector and var.doubleVector are properly filled. but it is not the case.
What is going wrong ?
The code sample is located here
Thanks in advance,
Surya
qi::skip_type is not something you could use a skipper. qi::skip_type is the type of the placeholder qi::skip, which is applicable for the skip[] directive only (to enable skipping inside a lexeme[] or to change skipper in use) and which is not a parser component matching any input on its own. You need to specify your specific skipper type instead (in your case that's boost::spirit::ascii:space_type).
Moreover, in order for your rules to return the parsed attribute, you need to specify the type of the expected attribute while defining your rule. That leaves you with:
qi::rule<string::iterator, std::string(), ascii:space_type>
stringrule = *(char_ - ',');
qi::rule<string::iterator, MyStruct(), ascii:space_type>
myrule = repeat(3)[*(char_ - ',') >> ','] >> (double_ % ',');
which should do exactly what you expect.