Spirit Qi Parsing multilines from Console - c++

I am using the Console as an input source.
I looked for a way qi would parse a line and then it'd wait for the next line and continue parsing from that point.
for example take the following grammar
start = MultiLine | Line;
Multiline = "{" *(Line) "}";
Line = *(char_("A-Za-z0-9"));
As an input
{ AAAA
Bbbb
lllll
lalala
}
Taking the whole thing from a file is easy.
But what should i do if i need to have the input to come from the console instead?
That i want to have it process what its given already and wait for the rest of the lines.

The most "highlevel" idiomatic way to do this in Qi is qi::match:
Live On Coliru
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/qi_match.hpp>
using Line = std::string;
using MultiLine = std::vector<Line>;
int main() {
std::cin.unsetf(std::ios::skipws);
MultiLine block;
using namespace boost::spirit::qi;
while (std::cin >> phrase_match('{' >> lexeme[*~char_("\r\n}")] % eol >> '}', blank, block))
{
std::cout << "Processing a block of " << block.size() << " lines\n";
block.clear();
}
}
Prints:
Processing a block of 4 lines
Processing a block of 7 lines
Where the second line appears after one second delay (due to the sleep 1 used)
As Chris Beck was hinting, this uses boost::spirit::istream_iterator under the hood. You can also use that explicitly, e.g. if you want to use recursively nested rules.

You can just use an istream iterator which you construct from std::cin for this. If your grammar is templated against typename iterator like Hartmut does in all the examples, then it should work fine.
It would be easy to illustrate if you provide an SSCCE. The istream iterator is only a few lines of code to create.

Related

Spirit Grammar For Path Verificiation

I am trying to write a simple grammar using boost spirit to validate that a string is a valid directory. I am using these tutorials since this is the first grammar I have attempted:
http://www.boost.org/doc/libs/1_36_0/libs/spirit/doc/html/spirit/qi_and_karma.html
http://www.boost.org/doc/libs/1_48_0/libs/spirit/doc/html/spirit/qi/reference/directive/lexeme.html
http://www.boost.org/doc/libs/1_44_0/libs/spirit/doc/html/spirit/qi/tutorials/employee___parsing_into_structs.html
Currently, what I have come up with is:
// I want these to be valid matches
std::string valid1 = "./";
// This string could be any number of sub dirs i.e. /home/user/test/ is valid
std::string valid2 = "/home/user/";
using namespace boost::spirit::qi;
bool match = phrase_parse(valid1.begin(), valid1.end(), lexeme[
((char_('.') | char_('/')) >> +char_ >> char_('/')],
ascii::space);
if (match)
{
std::cout << "Match!" << std::endl;
}
However, this matches nothing. I had a few ideas as to why; however, after doing some research I haven't found the answers. For example I assume the +char_ will probably consume all chars? So how can I find out if some sequence of characters all end with /?
Essentially my thoughts behind writing the above code was I want directories starting with . and / to be valid and then the last character has to be a /. Could someone help me with my grammar or point me to something more similar example to what I want to do? This is purely an excise to learn how to use spirit.
Edit
So I have got the parser to match using:
bool match = phrase_parse(valid1.begin(), valid1.end(), lexeme[
((char_('.') | char_('/')) >> *(+char_ >> char_('/'))],
ascii::space);
if (match)
{
std::cout << "Match!" << std::endl;
}
Not sure if that is proper or not? Or if it is matching for other reasons... Also should the ascii::space be used here? I read in a tutorial that it was to make spaces agnostic i.e. a b is equivalent to ab. Which I wouldn't want in a path name? If it isn't the correct thing to use what would be?
SSCCE:
#include <string>
#include <iostream>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/qi_char.hpp>
#include <boost/spirit/include/qi_eoi.hpp>
int main()
{
namespace qi = boost::spirit::qi;
std::string valid1 = "./";
std::string valid2 = "/home/blah/";
bool match = qi::parse(valid2.begin(), valid2.end(), &((qi::lit("./")|'/') >> (+~qi::char_('/') % '/') >> qi::eoi));
if (match)
{
std::cout << "Match" << std::endl;
}
}
If you don't want to ignore space differences (which you shouldn't), use parse instead of phrase_parse. The use of lexeme inhibits the skipper again (so you were just stripping leading/trailing space). See also stackoverflow.com/questions/17072987/boost-spirit-skipper-issues/17073965#17073965
Use char_("ab") instead of char_('a')|char_('b').
*char_ matches everything. You may have meant *~char_('/').
I'd suggest something like
bool ok = qi::parse(b, f, &(lit("./")|'/') >> (*~char_('/') % '/'));
This won't expose the matched input. Add raw[] around it to achieve that .
Add > qi::eoi to assert all of the input was consumed.

iterate over ini file on c++, probably using boost::property_tree::ptree?

My task is trivial - i just need to parse such file:
Apple = 1
Orange = 2
XYZ = 3950
But i do not know the set of available keys. I was parsing this file relatively easy using C#, let me demonstrate source code:
public static Dictionary<string, string> ReadParametersFromFile(string path)
{
string[] linesDirty = File.ReadAllLines(path);
string[] lines = linesDirty.Where(
str => !String.IsNullOrWhiteSpace(str) && !str.StartsWith("//")).ToArray();
var dict = lines.Select(s => s.Split(new char[] { '=' }))
.ToDictionary(s => s[0].Trim(), s => s[1].Trim());
return dict;
}
Now I just need to do the same thing using c++. I was thinking to use boost::property_tree::ptree however it seems I just can not iterate over ini file. It's easy to read ini file:
boost::property_tree::ptree pt;
boost::property_tree::ini_parser::read_ini(path, pt);
But it is not possible to iterate over it, refer to this question Boost program options - get all entries in section
The question is - what is the easiest way to write analog of C# code above on C++ ?
To answer your question directly: of course iterating a property tree is possible. In fact it's trivial:
#include <boost/property_tree/ptree.hpp>
#include <boost/property_tree/ini_parser.hpp>
int main()
{
using boost::property_tree::ptree;
ptree pt;
read_ini("input.txt", pt);
for (auto& section : pt)
{
std::cout << '[' << section.first << "]\n";
for (auto& key : section.second)
std::cout << key.first << "=" << key.second.get_value<std::string>() << "\n";
}
}
This results in output like:
[Cat1]
name1=100 #skipped
name2=200 \#not \\skipped
name3=dhfj dhjgfd
[Cat_2]
UsagePage=9
Usage=19
Offset=0x1204
[Cat_3]
UsagePage=12
Usage=39
Offset=0x12304
I've written a very full-featured Inifile parser using boost-spirit before:
Cross-platform way to get line number of an INI file where given option was found
It supports comments (single line and block), quotes, escapes etc.
(as a bonus, it optionally records the exact source locations of all the parsed elements, which was the subject of that question).
For your purpose, though, I think I'd recomment Boost Property Tree.
For the moment, I've simplified the problem a bit, leaving out the logic for comments (which looks broken to me anyway).
#include <map>
#include <fstream>
#include <iostream>
#include <string>
typedef std::pair<std::string, std::string> entry;
// This isn't officially allowed (it's an overload, not a specialization) but is
// fine with every compiler of which I'm aware.
namespace std {
std::istream &operator>>(std::istream &is, entry &d) {
std::getline(is, d.first, '=');
std::getline(is, d.second);
return is;
}
}
int main() {
// open an input file.
std::ifstream in("myfile.ini");
// read the file into our map:
std::map<std::string, std::string> dict((std::istream_iterator<entry>(in)),
std::istream_iterator<entry>());
// Show what we read:
for (entry const &e : dict)
std::cout << "Key: " << e.first << "\tvalue: " << e.second << "\n";
}
Personally, I think I'd write the comment skipping as a filtering stream buffer, but for those unfamiliar with the C++ standard library, it's open to argument that would be a somewhat roundabout solution. Another possibility would be a comment_iterator that skips the remainder of a line, starting from a designated comment delimiter. I don't like that as well, but it's probably simpler in some ways.
Note that the only code we really write here is to read one, single entry from the file into a pair. The istream_iterator handles pretty much everything from there. As such, there's little real point in writing a direct analog of your function -- we just initialize the map from the iterators, and we're done.

How would I perform this text pattern matching

Migrated from [Spirit-general] list
Good morning,
I'm trying to parse a relatively simple pattern across 4 std::strings,
extracting whatever the part which matches the pattern into a separate
std::string.
In an abstracted sense, here is what I want:
s1=<string1><consecutive number>, s2=<consecutive number><string2>,
s3=<string1><consecutive number>, s4=<consecutive number><string2>
Less abstracted:
s1="apple 1", s2="2 cheese", s3="apple 3", s4="4 cheese"
Actual contents:
s1="lxckvjlxcjvlkjlkje xvcjxzlvcj wqrej lxvcjz ljvl;x czvouzxvcu
j;ljfds apple 1 xcvljxclvjx oueroi xcvzlkjv; zjx", s2="xzljlkxvc
jlkjxzvl jxcvljzx lvjlkj wre 2 cheese", s3="apple 3", s4="kxclvj
xcvjlxk jcvljxlck jxcvl 4 cheese"
How would I perform this pattern matching?
Thanks for all suggestions,
Alec Taylor
Update 2
Here is a really simple explanation I just figured out to explain the
problem I am trying to solve:
std::string s1=garbagetext1+number1+name1+garbagetext4;
std::string s3=garbagetext2+(number1+2)+name1+garbagetext5;
std::string s5=garbagetext3+(number1+4)+name1+garbagetext6;
Edit for context:
Feel free to add it to stackoverflow (I've been having some trouble
posting there)
I can't give you what I've done so far, because I wasn't sure if it
was within the capabilities of the boost::spirit libraries to do what
I'm trying to do
Edit: Re Update2
Here is a really simple explanation I just figured out to explain the
problem I am trying to solve:
std::string s1=garbagetext1+number1+name1+garbagetext4;
std::string s3=garbagetext2+(number1+2)+name1+garbagetext5;
std::string s5=garbagetext3+(number1+4)+name1+garbagetext6;
It starts looking like a job for:
Tokenizing the 'garbage text/names' - you could make a symbol table of sorts on the fly and use it to match patterns (spirit Lex and Qi's symbol table (qi::symbol) could facilitate it, but I feel you could write that in any number of ways)
conversely, use regular expressions, as suggested before (below, and at least twice in mailing list).
Here's a simple idea:
(\d+) ([a-z]+).*?(\d+) \2
\d+ match a sequence of digits in a "(subexpression)" (NUM1)
([a-z]+) match a name (just picked a simple definition of 'name')
.*? skip any length of garbage, but as little as possible before starting subsequent match
\d+ match another number (sequence of digits) (NUM2)
\2 followed by the same name (backreference)
You can see how you'd already be narrowing the list of matches to inspect down to 'potential' hits. You'd only have to /post-validate/ to see that NUM2 == NUM1+2
Two notes:
Add (...)+ around the tail part to allow repeated matching of patterns
(\d+) ([a-z]+)(.*?(\d+) \2)+
You may wish to make the garbage skip (.*?) aware of separators (by doing negative zerowidth assertions) to avoid more than 2 skipping delimiters (e.g. s\d+=" as a delimiting pattern). I leave it out of scope for clarity now, here's the gist:
((?!s\d+=").)*? -- beware of potential performance degradation
Alec, The following is a show-case of how to do a wide range of things in Boost Spirit, in the context of answering your question.
I had to make assumptions about what is required input structure; I assumed
whitespace was strict (spaces as shown, no newlines)
the sequence numbers should be in increasing order
the sequence numbers should recur exactly in the text values
the keywords 'apple' and 'cheese' are in strict alternation
whether the keyword comes before or after the the sequence number in the text value, is also in strict alternation
Note There are about a dozen places in the implementation below, where significantly less complex choices could possibly have been made. For example, I could have hardcoded the whole pattern (as a de facto regex?), assuming that 4 items are always expected in the input. However I wanted to
make no more assumptions than necessary
learn from the experience. Especially the topic of qi::locals<> and inherited attributes have been on my agenda for a while.
However, the solution allows a great deal of flexibility:
the keywords aren't hardcoded, and you could e.g. easily make the parser accept both keywords at any sequence number
a comment shows how to generate a custom parsing exception when the sequence number is out of sync (not the expected number)
different spellings of the sequence numbers are currently accepted (i.e. s01="apple 001" is ok. Look at Unsigned Integer Parsers for info on how to tune that behaviour)
the output structure is either a vector<std::pair<int, std::string> > or a vector of struct:
struct Entry
{
int sequence;
std::string text;
};
both versions can be switched with the single #if 1/0 line
The sample uses Boost Spirit Qi for parsing.
Conversely, Boost Spirit Karma is used to display the result of parsing:
format((('s' << auto_ << "=\"" << auto_) << "\"") % ", ", parsed)
The output for the actual contents given in the post is:
parsed: s1="apple 1", s2="2 cheese", s3="apple 3", s4="4 cheese"
On to the code.
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/karma.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <boost/spirit/include/phoenix_operator.hpp>
namespace qi = boost::spirit::qi;
namespace karma = boost::spirit::karma;
namespace phx = boost::phoenix;
#if 1 // using fusion adapted struct
#include <boost/fusion/adapted/struct.hpp>
struct Entry
{
int sequence;
std::string text;
};
BOOST_FUSION_ADAPT_STRUCT(Entry, (int, sequence)(std::string, text));
#else // using boring std::pair
#include <boost/fusion/adapted/std_pair.hpp> // for karma output generation
typedef std::pair<int, std::string> Entry;
#endif
int main()
{
std::string input =
"s1=\"lxckvjlxcjvlkjlkje xvcjxzlvcj wqrej lxvcjz ljvl;x czvouzxvcu"
"j;ljfds apple 1 xcvljxclvjx oueroi xcvzlkjv; zjx\", s2=\"xzljlkxvc"
"jlkjxzvl jxcvljzx lvjlkj wre 2 cheese\", s3=\"apple 3\", s4=\"kxclvj"
"xcvjlxk jcvljxlck jxcvl 4 cheese\"";
using namespace qi;
typedef std::string::const_iterator It;
It f(input.begin()), l(input.end());
int next = 1;
qi::rule<It, std::string(int)> label;
qi::rule<It, std::string(int)> value;
qi::rule<It, int()> number;
qi::rule<It, Entry(), qi::locals<int> > assign;
label %= qi::raw [
( eps(qi::_r1 % 2) >> qi::string("apple ") > qi::uint_(qi::_r1) )
| qi::uint_(qi::_r1) > qi::string(" cheese")
];
value %= '"'
>> qi::omit[ *(~qi::char_('"') - label(qi::_r1)) ]
>> label(qi::_r1)
>> qi::omit[ *(~qi::char_('"')) ]
>> '"';
number %= qi::uint_(phx::ref(next)++) /*| eps [ phx::throw_(std::runtime_error("Sequence number out of sync")) ] */;
assign %= 's' > number[ qi::_a = _1 ] > '=' > value(qi::_a);
std::vector<Entry> parsed;
bool ok = false;
try
{
ok = parse(f, l, assign % ", ", parsed);
if (ok)
{
using namespace karma;
std::cout << "parsed:\t" << format((('s' << auto_ << "=\"" << auto_) << "\"") % ", ", parsed) << std::endl;
}
} catch(qi::expectation_failure<It>& e)
{
std::cerr << "Expectation failed: " << e.what() << " '" << std::string(e.first, e.last) << "'" << std::endl;
} catch(const std::exception& e)
{
std::cerr << e.what() << std::endl;
}
if (!ok || (f!=l))
std::cerr << "problem at: '" << std::string(f,l) << "'" << std::endl;
}
Provided you can use c++11 compiler, parsing these patterns is pretty simple using AXE†:
#include <axe.h>
#include <string>
template<class I>
void num_value(I i1, I i2)
{
unsigned n;
unsigned next = 1;
// rule to match unsigned decimal number and compare it with another number
auto num = axe::r_udecimal(n) & axe::r_bool([&](...){ return n == next; });
// rule to match a single word
auto word = axe::r_alphastr();
// rule to match space characters
auto space = axe::r_any(" \t\n");
// semantic action - print to cout and increment next
auto e_cout = axe::e_ref([&](I i1, I i2)
{
std::cout << std::string(i1, i2) << '\n';
++next;
});
// there are only two patterns in this example
auto pattern1 = (word & +space & num) >> e_cout;
auto pattern2 = (num & +space & word) >> e_cout;
auto s1 = axe::r_find(pattern1);
auto s2 = axe::r_find(pattern2);
auto text = s1 & s2 & s1 & s2 & axe::r_end();
text(i1, i2);
}
To parse the text simply call num_value(text.begin(), text.end()); No changes required to parse unicode strings.
† I didn't test it.
Look into Boost.Regex. I've seen an almost-identical poosting in boost-users and the solution is to use regexes for some of the match work.

Boost.Spirit bug when mixing "alternates" with "optionals"?

I've only been working with Boost.Spirit (from Boost 1.44) for three days, trying to parse raw e-mail messages by way of the exact grammar in RFC2822. I thought I was starting to understand it and get somewhere, but then I ran into a problem:
#include <iostream>
#include <boost/spirit/include/qi.hpp>
namespace qi = boost::spirit::qi;
using qi::omit;
using qi::repeat;
using std::cout;
using std::endl;
typedef qi::rule<std::string::const_iterator, std::string()> strrule_t;
void test(const std::string input, strrule_t rule) {
std::string target;
std::string::const_iterator i = input.begin(), ie = input.end();
if (qi::parse(i, ie, rule, target)) {
cout << "Success: '" << target << "'" << endl;
} else {
cout << "Failed to match." << endl;
}
}
int main() {
strrule_t obsolete_year = omit[-qi::char_(" \t")] >> repeat(2)[qi::digit] >>
omit[-qi::char_(" \t")];
strrule_t correct_year = repeat(4)[qi::digit];
test("1776", correct_year | repeat(2)[qi::digit]); // 1: Works, reports 1776.
test("76", obsolete_year); // 2: Works, reports 76.
test("76", obsolete_year | correct_year); // 3: Works, reports 76.
test(" 76", correct_year | obsolete_year); // 4: Works, reports 76.
test("76", correct_year | obsolete_year); // 5: Fails.
test("76", correct_year | repeat(2)[qi::digit]); // 6: Also fails.
}
If test #3 works, then why does test #5 -- the exact same test with the two alternatives reversed -- fail?
By the same token, if you'll pardon the expression: if test #4 works, and the space at the beginning is marked as optional, then why does test #5 (the exact same test with the exact same input, save that there's no leading space in the input) fail?
And finally, if this is a bug in Boost.Spirit (as I suspect it must be), how can I work around it?
That is because you hit a bug in Spirit's directive repeat[]. Thanks for the report, I fixed this problem in SVN (rev. [66167]) and it will be available in Boost V1.45. At the same time I would like to add your small test as a regression test to Spirit's test suite. I hope you don't mind me doing so.

Reading a file of mixed data into a C++ string

I need to use C++ to read in text with spaces, followed by a numeric value.
For example, data that looks like:
text1
1.0
text two
2.1
text2 again
3.1
can't be read in with 2 "infile >>" statements. I'm not having any luck with getline
either. I ultimately want to populate a struct with these 2 data elements. Any ideas?
The standard IO library isn't going to do this for you alone, you need some sort of simple parsing of the data to determine where the text ends and the numeric value begins. If you can make some simplifying assumptions (like saying there is exactly one text/number pair per line, and minimal error recovery) it wouldn't be too bad to getline() the whole thing into a string and then scan it by hand. Otherwise, you're probably better off using a regular expression or parsing library to handle this, rather than reinventing the wheel.
Why? You can use getline providing a space as line separator. Then stitch extracted parts if next is a number.
If you can be sure that your input is well-formed, you can try something like this sample:
#include <iostream>
#include <sstream>
int main()
{
std::istringstream iss("text1 1.0 text two 2.1 text2 again 3.1");
for ( ;; )
{
double x;
if ( iss >> x )
{
std::cout << x << std::endl;
}
else
{
iss.clear();
std::string junk;
if ( !(iss >> junk) )
break;
}
}
}
If you do have to validate input (instead of just trying to parse anything looking like a double from it), you'll have to write some kind of parser, which is not hard but boring.
Pseudocode.
This should work. It assumes you have text/numbers in pairs, however. You'll have to do some jimmying to get all the typing happy, also.
while( ! eof)
getline(textbuffer)
getline(numberbuffer)
stringlist = tokenize(textbuffer)
number = atof(numberbuffer)