Parsing nested data in boost-spirit - c++

I need parse some text-tree :
std::string data = "<delimiter>field1a fieald1b fieald1c<delimiter1>subfield11<delimiter1>subfieald12<delimiter1>subfieald13 ... <delimiter>field2a fieald2b fieald2c<delimiter1>subfield21<delimiter1>subfieald22<delimiter1>subfieald23 ..."
where <delimiter>,<delimiter1> is part of std::string not a single char
It is possible tokenize this string with boost::spirit?

The list parser is you friend:
namespace qi = boost::spirit::qi;
// tokenize on '<delimiter1>' and return the vector
rule<std::string::iterator, qi::space_type, std::vector<std::string>()> fields =
*(char_ - "<delimiter1>") % "<delimiter1>";
std::string data("<delimiter>field1a fieald1b ...");
std::vector<std::vector<std::string> > fields_data;
// tokenize of '<delimiter>' and return a vector of vectors
qi::phrase_parse(data.begin(), data.end(),
fields % "<delimiter>", qi::space, fields_data);
You might need a recent version of Spirit for this to work (Boost V1.47 or SVN trunk).

Yes you could use spirit to do this format but it seems to me to be much more than you need.
I would just code the tokenise myself directly using std string functions. Alternately boost:regex should do this very easily for you.

Related

Boost spirit: Parse char_ with changing local variable value

I want to implement a grammar that requires parsing instance names and paths, where a path is a list of instance names separated by a divider. The divider can be either . (period) or / (slash) given in the input file before the paths are listed, e.g.:
DIVIDER .
a.b.c
x.y.z
Once set, the divider never changes for the whole file (i.e. if set to ., encountering a path like a/b/c should not parse correctly). Since I don't know what the divider is in advance, I'm thinking about storing it in a variable of my grammar and use that value in corresponding char_ parsers (of course, the actual grammar is much more complex, but this is the part where I'm having trouble).
This is somewhat similar to this question: Boost spirit using local variables but not quite what I want, since using the Nabialek trick allows to parse "invalid" paths after the divider is set.
I'm not asking for a complete solution here, but my question is essentially this: Can I parse values into members of my grammar and then use these values for further parsing of remaining input?
I'd use an inherited attribute:
qi::rule<It, std::string(char)> element = *~qi::char_(qi::_r1);
qi::rule<It, std::vector<std::string>(char)> path = element(qi::_r1) % qi::char_(qi::_r1);
// use it like:
std::vector<std::string> data;
bool ok = qi::parse(f, l, path('/'), data);
Alternatively you /can/ indeed bind to a local variable:
char delim = '/';
qi::rule<It, std::string()> element = *~qi::char_(delim);
qi::rule<It, std::vector<std::string>()> path = element % qi::char_(delim);
// use it like:
std::vector<std::string> data;
bool ok = qi::parse(f, l, path, data);
If you need it to be dynamic, use boost::phoenix::ref:
char delim = '/';
qi::rule<It, std::string()> element = *~qi::char_(boost::phoenix::ref(delim));
qi::rule<It, std::vector<std::string>()> path = element % qi::char_(boost::phoenix::ref(delim));
// use it like:
std::vector<std::string> data;
bool ok = qi::parse(f, l, path, data);

Using boost::spirit to match words

I want to create a parser that will match exactly two alphanumeric words from a string, such as:
message1 message2
and then save that into two variables of type std::string.
I've read this previous answer which seems to work for an endless amount of repetitions, which uses the following parser:
+qi::alnum % +qi::space
However when I try to do this:
bool const result = qi::phrase_parse(
input.begin(), input.end(),
+qi::alnum >> +qi::alnum,
+qi::space,
words
);
the words vector contains every single letter in a different string:
't'
'h'
'i'
's'
'i'
's'
This is extremely counter-intuitive, and I'm not sure as to why it's happening. Could someone please explain that?
Also, can I have two predefined strings to be populated instead of a std::vector?
Final note: I would like to avoid the using statement, as I would like to have every namespace clearly defined to help me understand how Spirit works.
Yes, but the skipper ignores the whitespace before you can act on it.
Use lexeme to control the skipper:
bool const result = qi::phrase_parse(
input.begin(), input.end(),
qi::lexeme [+qi::alnum] >> qi::lexeme [+qi::alnum],
qi::space,
words
);
Note the skipper should be qi::space instead of +qi::space.
See also Boost spirit skipper issues

Boost Qi Composing rules using Functions

I'm trying to define some Boost::spirit::qi parsers for multiple subsets of a language with minimal code duplication. To do this, I created a few basic rule building functions. The original parser works fine, but once I started to use the composing functions, my parsers no longer seem to work.
The general language is of the form:
A B: C
There are subsets of the language where A, B, or C must be specific types, such as A is an int while B and C are floats. Here is the parser I used for that sub language:
using entry = boost::tuple<int, float, float>;
template <typename Iterator>
struct sublang : grammar<Iterator, entry(), ascii::space_type>
{
sublang() : sublang::base_type(start)
{
start = int_ >> float_ >> ':' >> float_;
}
rule<Iterator, entry(), ascii::space_type> start;
};
But since there are many subsets, I tried to create a function to build my parser rules:
template<typename AttrName, typename Value>
auto attribute(AttrName attrName, Value value)
{
return attrName >> ':' >> value;
}
So that I could build parsers for each subset more easily without duplicate information:
// in sublang
start = int_ >> attribute(float_, float_);
This fails however and I'm not sure why. In my clang testing, parsing just fails. In g++, it seems the program crashes.
Here's the full example code: http://coliru.stacked-crooked.com/a/8636f19b2e9bff8d
What is wrong with the current code and what would be the correct approach for this problem? I would like to avoid specifying the grammar of attributes and other elements in each sublanguage parser.
Quite simply: using auto with Spirit (or any EDSL based on Boost Proto and Boost Phoenix) is most likely Undefined Behaviour¹
Now, you can usually fix this using
BOOST_SPIRIT_AUTO
boost::proto::deep_copy
the new facility that's coming in the most recent version of Boost (TODO add link)
In this case,
template<typename AttrName, typename Value>
auto attribute(AttrName attrName, Value value) {
return boost::proto::deep_copy(attrName >> ':' >> value);
}
fixes it: Live On Coliru
Alternatively
you could use qi::lazy[] with inherited attributes.
I do very similar things in the prop_key rule in Reading JSON file with C++ and BOOST.
you could have a look at the Keyword List Operator from the Spirit Repository. It's designed to allow easier construction of grammars like:
no_constraint_person_rule %=
kwd("name")['=' > parse_string ]
/ kwd("age") ['=' > int_]
/ kwd("size") ['=' > double_ > 'm']
;
This you could potentially combine with the Nabialek Trick. I'd search the answers on SO for examples. (One is Grammar balancing issue)
¹ Except for entirely stateless actors (Eric Niebler on this) and expression placeholders. See e.g.
Assigning parsers to auto variables
undefined behaviour somewhere in boost::spirit::qi::phrase_parse
C++ Boost qi recursive rule construction
boost spirit V2 qi bug associated with optimization level
Some examples
Define parsers parameterized with sub-parsers in Boost Spirit
Generating Spirit parser expressions from a variadic list of alternative parser expressions

Vector push_back on duplicate strings with the help of delimiter

I am trying to read the PATH Environment variable and remove any duplicates that are present in it using vector functionalities such as - sort, erase and unique. But as I've seen vector will delimit each element default by newline. When I get the path as C:\Program Files(x86)\..., its breaking at C:/ Program. This is my code so far:
char *path = getenv("PATH");
char str[10012] = "";
strcpy(str,path);
string strr(str);
vector<string> vec;
stringstream ss(strr);
string s;
while(ss >> s)
{
push_back(s);
}
sort(vec.begin(),vec.end());
vec.erase(unique(vec.begin(),vec.end()),vec.end());
for(unsigned i=0;i<vec.size();i++)
{
cout<<vec[i]<<endl;
}
Is it the delimiter problem? I need to pus_back at every ; and search for duplicates. Can anyone help me in this regard.
I would use a stringstream to chop it up, and the use a set to ensure there are no duplicates.
std::string p { std::getenv("PATH") }
std::set<string> set;
std::stringstream ss { p };
std::string s;
while(std::getline(ss, s, ':')) //this might need to be ';' for windows
{
set.insert(s);
}
for(const auto& elem : set)
std::cout << elem << std::endl;
Should you need to use a vector for some reason, you'd want to sort it with std::sort then remove duplicates with std::unique then erase the slack with erase.
std::sort(begin(vec), end(vec));
auto it=std::unique(begin(vec), end(vec));
vec.erase(it, end(vec));
EDIT: link to docs
http://en.cppreference.com/w/cpp/container/set
http://en.cppreference.com/w/cpp/algorithm/unique
http://en.cppreference.com/w/cpp/algorithm/sort
For this task it is better to use std::set<std::string> which will eliminate duplicates automatically. To read in PATH, use strtok to split it into substrings.
You need to use a different delimiter (':' or ';' to split the directories from the PATH, depending on the system). For instance, you can have a look at the std::getline() function to replace your current while () / push_back loop. This function allows you to specify a custom delimiter and would be a drop-in replacement in your code.
It isn't so much that std::vector<T> is delimiting anything but that the formatted input operator (operator>>()) for strings uses whitespace as delimiters. Other already posted about using std::getline() and the like. There are two other approaches:
Change what is considered to be whitespace for the stream! The std::string input operator uses the stream's std::locale object to obtain a std::ctype<char> facet which can be replaced. The std::ctype<char> facet has functions to do character classification and it can be used to consider, e.g., the character ';' as a space. It is a bit involved but a more solid approach than the next one.
I don't think path components can include newlines, i.e., a simple approach could be to replace all semicolons by newlines before reading the components:
std::string path(std::getenv("PATH"));
std::replace(path.begin(), path.end(), path.begin(), ';', '\n');
std::istringstream pin(path);
std::istream_iterator<std::string> pbegin(pin), pend;
std::vector<std::string> vec(pbegin, pend);
This approach may have the problem that the PATH may contain components which contain spaces: these would be split into individual object. You might want to replace spaces with another character (e.g., the now unused ';') and restore those at an appropriate to become spaces.

C++: storing CSV in contianer

I have a std::string that contains comma separated values, i need to store those values in some suitable container e.g. array, vector or some other container. Is there any built in function through which i could do this? Or i need to write custom code for this?
If you're willing and able to use the Boost libraries, Boost Tokenizer would work really well for this task.
That would look like:
std::string str = "some,comma,separated,words";
typedef boost::tokenizer<boost::char_separator<char> > tokenizer;
boost::char_separator<char> sep(",");
tokenizer tokens(str, sep);
std::vector<std::string> vec(tokens.begin(), tokens.end());
You basically need to tokenize the string using , as the delimiter. This earlier Stackoverflow thread shall help you with it.
Here is another relevant post.
I don't think there is any available in the standard library. I would approach like -
Tokenize the string based on , delimeter using strtok.
Convert it to integer using atoi function.
push_back the value to the vector.
If you are comfortable with boost library, check this thread.
Using AXE parser generator you can easily parse your csv string, e.g.
std::string input = "aaa,bbb,ccc,ddd";
std::vector<std::string> v; // your strings get here
auto value = *(r_any() - ',') >> r_push_back(v); // rule for single value
auto csv = *(value & ',') & value & r_end(); // rule for csv string
csv(input.begin(), input.end());
Disclaimer: I didn't test the code above, it might have some superficial errors.