I use boost::spirit to parse (a part) of a monomial like x, y, xy, x^2, x^3yz. I want to save the variables of the monomial into a map, which also stores the corresponding exponent. Therefore the grammar should also save the implicit exponent of 1 (so x stores as if it was written as x^1).
start = +(potVar);
potVar=(varName>>'^'>>exponent)|(varName>> qi::attr(1));// First try: This doubles the variable name
//potVar = varName >> (('^' >> exponent) | qi::attr(1));// Second try: This works as intended
exponent = qi::int_;
varName = qi::char_("a-z");
When using the default attribute as in the line "First try", Spirit doubles the variable name.
Everything works as intended when using the default attribute as in the line "Second try".
'First try' reads a variable x and stores the pair [xx, 1].
'Second try' reads a variable x and stores the pair [x, 1].
I think I solved the original problem myself. The second try works. However, I don't see how I doubled the variable name. Because I am about to get familiar with boost::spirit, which is a collection of challenges for me, and there are probably more to come, I would like to understand this behavior.
This is the whole code to recreate the problem. The frame of the grammar is copied from a presentation of the KIT https://panthema.net/2018/0912-Boost-Spirit-Tutorial/ , and Stackoverflow was already very helpful, when I needed the header, which enables me to use the std::pair.
#include <iostream>
#include <iomanip>
#include <stdexcept>
#include <cmath>
#include <map>
#include <utility>//for std::pair
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/fusion/adapted/std_pair.hpp> //https://stackoverflow.com/questions/53953642/parsing-map-of-variants-with-boost-spirit-x3
namespace qi = boost::spirit::qi;
template <typename Parser, typename Skipper, typename ... Args>
void PhraseParseOrDie(
const std::string& input, const Parser& p, const Skipper& s,
Args&& ... args)
{
std::string::const_iterator begin = input.begin(), end = input.end();
boost::spirit::qi::phrase_parse(
begin, end, p, s, std::forward<Args>(args) ...);
if (begin != end) {
std::cout << "Unparseable: "
<< std::quoted(std::string(begin, end)) << std::endl;
throw std::runtime_error("Parse error");
}
}
class ArithmeticGrammarMonomial : public qi::grammar<
std::string::const_iterator,
std::map<std::string, int>(), qi::space_type>
{
public:
using Iterator = std::string::const_iterator;
ArithmeticGrammarMonomial() : ArithmeticGrammarMonomial::base_type(start)
{
start = +(potVar);
potVar=(varName>>'^'>>exponent)|(varName>> qi::attr(1));
//potVar = varName >> (('^' >> exponent) | qi::attr(1));
exponent = qi::int_;
varName = qi::char_("a-z");
}
qi::rule<Iterator, std::map<std::string, int>(), qi::space_type> start;
qi::rule<Iterator, std::pair<std::string, int>(), qi::space_type> potVar;
qi::rule<Iterator, int()> exponent;
qi::rule<Iterator, std::string()> varName;
};
void test2(std::string input)
{
std::map<std::string, int> out_map;
PhraseParseOrDie(input, ArithmeticGrammarMonomial(), qi::space, out_map);
std::cout << "test2() parse result: "<<std::endl;
for(auto &it: out_map)
std::cout<< it.first<<it.second << std::endl;
}
/******************************************************************************/
int main(int argc, char* argv[])
{
std::cout << "Parse Monomial 1" << std::endl;
test2(argc >= 2 ? argv[1] : "x^3y^1");
test2(argc >= 2 ? argv[1] : "xy");
return 0;
}
Live demo
I think I solved the original problem myself. The second try works.
Indeed. It's how I'd do this (always match the AST with your parser expressions).
However, I don't see how I doubled the variable name.
It's due to backtracking with container attributes. They don't get rolled back. So the first branch parses potVar into a string, and then the parser backtracks into the second branch, which parses potVar into the same string.
boost::spirit::qi duplicate parsing on the output
Understanding Boost.spirit's string parser
Parsing with Boost::Spirit (V2.4) into container
Boost Spirit optional parser and backtracking
boost::spirit alternative parsers return duplicates
It can also crop up with semantic actions:
Boost Semantic Actions causing parsing issues
Boost Spirit optional parser and backtracking
In short:
match your AST structure in your rule expression, or use qi::hold to force the issue (at performance cost)
avoid semantic actions (Boost Spirit: "Semantic actions are evil"?)
For inspiration, here's a simplified take using Spirit X3
Live On Compiler Explorer
#include <boost/fusion/adapted.hpp>
#include <boost/spirit/home/x3.hpp>
#include <fmt/ranges.h>
#include <map>
namespace Parsing {
namespace x3 = boost::spirit::x3;
auto exponent = '^' >> x3::int_ | x3::attr(1);
auto varName = x3::repeat(1)[x3::char_("a-z")];
auto potVar
= x3::rule<struct P, std::pair<std::string, int>>{}
= varName >> exponent;
auto start = x3::skip(x3::space)[+potVar >> x3::eoi];
template <typename T = x3::unused_type>
void StrictParse(std::string_view input, T&& into = {})
{
auto f = input.begin(), l = input.end();
if (!x3::parse(f, l, start, into)) {
fmt::print(stderr, "Error at: '{}'\n", std::string(f, l));
throw std::runtime_error("Parse error");
}
}
} // namespace Parsing
void test2(std::string input) {
std::map<std::string, int> out_map;
Parsing::StrictParse(input, out_map);
fmt::print("{} -> {}\n", input, out_map);
}
int main() {
for (auto s : {"x^3y^1", "xy"})
test2(s);
}
Prints
x^3y^1 -> [("x", 3), ("y", 1)]
xy -> [("x", 1), ("y", 1)]
Bonus Notes
It looks to me like you should be more careful. Even if you assume that all variables are 1 letter and no terms can occur (only factors), then still you need to correctly handle x^5y^2x to be x^6y^2 right?
Here's Qi version that uses semantic actions to correctly accumulate like factors:
Live On Coliru
#include <boost/spirit/include/phoenix.hpp>
#include <boost/spirit/include/qi.hpp>
#include <iomanip>
#include <iostream>
#include <map>
namespace qi = boost::spirit::qi;
using Iterator = std::string::const_iterator;
using Monomial = std::map<char, int>;
struct ArithmeticGrammarMonomial : qi::grammar<Iterator, Monomial()> {
ArithmeticGrammarMonomial() : ArithmeticGrammarMonomial::base_type(start) {
using namespace qi;
exp_ = '^' >> int_ | attr(1);
start = skip(space)[ //
+(char_("a-z") >> exp_)[_val[_1] += _2] //
];
}
private:
qi::rule<Iterator, Monomial()> start;
qi::rule<Iterator, int(), qi::space_type> exp_;
};
void do_test(std::string_view input) {
Monomial output;
static const ArithmeticGrammarMonomial p;
Iterator f(begin(input)), l(end(input));
qi::parse(f, l, qi::eps > p, output);
std::cout << std::quoted(input) << " -> " << std::endl;
for (auto& [var,exp] : output)
std::cout << " - " << var << '^' << exp << std::endl;
}
int main() {
for (auto s : {"x^3y^1", "xy", "x^5y^2x"})
do_test(s);
}
Prints
"x^3y^1" ->
- x^3
- y^1
"xy" ->
- x^1
- y^1
"x^5y^2x" ->
- x^6
- y^2
I'd like to parse a string like "{{0, 1}, {2, 3}}" into a std::map. I can write a small function for parsing a string using <regex> library, but I have no idea how to check whether a given string is in a valid format. How can I validate the format of a string?
#include <list>
#include <map>
#include <regex>
#include <iostream>
void f(const std::string& s) {
std::map<int, int> m;
std::regex p {"[\\[\\{\\(](\\d+),\\s*(\\d+)[\\)\\}\\]]"};
auto begin = std::sregex_iterator(s.begin(), s.end(), p);
auto end = std::sregex_iterator();
for (auto x = begin; x != end; ++x) {
std::cout << x->str() << '\n';
m[std::stoi(x->str(1))] = std::stoi(x->str(2));
}
std::cout << m.size() << '\n';
}
int main() {
std::list<std::string> l {
"{{0, 1}, (2, 3)}",
"{{4, 5, {6, 7}}" // Ill-formed, so need to throw an excpetion.
};
for (auto x : l) {
f(x);
}
}
NOTE: I don't feel obliged to use regex to solve this problem. Any kind of solutions, including some ways validating and inserting at once by subtracting substrings, will be appreciated.
In my opinion, Spirit-based parser is always much more robust and readable. It is also much more fun to parse with Spirit :-). So, in addition to #Aleph0 's answer, I'd like to provide a compact solution based on Spirit-X3:
#include <string>
#include <map>
#include <iostream>
#include <boost/fusion/adapted/std_pair.hpp>
#include <boost/spirit/home/x3.hpp>
int main() {
std::string input ="{{0, 1}, {2, 3}}";
using namespace boost::spirit::x3;
const auto pair = '{' > int_ > ',' > int_ > '}';
const auto pairs = '{' > (pair % ',') > '}';
std::map<int, int> output;
// ignore spaces, tabs, newlines
phrase_parse(input.begin(), input.end(), pairs, space, output);
for (const auto [key, value] : output) {
std::cout << key << ":" << value << std::endl;
}
}
Note that I used operator >, which means "expect". So, if the input does not match the expectation, Spirit throws an exception. If you prefer a silent failure, use operator >> instead.
It might be a little too much, but if you have boost at your hands you can use boost-spirit to do the job for you. An advantage might be, that the solution is easily extendible to parse other kind of maps, like std::map<std::string, int> for example.
Another advantage, that shouldn't be underestimated is that boost-spirit leaves you with sane exceptions in case the string doesn't satisfy your grammar. It is quite hard to achieve this with a hand written solution.
The place where the error occurs is also given by boost-spirit, so that you might backtrack to this place.
#include <map>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix_operator.hpp>
#include <boost/spirit/include/phoenix_stl.hpp>
#include <boost/fusion/adapted/std_pair.hpp>
template <typename Iterator, typename Skipper>
struct mapLiteral : boost::spirit::qi::grammar<Iterator, std::map<int,int>(), Skipper>
{
mapLiteral() : mapLiteral::base_type(map)
{
namespace qi = boost::spirit::qi;
using qi::lit;
map = (lit("{") >> pair >> *(lit(",") >> pair) >> lit("}"))|(lit("{") >> lit("}"));
pair = (lit("{") >> boost::spirit::int_ >> lit(",") >> boost::spirit::int_ >> lit("}"));
}
boost::spirit::qi::rule<Iterator, std::map<int, int>(), Skipper> map;
boost::spirit::qi::rule<Iterator, std::pair<int, int>(), Skipper> pair;
};
std::map<int,int> parse(const std::string& expression, bool& ok)
{
std::map<int, int> result;
try {
std::string formula = expression;
boost::spirit::qi::space_type space;
mapLiteral<std::string::const_iterator, decltype(space)> parser;
auto b = formula.begin();
auto e = formula.end();
ok = boost::spirit::qi::phrase_parse(b, e, parser, space, result);
if (b != e) {
ok = false;
return std::map<int, int>();
}
return result;
}
catch (const boost::spirit::qi::expectation_failure<std::string::iterator>&) {
ok = false;
return result;
}
}
int main(int argc, char** args)
{
std::vector<std::pair<std::map<int, int>,std::string>> tests = {
{{ },"{ \t\n}"},
{{{5,2},{2,1}},"{ {5,2},{2,1} }"},
{{},"{{2, 6}{}}"} // Bad food
};
for (auto iter :tests)
{
bool ok;
auto result = parse(iter.second, ok);
if (result == iter.first)
{
std::cout << "Equal:" << std::endl;
}
}
}
Since Han mentioned in his comments that he would like to wait for further ideas, I will show an additional solution.
And as everybody before, I think it is the most appropriate solution :-)
Additionally, I will unpack the "big hammer", and talk about "languages" and "grammars" and, uh oh, Chomsky Hierachy.
First a very simple answer: Pure Regular Expressions cannot count. So, they cannot check matching braces, like 3 open braces and 3 closed Braces.
They are mostly implemented as DFA (Deterministic Finite Automaton), also known as FSA (Finite State Automaton). One of the relevant properties here is that they do know only about their current state. They cannot "remember" previous states. They have no memory.
The languages that they can produce are so-called "regular languages". In the Chomsky hierarchy, the grammar to produce such a regular language is of Type-3. And “regular expressions” can be used to produce such languages.
However, there are extensions to regular expressions that can also be used to match balanced braces. See here: Regular expression to match balanced parentheses
But these are not regular expression as per the original definition.
What we really need, is a Chomsky-Type-2 grammar. A so-called context-free-grammar. And this will usually be implemented with a pushdown-automaton. A stack is used to store additional state. This is the “memory” that regular expressions do not have.
So, if we want to check the syntax of a given expression, as in your case the input for a std::map, we can define an ultra-simple Grammar and parse the input string using the standard classical approach: A Shift/Reduce Parser.
There are several steps necessary: First the input stream will be split into Lexems od Tokens. This is usually done by a so called Lexer or Scanner. You will always find a function like getNextToken or similar. Then the Tokens will be shifted on the stack. The Stack Top will be matched against productions in the grammar. If there is a match with the right side of the production, the elements in the stack will be replaced by the none-terminal on the left side of the productions. This procedure will be repeated until the start symbol of the grammar will be hit (meaning everything was OK) or a syntax error will be found.
Regarding your question:
How to parse a string into std::map and validate its format?
I would split it in to 2 tasks.
Parse the string to validate the format
If the string is valid, put the data into a map
Task 2 is simple and typically a one-liner using a std::istream_iterator.
Task 1 unfortunately needs a shift-reduce-parser. This is a little bit complex.
In the attached code below, I show one possible solution. Please note: This can of cause be optimized by using Token with attributes. The attributes would be an integer number and the type of the brace. The Token with attributes would be stored on the parse stack. With that we could eliminate the need to have productions for all kind of braces and we could fill the map in the parser (in the reduction operation of one of “{Token::Pair, { Token::B1open, Token::Integer, Token::Comma, Token::Integer, Token::B1close} }”
Please see the code below:
#include <iostream>
#include <iterator>
#include <sstream>
#include <map>
#include <vector>
#include <algorithm>
// Tokens: Terminals and None-Terminals
enum class Token { Pair, PairList, End, OK, Integer, Comma, B1open, B1close, B2open, B2close, B3open, B3close };
// Production type for Grammar
struct Production { Token nonTerminal; std::vector<Token> rightSide; };
// The Context Free Grammar CFG
std::vector<Production> grammar
{
{Token::OK, { Token::B1open, Token::PairList, Token::B1close } },
{Token::OK, { Token::B2open, Token::PairList, Token::B2close } },
{Token::OK, { Token::B3open, Token::PairList, Token::B3close } },
{Token::PairList, { Token::PairList, Token::Comma, Token::Pair} },
{Token::PairList, { Token::Pair } },
{Token::Pair, { Token::B1open, Token::Integer, Token::Comma, Token::Integer, Token::B1close} },
{Token::Pair, { Token::B2open, Token::Integer, Token::Comma, Token::Integer, Token::B2close} },
{Token::Pair, { Token::B3open, Token::Integer, Token::Comma, Token::Integer, Token::B3close} }
};
// Helper for translating brace characters to Tokens
std::map<const char, Token> braceToToken{
{'(',Token::B1open},{'[',Token::B2open},{'{',Token::B3open},{')',Token::B1close},{']',Token::B2close},{'}',Token::B3close},
};
// A classical SHIFT - REDUCE Parser
class Parser
{
public:
Parser() : parseString(), parseStringPos(parseString.begin()) {}
bool parse(const std::string& inputString);
protected:
// String to be parsed
std::string parseString{}; std::string::iterator parseStringPos{}; // Iterator for input string
// The parse stack for the Shift Reduce Parser
std::vector<Token> parseStack{};
// Parser Step 1: LEXER (lexical analysis / scanner)
Token getNextToken();
// Parser Step 2: SHIFT
void shift(Token token) { parseStack.push_back(token); }
// Parser Step 3: MATCH / REDUCE
bool matchAndReduce();
};
bool Parser::parse(const std::string& inputString)
{
parseString = inputString; parseStringPos = parseString.begin(); parseStack.clear();
Token token{ Token::End };
do // Read tokens untils end of string
{
token = getNextToken(); // Parser Step 1: LEXER (lexical analysis / scanner)
shift(token); // Parser Step 2: SHIFT
while (matchAndReduce()) // Parser Step 3: MATCH / REDUCE
; // Empty body
} while (token != Token::End); // Do until end of string reached
return (!parseStack.empty() && parseStack[0] == Token::OK);
}
Token Parser::getNextToken()
{
Token token{ Token::End };
// Eat all white spaces
while ((parseStringPos != parseString.end()) && std::isspace(static_cast<int>(*parseStringPos))) {
++parseStringPos;
}
// Check for end of string
if (parseStringPos == parseString.end()) {
token = Token::End;
}
// Handle digits
else if (std::isdigit(static_cast<int>(*parseStringPos))) {
while ((((parseStringPos + 1) != parseString.end()) && std::isdigit(static_cast<int>(*(parseStringPos + 1))))) ++parseStringPos;
token = Token::Integer;
}
// Detect a comma
else if (*parseStringPos == ',') {
token = Token::Comma;
// Else search for all kind of braces
}
else {
std::map<const char, Token>::iterator foundBrace = braceToToken.find(*parseStringPos);
if (foundBrace != braceToToken.end()) token = foundBrace->second;
}
// In next function invocation the next string element will be checked
if (parseStringPos != parseString.end())
++parseStringPos;
return token;
}
bool Parser::matchAndReduce()
{
bool result{ false };
// Iterate over all productions in the grammar
for (const Production& production : grammar) {
if (production.rightSide.size() <= parseStack.size()) {
// If enough elements on the stack, match the top of the stack with a production
if (std::equal(production.rightSide.begin(), production.rightSide.end(), parseStack.end() - production.rightSide.size())) {
// Found production: Reduce
parseStack.resize(parseStack.size() - production.rightSide.size());
// Replace right side of production with left side
parseStack.push_back(production.nonTerminal);
result = true;
break;
}
}
}
return result;
}
using IntMap = std::map<int, int>;
using IntPair = std::pair<int, int>;
namespace std {
istream& operator >> (istream& is, IntPair& intPair) {
return is >> intPair.first >> intPair.second;
}
ostream& operator << (ostream& os, const pair<const int, int>& intPair) {
return os << intPair.first << " --> " << intPair.second;
}
}
int main()
{ // Test Data. Test Vector with different strings to test
std::vector <std::string> testVector{
"({10, 1 1}, (2, 3) , [5 ,6])",
"({10, 1}, (2, 3) , [5 ,6])",
"({10, 1})",
"{10,1}"
};
// Define the Parser
Parser parser{};
for (std::string& test : testVector)
{ // Give some nice info to the user
std::cout << "\nChecking '" << test << "'\n";
// Parse the test string and test, if it is valid
bool inputStringIsValid = parser.parse(test);
if (inputStringIsValid) { // String is valid. Delete everything but digits
std::replace_if(test.begin(), test.end(), [](const char c) {return !std::isdigit(static_cast<int>(c)); }, ' ');
std::istringstream iss(test); // Copy string with digits int a istringstream, so that we can read with istream_iterator
IntMap intMap{ std::istream_iterator<IntPair>(iss),std::istream_iterator<IntPair>() };
// Present the resulting data in the map to the user
std::copy(intMap.begin(), intMap.end(), std::ostream_iterator<IntPair>(std::cout, "\n"));
} else {
std::cerr << "***** Invalid input data\n";
}
}
return 0;
}
I hope this is not too complex. But it is the "mathematical" correct solution. Have fun . . .
You can validate your strings by checking just the parentheses like so, this is not extremely efficient since it always iterates each string but it can be optimized.
#include <list>
#include <iostream>
#include <string>
bool validate(std::string s)
{
std::list<char> parens;
for (auto c : s) {
if (c == '(' || c == '[' || c == '{') {
parens.push_back(c);
}
if (c == ')' && parens.back() == '(') {
parens.pop_back();
} else if (c == ']' && parens.back() == '[') {
parens.pop_back();
} else if (c == '}' && parens.back() == '{') {
parens.pop_back();
}
}
return parens.size() == 0;
}
int main()
{
std::list<std::string> l {
"{{0, 1}, (2, 3)}",
"{{4, 5, {6, 7}}" // Ill-formed, so need to throw an excpetion.
};
for (auto s : l) {
std::cout << "'" << s << "' is " << (validate(s) ? "" : "not ") << "valid" << std::endl;
}
return 0;
}
The output of the above code is this:
'{{0, 1}, (2, 3)}' is valid
'{{4, 5, {6, 7}}' is notvalid
EDIT:
This version should be more efficient since it returns right after it notices a string is not valid.
bool validate(std::string s)
{
std::list<char> parens;
for (auto c : s) {
if (c == '(' || c == '[' || c == '{') {
parens.push_back(c);
}
if (c == ')') {
if (parens.back() != '(') {
return false;
}
parens.pop_back();
} else if (c == ']') {
if (parens.back() != '[') {
return false;
}
parens.pop_back();
} else if (c == '}') {
if (parens.back() != '{') {
return false;
}
parens.pop_back();
}
}
return parens.size() == 0;
}
Your regex parses single map element perfectly. I suggest you to validate string before creating map and filling it with parsed elements.
Let's use slightly improved version of you regex:
[\[\{\(](([\[\{\(](\d+),(\s*)(\d+)[\)\}\]])(,?)(\s*))*[\)\}\]]
It matches the whole string if it is valid: it begins with [\[\{\(], ends with [\)\}\]], contains several (or zero) pattern of map element inside followed by , and multiple (or zero) spaces.
Here is the code:
#include <list>
#include <map>
#include <regex>
#include <sstream>
#include <iostream>
void f(const std::string& s) {
// part 1: validate string
std::regex valid_pattern {"[\\[\\{\\(](([\\[\\{\\(](\\d+),(\\s*)(\\d+)[\\)\\}\\]])(,?)(\\s*))*[\\)\\}\\]]"};
auto valid_begin = std::sregex_iterator(s.begin(), s.end(), valid_pattern);
auto valid_end = std::sregex_iterator();
if (valid_begin == valid_end || valid_begin->str().size() != s.size ()) {
std::stringstream res;
res << "String \"" << s << "\" doesn't satisfy pattern!";
throw std::invalid_argument (res.str ());
} else {
std::cout << "String \"" << s << "\" satisfies pattern!" << std::endl;
}
// part 2: parse map elements
std::map<int, int> m;
std::regex pattern {"[\\[\\{\\(](\\d+),\\s*(\\d+)[\\)\\}\\]]"};
auto parsed_begin = std::sregex_iterator(s.begin(), s.end(), pattern);
auto parsed_end = std::sregex_iterator();
for (auto x = parsed_begin; x != parsed_end; ++x) {
m[std::stoi(x->str(1))] = std::stoi(x->str(2));
}
std::cout << "Number of parsed elements: " << m.size() << '\n';
}
int main() {
std::list<std::string> l {
"{}",
"[]",
"{{0, 153}, (2, 3)}",
"{{0, 153}, (2, 3)}",
"{[0, 153], (2, 3), [154, 33] }",
"{[0, 153], (2, 3), [154, 33] ", // Ill-formed, so need to throw an exception.
"{{4, 5, {6, 7}}", // Ill-formed, so need to throw an exception.
"{{4, 5, {x, 7}}" // Ill-formed, so need to throw an exception.
};
for (const auto &x : l) {
try {
f(x);
}
catch (std::invalid_argument &ex) {
std::cout << ex.what () << std::endl;
}
std::cout << std::endl;
}
}
Here is the output:
String "{}" satisfies pattern!
Number of parsed elements: 0
String "[]" satisfies pattern!
Number of parsed elements: 0
String "{{0, 153}, (2, 3)}" satisfies pattern!
Number of parsed elements: 2
String "{{0, 153}, (2, 3)}" satisfies pattern!
Number of parsed elements: 2
String "{[0, 153], (2, 3), [154, 33] }" satisfies pattern!
Number of parsed elements: 3
String "{[0, 153], (2, 3), [154, 33] " doesn't satisfy pattern!
String "{{4, 5, {6, 7}}" doesn't satisfy pattern!
String "{{4, 5, {x, 7}}" doesn't satisfy pattern!
PS It has only one defect. It doesn't check that corresponding closing bracket is equal to the opening one. So it matches this: {], {(1,2]) etc. If It is not okay for you, the easiest way to fix it is to add some extra validation code before putting parsed pair in map.
PPS If you are able to avoid regex's, your problem could be solved much more efficient with a single string scan for each string. #SilvanoCerza proposed an implementation for this case.
I am working on learning to write spirit grammars and I am trying to create a basic base 16 to base 64 converter that takes in a string representing hex, for example:
49276d206b696c
parse out 6 or less characters (less if the string isn't a perfect multiple of 6) and generate a base 64 encoded string from the input. One grammar I figured would probably work is something like this:
// 6 characters
`(qi::char_("0-9a-fA-F") >> qi::char_("0-9a-fA-F") >>
qi::char_("0-9a-fA-F") >> qi::char_("0-9a-fA-F") >>
qi::char_("0-9a-fA-F") >> qi::char_("0-9a-fA-F")[/*action*/]) |
// or 5 characters
(qi::char_("0-9a-fA-F") >> qi::char_("0-9a-fA-F") >>
qi::char_("0-9a-fA-F") >> qi::char_("0-9a-fA-F") >>
qi::char_("0-9a-fA-F")[/*action*/]) | ...`
etc.... all the way down to one character, Or having a different rule defined for each number of characters, but I think there must be a better way to specify the grammar. I read about spirit repeat and was thinking maybe I could do something like
+(boost::spirit::repeat(1, 6)[qi::char_("0-9a-fA-F")][/*action on characters*/])
however the compiler throws an error on this, because of the sematic action portion of the grammar. Is there a simpler way to specify a grammar to operate on exactly 6 or less characters at a time?
Edit
Here is what I have done so far...
base16convertergrammar.hpp
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <string>
#include <iostream>
namespace grammar {
namespace qi = boost::spirit::qi;
void toBase64(const std::string& p_input, std::string& p_output)
{
if (p_input.length() < 6)
{
// pad length
}
// use back inserter and generator to append to end of p_output.
}
template <typename Iterator>
struct Base16Grammar : qi::grammar<Iterator, std::string()>
{
Base16Grammar() : Base16Grammar::base_type(start, "base16grammar"),
m_base64String()
{
// get six characters at a time and send them off to be encoded
// if there is less than six characters just parse what we have
start = +(boost::spirit::repeat(1, 6)[qi::char_("0-9a-fA-F")][boost::phoenix::bind(toBase64, qi::_1,
boost::phoenix::ref(m_base64String))]);
}
qi::rule<Iterator, std::string()> start;
std::string m_base64String;
};
}
And here is the usage...
base16converter.cpp
#include "base16convertergrammar.hpp"
const std::string& convertHexToBase64(const std::string& p_hexString)
{
grammar::Base16Grammar<std::string::const_iterator> g;
bool r = boost::spirit::qi::parse(p_hexString.begin(), p_hexString.end(), g);
}
int main(int argc, char** argv)
{
std::string test("49276d206b696c6c");
convertHexToBase64(test);
}
First of all, repeat()[] exposes a vector, so vector<char>, not a string.
void toBase64(const std::vector<char>& p_input, std::string& p_output)
Secondly, please don't do all that work. You don't tell us what the input means, but as long as you want to group it in sixes, I'm assuming you want them interpreted as /something/. You could e.g. use the int_parser:
Live On Coliru
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <string>
#include <iostream>
namespace grammar {
namespace qi = boost::spirit::qi;
namespace px = boost::phoenix;
template <typename Iterator>
struct Base16Grammar : qi::grammar<Iterator, std::string()>
{
Base16Grammar() : Base16Grammar::base_type(start, "base16grammar")
{
start = +qi::int_parser<uint64_t, 16, 1, 6>() [ qi::_val += to_string(qi::_1) + "; " ];
}
private:
struct to_string_f { template <typename T> std::string operator()(T const& v) const { return std::to_string(v); } };
px::function<to_string_f> to_string;
qi::rule<Iterator, std::string()> start;
};
}
std::string convertHexToBase64(const std::string& p_hexString)
{
grammar::Base16Grammar<std::string::const_iterator> g;
std::string result;
bool r = boost::spirit::qi::parse(p_hexString.begin(), p_hexString.end(), g, result);
assert(r);
return result;
}
int main()
{
for (std::string test : {"49276d206b696c6c"})
std::cout << test << " -> " << convertHexToBase64(test) << "\n";
}
Prints
49276d206b696c6c -> 4794221; 2124649; 27756;
Going out on a limb, you just want to transcode hex-encoded binary into base64.
Since you're already using Boost:
Live On Coliru
#include <boost/archive/iterators/base64_from_binary.hpp>
#include <boost/archive/iterators/insert_linebreaks.hpp>
#include <boost/archive/iterators/transform_width.hpp>
// for hex decoding
#include <boost/iterator/function_input_iterator.hpp>
#include <string>
#include <iostream>
#include <functional>
std::string convertHexToBase64(const std::string &hex) {
struct get_byte_f {
using result_type = uint8_t;
std::string::const_iterator hex_it;
result_type operator()() {
auto nibble = [](uint8_t ch) {
if (!std::isxdigit(ch)) throw std::runtime_error("invalid hex input");
return std::isdigit(ch) ? ch - '0' : std::tolower(ch) - 'a' + 10;
};
auto hi = nibble(*hex_it++);
auto lo = nibble(*hex_it++);
return hi << 4 | lo;
}
} get_byte{ hex.begin() };
using namespace boost::archive::iterators;
using It = boost::iterators::function_input_iterator<get_byte_f, size_t>;
typedef insert_linebreaks< // insert line breaks every 72 characters
base64_from_binary< // convert binary values to base64 characters
transform_width< // retrieve 6 bit integers from a sequence of 8 bit bytes
It, 6, 8> >,
72> B64; // compose all the above operations in to a new iterator
return { B64(It{get_byte, 0}), B64(It{get_byte, hex.size()/2}) };
}
int main() {
for (std::string test : {
"49276d206b696c6c",
"736f6d65206c656e67746879207465787420746f2073686f77207768617420776f756c642068617070656e206174206c696e6520777261700a"
})
{
std::cout << " === hex: " << test << "\n" << convertHexToBase64(test) << "\n";
}
}
Prints
=== hex: 49276d206b696c6c
SSdtIGtpbGw
=== hex: 736f6d65206c656e67746879207465787420746f2073686f77207768617420776f756c642068617070656e206174206c696e6520777261700a
c29tZSBsZW5ndGh5IHRleHQgdG8gc2hvdyB3aGF0IHdvdWxkIGhhcHBlbiBhdCBsaW5lIHdy
YXAK
I am currently starting with boost::spirit::*. I try to parse a 128 bit string into a simple c array with corresponding size. I created a short test which does the job:
boost::spirit::qi::int_parser< boost::uint8_t, 16, 2, 2 > uint8_hex;
std::string src( "00112233445566778899aabbccddeeff" );
boost::uint8_t dst[ 16 ];
bool r;
for( std::size_t i = 0; i < 16; ++i )
{
r = boost::spirit::qi::parse( src.begin( ) + 2 * i, src.begin( ) + 2 * i + 2, uint8_hex, dst[ i ] );
}
I have the feeling that this is not the smartest way to do it :) Any ideas how to define a rule so I can avoid the loop ?
Update:
In the meantime I figured out the following code which does the job very well:
using namespace boost::spirit;
using namespace boost::phoenix;
qi::int_parser< boost::uint8_t, 16, 2, 2 > uint8_hex;
std::string src( "00112233445566778899aabbccddeeff" );
boost::uint8_t dst[ 16 ];
std::size_t i = 0;
bool r = qi::parse( src.begin( ),
src.end( ),
qi::repeat( 16 )[ uint8_hex[ ref( dst )[ ref( i )++ ] = qi::_1 ] ] );
Not literally staying with the question, if you really wanted just to parse the hexadecimal representation of a 128 bit integer, you can do so portably by using uint128_t defined in Boost Multiprecision:
qi::int_parser<uint128_t, 16, 16, 16> uint128_hex;
uint128_t parsed;
bool r = qi::parse(f, l, uint128_hex, parsed);
This is bound to be the quickest way especially on platforms where 128bit types are supported in the instruction set.
Live On Coliru
#include <boost/multiprecision/cpp_int.hpp>
#include <boost/spirit/include/qi.hpp>
namespace qi = boost::spirit::qi;
int main() {
using boost::multiprecision::uint128_t;
using It = std::string::const_iterator;
qi::int_parser<uint128_t, 16, 16, 16> uint128_hex;
std::string const src("00112233445566778899aabbccddeeff");
auto f(src.begin()), l(src.end());
uint128_t parsed;
bool r = qi::parse(f, l, uint128_hex, parsed);
if (r) std::cout << "Parse succeeded: " << std::hex << std::showbase << parsed << "\n";
else std::cout << "Parse failed at '" << std::string(f,l) << "'\n";
}
There's a sad combination of factors that lead to this being a painful edge case
Boost Fusion can adapt (boost::)array<> but it it requires the parser to result in a tuple of elements, not a container
Boost Fusion can adapt these sequences, but need to be configure to allow 16 elements:
#define FUSION_MAX_VECTOR_SIZE 16
Even when you do, the qi::repeat(n)[] parser directive expects the attribute to be a container type.
You might work around all this in an ugly way (e.g. Live On Coliru). This makes everything hard to work with down the road.
I'd prefer a tiny semantic action here to make the result being assigned from qi::repeat(n)[]:
using data_t = boost::array<uint8_t, 16>;
data_t dst {};
qi::rule<It, data_t(), qi::locals<data_t::iterator> > rule =
qi::eps [ qi::_a = phx::begin(qi::_val) ]
>> qi::repeat(16) [
uint8_hex [ *qi::_a++ = qi::_1 ]
];
This works without too much noise. The idea is to take the start iterator and write to the next element each iteraton.
Live On Coliru
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
namespace qi = boost::spirit::qi;
namespace phx = boost::phoenix;
int main() {
using It = std::string::const_iterator;
qi::int_parser<uint8_t, 16, 2, 2> uint8_hex;
std::string const src("00112233445566778899aabbccddeeff");
auto f(src.begin()), l(src.end());
using data_t = boost::array<uint8_t, 16>;
data_t dst {};
qi::rule<It, data_t(), qi::locals<data_t::iterator> > rule =
qi::eps [ qi::_a = phx::begin(qi::_val) ]
>> qi::repeat(16) [
uint8_hex [ *qi::_a++ = qi::_1 ]
];
bool r = qi::parse(f, l, rule, dst);
if (r) {
std::cout << "Parse succeeded\n";
for(unsigned i : dst) std::cout << std::hex << std::showbase << i << " ";
std::cout << "\n";
} else {
std::cout << "Parse failed at '" << std::string(f,l) << "'\n";
}
}
How could I trace the position of the attribute of spirit?
A simple example
template <typename Iterator>
bool trace_numbers(Iterator first, Iterator last)
{
using boost::spirit::qi::double_;
using boost::spirit::qi::phrase_parse;
using boost::spirit::ascii::space;
bool r = phrase_parse(first, last,
// Begin grammar
(
double_ % ','
)
,
// End grammar
space);
if (first != last) // fail if we did not get a full match
return false;
return r;
}
I want to trace the position(line and column) of "double_", I found line_pos_iterator but have no idea how to use it.I also found multi-pass, but don't know it could be used to trace the positions or not(if it can, how?).
After some research, I found that using spirit::lex alone or combine it with spirit::qi is a solution.
#include <boost/config/warning_disable.hpp>
//[wcp_includes
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/lex_lexertl.hpp>
#include <boost/spirit/include/phoenix_operator.hpp>
#include <boost/spirit/include/phoenix_statement.hpp>
#include <boost/spirit/include/phoenix_container.hpp>
//]
#include <iostream>
#include <string>
#include <vector>
namespace spiritParser
{
//[wcp_namespaces
using namespace boost::spirit;
using namespace boost::spirit::ascii;
//[wcp_token_ids
enum tokenids
{
IDANY = lex::min_token_id + 10
};
//]
//[wcp_token_definition
template <typename Lexer>
struct number_position_track_tokens : lex::lexer<Lexer>
{
number_position_track_tokens()
{
// define patterns (lexer macros) to be used during token definition
// below
this->self.add_pattern
("NUM", "[0-9]+")
;
number = "{NUM}"; // reference the pattern 'NUM' as defined above
this->self.add
(number) // no token id is needed here
(".", IDANY) // characters are usable as tokens as well
;
}
lex::token_def<std::string> number;
};
//]
template<typename Iterator>
struct numberGrammar : qi::grammar<Iterator>
{
template <typename TokenDef>
numberGrammar(TokenDef const &tok)
: numberGrammar::base_type(start)
, num(0), position(0)
{
using boost::phoenix::ref;
using boost::phoenix::push_back;
using boost::phoenix::size;
//"34, 44, 55, 66, 77, 88"
start = *( tok.number [++ref(num),
boost::phoenix::push_back(boost::phoenix::ref(numPosition), boost::phoenix::ref(position)),
ref(position) += size(_1)
]
| qi::token(IDANY) [++ref(position)]
)
;
}
std::size_t num, position;
std::vector<size_t> numPosition;
qi::rule<Iterator> start;
};
void lex_word_count_1()
{
using token_type = lex::lexertl::token<char const*, boost::mpl::vector<std::string> >;
number_position_track_tokens<lexer_type> word_count; // Our lexer
numberGrammar<iterator_type> g (word_count); // Our parser
// read in the file int memory
std::string str ("34, 44, 55, 66, 77, 88");
char const* first = str.c_str();
char const* last = &first[str.size()];
if (r) {
std::cout << "nums: " << g.num << ", size: " << g.position <<std::endl;
for(auto data : g.numPosition){
std::cout<<"position : "<<data<<std::endl;
}
}
else {
std::string rest(first, last);
std::cerr << "Parsing failed\n" << "stopped at: \""
<< rest << "\"\n";
}
}
}
This is the example from the document Quickstart 3 - Counting Words Using a Parser with some alternation.In my humble opinion, this is far from easy for a small task like this. If the patterns are not difficult for std::regex to descript; need faster speed or both, select spirit::lex to track the locations of simple pattern(like the example I show) is overkill.