Decode char UTF8 escapes in Boost Spirit

Decode char UTF8 escapes in Boost Spirit - c++

question asked: Spirit-general list
Hello all,
I'm not sure if my subject is correct, but the testcode will probably show
what I want to achieve.
I'm trying to parse things like:
'%40' to '#'
'%3C' to '<'
I have a minimal testcase below. I don't understand why
this doesn't work. It's probably me making a mistake but I don't see it.
Using:
Compiler: gcc 4.6
Boost: current trunk
I use the following compile line:
g++ -o main -L/usr/src/boost-trunk/stage/lib -I/usr/src/boost-trunk -g -Werror -Wall -std=c++0x -DBOOST_SPIRIT_USE_PHOENIX_V3 main.cpp
#include <iostream>
#include <string>
#define BOOST_SPIRIT_UNICODE
#include <boost/cstdint.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/phoenix/phoenix.hpp>
typedef boost::uint32_t uchar; // Unicode codepoint
namespace qi = boost::spirit::qi;
int main(int argc, char **argv) {
// Input
std::string input = "%3C";
std::string::const_iterator begin = input.begin();
std::string::const_iterator end = input.end();
using qi::xdigit;
using qi::_1;
using qi::_2;
using qi::_val;
qi::rule<std::string::const_iterator, uchar()> pchar =
('%' > xdigit > xdigit) [_val = (_1 << 4) + _2];
std::string result;
bool r = qi::parse(begin, end, pchar, result);
if (r && begin == end) {
std::cout << "Output: " << result << std::endl;
std::cout << "Expected: < (LESS-THAN SIGN)" << std::endl;
} else {
std::cerr << "Error" << std::endl;
return 1;
}
return 0;
}
Regards,
Matthijs Möhlmann

qi::xdigit does not do what you think it does: it returns the raw character (i.e. '0', not 0x00).
You could leverage qi::uint_parser to your advantage, making your parse much simpler as a bonus:
typedef qi::uint_parser<uchar, 16, 2, 2> xuchar;
no need to rely on phoenix (making it work on older versions of Boost)
get both characters in one go (otherwise, you might have needed to add copious casting to prevent integer sign extensions)
Here is a fixed up sample:
#include <iostream>
#include <string>
#define BOOST_SPIRIT_UNICODE
#include <boost/cstdint.hpp>
#include <boost/spirit/include/qi.hpp>
typedef boost::uint32_t uchar; // Unicode codepoint
namespace qi = boost::spirit::qi;
typedef qi::uint_parser<uchar, 16, 2, 2> xuchar;
const static xuchar xuchar_ = xuchar();
int main(int argc, char **argv) {
// Input
std::string input = "%3C";
std::string::const_iterator begin = input.begin();
std::string::const_iterator end = input.end();
qi::rule<std::string::const_iterator, uchar()> pchar = '%' > xuchar_;
uchar result;
bool r = qi::parse(begin, end, pchar, result);
if (r && begin == end) {
std::cout << "Output: " << result << std::endl;
std::cout << "Expected: < (LESS-THAN SIGN)" << std::endl;
} else {
std::cerr << "Error" << std::endl;
return 1;
}
return 0;
}
Output:
Output: 60
Expected: < (LESS-THAN SIGN)
'<' is indeed ASCII 60

Related

Spirit X3, How to fail parse on non-ascii input?

So the objective is to not tolerate characters from 80h through FFh in the input string. I was under the impression that
using ascii::char_;
would take care of this. But as you can see in the example code it will happily print Parsing succeeded.
In the following Spirit mailing list post, Joel suggested to let parse to fail on these non-ascii characters. But I'm not sure whether he proceeded in doing so.
[Spirit-general] ascii encoding assert on invalid input ...
Here my example code:
#include <iostream>
#include <boost/spirit/home/x3.hpp>
namespace client::parser
{
namespace x3 = boost::spirit::x3;
namespace ascii = boost::spirit::x3::ascii;
using ascii::char_;
using ascii::space;
using x3::lexeme;
using x3::skip;
const auto quoted_string = lexeme[char_('"') >> *(char_ - '"') >> char_('"')];
const auto entry_point = skip(space) [ quoted_string ];
}
int main()
{
for(std::string const input : { "\"naughty \x80" "bla bla bla\"" }) {
std::string output;
if (parse(input.begin(), input.end(), client::parser::entry_point, output)) {
std::cout << "Parsing succeeded\n";
std::cout << "input: " << input << "\n";
std::cout << "output: " << output << "\n";
} else {
std::cout << "Parsing failed\n";
}
}
}
How can I change the example to have Spirit to fail on this invalid input?
Furthermore, but very related, I would like to know how I should use the character parser that defines a char_set encoding. You know char_(charset) from X3 docs: Character Parsers develop branch.
The documentation is lacking so strongly to describe the basic functionality. Why can't the boost top level people force library authors to come with documentation at least on the level of cppreference.com?

Nothing bad about the docs here. It's just a library bug.
Where the code for any_char says:
template <typename Char, typename Context>
bool test(Char ch_, Context const&) const
{
return ((sizeof(Char) <= sizeof(char_type)) || encoding::ischar(ch_));
}
It should have said
template <typename Char, typename Context>
bool test(Char ch_, Context const&) const
{
return ((sizeof(Char) <= sizeof(char_type)) && encoding::ischar(ch_));
}
That makes your program behave as expected and required. That behaviour also matches the Qi behaviour:
Live On Coliru
#include <boost/spirit/include/qi.hpp>
int main() {
namespace qi = boost::spirit::qi;
char const* input = "\x80";
assert(!qi::parse(input, input+1, qi::ascii::char_));
}
Filed a bug here: https://github.com/boostorg/spirit/issues/520

You can achieve that by using print parser:
#include <iostream>
#include <boost/spirit/home/x3.hpp>
namespace client::parser
{
namespace x3 = boost::spirit::x3;
namespace ascii = boost::spirit::x3::ascii;
using ascii::char_;
using ascii::print;
using ascii::space;
using x3::lexeme;
using x3::skip;
const auto quoted_string = lexeme[char_('"') >> *(print - '"') >> char_('"')];
const auto entry_point = skip(space) [ quoted_string ];
}
int main()
{
for(std::string const input : { "\"naughty \x80\"", "\"bla bla bla\"" }) {
std::string output;
std::cout << "input: " << input << "\n";
if (parse(input.begin(), input.end(), client::parser::entry_point, output)) {
std::cout << "output: " << output << "\n";
std::cout << "Parsing succeeded\n";
} else {
std::cout << "Parsing failed\n";
}
}
}
Output:
input: "naughty �"
Parsing failed
input: "bla bla bla"
output: "bla bla bla"
Parsing succeeded
https://wandbox.org/permlink/HSoB8uqMC3WME5yI
It is a surprising fact that for some reason the check for char_ is done only when the sizeof(iterator char type) > sizeof(char):
#include <boost/spirit/home/x3.hpp>
#include <iostream>
#include <string>
#include <boost/core/demangle.hpp>
#include <typeinfo>
namespace x3 = boost::spirit::x3;
template <typename Char>
void test(Char const* str)
{
std::basic_string<Char> s = str;
std::cout << boost::core::demangle(typeid(Char).name()) << ":\t";
Char c;
auto it = s.begin();
if (x3::parse(it, s.end(), x3::ascii::char_, c) && it == s.end())
std::cout << "OK: " << int(c) << "\n";
else
std::cout << "Failed\n";
}
int main()
{
test("\x80");
test(L"\x80");
test(u8"\x80");
test(u"\x80");
test(U"\x80");
}
Output:
char: OK: -128
wchar_t: Failed
char8_t: OK: 128
char16_t: Failed
char32_t: Failed
https://wandbox.org/permlink/j9PQeRVnGZQeELFA

Boost Spirit synthesised attribute confusion

I'm trying to parse input which has either a plus or minus character, followed by an X or Y character, followed by an unsigned integer.
(char_('+') | char_('-')) >> char_("xyXY") >> uint_
According to my reading of the docs, the synthesised attribute for this would be tuple<vector<char>,unsigned int> because the alternative parser (char | char) would be of type char, the char >> char("xyXY") would be vector<char>, and the vector<char> >> uint_ would be a tuple of the types, so tuple<vector<char>,unsigned int>. This fails to compile
qi\detail\assign_to.hpp(152) : error C2440: 'static_cast' : cannot convert from 'const char' to 'boost::tuples::tuple<T0,T1>'
Code:
#include <iostream>
#include <string>
#include <vector>
#include <boost/fusion/include/tuple.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/tuple/tuple.hpp>
using namespace boost::spirit::qi;
int main()
{
std::string input("-Y 512");
typedef std::string::const_iterator Iterator;
Iterator first = input.begin();
Iterator last = input.end();
boost::tuple<std::vector<char>,unsigned int> output;
bool result = phrase_parse(first,last,(char_('+') | char_('-')) >> char_("xyXY") >> uint_,ascii::space,output);
if(result && first == last)
std::cout << "sign=" << boost::get<0>(output)[0] << ", xy=" << boost::get<0>(output)[1] << ", size=" << boost::get<1>(output) << '\n';
else
std::cerr << "Parse error\n";
}
I then tried tuple<char,char,unsigned int> as the attribute type:
#include <iostream>
#include <string>
#include <vector>
#include <boost/fusion/include/tuple.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/tuple/tuple.hpp>
using namespace boost::spirit::qi;
int main()
{
std::string input("-Y 512");
typedef std::string::const_iterator Iterator;
Iterator first = input.begin();
Iterator last = input.end();
boost::tuple<char,char,unsigned int> output;
bool result = phrase_parse(first,last,(char_('+') | char_('-')) >> char_("xyXY") >> uint_,ascii::space,output);
if(result && first == last)
std::cout << "sign=" << boost::get<0>(output) << ", xy=" << boost::get<1>(output) << ", size=" << boost::get<2>(output) << '\n';
else
std::cerr << "Parse error\n";
}
This compiles but the output is incorrect. The first token of the input is parsed correctly, but the subsequent tokens aren't:
sign=-, xy= , size=0
I also tried as_string[]:
#include <iostream>
#include <string>
#include <vector>
#include <boost/fusion/include/tuple.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/tuple/tuple.hpp>
using namespace boost::spirit::qi;
int main()
{
std::string input("-Y 512");
typedef std::string::const_iterator Iterator;
Iterator first = input.begin();
Iterator last = input.end();
boost::tuple<std::string,unsigned int> output;
bool result = phrase_parse(first,last,as_string[(char_('+') | char_('-')) >> char_("xyXY")] >> uint_,ascii::space,output);
if(result && first == last)
std::cout << "sign=" << boost::get<0>(output)[0] << ", xy=" << boost::get<0>(output)[1] << ", size=" << boost::get<1>(output) << '\n';
else
std::cerr << "Parse error\n";
}
This improved things as the x/y token got parsed, but not the third integer token:
sign=-, xy=Y, size=0
Please show me where I'm going wrong.
I'm using Spirit version 2.5.2 (from Boost 1.58.0) and Microsoft Visual Studio 2008.

Spirit library docs recommend to use Fusion tuple. I think I saw somewhere (can't find it now) that Boost tuple may not be fully compatible with Spirit library.
Here is your fixed example:
#include <iostream>
#include <string>
#include <vector>
#include <boost/fusion/include/tuple.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/fusion/sequence.hpp>
namespace qi = boost::spirit::qi;
int main()
{
std::string input("-Y 512");
typedef std::string::const_iterator Iterator;
Iterator first = input.begin();
Iterator last = input.end();
boost::fusion::tuple<char, char, unsigned int> output;
bool result = qi::phrase_parse(first, last, (qi::char_('+') | qi::char_('-')) >> qi::char_("xyXY") >> qi::uint_, qi::ascii::space, output);
if (result && first == last)
std::cout << "sign=" << boost::fusion::get<0>(output) << ", xy=" << boost::fusion::get<1>(output) << ", size=" << boost::fusion::get<2>(output) << '\n';
else
std::cerr << "Parse error\n";
return 0;
}
Output:
sign=-, xy=Y, size=512
Update: Actually I found here that it's possible to use boost::tuple but different header needs to be included: #include <boost/fusion/include/boost_tuple.hpp>.

boost spirit qi parser failed in release and pass in debug

#include <boost/spirit/include/qi.hpp>
#include <string>
#include <vector>
#include <iterator>
#include <algorithm>
#include <iostream>
using namespace boost::spirit;
int main()
{
std::string s;
std::getline(std::cin, s);
auto specialtxt = *(qi::char_('-', '.', '_'));
auto txt = no_skip[*(qi::char_("a-zA-Z0-9_.\\:$\'-"))];
auto anytxt = *(qi::char_("a-zA-Z0-9_.\\:${}[]+/()-"));
qi::rule <std::string::iterator, void(),ascii::space_type> rule2 = txt ('=') >> ('[') >> (']');
auto begin = s.begin();
auto end = s.end();
if (qi::phrase_parse(begin, end, rule2, ascii::space))
{
std::cout << "MATCH" << std::endl;
}
else
{
std::cout << "NO MATCH" << std::endl;
}
}
this code works fine in debug mode
parser fails in release mode
rule is to just parse text=[]; any thing else than this should fail it works fine in debug mode but not in release mode it shows result no match for any string.
if i enter string like
abc=[];
this passes in debug as expected but fails in release

You can't use auto with Spirit v2:
Assigning parsers to auto variables
You have Undefined Behaviour
DEMO
I tried to make (more) sense of the rest of the code. There were various instances that would never work:
txt('=') is an invalid Qi expression. I assumed you wanted txt >> ('=') instead
qi::char_("a-zA-Z0-9_.\\:$\\-{}[]+/()") doesn't do what you think because $-{ is actually the character "range" \x24-\x7b... Escape the - (or put it at the very end/start of the set like in the other char_ call).
qi::char_('-','.','_') can't work. Did you mean qi::char_("-._")?
specialtxt and anytxt were unused...
prefer const_iterator
prefer namespace aliases above using namespace to prevent hard-to-detect errors
Live On Coliru
#include <boost/spirit/include/qi.hpp>
#include <iostream>
namespace qi = boost::spirit::qi;
int main() {
std::string const s = "abc=[];";
auto specialtxt = qi::copy(*(qi::char_("-._")));
auto anytxt = qi::copy(*(qi::char_("a-zA-Z0-9_.\\:$\\-{}[]+/()")));
(void) specialtxt;
(void) anytxt;
auto txt = qi::copy(qi::no_skip[*(qi::char_("a-zA-Z0-9_.\\:$\'-"))]);
qi::rule<std::string::const_iterator, qi::space_type> rule2 = txt >> '=' >> '[' >> ']';
auto begin = s.begin();
auto end = s.end();
if (qi::phrase_parse(begin, end, rule2, qi::space)) {
std::cout << "MATCH" << std::endl;
} else {
std::cout << "NO MATCH" << std::endl;
}
if (begin != end) {
std::cout << "Trailing unparsed: '" << std::string(begin, end) << "'\n";
}
}
Printing
MATCH
Trailing unparsed: ';'

Access violation in boost::spirit::lex

I've reduced my code to the absolute minimum needed to reproduce the error (sadly that is still 60 lines, not quite Minimal, but its VCE at least).
I'm using Boost 1.56 in Visual Studio 2013 (Platform Toolset v120).
The code below gives me an Access Violation unless I uncomment the marked lines. By doing some tests, it seems boost::spirit doesn't like it if the enum starts at 0 (in my full code I have more values in the enum and I just set IntLiteral = 1 and it also got rid of the access violation error, although the names were wrong because ToString was off by one when indexing into the array).
Is this a bug in boost::spirit or did I do something wrong?
#include <iostream>
#include <string>
#include <vector>
#include <boost/config/warning_disable.hpp>
#include <boost/spirit/include/lex_lexertl.hpp>
namespace lex = boost::spirit::lex;
typedef lex::lexertl::token<char const*> LexToken;
typedef lex::lexertl::actor_lexer<LexToken> LexerType;
typedef boost::iterator_range<char const*> IteratorRange;
enum TokenType
{
//Unused, // <-- Uncommenting this fixes the error (1/2)
IntLiteral,
};
std::string tokenTypeNames[] = {
//"unused", // <-- Uncommenting this line fixes the error (2/2)
"int literal",
};
std::string ToString(TokenType t)
{
return tokenTypeNames[t];
}
template <typename T>
struct Lexer : boost::spirit::lex::lexer < T >
{
Lexer()
{
self.add
// Literals
("[1-9][0-9]*", TokenType::IntLiteral);
}
};
int main(int argc, char* argv[])
{
std::cout << "Boost version: " << BOOST_LIB_VERSION << std::endl;
std::string input = "33";
char const* inputIt = input.c_str();
char const* inputEnd = &input[input.size()];
Lexer<LexerType> tokens;
LexerType::iterator_type token = tokens.begin(inputIt, inputEnd);
LexerType::iterator_type end = tokens.end();
for (; token->is_valid() && token != end; ++token)
{
auto range = boost::get<IteratorRange>(token->value());
std::cout << ToString(static_cast<TokenType>(token->id())) << " (" << std::string(range.begin(), range.end()) << ')' << std::endl;
}
std::cin.get();
return 0;
}
If I uncomment the lines I get:
Boost version: 1_56
int literal (33)

The fact that it "works" if you uncomment those lines, is pure accident.
From the docs spirit/lex/tutorials/lexer_quickstart2.html:
To ensure every token gets assigned a id the Spirit.Lex library internally assigns unique numbers to the token definitions, starting with the constant defined by boost::spirit::lex::min_token_id
See also this older answer:
Spirit Lex: Which token definition generated this token?
So you can just fix it using the offset, but I guess it will keep on being a brittle solution as it is very easy to let the enum go out of synch with actual token definitions in the lexer tables.
I'd suggest using the nameof() approach as given in the linked answer, which leverages named token_def<> objects.
Live On Coliru
#include <iostream>
#include <string>
#include <vector>
#include <boost/config/warning_disable.hpp>
#include <boost/spirit/include/lex_lexertl.hpp>
namespace lex = boost::spirit::lex;
typedef lex::lexertl::token<char const*> LexToken;
typedef lex::lexertl::actor_lexer<LexToken> LexerType;
typedef boost::iterator_range<char const*> IteratorRange;
enum TokenType {
IntLiteral = boost::spirit::lex::min_token_id
};
std::string const& ToString(TokenType t) {
static const std::string tokenTypeNames[] = {
"int literal",
};
return tokenTypeNames[t - boost::spirit::lex::min_token_id];
}
template <typename T>
struct Lexer : boost::spirit::lex::lexer<T> {
Lexer() {
this->self.add
// Literals
("[1-9][0-9]*", TokenType::IntLiteral);
}
};
int main() {
std::cout << "Boost version: " << BOOST_LIB_VERSION << std::endl;
std::string input = "33";
char const* inputIt = input.c_str();
char const* inputEnd = &input[input.size()];
Lexer<LexerType> tokens;
LexerType::iterator_type token = tokens.begin(inputIt, inputEnd);
LexerType::iterator_type end = tokens.end();
for (; token->is_valid() && token != end; ++token)
{
auto range = boost::get<IteratorRange>(token->value());
std::cout << ToString(static_cast<TokenType>(token->id())) << " (" << std::string(range.begin(), range.end()) << ')' << std::endl;
}
}

Generating default value when none is found

I have an input vector that can have any size between empty and 3 elements. I want the generated string to always be 3 floats separated by spaces, where a default value is used if there aren't enough elements in the vector. So far I've managed to output only the contents of the vector:
#include <iostream>
#include <iterator>
#include <vector>
#include "boost/spirit/include/karma.hpp"
namespace karma = boost::spirit::karma;
namespace phx = boost::phoenix;
typedef std::back_insert_iterator<std::string> BackInsertIt;
int main( int argc, char* argv[] )
{
std::vector<float> input;
input.push_back(1.0f);
input.push_back(2.0f);
struct TestGram
: karma::grammar<BackInsertIt, std::vector<float>(), karma::space_type>
{
TestGram() : TestGram::base_type(output)
{
using namespace karma;
floatRule = double_;
output = repeat(3)[ floatRule ];
}
karma::rule<BackInsertIt, std::vector<float>(), karma::space_type> output;
karma::rule<BackInsertIt, float(), karma::space_type> floatRule;
} testGram;
std::string output;
BackInsertIt sink(output);
karma::generate_delimited( sink, testGram, karma::space, input );
std::cout << "Generated: " << output << std::endl;
std::cout << "Press enter to exit" << std::endl;
std::cin.get();
return 0;
}
I've tried modifying the float rule to something like this: floatRule = double_ | lit(0.0f), but that only gave me compilation errors. The same for a lot of other similar stuff I tried.
I really have no idea how to get this working. Some help would be great :)
EDIT: Just to make it clear. If I have a vector containing 2 elements: 1.0 and 2.0, I want to generate a string that looks like this: "1.0 2.0 0.0" (the last value should be the default value).

Not pretty, but working:
#include <iostream>
#include <iterator>
#include <vector>
#define BOOST_SPIRIT_USE_PHOENIX_V3
#include "boost/spirit/include/karma.hpp"
#include <boost/spirit/include/phoenix.hpp>
namespace karma = boost::spirit::karma;
namespace phx = boost::phoenix;
typedef std::back_insert_iterator<std::string> BackInsertIt;
int main(int argc, char* argv[]) {
std::vector<float> input;
input.push_back(1.0f);
input.push_back(2.0f);
struct TestGram: karma::grammar<BackInsertIt, std::vector<float>(),
karma::space_type> {
TestGram()
: TestGram::base_type(output) {
using namespace karma;
floatRule = double_;
output = repeat(phx::bind(&std::vector<float>::size, (karma::_val)))[floatRule]
<< repeat(3 - phx::bind(&std::vector<float>::size, (karma::_val)))[karma::lit("0.0")];
}
karma::rule<BackInsertIt, std::vector<float>(), karma::space_type> output;
karma::rule<BackInsertIt, float(), karma::space_type> floatRule;
} testGram;
std::string output;
BackInsertIt sink(output);
karma::generate_delimited(sink, testGram, karma::space, input);
std::cout << "Generated: " << output << std::endl;
return 0;
}

Big warning:
The code shown is flawed, either due to a bug, or due to abuse of karma attribute propagation (see the comment).
It invokes Undefined Behaviour (presumably) dereferencing the end() iterator on the input vector.
This should work
floatRule = double_ | "0.0";
output = -floatRule << -floatRule << -floatRule;
Note, floatRule should accept an optional<float> instead. See it Live on Coliru
Minimal example:
#include "boost/spirit/include/karma.hpp"
namespace karma = boost::spirit::karma;
using It = boost::spirit::ostream_iterator;
int main( int argc, char* argv[] )
{
const std::vector<float> input { 1.0f, 2.0f };
using namespace karma;
rule<It, boost::optional<float>()> floatRule = double_ | "0.0";
rule<It, std::vector<float>(), space_type> output = -floatRule << -floatRule << -floatRule;
std::cout << format_delimited(output, space, input);
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Decode char UTF8 escapes in Boost Spirit - c++

Related

Spirit X3, How to fail parse on non-ascii input?

Boost Spirit synthesised attribute confusion

boost spirit qi parser failed in release and pass in debug

Access violation in boost::spirit::lex

Generating default value when none is found

Categories

Resources