I'm writing a parser with error handling. I would like to output to the user the exact location of the parts of the input that couldn't be parsed.
However, the location of the error token always starts at 0, even if before it were parts that were parsed successfully.
Here's a heavily simplified example of what I did.
(The problematic part is probably in the parser.yy.)
Location.hh:
#pragma once
#include <string>
// The full version tracks position in bytes, line number and offset in the current line.
// Here however, I've shortened it to line number only.
struct Location
{
int beginning, ending;
operator std::string() const { return std::to_string(beginning) + '-' + std::to_string(ending); }
};
LexerClass.hh:
#pragma once
#include <istream>
#include <string>
#if ! defined(yyFlexLexerOnce)
#include <FlexLexer.h>
#endif
#include "Location.hh"
class LexerClass : public yyFlexLexer
{
int currentPosition = 0;
protected:
std::string *yylval = nullptr;
Location *yylloc = nullptr;
public:
LexerClass(std::istream &in) : yyFlexLexer(&in) {}
[[nodiscard]] int yylex(std::string *const lval, Location *const lloc);
void onNewLine() { yylloc->beginning = yylloc->ending = ++currentPosition; }
};
lexer.ll:
%{
#include "./parser.hh"
#include "./LexerClass.hh"
#undef YY_DECL
#define YY_DECL int LexerClass::yylex(std::string *const lval, Location *const lloc)
%}
%option c++ noyywrap
%option yyclass="LexerClass"
%%
%{
yylval = lval;
yylloc = lloc;
%}
[[:blank:]] ;
\n { onNewLine(); }
[0-9] { return yy::Parser::token::DIGIT; }
. { return yytext[0]; }
parser.yy:
%language "c++"
%code requires {
#include "LexerClass.hh"
#include "Location.hh"
}
%define api.parser.class {Parser}
%define api.value.type {std::string}
%define api.location.type {Location}
%parse-param {LexerClass &lexer}
%defines
%code {
template<typename RHS>
void calcLocation(Location ¤t, const RHS &rhs, const int n);
#define YYLLOC_DEFAULT(Cur, Rhs, N) calcLocation(Cur, Rhs, N)
#define yylex lexer.yylex
}
%token DIGIT
%%
numbers:
%empty
| numbers number ';' { std::cout << std::string(#number) << "\tnumber" << std::endl; }
| error ';' { yyerrok; std::cerr << std::string(#error) << "\terror context" << std::endl; }
;
number:
DIGIT {}
| number DIGIT {}
;
%%
#include <iostream>
template<typename RHS>
inline void calcLocation(Location ¤t, const RHS &rhs, const int n)
{
current = (n <= 1)
? YYRHSLOC(rhs, n)
: Location{YYRHSLOC(rhs, 1).beginning, YYRHSLOC(rhs, n).ending};
}
void yy::Parser::error(const Location &location, const std::string &message)
{
std::cout << std::string(location) << "\terror: " << message << std::endl;
}
int main()
{
LexerClass lexer(std::cin);
yy::Parser parser(lexer);
return parser();
}
For the input:
123
456
789;
123;
089
xxx
123;
765
432;
expected output:
0-2 number
3-3 number
5-5 error: syntax error
4-6 error context
7-8 number
actual output:
0-2 number
3-3 number
5-5 error: syntax error
0-6 error context
7-8 number
Here's your numbers rule, for reference (without actions, since they're not really relevant):
numbers:
%empty
| numbers number ';'
| error ';'
numbers is also your start symbol. It should be reasonably clear that there is nothing before a numbers non-terminal in any derivation. There is a top-level numbers non-terminal, which encompasses the entire input, and it starts with a numbers non-terminal which contains everything except the last number ;, and so on. All of these numbers start at the beginning.
Similarly, the error pseudotoken is at the start of some numbers derivation. So it, too, must start at the beginning of the input.
In other words, your statement that "the location of the error token always starts at 0, even if before it were parts that were parsed successfully" is untestable. The location of the error token always starts at 0, because there cannot be anything before it, and the output you're receiving is "expected". Or, at least, predictable; I understand that you didn't expect it, and it's an easy confusion to fall into. I didn't really see it until I ran the parser with tracing enabled, which is highly recommended; note that to do so, it's helpful to add an overload of std::operator(ostream&, Location const&).
I'm building upon the rici's answer, so read that one first.
Let's consider the rule:
numbers:
%empty
| numbers number ';'
| error ';' { yyerrok; }
;
This means the nonterminal numbers can be one of these three things:
It may be empty.
It may be a number preceded by any valid numbers.
It may be an error.
Do you see the problem yet?
The whole numbers has to be an error, from the beginning; there is no rule saying that anything else allowed before it.
Of course Bison obediently complies to your wishes and makes the error start at the very beginning of the nonterminal numbers.
It can do that because error is a jack of all trades and there can be no rule about what can be included inside of it. Bison, to fulfill your rule, needs to extend the error over all previous numbers.
When you understand the problem, fixing it is rather easy. You just need to tell Bison that numbers are allowed before the error:
numbers:
%empty
| numbers number ';'
| numbers error ';' { yyerrok; }
;
This is IMO the best solution. There is another approach, though.
You can move the error token to the number:
numbers:
%empty
| numbers number ';' { yyerrok; }
;
number:
DIGIT
| number DIGIT
| error
;
Notice that yyerrok needs to stay in numbers because the parser would enter an infinite loop if you place it next to a rule that ends with token error.
A disadvantage of this approach is that if you place an action next to this error, it will be triggered multiple times (more or less once per every illegal terminal).
Maybe in some situations this is preferable but generally I suggest using the first way of solving the issue.
Related
This is a yacc program to recognize all strings ending with b preceded by n a’s using the grammar a n b (note: input n value).
%{
#include "y.tab.h"
%}
%%
a {return A;}
b {return B;}
\n {return 0;}
. {return yytext[0];}
%%
The yacc part
YACC PART
%{
#include <stdio.h>
int aCount=0,n;
%}
%token A
%token B
%%
s : X B { if (aCount<n || aCount>n)
{
YYFAIL();
}
}
X : X T | T
T : A { aCount++;}
;
%%
int main()
{ printf("Enter the value of n \n");
scanf("%d",&n);
printf("Enter the string\n");
yyparse();
printf("Valid string\n");
}
int YYFAIL()
{
printf("Invalid count of 'a'\n");
exit(0);
}
int yyerror()
{
printf("Invalid string\n");
exit(0);
}
output
invalid string
It displays invalid string even for valid string like aab for n value 2.
For every string i enter,yyerror() is called.
Please help me resolve this!
TIA
scanf("%d",&n);
reads a number from standard input.
It does not read a number and the following newline. It just reads a number. Whatever follows the number will be returned from the next operation which reads from stdin.
So when you attempt to parse, the character read by the lexer is the newline character which you typed after the number. That newline character causes the lexer to return 0 to the parser, which the parser interprets as the end of input. But the grammar doesn't allow empty inputs, so the parser reports a syntax error.
On my system, the parser reports a syntax error before it gives me the opportunity to type any input. The fact that it allows you to type an input line is a bit puzzling to me, but it might have something to do with whatever IDE you are using to run your program.
for example :
cout << " hello\n400";
will print:
hello
400
another example:
cout << " hello\r400";
will print:
400ello
there is a option to define my own special character?
i would like to make somthing like:
cout << " hello\d400";
would give:
hello
400
(/d is my special character, and i already got the function to get the stdout cursor one line down(cursorDown()),but i just don't how to define a special character that each time will be writted will call to my cursorDown() function)
As said by others there is no way you can make cout understand user defined characters , however what you could do is
std::cout is an object of type std::ostream which overloads operator<<. You could create an object of the struct which parses your string for your special characters and other user defined characters before printing it to a file or console using ostream similar to any log stream.
Example
or
Instead of calling cout << "something\dsomething"
you can call a method special_cout(std::string); which parses the string for user defined characters and executes the calls.
There is no way to define "new" special characters.
But you can make the stream interpret specific characters to have new meanings (that you can define). You can do this using locals.
Some things to note:
The characters in the string "xyza" is just a simple way of encoding a string. Escaped characters are C++ way of allowing you to represent representing characters that are not visible but are well defined. Have a look at an ASCII table and you will see that all characters in the range 00 -> 31 (decimal) have special meanings (often referred to as control characters).
See Here: http://www.asciitable.com/
You can place any character into a string by using the escape sequence to specify its exact value; i.e. \x0A used in a string puts the "New Line" character in the string.
The more commonly used "control characters" have shorthand versions (defined by the C++ language). '\n' => '\x0A' but you can not add new special shorthand characters as this is just a convenience supply by the language (its like a tradition that most languages support).
But given a character can you give it a special meaning in an IO stream. YES. You need to define a facet for a locale then apply that locale to the stream.
Note: Now there is a problem with applying locals to std::cin/std::out. If the stream has already been used (in any way) applying a local may fail and the OS may do stuff with the stream before you reach main() and thus applying a locale to std::cin/std::cout may fail (but you can do it to file and string streams easily).
So how do we do it.
Lets use "Vertical Tab" as the character we want to change the meaning of. I pick this as there is a shortcut for it \v (so its shorter to type than \x0B) and usually has no meaning for terminals.
Lets define the meaning as new line and indent 3 spaces.
#include <locale>
#include <algorithm>
#include <iostream>
#include <fstream>
class IndentFacet: public std::codecvt<char,char,std::mbstate_t>
{
public:
explicit IndentFacet(size_t ref = 0): std::codecvt<char,char,std::mbstate_t>(ref) {}
typedef std::codecvt_base::result result;
typedef std::codecvt<char,char,std::mbstate_t> parent;
typedef parent::intern_type intern_type;
typedef parent::extern_type extern_type;
typedef parent::state_type state_type;
protected:
virtual result do_out(state_type& tabNeeded,
const intern_type* rStart, const intern_type* rEnd, const intern_type*& rNewStart,
extern_type* wStart, extern_type* wEnd, extern_type*& wNewStart) const
{
result res = std::codecvt_base::ok;
for(;(rStart < rEnd) && (wStart < wEnd);++rStart,++wStart)
{
if (*rStart == '\v')
{
if (wEnd - wStart < 4)
{
// We do not have enough space to convert the '\v`
// So stop converting and a subsequent call should do it.
res = std::codecvt_base::partial;
break;
}
// if we find the special character add a new line and three spaces
wStart[0] = '\n';
wStart[1] = ' ';
wStart[2] = ' ';
wStart[3] = ' ';
// Note we do +1 in the for() loop
wStart += 3;
}
else
{
// Otherwise just copy the character.
*wStart = *rStart;
}
}
// Update the read and write points.
rNewStart = rStart;
wNewStart = wStart;
// return the appropriate result.
return res;
}
// Override so the do_out() virtual function is called.
virtual bool do_always_noconv() const throw()
{
return false; // Sometime we add extra tabs
}
};
Some code that uses the locale.
int main()
{
std::ios::sync_with_stdio(false);
/* Imbue std::cout before it is used */
std::cout.imbue(std::locale(std::locale::classic(), new IndentFacet()));
// Notice the use of '\v' after the first lien
std::cout << "Line 1\vLine 2\nLine 3\n";
/* You must imbue a file stream before it is opened. */
std::ofstream data;
data.imbue(std::locale(std::locale::classic(), new IndentFacet()));
data.open("PLOP");
// Notice the use of '\v' after the first lien
data << "Loki\vUses Locale\nTo do something silly\n";
}
The output:
> ./a.out
Line 1
Line 2
Line 3
> cat PLOP
Loki
Uses Locale
To do something silly
BUT
Now writing all this is not really worth it. If you want a fixed indent like that us a named variable that has those specific characters in it. It makes your code slightly more verbose but does the trick.
#include <string>
#include <iostream>
std::string const newLineWithIndent = "\n ";
int main()
{
std::cout << " hello" << newLineWithIndent << "400";
}
OK, I'm doing a few experiments with Lex/Bison(Yacc), and given that my C skills are rather rusty (I've once created compilers and stuff with all these tools and now I'm lost in the first few lines... :-S), I need your help.
This is what my Parser looks like :
%{
#include <stdio.h>
#include <string.h>
void yyerror(const char *str)
{
fprintf(stderr,"error: %s\n",str);
}
int yywrap()
{
return 1;
}
main()
{
yyparse();
}
%}
%union
{
char* str;
}
%token <str> WHAT IS FIND GET SHOW WITH POSS OF NUMBER WORD
%type <str> statement
%start statements
%%
statement
: GET { printf("get\n"); }
| SHOW { printf("%s\n",$1); }
| OF { printf("of\n"); }
;
statements
: statement
| statements statement
;
The Issue :
So, basically, whenever the parser comes across a "get", it prints "get". And so on.
However, when trying to print "show" (using the $1 specifier) it gives out a segmentation fault error.
What am I doing wrong?
Lex returns a number representing the token, you need to access yytext to get the text of what is parsed.
something like
statement : GET { printf("get\n"); }
| SHOW { printf("%s\n",yytext); }
| OF { printf("of\n"); }
;
to propogate the text of terminals, I go ahead associate a nonterminal with a terminal and pass back the char* and start building the parse tree for example. Note I've left out the type decl and the implementation of create_sww_ASTNode(char*,char*,char*); However, importantly not all nonterminals will return the same type, for number is an integer, word return char* sww return astNode (or whatever generic abstract syntax tree structure you come up with). Usually beyond the nonterminal representing terminals, it's all AST stuff.
sww : show word word
{
$$ = create_sww_ASTNode($1,$2,$3);
}
;
word : WORD
{
$$ = malloc(strlen(yytext) + 1);
strcpy($$,yytext);
}
;
show : SHOW
{
$$ = malloc(strlen(yytext) + 1);
strcpy($$,yytext);
}
;
number : NUMBER
{
$$ = atoi(yytext);
}
;
You don't show your lexer code, but the problem is probably that you never set yylval to anything, so when you access $1 in the parser, it contains garbage and you get a crash. Your lexer actions need to set yylval.str to something so it will be valid:
"show" { yylval.str = "SHOW"; return SHOW }
[a-z]+ { yylval.str = strdup(yytext); return WORD; }
OK, so here's the answer (Can somebody tell me what it is that I always come up with the solution once I've already published a question here in SO? lol!)
The problem was not with the parser itself, but actually with the Lexer.
The thing is : when you tell it to { printf("%s\n",$1); }, we actually tell it to print yylval (which is by default an int, not a string).
So, the trick is to convert the appropriate tokens into strings.
Here's my (updated) Lexer file :
%{
#include <stdio.h>
#include "parser.tab.h"
void toStr();
%}
DIGIT [0-9]
LETTER [a-zA-Z]
LETTER_OR_SPACE [a-zA-Z ]
%%
find { toStr(); return FIND; }
get { toStr(); return GET; }
show { toStr(); return SHOW; }
{DIGIT}+(\.{DIGIT}+)? { toStr(); return NUMBER; }
{LETTER}+ { toStr(); return WORD; }
\n /* ignore end of line */;
[ \t]+ /* ignore whitespace */;
%%
void toStr()
{
yylval.str=strdup(yytext);
}
I am learning flex & bison and I am stuck here and cannot figure out how such a simple grammar rule does not work as I expected, below is the lexer code:
%{
#include <stdio.h>
#include "zparser.tab.h"
%}
%%
[\t\n ]+ //ignore white space
FROM|from { return FROM; }
select|SELECT { return SELECT; }
update|UPDATE { return UPDATE; }
insert|INSERT { return INSERT; }
delete|DELETE { return DELETE; }
[a-zA-Z].* { return IDENTIFIER; }
\* { return STAR; }
%%
And below is the parser code:
%{
#include<stdio.h>
#include<iostream>
#include<vector>
#include<string>
using namespace std;
extern int yyerror(const char* str);
extern int yylex();
%}
%%
%token SELECT UPDATE INSERT DELETE STAR IDENTIFIER FROM;
ZQL : SELECT STAR FROM IDENTIFIER { cout<<"Done"<<endl; return 0;}
;
%%
Can any one tell me why it shows error if I try to put "select * from something"
[a-zA-Z].* will match an alphabetic character followed by any number of arbitrary characters except newline. In other words, it will match from an alphabetic character to the end of the line.
Since flex always accepts the longest match, the line select * from ... will appear to have only one token, IDENTIFIER, and that is a syntax error.
[a-zA-Z].* { return IDENTIFIER; }
The problem is here. It allows any junk to follow an initial alpha character and be returned as IDENTIFIER, including in this case the entire rest of the line after the initial ''s.
It should be:
[a-zA-Z]+ { return IDENTIFIER; }
or possibly
[a-zA-Z][a-zA-Z0-9]* { return IDENTIFIER; }
or whatever else you want to allow to follow an initial alpha character in your identifiers.
Migrated from [Spirit-general] list
Good morning,
I'm trying to parse a relatively simple pattern across 4 std::strings,
extracting whatever the part which matches the pattern into a separate
std::string.
In an abstracted sense, here is what I want:
s1=<string1><consecutive number>, s2=<consecutive number><string2>,
s3=<string1><consecutive number>, s4=<consecutive number><string2>
Less abstracted:
s1="apple 1", s2="2 cheese", s3="apple 3", s4="4 cheese"
Actual contents:
s1="lxckvjlxcjvlkjlkje xvcjxzlvcj wqrej lxvcjz ljvl;x czvouzxvcu
j;ljfds apple 1 xcvljxclvjx oueroi xcvzlkjv; zjx", s2="xzljlkxvc
jlkjxzvl jxcvljzx lvjlkj wre 2 cheese", s3="apple 3", s4="kxclvj
xcvjlxk jcvljxlck jxcvl 4 cheese"
How would I perform this pattern matching?
Thanks for all suggestions,
Alec Taylor
Update 2
Here is a really simple explanation I just figured out to explain the
problem I am trying to solve:
std::string s1=garbagetext1+number1+name1+garbagetext4;
std::string s3=garbagetext2+(number1+2)+name1+garbagetext5;
std::string s5=garbagetext3+(number1+4)+name1+garbagetext6;
Edit for context:
Feel free to add it to stackoverflow (I've been having some trouble
posting there)
I can't give you what I've done so far, because I wasn't sure if it
was within the capabilities of the boost::spirit libraries to do what
I'm trying to do
Edit: Re Update2
Here is a really simple explanation I just figured out to explain the
problem I am trying to solve:
std::string s1=garbagetext1+number1+name1+garbagetext4;
std::string s3=garbagetext2+(number1+2)+name1+garbagetext5;
std::string s5=garbagetext3+(number1+4)+name1+garbagetext6;
It starts looking like a job for:
Tokenizing the 'garbage text/names' - you could make a symbol table of sorts on the fly and use it to match patterns (spirit Lex and Qi's symbol table (qi::symbol) could facilitate it, but I feel you could write that in any number of ways)
conversely, use regular expressions, as suggested before (below, and at least twice in mailing list).
Here's a simple idea:
(\d+) ([a-z]+).*?(\d+) \2
\d+ match a sequence of digits in a "(subexpression)" (NUM1)
([a-z]+) match a name (just picked a simple definition of 'name')
.*? skip any length of garbage, but as little as possible before starting subsequent match
\d+ match another number (sequence of digits) (NUM2)
\2 followed by the same name (backreference)
You can see how you'd already be narrowing the list of matches to inspect down to 'potential' hits. You'd only have to /post-validate/ to see that NUM2 == NUM1+2
Two notes:
Add (...)+ around the tail part to allow repeated matching of patterns
(\d+) ([a-z]+)(.*?(\d+) \2)+
You may wish to make the garbage skip (.*?) aware of separators (by doing negative zerowidth assertions) to avoid more than 2 skipping delimiters (e.g. s\d+=" as a delimiting pattern). I leave it out of scope for clarity now, here's the gist:
((?!s\d+=").)*? -- beware of potential performance degradation
Alec, The following is a show-case of how to do a wide range of things in Boost Spirit, in the context of answering your question.
I had to make assumptions about what is required input structure; I assumed
whitespace was strict (spaces as shown, no newlines)
the sequence numbers should be in increasing order
the sequence numbers should recur exactly in the text values
the keywords 'apple' and 'cheese' are in strict alternation
whether the keyword comes before or after the the sequence number in the text value, is also in strict alternation
Note There are about a dozen places in the implementation below, where significantly less complex choices could possibly have been made. For example, I could have hardcoded the whole pattern (as a de facto regex?), assuming that 4 items are always expected in the input. However I wanted to
make no more assumptions than necessary
learn from the experience. Especially the topic of qi::locals<> and inherited attributes have been on my agenda for a while.
However, the solution allows a great deal of flexibility:
the keywords aren't hardcoded, and you could e.g. easily make the parser accept both keywords at any sequence number
a comment shows how to generate a custom parsing exception when the sequence number is out of sync (not the expected number)
different spellings of the sequence numbers are currently accepted (i.e. s01="apple 001" is ok. Look at Unsigned Integer Parsers for info on how to tune that behaviour)
the output structure is either a vector<std::pair<int, std::string> > or a vector of struct:
struct Entry
{
int sequence;
std::string text;
};
both versions can be switched with the single #if 1/0 line
The sample uses Boost Spirit Qi for parsing.
Conversely, Boost Spirit Karma is used to display the result of parsing:
format((('s' << auto_ << "=\"" << auto_) << "\"") % ", ", parsed)
The output for the actual contents given in the post is:
parsed: s1="apple 1", s2="2 cheese", s3="apple 3", s4="4 cheese"
On to the code.
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/karma.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <boost/spirit/include/phoenix_operator.hpp>
namespace qi = boost::spirit::qi;
namespace karma = boost::spirit::karma;
namespace phx = boost::phoenix;
#if 1 // using fusion adapted struct
#include <boost/fusion/adapted/struct.hpp>
struct Entry
{
int sequence;
std::string text;
};
BOOST_FUSION_ADAPT_STRUCT(Entry, (int, sequence)(std::string, text));
#else // using boring std::pair
#include <boost/fusion/adapted/std_pair.hpp> // for karma output generation
typedef std::pair<int, std::string> Entry;
#endif
int main()
{
std::string input =
"s1=\"lxckvjlxcjvlkjlkje xvcjxzlvcj wqrej lxvcjz ljvl;x czvouzxvcu"
"j;ljfds apple 1 xcvljxclvjx oueroi xcvzlkjv; zjx\", s2=\"xzljlkxvc"
"jlkjxzvl jxcvljzx lvjlkj wre 2 cheese\", s3=\"apple 3\", s4=\"kxclvj"
"xcvjlxk jcvljxlck jxcvl 4 cheese\"";
using namespace qi;
typedef std::string::const_iterator It;
It f(input.begin()), l(input.end());
int next = 1;
qi::rule<It, std::string(int)> label;
qi::rule<It, std::string(int)> value;
qi::rule<It, int()> number;
qi::rule<It, Entry(), qi::locals<int> > assign;
label %= qi::raw [
( eps(qi::_r1 % 2) >> qi::string("apple ") > qi::uint_(qi::_r1) )
| qi::uint_(qi::_r1) > qi::string(" cheese")
];
value %= '"'
>> qi::omit[ *(~qi::char_('"') - label(qi::_r1)) ]
>> label(qi::_r1)
>> qi::omit[ *(~qi::char_('"')) ]
>> '"';
number %= qi::uint_(phx::ref(next)++) /*| eps [ phx::throw_(std::runtime_error("Sequence number out of sync")) ] */;
assign %= 's' > number[ qi::_a = _1 ] > '=' > value(qi::_a);
std::vector<Entry> parsed;
bool ok = false;
try
{
ok = parse(f, l, assign % ", ", parsed);
if (ok)
{
using namespace karma;
std::cout << "parsed:\t" << format((('s' << auto_ << "=\"" << auto_) << "\"") % ", ", parsed) << std::endl;
}
} catch(qi::expectation_failure<It>& e)
{
std::cerr << "Expectation failed: " << e.what() << " '" << std::string(e.first, e.last) << "'" << std::endl;
} catch(const std::exception& e)
{
std::cerr << e.what() << std::endl;
}
if (!ok || (f!=l))
std::cerr << "problem at: '" << std::string(f,l) << "'" << std::endl;
}
Provided you can use c++11 compiler, parsing these patterns is pretty simple using AXE†:
#include <axe.h>
#include <string>
template<class I>
void num_value(I i1, I i2)
{
unsigned n;
unsigned next = 1;
// rule to match unsigned decimal number and compare it with another number
auto num = axe::r_udecimal(n) & axe::r_bool([&](...){ return n == next; });
// rule to match a single word
auto word = axe::r_alphastr();
// rule to match space characters
auto space = axe::r_any(" \t\n");
// semantic action - print to cout and increment next
auto e_cout = axe::e_ref([&](I i1, I i2)
{
std::cout << std::string(i1, i2) << '\n';
++next;
});
// there are only two patterns in this example
auto pattern1 = (word & +space & num) >> e_cout;
auto pattern2 = (num & +space & word) >> e_cout;
auto s1 = axe::r_find(pattern1);
auto s2 = axe::r_find(pattern2);
auto text = s1 & s2 & s1 & s2 & axe::r_end();
text(i1, i2);
}
To parse the text simply call num_value(text.begin(), text.end()); No changes required to parse unicode strings.
† I didn't test it.
Look into Boost.Regex. I've seen an almost-identical poosting in boost-users and the solution is to use regexes for some of the match work.