Spirit Grammar For Path Verificiation - c++

I am trying to write a simple grammar using boost spirit to validate that a string is a valid directory. I am using these tutorials since this is the first grammar I have attempted:
http://www.boost.org/doc/libs/1_36_0/libs/spirit/doc/html/spirit/qi_and_karma.html
http://www.boost.org/doc/libs/1_48_0/libs/spirit/doc/html/spirit/qi/reference/directive/lexeme.html
http://www.boost.org/doc/libs/1_44_0/libs/spirit/doc/html/spirit/qi/tutorials/employee___parsing_into_structs.html
Currently, what I have come up with is:
// I want these to be valid matches
std::string valid1 = "./";
// This string could be any number of sub dirs i.e. /home/user/test/ is valid
std::string valid2 = "/home/user/";
using namespace boost::spirit::qi;
bool match = phrase_parse(valid1.begin(), valid1.end(), lexeme[
((char_('.') | char_('/')) >> +char_ >> char_('/')],
ascii::space);
if (match)
{
std::cout << "Match!" << std::endl;
}
However, this matches nothing. I had a few ideas as to why; however, after doing some research I haven't found the answers. For example I assume the +char_ will probably consume all chars? So how can I find out if some sequence of characters all end with /?
Essentially my thoughts behind writing the above code was I want directories starting with . and / to be valid and then the last character has to be a /. Could someone help me with my grammar or point me to something more similar example to what I want to do? This is purely an excise to learn how to use spirit.
Edit
So I have got the parser to match using:
bool match = phrase_parse(valid1.begin(), valid1.end(), lexeme[
((char_('.') | char_('/')) >> *(+char_ >> char_('/'))],
ascii::space);
if (match)
{
std::cout << "Match!" << std::endl;
}
Not sure if that is proper or not? Or if it is matching for other reasons... Also should the ascii::space be used here? I read in a tutorial that it was to make spaces agnostic i.e. a b is equivalent to ab. Which I wouldn't want in a path name? If it isn't the correct thing to use what would be?
SSCCE:
#include <string>
#include <iostream>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/qi_char.hpp>
#include <boost/spirit/include/qi_eoi.hpp>
int main()
{
namespace qi = boost::spirit::qi;
std::string valid1 = "./";
std::string valid2 = "/home/blah/";
bool match = qi::parse(valid2.begin(), valid2.end(), &((qi::lit("./")|'/') >> (+~qi::char_('/') % '/') >> qi::eoi));
if (match)
{
std::cout << "Match" << std::endl;
}
}

If you don't want to ignore space differences (which you shouldn't), use parse instead of phrase_parse. The use of lexeme inhibits the skipper again (so you were just stripping leading/trailing space). See also stackoverflow.com/questions/17072987/boost-spirit-skipper-issues/17073965#17073965
Use char_("ab") instead of char_('a')|char_('b').
*char_ matches everything. You may have meant *~char_('/').
I'd suggest something like
bool ok = qi::parse(b, f, &(lit("./")|'/') >> (*~char_('/') % '/'));
This won't expose the matched input. Add raw[] around it to achieve that .
Add > qi::eoi to assert all of the input was consumed.

Related

In C++ STL, how do I remove non-numeric characters from std::string with regex_replace?

Using the C++ Standard Template Library function regex_replace(), how do I remove non-numeric characters from a std::string and return a std::string?
This question is not a duplicate of question
747735
because that question requests how to use TR1/regex, and I'm
requesting how to use standard STL regex, and because the answer given
is merely some very complex documentation links. The C++ regex
documentation is extremely hard to understand and poorly documented,
in my opinion, so even if a question pointed out the standard C++
regex_replace
documentation,
it still wouldn't be very useful to new coders.
// assume #include <regex> and <string>
std::string sInput = R"(AA #-0233 338982-FFB /ADR1 2)";
std::string sOutput = std::regex_replace(sInput, std::regex(R"([\D])"), "");
// sOutput now contains only numbers
Note that the R"..." part means raw string literal and does not evaluate escape codes like a C or C++ string would. This is very important when doing regular expressions and makes your life easier.
Here's a handy list of single-character regular expression raw literal strings for your std::regex() to use for replacement scenarios:
R"([^A-Za-z0-9])" or R"([^A-Za-z\d])" = select non-alphabetic and non-numeric
R"([A-Za-z0-9])" or R"([A-Za-z\d])" = select alphanumeric
R"([0-9])" or R"([\d])" = select numeric
R"([^0-9])" or R"([^\d])" or R"([\D])" = select non-numeric
Regular expressions are overkill here.
#include <algorithm>
#include <iostream>
#include <iterator>
#include <string>
inline bool not_digit(char ch) {
return '0' <= ch && ch <= '9';
}
std::string remove_non_digits(const std::string& input) {
std::string result;
std::copy_if(input.begin(), input.end(),
std::back_inserter(result),
not_digit);
return result;
}
int main() {
std::string input = "1a2b3c";
std::string result = remove_non_digits(input);
std::cout << "Original: " << input << '\n';
std::cout << "Filtered: " << result << '\n';
return 0;
}
The accepted answer if fine for the specifics of the given sample.
But it will fail for a number such as "-12.34" (it would result in "1234").
(note how the sample could be negative numbers)
Then the regex should be:
(-|\+)?(\d)+(.(\d)+)*
explanation: (optional ( "-" or "+" )) with (a number, repeated 1 to n times) with (optionally end's with: ( a "." followed by (a number, repeated 1 to n times) )
A bit over-reaching, but I was looking for this and the page showed up first in my search, so I'm adding my answer for future searches.

Spirit Qi Parsing multilines from Console

I am using the Console as an input source.
I looked for a way qi would parse a line and then it'd wait for the next line and continue parsing from that point.
for example take the following grammar
start = MultiLine | Line;
Multiline = "{" *(Line) "}";
Line = *(char_("A-Za-z0-9"));
As an input
{ AAAA
Bbbb
lllll
lalala
}
Taking the whole thing from a file is easy.
But what should i do if i need to have the input to come from the console instead?
That i want to have it process what its given already and wait for the rest of the lines.
The most "highlevel" idiomatic way to do this in Qi is qi::match:
Live On Coliru
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/qi_match.hpp>
using Line = std::string;
using MultiLine = std::vector<Line>;
int main() {
std::cin.unsetf(std::ios::skipws);
MultiLine block;
using namespace boost::spirit::qi;
while (std::cin >> phrase_match('{' >> lexeme[*~char_("\r\n}")] % eol >> '}', blank, block))
{
std::cout << "Processing a block of " << block.size() << " lines\n";
block.clear();
}
}
Prints:
Processing a block of 4 lines
Processing a block of 7 lines
Where the second line appears after one second delay (due to the sleep 1 used)
As Chris Beck was hinting, this uses boost::spirit::istream_iterator under the hood. You can also use that explicitly, e.g. if you want to use recursively nested rules.
You can just use an istream iterator which you construct from std::cin for this. If your grammar is templated against typename iterator like Hartmut does in all the examples, then it should work fine.
It would be easy to illustrate if you provide an SSCCE. The istream iterator is only a few lines of code to create.

Using Boost.Spirit to extract certain tags/attributes from HTML

So I've been learning a bit about Boost.Spirit to replace the use of regular expressions in a lot of my code. The main reason is pure speed. I've found Boost.Spirit to be up to 50 times faster than PCRE for some relatively simple tasks.
One thing that is a big bottleneck in one of my apps is taking some HTML, finding all "img" tags, and extracting the "src" attribute.
This is my current regex:
(?i:<img\s[^\>]*src\s*=\s*[""']([^<][^""']+)[^\>]*\s*/*>)
I've been playing around with it trying to get something to work in Spirit, but so far I've come up empty. Any tips on how to create a set of Spirit rules that will accomplish the same thing as this regex would be awesome.
And of course, the Boost Spirit variant couldn't be missed:
sehe#natty:/tmp$ time ./spirit < bench > /dev/null
real 0m3.895s
user 0m3.820s
sys 0m0.070s
To be honest the Spirit code is slightly more versatile than the other variations:
it actually parses attributes a bit smarter, so it would be easy to handle a variety of attributes at the same time, perhaps depending on the containing element
the Spirit parser would be easier to adapt to cross-line matching. This could be most easily achieved
using spirit::istream_iterator<> (which is unfortunately notoriously slow)
using a memory-mapped file with raw const char* as iterators; The latter approach works equally well for the other techniques
The code is as follows: (full code at https://gist.github.com/c16725584493b021ba5b)
//#define BOOST_SPIRIT_DEBUG
#include <string>
#include <iostream>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
namespace qi = boost::spirit::qi;
namespace phx = boost::phoenix;
void handle_attr(
const std::string& elem,
const std::string& attr,
const std::string& value)
{
if (elem == "img" && attr == "src")
std::cout << "value : " << value << std::endl;
}
typedef std::string::const_iterator It;
typedef qi::space_type Skipper;
struct grammar : qi::grammar<It, Skipper>
{
grammar() : grammar::base_type(html)
{
using namespace boost::spirit::qi;
using phx::bind;
attr = as_string [ +~char_("= \t\r\n/>") ] [ _a = _1 ]
>> '=' >> (
as_string [ '"' >> lexeme [ *~char_('"') ] >> '"' ]
| as_string [ "'" >> lexeme [ *~char_("'") ] >> "'" ]
) [ bind(handle_attr, _r1, _a, _1) ]
;
elem = lit('<')
>> as_string [ lexeme [ ~char_("-/>") >> *(char_ - space - char_("/>")) ] ] [ _a = _1 ]
>> *attr(_a);
html = (-elem) % +("</" | (char_ - '<'));
BOOST_SPIRIT_DEBUG_NODE(html);
BOOST_SPIRIT_DEBUG_NODE(elem);
BOOST_SPIRIT_DEBUG_NODE(attr);
}
qi::rule<It, Skipper> html;
qi::rule<It, Skipper, qi::locals<std::string> > elem;
qi::rule<It, qi::unused_type(std::string), Skipper, qi::locals<std::string> > attr;
};
int main(int argc, const char *argv[])
{
std::string s;
const static grammar html_;
while (std::getline(std::cin, s))
{
It f = s.begin(),
l = s.end();
if (!phrase_parse(f, l, html_, qi::space) || (f!=l))
std::cerr << "unparsed: " << std::string(f,l) << std::endl;
}
return 0;
}
Update
I did benchmarks.
Full disclosure is here: https://gist.github.com/c16725584493b021ba5b
It includes the full code used, the compilation flags and the body of test data (file bench) used.
In short
Regular expressions are indeed faster and way simpler here
Do not underestimate the time I spent debugging the Spirit grammar to get it correct!
Care has been taken to eliminate 'accidental' differences (by e.g.
keeping handle_attribute unchanged across the implementations, even though it makes sense mostly only for the Spirit implementation).
using the same line-wise input style and string iterators for both
Right now, all three implementations result in the exact same output
Everything built/timed on g++ 4.6.1 (c++03 mode), -O3
Edit in reply to the knee-jerk (and correct) response that you shouldn't be parsing HTML using Regexes:
You shouldn't be using regexen to parse non-trivial inputs (mainly, anything with a grammar. Of course Perl 5.10+ 'regex grammars' are an exception, because they are not isolated regexes anymore
HTML basically cannot be parsed, it is non-standard tag soup. Strict (X)HTML, are a different matter
According to Xaade, if you haven't got enough time to produce a perfect implementation using a standards compliant HTML reader, you should
"ask client if they want shit or not. If they want shit, you charge them more. Shit costs you more than them." -- Xaade
That said there are scenarios in which I'd do precisely what I suggest here: use a regex. Mainly, if it is to do a one-off quick search or to get daily, rough statistics of known data etc. YMMV and you should make your own call.
For timings and summaries, see:
Boost Regex answer below
Boost Xpressive answer here
Spirit answer here
I heartily suggest using a regex here:
typedef std::string::const_iterator It;
int main(int argc, const char *argv[])
{
const boost::regex re("<img\\s+[^\\>]*?src\\s*=\\s*([\"'])(.*?)\\1");
std::string s;
boost::smatch what;
while (std::getline(std::cin, s))
{
It f = s.begin(), l = s.end();
do
{
if (!boost::regex_search(f, l, what, re))
break;
handle_attr("img", "src", what[2]);
f = what[0].second;
} while (f!=s.end());
}
return 0;
}
Use it like:
./test < index.htm
I cannot see any reason why the spirit based approach should/could be any faster?
Edit PS. Iff you claim that static optimization would be the key, why not just convert it into a Boost Expressive, static, regular expression?
Out of curiosity I redid my regex sample based on Boost Xpressive, using statically compiled regexes:
sehe#natty:/tmp$ time ./expressive < bench > /dev/null
real 0m2.146s
user 0m2.110s
sys 0m0.030s
Interestingly, there is no discernable speed difference when using the dynamic regular expression; however, on the whole the Xpressive version performs better than the Boost Regex version (by roughly 10%)
What is really nice, IMO, is that it was really almost matter of including the xpressive.hpp and changing a few namespaces around to change from Boost Regex to Xpressive. The API interface (as far as it was being used) is exactly the same.
The relevant code is as follows: (full code at https://gist.github.com/c16725584493b021ba5b)
typedef std::string::const_iterator It;
int main(int argc, const char *argv[])
{
using namespace boost::xpressive;
#if DYNAMIC
const sregex re = sregex::compile
("<img\\s+[^\\>]*?src\\s*=\\s*([\"'])(.*?)\\1");
#else
const sregex re = "<img" >> +_s >> -*(~(set = '\\','>')) >>
"src" >> *_s >> '=' >> *_s
>> (s1 = as_xpr('"') | '\'') >> (s2 = -*_) >> s1;
#endif
std::string s;
smatch what;
while (std::getline(std::cin, s))
{
It f = s.begin(), l = s.end();
do
{
if (!regex_search(f, l, what, re))
break;
handle_attr("img", "src", what[2]);
f = what[0].second;
} while (f!=s.end());
}
return 0;
}

How would I perform this text pattern matching

Migrated from [Spirit-general] list
Good morning,
I'm trying to parse a relatively simple pattern across 4 std::strings,
extracting whatever the part which matches the pattern into a separate
std::string.
In an abstracted sense, here is what I want:
s1=<string1><consecutive number>, s2=<consecutive number><string2>,
s3=<string1><consecutive number>, s4=<consecutive number><string2>
Less abstracted:
s1="apple 1", s2="2 cheese", s3="apple 3", s4="4 cheese"
Actual contents:
s1="lxckvjlxcjvlkjlkje xvcjxzlvcj wqrej lxvcjz ljvl;x czvouzxvcu
j;ljfds apple 1 xcvljxclvjx oueroi xcvzlkjv; zjx", s2="xzljlkxvc
jlkjxzvl jxcvljzx lvjlkj wre 2 cheese", s3="apple 3", s4="kxclvj
xcvjlxk jcvljxlck jxcvl 4 cheese"
How would I perform this pattern matching?
Thanks for all suggestions,
Alec Taylor
Update 2
Here is a really simple explanation I just figured out to explain the
problem I am trying to solve:
std::string s1=garbagetext1+number1+name1+garbagetext4;
std::string s3=garbagetext2+(number1+2)+name1+garbagetext5;
std::string s5=garbagetext3+(number1+4)+name1+garbagetext6;
Edit for context:
Feel free to add it to stackoverflow (I've been having some trouble
posting there)
I can't give you what I've done so far, because I wasn't sure if it
was within the capabilities of the boost::spirit libraries to do what
I'm trying to do
Edit: Re Update2
Here is a really simple explanation I just figured out to explain the
problem I am trying to solve:
std::string s1=garbagetext1+number1+name1+garbagetext4;
std::string s3=garbagetext2+(number1+2)+name1+garbagetext5;
std::string s5=garbagetext3+(number1+4)+name1+garbagetext6;
It starts looking like a job for:
Tokenizing the 'garbage text/names' - you could make a symbol table of sorts on the fly and use it to match patterns (spirit Lex and Qi's symbol table (qi::symbol) could facilitate it, but I feel you could write that in any number of ways)
conversely, use regular expressions, as suggested before (below, and at least twice in mailing list).
Here's a simple idea:
(\d+) ([a-z]+).*?(\d+) \2
\d+ match a sequence of digits in a "(subexpression)" (NUM1)
([a-z]+) match a name (just picked a simple definition of 'name')
.*? skip any length of garbage, but as little as possible before starting subsequent match
\d+ match another number (sequence of digits) (NUM2)
\2 followed by the same name (backreference)
You can see how you'd already be narrowing the list of matches to inspect down to 'potential' hits. You'd only have to /post-validate/ to see that NUM2 == NUM1+2
Two notes:
Add (...)+ around the tail part to allow repeated matching of patterns
(\d+) ([a-z]+)(.*?(\d+) \2)+
You may wish to make the garbage skip (.*?) aware of separators (by doing negative zerowidth assertions) to avoid more than 2 skipping delimiters (e.g. s\d+=" as a delimiting pattern). I leave it out of scope for clarity now, here's the gist:
((?!s\d+=").)*? -- beware of potential performance degradation
Alec, The following is a show-case of how to do a wide range of things in Boost Spirit, in the context of answering your question.
I had to make assumptions about what is required input structure; I assumed
whitespace was strict (spaces as shown, no newlines)
the sequence numbers should be in increasing order
the sequence numbers should recur exactly in the text values
the keywords 'apple' and 'cheese' are in strict alternation
whether the keyword comes before or after the the sequence number in the text value, is also in strict alternation
Note There are about a dozen places in the implementation below, where significantly less complex choices could possibly have been made. For example, I could have hardcoded the whole pattern (as a de facto regex?), assuming that 4 items are always expected in the input. However I wanted to
make no more assumptions than necessary
learn from the experience. Especially the topic of qi::locals<> and inherited attributes have been on my agenda for a while.
However, the solution allows a great deal of flexibility:
the keywords aren't hardcoded, and you could e.g. easily make the parser accept both keywords at any sequence number
a comment shows how to generate a custom parsing exception when the sequence number is out of sync (not the expected number)
different spellings of the sequence numbers are currently accepted (i.e. s01="apple 001" is ok. Look at Unsigned Integer Parsers for info on how to tune that behaviour)
the output structure is either a vector<std::pair<int, std::string> > or a vector of struct:
struct Entry
{
int sequence;
std::string text;
};
both versions can be switched with the single #if 1/0 line
The sample uses Boost Spirit Qi for parsing.
Conversely, Boost Spirit Karma is used to display the result of parsing:
format((('s' << auto_ << "=\"" << auto_) << "\"") % ", ", parsed)
The output for the actual contents given in the post is:
parsed: s1="apple 1", s2="2 cheese", s3="apple 3", s4="4 cheese"
On to the code.
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/karma.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <boost/spirit/include/phoenix_operator.hpp>
namespace qi = boost::spirit::qi;
namespace karma = boost::spirit::karma;
namespace phx = boost::phoenix;
#if 1 // using fusion adapted struct
#include <boost/fusion/adapted/struct.hpp>
struct Entry
{
int sequence;
std::string text;
};
BOOST_FUSION_ADAPT_STRUCT(Entry, (int, sequence)(std::string, text));
#else // using boring std::pair
#include <boost/fusion/adapted/std_pair.hpp> // for karma output generation
typedef std::pair<int, std::string> Entry;
#endif
int main()
{
std::string input =
"s1=\"lxckvjlxcjvlkjlkje xvcjxzlvcj wqrej lxvcjz ljvl;x czvouzxvcu"
"j;ljfds apple 1 xcvljxclvjx oueroi xcvzlkjv; zjx\", s2=\"xzljlkxvc"
"jlkjxzvl jxcvljzx lvjlkj wre 2 cheese\", s3=\"apple 3\", s4=\"kxclvj"
"xcvjlxk jcvljxlck jxcvl 4 cheese\"";
using namespace qi;
typedef std::string::const_iterator It;
It f(input.begin()), l(input.end());
int next = 1;
qi::rule<It, std::string(int)> label;
qi::rule<It, std::string(int)> value;
qi::rule<It, int()> number;
qi::rule<It, Entry(), qi::locals<int> > assign;
label %= qi::raw [
( eps(qi::_r1 % 2) >> qi::string("apple ") > qi::uint_(qi::_r1) )
| qi::uint_(qi::_r1) > qi::string(" cheese")
];
value %= '"'
>> qi::omit[ *(~qi::char_('"') - label(qi::_r1)) ]
>> label(qi::_r1)
>> qi::omit[ *(~qi::char_('"')) ]
>> '"';
number %= qi::uint_(phx::ref(next)++) /*| eps [ phx::throw_(std::runtime_error("Sequence number out of sync")) ] */;
assign %= 's' > number[ qi::_a = _1 ] > '=' > value(qi::_a);
std::vector<Entry> parsed;
bool ok = false;
try
{
ok = parse(f, l, assign % ", ", parsed);
if (ok)
{
using namespace karma;
std::cout << "parsed:\t" << format((('s' << auto_ << "=\"" << auto_) << "\"") % ", ", parsed) << std::endl;
}
} catch(qi::expectation_failure<It>& e)
{
std::cerr << "Expectation failed: " << e.what() << " '" << std::string(e.first, e.last) << "'" << std::endl;
} catch(const std::exception& e)
{
std::cerr << e.what() << std::endl;
}
if (!ok || (f!=l))
std::cerr << "problem at: '" << std::string(f,l) << "'" << std::endl;
}
Provided you can use c++11 compiler, parsing these patterns is pretty simple using AXE†:
#include <axe.h>
#include <string>
template<class I>
void num_value(I i1, I i2)
{
unsigned n;
unsigned next = 1;
// rule to match unsigned decimal number and compare it with another number
auto num = axe::r_udecimal(n) & axe::r_bool([&](...){ return n == next; });
// rule to match a single word
auto word = axe::r_alphastr();
// rule to match space characters
auto space = axe::r_any(" \t\n");
// semantic action - print to cout and increment next
auto e_cout = axe::e_ref([&](I i1, I i2)
{
std::cout << std::string(i1, i2) << '\n';
++next;
});
// there are only two patterns in this example
auto pattern1 = (word & +space & num) >> e_cout;
auto pattern2 = (num & +space & word) >> e_cout;
auto s1 = axe::r_find(pattern1);
auto s2 = axe::r_find(pattern2);
auto text = s1 & s2 & s1 & s2 & axe::r_end();
text(i1, i2);
}
To parse the text simply call num_value(text.begin(), text.end()); No changes required to parse unicode strings.
† I didn't test it.
Look into Boost.Regex. I've seen an almost-identical poosting in boost-users and the solution is to use regexes for some of the match work.

Getting sub-match_results with boost::regex

Hey, let's say I have this regex: (test[0-9])+
And that I match it against: test1test2test3test0
const bool ret = boost::regex_search(input, what, r);
for (size_t i = 0; i < what.size(); ++i)
cout << i << ':' << string(what[i]) << "\n";
Now, what[1] will be test0 (the last occurrence). Let's say that I need to get test1, 2 and 3 as well: what should I do?
Note: the real regex is extremely more complex and has to remain one overall match, so changing the example regex to (test[0-9]) won't work.
I think Dot Net has the ability to make single capture group Collections so that (grp)+ will create a collection object on group1. The boost engine's regex_search() is going to be just like any ordinary match function. You sit in a while() loop matching the pattern where the last match left off. The form you used does not use a bid-itterator, so the function won't start the next match where the last match left off.
You can use the itterator form:
(Edit - you can also use the token iterator, defining what groups to iterate over. Added in the code below).
#include <boost/regex.hpp>
#include <string>
#include <iostream>
using namespace std;
using namespace boost;
int main()
{
string input = "test1 ,, test2,, test3,, test0,,";
boost::regex r("(test[0-9])(?:$|[ ,]+)");
boost::smatch what;
std::string::const_iterator start = input.begin();
std::string::const_iterator end = input.end();
while (boost::regex_search(start, end, what, r))
{
string stest(what[1].first, what[1].second);
cout << stest << endl;
// Update the beginning of the range to the character
// following the whole match
start = what[0].second;
}
// Alternate method using token iterator
const int subs[] = {1}; // we just want to see group 1
boost::sregex_token_iterator i(input.begin(), input.end(), r, subs);
boost::sregex_token_iterator j;
while(i != j)
{
cout << *i++ << endl;
}
return 0;
}
Output:
test1
test2
test3
test0
Boost.Regex offers experimental support for exactly this feature (called repeated captures); however, since it's huge performance hit, this feature is disabled by default.
To enable repeated captures, you need to rebuild Boost.Regex and define macro BOOST_REGEX_MATCH_EXTRA in all translation units; the best way to do this is to uncomment this define in boost/regex/user.hpp (see the reference, it's at the very bottom of the page).
Once compiled with this define, you can use this feature by calling/using regex_search, regex_match and regex_iterator with match_extra flag.
Check reference to Boost.Regex for more info.
Seems to me like you need to create a regex_iterator, using the (test[0-9]) regex as input. Then you can use the resulting regex_iterator to enumerate the matching substrings of your original target.
If you still need "one overall match" then perhaps that work has to be decoupled from the task of finding matching substrings. Can you clarify that part of your requirement?