Boost regex to capture all repeated patterns - c++

I have a simple text file with the following contents:
VALUE "foo"
ANOTHERVALUE "bar"
YETANOTHERVALUE "barbar"
The values in column 1 are known.
I want to capture everything in column 1 and in column 2.
My solution involves manually writing all the possible values of column 1 (which is known), into the regex string but obviously this is not ideal practice since I am basically repeating code and this does not allow the ordering to be flexible:
const char* re =
"^[[:space:]]*"
"(VALUE)[[:space:]]*\"(.*)\"[[:space:]]*"
"(ANOTHERVALUE)[[:space:]]*\"(.*)\"[[:space:]]*"
"(YETANOTHERVALUE)[[:space:]]*\"(.*)\"[[:space:]]*";

I'm citing commenter Igor Tandetnik here, because he almost gave the complete answer in his comment:
Regular expressions capture exactly as many substrings as there are
left parentheses in the expression [...]
The right way to solve this problem is to write a regex that matches a
single pair, [...]
\s*([a-zA-Z]+)\s*"(.*?)"
Notes:
\s is equivalent to [[:space:]]
.*? is used to stop searching after the 2nd " instead of the last " in the string
and apply it repeatedly, e.g. via std::regex_iterator
The boost equivalent is boost::regex_iterator.
#include <iostream>
#include <string>
#include <algorithm>
#include <boost/regex.hpp>
const boost::regex expr{ R"__(\s*([a-zA-Z]+)\s*"(.*?)")__" };
const std::string s =
R"(VALUE "foo"
ANOTHERVALUE "bar"
YETANOTHERVALUE "barbar"
)";
int main() {
boost::sregex_iterator it{ begin(s), end(s), expr }, itEnd;
std::for_each( it, itEnd, []( const boost::smatch& m ){
std::cout << m[1] << '\n' << m[2] << std::endl;
});
}
Live demo.
Notes:
I'm using raw string literals to make the code cleaner.

I would use a little Spirit Parser here:
Reading Into A Map
Live On Coliru
#include <boost/fusion/adapted/std_pair.hpp> // reading maps
#include <boost/spirit/home/x3.hpp>
#include <boost/spirit/include/support_istream_iterator.hpp>
#include <iostream>
#include <fstream>
#include <map>
auto read_config_map(std::istream& stream) {
std::map<std::string, std::string> settings;
boost::spirit::istream_iterator f(stream >> std::noskipws), l;
using namespace boost::spirit::x3;
auto key_ = lexeme [ +upper ];
auto value_ = lexeme [ '"' >> *~char_('"') >> '"' ];
if (!phrase_parse(f, l, -(key_ >> value_) % eol >> eoi, blank, settings))
throw std::invalid_argument("cannot parse config map");
return settings;
}
auto read_config_map(std::string const& fname) {
std::ifstream stream(fname);
return read_config_map(stream);
}
int main() {
for (auto&& entry : read_config_map(std::cin))
std::cout << "Key:'" << entry.first << "' Value:'" << entry.second << "'\n";
}
Prints:
Key:'ANOTHERVALUE' Value:'bar'
Key:'VALUE' Value:'foo'
Key:'YETANOTHERVALUE' Value:'barbar'

Related

Boost Spirit X3: skip parser that would do nothing

I'm getting myself familiarized with boost spirit v3. The question I want to ask is how to state the fact that you don't want to use skip parser in any way.
Consider a simple example of parsing comma-separated sequence of integers:
#include <iostream>
#include <string>
#include <vector>
#include <boost/spirit/home/x3.hpp>
int main()
{
using namespace boost::spirit::x3;
const std::string input{"2,4,5"};
const auto parser = int_ % ',';
std::vector<int> numbers;
auto start = input.cbegin();
auto r = phrase_parse(start, input.end(), parser, space, numbers);
if(r && start == input.cend())
{
// success
for(const auto &item: numbers)
std::cout << item << std::endl;
return 0;
}
std::cerr << "Input was not parsed successfully" << std::endl;
return 1;
}
This works totally fine. However, I would like to forbid having spaces in between (i.e. "2, 4,5" should not be parsed well).
I tried using eps as a skip parser in phrase_parse, but as you can guess, the program ended up in the infinite loop because eps matches to an empty string.
Solution I found is to use no_skip directive (https://www.boost.org/doc/libs/1_75_0/libs/spirit/doc/html/spirit/qi/reference/directive/no_skip.html). So the parser now becomes:
const auto parser = no_skip[int_ % ','];
This works fine, but I don't find it to be an elegant solution (especially providing "space" parser in phrase_parse when I want no whitespace skips). Are there no skip parsers that would simply do nothing? Am I missing something?
Thanks for Your time. Looking forward to any replies.
You can use either no_skip[] or lexeme[]. They're almost identical, except for pre-skip (Boost Spirit lexeme vs no_skip).
Are there no skip parsers that would simply do nothing? Am I missing something?
A wild guess, but you might be missing the parse API that doesn't accept a skipper in the first place
Live On Coliru
#include <iostream>
#include <iomanip>
#include <boost/spirit/home/x3.hpp>
namespace x3 = boost::spirit::x3;
int main() {
std::string const input{ "2,4,5" };
auto f = begin(input), l = end(input);
const auto parser = x3::int_ % ',';
std::vector<int> numbers;
auto r = parse(f, l, parser, numbers);
if (r) {
// success
for (const auto& item : numbers)
std::cout << item << std::endl;
} else {
std::cerr << "Input was not parsed successfully" << std::endl;
return 1;
}
if (f!=l) {
std::cout << "Remaining input " << std::quoted(std::string(f,l)) << "\n";
return 2;
}
}
Prints
2
4
5

Using too many alternative operator in boost::qi caused segmentation fault

I generated a simple code using qi::spirit:
#include <boost/spirit/include/qi.hpp>
#include <string>
using namespace std;
using namespace boost::spirit;
int main() {
string str = "string";
auto begin = str.begin();
auto symbols = (qi::lit(";") | qi::lit("(") | qi::lit(")") | qi::lit("+") |
qi::lit("/") | qi::lit("-") | qi::lit("*"));
qi::parse(begin, str.end(), *(qi::char_ - symbols));
}
And then this program was terminated by SEGV.Then,
my rewritten code with less alternative operators in rhs of symbols,
#include <boost/spirit/include/qi.hpp>
#include <string>
using namespace std;
using namespace boost::spirit;
int main()
{
string str = "string";
auto begin = str.begin();
auto symbols = (qi::lit(";") | qi::lit("+") | qi::lit("/") | qi::lit("-") |
qi::lit("*"));
qi::parse(begin, str.end(), *(qi::char_ - symbols));
}
now works well. What's the difference between 2 cases?
Your problem is a classic mistake: using auto to store Qi parser expressions: Assigning parsers to auto variables
That leads to UB.
Use a rule, or qi::copy (which is proto::deep_copy under the hooed).
auto symbols = qi::copy(qi::lit(";") | qi::lit("(") | qi::lit(")") | qi::lit("+") |
qi::lit("/") | qi::lit("-") | qi::lit("*"));
Even better, use a character set to match all the characters at once,
auto symbols = qi::copy(qi::omit(qi::char_(";()+/*-")));
The omit[] counteracts the fact that char_ exposes it's attribute (where lit doesn't). But since all you ever you it for is to SUBTRACT from another character-set:
qi::char_ - symbols
You could just as well just write
qi::char_ - qi::char_(";()+/*-")
Now. You might not know, but you can use ~charset to negate it, so it would just become
~qi::char_(";()+/*-")
NOTE - can have special meaning in charsets, which is why I very subtly move it to the end. See docs
Live Demo
Extending a little and showing some subtler patterns:
Live On Coliru
#include <boost/spirit/include/qi.hpp>
#include <iomanip>
#include <string>
using namespace std;
using namespace boost::spirit;
int main() {
string const str = "string;some(thing) else + http://me#host:*/path-element.php";
auto cs = ";()+/*-";
using qi::char_;
{
std::vector<std::string> tokens;
qi::parse(str.begin(), str.end(), +~char_(cs) % +char_(cs), tokens);
std::cout << "Condensing: ";
for (auto& tok : tokens) {
std::cout << " " << std::quoted(tok);
}
std::cout << std::endl;
}
{
std::vector<std::string> tokens;
qi::parse(str.begin(), str.end(), *~char_(cs) % char_(cs), tokens);
std::cout << "Not condensing: ";
for (auto& tok : tokens) {
std::cout << " " << std::quoted(tok);
}
std::cout << std::endl;
}
}
Prints
Condensing: "string" "some" "thing" " else " " http:" "me#host:" "path" "element.php"
Not condensing: "string" "some" "thing" " else " " http:" "" "me#host:" "" "path" "element.php"
X3
If you have c++14, you can use Spirit X3, which doesn't have the "auto problem" (because it doesn't have Proto Expression trees that can get dangling references).
Your original code would have been fine in X3, and it will compile a lot faster.
Here's my example using X3:
Live On Coliru
#include <boost/spirit/home/x3.hpp>
#include <iostream>
#include <iomanip>
#include <string>
namespace x3 = boost::spirit::x3;
int main() {
std::string const str = "string;some(thing) else + http://me#host:*/path-element.php";
auto const cs = x3::char_(";()+/*-");
std::vector<std::string> tokens;
x3::parse(str.begin(), str.end(), +~cs % +cs, tokens);
//x3::parse(str.begin(), str.end(), *~cs % cs, tokens);
for (auto& tok : tokens) {
std::cout << " " << std::quoted(tok);
}
}
Printing
"string" "some" "thing" " else " " http:" "me#host:" "path" "element.php"
The thing is that using ( and closing it in double quotes is not really safe, it might get interpreted as something else entirely, you might want to try using / to escape it instead.

boost spirit qi parser failed in release and pass in debug

#include <boost/spirit/include/qi.hpp>
#include <string>
#include <vector>
#include <iterator>
#include <algorithm>
#include <iostream>
using namespace boost::spirit;
int main()
{
std::string s;
std::getline(std::cin, s);
auto specialtxt = *(qi::char_('-', '.', '_'));
auto txt = no_skip[*(qi::char_("a-zA-Z0-9_.\\:$\'-"))];
auto anytxt = *(qi::char_("a-zA-Z0-9_.\\:${}[]+/()-"));
qi::rule <std::string::iterator, void(),ascii::space_type> rule2 = txt ('=') >> ('[') >> (']');
auto begin = s.begin();
auto end = s.end();
if (qi::phrase_parse(begin, end, rule2, ascii::space))
{
std::cout << "MATCH" << std::endl;
}
else
{
std::cout << "NO MATCH" << std::endl;
}
}
this code works fine in debug mode
parser fails in release mode
rule is to just parse text=[]; any thing else than this should fail it works fine in debug mode but not in release mode it shows result no match for any string.
if i enter string like
abc=[];
this passes in debug as expected but fails in release
You can't use auto with Spirit v2:
Assigning parsers to auto variables
You have Undefined Behaviour
DEMO
I tried to make (more) sense of the rest of the code. There were various instances that would never work:
txt('=') is an invalid Qi expression. I assumed you wanted txt >> ('=') instead
qi::char_("a-zA-Z0-9_.\\:$\\-{}[]+/()") doesn't do what you think because $-{ is actually the character "range" \x24-\x7b... Escape the - (or put it at the very end/start of the set like in the other char_ call).
qi::char_('-','.','_') can't work. Did you mean qi::char_("-._")?
specialtxt and anytxt were unused...
prefer const_iterator
prefer namespace aliases above using namespace to prevent hard-to-detect errors
Live On Coliru
#include <boost/spirit/include/qi.hpp>
#include <iostream>
namespace qi = boost::spirit::qi;
int main() {
std::string const s = "abc=[];";
auto specialtxt = qi::copy(*(qi::char_("-._")));
auto anytxt = qi::copy(*(qi::char_("a-zA-Z0-9_.\\:$\\-{}[]+/()")));
(void) specialtxt;
(void) anytxt;
auto txt = qi::copy(qi::no_skip[*(qi::char_("a-zA-Z0-9_.\\:$\'-"))]);
qi::rule<std::string::const_iterator, qi::space_type> rule2 = txt >> '=' >> '[' >> ']';
auto begin = s.begin();
auto end = s.end();
if (qi::phrase_parse(begin, end, rule2, qi::space)) {
std::cout << "MATCH" << std::endl;
} else {
std::cout << "NO MATCH" << std::endl;
}
if (begin != end) {
std::cout << "Trailing unparsed: '" << std::string(begin, end) << "'\n";
}
}
Printing
MATCH
Trailing unparsed: ';'

How to split_regex only once?

The function template boost::algorithm::split_regex splits a single string into strings on the substring of the original string that matches the regex pattern we passed to split_regex. The question is: how can I split it only once on the first substring that matches? That is, is it possible to make split_regex stop after its first splitting? Please see the following codes.
#include <boost/algorithm/string/regex.hpp>
#include <boost/format.hpp>
#include <boost/regex.hpp>
#include <iostream>
#include <locale>
int main(int argc, char *argv[])
{
using namespace std;
using boost::regex;
locale::global(locale(""));
// Create a standard string for experiment.
string strRequestLine("Host: 192.168.0.1:12345");
regex pat(R"(:\s*)", regex::perl | boost::regex_constants::match_stop);
// Try to split the request line.
vector<string> coll;
boost::algorithm::split_regex(coll, strRequestLine, pat);
// Output what we got.
for (const auto& elt : coll)
cout << boost::format("{%s}\n") % elt;
// Exit the program.
return 0;
}
Where shall the codes be modified to have the output like
{Host}
{192.168.0.1:12345}
instead of the current output
{Host}
{192.168.0.1}
{12345}
Any suggestion/hint? Thanks.
Please note that I'm not asking how to do it with other functions or patterns. I'm asking if it's possible for split_regex to split only once and then stop. Because regex object seems to have the ability to stop at the first matched, I wonder that if offering it some proper flags it maybe stop at the first matched.
For your specific input it seems the simple fix is to change the pattern to become R"(:\s+)". Of course, this assumes that there is, at least, one space after Host: and no space between the IP address and the port.
Another alternative would be not to use split_regex() but rather std::regex_match():
#include <iostream>
#include <regex>
#include <string>
int main()
{
std::string strRequestLine("Host: 192.168.0.1:12345");
std::smatch results;
if (std::regex_match(strRequestLine, results, std::regex(R"(([^:]*):\s*(.*))"))) {
for (auto it(++results.begin()), end(results.end()); it != end; ++it) {
std::cout << "{" << *it << "}\n";
}
}
}
Expanding from my comment:
You might be interested in the first sample I listed here: small HTTP response headers parsing function. Summary: use phrase_parse(f, e, token >> ':' >> lexeme[*(char_ - eol)], space, key, value)
Here's a simple sample:
Live On Coliru
#include <boost/spirit/include/qi.hpp>
namespace qi = boost::spirit::qi;
namespace {
typedef std::string::const_iterator It;
// 2.2 Basic Rules (rfc1945)
static const qi::rule<It, std::string()> rfc1945_token = +~qi::char_( " \t><#,;:\\\"/][?=}{:"); // FIXME? should filter CTLs
}
#include <iostream>
int main()
{
std::string const strRequestLine("Host: 192.168.0.1:12345");
std::string::const_iterator f(strRequestLine.begin()), l(strRequestLine.end());
std::string key, value;
if (qi::phrase_parse(f, l, rfc1945_token >> ':' >> qi::lexeme[*(qi::char_ - qi::eol)], qi::space, key, value))
std::cout << "'" << key << "' -> '" << value << "'\n";
}
Prints
'Host' -> '192.168.0.1:12345'

Boost Spirit Qi track line and parse unicode

I want to trace input position and input line for unicode strings.
For the position I store an iterator to begin and use std::distance at the desired position. That works well as long as the input is not unicode. With unicode symbols the position gets shifted, i.e. ä takes two spaces in the input stream and position is off by 1. So, I switched to boost::u8_to_u32_iterator and this works fine.
For the line I use boost::spirit::line_pos_iterator which also works well.
My problem is in combining both concepts to use the line iterator and the unicode iterator. Another solution allowing pos and line on unicode strings is of course also welcome.
Here is a small example for the unicode parser; as said I would like to wrap the iterator additionally with boost::spirit::line_pos_iterator but that doesn't even compile.
#define BOOST_SPIRIT_USE_PHOENIX_V3
#define BOOST_SPIRIT_UNICODE
#include <boost/regex/pending/unicode_iterator.hpp>
#include <boost/fusion/adapted/struct.hpp>
#include <boost/spirit/include/phoenix.hpp>
namespace phx = boost::phoenix;
#include <boost/spirit/include/qi.hpp>
namespace qi = boost::spirit::qi;
#include <boost/spirit/repository/include/qi_iter_pos.hpp>
#include <boost/spirit/include/support_line_pos_iterator.hpp>
#include <iostream>
#include <string>
//==============================================================================
std::string to_utf8(const std::u32string& input) {
return std::string(
boost::u32_to_u8_iterator<std::u32string::const_iterator>(input.begin()),
boost::u32_to_u8_iterator<std::u32string::const_iterator>(input.end()));
}
//==============================================================================
int main() {
std::string input(u8"Hallo äöüß");
typedef boost::u8_to_u32_iterator<std::string::const_iterator> iterator_type;
iterator_type first(input.begin()), last(input.end());
qi::rule<iterator_type, std::u32string()> string_u32 = *(qi::char_ - qi::eoi);
qi::rule<iterator_type, std::string()> string =
string_u32[qi::_val = phx::bind(&to_utf8, qi::_1)];
qi::rule<iterator_type, std::string()> rule = string;
std::string ast;
bool result = qi::parse(first, last, rule, ast);
if (result) {
result = first == last;
}
if (result) {
std::cout << "Parsed: " << ast << std::endl;
} else {
std::cout << "Failure" << std::endl;
}
}
Update Demo added Live on Coliru
I see the same problem whe you try to wrap iterator_type in a line_pos_iterator.
After some thinking, I don't quite know what causes it (it might be possible to get around this by wrapping the u8_to_u32 converting iterator adapter inside a boost::spirit::multi_pass<> iterator adapter, but... that sounded so unwieldy I haven't even tried).
Instead, I think that the nature of line-breaking is that it is (mostly?) charset agnostic. So you could wrap the source iterator with line_pos_iterator first, before the encoding conversion.
This does compile. Of course, then you'll get position information in terms of the source iterators, not 'logical characters'[1].
Let me show a demonstration below. It parses space separated words into a vector of strings. The simplest way to show the position information was to use a vector of iterator_ranges instead of just strings. I used qi::raw[] to expose the iterators[2].
So after a successful parse I loop through the matched ranges and print their location information. First, I print the actual positions reported from line_pos_iterators. Remember these are 'raw' byte offsets, since the source iterator is byte-oriented.
Next, I do a little dance with get_current_line and the u8_to_u32 conversion to translate the offset within the line to a (more) logical count. You'll see that the range for e.g.
Note I currently assumed that ranges would not cross line boundaries (that is true for this grammar). Otherwise one would need to extract and convert 2 lines. The way I'm doing that now is rather expensive. Consider optimizing by e.g. using Boost String Algorithm's find_all facilities. You can build a list of line-ends and use std::lower_bound to locate the current line slightly more efficiently.
Note There might be issues with the implementations of get_line_start and get_current_line; if you notice anything like this, there's a 10-line patch over at the [spirit-general] user list that you could try
Without further ado, the code and the output:
#define BOOST_SPIRIT_USE_PHOENIX_V3
#define BOOST_SPIRIT_UNICODE
#include <boost/regex/pending/unicode_iterator.hpp>
#include <boost/fusion/adapted/struct.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <boost/phoenix/function/adapt_function.hpp>
namespace phx = boost::phoenix;
#include <boost/spirit/include/qi.hpp>
namespace qi = boost::spirit::qi;
namespace encoding = boost::spirit::unicode;
#include <boost/spirit/repository/include/qi_iter_pos.hpp>
#include <boost/spirit/include/support_line_pos_iterator.hpp>
#include <iostream>
#include <string>
//==============================================================================
std::string to_utf8(const std::u32string& input) {
return std::string(
boost::u32_to_u8_iterator<std::u32string::const_iterator>(input.begin()),
boost::u32_to_u8_iterator<std::u32string::const_iterator>(input.end()));
}
BOOST_PHOENIX_ADAPT_FUNCTION(std::string, to_utf8_, to_utf8, 1)
//==============================================================================
int main() {
std::string input(u8"Hallo äöüß\n¡Bye! ✿➂➿♫");
typedef boost::spirit::line_pos_iterator<std::string::const_iterator> source_iterator;
typedef boost::u8_to_u32_iterator<source_iterator> iterator_type;
source_iterator soi(input.begin()),
eoi(input.end());
iterator_type first(soi),
last(eoi);
qi::rule<iterator_type, std::u32string()> string_u32 = +encoding::graph;
qi::rule<iterator_type, std::string()> string = string_u32 [qi::_val = to_utf8_(qi::_1)];
std::vector<boost::iterator_range<iterator_type> > ast;
// note the trick with `raw` to expose the iterators
bool result = qi::phrase_parse(first, last, *qi::raw[ string ], encoding::space, ast);
if (result) {
for (auto const& range : ast)
{
source_iterator
base_b(range.begin().base()),
base_e(range.end().base());
auto lbound = get_line_start(soi, base_b);
// RAW access to the base iterators:
std::cout << "Fragment: '" << std::string(base_b, base_e) << "'\t"
<< "raw: L" << get_line(base_b) << ":" << get_column(lbound, base_b, /*tabs:*/4)
<< "-L" << get_line(base_e) << ":" << get_column(lbound, base_e, /*tabs:*/4);
// "cooked" access:
auto line = get_current_line(lbound, base_b, eoi);
// std::cout << "Line: '" << line << "'\n";
// iterator_type is an alias for u8_to_u32_iterator<...>
size_t cur_pos = 0, start_pos = 0, end_pos = 0;
for(iterator_type it = line.begin(), _eol = line.end(); ; ++it, ++cur_pos)
{
if (it.base() == base_b) start_pos = cur_pos;
if (it.base() == base_e) end_pos = cur_pos;
if (it == _eol)
break;
}
std::cout << "\t// in u32 code _units_: positions " << start_pos << "-" << end_pos << "\n";
}
std::cout << "\n";
} else {
std::cout << "Failure" << std::endl;
}
if (first!=last)
{
std::cout << "Remaining: '" << std::string(first, last) << "'\n";
}
}
The output:
clang++ -std=c++11 -Os main.cpp && ./a.out
Fragment: 'Hallo' raw: L1:1-L1:6 // in u32 code _units_: positions 0-5
Fragment: 'äöüß' raw: L1:7-L1:15 // in u32 code _units_: positions 6-10
Fragment: '¡Bye!' raw: L2:2-L2:8 // in u32 code _units_: positions 1-6
Fragment: '✿➂➿♫' raw: L2:9-L2:21 // in u32 code _units_: positions 7-11
[1] I think there's not a useful definition of what a character is in this context. There's bytes, code units, code points, grapheme clusters, possibly more. Suffice it to say that the source iterator (std::string::const_iterator) deals with bytes (since it is charset/encoding unaware). In u32string you can /almost/ assume that a single position is roughly a code-point (although I think (?) that for >L2 UNICODE support you still would have to support code points combined from multiple code units).
[2] This means that current the attribute conversion and the semantic action are redundant, but you'll get that :)