How to split_regex only once? - c++

The function template boost::algorithm::split_regex splits a single string into strings on the substring of the original string that matches the regex pattern we passed to split_regex. The question is: how can I split it only once on the first substring that matches? That is, is it possible to make split_regex stop after its first splitting? Please see the following codes.
#include <boost/algorithm/string/regex.hpp>
#include <boost/format.hpp>
#include <boost/regex.hpp>
#include <iostream>
#include <locale>
int main(int argc, char *argv[])
{
using namespace std;
using boost::regex;
locale::global(locale(""));
// Create a standard string for experiment.
string strRequestLine("Host: 192.168.0.1:12345");
regex pat(R"(:\s*)", regex::perl | boost::regex_constants::match_stop);
// Try to split the request line.
vector<string> coll;
boost::algorithm::split_regex(coll, strRequestLine, pat);
// Output what we got.
for (const auto& elt : coll)
cout << boost::format("{%s}\n") % elt;
// Exit the program.
return 0;
}
Where shall the codes be modified to have the output like
{Host}
{192.168.0.1:12345}
instead of the current output
{Host}
{192.168.0.1}
{12345}
Any suggestion/hint? Thanks.
Please note that I'm not asking how to do it with other functions or patterns. I'm asking if it's possible for split_regex to split only once and then stop. Because regex object seems to have the ability to stop at the first matched, I wonder that if offering it some proper flags it maybe stop at the first matched.

For your specific input it seems the simple fix is to change the pattern to become R"(:\s+)". Of course, this assumes that there is, at least, one space after Host: and no space between the IP address and the port.
Another alternative would be not to use split_regex() but rather std::regex_match():
#include <iostream>
#include <regex>
#include <string>
int main()
{
std::string strRequestLine("Host: 192.168.0.1:12345");
std::smatch results;
if (std::regex_match(strRequestLine, results, std::regex(R"(([^:]*):\s*(.*))"))) {
for (auto it(++results.begin()), end(results.end()); it != end; ++it) {
std::cout << "{" << *it << "}\n";
}
}
}

Expanding from my comment:
You might be interested in the first sample I listed here: small HTTP response headers parsing function. Summary: use phrase_parse(f, e, token >> ':' >> lexeme[*(char_ - eol)], space, key, value)
Here's a simple sample:
Live On Coliru
#include <boost/spirit/include/qi.hpp>
namespace qi = boost::spirit::qi;
namespace {
typedef std::string::const_iterator It;
// 2.2 Basic Rules (rfc1945)
static const qi::rule<It, std::string()> rfc1945_token = +~qi::char_( " \t><#,;:\\\"/][?=}{:"); // FIXME? should filter CTLs
}
#include <iostream>
int main()
{
std::string const strRequestLine("Host: 192.168.0.1:12345");
std::string::const_iterator f(strRequestLine.begin()), l(strRequestLine.end());
std::string key, value;
if (qi::phrase_parse(f, l, rfc1945_token >> ':' >> qi::lexeme[*(qi::char_ - qi::eol)], qi::space, key, value))
std::cout << "'" << key << "' -> '" << value << "'\n";
}
Prints
'Host' -> '192.168.0.1:12345'

Related

Boost regex to capture all repeated patterns

I have a simple text file with the following contents:
VALUE "foo"
ANOTHERVALUE "bar"
YETANOTHERVALUE "barbar"
The values in column 1 are known.
I want to capture everything in column 1 and in column 2.
My solution involves manually writing all the possible values of column 1 (which is known), into the regex string but obviously this is not ideal practice since I am basically repeating code and this does not allow the ordering to be flexible:
const char* re =
"^[[:space:]]*"
"(VALUE)[[:space:]]*\"(.*)\"[[:space:]]*"
"(ANOTHERVALUE)[[:space:]]*\"(.*)\"[[:space:]]*"
"(YETANOTHERVALUE)[[:space:]]*\"(.*)\"[[:space:]]*";
I'm citing commenter Igor Tandetnik here, because he almost gave the complete answer in his comment:
Regular expressions capture exactly as many substrings as there are
left parentheses in the expression [...]
The right way to solve this problem is to write a regex that matches a
single pair, [...]
\s*([a-zA-Z]+)\s*"(.*?)"
Notes:
\s is equivalent to [[:space:]]
.*? is used to stop searching after the 2nd " instead of the last " in the string
and apply it repeatedly, e.g. via std::regex_iterator
The boost equivalent is boost::regex_iterator.
#include <iostream>
#include <string>
#include <algorithm>
#include <boost/regex.hpp>
const boost::regex expr{ R"__(\s*([a-zA-Z]+)\s*"(.*?)")__" };
const std::string s =
R"(VALUE "foo"
ANOTHERVALUE "bar"
YETANOTHERVALUE "barbar"
)";
int main() {
boost::sregex_iterator it{ begin(s), end(s), expr }, itEnd;
std::for_each( it, itEnd, []( const boost::smatch& m ){
std::cout << m[1] << '\n' << m[2] << std::endl;
});
}
Live demo.
Notes:
I'm using raw string literals to make the code cleaner.
I would use a little Spirit Parser here:
Reading Into A Map
Live On Coliru
#include <boost/fusion/adapted/std_pair.hpp> // reading maps
#include <boost/spirit/home/x3.hpp>
#include <boost/spirit/include/support_istream_iterator.hpp>
#include <iostream>
#include <fstream>
#include <map>
auto read_config_map(std::istream& stream) {
std::map<std::string, std::string> settings;
boost::spirit::istream_iterator f(stream >> std::noskipws), l;
using namespace boost::spirit::x3;
auto key_ = lexeme [ +upper ];
auto value_ = lexeme [ '"' >> *~char_('"') >> '"' ];
if (!phrase_parse(f, l, -(key_ >> value_) % eol >> eoi, blank, settings))
throw std::invalid_argument("cannot parse config map");
return settings;
}
auto read_config_map(std::string const& fname) {
std::ifstream stream(fname);
return read_config_map(stream);
}
int main() {
for (auto&& entry : read_config_map(std::cin))
std::cout << "Key:'" << entry.first << "' Value:'" << entry.second << "'\n";
}
Prints:
Key:'ANOTHERVALUE' Value:'bar'
Key:'VALUE' Value:'foo'
Key:'YETANOTHERVALUE' Value:'barbar'

How to split string using CRLF delimiter in cpp?

I have some string :
testing testing
test2test2
these lines are devided by CRLF. I saw that there are : 0d0a0d0a deviding them.
How can I split it using this information?
I wanted to use str.find(CRLF-DELIMITER) but can't semm to figure how
editing :
I already used str.find("textDelimiter"), but now I need it to look for hexa and not search for a string "0d0a0d0a"
Use boost::split to do that. Please also take a look at Boost.Tokenizer.
Here is another way of doing it using regex:
using std::endl;
using std::cout;
using std::string;
using std::vector;
using boost::algorithm::split_regex;
int main()
{
vector<string> res;
string input = "test1\r\ntest2\r\ntest3";
split_regex(res, input, boost::regex("(\r\n)+"));
for (auto& tok : res)
{
std::cout << "Token: " << tok << std::endl;
}
return 0;
}
Here is the way of doing it without Boost:
#include <string>
#include <sstream>
#include <istream>
#include <vector>
#include <iostream>
int main()
{
std::string strlist("line1\r\nLine2\r\nLine3\r\n");
std::istringstream MyStream(strlist);
std::vector<std::string> v;
std::string s;
while (std::getline(MyStream, s))
{
v.push_back(s);
std::cout << s << std::endl;
}
return 0;
}

Boost Spirit Qi track line and parse unicode

I want to trace input position and input line for unicode strings.
For the position I store an iterator to begin and use std::distance at the desired position. That works well as long as the input is not unicode. With unicode symbols the position gets shifted, i.e. ä takes two spaces in the input stream and position is off by 1. So, I switched to boost::u8_to_u32_iterator and this works fine.
For the line I use boost::spirit::line_pos_iterator which also works well.
My problem is in combining both concepts to use the line iterator and the unicode iterator. Another solution allowing pos and line on unicode strings is of course also welcome.
Here is a small example for the unicode parser; as said I would like to wrap the iterator additionally with boost::spirit::line_pos_iterator but that doesn't even compile.
#define BOOST_SPIRIT_USE_PHOENIX_V3
#define BOOST_SPIRIT_UNICODE
#include <boost/regex/pending/unicode_iterator.hpp>
#include <boost/fusion/adapted/struct.hpp>
#include <boost/spirit/include/phoenix.hpp>
namespace phx = boost::phoenix;
#include <boost/spirit/include/qi.hpp>
namespace qi = boost::spirit::qi;
#include <boost/spirit/repository/include/qi_iter_pos.hpp>
#include <boost/spirit/include/support_line_pos_iterator.hpp>
#include <iostream>
#include <string>
//==============================================================================
std::string to_utf8(const std::u32string& input) {
return std::string(
boost::u32_to_u8_iterator<std::u32string::const_iterator>(input.begin()),
boost::u32_to_u8_iterator<std::u32string::const_iterator>(input.end()));
}
//==============================================================================
int main() {
std::string input(u8"Hallo äöüß");
typedef boost::u8_to_u32_iterator<std::string::const_iterator> iterator_type;
iterator_type first(input.begin()), last(input.end());
qi::rule<iterator_type, std::u32string()> string_u32 = *(qi::char_ - qi::eoi);
qi::rule<iterator_type, std::string()> string =
string_u32[qi::_val = phx::bind(&to_utf8, qi::_1)];
qi::rule<iterator_type, std::string()> rule = string;
std::string ast;
bool result = qi::parse(first, last, rule, ast);
if (result) {
result = first == last;
}
if (result) {
std::cout << "Parsed: " << ast << std::endl;
} else {
std::cout << "Failure" << std::endl;
}
}
Update Demo added Live on Coliru
I see the same problem whe you try to wrap iterator_type in a line_pos_iterator.
After some thinking, I don't quite know what causes it (it might be possible to get around this by wrapping the u8_to_u32 converting iterator adapter inside a boost::spirit::multi_pass<> iterator adapter, but... that sounded so unwieldy I haven't even tried).
Instead, I think that the nature of line-breaking is that it is (mostly?) charset agnostic. So you could wrap the source iterator with line_pos_iterator first, before the encoding conversion.
This does compile. Of course, then you'll get position information in terms of the source iterators, not 'logical characters'[1].
Let me show a demonstration below. It parses space separated words into a vector of strings. The simplest way to show the position information was to use a vector of iterator_ranges instead of just strings. I used qi::raw[] to expose the iterators[2].
So after a successful parse I loop through the matched ranges and print their location information. First, I print the actual positions reported from line_pos_iterators. Remember these are 'raw' byte offsets, since the source iterator is byte-oriented.
Next, I do a little dance with get_current_line and the u8_to_u32 conversion to translate the offset within the line to a (more) logical count. You'll see that the range for e.g.
Note I currently assumed that ranges would not cross line boundaries (that is true for this grammar). Otherwise one would need to extract and convert 2 lines. The way I'm doing that now is rather expensive. Consider optimizing by e.g. using Boost String Algorithm's find_all facilities. You can build a list of line-ends and use std::lower_bound to locate the current line slightly more efficiently.
Note There might be issues with the implementations of get_line_start and get_current_line; if you notice anything like this, there's a 10-line patch over at the [spirit-general] user list that you could try
Without further ado, the code and the output:
#define BOOST_SPIRIT_USE_PHOENIX_V3
#define BOOST_SPIRIT_UNICODE
#include <boost/regex/pending/unicode_iterator.hpp>
#include <boost/fusion/adapted/struct.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <boost/phoenix/function/adapt_function.hpp>
namespace phx = boost::phoenix;
#include <boost/spirit/include/qi.hpp>
namespace qi = boost::spirit::qi;
namespace encoding = boost::spirit::unicode;
#include <boost/spirit/repository/include/qi_iter_pos.hpp>
#include <boost/spirit/include/support_line_pos_iterator.hpp>
#include <iostream>
#include <string>
//==============================================================================
std::string to_utf8(const std::u32string& input) {
return std::string(
boost::u32_to_u8_iterator<std::u32string::const_iterator>(input.begin()),
boost::u32_to_u8_iterator<std::u32string::const_iterator>(input.end()));
}
BOOST_PHOENIX_ADAPT_FUNCTION(std::string, to_utf8_, to_utf8, 1)
//==============================================================================
int main() {
std::string input(u8"Hallo äöüß\n¡Bye! ✿➂➿♫");
typedef boost::spirit::line_pos_iterator<std::string::const_iterator> source_iterator;
typedef boost::u8_to_u32_iterator<source_iterator> iterator_type;
source_iterator soi(input.begin()),
eoi(input.end());
iterator_type first(soi),
last(eoi);
qi::rule<iterator_type, std::u32string()> string_u32 = +encoding::graph;
qi::rule<iterator_type, std::string()> string = string_u32 [qi::_val = to_utf8_(qi::_1)];
std::vector<boost::iterator_range<iterator_type> > ast;
// note the trick with `raw` to expose the iterators
bool result = qi::phrase_parse(first, last, *qi::raw[ string ], encoding::space, ast);
if (result) {
for (auto const& range : ast)
{
source_iterator
base_b(range.begin().base()),
base_e(range.end().base());
auto lbound = get_line_start(soi, base_b);
// RAW access to the base iterators:
std::cout << "Fragment: '" << std::string(base_b, base_e) << "'\t"
<< "raw: L" << get_line(base_b) << ":" << get_column(lbound, base_b, /*tabs:*/4)
<< "-L" << get_line(base_e) << ":" << get_column(lbound, base_e, /*tabs:*/4);
// "cooked" access:
auto line = get_current_line(lbound, base_b, eoi);
// std::cout << "Line: '" << line << "'\n";
// iterator_type is an alias for u8_to_u32_iterator<...>
size_t cur_pos = 0, start_pos = 0, end_pos = 0;
for(iterator_type it = line.begin(), _eol = line.end(); ; ++it, ++cur_pos)
{
if (it.base() == base_b) start_pos = cur_pos;
if (it.base() == base_e) end_pos = cur_pos;
if (it == _eol)
break;
}
std::cout << "\t// in u32 code _units_: positions " << start_pos << "-" << end_pos << "\n";
}
std::cout << "\n";
} else {
std::cout << "Failure" << std::endl;
}
if (first!=last)
{
std::cout << "Remaining: '" << std::string(first, last) << "'\n";
}
}
The output:
clang++ -std=c++11 -Os main.cpp && ./a.out
Fragment: 'Hallo' raw: L1:1-L1:6 // in u32 code _units_: positions 0-5
Fragment: 'äöüß' raw: L1:7-L1:15 // in u32 code _units_: positions 6-10
Fragment: '¡Bye!' raw: L2:2-L2:8 // in u32 code _units_: positions 1-6
Fragment: '✿➂➿♫' raw: L2:9-L2:21 // in u32 code _units_: positions 7-11
[1] I think there's not a useful definition of what a character is in this context. There's bytes, code units, code points, grapheme clusters, possibly more. Suffice it to say that the source iterator (std::string::const_iterator) deals with bytes (since it is charset/encoding unaware). In u32string you can /almost/ assume that a single position is roughly a code-point (although I think (?) that for >L2 UNICODE support you still would have to support code points combined from multiple code units).
[2] This means that current the attribute conversion and the semantic action are redundant, but you'll get that :)

Regex search & replace group in C++?

The best I can come up with is:
#include <boost/algorithm/string/replace.hpp>
#include <boost/regex.hpp>
#include <iostream>
using namespace std;
int main() {
string dog = "scooby-doo";
boost::regex pattern("(\\w+)-doo");
boost::smatch groups;
if (boost::regex_match(dog, groups, pattern))
boost::replace_all(dog, string(groups[1]), "scrappy");
cout << dog << endl;
}
with output:
scrappy-doo
.. is there a simpler way of doing this, that doesn't involve doing two distinct searches? Maybe with the new C++11 stuff (although I'm not sure that it's compatible with gcc atm?)
std::regex_replace should do the trick. The provided example is pretty close to your problem, even to the point of showing how to shove the answer straight into cout if you want. Pasted here for posterity:
#include <iostream>
#include <iterator>
#include <regex>
#include <string>
int main()
{
std::string text = "Quick brown fox";
std::regex vowel_re("a|e|i|o|u");
// write the results to an output iterator
std::regex_replace(std::ostreambuf_iterator<char>(std::cout),
text.begin(), text.end(), vowel_re, "*");
// construct a string holding the results
std::cout << '\n' << std::regex_replace(text, vowel_re, "[$&]") << '\n';
}

If-Then-Else Conditionals in Regular Expressions and using capturing group

I have some difficulties in understanding if-then-else conditionals in regular expressions.
After reading If-Then-Else Conditionals in Regular Expressions I decided to write a simple test. I use C++, Boost 1.38 Regex and MS VC 8.0.
I have written this program:
#include <iostream>
#include <string>
#include <boost/regex.hpp>
int main()
{
std::string str_to_modify = "123";
//std::string str_to_modify = "ttt";
boost::regex regex_to_search ("(\\d\\d\\d)");
std::string regex_format ("(?($1)$1|000)");
std::string modified_str =
boost::regex_replace(
str_to_modify,
regex_to_search,
regex_format,
boost::match_default | boost::format_all | format_no_copy );
std::cout << modified_str << std::endl;
return 0;
}
I expected to get "123" if str_to_modify has "123" and to get "000" if I str_to_modify has "ttt". However I get ?123123|000 in the first case and nothing in second one.
Coluld you tell me, please, what is wrong with my test?
The second example that still doesn't work :
#include <iostream>
#include <string>
#include <boost/regex.hpp>
int main()
{
//std::string str_to_modify = "123";
std::string str_to_modify = "ttt";
boost::regex regex_to_search ("(\\d\\d\\d)");
std::string regex_format ("(?1foo:bar");
std::string modified_str =
boost::regex_replace(str_to_modify, regex_to_search, regex_format,
boost::match_default | boost::format_all | boost::format_no_copy );
std::cout << modified_str << std::endl;
return 0;
}
I think the format string should be (?1$1:000) as described in the Boost.Regex docs.
Edit: I don't think regex_replace can do what you want. Why don't you try the following instead? regex_match will tell you whether the match succeeded (or you can use match[i].matched to check whether the i-th tagged sub-expression matched). You can format the match using the match.format member function.
#include <iostream>
#include <string>
#include <boost/regex.hpp>
int main()
{
boost::regex regex_to_search ("(\\d\\d\\d)");
std::string str_to_modify;
while (std::getline(std::cin, str_to_modify))
{
boost::smatch match;
if (boost::regex_match(str_to_modify, match, regex_to_search))
std::cout << match.format("foo:$1") << std::endl;
else
std::cout << "error" << std::endl;
}
}