How can I split a string with Boost with a regex AND have the delimiter included in the result list?
for example, if I have the string "1d2" and my regex is "[a-z]" I want the results in a vector with (1, d, 2)
I have:
std::string expression = "1d2";
boost::regex re("[a-z]");
boost::sregex_token_iterator i (expression.begin (),
expression.end (),
re);
boost::sregex_token_iterator j;
std::vector <std::string> splitResults;
std::copy (i, j, std::back_inserter (splitResults));
Thanks
I think you cannot directly extract the delimiters using boost::regex. You can, however, extract the position where the regex is found in your string:
std::string expression = "1a234bc";
boost::regex re("[a-z]");
boost::sregex_iterator i(
expression.begin (),
expression.end (),
re);
boost::sregex_iterator j;
for(; i!=j; ++i) {
std::cout << (*i).position() << " : " << (*i) << std::endl;
}
This example would show:
1 : a
5 : b
6 : c
Using this information, you can extract the delimitiers from your original string:
std::string expression = "1a234bc43";
boost::regex re("[a-z]");
boost::sregex_iterator i(
expression.begin (),
expression.end (),
re);
boost::sregex_iterator j;
size_t pos=0;
for(; i!=j;++i) {
std::string pre_delimiter = expression.substr(pos, (*i).position()-pos);
std::cout << pre_delimiter << std::endl;
std::cout << (*i) << std::endl;
pos = (*i).position() + (*i).size();
}
std::string last_delimiter = expression.substr(pos);
std::cout << last_delimiter << std::endl;
This example would show:
1
a
234
b
c
43
There is an empty string betwen b and c because there is no delimiter.
Related
I have no idea about boost, could anybody please tell me what exactly this function is doing?
int
Function(const string& tempStr)
{
boost::regex expression ("result = ");
std::string::const_iterator start, end;
start = tempStr.begin();
end = tempStr.end();
boost::match_results<std::string::const_iterator> what;
boost::regex_constants::_match_flags flags = boost::match_default;
int count = 0;
while(regex_search(start, end, what, expression, flags)){
start = what[0].second;
count++;
}
cout << "Count :"<< count << endl;
return count;
}
match_results is a collection of sub_match objects. The first sub_match object (index 0) represents the full match in the target sequence (subsequent matches would correspond to the subexpressions matches). Your code is searching for result = matches and restarting the search each time from the end of the previous match (what[0].second)
int
Function(const string& tempStr)
{
boost::regex expression ("result = ");
std::string::const_iterator start, end;
start = tempStr.begin();
end = tempStr.end();
boost::match_results<std::string::const_iterator> what;
boost::regex_constants::_match_flags flags = boost::match_default;
int count = 0;
while(regex_search(start, end, what, expression, flags)){
start = what[0].second;
count++;
}
cout << "Count :"<< count << endl;
return count;
}
int main()
{
Function("result = 22, result = 33"); // Outputs 'Count: 2'
}
Live Example
The base of the functionality is searching for a regular expression match on tempStr.
Look at the regex_search documentation and notice what the match_result contains after it finishes (that's the 3rd parameter, or what in your code sample). From there understanding the while loop should be straightforward.
This function is a complicated way to count the number of occurrences of "result = " string. A simpler way would be:
boost::regex search_string("result = ");
auto begin = boost::make_regex_iterator(tempStr, search_string);
int count = std::distance(begin, {});
Which can be collapsed to a one-liner, with possible loss of readability.
This is a match counter function:
The author uses useless code: here is the equivalent code in std ( also boost )
unsigned int count_match( std::string user_string, const std::string& user_pattern ){
const std::regex rx( user_pattern );
std::regex_token_iterator< std::string::const_iterator > first( user_string. begin(), user_string.end(), rx ), last;
return std::distance( first, last );
}
and with std::regex_search it can be (also boost ):
unsigned int match_count( std::string user_string, const std::string& user_pattern ){
unsigned int counter = 0;
std::match_results< std::string::const_iterator > match_result;
std::regex regex( user_pattern );
while( std::regex_search( user_string, match_result, regex ) ){
user_string = match_result.suffix().str();
++counter;
}
return counter;
}
NOTE:
no need to use this part:
std::string::const_iterator start, end;
start = tempStr.begin();
end = tempStr.end();
Also
boost::match_results<std::string::const_iterator> what;
can be
boost::smatch what // a typedef of match_results<std::string::const_iterator>
no need:
boost::regex_constants::_match_flags flags = boost::match_default;
because by default regex_search has this flag
this:
start = what[0].second;
is for updating the iteration that can be:
match_result.suffix().str();
if you want to see what happen in the while loop use this code:
std::cout << "prefix: '" << what.prefix().str() << '\n';
std::cout << "match : '" << what.str() << '\n';
std::cout << "suffix: '" << what.suffix().str() << '\n';
std::cout << "------------------------------\n";
How do I count the number of matches using C++11's std::regex?
std::regex re("[^\\s]+");
std::cout << re.matches("Harry Botter - The robot who lived.").count() << std::endl;
Expected output:
7
You can use regex_iterator to generate all of the matches, then use distance to count them:
std::regex const expression("[^\\s]+");
std::string const text("Harry Botter - The robot who lived.");
std::ptrdiff_t const match_count(std::distance(
std::sregex_iterator(text.begin(), text.end(), expression),
std::sregex_iterator()));
std::cout << match_count << std::endl;
You can use this:
int countMatchInRegex(std::string s, std::string re)
{
std::regex words_regex(re);
auto words_begin = std::sregex_iterator(
s.begin(), s.end(), words_regex);
auto words_end = std::sregex_iterator();
return std::distance(words_begin, words_end);
}
Example usage:
std::cout << countMatchInRegex("Harry Botter - The robot who lived.", "[^\\s]+");
Output:
7
I would like to split a string like this one
“this1245is#g$0,therhsuidthing345”
using a list of words like the one bellow
{“this”, “is”, “the”, “thing”}
into this list
{“this”, “1245”, “is”, “#g$0,”, “the”, “rhsuid”, “thing”, “345”}
// ^--------------^---------------^------------------^-- these were the delimiters
The delimiters are allowed to appear more than once in the string to split, and it can be done using regular expressions
The precedence is in the order in which the delimiters appear in the array
The platform I'm developing for has no support for the Boost library
Update
This is what I have for the moment
#include <iostream>
#include <string>
#include <regex>
int main ()
{
std::string s ("this1245is#g$0,therhsuidthing345");
std::string delimiters[] = {"this", "is", "the", "thing"};
for (int i=0; i<4; i++) {
std::string delimiter = "(" + delimiters[i] + ")(.*)";
std::regex e (delimiter); // matches words beginning by the i-th delimiter
// default constructor = end-of-sequence:
std::sregex_token_iterator rend;
std::cout << "1st and 2nd submatches:";
int submatches[] = { 1, 2 };
std::sregex_token_iterator c ( s.begin(), s.end(), e, submatches );
while (c!=rend) std::cout << " [" << *c++ << "]";
std::cout << std::endl;
}
return 0;
}
output:
1st and 2nd submatches:[this][x1245fisA#g$0,therhsuidthing345]
1st and 2nd submatches:[is][x1245fisA#g$0,therhsuidthing345]
1st and 2nd submatches:[the][rhsuidthing345]
1st and 2nd submatches:[thing][345]
I think I need to make some recursive thing to call on each iteration
Build the expression you want for matches only (re), then pass in {-1, 0} to your std::sregex_token_iterator to return all non-matches (-1) and matches (0).
#include <iostream>
#include <regex>
int main() {
std::string s("this1245is#g$0,therhsuidthing345");
std::regex re("(this|is|the|thing)");
std::sregex_token_iterator iter(s.begin(), s.end(), re, { -1, 0 });
std::sregex_token_iterator end;
while (iter != end) {
//Works in vc13, clang requires you increment separately,
//haven't gone into implementation to see if/how ssub_match is affected.
//Workaround: increment separately.
//std::cout << "[" << *iter++ << "] ";
std::cout << "[" << *iter << "] ";
++iter;
}
}
I don't know how to perform the precedence requirement. This seems to work on the given input:
std::vector<std::string> parse (std::string s)
{
std::vector<std::string> out;
std::regex re("\(this|is|the|thing).*");
std::string word;
auto i = s.begin();
while (i != s.end()) {
std::match_results<std::string::iterator> m;
if (std::regex_match(i, s.end(), m, re)) {
if (!word.empty()) {
out.push_back(word);
word.clear();
}
out.push_back(std::string(m[1].first, m[1].second));
i += out.back().size();
} else {
word += *i++;
}
}
if (!word.empty()) {
out.push_back(word);
}
return out;
}
vector<string> strs;
boost::split(strs,line,boost::is_space());
For example, If I have a string like "first second third forth" and I want to match every single word in one operation to output them one by one.
I just thought that "(\\b\\S*\\b){0,}" would work. But actually it did not.
What should I do?
Here's my code:
#include<iostream>
#include<string>
using namespace std;
int main()
{
regex exp("(\\b\\S*\\b)");
smatch res;
string str = "first second third forth";
regex_search(str, res, exp);
cout << res[0] <<" "<<res[1]<<" "<<res[2]<<" "<<res[3]<< endl;
}
Simply iterate over your string while regex_searching, like this:
{
regex exp("(\\b\\S*\\b)");
smatch res;
string str = "first second third forth";
string::const_iterator searchStart( str.cbegin() );
while ( regex_search( searchStart, str.cend(), res, exp ) )
{
cout << ( searchStart == str.cbegin() ? "" : " " ) << res[0];
searchStart = res.suffix().first;
}
cout << endl;
}
This can be done in regex of C++11.
Two methods:
You can use () in regex to define your captures(sub expressions).
Like this:
string var = "first second third forth";
const regex r("(.*) (.*) (.*) (.*)");
smatch sm;
if (regex_search(var, sm, r)) {
for (int i=1; i<sm.size(); i++) {
cout << sm[i] << endl;
}
}
See it live: http://coliru.stacked-crooked.com/a/e1447c4cff9ea3e7
You can use sregex_token_iterator():
string var = "first second third forth";
regex wsaq_re("\\s+");
copy( sregex_token_iterator(var.begin(), var.end(), wsaq_re, -1),
sregex_token_iterator(),
ostream_iterator<string>(cout, "\n"));
See it live: http://coliru.stacked-crooked.com/a/677aa6f0bb0612f0
sregex_token_iterator appears to be the ideal, efficient solution, but the example given in the selected answer leaves much to be desired. Instead, I found some great examples here:
http://www.cplusplus.com/reference/regex/regex_token_iterator/regex_token_iterator/
For your convenience, I've copy-pasted the sample code shown by that page. I claim no credit for the code.
// regex_token_iterator example
#include <iostream>
#include <string>
#include <regex>
int main ()
{
std::string s ("this subject has a submarine as a subsequence");
std::regex e ("\\b(sub)([^ ]*)"); // matches words beginning by "sub"
// default constructor = end-of-sequence:
std::regex_token_iterator<std::string::iterator> rend;
std::cout << "entire matches:";
std::regex_token_iterator<std::string::iterator> a ( s.begin(), s.end(), e );
while (a!=rend) std::cout << " [" << *a++ << "]";
std::cout << std::endl;
std::cout << "2nd submatches:";
std::regex_token_iterator<std::string::iterator> b ( s.begin(), s.end(), e, 2 );
while (b!=rend) std::cout << " [" << *b++ << "]";
std::cout << std::endl;
std::cout << "1st and 2nd submatches:";
int submatches[] = { 1, 2 };
std::regex_token_iterator<std::string::iterator> c ( s.begin(), s.end(), e, submatches );
while (c!=rend) std::cout << " [" << *c++ << "]";
std::cout << std::endl;
std::cout << "matches as splitters:";
std::regex_token_iterator<std::string::iterator> d ( s.begin(), s.end(), e, -1 );
while (d!=rend) std::cout << " [" << *d++ << "]";
std::cout << std::endl;
return 0;
}
Output:
entire matches: [subject] [submarine] [subsequence]
2nd submatches: [ject] [marine] [sequence]
1st and 2nd submatches: [sub] [ject] [sub] [marine] [sub] [sequence]
matches as splitters: [this ] [ has a ] [ as a ]
You could use the suffix() function, and search again until you don't find a match:
int main()
{
regex exp("(\\b\\S*\\b)");
smatch res;
string str = "first second third forth";
while (regex_search(str, res, exp)) {
cout << res[0] << endl;
str = res.suffix();
}
}
My code will capture all groups in all matches:
vector<vector<string>> U::String::findEx(const string& s, const string& reg_ex, bool case_sensitive)
{
regex rx(reg_ex, case_sensitive ? regex_constants::icase : 0);
vector<vector<string>> captured_groups;
vector<string> captured_subgroups;
const std::sregex_token_iterator end_i;
for (std::sregex_token_iterator i(s.cbegin(), s.cend(), rx);
i != end_i;
++i)
{
captured_subgroups.clear();
string group = *i;
smatch res;
if(regex_search(group, res, rx))
{
for(unsigned i=0; i<res.size() ; i++)
captured_subgroups.push_back(res[i]);
if(captured_subgroups.size() > 0)
captured_groups.push_back(captured_subgroups);
}
}
captured_groups.push_back(captured_subgroups);
return captured_groups;
}
My reading of the documentation is that regex_search searches for the first match and that none of the functions in std::regex do a "scan" as you are looking for. However, the Boost library seems to be support this, as described in C++ tokenize a string using a regular expression
How do I count the number of matches using C++11's std::regex?
std::regex re("[^\\s]+");
std::cout << re.matches("Harry Botter - The robot who lived.").count() << std::endl;
Expected output:
7
You can use regex_iterator to generate all of the matches, then use distance to count them:
std::regex const expression("[^\\s]+");
std::string const text("Harry Botter - The robot who lived.");
std::ptrdiff_t const match_count(std::distance(
std::sregex_iterator(text.begin(), text.end(), expression),
std::sregex_iterator()));
std::cout << match_count << std::endl;
You can use this:
int countMatchInRegex(std::string s, std::string re)
{
std::regex words_regex(re);
auto words_begin = std::sregex_iterator(
s.begin(), s.end(), words_regex);
auto words_end = std::sregex_iterator();
return std::distance(words_begin, words_end);
}
Example usage:
std::cout << countMatchInRegex("Harry Botter - The robot who lived.", "[^\\s]+");
Output:
7