C++ Boost usage over a string

C++ Boost usage over a string - c++

I have no idea about boost, could anybody please tell me what exactly this function is doing?
int
Function(const string& tempStr)
{
boost::regex expression ("result = ");
std::string::const_iterator start, end;
start = tempStr.begin();
end = tempStr.end();
boost::match_results<std::string::const_iterator> what;
boost::regex_constants::_match_flags flags = boost::match_default;
int count = 0;
while(regex_search(start, end, what, expression, flags)){
start = what[0].second;
count++;
}
cout << "Count :"<< count << endl;
return count;
}

match_results is a collection of sub_match objects. The first sub_match object (index 0) represents the full match in the target sequence (subsequent matches would correspond to the subexpressions matches). Your code is searching for result = matches and restarting the search each time from the end of the previous match (what[0].second)
int
Function(const string& tempStr)
{
boost::regex expression ("result = ");
std::string::const_iterator start, end;
start = tempStr.begin();
end = tempStr.end();
boost::match_results<std::string::const_iterator> what;
boost::regex_constants::_match_flags flags = boost::match_default;
int count = 0;
while(regex_search(start, end, what, expression, flags)){
start = what[0].second;
count++;
}
cout << "Count :"<< count << endl;
return count;
}
int main()
{
Function("result = 22, result = 33"); // Outputs 'Count: 2'
}
Live Example

The base of the functionality is searching for a regular expression match on tempStr.
Look at the regex_search documentation and notice what the match_result contains after it finishes (that's the 3rd parameter, or what in your code sample). From there understanding the while loop should be straightforward.

This function is a complicated way to count the number of occurrences of "result = " string. A simpler way would be:
boost::regex search_string("result = ");
auto begin = boost::make_regex_iterator(tempStr, search_string);
int count = std::distance(begin, {});
Which can be collapsed to a one-liner, with possible loss of readability.

This is a match counter function:
The author uses useless code: here is the equivalent code in std ( also boost )
unsigned int count_match( std::string user_string, const std::string& user_pattern ){
const std::regex rx( user_pattern );
std::regex_token_iterator< std::string::const_iterator > first( user_string. begin(), user_string.end(), rx ), last;
return std::distance( first, last );
}
and with std::regex_search it can be (also boost ):
unsigned int match_count( std::string user_string, const std::string& user_pattern ){
unsigned int counter = 0;
std::match_results< std::string::const_iterator > match_result;
std::regex regex( user_pattern );
while( std::regex_search( user_string, match_result, regex ) ){
user_string = match_result.suffix().str();
++counter;
}
return counter;
}
NOTE:
no need to use this part:
std::string::const_iterator start, end;
start = tempStr.begin();
end = tempStr.end();
Also
boost::match_results<std::string::const_iterator> what;
can be
boost::smatch what // a typedef of match_results<std::string::const_iterator>
no need:
boost::regex_constants::_match_flags flags = boost::match_default;
because by default regex_search has this flag
this:
start = what[0].second;
is for updating the iteration that can be:
match_result.suffix().str();
if you want to see what happen in the while loop use this code:
std::cout << "prefix: '" << what.prefix().str() << '\n';
std::cout << "match : '" << what.str() << '\n';
std::cout << "suffix: '" << what.suffix().str() << '\n';
std::cout << "------------------------------\n";

Related

How can I trim empty/whitespace lines?

I have to process badly mismanaged text with creative indentation. I want to remove the empty (or whitespace) lines at the beginning and end of my text without touching anything else; meaning that if the first or last actual lines respectively begin or end with whitespace, these will stay.
For example, this:
<lines, empty or with whitespaces ...>
<text, maybe preceded by whitespace>
<lines with or without text...>
<text, maybe followed by whitespace>
<lines, empty or with whitespaces ...>
turns to
<text, maybe preceded by whitespace>
<lines with or without text...>
<text, maybe followed by whitespace>
preserving the spaces at the beginning and the end of the actual text lines (the text might also be entirely whitespace)
A regex replacing (\A\s*(\r\n|\Z)|\r\n\s*\Z) by emptiness does exactly what I want, but regex is kind of overkill, and I fear it might cost me some time when processing texts with a lot of lines but not much to trim.
On the other hand, an explicit algorithm is easy to make (just read until a non-whitespace/the end while remembering the last line feed, then truncate, and do the same backwards) but it feels like I'm missing something obvious.
How can I do this?

As you can see from this discussion, trimming whitespace requires a lot of work in C++. This should definitely be included in the standard library.
Anyway, I've checked how to do it as simply as possible, but nothing comes near the compactness of RegEx. For speed, it's a different story.
In the following you can find three versions of a program which does the required task. With regex, with std functions and with just a couple of indexes. The last one can be also made faster because you can avoid copying altogether, but I left it for fair comparison:
#include <string>
#include <sstream>
#include <chrono>
#include <iostream>
#include <regex>
#include <exception>
struct perf {
std::chrono::steady_clock::time_point start_;
perf() : start_(std::chrono::steady_clock::now()) {}
double elapsed() const {
auto stop = std::chrono::steady_clock::now();
std::chrono::duration<double> elapsed_seconds = stop - start_;
return elapsed_seconds.count();
}
};
std::string Generate(size_t line_len, size_t empty, size_t nonempty) {
std::string es(line_len, ' ');
es += '\n';
for (size_t i = 0; i < empty; ++i) {
es += es;
}
std::string nes(line_len - 1, ' ');
es += "a\n";
for (size_t i = 0; i < nonempty; ++i) {
nes += nes;
}
return es + nes + es;
}
int main()
{
std::string test;
//test = " \n\t\n \n \tTEST\n\tTEST\n\t\t\n TEST\t\n \t\n \n ";
std::cout << "Generating...";
std::cout.flush();
test = Generate(1000, 8, 10);
std::cout << " done." << std::endl;
std::cout << "Test 1...";
std::cout.flush();
perf p1;
std::string out1;
std::regex re(R"(^\s*\n|\n\s*$)");
try {
out1 = std::regex_replace(test, re, "");
}
catch (std::exception& e) {
std::cout << e.what() << std::endl;
}
std::cout << " done. Elapsed time: " << p1.elapsed() << "s" << std::endl;
std::cout << "Test 2...";
std::cout.flush();
perf p2;
std::stringstream is(test);
std::string line;
while (std::getline(is, line) && line.find_first_not_of(" \t\n\v\f\r") == std::string::npos);
std::string out2 = line;
size_t end = out2.size();
while (std::getline(is, line)) {
out2 += '\n';
out2 += line;
if (line.find_first_not_of(" \t\n\v\f\r") != std::string::npos) {
end = out2.size();
}
}
out2.resize(end);
std::cout << " done. Elapsed time: " << p2.elapsed() << "s" << std::endl;
if (out1 == out2) {
std::cout << "out1 == out2\n";
}
else {
std::cout << "out1 != out2\n";
}
std::cout << "Test 3...";
std::cout.flush();
perf p3;
static bool whitespace_table[] = {
1,1,1,1,1,1,1,1,1,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
};
size_t sfl = 0; // Start of first line
for (size_t i = 0, end = test.size(); i < end; ++i) {
if (test[i] == '\n') {
sfl = i + 1;
}
else if (whitespace_table[(unsigned char)test[i]]) {
break;
}
}
size_t ell = test.size(); // End of last line
for (size_t i = test.size(); i-- > 0;) {
if (test[i] == '\n') {
ell = i;
}
else if (whitespace_table[(unsigned char)test[i]]) {
break;
}
}
std::string out3 = test.substr(sfl, ell - sfl);
std::cout << " done. Elapsed time: " << p3.elapsed() << "s" << std::endl;
if (out1 == out3) {
std::cout << "out1 == out3\n";
}
else {
std::cout << "out1 != out3\n";
}
return 0;
}
Running it on C++ Shell you get these timings:
Generating... done.
Test 1... done. Elapsed time: 4.2288s
Test 2... done. Elapsed time: 0.0077323s
out1 == out2
Test 3... done. Elapsed time: 0.000695783s
out1 == out3
If performance is important, it's better to really test it with the real files.
As a side note, this regex doesn't work on MSVC, because I couldn't find a way of avoiding ^ and $ to match the start and end of lines, that is disable the multiline mode of operation. If you run this, it throws an exception saying regex_error(error_complexity): The complexity of an attempted match against a regular expression exceeded a pre-set level.
I think I'll ask how to cope with this!

If whitespace in front of the first line or after the last non-whitespace-only line can be removed then this answer https://stackoverflow.com/a/217605/14258355 will suffice.
However, due to this constraint and if you do not want to use regex, I would propose to convert the string into lines and then build the string back up again from the first to the last non-whitespace-only line.
Here is a working example: https://godbolt.org/z/rozxj6saj
Convert the string to lines:
std::vector<std::string> StringToLines(const std::string &s) {
// Create vector with lines (not using input stream to keep line break
// characters)
std::vector<std::string> result;
std::string line;
for (auto c : s) {
line.push_back(c);
// Check for line break
if (c == '\n' || c == '\r') {
result.push_back(line);
line.clear();
}
}
// add last bit
result.push_back(line);
return result;
}
Build the string from the first to the last non-whitespace-only line:
bool IsNonWhiteSpaceString(const std::string &s) {
return s.end() != std::find_if(s.begin(), s.end(), [](unsigned char uc) {
return !std::isspace(uc);
});
}
std::string TrimVectorEmptyEndsIntoString(const std::vector<std::string> &v) {
std::string result;
// Find first non-whitespace line
auto it_begin = std::find_if(v.begin(), v.end(), [](const std::string &s) {
return IsNonWhiteSpaceString(s);
});
// Find last non-whitespace line
auto it_end = std::find_if(v.rbegin(), v.rend(), [](const std::string &s) {
return IsNonWhiteSpaceString(s);
});
// Build the string
for (auto it = it_begin; it != it_end.base(); std::advance(it, 1)) {
result.append(*it);
}
return result;
}
Usage example:
// Create a test string
std::string test_string(
" \n\t\n \n TEST\n\tTEST\n\t\tTEST\n TEST\t\n \t");
// Output result
std::cout << TrimVectorEmptyEndsIntoString(StringToLines(test_string));
Output showing whitespace:

C++ split string using a list of words as separators

I would like to split a string like this one
“this1245is#g$0,therhsuidthing345”
using a list of words like the one bellow
{“this”, “is”, “the”, “thing”}
into this list
{“this”, “1245”, “is”, “#g$0,”, “the”, “rhsuid”, “thing”, “345”}
// ^--------------^---------------^------------------^-- these were the delimiters
The delimiters are allowed to appear more than once in the string to split, and it can be done using regular expressions
The precedence is in the order in which the delimiters appear in the array
The platform I'm developing for has no support for the Boost library
Update
This is what I have for the moment
#include <iostream>
#include <string>
#include <regex>
int main ()
{
std::string s ("this1245is#g$0,therhsuidthing345");
std::string delimiters[] = {"this", "is", "the", "thing"};
for (int i=0; i<4; i++) {
std::string delimiter = "(" + delimiters[i] + ")(.*)";
std::regex e (delimiter); // matches words beginning by the i-th delimiter
// default constructor = end-of-sequence:
std::sregex_token_iterator rend;
std::cout << "1st and 2nd submatches:";
int submatches[] = { 1, 2 };
std::sregex_token_iterator c ( s.begin(), s.end(), e, submatches );
while (c!=rend) std::cout << " [" << *c++ << "]";
std::cout << std::endl;
}
return 0;
}
output:
1st and 2nd submatches:[this][x1245fisA#g$0,therhsuidthing345]
1st and 2nd submatches:[is][x1245fisA#g$0,therhsuidthing345]
1st and 2nd submatches:[the][rhsuidthing345]
1st and 2nd submatches:[thing][345]
I think I need to make some recursive thing to call on each iteration

Build the expression you want for matches only (re), then pass in {-1, 0} to your std::sregex_token_iterator to return all non-matches (-1) and matches (0).
#include <iostream>
#include <regex>
int main() {
std::string s("this1245is#g$0,therhsuidthing345");
std::regex re("(this|is|the|thing)");
std::sregex_token_iterator iter(s.begin(), s.end(), re, { -1, 0 });
std::sregex_token_iterator end;
while (iter != end) {
//Works in vc13, clang requires you increment separately,
//haven't gone into implementation to see if/how ssub_match is affected.
//Workaround: increment separately.
//std::cout << "[" << *iter++ << "] ";
std::cout << "[" << *iter << "] ";
++iter;
}
}

I don't know how to perform the precedence requirement. This seems to work on the given input:
std::vector<std::string> parse (std::string s)
{
std::vector<std::string> out;
std::regex re("\(this|is|the|thing).*");
std::string word;
auto i = s.begin();
while (i != s.end()) {
std::match_results<std::string::iterator> m;
if (std::regex_match(i, s.end(), m, re)) {
if (!word.empty()) {
out.push_back(word);
word.clear();
}
out.push_back(std::string(m[1].first, m[1].second));
i += out.back().size();
} else {
word += *i++;
}
}
if (!word.empty()) {
out.push_back(word);
}
return out;
}

vector<string> strs;
boost::split(strs,line,boost::is_space());

How to match multiple results using std::regex

For example, If I have a string like "first second third forth" and I want to match every single word in one operation to output them one by one.
I just thought that "(\\b\\S*\\b){0,}" would work. But actually it did not.
What should I do?
Here's my code:
#include<iostream>
#include<string>
using namespace std;
int main()
{
regex exp("(\\b\\S*\\b)");
smatch res;
string str = "first second third forth";
regex_search(str, res, exp);
cout << res[0] <<" "<<res[1]<<" "<<res[2]<<" "<<res[3]<< endl;
}

Simply iterate over your string while regex_searching, like this:
{
regex exp("(\\b\\S*\\b)");
smatch res;
string str = "first second third forth";
string::const_iterator searchStart( str.cbegin() );
while ( regex_search( searchStart, str.cend(), res, exp ) )
{
cout << ( searchStart == str.cbegin() ? "" : " " ) << res[0];
searchStart = res.suffix().first;
}
cout << endl;
}

This can be done in regex of C++11.
Two methods:
You can use () in regex to define your captures(sub expressions).
Like this:
string var = "first second third forth";
const regex r("(.*) (.*) (.*) (.*)");
smatch sm;
if (regex_search(var, sm, r)) {
for (int i=1; i<sm.size(); i++) {
cout << sm[i] << endl;
}
}
See it live: http://coliru.stacked-crooked.com/a/e1447c4cff9ea3e7
You can use sregex_token_iterator():
string var = "first second third forth";
regex wsaq_re("\\s+");
copy( sregex_token_iterator(var.begin(), var.end(), wsaq_re, -1),
sregex_token_iterator(),
ostream_iterator<string>(cout, "\n"));
See it live: http://coliru.stacked-crooked.com/a/677aa6f0bb0612f0

sregex_token_iterator appears to be the ideal, efficient solution, but the example given in the selected answer leaves much to be desired. Instead, I found some great examples here:
http://www.cplusplus.com/reference/regex/regex_token_iterator/regex_token_iterator/
For your convenience, I've copy-pasted the sample code shown by that page. I claim no credit for the code.
// regex_token_iterator example
#include <iostream>
#include <string>
#include <regex>
int main ()
{
std::string s ("this subject has a submarine as a subsequence");
std::regex e ("\\b(sub)([^ ]*)"); // matches words beginning by "sub"
// default constructor = end-of-sequence:
std::regex_token_iterator<std::string::iterator> rend;
std::cout << "entire matches:";
std::regex_token_iterator<std::string::iterator> a ( s.begin(), s.end(), e );
while (a!=rend) std::cout << " [" << *a++ << "]";
std::cout << std::endl;
std::cout << "2nd submatches:";
std::regex_token_iterator<std::string::iterator> b ( s.begin(), s.end(), e, 2 );
while (b!=rend) std::cout << " [" << *b++ << "]";
std::cout << std::endl;
std::cout << "1st and 2nd submatches:";
int submatches[] = { 1, 2 };
std::regex_token_iterator<std::string::iterator> c ( s.begin(), s.end(), e, submatches );
while (c!=rend) std::cout << " [" << *c++ << "]";
std::cout << std::endl;
std::cout << "matches as splitters:";
std::regex_token_iterator<std::string::iterator> d ( s.begin(), s.end(), e, -1 );
while (d!=rend) std::cout << " [" << *d++ << "]";
std::cout << std::endl;
return 0;
}
Output:
entire matches: [subject] [submarine] [subsequence]
2nd submatches: [ject] [marine] [sequence]
1st and 2nd submatches: [sub] [ject] [sub] [marine] [sub] [sequence]
matches as splitters: [this ] [ has a ] [ as a ]

You could use the suffix() function, and search again until you don't find a match:
int main()
{
regex exp("(\\b\\S*\\b)");
smatch res;
string str = "first second third forth";
while (regex_search(str, res, exp)) {
cout << res[0] << endl;
str = res.suffix();
}
}

My code will capture all groups in all matches:
vector<vector<string>> U::String::findEx(const string& s, const string& reg_ex, bool case_sensitive)
{
regex rx(reg_ex, case_sensitive ? regex_constants::icase : 0);
vector<vector<string>> captured_groups;
vector<string> captured_subgroups;
const std::sregex_token_iterator end_i;
for (std::sregex_token_iterator i(s.cbegin(), s.cend(), rx);
i != end_i;
++i)
{
captured_subgroups.clear();
string group = *i;
smatch res;
if(regex_search(group, res, rx))
{
for(unsigned i=0; i<res.size() ; i++)
captured_subgroups.push_back(res[i]);
if(captured_subgroups.size() > 0)
captured_groups.push_back(captured_subgroups);
}
}
captured_groups.push_back(captured_subgroups);
return captured_groups;
}

My reading of the documentation is that regex_search searches for the first match and that none of the functions in std::regex do a "scan" as you are looking for. However, the Boost library seems to be support this, as described in C++ tokenize a string using a regular expression

How to use sregex_token_iterator

I'm trying to use regular expression to parse SQL statement while confused by the behavior of "sregex_token_iterator".
My function f() and g() looks similar while the former prints two sentences and the latter, g() prints one only:
Here is f():
void f()
{
cout << "in f()" << endl;
string str = " where a <= 2 and b = 2";
smatch result;
regex pattern("(\\w+\\s*(<|=|>|<>|<=|>=)\\s*\\w+)");
const sregex_token_iterator end;
for (sregex_token_iterator it(str.begin(), str.end(), pattern); it != end; it ++)
{
cout << *it << endl;
}
}
Here is g():
void g()
{
cout << "in g()" << endl;
string str = " where a <= 2 and b = 2";
smatch result;
regex pattern("(\\w+\\s*(<|=|>|<>|<=|>=)\\s*\\w+)");
const sregex_token_iterator end;
for (sregex_token_iterator it(str.begin(), str.end(), pattern); it != end; it ++)
{
cout << *it << endl;
string cur = *it;
pattern = "(\\w+)\\s*<>\\s*(\\w+)";
if ( regex_match(cur, result, pattern) )
{
// cout <<"<>" << endl;
}
pattern = "(\\w+)\\s*=\\s*(\\w+)";
if ( regex_match(cur, result, pattern) ){}
pattern = "(\\w+)\\s*<\\s*(\\w+)";
if ( regex_match(cur, result, pattern) ){}
pattern = "(\\w+)\\s*>\\s*(\\w+)";
if ( regex_match(cur, result, pattern) ){}
pattern = "(\\w+)\\s*<=\\s*(\\w+)";
if ( regex_match(cur, result, pattern) ){}
pattern = "(\\w+)\\s*>=\\s*(\\w+)";
if ( regex_match(cur, result, pattern) ){}
}
}
I'm guessing the variable 'end'("const sregex_token_iterator end;") changed in g() or the judge condition in "for" clause failed after it ++.
If it did, how did that happen.
And what should I do to fix that?

sregex_token_iterator stores a pointer to pattern, not a copy. You are changing the regular expression right from under the iterator.

C++ Boost: Split String

How can I split a string with Boost with a regex AND have the delimiter included in the result list?
for example, if I have the string "1d2" and my regex is "[a-z]" I want the results in a vector with (1, d, 2)
I have:
std::string expression = "1d2";
boost::regex re("[a-z]");
boost::sregex_token_iterator i (expression.begin (),
expression.end (),
re);
boost::sregex_token_iterator j;
std::vector <std::string> splitResults;
std::copy (i, j, std::back_inserter (splitResults));
Thanks

I think you cannot directly extract the delimiters using boost::regex. You can, however, extract the position where the regex is found in your string:
std::string expression = "1a234bc";
boost::regex re("[a-z]");
boost::sregex_iterator i(
expression.begin (),
expression.end (),
re);
boost::sregex_iterator j;
for(; i!=j; ++i) {
std::cout << (*i).position() << " : " << (*i) << std::endl;
}
This example would show:
1 : a
5 : b
6 : c
Using this information, you can extract the delimitiers from your original string:
std::string expression = "1a234bc43";
boost::regex re("[a-z]");
boost::sregex_iterator i(
expression.begin (),
expression.end (),
re);
boost::sregex_iterator j;
size_t pos=0;
for(; i!=j;++i) {
std::string pre_delimiter = expression.substr(pos, (*i).position()-pos);
std::cout << pre_delimiter << std::endl;
std::cout << (*i) << std::endl;
pos = (*i).position() + (*i).size();
}
std::string last_delimiter = expression.substr(pos);
std::cout << last_delimiter << std::endl;
This example would show:
1
a
234
b
c
43
There is an empty string betwen b and c because there is no delimiter.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

C++ Boost usage over a string - c++

Related

How can I trim empty/whitespace lines?

C++ split string using a list of words as separators

How to match multiple results using std::regex

How to use sregex_token_iterator

C++ Boost: Split String

Categories

Resources