How to match multiple results using std::regex - c++

For example, If I have a string like "first second third forth" and I want to match every single word in one operation to output them one by one.
I just thought that "(\\b\\S*\\b){0,}" would work. But actually it did not.
What should I do?
Here's my code:
#include<iostream>
#include<string>
using namespace std;
int main()
{
regex exp("(\\b\\S*\\b)");
smatch res;
string str = "first second third forth";
regex_search(str, res, exp);
cout << res[0] <<" "<<res[1]<<" "<<res[2]<<" "<<res[3]<< endl;
}

Simply iterate over your string while regex_searching, like this:
{
regex exp("(\\b\\S*\\b)");
smatch res;
string str = "first second third forth";
string::const_iterator searchStart( str.cbegin() );
while ( regex_search( searchStart, str.cend(), res, exp ) )
{
cout << ( searchStart == str.cbegin() ? "" : " " ) << res[0];
searchStart = res.suffix().first;
}
cout << endl;
}

This can be done in regex of C++11.
Two methods:
You can use () in regex to define your captures(sub expressions).
Like this:
string var = "first second third forth";
const regex r("(.*) (.*) (.*) (.*)");
smatch sm;
if (regex_search(var, sm, r)) {
for (int i=1; i<sm.size(); i++) {
cout << sm[i] << endl;
}
}
See it live: http://coliru.stacked-crooked.com/a/e1447c4cff9ea3e7
You can use sregex_token_iterator():
string var = "first second third forth";
regex wsaq_re("\\s+");
copy( sregex_token_iterator(var.begin(), var.end(), wsaq_re, -1),
sregex_token_iterator(),
ostream_iterator<string>(cout, "\n"));
See it live: http://coliru.stacked-crooked.com/a/677aa6f0bb0612f0

sregex_token_iterator appears to be the ideal, efficient solution, but the example given in the selected answer leaves much to be desired. Instead, I found some great examples here:
http://www.cplusplus.com/reference/regex/regex_token_iterator/regex_token_iterator/
For your convenience, I've copy-pasted the sample code shown by that page. I claim no credit for the code.
// regex_token_iterator example
#include <iostream>
#include <string>
#include <regex>
int main ()
{
std::string s ("this subject has a submarine as a subsequence");
std::regex e ("\\b(sub)([^ ]*)"); // matches words beginning by "sub"
// default constructor = end-of-sequence:
std::regex_token_iterator<std::string::iterator> rend;
std::cout << "entire matches:";
std::regex_token_iterator<std::string::iterator> a ( s.begin(), s.end(), e );
while (a!=rend) std::cout << " [" << *a++ << "]";
std::cout << std::endl;
std::cout << "2nd submatches:";
std::regex_token_iterator<std::string::iterator> b ( s.begin(), s.end(), e, 2 );
while (b!=rend) std::cout << " [" << *b++ << "]";
std::cout << std::endl;
std::cout << "1st and 2nd submatches:";
int submatches[] = { 1, 2 };
std::regex_token_iterator<std::string::iterator> c ( s.begin(), s.end(), e, submatches );
while (c!=rend) std::cout << " [" << *c++ << "]";
std::cout << std::endl;
std::cout << "matches as splitters:";
std::regex_token_iterator<std::string::iterator> d ( s.begin(), s.end(), e, -1 );
while (d!=rend) std::cout << " [" << *d++ << "]";
std::cout << std::endl;
return 0;
}
Output:
entire matches: [subject] [submarine] [subsequence]
2nd submatches: [ject] [marine] [sequence]
1st and 2nd submatches: [sub] [ject] [sub] [marine] [sub] [sequence]
matches as splitters: [this ] [ has a ] [ as a ]

You could use the suffix() function, and search again until you don't find a match:
int main()
{
regex exp("(\\b\\S*\\b)");
smatch res;
string str = "first second third forth";
while (regex_search(str, res, exp)) {
cout << res[0] << endl;
str = res.suffix();
}
}

My code will capture all groups in all matches:
vector<vector<string>> U::String::findEx(const string& s, const string& reg_ex, bool case_sensitive)
{
regex rx(reg_ex, case_sensitive ? regex_constants::icase : 0);
vector<vector<string>> captured_groups;
vector<string> captured_subgroups;
const std::sregex_token_iterator end_i;
for (std::sregex_token_iterator i(s.cbegin(), s.cend(), rx);
i != end_i;
++i)
{
captured_subgroups.clear();
string group = *i;
smatch res;
if(regex_search(group, res, rx))
{
for(unsigned i=0; i<res.size() ; i++)
captured_subgroups.push_back(res[i]);
if(captured_subgroups.size() > 0)
captured_groups.push_back(captured_subgroups);
}
}
captured_groups.push_back(captured_subgroups);
return captured_groups;
}

My reading of the documentation is that regex_search searches for the first match and that none of the functions in std::regex do a "scan" as you are looking for. However, the Boost library seems to be support this, as described in C++ tokenize a string using a regular expression

Related

C++ regex library

I have this sample code
// regex_search example
#include <iostream>
#include <string>
#include <regex>
int main ()
{
std::string s ("eritueriotu3498 \"pi656\" sdfs3646df");
std::smatch m;
std::string reg("\\(?<=pi\\)\\(\\d+\\)\\(?=\"\\)");
std::regex e (reg);
std::cout << "Target sequence: " << s << std::endl;
std::cout << "The following matches and submatches were found:" << std::endl;
while (std::regex_search (s,m,e)) {
for (auto x:m) std::cout << x << " ";
std::cout << std::endl;
s = m.suffix().str();
}
return 0;
}
I need to get number between pi and " -> (piMYNUMBER")
In online regex service my regex works fine (?<=pi)(\d+)(?=") but c++ regex don't match anything.
Who knows what is wrong with my expression?
Best regards
That is correct, C++ std::regex flavors do not support lookbehinds. You need to capture the digits between pi and ":
#include <iostream>
#include <vector>
#include <regex>
int main() {
std::string s ("eritueriotu3498 \"pi656\" sdfs3646df");
std::smatch m;
std::string reg("pi(\\d+)\""); // Or, with a raw string literal:
// std::string reg(R"(pi(\d+)\")");
std::regex e (reg);
std::vector<std::string> results(std::sregex_token_iterator(s.begin(), s.end(), e, 1),
std::sregex_token_iterator());
// Demo printing the results:
std::cout << "Number of matches: " << results.size() << std::endl;
for( auto & p : results ) std::cout << p << std::endl;
return 0;
}
See the C++ demo. Output:
Number of matches: 1
656
Here, pi(\d+)" pattern matches
pi - a literal substring
(\d+) - captures 1+ digits into Group 1
" - consumes a double quote.
Note the fourth argument to std::sregex_token_iterator, it is 1 because you need to collect only Group 1 values.

Find an exact substr in a string

I have a text file which contains the following text
License = "123456"
GeneralLicense = "56475655"
I want to search for License as well as for GeneralLicense.
while (getline(FileStream, CurrentReadLine))
{
if (CurrentReadLine.find("License") != std::string::npos)
{
std::cout << "License Line: " << CurrentReadLine;
}
if (CurrentReadLine.find("GeneralLicense") != std::string::npos)
{
std::cout << "General License Line: " << CurrentReadLine;
}
}
Since the word License also present in the word GeneralLicense so if-statement in the line if (CurrentReadLine.find("License") != std::string::npos) becomes true two times.
How can I specify that I want to search for the exact sub-string?
UPDATE: I can reverse the order as mentioned by some Answers OR check if the License is at Index zero. But isn't there anything ROBOUST (flag or something) which we can speficy to look for the exact match (Something like we have in most of the editors e.g. MS Word etc.).
while (getline(FileStream, CurrentReadLine))
{
if (CurrentReadLine.find("GeneralLicense") != std::string::npos)
{
std::cout << "General License Line: " << CurrentReadLine;
}
else if (CurrentReadLine.find("License") != std::string::npos)
{
std::cout << "License Line: " << CurrentReadLine;
}
}
The more ROBUST search is called a regex:
#include <regex>
while (getline(FileStream, CurrentReadLine))
{
if(std::regex_match(CurrentReadLine,
std::regex(".*\\bLicense\\b.*=.*")))
{
std::cout << "License Line: " << CurrentReadLine << std::endl;
}
if(std::regex_match(CurrentReadLine,
std::regex(".*\\bGeneralLicense\\b.*=.*")))
{
std::cout << "General License Line: " << CurrentReadLine << std::endl;
}
}
The \b escape sequences denote word boundaries.
.* means "any sequence of characters, including zero characters"
EDIT: You could also use regex_search instead of regex_match to search for substrings that match instead of using .* to cover the parts that don't match:
#include <regex>
while (getline(FileStream, CurrentReadLine))
{
if(std::regex_search(CurrentReadLine, std::regex("\\bLicense\\b")))
{
std::cout << "License Line: " << CurrentReadLine << std::endl;
}
if(std::regex_search(CurrentReadLine, std::regex("\\bGeneralLicense\\b")))
{
std::cout << "General License Line: " << CurrentReadLine << std::endl;
}
}
This more closely matches your code, but note that it will get tripped up if the keywords are also found after the equals sign. If you want maximum robustness, use regex_match and specify exactly what the whole line should match.
You can check if the position at which the substring appears is at index zero, or that the character preceding the initial position is a space:
bool findAtWordBoundary(const std::string& line, const std::string& search) {
size_t pos = line.find(search);
return (pos != std::string::npos) && (pos== 0 || isspace(line[pos-1]));
}
Isn't there anything ROBUST (flag or something) which we can specify to look for the exact match?
In a way, find already looks for exact match. However, it treats a string as a sequence of meaningless numbers that represent individual characters. That is why std::string class lacks the concept of "full word", which is present in other parts of the library, such as regular expressions.
You could write a function that tests for the largest match first and then returns what ever information you want about the match.
Something a bit like:
// find the largest matching element from the set and return it
std::string find_one_of(std::set<std::string, std::greater<std::string>> const& tests, std::string const& s)
{
for(auto const& test: tests)
if(s.find(test) != std::string::npos)
return test;
return {};
}
int main()
{
std::string text = "abcdef";
auto found = find_one_of({"a", "abc", "ab"}, text);
std::cout << "found: " << found << '\n'; // prints "abc"
}
If all matches start on pos 0 and none is prefix of an other, then the following might work
if (CurrentReadLine.substr( 0, 7 ) == "License")
You can tokenize your string and do a full comparison with your search key and the tokens
Example:
#include <string>
#include <sstream>
#include <vector>
#include <iostream>
auto tokenizer(const std::string& line)
{
std::vector<std::string> results;
std::istringstream ss(line);
std::string s;
while(std::getline(ss, s, ' '))
results.push_back(s);
return results;
}
auto compare(const std::vector<std::string>& tokens, const std::string& key)
{
for (auto&& i : tokens)
if ( i == key )
return true;
return false;
}
int main()
{
std::string x = "License = \"12345\"";
auto token = tokenizer(x);
std::cout << compare(token, "License") << std::endl;
std::cout << compare(token, "GeneralLicense") << std::endl;
}

MSVC regular expression match

I am trying to match a literal number, e.g. 1600442 using a set of regular expressions in Microsoft Visual Studio 2010. My regular expressions are simply:
1600442|7654321
7895432
The problem is that both of the above matches the string.
Implementing this in Python gives the expected result:
import re
serial = "1600442"
re1 = "1600442|7654321"
re2 = "7895432"
m = re.match(re1, serial)
if m:
print "found for re1"
print m.groups()
m = re.match(re2, serial)
if m:
print "found for re2"
print m.groups()
Gives output
found for re1
()
Which is what I expected. Using this code in C++ however:
#include <string>
#include <iostream>
#include <regex>
int main(){
std::string serial = "1600442";
std::tr1::regex re1("1600442|7654321");
std::tr1::regex re2("7895432");
std::tr1::smatch match;
std::cout << "re1:" << std::endl;
std::tr1::regex_search(serial, match, re1);
for (auto i = 0;i <match.length(); ++i)
std::cout << match[i].str().c_str() << " ";
std::cout << std::endl << "re2:" << std::endl;
std::tr1::regex_search(serial, match, re2);
for (auto i = 0;i <match.length(); ++i)
std::cout << match[i].str().c_str() << " ";
std::cout << std::endl;
std::string s;
std::getline (std::cin,s);
}
gives me:
re1:
1600442
re2:
1600442
which is not what I expected. Why do I get match here?
The smatch does not get overwritten by the second call to regex_search thus, it is left intact and contains the first results.
You can move the regex searching code to a separate method:
void FindMeText(std::regex re, std::string serial)
{
std::smatch match;
std::regex_search(serial, match, re);
for (auto i = 0;i <match.length(); ++i)
std::cout << match[i].str().c_str() << " ";
std::cout << std::endl;
}
int main(){
std::string serial = "1600442";
std::regex re1("^(?:1600442|7654321)");
std::regex re2("^7895432");
std::cout << "re1:" << std::endl;
FindMeText(re1, serial);
std::cout << "re2:" << std::endl;
FindMeText(re2, serial);
std::cout << std::endl;
std::string s;
std::getline (std::cin,s);
}
Result:
Note that Python re.match searches for the pattern match at the start of string only, thus I suggest using ^ (start of string) at the beginning of each pattern.

C++ split string using a list of words as separators

I would like to split a string like this one
“this1245is#g$0,therhsuidthing345”
using a list of words like the one bellow
{“this”, “is”, “the”, “thing”}
into this list
{“this”, “1245”, “is”, “#g$0,”, “the”, “rhsuid”, “thing”, “345”}
// ^--------------^---------------^------------------^-- these were the delimiters
The delimiters are allowed to appear more than once in the string to split, and it can be done using regular expressions
The precedence is in the order in which the delimiters appear in the array
The platform I'm developing for has no support for the Boost library
Update
This is what I have for the moment
#include <iostream>
#include <string>
#include <regex>
int main ()
{
std::string s ("this1245is#g$0,therhsuidthing345");
std::string delimiters[] = {"this", "is", "the", "thing"};
for (int i=0; i<4; i++) {
std::string delimiter = "(" + delimiters[i] + ")(.*)";
std::regex e (delimiter); // matches words beginning by the i-th delimiter
// default constructor = end-of-sequence:
std::sregex_token_iterator rend;
std::cout << "1st and 2nd submatches:";
int submatches[] = { 1, 2 };
std::sregex_token_iterator c ( s.begin(), s.end(), e, submatches );
while (c!=rend) std::cout << " [" << *c++ << "]";
std::cout << std::endl;
}
return 0;
}
output:
1st and 2nd submatches:[this][x1245fisA#g$0,therhsuidthing345]
1st and 2nd submatches:[is][x1245fisA#g$0,therhsuidthing345]
1st and 2nd submatches:[the][rhsuidthing345]
1st and 2nd submatches:[thing][345]
I think I need to make some recursive thing to call on each iteration
Build the expression you want for matches only (re), then pass in {-1, 0} to your std::sregex_token_iterator to return all non-matches (-1) and matches (0).
#include <iostream>
#include <regex>
int main() {
std::string s("this1245is#g$0,therhsuidthing345");
std::regex re("(this|is|the|thing)");
std::sregex_token_iterator iter(s.begin(), s.end(), re, { -1, 0 });
std::sregex_token_iterator end;
while (iter != end) {
//Works in vc13, clang requires you increment separately,
//haven't gone into implementation to see if/how ssub_match is affected.
//Workaround: increment separately.
//std::cout << "[" << *iter++ << "] ";
std::cout << "[" << *iter << "] ";
++iter;
}
}
I don't know how to perform the precedence requirement. This seems to work on the given input:
std::vector<std::string> parse (std::string s)
{
std::vector<std::string> out;
std::regex re("\(this|is|the|thing).*");
std::string word;
auto i = s.begin();
while (i != s.end()) {
std::match_results<std::string::iterator> m;
if (std::regex_match(i, s.end(), m, re)) {
if (!word.empty()) {
out.push_back(word);
word.clear();
}
out.push_back(std::string(m[1].first, m[1].second));
i += out.back().size();
} else {
word += *i++;
}
}
if (!word.empty()) {
out.push_back(word);
}
return out;
}
vector<string> strs;
boost::split(strs,line,boost::is_space());

C++ Boost: Split String

How can I split a string with Boost with a regex AND have the delimiter included in the result list?
for example, if I have the string "1d2" and my regex is "[a-z]" I want the results in a vector with (1, d, 2)
I have:
std::string expression = "1d2";
boost::regex re("[a-z]");
boost::sregex_token_iterator i (expression.begin (),
expression.end (),
re);
boost::sregex_token_iterator j;
std::vector <std::string> splitResults;
std::copy (i, j, std::back_inserter (splitResults));
Thanks
I think you cannot directly extract the delimiters using boost::regex. You can, however, extract the position where the regex is found in your string:
std::string expression = "1a234bc";
boost::regex re("[a-z]");
boost::sregex_iterator i(
expression.begin (),
expression.end (),
re);
boost::sregex_iterator j;
for(; i!=j; ++i) {
std::cout << (*i).position() << " : " << (*i) << std::endl;
}
This example would show:
1 : a
5 : b
6 : c
Using this information, you can extract the delimitiers from your original string:
std::string expression = "1a234bc43";
boost::regex re("[a-z]");
boost::sregex_iterator i(
expression.begin (),
expression.end (),
re);
boost::sregex_iterator j;
size_t pos=0;
for(; i!=j;++i) {
std::string pre_delimiter = expression.substr(pos, (*i).position()-pos);
std::cout << pre_delimiter << std::endl;
std::cout << (*i) << std::endl;
pos = (*i).position() + (*i).size();
}
std::string last_delimiter = expression.substr(pos);
std::cout << last_delimiter << std::endl;
This example would show:
1
a
234
b
c
43
There is an empty string betwen b and c because there is no delimiter.