boost regex iterator returning empty string - c++

I am a beginner to regex in c++ I was wondering why this code:
#include <iostream>
#include <string>
#include <boost/regex.hpp>
int main() {
std::string s = "? 8==2 : true ! false";
boost::regex re("\\?\\s+(.*)\\s*:\\s*(.*)\\s*\\!\\s*(.*)");
boost::sregex_token_iterator p(s.begin(), s.end(), re, -1); // sequence and that reg exp
boost::sregex_token_iterator end; // Create an end-of-reg-exp
// marker
while (p != end)
std::cout << *p++ << '\n';
}
Prints a empty string. I put the regex in regexTester and it matches the string correctly but here when I try to iterate over the matches it returns nothing.

I think the tokenizer is actually meant to split text by some delimiter, and the delimiter is not included. Compare with std::regex_token_iterator:
std::regex_token_iterator is a read-only LegacyForwardIterator that accesses the individual sub-matches of every match of a regular expression within the underlying character sequence. It can also be used to access the parts of the sequence that were not matched by the given regular expression (e.g. as a tokenizer).
Indeed you invoke exactly this mode as per the docs:
if submatch is -1, then enumerates all the text sequences that did not match the expression re (that is to performs field splitting).
(emphasis mine).
So, just fix that:
for (boost::sregex_token_iterator p(s.begin(), s.end(), re), e; p != e;
++p)
{
boost::sub_match<It> const& current = *p;
if (current.matched) {
std::cout << std::quoted(current.str()) << '\n';
} else {
std::cout << "non matching" << '\n';
}
}
Other Observations
All the greedy Kleene-stars are recipe for trouble. You won't ever find a second match, because the first one's .* at the end will by definition gobble up all remaining input.
Instead, make them non-greedy (.*?) and or much more precise (like isolating some character set, or mandating non-space characters?).
boost::regex re(R"(\?\s+(.*?)\s*:\s*(.*?)\s*\!\s*(.*?))");
// Or, if you don't want raw string literals:
boost::regex re("\\?\\s+(.*?)\\s*:\\s*(.*?)\\s*\\!\\s*(.*?)");
Live Demo
#include <boost/regex.hpp>
#include <iomanip>
#include <iostream>
#include <string>
int main() {
using It = std::string::const_iterator;
std::string const s =
"? 8==2 : true ! false;"
"? 9==3 : 'book' ! 'library';";
boost::regex re(R"(\?\s+(.*?)\s*:\s*(.*?)\s*\!\s*(.*?))");
{
std::cout << "=== regex_search:\n";
boost::smatch results;
for (It b = s.begin(); boost::regex_search(b, s.end(), results, re); b = results[0].end()) {
std::cout << results.str() << "\n";
std::cout << "remain: " << std::quoted(std::string(results[0].second, s.end())) << "\n";
}
}
std::cout << "=== token iteration:\n";
for (boost::sregex_token_iterator p(s.begin(), s.end(), re), e; p != e;
++p)
{
boost::sub_match<It> const& current = *p;
if (current.matched) {
std::cout << std::quoted(current.str()) << '\n';
} else {
std::cout << "non matching" << '\n';
}
}
}
Prints
=== regex_search:
? 8==2 : true !
remain: "false;? 9==3 : 'book' ! 'library';"
? 9==3 : 'book' !
remain: "'library';"
=== token iteration:
"? 8==2 : true ! "
"? 9==3 : 'book' ! "
BONUS: Parser Expressions
Instead of abusing regexen to do parsing, you could generate a parser, e.g. using Boost Spirit:
Live On Coliru
#include <boost/spirit/home/x3.hpp>
#include <boost/fusion/adapted.hpp>
#include <iomanip>
#include <iostream>
namespace x3 = boost::spirit::x3;
int main() {
std::string const s =
"? 8==2 : true ! false;"
"? 9==3 : 'book' ! 'library';";
using expression = std::string;
using ternary = std::tuple<expression, expression, expression>;
std::vector<ternary> parsed;
auto expr_ = x3::lexeme [+(x3::graph - ';')];
auto ternary_ = "?" >> expr_ >> ":" >> expr_ >> "!" >> expr_;
std::cout << "=== parser approach:\n";
if (x3::phrase_parse(begin(s), end(s), *x3::seek[ ternary_ ], x3::space, parsed)) {
for (auto [cond, e1, e2] : parsed) {
std::cout
<< " condition " << std::quoted(cond) << "\n"
<< " true expression " << std::quoted(e1) << "\n"
<< " else expression " << std::quoted(e2) << "\n"
<< "\n";
}
} else {
std::cout << "non matching" << '\n';
}
}
Prints
=== parser approach:
condition "8==2"
true expression "true"
else expression "false"
condition "9==3"
true expression "'book'"
else expression "'library'"
This is much more extensible, will easily support recursive grammars and will be able to synthesize a typed representation of your syntax tree, instead of just leaving you with scattered bits of string.

Related

C++ regex library

I have this sample code
// regex_search example
#include <iostream>
#include <string>
#include <regex>
int main ()
{
std::string s ("eritueriotu3498 \"pi656\" sdfs3646df");
std::smatch m;
std::string reg("\\(?<=pi\\)\\(\\d+\\)\\(?=\"\\)");
std::regex e (reg);
std::cout << "Target sequence: " << s << std::endl;
std::cout << "The following matches and submatches were found:" << std::endl;
while (std::regex_search (s,m,e)) {
for (auto x:m) std::cout << x << " ";
std::cout << std::endl;
s = m.suffix().str();
}
return 0;
}
I need to get number between pi and " -> (piMYNUMBER")
In online regex service my regex works fine (?<=pi)(\d+)(?=") but c++ regex don't match anything.
Who knows what is wrong with my expression?
Best regards
That is correct, C++ std::regex flavors do not support lookbehinds. You need to capture the digits between pi and ":
#include <iostream>
#include <vector>
#include <regex>
int main() {
std::string s ("eritueriotu3498 \"pi656\" sdfs3646df");
std::smatch m;
std::string reg("pi(\\d+)\""); // Or, with a raw string literal:
// std::string reg(R"(pi(\d+)\")");
std::regex e (reg);
std::vector<std::string> results(std::sregex_token_iterator(s.begin(), s.end(), e, 1),
std::sregex_token_iterator());
// Demo printing the results:
std::cout << "Number of matches: " << results.size() << std::endl;
for( auto & p : results ) std::cout << p << std::endl;
return 0;
}
See the C++ demo. Output:
Number of matches: 1
656
Here, pi(\d+)" pattern matches
pi - a literal substring
(\d+) - captures 1+ digits into Group 1
" - consumes a double quote.
Note the fourth argument to std::sregex_token_iterator, it is 1 because you need to collect only Group 1 values.

Find an exact substr in a string

I have a text file which contains the following text
License = "123456"
GeneralLicense = "56475655"
I want to search for License as well as for GeneralLicense.
while (getline(FileStream, CurrentReadLine))
{
if (CurrentReadLine.find("License") != std::string::npos)
{
std::cout << "License Line: " << CurrentReadLine;
}
if (CurrentReadLine.find("GeneralLicense") != std::string::npos)
{
std::cout << "General License Line: " << CurrentReadLine;
}
}
Since the word License also present in the word GeneralLicense so if-statement in the line if (CurrentReadLine.find("License") != std::string::npos) becomes true two times.
How can I specify that I want to search for the exact sub-string?
UPDATE: I can reverse the order as mentioned by some Answers OR check if the License is at Index zero. But isn't there anything ROBOUST (flag or something) which we can speficy to look for the exact match (Something like we have in most of the editors e.g. MS Word etc.).
while (getline(FileStream, CurrentReadLine))
{
if (CurrentReadLine.find("GeneralLicense") != std::string::npos)
{
std::cout << "General License Line: " << CurrentReadLine;
}
else if (CurrentReadLine.find("License") != std::string::npos)
{
std::cout << "License Line: " << CurrentReadLine;
}
}
The more ROBUST search is called a regex:
#include <regex>
while (getline(FileStream, CurrentReadLine))
{
if(std::regex_match(CurrentReadLine,
std::regex(".*\\bLicense\\b.*=.*")))
{
std::cout << "License Line: " << CurrentReadLine << std::endl;
}
if(std::regex_match(CurrentReadLine,
std::regex(".*\\bGeneralLicense\\b.*=.*")))
{
std::cout << "General License Line: " << CurrentReadLine << std::endl;
}
}
The \b escape sequences denote word boundaries.
.* means "any sequence of characters, including zero characters"
EDIT: You could also use regex_search instead of regex_match to search for substrings that match instead of using .* to cover the parts that don't match:
#include <regex>
while (getline(FileStream, CurrentReadLine))
{
if(std::regex_search(CurrentReadLine, std::regex("\\bLicense\\b")))
{
std::cout << "License Line: " << CurrentReadLine << std::endl;
}
if(std::regex_search(CurrentReadLine, std::regex("\\bGeneralLicense\\b")))
{
std::cout << "General License Line: " << CurrentReadLine << std::endl;
}
}
This more closely matches your code, but note that it will get tripped up if the keywords are also found after the equals sign. If you want maximum robustness, use regex_match and specify exactly what the whole line should match.
You can check if the position at which the substring appears is at index zero, or that the character preceding the initial position is a space:
bool findAtWordBoundary(const std::string& line, const std::string& search) {
size_t pos = line.find(search);
return (pos != std::string::npos) && (pos== 0 || isspace(line[pos-1]));
}
Isn't there anything ROBUST (flag or something) which we can specify to look for the exact match?
In a way, find already looks for exact match. However, it treats a string as a sequence of meaningless numbers that represent individual characters. That is why std::string class lacks the concept of "full word", which is present in other parts of the library, such as regular expressions.
You could write a function that tests for the largest match first and then returns what ever information you want about the match.
Something a bit like:
// find the largest matching element from the set and return it
std::string find_one_of(std::set<std::string, std::greater<std::string>> const& tests, std::string const& s)
{
for(auto const& test: tests)
if(s.find(test) != std::string::npos)
return test;
return {};
}
int main()
{
std::string text = "abcdef";
auto found = find_one_of({"a", "abc", "ab"}, text);
std::cout << "found: " << found << '\n'; // prints "abc"
}
If all matches start on pos 0 and none is prefix of an other, then the following might work
if (CurrentReadLine.substr( 0, 7 ) == "License")
You can tokenize your string and do a full comparison with your search key and the tokens
Example:
#include <string>
#include <sstream>
#include <vector>
#include <iostream>
auto tokenizer(const std::string& line)
{
std::vector<std::string> results;
std::istringstream ss(line);
std::string s;
while(std::getline(ss, s, ' '))
results.push_back(s);
return results;
}
auto compare(const std::vector<std::string>& tokens, const std::string& key)
{
for (auto&& i : tokens)
if ( i == key )
return true;
return false;
}
int main()
{
std::string x = "License = \"12345\"";
auto token = tokenizer(x);
std::cout << compare(token, "License") << std::endl;
std::cout << compare(token, "GeneralLicense") << std::endl;
}

Boost regex expression capture

My goal is to capture an integer using boost::regex_search.
#define BOOST_REGEX_MATCH_EXTRA
#include <boost\regex.hpp>
#include <iostream>
int main(int argc, char* argv[])
{
std::string tests[4] = {
"SomeString #222",
"SomeString #1",
"SomeString #42",
"SomeString #-1"
};
boost::regex rgx("#(-?[0-9]+)$");
boost::smatch match;
for(int i=0;i< 4; ++i)
{
std::cout << "Test " << i << std::endl;
boost::regex_search(tests[i], match, rgx, boost::match_extra);
for(int j=0; j< match.size(); ++j)
{
std::string match_string;
match_string.assign(match[j].first, match[j].second);
std::cout << " Match " << j << ": " << match_string << std::endl;
}
}
system("pause");
}
I notice that each regex search results in two matches. The first being the string matched, and the second is the capture in parenthesis.
Test 0
Match 0: #222
Match 1: 222
Test 1
Match 0: #1
Match 1: 1
Test 2
Match 0: #42
Match 1: 42
Test 3
Match 0: #-1
Match 1: -1
The documentation discourages use of BOOST_REGEX_MATCH_EXTRA unless needed. Is it required to capture a single match within parentheses, or is there another way?
If you want more speed, perhaps Boost Spirit could bring it, or other Boost Xpressive.
Both will generate code from expression templates. Meaning, among other things, that if you don't "absorb" any attribute values, no cost will be incurred.
Boost Spirit:
This solution is header-only. It can probably be made more efficient, but here's a start:
#include <boost/spirit/include/qi.hpp>
namespace qi = boost::spirit::qi;
int main()
{
std::string const tests[] = {
"SomeString #222",
"SomeString #1",
"SomeString #42",
"SomeString #-1"
};
for(auto& input : tests)
{
int value;
auto f(input.begin()), l(input.end());
if (qi::phrase_parse(f, l, // input iterators
qi::omit [ *~qi::char_('#') ] >> '#' >> qi::int_, // grammar
qi::space, // skipper
value)) // output attribute
{
std::cout << " Input '" << input << "' -> " << value << "\n";
}
}
}
See it Live On Coliru
Boost Xpressive
#include <boost/xpressive/xpressive_static.hpp>
#include <iostream>
namespace xp = boost::xpressive;
int main()
{
std::string const tests[] = {
"SomeString #222",
"SomeString #1",
"SomeString #42",
"SomeString #-1"
};
for(auto& input : tests)
{
static xp::sregex rex = (xp::s1= -*xp::_) >> '#' >> (xp::s2= !xp::as_xpr('-') >> +xp::_d);
xp::smatch what;
if(xp::regex_match(input, what, rex))
{
std::cout << "Input '" << what[0] << " -> " << what[2] << '\n';
}
}
}
See it Live On Coliru too.
I have a hunch that the Spirit solution is gonna be more performant, and close to what you want (because it parses a general grammar and parses it into your desired data-type directly).

How to match multiple results using std::regex

For example, If I have a string like "first second third forth" and I want to match every single word in one operation to output them one by one.
I just thought that "(\\b\\S*\\b){0,}" would work. But actually it did not.
What should I do?
Here's my code:
#include<iostream>
#include<string>
using namespace std;
int main()
{
regex exp("(\\b\\S*\\b)");
smatch res;
string str = "first second third forth";
regex_search(str, res, exp);
cout << res[0] <<" "<<res[1]<<" "<<res[2]<<" "<<res[3]<< endl;
}
Simply iterate over your string while regex_searching, like this:
{
regex exp("(\\b\\S*\\b)");
smatch res;
string str = "first second third forth";
string::const_iterator searchStart( str.cbegin() );
while ( regex_search( searchStart, str.cend(), res, exp ) )
{
cout << ( searchStart == str.cbegin() ? "" : " " ) << res[0];
searchStart = res.suffix().first;
}
cout << endl;
}
This can be done in regex of C++11.
Two methods:
You can use () in regex to define your captures(sub expressions).
Like this:
string var = "first second third forth";
const regex r("(.*) (.*) (.*) (.*)");
smatch sm;
if (regex_search(var, sm, r)) {
for (int i=1; i<sm.size(); i++) {
cout << sm[i] << endl;
}
}
See it live: http://coliru.stacked-crooked.com/a/e1447c4cff9ea3e7
You can use sregex_token_iterator():
string var = "first second third forth";
regex wsaq_re("\\s+");
copy( sregex_token_iterator(var.begin(), var.end(), wsaq_re, -1),
sregex_token_iterator(),
ostream_iterator<string>(cout, "\n"));
See it live: http://coliru.stacked-crooked.com/a/677aa6f0bb0612f0
sregex_token_iterator appears to be the ideal, efficient solution, but the example given in the selected answer leaves much to be desired. Instead, I found some great examples here:
http://www.cplusplus.com/reference/regex/regex_token_iterator/regex_token_iterator/
For your convenience, I've copy-pasted the sample code shown by that page. I claim no credit for the code.
// regex_token_iterator example
#include <iostream>
#include <string>
#include <regex>
int main ()
{
std::string s ("this subject has a submarine as a subsequence");
std::regex e ("\\b(sub)([^ ]*)"); // matches words beginning by "sub"
// default constructor = end-of-sequence:
std::regex_token_iterator<std::string::iterator> rend;
std::cout << "entire matches:";
std::regex_token_iterator<std::string::iterator> a ( s.begin(), s.end(), e );
while (a!=rend) std::cout << " [" << *a++ << "]";
std::cout << std::endl;
std::cout << "2nd submatches:";
std::regex_token_iterator<std::string::iterator> b ( s.begin(), s.end(), e, 2 );
while (b!=rend) std::cout << " [" << *b++ << "]";
std::cout << std::endl;
std::cout << "1st and 2nd submatches:";
int submatches[] = { 1, 2 };
std::regex_token_iterator<std::string::iterator> c ( s.begin(), s.end(), e, submatches );
while (c!=rend) std::cout << " [" << *c++ << "]";
std::cout << std::endl;
std::cout << "matches as splitters:";
std::regex_token_iterator<std::string::iterator> d ( s.begin(), s.end(), e, -1 );
while (d!=rend) std::cout << " [" << *d++ << "]";
std::cout << std::endl;
return 0;
}
Output:
entire matches: [subject] [submarine] [subsequence]
2nd submatches: [ject] [marine] [sequence]
1st and 2nd submatches: [sub] [ject] [sub] [marine] [sub] [sequence]
matches as splitters: [this ] [ has a ] [ as a ]
You could use the suffix() function, and search again until you don't find a match:
int main()
{
regex exp("(\\b\\S*\\b)");
smatch res;
string str = "first second third forth";
while (regex_search(str, res, exp)) {
cout << res[0] << endl;
str = res.suffix();
}
}
My code will capture all groups in all matches:
vector<vector<string>> U::String::findEx(const string& s, const string& reg_ex, bool case_sensitive)
{
regex rx(reg_ex, case_sensitive ? regex_constants::icase : 0);
vector<vector<string>> captured_groups;
vector<string> captured_subgroups;
const std::sregex_token_iterator end_i;
for (std::sregex_token_iterator i(s.cbegin(), s.cend(), rx);
i != end_i;
++i)
{
captured_subgroups.clear();
string group = *i;
smatch res;
if(regex_search(group, res, rx))
{
for(unsigned i=0; i<res.size() ; i++)
captured_subgroups.push_back(res[i]);
if(captured_subgroups.size() > 0)
captured_groups.push_back(captured_subgroups);
}
}
captured_groups.push_back(captured_subgroups);
return captured_groups;
}
My reading of the documentation is that regex_search searches for the first match and that none of the functions in std::regex do a "scan" as you are looking for. However, the Boost library seems to be support this, as described in C++ tokenize a string using a regular expression

How to extract trimmed text using Boost Spirit?

Using boost spirit, I'd like to extract a string that is followed by some data in parentheses. The relevant string is separated by a space from the opening parenthesis. Unfortunately, the string itself may contain spaces. I'm looking for a concise solution that returns the string without a trailing space.
The following code illustrates the problem:
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix_operator.hpp>
#include <string>
#include <iostream>
namespace qi = boost::spirit::qi;
using std::string;
using std::cout;
using std::endl;
void
test_input(const string &input)
{
string::const_iterator b = input.begin();
string::const_iterator e = input.end();
string parsed;
bool const r = qi::parse(b, e,
*(qi::char_ - qi::char_("(")) >> qi::lit("(Spirit)"),
parsed
);
if(r) {
cout << "PASSED:" << endl;
} else {
cout << "FAILED:" << endl;
}
cout << " Parsed: \"" << parsed << "\"" << endl;
cout << " Rest: \"" << string(b, e) << "\"" << endl;
}
int main()
{
test_input("Fine (Spirit)");
test_input("Hello, World (Spirit)");
return 0;
}
Its output is:
PASSED:
Parsed: "Fine "
Rest: ""
PASSED:
Parsed: "Hello, World "
Rest: ""
With this simple grammar, the extracted string is always followed by a space (that I 'd like to eliminate).
The solution should work within Spirit since this is only part of a larger grammar. (Thus, it would probably be clumsy to trim the extracted strings after parsing.)
Thank you in advance.
Like the comment said, in the case of a single space, you can just hard code it. If you need to be more flexible or tolerant:
I'd use a skipper with raw to "cheat" the skipper for your purposes:
bool const r = qi::phrase_parse(b, e,
qi::raw [ *(qi::char_ - qi::char_("(")) ] >> qi::lit("(Spirit)"),
qi::space,
parsed
);
This works, and prints
PASSED:
Parsed: "Fine"
Rest: ""
PASSED:
Parsed: "Hello, World"
Rest: ""
See it Live on Coliru
Full program for reference:
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix_operator.hpp>
#include <string>
#include <iostream>
namespace qi = boost::spirit::qi;
using std::string;
using std::cout;
using std::endl;
void
test_input(const string &input)
{
string::const_iterator b = input.begin();
string::const_iterator e = input.end();
string parsed;
bool const r = qi::phrase_parse(b, e,
qi::raw [ *(qi::char_ - qi::char_("(")) ] >> qi::lit("(Spirit)"),
qi::space,
parsed
);
if(r) {
cout << "PASSED:" << endl;
} else {
cout << "FAILED:" << endl;
}
cout << " Parsed: \"" << parsed << "\"" << endl;
cout << " Rest: \"" << string(b, e) << "\"" << endl;
}
int main()
{
test_input("Fine (Spirit)");
test_input("Hello, World (Spirit)");
return 0;
}