C++11 Regex Find Capture Group Identifier - c++

I've looked at a number of sources for C++11's new regex library, but most of them focus more on the syntax, or the more basic usage of things like regex_match, or regex_search. While these articles helped me get started using the regex library, I'm having a difficult time finding more details on capture groups.
What I'm trying to accomplish, is find out which capture group a match belongs to. So far, I've only found a single way to do this.
#include <iostream>
#include <string>
#include <regex>
int main(int argc, char** argv)
{
std::string input = "+12 -12 -13 90 qwerty";
std::regex pattern("([+-]?[[:digit:]]+)|([[:alpha:]]+)");
auto iter_begin = std::sregex_token_iterator(input.begin(), input.end(), pattern, 1);
auto iter_end = std::sregex_token_iterator();
for (auto it = iter_begin; it != iter_end; ++it)
{
std::ssub_match match = *it;
std::cout << "Match: " << match.str() << " [" << match.length() << "]" << std::endl;
}
std::cout << std::endl << "Done matching..." << std::endl;
std::string temp;
std::getline(std::cin, temp);
return 0;
}
In changing the value of the fourth argument of std::sregex_token_iterator, I can control which submatch it will keep, telling it to throw away the rest of them. Therefore, to find out which capture group a match belongs to, I can simply iterate through the capture groups to find out which matches are not thrown away for a particular group.
However, this would be undesirable for me, because unless there's some caching going on in the background I would expect each construction of std::sregex_token_iterator to pass over the input and find the matches again (someone please correct me if this is wrong, but this is the best conclusion I could come to).
Is there any better way of finding the capture group(s) a match belongs to? Or is iterating over the submatches the best course of action?

Use regex_iterator instead. You will have access to match_results for each match, which contains all the sub_matches, where you can check which of the capturing group the match belongs to.

Related

Regexp matching fails with invalid special open parenthesis

I am trying to use regexps in c++11, but my code always throws an std::regex_error of Invalid special open parenthesis.. A minimal example code which tries to find the first duplicate character in a string:
std::string regexp_string("(?P<a>[a-z])(?P=a)"); // Nothing to be escaped here, right?
std::regex regexp_to_match(regexp_string);
std::string target("abbab");
std::smatch matched_regexp;
std::regex_match(target, matched_regexp, regexp_to_match);
for(const auto& m: matched_regexp)
{
std::cout << m << std::endl;
}
Why do I get an error and how do I fix this example?
There are 2 issues here:
std::regex flavors do not support named capturing groups / backreferences, you need to use numbered capturing groups / backreferences
You should use regex_search rather than regex_match that requires a full string match.
Use
std::string regexp_string(R"(([a-z])\1)");
std::regex regexp_to_match(regexp_string);
std::string target("abbab");
std::smatch matched_regexp;
if (std::regex_search(target, matched_regexp, regexp_to_match)) {
std::cout << matched_regexp.str() << std::endl;
}
// => bb
See the C++ demo
The R"(([a-z])\1)" raw string literal defines the ([a-z])\1 regex that matches any lowercase ASCII letter and then matches the same letter again.
http://en.cppreference.com/w/cpp/regex/ecmascript says that ECMAScript (the default type for std::regex) requires (?= for positive lookahead.
The reason your regex crashes for you is because named groups not supported by std::regex. However you can still use what is available to find the first duplicate char in string:
#include <iostream>
#include <regex>
int main()
{
std::string s = "abc def cde";
std::smatch m;
std::regex r("(\\w).*?(?=\\1)");
if (std::regex_search(s, m, r))
std::cout << m[1] << std::endl;
return 0;
}
Prints
c

c++ Is there a way to find sentences within strings?

I'm trying to recognise certain phrases within a user defined string but so far have only been able to get a single word.
For example, if I have the sentence:
"What do you think of stack overflow?"
is there a way to search for "What do you" within the string?
I know you can retrieve a single word with the find function but when attempting to get all three it gets stuck and can only search for the first.
Is there a way to search for the whole string in another string?
Use str.find()
size_t find (const string& str, size_t pos = 0)
Its return value is the starting position of the substring. You can test if the string you are looking for is contained in the main string by performing the simple boolean test of returning str::npos:
string str = "What do you think of stack overflow?";
if (str.find("What do you") != str::npos) // is contained
The second argument can be used to limit your search from certain string position.
The OP question mentions it gets stuck in the attempt to find a three word string. Actually, I believe you are misinterpreting the return value. It happens that the return for the single word search "What" and the string "What do you" have coincidental starting positions, therefore str.find() returns the same. To search for individual words positions, use multiple function calls.
Use regular expressions
#include <iostream>
#include <string>
#include <regex>
int main ()
{
std::string s ("What do you think of stack overflow?");
std::smatch m;
std::regex e ("\\bWhat do you think\\b");
std::cout << "The following matches and submatches were found:" << std::endl;
while (std::regex_search (s,m,e)) {
for (auto x:m) std::cout << x << " ";
std::cout << std::endl;
s = m.suffix().str();
}
return 0;
}
Also you can find wildcards implementing with boost (regex in std library was boost::regex library before c++11) there

Need help constructing Regular expression pattern

I'm failing to create a pattern for the stl regex_match function and need some help understanding why the pattern I created doesn't work and what would fix it.
I think the regex would have a hit for dl.boxcloud.com but it does not.
****still looking for input. I updated the program reflect suggestions. There are two matches when I think should be one.
#include <string>
#include <regex>
using namespace std;
wstring GetBody();
int _tmain(int argc, _TCHAR* argv[])
{
wsmatch m;
wstring regex(L"(dl\\.boxcloud\\.com|api-content\\.dropbox\\.com)");
regex_search(GetBody(), m, wregex(regex));
printf("%d matches.\n", m.size());
return 0;
}
wstring GetBody() {
wstring body(L"ABOUTLinkedIn\r\n\r\nwall of textdl.boxcloud.com/this/file/bitbyte.zip sent you a message.\r\n\r\nDate: 12/04/2012\r\n\r\nSubject: RE: Reference Ask\r\n\r\nOn 12/03/12 2:02 PM, wall of text wrote:\r\n--------------------\r\nRuba,\r\n\r\nI am looking for a n.");
return body;
}
There is no problem with the code itself. You mistake m.size() for the number of matches, when in fact, it is a number of groups your regex returns.
The std::match_results::size reference is not helpful with understanding that:
Returns the number of matches and sub-matches in the match_results object.
There are 2 groups (since you defined a capturing group around the 2 alternatives) and 1 match all in all.
See this IDEONE demo
#include <regex>
#include <string>
#include <iostream>
#include <time.h>
using namespace std;
int main()
{
string data("ABOUTLinkedIn\r\n\r\nwall of textdl.boxcloud.com/this/file/bitbyte.zip sent you a message.\r\n\r\nDate: 12/04/2012\r\n\r\nSubject: RE: Reference Ask\r\n\r\nOn 12/03/12 2:02 PM, wall of text wrote:\r\n--------------------\r\nRuba,\r\n\r\nI am looking for a n.");
std::regex pattern("(dl\\.boxcloud\\.com|api-content\\.dropbox\\.com)");
std::smatch result;
while (regex_search(data, result, pattern)) {
std::cout << "Match: " << result[0] << std::endl;
std::cout << "Captured text 1: " << result[1] << std::endl;
std::cout << "Size: " << result.size() << std::endl;
data = result.suffix().str();
}
}
It outputs:
Match: dl.boxcloud.com
Captured text 1: dl.boxcloud.com
Size: 2
See, the captured text equals the whole match.
To "fix" that, you may use non-capturing group, or remove grouping at all:
std::regex pattern("(?:dl\\.boxcloud\\.com|api-content\\.dropbox\\.com)");
// or
std::regex pattern("dl\\.boxcloud\\.com|api-content\\.dropbox\\.com");
Also, consider using raw string literal when declaring a regex (to avoid backslash hell):
std::regex pattern(R"(dl\.boxcloud\.com|api-content\.dropbox\.com)");
You need to add another "\" before each ".". I think that should fix it. You need to use escape character to represent "\" so your regex looks like this
wstring regex(L"(dl\\.boxcloud\\.com|api-content\\.dropbox\\.com)");
Update:
As #user3494744 also said you have to use
std::regex_search
instead of
std::regex_match.
I tested and it works now.
The problem is that you use regex_match instead of regex_search. To quote from the manual:
Note that regex_match will only successfully match a regular expression to an entire character sequence, whereas std::regex_search will successfully match subsequences
This fix will give a match, but too many since you also have to replace \. by \\. as shown before my answer. Otherwise the string "dlXboxcloud.com" will also match.

Determining the location of C++11 regular expression matches

How do I efficiently determine the location of a capture group inside a searched string? Getting the location of the entire match is easy, but I see no obvious ways to get at capture groups beyond the first.
This is a simplified example, lets presume "a*" and "b*" are complicated regexes that are expensive to run.
#include <iostream>
#include <regex>
#include <string>
using namespace std;
int main()
{
regex matcher("a*(needle)b*");
smatch findings;
string haystack("aaaaaaaaneedlebbbbbbbbbbbbbb");
if( regex_match(haystack, findings, matcher) )
{
// What do I put here to know how the offset of "needle" in the
// string haystack?
// This is the position of the entire, which is
// always 0 with regex_match, with regex_search
cout << "smatch::position - " << findings.position() << endl;
// Is this just a string or what? Are there member functions
// That can be called?
cout << "Needle - " << findings[1] << endl;
}
return 0;
}
If it helps I built this question in Coliru: http://coliru.stacked-crooked.com/a/885a6b694d32d9b5
I will not mark this as and answer until 72 hours have passed and no better answers are present.
Before asking this I presumed smatch::position took no arguments I cared about, because when I read the cppreference page the "sub" parameter was not obviously an index into the container of matches. I thought it had something to do with "sub"strings and the offset value of the whole match.
So my answer is:
cout << "Needle Position- " << findings.position(1) << endl;
Any explanation on this design, or other issues my line of thinking may have caused would be appreciated.
According to the documentation, you can access the iterator pointing to the beginning and the end of the captured text via match[n].first and match[n].second. To get the start and end indices, just do pointer arithmetic with haystack.begin().
if (findings[1].matched) {
cout << "[" << findings[1].first - haystack.begin() << "-"
<< findings[1].second - haystack.begin() << "] "
<< findings[1] << endl;
}
Except for the main match (index 0), capturing groups may or may not capture anything. In such cases, first and second will point to the end of the string.
I also demonstrate the matched property of sub_match object. While it's unnecessary in this case, in general, if you want to print out the indices of the capturing groups, it's necessary to check whether the capturing group matches anything first.

Getting sub-match_results with boost::regex

Hey, let's say I have this regex: (test[0-9])+
And that I match it against: test1test2test3test0
const bool ret = boost::regex_search(input, what, r);
for (size_t i = 0; i < what.size(); ++i)
cout << i << ':' << string(what[i]) << "\n";
Now, what[1] will be test0 (the last occurrence). Let's say that I need to get test1, 2 and 3 as well: what should I do?
Note: the real regex is extremely more complex and has to remain one overall match, so changing the example regex to (test[0-9]) won't work.
I think Dot Net has the ability to make single capture group Collections so that (grp)+ will create a collection object on group1. The boost engine's regex_search() is going to be just like any ordinary match function. You sit in a while() loop matching the pattern where the last match left off. The form you used does not use a bid-itterator, so the function won't start the next match where the last match left off.
You can use the itterator form:
(Edit - you can also use the token iterator, defining what groups to iterate over. Added in the code below).
#include <boost/regex.hpp>
#include <string>
#include <iostream>
using namespace std;
using namespace boost;
int main()
{
string input = "test1 ,, test2,, test3,, test0,,";
boost::regex r("(test[0-9])(?:$|[ ,]+)");
boost::smatch what;
std::string::const_iterator start = input.begin();
std::string::const_iterator end = input.end();
while (boost::regex_search(start, end, what, r))
{
string stest(what[1].first, what[1].second);
cout << stest << endl;
// Update the beginning of the range to the character
// following the whole match
start = what[0].second;
}
// Alternate method using token iterator
const int subs[] = {1}; // we just want to see group 1
boost::sregex_token_iterator i(input.begin(), input.end(), r, subs);
boost::sregex_token_iterator j;
while(i != j)
{
cout << *i++ << endl;
}
return 0;
}
Output:
test1
test2
test3
test0
Boost.Regex offers experimental support for exactly this feature (called repeated captures); however, since it's huge performance hit, this feature is disabled by default.
To enable repeated captures, you need to rebuild Boost.Regex and define macro BOOST_REGEX_MATCH_EXTRA in all translation units; the best way to do this is to uncomment this define in boost/regex/user.hpp (see the reference, it's at the very bottom of the page).
Once compiled with this define, you can use this feature by calling/using regex_search, regex_match and regex_iterator with match_extra flag.
Check reference to Boost.Regex for more info.
Seems to me like you need to create a regex_iterator, using the (test[0-9]) regex as input. Then you can use the resulting regex_iterator to enumerate the matching substrings of your original target.
If you still need "one overall match" then perhaps that work has to be decoupled from the task of finding matching substrings. Can you clarify that part of your requirement?