avoid regex greediness - c++

Basic regex question.
By default, regular expression are greedy, it seems. For e.g. below code:
#include <regex>
#include <iostream>
int main() {
const std::string t = "*1 abc";
std::smatch match;
std::regex rgxx("\\*(\\d+?)\\s+(.+?)$");
bool matched1 = std::regex_search(t.begin(), t.end(), match, rgxx);
std::cout << "Matched size " << match.size() << std::endl;
for(int i = 0 ; i < match.size(); ++i) {
std::cout << i << " match " << match[i] << std::endl;
}
}
This will produce an output of:
Matched size 3
**0 match *1 abc**
1 match 1
2 match abc
As an general regular expression writer, I would expected only
1 match 1
2 match abc
to come. First match is coming because of regex greediness, I think. How is it avoidable?

From std::regex_search: match[0] is not the result of greedy evaluation, but is the range of the entire match. The match elements [1, n) are the capture groups.
Here's in illustration of what the match results mean:
regex "hello ([\\w]+)"
string = "Oh, hello John!"
match[0] = "hello John" // matches the whole regex above
match[1] = "John" // the first capture group

You only have one match. That match has 2 "marked subexpressions", because that's what the regex specifies. You don't have multiple matches of that regex.
From std::regex_search
m.size(): number of marked subexpressions plus 1, that is, 1+rgxx.mark_count()
If you are looking for multiple matches, use std::regex_iterator

Related

Regular expression to return only one match

My CPP application is similar to a regular expression testing application in which I can enter the regular expression and the input string to see the output. I am using the cpp API std::regex_search(inputString, match, regex) to execute and get the match for all regular expressions. The problem I am facing here is that the match can have more than 1 item but I should return only one of them.
I have 2 types of input strings. For example:
Name:Jake (string with prefix 'Name:'). I am using the regular expression ^Name:(.*?)$. Here match contains Name:Jake and Jake. I have to ignore match[0] and return match[1] in this case.
1234-r (string with suffix '-r') Here I am using regularexpression ^.*(?=(\\-r)). In this case match contains 1234 and -r. In this case, I have to ignore match[1] and return match[0].
Is there a way I can modify these regular expressions so that the match will have only one item in that? Jake in the first case and 1234 in the second case.
This is the first time I am dealing with regular expressions.
smatch sm;
string str = "Name:Jake";
std::regex_match(str, sm, std::regex("^Name:(.*?)$"));
std::cout << sm.size() << endl; //number of matches
std::cout << sm[1] << std::endl; //you only need the second match here
for (unsigned i = 0; i < sm.size(); ++i) {
cout << "[" << sm[i] << "] ";
}

Avoid extra matches from Regex_search

Very new to the c++ regex libraries.
We are trying to parse a line
*10 abc
We want to parse/split this line into only two tokens:
10
abc
I have tried multiple things such as regex_search but I do get 3 matches. First match is whole match and second, third are sub sequences matches. My question would be that
How can we get only two matches(10 & abc) from above string. Snapshot of what I have tried:
#include <regex>
#include <iostream>
int main() {
const std::string t = "*10 abc";
std::regex rgxx("\\*(\\d+)\\s+(.+)");
std::smatch match;
bool matched1 = std::regex_search(t.begin(), t.end(), match, rgxx);
std::cout << "Matched size " << match.size() << std::endl;
for(int i = 0 ; i < match.size(); ++i) {
std::cout << i << " match " << match[i] << std::endl;
}
}
Output:
Matched size 3
0 match *10 abc
1 match 10
2 match abc
0 match is the one which I do not want.
I am open to use boost libraries/regexes as well. Thank you.
There is nothing really wrong with your code per se. The zero match is just the entire string, which matched the regex pattern. If you only want the two captured terms, then just print the first and second capture groups:
const std::string t = "*10 abc";
std::regex rgxx("(\\d+)\\s+(.+)");
std::smatch match;
bool matched1 = std::regex_search(t.begin(), t.end(), match, rgxx);
std::cout << "Matched size " << match.size() << std::endl;
for (int i=1; i < match.size(); ++i) {
std::cout << i << " match " << match[i] << std::endl;
}
Matched size 3
1 match 10
2 match abc
So, the lesson here is that the first entry in the match array (index of zero) will always be the entire string.

c++11 regexp retrieving all groups with +/* modifiers

I don't understand how to retrieve all groups using regexp in c++
An example:
const std::string s = "1,2,3,5";
std::regex lrx("^(\\d+)(,(\\d+))*$");
std::smatch match;
if (std::regex_search(s, match, lrx))
{
int i = 0;
for (auto m : match)
std::cout << " submatch " << i++ << ": "<< m << std::endl;
}
Gives me the result
submatch 0: 1,2,3,5
submatch 1: 1
submatch 2: ,5
submatch 3: 5
I am missing 2 and 3
You cannot use the current approach, since std::regex does not allow storing of the captured values in memory, each time a part of the string is captured, the former value in the group is re-written with the new one, and only the last value captured is available after a match is found and returned. And since you defined 3 capturing groups in the pattern, you have 3+1 groups in the output.
Mind also, that std::regex_search only returns one match, while you will need multiple matches here.
So, what you may do is to perform 2 steps: 1) validate the string using the pattern you have (no capturing is necessary here), 2) extract the digits (or split with a comma, that depends on the requirements).
A C++ demo:
#include <string>
#include <iostream>
#include <regex>
using namespace std;
int main() {
std::regex rx_extract("[0-9]+");
std::regex rx_validate(R"(^\d+(?:,\d+)*$)");
std::string s = "1,2,3,5";
if (regex_match(s, rx_validate)) {
for(std::sregex_iterator i = std::sregex_iterator(s.begin(), s.end(), rx_extract);
i != std::sregex_iterator();
++i)
{
std::smatch m = *i;
std::cout << m.str() << '\n';
}
}
return 0;
}
Output:
1
2
3
5

C++11 Regex IfThenElse - Single, closed brackets matched OR no brackets matched

How can I define a c++11/ECMAScript compatible regex statement that matches strings either:
Containing a single, closed, pair of round brackets containing an alphanumeric string of length greater than 0 - for example the regex statement "\(\w+\)", which correctly matches "(abc_123)" and ignores the incorrect "(abc_123", "abc_123)" and "abc_123". However, the above expression does not ignore input strings containing multiple balanced/unbalanced bracketing - I would like to exclude "((abc_123)", "(abc_123))", and "((abc_123))" from my matched results.
Or a single, alphanumeric word, without any unbalanced brackets - for example something like the regex statement "\w+" correctly matches "abc_123", but unfortunately incorrectly matches with "(abc_123", "abc_123)", "((abc_123)", "(abc_123))", and "((abc_123))"...
For clarity, the required matchings for each the test cases above are:
"abc_123" = Match,
"(abc_123)" = Match,
"(abc_123" = Not matched,
"abc_123)" = Not matched,
"((abc_123)" = Not matched,
"(abc_123))" = Not matched,
"((abc_123))" = Not matched.
I've been playing around with implementing the IfThenElse format suggested by http://www.regular-expressions.info/conditional.html, but haven't gotten very far... Is there some way to limit the number of occurrences of a particular group [e.g. "(\(){0,1}" matches zero or one left hand round bracket], AND pass the number of repetitions of a previous group to a later group [say "num\1" equals the number of times the "(" bracket appears in "(\(){0,1}", then I could pass this to the corresponding closing bracket group, "(\)){num\1}" say...]
Not what do you want, I suppose, and non really elegant but...
With "or" (|) you should obtain a better-than-nothing solution based on "\\(\\w+\\)|\\w+".
A full example follows
#include <regex>
#include <iostream>
bool isMatch (std::string const & str)
{
static std::regex const
rgx { "\\(\\w+\\)|\\w+" };
std::smatch srgx;
return std::regex_match(str, srgx, rgx);
}
int main()
{
std::cout << isMatch("abc_123") << std::endl; // print 1
std::cout << isMatch("(abc_123)") << std::endl; // print 1
std::cout << isMatch("(abc_123") << std::endl; // print 0
std::cout << isMatch("abc_123)") << std::endl; // print 0
std::cout << isMatch("((abc_123)") << std::endl; // print 0
std::cout << isMatch("(abc_123))") << std::endl; // print 0
std::cout << isMatch("((abc_123))") << std::endl; // print 0
}

C++ RegEx matching - Pull the matching numbers

Ok, so I'm working with C++ regex and I'm not quite sure how to go about extracting the numbers that I want from my expression.
I'm building an expression BASED on numbers, but not sure how to pull them back out.
Here's my string:
+10.7% Is My String +5 And Some Extra Stuff Here
I use that string to pull the numbers
10 , 7 , 5 out and add them to a vector, no big deal.
I then change that string to become a regex expression.
\+([0-9]+)\.([0-9]+)% Is My String \+([0-9]+) And Some Extra Stuff Here
Now how do I go about using that regexp expression to MATCH my starting string and extracting the numbers back out.
Something along the lines of using the match table?
You must iterate over the submatches to extract them.
Example:
#include <iostream>
#include <string>
#include <regex>
int main()
{
std::string input = "+10.7% Is My String +5 And Some Extra Stuff Here";
std::regex rx("\\+([0-9]+)\\.([0-9]+)% Is My String \\+([0-9]+) And Some Extra Stuff Here");
std::smatch match;
if (std::regex_match(input, match, rx))
{
for (std::size_t i = 0; i < match.size(); ++i)
{
std::ssub_match sub_match = match[i];
std::string num = sub_match.str();
std::cout << " submatch " << i << ": " << num << std::endl;
}
}
}
Output:
submatch 0: +10.7% Is My String +5 And Some Extra Stuff Here
submatch 1: 10
submatch 2: 7
submatch 3: 5
live example: https://ideone.com/01XJDF