Regular expression to return only one match - c++

My CPP application is similar to a regular expression testing application in which I can enter the regular expression and the input string to see the output. I am using the cpp API std::regex_search(inputString, match, regex) to execute and get the match for all regular expressions. The problem I am facing here is that the match can have more than 1 item but I should return only one of them.
I have 2 types of input strings. For example:
Name:Jake (string with prefix 'Name:'). I am using the regular expression ^Name:(.*?)$. Here match contains Name:Jake and Jake. I have to ignore match[0] and return match[1] in this case.
1234-r (string with suffix '-r') Here I am using regularexpression ^.*(?=(\\-r)). In this case match contains 1234 and -r. In this case, I have to ignore match[1] and return match[0].
Is there a way I can modify these regular expressions so that the match will have only one item in that? Jake in the first case and 1234 in the second case.
This is the first time I am dealing with regular expressions.

smatch sm;
string str = "Name:Jake";
std::regex_match(str, sm, std::regex("^Name:(.*?)$"));
std::cout << sm.size() << endl; //number of matches
std::cout << sm[1] << std::endl; //you only need the second match here
for (unsigned i = 0; i < sm.size(); ++i) {
cout << "[" << sm[i] << "] ";
}

Related

avoid regex greediness

Basic regex question.
By default, regular expression are greedy, it seems. For e.g. below code:
#include <regex>
#include <iostream>
int main() {
const std::string t = "*1 abc";
std::smatch match;
std::regex rgxx("\\*(\\d+?)\\s+(.+?)$");
bool matched1 = std::regex_search(t.begin(), t.end(), match, rgxx);
std::cout << "Matched size " << match.size() << std::endl;
for(int i = 0 ; i < match.size(); ++i) {
std::cout << i << " match " << match[i] << std::endl;
}
}
This will produce an output of:
Matched size 3
**0 match *1 abc**
1 match 1
2 match abc
As an general regular expression writer, I would expected only
1 match 1
2 match abc
to come. First match is coming because of regex greediness, I think. How is it avoidable?
From std::regex_search: match[0] is not the result of greedy evaluation, but is the range of the entire match. The match elements [1, n) are the capture groups.
Here's in illustration of what the match results mean:
regex "hello ([\\w]+)"
string = "Oh, hello John!"
match[0] = "hello John" // matches the whole regex above
match[1] = "John" // the first capture group
You only have one match. That match has 2 "marked subexpressions", because that's what the regex specifies. You don't have multiple matches of that regex.
From std::regex_search
m.size(): number of marked subexpressions plus 1, that is, 1+rgxx.mark_count()
If you are looking for multiple matches, use std::regex_iterator

C++11 Regex IfThenElse - Single, closed brackets matched OR no brackets matched

How can I define a c++11/ECMAScript compatible regex statement that matches strings either:
Containing a single, closed, pair of round brackets containing an alphanumeric string of length greater than 0 - for example the regex statement "\(\w+\)", which correctly matches "(abc_123)" and ignores the incorrect "(abc_123", "abc_123)" and "abc_123". However, the above expression does not ignore input strings containing multiple balanced/unbalanced bracketing - I would like to exclude "((abc_123)", "(abc_123))", and "((abc_123))" from my matched results.
Or a single, alphanumeric word, without any unbalanced brackets - for example something like the regex statement "\w+" correctly matches "abc_123", but unfortunately incorrectly matches with "(abc_123", "abc_123)", "((abc_123)", "(abc_123))", and "((abc_123))"...
For clarity, the required matchings for each the test cases above are:
"abc_123" = Match,
"(abc_123)" = Match,
"(abc_123" = Not matched,
"abc_123)" = Not matched,
"((abc_123)" = Not matched,
"(abc_123))" = Not matched,
"((abc_123))" = Not matched.
I've been playing around with implementing the IfThenElse format suggested by http://www.regular-expressions.info/conditional.html, but haven't gotten very far... Is there some way to limit the number of occurrences of a particular group [e.g. "(\(){0,1}" matches zero or one left hand round bracket], AND pass the number of repetitions of a previous group to a later group [say "num\1" equals the number of times the "(" bracket appears in "(\(){0,1}", then I could pass this to the corresponding closing bracket group, "(\)){num\1}" say...]
Not what do you want, I suppose, and non really elegant but...
With "or" (|) you should obtain a better-than-nothing solution based on "\\(\\w+\\)|\\w+".
A full example follows
#include <regex>
#include <iostream>
bool isMatch (std::string const & str)
{
static std::regex const
rgx { "\\(\\w+\\)|\\w+" };
std::smatch srgx;
return std::regex_match(str, srgx, rgx);
}
int main()
{
std::cout << isMatch("abc_123") << std::endl; // print 1
std::cout << isMatch("(abc_123)") << std::endl; // print 1
std::cout << isMatch("(abc_123") << std::endl; // print 0
std::cout << isMatch("abc_123)") << std::endl; // print 0
std::cout << isMatch("((abc_123)") << std::endl; // print 0
std::cout << isMatch("(abc_123))") << std::endl; // print 0
std::cout << isMatch("((abc_123))") << std::endl; // print 0
}

Retrieving the results from the std::tr1::regex_search

I have a confusion on how to fetch the result after running the function regex_search in the std::tr1::regex.
Following is a sample code to demonstrate my issue.
string source = "abcd 16000 ";
string exp = "abcd ([^\\s]+)";
std::tr1::cmatch res;
std::tr1::regex rx(exp);
while(std::tr1::regex_search(source.c_str(), res, rx, std::tr1::regex_constants::match_continuous))
{
//HOW TO FETCH THE RESULT???????????
std::cout <<" "<< res.str()<<endl;
source = res.suffix().str();
}
The regular expression mentioned should ideally strip off the "abcd" from the string and return me 16000.
I see that the cmatch res has TWO objects. The second object contains the expected result.(this object has three members (matched, first, second). and the values are {true, "16000", " "}.
My question is what does this size of the object denote? Why is it showing 2 in this specific case( res[0] and res[1]) when I have run regex_search only once? And how do I know which object would have the expected result?
Thanks
Sunil
As stated here:
match[0]: represents the entire match
match[1]: represents the first match
match[2]: represents the second match, and so forth
This means match[0] should - in this case! - hold your full source (abcd 16000) as you match the whole thing, while match[1] contains the content of your capturing group.
If there was, for example, a second capturing group in your regex you'd get a third object in the match-collection and so on.
I'm a guy who understands visualized problems/solutions better, so let's do this:
See the demo#regex101.
See the two colors in the textfield containing the teststring?
The green color is the background for your capturing group while the
blue color represents everything else generally matched by the expression, but not captured by any group.
In other words: blue+green is the equivalent for match[0] and green for match[1] in your case.
This way you can always know which of the objects in match refers to which capturing group:
You initialize a counter in your head, starting at 0. Now go through the regex from the left to the right, add 1 for each ( and subtract 1 for each ) until you reach the opening bracket of the capturing group you want to extract. The number in your head is the array index.
EDIT
Regarding your comment on checking res[0].first:
The member first of the sub_match class is only
denoting the position of the start of the match.
While second denotes the position of the end of the match.
(taken from boost doc)
Both return a char* (VC++10) or an iterator (Boost), thus you get a substring of the sourcestring as the output (which may be the full source in case the match starts at index zero!).
Consider the following program (VC++10):
#include "stdafx.h"
#include <regex>
#include <iostream>
using namespace std;
int _tmain(int argc, _TCHAR* argv[])
{
string source = "abcdababcdefg";
string exp = "ab";
tr1::cmatch res;
tr1::regex rx(exp);
tr1::regex_search(source.c_str(), res, rx);
for (size_t n = 0; n < res.size(); ++n)
{
std::cout << "submatch[" << n << "]: matched == " << std::boolalpha
<< res[n].matched <<
" at position " << res.position(n) << std::endl;
std::cout << " " << res.length(n)
<< " chars, value == " << res[n] << std::endl;
}
std::cout << std::endl;
cout << "res[0].first: " << res[0].first << " - res[0].second: " << res[0].second << std::endl;
cout << "res[0]: " << res[0];
cin.get();
return 0;
}
Execute it and look at the output. The first (and only) match is - obviously - the first to chars ab, so this is actually the whole matched string and the reason why res[0] == "ab".
Now, knowing that .first/.second give us substrings from the start of the match and from the end of the match onwards, the output shouldn't be confusing anymore.

Regex in std c++

I want to find all occurences of something like this '{some text}'.
My code is:
std::wregex e(L"(\\{([a-z]+)\\})");
std::wsmatch m;
std::regex_search(chatMessage, m, e);
std::wcout << "matches for '" << chatMessage << "'\n";
for (size_t i = 0; i < m.size(); ++i) {
std::wssub_match sub_match = m[i];
std::wstring sub_match_str = sub_match.str();
std::wcout << i << ": " << sub_match_str << '\n';
}
but for string like this: L"Roses {aaa} {bbb} are {ccc} #ff0000") my output is:
0: {aaa}
1: {aaa}
2: aaa
and I dont get next substrings. I suspect that there is something wrong with my regular expression. Do anyone of you see what is wrong?
You're searching once and simply looping through the groups. You instead need to search multiple times and return the correct group only. Try:
std::wregex e(L"(\\{([a-z]+)\\})");
std::wsmatch m;
std::wcout << "matches for '" << chatMessage << "'\n";
while (std::regex_search(chatMessage, m, e))
{
std::wssub_match sub_match = m[2];
std::wstring sub_match_str = sub_match.str();
std::wcout << sub_match_str << '\n';
chatMessage = m.suffix().str(); // this advances the position in the string
}
2 here is the second group, i.e. the second thing in brackets, i.e. ([a-z]+).
See this for more on groups.
There is nothing wrong with the regular expression, but you need to search for it repeatedly. And than you don't really need the parenthesis anyway.
The std::regex_search finds one occurence of the pattern. That's the {aaa}. The std::wsmatch is just that. It has 3 submatches. The whole string, the content of the outer parenthesis (which is the whole string again) and the content of the inner parenthesis. That's what you are seeing.
You have to call regex_search again on the rest of the string to get the next match:
std::wstring::const_iterator begin = chatMessage.begin(), end = chatMessage.end();
while (std::regex_search(begin, end, m, e)) {
// ...
begin = m.end();
}
The index operator on a regex_match object returns the matching substring at that index. When the index is 0 it returns the entire matching string, which is why the first line of output is {aaa}. When the index is 1 it returns the contents of the first capture group, that is, the text matched by the part of the regular expression that is between the first ( and the corresponding ). In this example, those are the outermost parentheses, which once again produces {abc}. When the index is 2 is returns the contents of the second capture group, i.e., the text between the second ( and its corresponding ), which gives you the aaa.
The easiest way to search again from where you left off is to use an iterator:
std::wsregex_iterator it(chatMessage.begin(), chatMessage.end(), e);
for ( ; it != wsregex_iterator(); ++it) {
std::cout << *it << '\n';
}
(note: this is a sketch, not tested)

boost regex sub-string match

I want to return output "match" if the pattern "regular" is a sub-string of variable st. Is this possible?
int main()
{
string st = "some regular expressions are Regxyzr";
boost::regex ex("[Rr]egular");
if (boost::regex_match(st, ex))
{
cout << "match" << endl;
}
else
{
cout << "not match" << endl;
}
}
The boost::regex_match only matches the whole string, you probably want boost::regex_search instead.
regex_search does what you want; regex_match is documented as
determines whether a given regular
expression matches all of a given
character sequence
(the emphasis is in the original URL I'm quoting from).
Your question is answered with example in library documentation - boost::regex
Alternate approach:
You can use boost::regex_iterator, this is useful for parsing file etc.
string[0],
string[1]
below indicates start and end iterator.
Ex:
boost::regex_iterator stIter(string[0], string[end], regExpression)
boost::regex_iterator endIter
for (stIter; stIter != endIter; ++stIter)
{
cout << " Whole string " << (*stIter)[0] << endl;
cout << " First sub-group " << (*stIter)[1] << endl;
}
}