How can I define a c++11/ECMAScript compatible regex statement that matches strings either:
Containing a single, closed, pair of round brackets containing an alphanumeric string of length greater than 0 - for example the regex statement "\(\w+\)", which correctly matches "(abc_123)" and ignores the incorrect "(abc_123", "abc_123)" and "abc_123". However, the above expression does not ignore input strings containing multiple balanced/unbalanced bracketing - I would like to exclude "((abc_123)", "(abc_123))", and "((abc_123))" from my matched results.
Or a single, alphanumeric word, without any unbalanced brackets - for example something like the regex statement "\w+" correctly matches "abc_123", but unfortunately incorrectly matches with "(abc_123", "abc_123)", "((abc_123)", "(abc_123))", and "((abc_123))"...
For clarity, the required matchings for each the test cases above are:
"abc_123" = Match,
"(abc_123)" = Match,
"(abc_123" = Not matched,
"abc_123)" = Not matched,
"((abc_123)" = Not matched,
"(abc_123))" = Not matched,
"((abc_123))" = Not matched.
I've been playing around with implementing the IfThenElse format suggested by http://www.regular-expressions.info/conditional.html, but haven't gotten very far... Is there some way to limit the number of occurrences of a particular group [e.g. "(\(){0,1}" matches zero or one left hand round bracket], AND pass the number of repetitions of a previous group to a later group [say "num\1" equals the number of times the "(" bracket appears in "(\(){0,1}", then I could pass this to the corresponding closing bracket group, "(\)){num\1}" say...]
Not what do you want, I suppose, and non really elegant but...
With "or" (|) you should obtain a better-than-nothing solution based on "\\(\\w+\\)|\\w+".
A full example follows
#include <regex>
#include <iostream>
bool isMatch (std::string const & str)
{
static std::regex const
rgx { "\\(\\w+\\)|\\w+" };
std::smatch srgx;
return std::regex_match(str, srgx, rgx);
}
int main()
{
std::cout << isMatch("abc_123") << std::endl; // print 1
std::cout << isMatch("(abc_123)") << std::endl; // print 1
std::cout << isMatch("(abc_123") << std::endl; // print 0
std::cout << isMatch("abc_123)") << std::endl; // print 0
std::cout << isMatch("((abc_123)") << std::endl; // print 0
std::cout << isMatch("(abc_123))") << std::endl; // print 0
std::cout << isMatch("((abc_123))") << std::endl; // print 0
}
Related
I'm looking for a regex pattern that returns true if found 7 numbers on given string. There's no order so if a string is set to: "100 my, str1ng y000" it catches that.
RegEx alone won't count exact occurrences for you, it would return true even if there are more than 7 digits in the string because it would try to find out at least 7 digits in the string.
You can use below code to test exact number (7 in your case) of digits in any string:
var temp = "100 my, str1ng y000 3c43fdgd";
var count = (temp.match(/\d/g) || []).length;
alert(count == 7);
I will show you an C++ Example that
Shows a regex for extracting digit groups
Shows a regex for matching at least 7 digits
Shows, if there is a match for the requested predicate
Shows the number of digits in the string (no regex needed)
Shows the group of digits
#include <iostream>
#include <string>
#include <algorithm>
#include <vector>
#include <regex>
// Our test data (raw string). So, containing also \" and so on
std::string testData("100 my, str1ng y000");
std::regex re1(R"#((\d+))#"); // For extracting digit groups
std::regex re2(R"#((\d.*){7,})#"); // For regex match
int main(void)
{
// Define the variable id as vector of string and use the range constructor to read the test data and tokenize it
std::vector<std::string> id{ std::sregex_token_iterator(testData.begin(), testData.end(), re1, 1), std::sregex_token_iterator() };
// Match the regex. Should have at least 7 digits somewhere
std::smatch base_match;
bool containsAtLeast7Digits = std::regex_match(testData, base_match, re2);
// Show result on screen
std::cout << "\nEvaluating string '" << testData <<
"'\n\nThe predicate 'contains-at-leats-7-digits' is " << std::boolalpha << containsAtLeast7Digits <<
"\n\nIt contains overall " <<
std::count_if(
testData.begin(),
testData.end(),
[](const char c) {
return std::isdigit(static_cast<int>(c));
}
) << " digits and " << id.size() << " digit groups. These are:\n\n";
// Print complete vector to std::cout
std::copy(id.begin(), id.end(), std::ostream_iterator<std::string>(std::cout, "\n"));
return 0;
}
Please note: Use std::count for counting. Faster and easier.
Hope this helps . . .
Basic regex question.
By default, regular expression are greedy, it seems. For e.g. below code:
#include <regex>
#include <iostream>
int main() {
const std::string t = "*1 abc";
std::smatch match;
std::regex rgxx("\\*(\\d+?)\\s+(.+?)$");
bool matched1 = std::regex_search(t.begin(), t.end(), match, rgxx);
std::cout << "Matched size " << match.size() << std::endl;
for(int i = 0 ; i < match.size(); ++i) {
std::cout << i << " match " << match[i] << std::endl;
}
}
This will produce an output of:
Matched size 3
**0 match *1 abc**
1 match 1
2 match abc
As an general regular expression writer, I would expected only
1 match 1
2 match abc
to come. First match is coming because of regex greediness, I think. How is it avoidable?
From std::regex_search: match[0] is not the result of greedy evaluation, but is the range of the entire match. The match elements [1, n) are the capture groups.
Here's in illustration of what the match results mean:
regex "hello ([\\w]+)"
string = "Oh, hello John!"
match[0] = "hello John" // matches the whole regex above
match[1] = "John" // the first capture group
You only have one match. That match has 2 "marked subexpressions", because that's what the regex specifies. You don't have multiple matches of that regex.
From std::regex_search
m.size(): number of marked subexpressions plus 1, that is, 1+rgxx.mark_count()
If you are looking for multiple matches, use std::regex_iterator
I'm a bit confused about the following C++11 code:
#include <iostream>
#include <string>
#include <regex>
int main()
{
std::string haystack("abcdefabcghiabc");
std::regex needle("abc");
std::smatch matches;
std::regex_search(haystack, matches, needle);
std::cout << matches.size() << std::endl;
}
I'd expect it to print out 3 but instead I get 1. Am I missing something?
You get 1 because regex_search returns only 1 match, and size() will return the number of capture groups + the whole match value.
Your matches is...:
Object of a match_results type (such as cmatch or smatch) that is filled by this function with information about the match results and any submatches found.
If [the regex search is] successful, it is not empty and contains a series of sub_match objects: the first sub_match element corresponds to the entire match, and, if the regex expression contained sub-expressions to be matched (i.e., parentheses-delimited groups), their corresponding sub-matches are stored as successive sub_match elements in the match_results object.
Here is a code that will find multiple matches:
#include <string>
#include <iostream>
#include <regex>
using namespace std;
int main() {
string str("abcdefabcghiabc");
int i = 0;
regex rgx1("abc");
smatch smtch;
while (regex_search(str, smtch, rgx1)) {
std::cout << i << ": " << smtch[0] << std::endl;
i += 1;
str = smtch.suffix().str();
}
return 0;
}
See IDEONE demo returning abc 3 times.
As this method destroys the input string, here is another alternative based on the std::sregex_iterator (std::wsregex_iterator should be used when your subject is an std::wstring object):
int main() {
std::regex r("ab(c)");
std::string s = "abcdefabcghiabc";
for(std::sregex_iterator i = std::sregex_iterator(s.begin(), s.end(), r);
i != std::sregex_iterator();
++i)
{
std::smatch m = *i;
std::cout << "Match value: " << m.str() << " at Position " << m.position() << '\n';
std::cout << " Capture: " << m[1].str() << " at Position " << m.position(1) << '\n';
}
return 0;
}
See IDEONE demo, returning
Match value: abc at Position 0
Capture: c at Position 2
Match value: abc at Position 6
Capture: c at Position 8
Match value: abc at Position 12
Capture: c at Position 14
What you're missing is that matches is populated with one entry for each capture group (including the entire matched substring as the 0th capture).
If you write
std::regex needle("a(b)c");
then you'll get matches.size()==2, with matches[0]=="abc", and matches[1]=="b".
EDIT: Some people have downvoted this answer. That may be for a variety of reasons, but if it is because it does not apply to the answer I criticized (no one left a comment to explain the decision), they should take note that W. Stribizew changed the code two months after I wrote this, and I was unaware of it until today, 2021-01-18. The rest of the answer is unchanged from when I first wrote it.
#stribizhev's solution has quadratic worst case complexity for sane regular expressions. For insane ones (e.g. "y*"), it doesn't terminate. In some applications, these issues could be DoS attacks waiting to happen. Here's a fixed version:
string str("abcdefabcghiabc");
int i = 0;
regex rgx1("abc");
smatch smtch;
auto beg = str.cbegin();
while (regex_search(beg, str.cend(), smtch, rgx1)) {
std::cout << i << ": " << smtch[0] << std::endl;
i += 1;
if ( smtch.length(0) > 0 )
std::advance(beg, smtch.length(0));
else if ( beg != str.cend() )
++beg;
else
break;
}
According to my personal preference, this will find n+1 matches of an empty regex in a string of length n. You might also just exit the loop after an empty match.
If you want to compare the performance for a string with millions of matches, add the following lines after the definition of str (and don't forget to turn on optimizations), once for each version:
for (int j = 0; j < 20; ++j)
str = str + str;
Why does this regex return an extra match of an empty string with std::regex_match?
std::regex trim_comments_spaces("^\\s*(?:(?:(.*?)\\s*[/]{2,}.*)|(?:(.*?)\\s*))$");
It seems to give the right matches, but I have to access the third element of the std::smatch results., which makes me suspicious that I got the alteration/grouping/capturing syntax slightly wrong.
std::string trim_line(std::string current_line) {
std::string trimmed_line = "";
if (current_line != "#include <glsl.h>") {
std::regex trim_comments_spaces("^\\s*(?:(?:(.*?)\\s*[/]{2,}.*)|(?:(.*?)\\s*))$");
std::smatch sub_matches;
if (std::regex_match(current_line, sub_matches, trim_comments_spaces)) {
std::cout << sub_matches.size() << "\n";
std::string sub_string = sub_matches[2].str();
if (sub_string != "") {
std::regex validate_line("^(?:(?:[a-z][a-zA-Z0-9\\s_+*\\-/=><&|^?:{().,[\\]]*[;{})])|[}])$");
if (std::regex_match(sub_string.begin(), sub_string.end(), validate_line)) {
trimmed_line = sub_string;
}
else {
std::cout << "Syntax error(2): " << sub_string << "\n";
}
}
}
else {
std::cout << "Syntax error(1): " << current_line << "\n";
}
}
return trimmed_line;
}
Your regex, once executed against a matching string, will fetch you a smatch object having 3 groups:
1) 0th group - the whole match,
2) 1st group - (.*?) in ^\\s*(?:(?:(.*?)\\s*[/]{2,}.*)|
3) 2rd group - (.*?) in (?:(.*?)\\s*))$
Whether or not a group matched, if you defined a (...) in the pattern, it will be initialized with an empty string first, then, it will either be populated with the captured value, or it will remain empty. Of course, unless you are using identically named groups or branch reset, but you have no access to them in std::regex. You may use Boost and use "^\\s*(?|(?:(.*?)\\s*[/]{2,}.*)|(?:(.*?)\\s*))$" (see the (?| construct, and then all your needed values will be in Group 1)
If you use your current code, you can concatenate groups 1 and 2 as one of them will always be empty.
std::string sub_string = sub_matches[1].str() + sub_matches[2].str();
See the C++ demo
I have a confusion on how to fetch the result after running the function regex_search in the std::tr1::regex.
Following is a sample code to demonstrate my issue.
string source = "abcd 16000 ";
string exp = "abcd ([^\\s]+)";
std::tr1::cmatch res;
std::tr1::regex rx(exp);
while(std::tr1::regex_search(source.c_str(), res, rx, std::tr1::regex_constants::match_continuous))
{
//HOW TO FETCH THE RESULT???????????
std::cout <<" "<< res.str()<<endl;
source = res.suffix().str();
}
The regular expression mentioned should ideally strip off the "abcd" from the string and return me 16000.
I see that the cmatch res has TWO objects. The second object contains the expected result.(this object has three members (matched, first, second). and the values are {true, "16000", " "}.
My question is what does this size of the object denote? Why is it showing 2 in this specific case( res[0] and res[1]) when I have run regex_search only once? And how do I know which object would have the expected result?
Thanks
Sunil
As stated here:
match[0]: represents the entire match
match[1]: represents the first match
match[2]: represents the second match, and so forth
This means match[0] should - in this case! - hold your full source (abcd 16000) as you match the whole thing, while match[1] contains the content of your capturing group.
If there was, for example, a second capturing group in your regex you'd get a third object in the match-collection and so on.
I'm a guy who understands visualized problems/solutions better, so let's do this:
See the demo#regex101.
See the two colors in the textfield containing the teststring?
The green color is the background for your capturing group while the
blue color represents everything else generally matched by the expression, but not captured by any group.
In other words: blue+green is the equivalent for match[0] and green for match[1] in your case.
This way you can always know which of the objects in match refers to which capturing group:
You initialize a counter in your head, starting at 0. Now go through the regex from the left to the right, add 1 for each ( and subtract 1 for each ) until you reach the opening bracket of the capturing group you want to extract. The number in your head is the array index.
EDIT
Regarding your comment on checking res[0].first:
The member first of the sub_match class is only
denoting the position of the start of the match.
While second denotes the position of the end of the match.
(taken from boost doc)
Both return a char* (VC++10) or an iterator (Boost), thus you get a substring of the sourcestring as the output (which may be the full source in case the match starts at index zero!).
Consider the following program (VC++10):
#include "stdafx.h"
#include <regex>
#include <iostream>
using namespace std;
int _tmain(int argc, _TCHAR* argv[])
{
string source = "abcdababcdefg";
string exp = "ab";
tr1::cmatch res;
tr1::regex rx(exp);
tr1::regex_search(source.c_str(), res, rx);
for (size_t n = 0; n < res.size(); ++n)
{
std::cout << "submatch[" << n << "]: matched == " << std::boolalpha
<< res[n].matched <<
" at position " << res.position(n) << std::endl;
std::cout << " " << res.length(n)
<< " chars, value == " << res[n] << std::endl;
}
std::cout << std::endl;
cout << "res[0].first: " << res[0].first << " - res[0].second: " << res[0].second << std::endl;
cout << "res[0]: " << res[0];
cin.get();
return 0;
}
Execute it and look at the output. The first (and only) match is - obviously - the first to chars ab, so this is actually the whole matched string and the reason why res[0] == "ab".
Now, knowing that .first/.second give us substrings from the start of the match and from the end of the match onwards, the output shouldn't be confusing anymore.