C++11 VS12 regex_search - c++

I'm trying to retrieve numbers from string. String format like _0_1_ and I want to get 0 and 1.
Here is my code:
std::tr1::regex rx("_(\\d+)_");
tstring fileName = Utils::extractFileName(docList[i]->c_str());
std::tr1::smatch res;
std::tr1::regex_search(fileName, res, rx);
but at the result I have (UPDATED: this is strange outputs from debugger watch):
res[0] = 3
res[1] = 1
Where 3 came from and what I'm doing wrong?
UPDATED:
I output results to the screen:
for (std::tr1::smatch::iterator it = res.begin(); it < res.end(); ++it){
std::cout << *it << std::endl;
}
And programm output:
_0_
0

A regexp normally returns all non-overlapping matches, so if you add _ both in front and on the back of numbers you're not going to get all the numbers because the underscore after the first number cannot be used to match also as the underscore before the second number
_123_456_
^
This cannot be used twice
Just use (\\d+) as expression to get all numbers (regexp is "greedy" by default so all the available digits will be found anyway).

This appears to be the expected output. The first match should be the entire substring which matched, and then the second (and so forth) should be the capture groups.
If you'd like to go through all matches, you'll need to call regex_search multiple times to get each match:
auto it = fileName.cbegin();
while (std::tr1::regex_search(it, fileName.cend(), res, rx)) {
std::cout << "Found matching group:" << std::endl;
for (int mm = 1; mm < res.size(); ++mm) {
std::cout << std::string(res[mm].first, res[mm].second) << std::endl;
}
it = res[0].second; // start 1 past the end
}
If you do really need only the numbers "wrapped" in underscores, you can use a positive assertion (?=_) to ensure this occurs:
// positive assertions are required matches, but are not consumed by the
// matching group.
std::tr1::regex rx("_(\\d+)(?=_)");
Which, when run against "//abc_1_2_3.txt", retrieves 1 and 2, but not 3.

Solution:
Thx to all, rewrite with help of regex_token_iterator and (\\d+). Now it works:
std::regex_token_iterator<tstring::iterator> rend;
tstring fileName = Utils::extractFileName(docList[i]->c_str());
std::tr1::regex_search(fileName, res, rx);
for (std::regex_token_iterator<std::string::iterator> it(fileName.begin(), fileName.end(), rx); it != rend; ++it) {
std::cout << " [" << *it << "]";
}

Related

Avoid extra matches from Regex_search

Very new to the c++ regex libraries.
We are trying to parse a line
*10 abc
We want to parse/split this line into only two tokens:
10
abc
I have tried multiple things such as regex_search but I do get 3 matches. First match is whole match and second, third are sub sequences matches. My question would be that
How can we get only two matches(10 & abc) from above string. Snapshot of what I have tried:
#include <regex>
#include <iostream>
int main() {
const std::string t = "*10 abc";
std::regex rgxx("\\*(\\d+)\\s+(.+)");
std::smatch match;
bool matched1 = std::regex_search(t.begin(), t.end(), match, rgxx);
std::cout << "Matched size " << match.size() << std::endl;
for(int i = 0 ; i < match.size(); ++i) {
std::cout << i << " match " << match[i] << std::endl;
}
}
Output:
Matched size 3
0 match *10 abc
1 match 10
2 match abc
0 match is the one which I do not want.
I am open to use boost libraries/regexes as well. Thank you.
There is nothing really wrong with your code per se. The zero match is just the entire string, which matched the regex pattern. If you only want the two captured terms, then just print the first and second capture groups:
const std::string t = "*10 abc";
std::regex rgxx("(\\d+)\\s+(.+)");
std::smatch match;
bool matched1 = std::regex_search(t.begin(), t.end(), match, rgxx);
std::cout << "Matched size " << match.size() << std::endl;
for (int i=1; i < match.size(); ++i) {
std::cout << i << " match " << match[i] << std::endl;
}
Matched size 3
1 match 10
2 match abc
So, the lesson here is that the first entry in the match array (index of zero) will always be the entire string.

Is it possible to find two strings in one string using regular expressions? [duplicate]

I'm a bit confused about the following C++11 code:
#include <iostream>
#include <string>
#include <regex>
int main()
{
std::string haystack("abcdefabcghiabc");
std::regex needle("abc");
std::smatch matches;
std::regex_search(haystack, matches, needle);
std::cout << matches.size() << std::endl;
}
I'd expect it to print out 3 but instead I get 1. Am I missing something?
You get 1 because regex_search returns only 1 match, and size() will return the number of capture groups + the whole match value.
Your matches is...:
Object of a match_results type (such as cmatch or smatch) that is filled by this function with information about the match results and any submatches found.
If [the regex search is] successful, it is not empty and contains a series of sub_match objects: the first sub_match element corresponds to the entire match, and, if the regex expression contained sub-expressions to be matched (i.e., parentheses-delimited groups), their corresponding sub-matches are stored as successive sub_match elements in the match_results object.
Here is a code that will find multiple matches:
#include <string>
#include <iostream>
#include <regex>
using namespace std;
int main() {
string str("abcdefabcghiabc");
int i = 0;
regex rgx1("abc");
smatch smtch;
while (regex_search(str, smtch, rgx1)) {
std::cout << i << ": " << smtch[0] << std::endl;
i += 1;
str = smtch.suffix().str();
}
return 0;
}
See IDEONE demo returning abc 3 times.
As this method destroys the input string, here is another alternative based on the std::sregex_iterator (std::wsregex_iterator should be used when your subject is an std::wstring object):
int main() {
std::regex r("ab(c)");
std::string s = "abcdefabcghiabc";
for(std::sregex_iterator i = std::sregex_iterator(s.begin(), s.end(), r);
i != std::sregex_iterator();
++i)
{
std::smatch m = *i;
std::cout << "Match value: " << m.str() << " at Position " << m.position() << '\n';
std::cout << " Capture: " << m[1].str() << " at Position " << m.position(1) << '\n';
}
return 0;
}
See IDEONE demo, returning
Match value: abc at Position 0
Capture: c at Position 2
Match value: abc at Position 6
Capture: c at Position 8
Match value: abc at Position 12
Capture: c at Position 14
What you're missing is that matches is populated with one entry for each capture group (including the entire matched substring as the 0th capture).
If you write
std::regex needle("a(b)c");
then you'll get matches.size()==2, with matches[0]=="abc", and matches[1]=="b".
EDIT: Some people have downvoted this answer. That may be for a variety of reasons, but if it is because it does not apply to the answer I criticized (no one left a comment to explain the decision), they should take note that W. Stribizew changed the code two months after I wrote this, and I was unaware of it until today, 2021-01-18. The rest of the answer is unchanged from when I first wrote it.
#stribizhev's solution has quadratic worst case complexity for sane regular expressions. For insane ones (e.g. "y*"), it doesn't terminate. In some applications, these issues could be DoS attacks waiting to happen. Here's a fixed version:
string str("abcdefabcghiabc");
int i = 0;
regex rgx1("abc");
smatch smtch;
auto beg = str.cbegin();
while (regex_search(beg, str.cend(), smtch, rgx1)) {
std::cout << i << ": " << smtch[0] << std::endl;
i += 1;
if ( smtch.length(0) > 0 )
std::advance(beg, smtch.length(0));
else if ( beg != str.cend() )
++beg;
else
break;
}
According to my personal preference, this will find n+1 matches of an empty regex in a string of length n. You might also just exit the loop after an empty match.
If you want to compare the performance for a string with millions of matches, add the following lines after the definition of str (and don't forget to turn on optimizations), once for each version:
for (int j = 0; j < 20; ++j)
str = str + str;

c++11 regex : check if a set of characters exist in a string

If for example, I have the string: "asdf{ asdf }",
I want to check if the string contains any character in the set []{}().
How would I go about doing this?
I'm looking for a general solution that checks if the string has the characters in the set, so that I can continue to add lookup characters in the set in the future.
Your question is unclear on whether you only want to detect if any of the characters in the search set are present in the input string, or whether you want to find all matches.
In either case, use std::regex to create the regular expression object. Because all the characters in your search set have special meanings in regular expressions, you'll need to escape all of them.
std::regex r{R"([\[\]\{\}\(\)])"};
char const *str = "asdf{ asdf }";
If you want to only detect whether at least one match was found, use std::regex_search.
std::cmatch results;
if(std::regex_search(str, results, r)) {
std::cout << "match found\n";
}
On the other hand, if you want to find all the matches, use std::regex_iterator.
std::cmatch results;
auto first = std::cregex_iterator(str, str + std::strlen(str), r);
auto last = std::cregex_iterator();
if(first != last) std::cout << "match found\n";
while(first != last) {
std::cout << (*first++).str() << '\n';
}
Live demo
I know you are asking about regex but this specific problem can be solved without it using std::string::find_first_of() which finds the position of the first character in the string(s) that is contained in a set (g):
#include <string>
#include <iostream>
int main()
{
std::string s = "asdf{ asdf }";
std::string g = "[]{}()";
// Does the string contain one of thecharacters?
if(s.find_first_of(g) != std::string::npos)
std::cout << s << " contains one of " << g << '\n';
// find the position of each occurence of the characters in the string
for(size_t pos = 0; (pos = s.find_first_of(g, pos)) != std::string::npos; ++pos)
std::cout << s << " contains " << s[pos] << " at " << pos << '\n';
}
OUTPUT:
asdf{ asdf } contains one of []{}()
asdf{ asdf } contains { at 4
asdf{ asdf } contains } at 11

Retrieving the results from the std::tr1::regex_search

I have a confusion on how to fetch the result after running the function regex_search in the std::tr1::regex.
Following is a sample code to demonstrate my issue.
string source = "abcd 16000 ";
string exp = "abcd ([^\\s]+)";
std::tr1::cmatch res;
std::tr1::regex rx(exp);
while(std::tr1::regex_search(source.c_str(), res, rx, std::tr1::regex_constants::match_continuous))
{
//HOW TO FETCH THE RESULT???????????
std::cout <<" "<< res.str()<<endl;
source = res.suffix().str();
}
The regular expression mentioned should ideally strip off the "abcd" from the string and return me 16000.
I see that the cmatch res has TWO objects. The second object contains the expected result.(this object has three members (matched, first, second). and the values are {true, "16000", " "}.
My question is what does this size of the object denote? Why is it showing 2 in this specific case( res[0] and res[1]) when I have run regex_search only once? And how do I know which object would have the expected result?
Thanks
Sunil
As stated here:
match[0]: represents the entire match
match[1]: represents the first match
match[2]: represents the second match, and so forth
This means match[0] should - in this case! - hold your full source (abcd 16000) as you match the whole thing, while match[1] contains the content of your capturing group.
If there was, for example, a second capturing group in your regex you'd get a third object in the match-collection and so on.
I'm a guy who understands visualized problems/solutions better, so let's do this:
See the demo#regex101.
See the two colors in the textfield containing the teststring?
The green color is the background for your capturing group while the
blue color represents everything else generally matched by the expression, but not captured by any group.
In other words: blue+green is the equivalent for match[0] and green for match[1] in your case.
This way you can always know which of the objects in match refers to which capturing group:
You initialize a counter in your head, starting at 0. Now go through the regex from the left to the right, add 1 for each ( and subtract 1 for each ) until you reach the opening bracket of the capturing group you want to extract. The number in your head is the array index.
EDIT
Regarding your comment on checking res[0].first:
The member first of the sub_match class is only
denoting the position of the start of the match.
While second denotes the position of the end of the match.
(taken from boost doc)
Both return a char* (VC++10) or an iterator (Boost), thus you get a substring of the sourcestring as the output (which may be the full source in case the match starts at index zero!).
Consider the following program (VC++10):
#include "stdafx.h"
#include <regex>
#include <iostream>
using namespace std;
int _tmain(int argc, _TCHAR* argv[])
{
string source = "abcdababcdefg";
string exp = "ab";
tr1::cmatch res;
tr1::regex rx(exp);
tr1::regex_search(source.c_str(), res, rx);
for (size_t n = 0; n < res.size(); ++n)
{
std::cout << "submatch[" << n << "]: matched == " << std::boolalpha
<< res[n].matched <<
" at position " << res.position(n) << std::endl;
std::cout << " " << res.length(n)
<< " chars, value == " << res[n] << std::endl;
}
std::cout << std::endl;
cout << "res[0].first: " << res[0].first << " - res[0].second: " << res[0].second << std::endl;
cout << "res[0]: " << res[0];
cin.get();
return 0;
}
Execute it and look at the output. The first (and only) match is - obviously - the first to chars ab, so this is actually the whole matched string and the reason why res[0] == "ab".
Now, knowing that .first/.second give us substrings from the start of the match and from the end of the match onwards, the output shouldn't be confusing anymore.

Regex in std c++

I want to find all occurences of something like this '{some text}'.
My code is:
std::wregex e(L"(\\{([a-z]+)\\})");
std::wsmatch m;
std::regex_search(chatMessage, m, e);
std::wcout << "matches for '" << chatMessage << "'\n";
for (size_t i = 0; i < m.size(); ++i) {
std::wssub_match sub_match = m[i];
std::wstring sub_match_str = sub_match.str();
std::wcout << i << ": " << sub_match_str << '\n';
}
but for string like this: L"Roses {aaa} {bbb} are {ccc} #ff0000") my output is:
0: {aaa}
1: {aaa}
2: aaa
and I dont get next substrings. I suspect that there is something wrong with my regular expression. Do anyone of you see what is wrong?
You're searching once and simply looping through the groups. You instead need to search multiple times and return the correct group only. Try:
std::wregex e(L"(\\{([a-z]+)\\})");
std::wsmatch m;
std::wcout << "matches for '" << chatMessage << "'\n";
while (std::regex_search(chatMessage, m, e))
{
std::wssub_match sub_match = m[2];
std::wstring sub_match_str = sub_match.str();
std::wcout << sub_match_str << '\n';
chatMessage = m.suffix().str(); // this advances the position in the string
}
2 here is the second group, i.e. the second thing in brackets, i.e. ([a-z]+).
See this for more on groups.
There is nothing wrong with the regular expression, but you need to search for it repeatedly. And than you don't really need the parenthesis anyway.
The std::regex_search finds one occurence of the pattern. That's the {aaa}. The std::wsmatch is just that. It has 3 submatches. The whole string, the content of the outer parenthesis (which is the whole string again) and the content of the inner parenthesis. That's what you are seeing.
You have to call regex_search again on the rest of the string to get the next match:
std::wstring::const_iterator begin = chatMessage.begin(), end = chatMessage.end();
while (std::regex_search(begin, end, m, e)) {
// ...
begin = m.end();
}
The index operator on a regex_match object returns the matching substring at that index. When the index is 0 it returns the entire matching string, which is why the first line of output is {aaa}. When the index is 1 it returns the contents of the first capture group, that is, the text matched by the part of the regular expression that is between the first ( and the corresponding ). In this example, those are the outermost parentheses, which once again produces {abc}. When the index is 2 is returns the contents of the second capture group, i.e., the text between the second ( and its corresponding ), which gives you the aaa.
The easiest way to search again from where you left off is to use an iterator:
std::wsregex_iterator it(chatMessage.begin(), chatMessage.end(), e);
for ( ; it != wsregex_iterator(); ++it) {
std::cout << *it << '\n';
}
(note: this is a sketch, not tested)