C++ regex finds only 1 sub match [duplicate] - c++

This question already has answers here:
How to match multiple results using std::regex
(6 answers)
Closed 5 years ago.
// Example program
#include <iostream>
#include <string>
#include <regex>
int main()
{
std::string strr("1.0.0.0029.443");
std::regex rgx("([0-9])");
std::smatch match;
if (std::regex_search(strr, match, rgx)) {
for(int i=0;i<match.size();i++)
std::cout << match[i] << std::endl;
}
}
this program should write
1
0
0
0
0
2
9
4
4
3
but it writes
1
1
checked it here http://cpp.sh/ and on visual studio, both same results.
Why does it find only 2 matches and why are they same?
As I understand from answers here, regex search stops at first match and match variable holds the necessary (sub?)string value to continue(by repeating) for other matches. Also since it stops at first match, () charachters are used only for sub-matches within the result.

Being called once, regex_search returns only the first match in the match variable. The collection in match comprises the match itself and capture groups if there are any.
In order to get all matches call regex_search in a loop:
while(regex_search(strr, match, rgx))
{
std::cout << match[0] << std::endl;
strr = match.suffix();
}
Note that in your case the first capture group is the same as the whole match so there is no need in the group and you may define the regex simply as [0-9] (without parentheses.)
Demo: https://ideone.com/pQ6IsO

Problems:
Using if only gives you one match. You need to use a while loop to find all the matches. You need to search past the previous match in the next iteration of the loop.
std::smatch::size() returns 1 + number of matches. See its documentation. std::smatch can contain sub-matches. To get the entire text, use match[0].
Here's an updated version of your program:
#include <iostream>
#include <string>
#include <regex>
int main()
{
std::string strr("1.0.0.0029.443");
std::regex rgx("([0-9])");
std::smatch match;
while (std::regex_search(strr, match, rgx)) {
std::cout << match[0] << std::endl;
strr = match.suffix();
}
}

Related

How to retrieve the captured substrings from a capturing group that may repeat?

I'm sorry I found it difficult to express this question with my poor English. So, let's go directly to a simple example.
Assume we have a subject string "apple:banana:cherry:durian". We want to match the subject and have $1, $2, $3 and $4 become "apple", "banana", "cherry" and "durian", respectively. The pattern I'm using is ^(\w+)(?::(.*?))*$, and $1 will be "apple" as expected. However, $2 will be "durian" instead of "banana".
Because the subject string to match doesn't need to be 4 items, for example, it could be "one:two:three", and $1 and $2 will be "one" and "three" respectively. Again, the middle item is missing.
What is the correct pattern to use in this case? By the way, I'm going to use PCRE2 in C++ codes, so there is no split, a Perl built-in function. Thanks.
If the input contains strictly items of interest separated by :, like item1:item2:item3, as the attempt in the question indicates, then you can use the regex pattern
[^:]+
which matches consecutive characters which are not :, so a substring up to the first :. That may need to capture as well, ([^:]+), depending on the overall approach. How to use this to get all such matches depends on the language.†
In C++ there are different ways to approach this. Using std::regex_iterator
#include <string>
#include <vector>
#include <iterator>
#include <regex>
#include <iostream>
int main()
{
std::string str{R"(one:two:three)"};
std::regex r{R"([^:]+)"};
std::vector<std::string> result{};
auto it = std::sregex_iterator(str.begin(), str.end(), r);
auto end = std::sregex_iterator();
for(; it != end; ++it) {
auto match = *it;
result.push_back(match[0].str());
}
std::cout << "Input string: " << str << '\n';
for(auto i : result)
std::cout << i << '\n';
}
Prints as expected.
One can also use std::regex_search, even as it returns at first match -- by iterating over the string to move the search start after every match
#include <string>
#include <regex>
#include <iostream>
int main()
{
std::string str{"one:two:three"};
std::regex r{"[^:]+"};
std::smatch res;
std::string::const_iterator search_beg( str.cbegin() );
while ( regex_search( search_beg, str.cend(), res, r ) )
{
std::cout << res[0] << '\n';
search_beg = res.suffix().first;
}
std::cout << '\n';
}
(With this string and regex we don't need the raw string literal so I've removed them here.)
† This question was initially tagged with perl (with no c++), also with an explicit mention of it in text (still there), and the original version of this answer referred to Perl with
/([^:]+)/g
The /g "modifier" is for "global," to find all matches. The // are pattern delimiters.
When this expression is bound (=~) to a variable with a target string then the whole expression returns a list of matches when used in a context in which a list is expected, which can thus be directly assigned to an array variable.
my #captures = $string =~ /[^:]+/g;
(when this is used literally as shown then the capturing () aren't needed)
Assigning to an array provides this "list context." If the matching is used in a "scalar context," in which a single value is expected, like in the condition for an if test or being assigned to a scalar variable, then a single true/false is returned (usually 1 or '', empty string).
Repeating a capture group will only capture the value of the last iteration. Instead, you might make use of the \G anchor to get consecutive matches.
If the whole string can only contain word characters separated by colons:
(?:^(?=\w+(?::\w+)+$)|\G(?!^):)\K\w+
The pattern matches:
(?: Non capture group
^ Assert start of string
(?=\w+(?::\w+)+$) Assert from the current position 1+ word characters and 1+ repetitions of : and 1+ word characters till the end of the string
| Or
\G(?!^): Assert the position at the end of the previous match, not at the start and match :
) Close non capture group
\K\w+ Forget what is matched so far, and match 1+ word characters
Regex demo
To allow only words as well from the start of the string, and allow other chars after the word chars:
\G:?\K\w+
Regex demo

std::regex: Match string consisting of digits and white space and extract digits. How?

I want to do 2 things at the same time: Match a string against a pattern and extract groups.
The string consists of white spaces and digits. I want to match the string against this pattern. Additionally I want to extract the digits (not numbers, single digits only) using std::smatch.
I tried a lot, but no success.
For the dupe hunters: I checked many many answers on SO, but I could not find a solution.
Then I tried to use the std::sregex_token_iterator. And the result was also baffeling me. In
#include <string>
#include <regex>
#include <vector>
#include <iterator>
const std::regex re1{ R"(((?:\s*)|(\d))+)" };
const std::regex re2{ R"(\s*(\d)\s*)" };
int main() {
std::string test(" 123 45 6 ");
std::smatch sm;
bool valid1 = std::regex_match(test, sm, re1);
std::vector<std::string> v(std::sregex_token_iterator(test.begin(), test.end(), re2), {});
return 0;
}
The vector contains not only the digits, but also spaces. I would like to have digits only.
The smatch does not contain any digits.
I know, that I can first remove all whitespaces from the string, but there should be a better, one step solution.
What is the proper regex to 1. match the string against my described pattern and 2. extract all single digits into the smatch?
The pattern you need to validate is
\s*(?:\d\s*)*
See the regex demo (note I added ^ and $ to make the pattern match the whole string at the regex testing site, since you use equivalent regex_match in the code, it requires a full string match).
Next, once your string is validated with the first regex, you just need to extract any single digit:
const std::regex re2{ R"(\d)" };
// ...
std::vector<std::string> v(std::sregex_token_iterator(test.begin(), test.end(), re2), {});
Full working snippet:
#include <string>
#include <regex>
#include <vector>
#include <iterator>
#include <iostream>
const std::regex re1{ R"(\s*(?:\d\s*)*)" };
const std::regex re2{ R"(\d)" };
int main() {
std::string test(" 123 45 6 ");
std::smatch sm;
bool valid1 = std::regex_match(test, sm, re1);
std::vector<std::string> v(std::sregex_token_iterator(test.begin(), test.end(), re2), {});
for (auto i: v)
std::cout << i << std::endl;
return 0;
}
Output:
1
2
3
4
5
6
Alternative solution using Boost
You may use a regex that will match all digits separately only if the whole string consists of whitespaces and digits using
\G\s*(\d)(?=[\s\d]*$)
See the regex demo.
Details
\G - start of string or end of the preceding successful match
\s* - 0+ whitespaces
(\d) - a digit captured in Group 1 (we'll return only this value when passing 1 as the last argument in boost::sregex_token_iterator iter(test.begin(), test.end(), re2, 1))
(?=[\s\d]*$) - there must be any 0 or more whitespaces or digits and then the end of string immediately to the right of the current location.
See the whole C++ snippet (compiled with the -lboost_regex option):
#include <iostream>
#include <vector>
#include <boost/regex.hpp>
int main()
{
std::string test(" 123 45 6 ");
boost::regex re2(R"(\G\s*(\d)(?=[\s\d]*$))");
boost::sregex_token_iterator iter(test.begin(), test.end(), re2, 1);
boost::sregex_token_iterator end;
std::vector<std::string> v(iter, end);
for (auto i: v)
std::cout << i << std::endl;
return 0;
}

C++: Matching regex, what is in smatch? [duplicate]

This question already has an answer here:
What is returned in std::smatch and how are you supposed to use it?
(1 answer)
Closed 2 years ago.
I'm using a modified regex example from Stroustrup C++ 4th Ed. Page 127 & 128. I'm trying to understand what is in the vector smatch matches.
$ ./a.out
AB00000-0000
AB00000-0000.-0000.
$ ./a.out
AB00000
AB00000..
It seems like the matches in parenthesis () appear in match[1], match[2], ... which the total match appears in match[0].
Appreciate any insight into this.
#include <iostream>
#include <regex>
using namespace std;
int main(int argc, char *argv[])
{
// ZIP code pattern: XXddddd-dddd and variants
regex pat (R"(\w{2}\s*\d{5}(-\d{4})?)");
for (string line; getline(cin,line);) {
smatch matches; // matched strings go here
if (regex_search(line, matches, pat)) { // search for pat in line
for (auto p : matches) {
cout << p << ".";
}
}
cout << endl;
}
return 0;
}
The type of matches is a std::match_results, not a vector, but it does have an operator[].
From the reference:
If n == 0, returns a reference to the std::sub_match representing the part of the target sequence matched by the entire matched regular expression.
If n > 0 and n < size(), returns a reference to the std::sub_match representing the part of the target sequence that was matched by the nth captured marked subexpression).
where n is the argument to operator[]. So matches[0] contains the entire matched expression, and matches[1], matches[2], ... contain consecutive capture group expressions.

C++11 regex matching capturing group multiple times

Could someone please help me to extract the text between the : and the ^ symbols using a JavaScript (ECMAScript) regular expression in C++11. I do not need to capture the hw-descriptor itself - but it does have to be present in the line in order for the rest of the line to be considered for a match. Also the :p....^, :m....^ and :u....^ can arrive in any order and there has to be at least 1 present.
I tried using the following regular expression:
static const std::regex gRegex("(?:hw-descriptor)(:[pmu](.*?)\\^)+", std::regex::icase);
against the following text line:
"hw-descriptor:pTEXT1^:mTEXT2^:uTEXT3^"
Here is the code which posted on a live coliru. It shows how I attempted to solve this problem, however I am only getting 1 match. I need to see how to extract each of the potential 3 matches corresponding to the p m or u characters described earlier.
#include <iostream>
#include <string>
#include <vector>
#include <regex>
int main()
{
static const std::regex gRegex("(?:hw-descriptor)(:[pmu](.*?)\\^)+", std::regex::icase);
std::string foo = "hw-descriptor:pTEXT1^:mTEXT2^:uTEXT3^";
// I seem to only get 1 match here, I was expecting
// to loop through each of the matches, looks like I need something like
// a pcre global option but I don't know how.
std::for_each(std::sregex_iterator(foo.cbegin(), foo.cend(), gRegex), std::sregex_iterator(),
[&](const auto& rMatch) {
for (int i=0; i< static_cast<int>(rMatch.size()); ++i) {
std::cout << rMatch[i] << std::endl;
}
});
}
The above program gives the following output:
g++ -std=c++14 -O2 -Wall -pedantic -pthread main.cpp && ./a.out
hw-descriptor:pTEXT1^:mTEXT2^:uTEXT3^
:uTEXT3^
TEXT3
With std::regex, you cannot keep mutliple repeated captures when matching a certain string with consecutive repeated patterns.
What you may do is to match the overall texts containing the prefix and the repeated chunks, capture the latter into a separate group, and then use a second smaller regex to grab all the occurrences of the substrings you want separately.
The first regex here may be
hw-descriptor((?::[pmu][^^]*\\^)+)
See the online demo. It will match hw-descriptor and ((?::[pmu][^^]*\\^)+) will capture into Group 1 one or more repetitions of :[pmu][^^]*\^ pattern: :, p/m/u, 0 or more chars other than ^ and then ^. Upon finding a match, use :[pmu][^^]*\^ regex to return all the real "matches".
C++ demo:
static const std::regex gRegex("hw-descriptor((?::[pmu][^^]*\\^)+)", std::regex::icase);
static const std::regex lRegex(":[pmu][^^]*\\^", std::regex::icase);
std::string foo = "hw-descriptor:pTEXT1^:mTEXT2^:uTEXT3^ hw-descriptor:pTEXT8^:mTEXT8^:uTEXT83^";
std::smatch smtch;
for(std::sregex_iterator i = std::sregex_iterator(foo.begin(), foo.end(), gRegex);
i != std::sregex_iterator();
++i)
{
std::smatch m = *i;
std::cout << "Match value: " << m.str() << std::endl;
std::string x = m.str(1);
for(std::sregex_iterator j = std::sregex_iterator(x.begin(), x.end(), lRegex);
j != std::sregex_iterator();
++j)
{
std::cout << "Element value: " << (*j).str() << std::endl;
}
}
Output:
Match value: hw-descriptor:pTEXT1^:mTEXT2^:uTEXT3^
Element value: :pTEXT1^
Element value: :mTEXT2^
Element value: :uTEXT3^
Match value: hw-descriptor:pTEXT8^:mTEXT8^:uTEXT83^
Element value: :pTEXT8^
Element value: :mTEXT8^
Element value: :uTEXT83^

Get String Between 2 Strings

How can I get a string that is between two other declared strings, for example:
String 1 = "[STRING1]"
String 2 = "[STRING2]"
Source:
"832h0ufhu0sdf4[STRING1]I need this text here[STRING2]afyh0fhdfosdfndsf"
How can I get the "I need this text here"?
Since this is homework, only clues:
Find index1 of occurrence of String1
Find index2 of occurrence of String2
Substring from index1+lengthOf(String1) (inclusive) to index2 (exclusive) is what you need
Copy this to a result buffer if necessary (don't forget to null-terminate)
Might be a good case for std::regex, which is part of C++11.
#include <iostream>
#include <string>
#include <regex>
int main()
{
using namespace std::string_literals;
auto start = "\\[STRING1\\]"s;
auto end = "\\[STRING2\\]"s;
std::regex base_regex(start + "(.*)" + end);
auto example = "832h0ufhu0sdf4[STRING1]I need this text here[STRING2]afyh0fhdfosdfndsf"s;
std::smatch base_match;
std::string matched;
if (std::regex_search(example, base_match, base_regex)) {
// The first sub_match is the whole string; the next
// sub_match is the first parenthesized expression.
if (base_match.size() == 2) {
matched = base_match[1].str();
}
}
std::cout << "example: \""<<example << "\"\n";
std::cout << "matched: \""<<matched << "\"\n";
}
Prints:
example: "832h0ufhu0sdf4[STRING1]I need this text here[STRING2]afyh0fhdfosdfndsf"
matched: "I need this text here"
What I did was create a program that creates two strings, start and end that serve as my start and end matches. I then use a regular expression string that will look for those, and match against anything in-between (including nothing). Then I use regex_match to find the matching part of the expression, and set matched as the matched string.
For more info, see http://en.cppreference.com/w/cpp/regex and http://en.cppreference.com/w/cpp/regex/regex_search
Use strstr http://www.cplusplus.com/reference/clibrary/cstring/strstr/ , with that function you will get 2 pointers, now you should compare them (if pointer1 < pointer2) if so, read all chars between them.