C++11 Regex submatches - c++

I have the following code to extract the left & right part from a string of type
[3->1],[2->2],[5->3]
My code looks like the following
#include <iostream>
#include <regex>
#include <string>
using namespace std;
int main()
{
regex expr("([[:d:]]+)->([[:d:]]+)");
string input = "[3->1],[2->2],[5->3]";
const std::sregex_token_iterator end;
int submatches[] = { 1, 2 };
string left, right;
for (std::sregex_token_iterator itr(input.begin(), input.end(), expr, submatches); itr != end;)
{
left = ((*itr).str()); ++itr;
right = ((*itr).str()); ++itr;
cout << left << " " << right << endl;
}
}
Output will be
3 1
2 2
5 3
Now I am trying to extend it so that first part will be a string instead of digit. For example, the input will be
[(3),(5),(0,1)->2],[(32,2)->6],[(27),(61,11)->1]
And I need to split it as
(3),(5),(0,1) 2
(32,2) 6
(27),(61,11) 1
Basic expressions that I tried ("(\\(.*+)->([[:d:]]+)") just splits the entire string to two as following
(3),(5),(0,1)->2],[(32,2)->6],[(27),(61,11) 1
Can somebody give me some suggestions on how to achieve this? Appreciate all the help.

You need to get everything after the first '[', except "->", kind of like if
you were doing a regex for the multiline comment /* ... */, where " */ " has to be excluded, or else the regex gets greedy and eats everything until the last one, like is happening in your case for "->". You can't really use the dot for any char, because it gets very greedy.
This works for me:
\\[([^-\\]]+)->([0-9]+)\\]
'^' at the start of [...] makes it so all chars, except '-', so you can avoid "->", and ']', are accepted

What you need is to make it a bit more specific:
\[([^]]*)->([^]]*)\]
In order to avoid capturing too many data. See live demo.
You could have use the .*? pattern instead of [^]]* but it would have been less efficient.

Related

How to retrieve the captured substrings from a capturing group that may repeat?

I'm sorry I found it difficult to express this question with my poor English. So, let's go directly to a simple example.
Assume we have a subject string "apple:banana:cherry:durian". We want to match the subject and have $1, $2, $3 and $4 become "apple", "banana", "cherry" and "durian", respectively. The pattern I'm using is ^(\w+)(?::(.*?))*$, and $1 will be "apple" as expected. However, $2 will be "durian" instead of "banana".
Because the subject string to match doesn't need to be 4 items, for example, it could be "one:two:three", and $1 and $2 will be "one" and "three" respectively. Again, the middle item is missing.
What is the correct pattern to use in this case? By the way, I'm going to use PCRE2 in C++ codes, so there is no split, a Perl built-in function. Thanks.
If the input contains strictly items of interest separated by :, like item1:item2:item3, as the attempt in the question indicates, then you can use the regex pattern
[^:]+
which matches consecutive characters which are not :, so a substring up to the first :. That may need to capture as well, ([^:]+), depending on the overall approach. How to use this to get all such matches depends on the language.†
In C++ there are different ways to approach this. Using std::regex_iterator
#include <string>
#include <vector>
#include <iterator>
#include <regex>
#include <iostream>
int main()
{
std::string str{R"(one:two:three)"};
std::regex r{R"([^:]+)"};
std::vector<std::string> result{};
auto it = std::sregex_iterator(str.begin(), str.end(), r);
auto end = std::sregex_iterator();
for(; it != end; ++it) {
auto match = *it;
result.push_back(match[0].str());
}
std::cout << "Input string: " << str << '\n';
for(auto i : result)
std::cout << i << '\n';
}
Prints as expected.
One can also use std::regex_search, even as it returns at first match -- by iterating over the string to move the search start after every match
#include <string>
#include <regex>
#include <iostream>
int main()
{
std::string str{"one:two:three"};
std::regex r{"[^:]+"};
std::smatch res;
std::string::const_iterator search_beg( str.cbegin() );
while ( regex_search( search_beg, str.cend(), res, r ) )
{
std::cout << res[0] << '\n';
search_beg = res.suffix().first;
}
std::cout << '\n';
}
(With this string and regex we don't need the raw string literal so I've removed them here.)
† This question was initially tagged with perl (with no c++), also with an explicit mention of it in text (still there), and the original version of this answer referred to Perl with
/([^:]+)/g
The /g "modifier" is for "global," to find all matches. The // are pattern delimiters.
When this expression is bound (=~) to a variable with a target string then the whole expression returns a list of matches when used in a context in which a list is expected, which can thus be directly assigned to an array variable.
my #captures = $string =~ /[^:]+/g;
(when this is used literally as shown then the capturing () aren't needed)
Assigning to an array provides this "list context." If the matching is used in a "scalar context," in which a single value is expected, like in the condition for an if test or being assigned to a scalar variable, then a single true/false is returned (usually 1 or '', empty string).
Repeating a capture group will only capture the value of the last iteration. Instead, you might make use of the \G anchor to get consecutive matches.
If the whole string can only contain word characters separated by colons:
(?:^(?=\w+(?::\w+)+$)|\G(?!^):)\K\w+
The pattern matches:
(?: Non capture group
^ Assert start of string
(?=\w+(?::\w+)+$) Assert from the current position 1+ word characters and 1+ repetitions of : and 1+ word characters till the end of the string
| Or
\G(?!^): Assert the position at the end of the previous match, not at the start and match :
) Close non capture group
\K\w+ Forget what is matched so far, and match 1+ word characters
Regex demo
To allow only words as well from the start of the string, and allow other chars after the word chars:
\G:?\K\w+
Regex demo

Getting a list of curly brace blocks using regex

I'm building a simple data encoder/decoder for a project I'm doing in c++, the data is written to a file in this format (dummy data):
{X143Y453CGRjGeBK}{X243Y6789CaRyGwBk}{X5743Y12CvRYGQBs}
The number of blocks is indefinite and the size of the blocks is variable.
To decode the image I need to iterate through each curly brace block and process the data within, the ideal output would look like this:
"X143Y453CGRjGeBK" "X243Y6789CaRyGwBk" "X5743Y12CvRYGQBs"
The closest I've got is:
"\\{(.*)\\}"
But this gives me the whole sequence rather than each block.
Sorry if this is a simple problem but regex hasn't really clicked with me yet, is this possible with regex or should I use a different method?
You can use [^{}]+:
[^{}]: Match a single character not present in the list below (in this case '{' & '}')
\+: once you match that character, match one and unlimited times as many as possible.
Testing: https://regex101.com/r/bNOK5U/1/
To extract multiple occurrences of substrings inside curly braces, that have no braces inside (that is, substrings inside innermost braces), you may use
#include <iostream>
#include <string>
#include <vector>
#include <regex>
int main() {
std::regex rx(R"(\{([^{}]*)})");
std::string s = "Text here {X143Y453CGRjGeBK} and here {X243Y6789CaRyGwBk}{X5743Y12CvRYGQBs} and more here.";
std::vector<std::string> results(std::sregex_token_iterator(s.begin(), s.end(), rx, 1),
std::sregex_token_iterator());
for( auto & p : results ) std::cout << p << std::endl;
return 0;
}
See a C++ demo.
The std::regex rx(R"(\{([^{}]*)})") regex string is \{([^{}]*)}, and it matches
\{ - a { char
([^{}]*) - Capturing group 1: zero or more chars other than { and }
} - a } char.
The 1 argument passed to the std::sregex_token_iterator extracts just thiose values that are captured into Group 1.

How to select the complete word within the brackets even if it have that brackets within word

Give some solution to this following example,
Scenario-1:
My String : Password={my_pswd}}123}
I want to select the value enclosed within the {} brackets(Example: I want to select the complete password key value {my_pswd}123} not {my_pswd})
If I'm using this regex \{(.*?)\} , this will select {my_pswd} not {my_pswd}}123}. So how to get complete word even if the word has } in between? Give me some suggestions by using regex or any other way.
Scenario-2:
I am using this regex ^\{|\}$ . If my string have both { bracket and } bracket like this {{my_password}} then only it want to select first and last bracket. If my string like this {{my_password, it don't want to select that starting bracket. Its like AND condition in Regex. I referred many posts they did with look up but I can't get clear idea. Give me some suggestion.
Thanks.
It seems that the {...} substrings you want to match must be followed with ; or end of string.
This will not work for cases when a } inside the values can also be followed with ;.
You may solve the first issue by adding a (?![^;]) lookaround:
\{(.*?)\}(?![^;])
See the regex demo.
Details
\{ - a { char
(.*?) - Group 1: any 0+ chars as few as possible
\} - a } char
(?![^;]) - no char other than ; is allowed right after the current position
See the C++ demo:
#include <iostream>
#include <vector>
#include <regex>
int main() {
const std::regex reg("\\{(.*?)\\}(?![^;])");
std::smatch match;
std::string s = "Username={My_{}user};Password={my_pswd}}123}}}kk};Password={my_pswd}}123}";
std::vector<std::string> results(
std::sregex_token_iterator(s.begin(), s.end(), reg, 1), // See 1, it extracts Group 1 value
std::sregex_token_iterator());
for (auto result : results)
{
std::cout << result << std::endl;
}
return 0;
}
Output:
My_{}user
my_pswd}}123}}}kk
my_pswd}}123
As for the second scenario, you may use
std::regex reg("^\\{([^]*)\\}$");
std::string s = "{My_{}user}";
std::cout << regex_replace(s, reg, "$1") << std::endl; // => My_{}user
See another C++ demo.
The \{([^]*)\}$ pattern matches the { at the start (^) of the string, then matches and captures into Group 1 (later referenced with the help of $1 in the replacement pattern) any 0+ chars, as many as possible, and then matches a } at the end of the string ($).

Regular expression validation fails while egrep validates just fine

I'm trying to use regular expressions in order to validate strings so before I go any further let me explain first how the strings looks like: optional number of digits followed by an 'X' and an optional ('^' followed by one or more digits).
Here are some exmaples: "2X", "X", "23X^6" fit the pattern while strings like "X^", "4", "foobar", "4X^", "4X44" don't.
Now where was I: using 'egrep' and the "^[0-9]{0,}\X(\^[0-9]{1,})$" regex I can validate just fine those strings however when trying this in C++ using the C++11 regex library it fails.
Here's the code I'm using to validate those strings:
#include <iostream>
#include <regex>
#include <string>
#include <vector>
int main()
{
std::regex r("^[0-9]{0,}\\X(\\^[0-9]{1,})$",
std::regex_constants::egrep);
std::vector<std::string> challanges_ok {"2X", "X", "23X^66", "23X^6",
"3123X", "2313131X^213213123"};
std::vector<std::string> challanges_bad {"X^", "4", "asdsad", " X",
"4X44", "4X^"};
std::cout << "challanges_ok: ";
for (auto &str : challanges_ok) {
std::cout << std::regex_match(str, r) << " ";
}
std::cout << "\nchallanges_bad: ";
for (auto &str : challanges_bad) {
std::cout << std::regex_match(str, r) << " ";
}
std::cout << "\n";
return 0;
}
Am I doing something wrong or am I missing something? I'm compiling under GCC 4.7.
Your regex fails to make the '^' followed by one or more digits optional; change it to:
"^[0-9]*X(\\^[0-9]+)?$".
Also note that this page says that GCC's support of <regex> is only partial, so std::regex may not work at all for you ('partial' in this context apparently means 'broken'); have you tried Boost.Xpressive or Boost.Regex as a sanity check?
optional number of digits followed by an 'X' and an optional ('^' followed by one or more digits).
OK, the regular expression in your code doesn't match that description, for two reasons: you have an extra backslash on the X, and the '^digits' part is not optional. The regex you want is this:
^[0-9]{0,}X(\^[0-9]{1,}){0,1}$
which means your grep command should look like this (note single quotes):
egrep '^[0-9]{0,}X(\^[0-9]{1,}){0,1}$' filename
And the string you have to pass in your C++ code is this:
"^[0-9]{0,}X(\\^[0-9]{1,}){0,1}$"
If you then replace all the explicit quantifiers with their more traditional abbreviations, you get #ildjarn's answer: {0,} is *, {1,} is +, and {0,1} is ?.

Get String Between 2 Strings

How can I get a string that is between two other declared strings, for example:
String 1 = "[STRING1]"
String 2 = "[STRING2]"
Source:
"832h0ufhu0sdf4[STRING1]I need this text here[STRING2]afyh0fhdfosdfndsf"
How can I get the "I need this text here"?
Since this is homework, only clues:
Find index1 of occurrence of String1
Find index2 of occurrence of String2
Substring from index1+lengthOf(String1) (inclusive) to index2 (exclusive) is what you need
Copy this to a result buffer if necessary (don't forget to null-terminate)
Might be a good case for std::regex, which is part of C++11.
#include <iostream>
#include <string>
#include <regex>
int main()
{
using namespace std::string_literals;
auto start = "\\[STRING1\\]"s;
auto end = "\\[STRING2\\]"s;
std::regex base_regex(start + "(.*)" + end);
auto example = "832h0ufhu0sdf4[STRING1]I need this text here[STRING2]afyh0fhdfosdfndsf"s;
std::smatch base_match;
std::string matched;
if (std::regex_search(example, base_match, base_regex)) {
// The first sub_match is the whole string; the next
// sub_match is the first parenthesized expression.
if (base_match.size() == 2) {
matched = base_match[1].str();
}
}
std::cout << "example: \""<<example << "\"\n";
std::cout << "matched: \""<<matched << "\"\n";
}
Prints:
example: "832h0ufhu0sdf4[STRING1]I need this text here[STRING2]afyh0fhdfosdfndsf"
matched: "I need this text here"
What I did was create a program that creates two strings, start and end that serve as my start and end matches. I then use a regular expression string that will look for those, and match against anything in-between (including nothing). Then I use regex_match to find the matching part of the expression, and set matched as the matched string.
For more info, see http://en.cppreference.com/w/cpp/regex and http://en.cppreference.com/w/cpp/regex/regex_search
Use strstr http://www.cplusplus.com/reference/clibrary/cstring/strstr/ , with that function you will get 2 pointers, now you should compare them (if pointer1 < pointer2) if so, read all chars between them.