Regex - How to capture all iterations of a repeating pattern? [duplicate] - c++

I'm using the C++ tr1::regex with the ECMA regex grammar. What I'm trying to do is parse a header and return values associated with each item in the header.
Header:
-Testing some text
-Numbers 1 2 5
-MoreStuff some more text
-Numbers 1 10
What I would like to do is find all of the "-Numbers" lines and put each number into its own result with a single regex. As you can see, the "-Numbers" lines can have an arbitrary number of values on the line. Currently, I'm just searching for "-Numbers([\s0-9]+)" and then tokenizing that result. I was just wondering if there was any way to both find and tokenize the results in a single regex.

No, there is not.

I was about to ask this exact same question, and I kind of found a solution.
Let's say you have an arbitrary number of words you want to capture.
"there are four lights"
and
"captain picard is the bomb"
You might think that the solution is:
/((\w+)\s?)+/
But this will only match the whole input string and the last captured group.
What you can do is use the "g" switch.
So, an example in Perl:
use strict;
use warnings;
my $str1 = "there are four lights";
my $str2 = "captain picard is the bomb";
foreach ( $str1, $str2 ) {
my #a = ( $_ =~ /(\w+)\s?/g );
print "captured groups are: " . join( "|", #a ) . "\n";
}
Output is:
captured groups are: there|are|four|lights
captured groups are: captain|picard|is|the|bomb
So, there is a solution if your language of choice supports an equivalent of "g" (and I guess most do...).
Hope this helps someone who was in the same position as me!
S

Problem is that desired solution insists on use of capture groups. C++ provides tool regex_token_iterator to handle this in better way (C++11 example):
#include <iostream>
#include <string>
#include <regex>
using namespace std;
int main() {
std::regex e (R"((?:^-Numbers)?\s*(\d+))");
string input;
while (getline(cin, input)) {
std::regex_token_iterator<std::string::iterator> a{
input.begin(), input.end(),
e, 1,
regex_constants::match_continuous
};
std::regex_token_iterator<std::string::iterator> end;
while (a != end) {
cout << *a << " - ";
++a;
}
cout << '\n';
}
return 0;
}
https://wandbox.org/permlink/TzVEqykXP1eYdo1c

Related

How to retrieve the captured substrings from a capturing group that may repeat?

I'm sorry I found it difficult to express this question with my poor English. So, let's go directly to a simple example.
Assume we have a subject string "apple:banana:cherry:durian". We want to match the subject and have $1, $2, $3 and $4 become "apple", "banana", "cherry" and "durian", respectively. The pattern I'm using is ^(\w+)(?::(.*?))*$, and $1 will be "apple" as expected. However, $2 will be "durian" instead of "banana".
Because the subject string to match doesn't need to be 4 items, for example, it could be "one:two:three", and $1 and $2 will be "one" and "three" respectively. Again, the middle item is missing.
What is the correct pattern to use in this case? By the way, I'm going to use PCRE2 in C++ codes, so there is no split, a Perl built-in function. Thanks.
If the input contains strictly items of interest separated by :, like item1:item2:item3, as the attempt in the question indicates, then you can use the regex pattern
[^:]+
which matches consecutive characters which are not :, so a substring up to the first :. That may need to capture as well, ([^:]+), depending on the overall approach. How to use this to get all such matches depends on the language.†
In C++ there are different ways to approach this. Using std::regex_iterator
#include <string>
#include <vector>
#include <iterator>
#include <regex>
#include <iostream>
int main()
{
std::string str{R"(one:two:three)"};
std::regex r{R"([^:]+)"};
std::vector<std::string> result{};
auto it = std::sregex_iterator(str.begin(), str.end(), r);
auto end = std::sregex_iterator();
for(; it != end; ++it) {
auto match = *it;
result.push_back(match[0].str());
}
std::cout << "Input string: " << str << '\n';
for(auto i : result)
std::cout << i << '\n';
}
Prints as expected.
One can also use std::regex_search, even as it returns at first match -- by iterating over the string to move the search start after every match
#include <string>
#include <regex>
#include <iostream>
int main()
{
std::string str{"one:two:three"};
std::regex r{"[^:]+"};
std::smatch res;
std::string::const_iterator search_beg( str.cbegin() );
while ( regex_search( search_beg, str.cend(), res, r ) )
{
std::cout << res[0] << '\n';
search_beg = res.suffix().first;
}
std::cout << '\n';
}
(With this string and regex we don't need the raw string literal so I've removed them here.)
† This question was initially tagged with perl (with no c++), also with an explicit mention of it in text (still there), and the original version of this answer referred to Perl with
/([^:]+)/g
The /g "modifier" is for "global," to find all matches. The // are pattern delimiters.
When this expression is bound (=~) to a variable with a target string then the whole expression returns a list of matches when used in a context in which a list is expected, which can thus be directly assigned to an array variable.
my #captures = $string =~ /[^:]+/g;
(when this is used literally as shown then the capturing () aren't needed)
Assigning to an array provides this "list context." If the matching is used in a "scalar context," in which a single value is expected, like in the condition for an if test or being assigned to a scalar variable, then a single true/false is returned (usually 1 or '', empty string).
Repeating a capture group will only capture the value of the last iteration. Instead, you might make use of the \G anchor to get consecutive matches.
If the whole string can only contain word characters separated by colons:
(?:^(?=\w+(?::\w+)+$)|\G(?!^):)\K\w+
The pattern matches:
(?: Non capture group
^ Assert start of string
(?=\w+(?::\w+)+$) Assert from the current position 1+ word characters and 1+ repetitions of : and 1+ word characters till the end of the string
| Or
\G(?!^): Assert the position at the end of the previous match, not at the start and match :
) Close non capture group
\K\w+ Forget what is matched so far, and match 1+ word characters
Regex demo
To allow only words as well from the start of the string, and allow other chars after the word chars:
\G:?\K\w+
Regex demo

C++11 Regex submatches

I have the following code to extract the left & right part from a string of type
[3->1],[2->2],[5->3]
My code looks like the following
#include <iostream>
#include <regex>
#include <string>
using namespace std;
int main()
{
regex expr("([[:d:]]+)->([[:d:]]+)");
string input = "[3->1],[2->2],[5->3]";
const std::sregex_token_iterator end;
int submatches[] = { 1, 2 };
string left, right;
for (std::sregex_token_iterator itr(input.begin(), input.end(), expr, submatches); itr != end;)
{
left = ((*itr).str()); ++itr;
right = ((*itr).str()); ++itr;
cout << left << " " << right << endl;
}
}
Output will be
3 1
2 2
5 3
Now I am trying to extend it so that first part will be a string instead of digit. For example, the input will be
[(3),(5),(0,1)->2],[(32,2)->6],[(27),(61,11)->1]
And I need to split it as
(3),(5),(0,1) 2
(32,2) 6
(27),(61,11) 1
Basic expressions that I tried ("(\\(.*+)->([[:d:]]+)") just splits the entire string to two as following
(3),(5),(0,1)->2],[(32,2)->6],[(27),(61,11) 1
Can somebody give me some suggestions on how to achieve this? Appreciate all the help.
You need to get everything after the first '[', except "->", kind of like if
you were doing a regex for the multiline comment /* ... */, where " */ " has to be excluded, or else the regex gets greedy and eats everything until the last one, like is happening in your case for "->". You can't really use the dot for any char, because it gets very greedy.
This works for me:
\\[([^-\\]]+)->([0-9]+)\\]
'^' at the start of [...] makes it so all chars, except '-', so you can avoid "->", and ']', are accepted
What you need is to make it a bit more specific:
\[([^]]*)->([^]]*)\]
In order to avoid capturing too many data. See live demo.
You could have use the .*? pattern instead of [^]]* but it would have been less efficient.

C++ RegExp and placeholders

I'm on C++11 MSVC2013, I need to extract a number from a file name, for example:
string filename = "s 027.wav";
If I were writing code in Perl, Java or Basic, I would use a regular expression and something like this would do the trick in Perl5:
filename ~= /(\d+)/g;
and I would have the number "027" in placeholder variable $1.
Can I do this in C++ as well? Or can you suggest a different method to extract the number 027 from that string? Also, I should convert the resulting numerical string into an integral scalar, I think atoi() is what I need, right?
You can do this in C++, as of C++11 with the collection of classes found in regex. It's pretty similar to other regular expressions you've used in other languages. Here's a no-frills example of how you might search for the number in the filename you posted:
const std::string filename = "s 027.wav";
std::regex re = std::regex("[0-9]+");
std::smatch matches;
if (std::regex_search(filename, matches, re)) {
std::cout << matches.size() << " matches." << std::endl;
for (auto &match : matches) {
std::cout << match << std::endl;
}
}
As far as converting 027 into a number, you could use atoi (from cstdlib) like you mentioned, but this will store the value 27, not 027. If you want to keep the 0 prefix, I believe you will need to keep this as a string. match above is a sub_match so, extract a string and convert to a const char* for atoi:
int value = atoi(match.str().c_str());
Ok, I solved using std::regex which for some reason I couldn't get to work properly when trying to modify the examples I found around the web. It was simpler than I thought. This is the code I wrote:
#include <regex>
#include <string>
string FileName = "s 027.wav";
// The search object
smatch m;
// The regexp /\d+/ works in Perl and Java but for some reason didn't work here.
// With this other variation I look for exactly a string of 1 to 3 characters
// containing only numbers from 0 to 9
regex re("[0-9]{1,3}");
// Do the search
regex_search (FileName, m, re);
// 'm' is actually an array where every index contains a match
// (equally to $1, $2, $2, etc. in Perl)
string sMidiNoteNum = m[0];
// This casts the string to an integer number
int MidiNote = atoi(sMidiNoteNum.c_str());
Here is an example using Boost, substitute the proper namespace and it should work.
typedef std::string::const_iterator SITR;
SITR start = str.begin();
SITR end = str.end();
boost::regex NumRx("\\d+");
boost::smatch m;
while ( boost::regex_search ( start, end, m, NumRx ) )
{
int val = atoi( m[0].str().c_str() )
start = m[0].second;
}

Get String Between 2 Strings

How can I get a string that is between two other declared strings, for example:
String 1 = "[STRING1]"
String 2 = "[STRING2]"
Source:
"832h0ufhu0sdf4[STRING1]I need this text here[STRING2]afyh0fhdfosdfndsf"
How can I get the "I need this text here"?
Since this is homework, only clues:
Find index1 of occurrence of String1
Find index2 of occurrence of String2
Substring from index1+lengthOf(String1) (inclusive) to index2 (exclusive) is what you need
Copy this to a result buffer if necessary (don't forget to null-terminate)
Might be a good case for std::regex, which is part of C++11.
#include <iostream>
#include <string>
#include <regex>
int main()
{
using namespace std::string_literals;
auto start = "\\[STRING1\\]"s;
auto end = "\\[STRING2\\]"s;
std::regex base_regex(start + "(.*)" + end);
auto example = "832h0ufhu0sdf4[STRING1]I need this text here[STRING2]afyh0fhdfosdfndsf"s;
std::smatch base_match;
std::string matched;
if (std::regex_search(example, base_match, base_regex)) {
// The first sub_match is the whole string; the next
// sub_match is the first parenthesized expression.
if (base_match.size() == 2) {
matched = base_match[1].str();
}
}
std::cout << "example: \""<<example << "\"\n";
std::cout << "matched: \""<<matched << "\"\n";
}
Prints:
example: "832h0ufhu0sdf4[STRING1]I need this text here[STRING2]afyh0fhdfosdfndsf"
matched: "I need this text here"
What I did was create a program that creates two strings, start and end that serve as my start and end matches. I then use a regular expression string that will look for those, and match against anything in-between (including nothing). Then I use regex_match to find the matching part of the expression, and set matched as the matched string.
For more info, see http://en.cppreference.com/w/cpp/regex and http://en.cppreference.com/w/cpp/regex/regex_search
Use strstr http://www.cplusplus.com/reference/clibrary/cstring/strstr/ , with that function you will get 2 pointers, now you should compare them (if pointer1 < pointer2) if so, read all chars between them.

Is there a way to have a capture repeat an arbitrary number of times in a regex?

I'm using the C++ tr1::regex with the ECMA regex grammar. What I'm trying to do is parse a header and return values associated with each item in the header.
Header:
-Testing some text
-Numbers 1 2 5
-MoreStuff some more text
-Numbers 1 10
What I would like to do is find all of the "-Numbers" lines and put each number into its own result with a single regex. As you can see, the "-Numbers" lines can have an arbitrary number of values on the line. Currently, I'm just searching for "-Numbers([\s0-9]+)" and then tokenizing that result. I was just wondering if there was any way to both find and tokenize the results in a single regex.
No, there is not.
I was about to ask this exact same question, and I kind of found a solution.
Let's say you have an arbitrary number of words you want to capture.
"there are four lights"
and
"captain picard is the bomb"
You might think that the solution is:
/((\w+)\s?)+/
But this will only match the whole input string and the last captured group.
What you can do is use the "g" switch.
So, an example in Perl:
use strict;
use warnings;
my $str1 = "there are four lights";
my $str2 = "captain picard is the bomb";
foreach ( $str1, $str2 ) {
my #a = ( $_ =~ /(\w+)\s?/g );
print "captured groups are: " . join( "|", #a ) . "\n";
}
Output is:
captured groups are: there|are|four|lights
captured groups are: captain|picard|is|the|bomb
So, there is a solution if your language of choice supports an equivalent of "g" (and I guess most do...).
Hope this helps someone who was in the same position as me!
S
Problem is that desired solution insists on use of capture groups. C++ provides tool regex_token_iterator to handle this in better way (C++11 example):
#include <iostream>
#include <string>
#include <regex>
using namespace std;
int main() {
std::regex e (R"((?:^-Numbers)?\s*(\d+))");
string input;
while (getline(cin, input)) {
std::regex_token_iterator<std::string::iterator> a{
input.begin(), input.end(),
e, 1,
regex_constants::match_continuous
};
std::regex_token_iterator<std::string::iterator> end;
while (a != end) {
cout << *a << " - ";
++a;
}
cout << '\n';
}
return 0;
}
https://wandbox.org/permlink/TzVEqykXP1eYdo1c