Get String Between 2 Strings - c++

How can I get a string that is between two other declared strings, for example:
String 1 = "[STRING1]"
String 2 = "[STRING2]"
Source:
"832h0ufhu0sdf4[STRING1]I need this text here[STRING2]afyh0fhdfosdfndsf"
How can I get the "I need this text here"?

Since this is homework, only clues:
Find index1 of occurrence of String1
Find index2 of occurrence of String2
Substring from index1+lengthOf(String1) (inclusive) to index2 (exclusive) is what you need
Copy this to a result buffer if necessary (don't forget to null-terminate)

Might be a good case for std::regex, which is part of C++11.
#include <iostream>
#include <string>
#include <regex>
int main()
{
using namespace std::string_literals;
auto start = "\\[STRING1\\]"s;
auto end = "\\[STRING2\\]"s;
std::regex base_regex(start + "(.*)" + end);
auto example = "832h0ufhu0sdf4[STRING1]I need this text here[STRING2]afyh0fhdfosdfndsf"s;
std::smatch base_match;
std::string matched;
if (std::regex_search(example, base_match, base_regex)) {
// The first sub_match is the whole string; the next
// sub_match is the first parenthesized expression.
if (base_match.size() == 2) {
matched = base_match[1].str();
}
}
std::cout << "example: \""<<example << "\"\n";
std::cout << "matched: \""<<matched << "\"\n";
}
Prints:
example: "832h0ufhu0sdf4[STRING1]I need this text here[STRING2]afyh0fhdfosdfndsf"
matched: "I need this text here"
What I did was create a program that creates two strings, start and end that serve as my start and end matches. I then use a regular expression string that will look for those, and match against anything in-between (including nothing). Then I use regex_match to find the matching part of the expression, and set matched as the matched string.
For more info, see http://en.cppreference.com/w/cpp/regex and http://en.cppreference.com/w/cpp/regex/regex_search

Use strstr http://www.cplusplus.com/reference/clibrary/cstring/strstr/ , with that function you will get 2 pointers, now you should compare them (if pointer1 < pointer2) if so, read all chars between them.

Related

How to retrieve the captured substrings from a capturing group that may repeat?

I'm sorry I found it difficult to express this question with my poor English. So, let's go directly to a simple example.
Assume we have a subject string "apple:banana:cherry:durian". We want to match the subject and have $1, $2, $3 and $4 become "apple", "banana", "cherry" and "durian", respectively. The pattern I'm using is ^(\w+)(?::(.*?))*$, and $1 will be "apple" as expected. However, $2 will be "durian" instead of "banana".
Because the subject string to match doesn't need to be 4 items, for example, it could be "one:two:three", and $1 and $2 will be "one" and "three" respectively. Again, the middle item is missing.
What is the correct pattern to use in this case? By the way, I'm going to use PCRE2 in C++ codes, so there is no split, a Perl built-in function. Thanks.
If the input contains strictly items of interest separated by :, like item1:item2:item3, as the attempt in the question indicates, then you can use the regex pattern
[^:]+
which matches consecutive characters which are not :, so a substring up to the first :. That may need to capture as well, ([^:]+), depending on the overall approach. How to use this to get all such matches depends on the language.†
In C++ there are different ways to approach this. Using std::regex_iterator
#include <string>
#include <vector>
#include <iterator>
#include <regex>
#include <iostream>
int main()
{
std::string str{R"(one:two:three)"};
std::regex r{R"([^:]+)"};
std::vector<std::string> result{};
auto it = std::sregex_iterator(str.begin(), str.end(), r);
auto end = std::sregex_iterator();
for(; it != end; ++it) {
auto match = *it;
result.push_back(match[0].str());
}
std::cout << "Input string: " << str << '\n';
for(auto i : result)
std::cout << i << '\n';
}
Prints as expected.
One can also use std::regex_search, even as it returns at first match -- by iterating over the string to move the search start after every match
#include <string>
#include <regex>
#include <iostream>
int main()
{
std::string str{"one:two:three"};
std::regex r{"[^:]+"};
std::smatch res;
std::string::const_iterator search_beg( str.cbegin() );
while ( regex_search( search_beg, str.cend(), res, r ) )
{
std::cout << res[0] << '\n';
search_beg = res.suffix().first;
}
std::cout << '\n';
}
(With this string and regex we don't need the raw string literal so I've removed them here.)
† This question was initially tagged with perl (with no c++), also with an explicit mention of it in text (still there), and the original version of this answer referred to Perl with
/([^:]+)/g
The /g "modifier" is for "global," to find all matches. The // are pattern delimiters.
When this expression is bound (=~) to a variable with a target string then the whole expression returns a list of matches when used in a context in which a list is expected, which can thus be directly assigned to an array variable.
my #captures = $string =~ /[^:]+/g;
(when this is used literally as shown then the capturing () aren't needed)
Assigning to an array provides this "list context." If the matching is used in a "scalar context," in which a single value is expected, like in the condition for an if test or being assigned to a scalar variable, then a single true/false is returned (usually 1 or '', empty string).
Repeating a capture group will only capture the value of the last iteration. Instead, you might make use of the \G anchor to get consecutive matches.
If the whole string can only contain word characters separated by colons:
(?:^(?=\w+(?::\w+)+$)|\G(?!^):)\K\w+
The pattern matches:
(?: Non capture group
^ Assert start of string
(?=\w+(?::\w+)+$) Assert from the current position 1+ word characters and 1+ repetitions of : and 1+ word characters till the end of the string
| Or
\G(?!^): Assert the position at the end of the previous match, not at the start and match :
) Close non capture group
\K\w+ Forget what is matched so far, and match 1+ word characters
Regex demo
To allow only words as well from the start of the string, and allow other chars after the word chars:
\G:?\K\w+
Regex demo

need support defining the right regex

I would like to parse a file using boost::sregex_token_iterator.
Unfortunately I'm not able to find the right regex to extract strings in the form FOO:BAR out of it.
The below code example is usable only if one such occurence per line is found, but I would like to support multiple of this entries per line, and ideally also a comment after an '#'
So entries like this
AA:BB CC:DD EE:FF #this is a comment
should result in 3 identified token (AA:BB, CC:DD, EE:FF)
boost::regex re("((\\W+:\\W+)\\S*)+");
boost::sregex_token_iterator i(line.begin(), line.end(), re, -1), end;
for(; i != end; i++){
std::stringstream ss(*i);
...
}
Any support is very welcome.
I suggest you use splitting to get the values you need.
I would begin by first splitting using #. This separates the comment from the rest of the line. Then split using white space, which separates the pairs out. After this, individual pairs can be split using :.
If, for whatever reason, you must use regex, you can iterate over the matches. In this case I would use the following regex:
(?:#(?:.*))*(\w+:\w+)\s*
This regex will match every pair until it finds a comment. If there is a comment, it will skip to the next new line.
You want to match sequences of 1 or more word chars followed with : and then having again 1 or more word chars.
Thus, you need to replace -1 with 1 in the call to boost::sregex_token_iterator to get Group 1 text chunks and replace the regex you use with \w+:\w+ pattern:
boost::regex re(R"(#.*|(\w+:\w+))");
boost::sregex_token_iterator i(line.begin(), line.end(), re, 1), end;
Note that R"(#.*|(\w+:\w+))" is a raw string literal that actually represents #.*|(\w+:\w+) pattern that matches # and then the rest of the line or matches and captures the pattern you need into Group 1.
See an std::regex C++ example (you may easily adjust the code for Boost):
#include <string>
#include <iostream>
#include <regex>
using namespace std;
int main() {
std::regex r(R"(#.*|(\w+:\w+))");
std::string s = "AA:BB CC:DD EE:FF #this is a comment XX:YY";
for(std::sregex_iterator i = std::sregex_iterator(s.begin(), s.end(), r);
i != std::sregex_iterator();
++i)
{
std::smatch m = *i;
std::cout << m[1].str() << '\n';
}
return 0;
}

C++11 Regex submatches

I have the following code to extract the left & right part from a string of type
[3->1],[2->2],[5->3]
My code looks like the following
#include <iostream>
#include <regex>
#include <string>
using namespace std;
int main()
{
regex expr("([[:d:]]+)->([[:d:]]+)");
string input = "[3->1],[2->2],[5->3]";
const std::sregex_token_iterator end;
int submatches[] = { 1, 2 };
string left, right;
for (std::sregex_token_iterator itr(input.begin(), input.end(), expr, submatches); itr != end;)
{
left = ((*itr).str()); ++itr;
right = ((*itr).str()); ++itr;
cout << left << " " << right << endl;
}
}
Output will be
3 1
2 2
5 3
Now I am trying to extend it so that first part will be a string instead of digit. For example, the input will be
[(3),(5),(0,1)->2],[(32,2)->6],[(27),(61,11)->1]
And I need to split it as
(3),(5),(0,1) 2
(32,2) 6
(27),(61,11) 1
Basic expressions that I tried ("(\\(.*+)->([[:d:]]+)") just splits the entire string to two as following
(3),(5),(0,1)->2],[(32,2)->6],[(27),(61,11) 1
Can somebody give me some suggestions on how to achieve this? Appreciate all the help.
You need to get everything after the first '[', except "->", kind of like if
you were doing a regex for the multiline comment /* ... */, where " */ " has to be excluded, or else the regex gets greedy and eats everything until the last one, like is happening in your case for "->". You can't really use the dot for any char, because it gets very greedy.
This works for me:
\\[([^-\\]]+)->([0-9]+)\\]
'^' at the start of [...] makes it so all chars, except '-', so you can avoid "->", and ']', are accepted
What you need is to make it a bit more specific:
\[([^]]*)->([^]]*)\]
In order to avoid capturing too many data. See live demo.
You could have use the .*? pattern instead of [^]]* but it would have been less efficient.

Looking for C++ Regex optionally including whitespaces

I'm having a string like
"<firstname>Anna</firstname>"
or
"<firstname>Anna Lena</firstname>"
and I want to use Regex to get the name out of it (so only "Anna" or "Anna Lena"). Currently I'm using:
std::regex reg1 ("(<firstname>)([a-zA-Z0-9]*)(</firstname>)");
and
std::regex_replace (std::back_inserter(result), input.begin(), input.end(), reg1, "$2");
which works well with only one name, but apparently it misses anything after that because it doesn't consider whitespaces. Now I've tried adding \s like ((([a-zA-Z0-9]*)|\s)*) but my IDE (Qt) tells me, that that \s is an unknown escape sequence.
Right now, "<firstname>Anna Lena</firstname>" results in "<firstname>Anna".
How do I solve this in an elegant way?
Use a reluctant quantifier for dot:
std::regex reg1 ("<firstname>(.*?)</firstname>");
Alternately, you can use "not a right angle":
std::regex reg1 ("<firstname>[^<]*</firstname>");
Note that I removed the unnecessary groups around the tag literals, so the target is now group 1 (your regex captured it in group 2).
It seems to me you have an issue with the back_converter in a regex_replace that inserts new elements automatically at the end of the container.
I suggest adding \s to the character class and matching the strings instead of reassigning the vector strings.
Here is a demo of my approach:
#include <iostream>
#include <regex>
using namespace std;
int main() {
std::vector<std::string> strings;
strings.push_back("<firstname>Anna</firstname>");
strings.push_back("<firstname>Anna Lena</firstname>");
std::regex reg("(<firstname>)([a-zA-Z0-9\\s]*)(</firstname>)");
for (size_t k = 0; k < strings.size(); k++)
{
smatch s;
if (std::regex_match(strings[k], s, reg)) {
strings[k] = s[2];
std::cout << strings[k] << std::endl;
}
}
return 0;
}
Output:
Anna
Anna Lena

C++ RegExp and placeholders

I'm on C++11 MSVC2013, I need to extract a number from a file name, for example:
string filename = "s 027.wav";
If I were writing code in Perl, Java or Basic, I would use a regular expression and something like this would do the trick in Perl5:
filename ~= /(\d+)/g;
and I would have the number "027" in placeholder variable $1.
Can I do this in C++ as well? Or can you suggest a different method to extract the number 027 from that string? Also, I should convert the resulting numerical string into an integral scalar, I think atoi() is what I need, right?
You can do this in C++, as of C++11 with the collection of classes found in regex. It's pretty similar to other regular expressions you've used in other languages. Here's a no-frills example of how you might search for the number in the filename you posted:
const std::string filename = "s 027.wav";
std::regex re = std::regex("[0-9]+");
std::smatch matches;
if (std::regex_search(filename, matches, re)) {
std::cout << matches.size() << " matches." << std::endl;
for (auto &match : matches) {
std::cout << match << std::endl;
}
}
As far as converting 027 into a number, you could use atoi (from cstdlib) like you mentioned, but this will store the value 27, not 027. If you want to keep the 0 prefix, I believe you will need to keep this as a string. match above is a sub_match so, extract a string and convert to a const char* for atoi:
int value = atoi(match.str().c_str());
Ok, I solved using std::regex which for some reason I couldn't get to work properly when trying to modify the examples I found around the web. It was simpler than I thought. This is the code I wrote:
#include <regex>
#include <string>
string FileName = "s 027.wav";
// The search object
smatch m;
// The regexp /\d+/ works in Perl and Java but for some reason didn't work here.
// With this other variation I look for exactly a string of 1 to 3 characters
// containing only numbers from 0 to 9
regex re("[0-9]{1,3}");
// Do the search
regex_search (FileName, m, re);
// 'm' is actually an array where every index contains a match
// (equally to $1, $2, $2, etc. in Perl)
string sMidiNoteNum = m[0];
// This casts the string to an integer number
int MidiNote = atoi(sMidiNoteNum.c_str());
Here is an example using Boost, substitute the proper namespace and it should work.
typedef std::string::const_iterator SITR;
SITR start = str.begin();
SITR end = str.end();
boost::regex NumRx("\\d+");
boost::smatch m;
while ( boost::regex_search ( start, end, m, NumRx ) )
{
int val = atoi( m[0].str().c_str() )
start = m[0].second;
}