Split string with specific constraint on delimiter - c++

Suppose we have a string: "((0.2,0), (1.5,0)) A1 ABC p". I want to split it into logical units like this:
((0.2,0), (1.5,0))
A1
ABC
p
I.e. split string by whitespaces with requirement that previous character isn't a comma.
Is it possible to use regex as solution?
Update: I've tried in this way:
#include <iostream>
#include <string>
#include <regex>
int main()
{
std::string s = "((0.2,0), (1.5,0)) A1 ABC p";
std::regex re("[^, ]*\\(, *[^, ]*\\)*"); // as suggested in the updated answers
std::sregex_token_iterator
p(s.begin(), s.end(), re, -1);
std::sregex_token_iterator end;
while (p != end)
std::cout << *p++ << std::endl;
}
The result was: ((0.2,0), (1.5,0)) A1 ABC p
Solution:
#include <iostream>
#include <string>
#include <regex>
int main() {
std::string s = "((0.2,0), (1.5,0)) A1 ABC p";
std::regex re("[^, ]*(, *[^, ]*)*");
std::regex_token_iterator<std::string::iterator> p(s.begin(), s.end(), re);
std::regex_token_iterator<std::string::iterator> end;
while (p != end)
std::cout << *p++ << std::endl;
}
Output:
((0.2,0), (1.5,0))
A1
ABC
p

you can do it like this:
[^, ]*(, *[^, ]*)*
what does this do?
first lets go over basics of regular expressions:
the [] defines a group of characters that you want to match for example [ab] will match an 'a' or 'b'.
If you use [^] syntax that describes all the characters you do NOT want to match so [^ab] will match anything that is NOT and 'a' or a 'b'.
the * symbol tell the regular expression that the previous match can appear zero or more times. so a* will match the empty string '' or 'a' or 'aaa' or 'aaaaaaaaaaaaa'
When you put () around a part of an expression that creates a group that you can then so interesting things with in our case we used it so that we could define a part of the pattern that we wanted to be optional by putting * next to it so that it could appear zero or more times.
Ok putting all together:
The fist part [^ ,]* says: Match zero or more character that are NOT ' ' or ',' this wil match string like 'A1' or '((0.2"
The second part in ()* is used to continue matching string that have ',' and space in them but that you do not want to split, this part is optional so that it correctly matches 'A1' or 'ABC' or 'p'.
So (, *[^, ]*)* will match zero or more strings that start with ',' and any number of ' ' followed by a string that does not have ',' or ' ' in it. So in your example it would match ",0)" which is the continuation of "((0.2" and also match ", (1.5" and again ",0))" which will all get added together to make "((0.2,0), (1.5,0))"
NOTE: You may need to escape some characters in your expression based on the regular expression library you are using. The solution will work in this online tester http://www.regexpal.com/
but some libraries and tools need you to escape things like the (
so the expression would look like:
[^, ]*\(, *[^, ]*\)*
Also I removed the ( |$) part is it is only required if you want the ending space to be part of the match.

Related

How to retrieve the captured substrings from a capturing group that may repeat?

I'm sorry I found it difficult to express this question with my poor English. So, let's go directly to a simple example.
Assume we have a subject string "apple:banana:cherry:durian". We want to match the subject and have $1, $2, $3 and $4 become "apple", "banana", "cherry" and "durian", respectively. The pattern I'm using is ^(\w+)(?::(.*?))*$, and $1 will be "apple" as expected. However, $2 will be "durian" instead of "banana".
Because the subject string to match doesn't need to be 4 items, for example, it could be "one:two:three", and $1 and $2 will be "one" and "three" respectively. Again, the middle item is missing.
What is the correct pattern to use in this case? By the way, I'm going to use PCRE2 in C++ codes, so there is no split, a Perl built-in function. Thanks.
If the input contains strictly items of interest separated by :, like item1:item2:item3, as the attempt in the question indicates, then you can use the regex pattern
[^:]+
which matches consecutive characters which are not :, so a substring up to the first :. That may need to capture as well, ([^:]+), depending on the overall approach. How to use this to get all such matches depends on the language.†
In C++ there are different ways to approach this. Using std::regex_iterator
#include <string>
#include <vector>
#include <iterator>
#include <regex>
#include <iostream>
int main()
{
std::string str{R"(one:two:three)"};
std::regex r{R"([^:]+)"};
std::vector<std::string> result{};
auto it = std::sregex_iterator(str.begin(), str.end(), r);
auto end = std::sregex_iterator();
for(; it != end; ++it) {
auto match = *it;
result.push_back(match[0].str());
}
std::cout << "Input string: " << str << '\n';
for(auto i : result)
std::cout << i << '\n';
}
Prints as expected.
One can also use std::regex_search, even as it returns at first match -- by iterating over the string to move the search start after every match
#include <string>
#include <regex>
#include <iostream>
int main()
{
std::string str{"one:two:three"};
std::regex r{"[^:]+"};
std::smatch res;
std::string::const_iterator search_beg( str.cbegin() );
while ( regex_search( search_beg, str.cend(), res, r ) )
{
std::cout << res[0] << '\n';
search_beg = res.suffix().first;
}
std::cout << '\n';
}
(With this string and regex we don't need the raw string literal so I've removed them here.)
† This question was initially tagged with perl (with no c++), also with an explicit mention of it in text (still there), and the original version of this answer referred to Perl with
/([^:]+)/g
The /g "modifier" is for "global," to find all matches. The // are pattern delimiters.
When this expression is bound (=~) to a variable with a target string then the whole expression returns a list of matches when used in a context in which a list is expected, which can thus be directly assigned to an array variable.
my #captures = $string =~ /[^:]+/g;
(when this is used literally as shown then the capturing () aren't needed)
Assigning to an array provides this "list context." If the matching is used in a "scalar context," in which a single value is expected, like in the condition for an if test or being assigned to a scalar variable, then a single true/false is returned (usually 1 or '', empty string).
Repeating a capture group will only capture the value of the last iteration. Instead, you might make use of the \G anchor to get consecutive matches.
If the whole string can only contain word characters separated by colons:
(?:^(?=\w+(?::\w+)+$)|\G(?!^):)\K\w+
The pattern matches:
(?: Non capture group
^ Assert start of string
(?=\w+(?::\w+)+$) Assert from the current position 1+ word characters and 1+ repetitions of : and 1+ word characters till the end of the string
| Or
\G(?!^): Assert the position at the end of the previous match, not at the start and match :
) Close non capture group
\K\w+ Forget what is matched so far, and match 1+ word characters
Regex demo
To allow only words as well from the start of the string, and allow other chars after the word chars:
\G:?\K\w+
Regex demo

How to find the exact substring with regex in c++11?

I am trying to find substrings that are not surrounded by other a-zA-Z0-9 symbols.
For example: I want to find substring hello, so it won't match hello1 or hellow but will match Hello and heLLo!##$%.
And I have such sample below.
std::string s = "1mySymbol1, /_mySymbol_ mysymbol";
const std::string sub = "mysymbol";
std::regex rgx("[^a-zA-Z0-9]*" + sub + "[^a-zA-Z0-9]*", std::regex::icase);
std::smatch match;
while (std::regex_search(s, match, rgx)) {
std::cout << match.size() << "match: " << match[0] << '\n';
s = match.suffix();
}
The result is:
1match: mySymbol
1match: , /_mySymbol_
1match: mysymbol
But I don't understand why first occurance 1mySymbol1 also matches my regex?
How to create a proper regex that will ignore such strings?
UDP
If I do like this
std::string s = "mySymbol, /_mySymbol_ mysymbol";
const std::string sub = "mysymbol";
std::regex rgx("[^a-zA-Z0-9]+" + sub + "[^a-zA-Z0-9]+", std::regex::icase);
then I find only substring in the middle
1match: , /_mySymbol_
And don't find substrings at the beggining and at the end.
The regex [^a-zA-Z0-9]* will match 0 or more characters, so it's perfectly valid for [^a-zA-Z0-9]*mysymbol[^a-zA-Z0-9]* to match mysymbol in 1mySymbol1 (allowing for case insensitivity). As you saw, this is fixed when you use [^a-zA-Z0-9]+ (matching 1 or more characters) instead.
With your update, you see that this doesn't match strings at the beginning or end. That's because [^a-zA-Z0-9]+ has to match 1 or more characters (which don't exist at the beginning or end of the string).
You have a few options:
Use beginning/end anchors: (?:[^a-zA-Z0-9]+|^)mysymbol(?:[^a-zA-Z0-9]+|$) (non-alphanumeric OR beginning of string, followed by mysymbol, followed by non-alphanumeric OR end of string).
Use negative lookahead and negative lookbehind: (?<![a-zA-Z0-9])mysymbol(?![a-zA-Z0-9]) (match mysymbol which doesn't have an alphanumeric character before or after it). Note that using this the match won't include the characters before/after mysymbol.
I recommend using https://regex101.com/ to play around with regular expressions. It lists all the different constructs you can use.

How to select the complete word within the brackets even if it have that brackets within word

Give some solution to this following example,
Scenario-1:
My String : Password={my_pswd}}123}
I want to select the value enclosed within the {} brackets(Example: I want to select the complete password key value {my_pswd}123} not {my_pswd})
If I'm using this regex \{(.*?)\} , this will select {my_pswd} not {my_pswd}}123}. So how to get complete word even if the word has } in between? Give me some suggestions by using regex or any other way.
Scenario-2:
I am using this regex ^\{|\}$ . If my string have both { bracket and } bracket like this {{my_password}} then only it want to select first and last bracket. If my string like this {{my_password, it don't want to select that starting bracket. Its like AND condition in Regex. I referred many posts they did with look up but I can't get clear idea. Give me some suggestion.
Thanks.
It seems that the {...} substrings you want to match must be followed with ; or end of string.
This will not work for cases when a } inside the values can also be followed with ;.
You may solve the first issue by adding a (?![^;]) lookaround:
\{(.*?)\}(?![^;])
See the regex demo.
Details
\{ - a { char
(.*?) - Group 1: any 0+ chars as few as possible
\} - a } char
(?![^;]) - no char other than ; is allowed right after the current position
See the C++ demo:
#include <iostream>
#include <vector>
#include <regex>
int main() {
const std::regex reg("\\{(.*?)\\}(?![^;])");
std::smatch match;
std::string s = "Username={My_{}user};Password={my_pswd}}123}}}kk};Password={my_pswd}}123}";
std::vector<std::string> results(
std::sregex_token_iterator(s.begin(), s.end(), reg, 1), // See 1, it extracts Group 1 value
std::sregex_token_iterator());
for (auto result : results)
{
std::cout << result << std::endl;
}
return 0;
}
Output:
My_{}user
my_pswd}}123}}}kk
my_pswd}}123
As for the second scenario, you may use
std::regex reg("^\\{([^]*)\\}$");
std::string s = "{My_{}user}";
std::cout << regex_replace(s, reg, "$1") << std::endl; // => My_{}user
See another C++ demo.
The \{([^]*)\}$ pattern matches the { at the start (^) of the string, then matches and captures into Group 1 (later referenced with the help of $1 in the replacement pattern) any 0+ chars, as many as possible, and then matches a } at the end of the string ($).

Regex to replace all occurrences between two matches

I am using std::regex and need to do a search and replace.
The string I have is:
begin foo even spaces and maybe new line(
some text only replace foo foo bar foo, keep the rest
)
some more text not replace foo here
Only the stuff between begin .... ( and ) should be touched.
I manage to replace the first foo by using this search and replace:
(begin[\s\S]*?\([\s\S]*?)foo([\s\S]*?\)[\s\S]*)
$1abc$2
Online regex demo
Online C++ demo
However, how do I replace all three foo in one pass? I tried lookarounds, but failed because of the quantifiers.
The end result should look like this:
begin foo even spaces and maybe new line(
some text only replace abc abc bar abc, keep the rest
)
some more text not replace foo here
Question update:
I am looking for a pure regex solution. That is, the question should be solved by only changing the search and replace strings in the online C++ demo.
I have come up with this code (based on Benjamin Lindley's answer):
#include <iostream>
#include <regex>
#include <string>
int main()
{
std::string input_text = "my text\nbegin foo even 14 spaces and maybe \nnew line(\nsome text only replace foo foo bar foo, keep the rest\n)\nsome more text not replace foo here";
std::regex re(R"((begin[^(]*)(\([^)]*\)))");
std::regex rxReplace(R"(\bfoo\b)");
std::string output_text;
auto callback = [&](std::string const& m){
std::smatch smtch;
if (regex_search(m, smtch, re)) {
output_text += smtch[1].str();
output_text += std::regex_replace(smtch[2].str().c_str(), rxReplace, "abc");
} else {
output_text += m;
}
};
std::sregex_token_iterator
begin(input_text.begin(), input_text.end(), re, {-1,0}),
end;
std::for_each(begin,end,callback);
std::cout << output_text;
return 0;
}
See IDEONE demo
I am using one regex to find all matches of begin...(....) and pass them into the callback function where only Group 2 is processed further (a \bfoo\b regex is used to replace foos with abcs).
I suggest using (begin[^(]*)(\([^)]*\)) regex:
(begin[^(]*) - Group 1 matching a character sequence begin followed with zero or more characters other than (
(\([^)]*\)) - Group 2 matching a literal ( followed with zero or more characters other than ) (with [^)]*) and a literal ).

Get String Between 2 Strings

How can I get a string that is between two other declared strings, for example:
String 1 = "[STRING1]"
String 2 = "[STRING2]"
Source:
"832h0ufhu0sdf4[STRING1]I need this text here[STRING2]afyh0fhdfosdfndsf"
How can I get the "I need this text here"?
Since this is homework, only clues:
Find index1 of occurrence of String1
Find index2 of occurrence of String2
Substring from index1+lengthOf(String1) (inclusive) to index2 (exclusive) is what you need
Copy this to a result buffer if necessary (don't forget to null-terminate)
Might be a good case for std::regex, which is part of C++11.
#include <iostream>
#include <string>
#include <regex>
int main()
{
using namespace std::string_literals;
auto start = "\\[STRING1\\]"s;
auto end = "\\[STRING2\\]"s;
std::regex base_regex(start + "(.*)" + end);
auto example = "832h0ufhu0sdf4[STRING1]I need this text here[STRING2]afyh0fhdfosdfndsf"s;
std::smatch base_match;
std::string matched;
if (std::regex_search(example, base_match, base_regex)) {
// The first sub_match is the whole string; the next
// sub_match is the first parenthesized expression.
if (base_match.size() == 2) {
matched = base_match[1].str();
}
}
std::cout << "example: \""<<example << "\"\n";
std::cout << "matched: \""<<matched << "\"\n";
}
Prints:
example: "832h0ufhu0sdf4[STRING1]I need this text here[STRING2]afyh0fhdfosdfndsf"
matched: "I need this text here"
What I did was create a program that creates two strings, start and end that serve as my start and end matches. I then use a regular expression string that will look for those, and match against anything in-between (including nothing). Then I use regex_match to find the matching part of the expression, and set matched as the matched string.
For more info, see http://en.cppreference.com/w/cpp/regex and http://en.cppreference.com/w/cpp/regex/regex_search
Use strstr http://www.cplusplus.com/reference/clibrary/cstring/strstr/ , with that function you will get 2 pointers, now you should compare them (if pointer1 < pointer2) if so, read all chars between them.