regex as tokenizer - string beginning with delimiter - c++

sregex_token_iterator works almost perfectly as a tokenizer when the index of the submatch is specified to be -1. But unfortunately it doesn't work well with strings that begin with delimiters e.g:
#include <string>
#include <regex>
#include <iostream>
using namespace std;
int main()
{
string s("--aa---b-c--d--");
regex r("-+");
for (sregex_token_iterator it = sregex_token_iterator(s.begin(), s.end(), r, -1); it != sregex_token_iterator(); ++it)
{
cout << (string) *it << endl;
}
return 0;
}
prints out:
aa
b
c
d
(Note the leading empty line).
So note that it actually handles trailing delimeters well (as it doesn't print an extra empty line).
Reading the standard it seems like there is a clause for specifically handling trailing delimeter to work well i.e:
[re.tokiter] no 4.
If the end of sequence is reached (position is equal to the end of sequence iterator), the iterator becomes equal to the end-of-sequence iterator value, unless the sub-expression being enumerated has index -1, in which case the iterator enumerates one last sub-expression that contains all the characters from the end of the last regular expression match to the end of the input sequence being enumerated, provided that this
would not be an empty sub-expression.
Does anyone know what's the reason for this seemingly asymmetric behaviour being specified?
And lastly, is there an elegant solution to make this work? (such that we don't have empty entries at all).

Apparently your regex matches empty strings between the - delimiters, a simple (not necessarily elegant solution) will discard all strings with length zero:
...
string aux = (string) *it;
if(aux.size() > 0){
cout << aux << endl;
}
...

It seems when you pass -1 as the third argument you're effectively doing a split, and that's the expected behavior for a split. The first token is whatever precedes the first delimiter, and the last token is whatever follows the last delimiter. In this case, both happen to be the empty string, and it's traditional for split() to drop any empty tokens at the end, but to keep the ones at the beginning.
Just out of curiosity, why don't you match the tokens themselves? If "-+" is the correct regex for the delimiters, this should match the tokens:
regex r("[^-}+");

Related

How to retrieve the captured substrings from a capturing group that may repeat?

I'm sorry I found it difficult to express this question with my poor English. So, let's go directly to a simple example.
Assume we have a subject string "apple:banana:cherry:durian". We want to match the subject and have $1, $2, $3 and $4 become "apple", "banana", "cherry" and "durian", respectively. The pattern I'm using is ^(\w+)(?::(.*?))*$, and $1 will be "apple" as expected. However, $2 will be "durian" instead of "banana".
Because the subject string to match doesn't need to be 4 items, for example, it could be "one:two:three", and $1 and $2 will be "one" and "three" respectively. Again, the middle item is missing.
What is the correct pattern to use in this case? By the way, I'm going to use PCRE2 in C++ codes, so there is no split, a Perl built-in function. Thanks.
If the input contains strictly items of interest separated by :, like item1:item2:item3, as the attempt in the question indicates, then you can use the regex pattern
[^:]+
which matches consecutive characters which are not :, so a substring up to the first :. That may need to capture as well, ([^:]+), depending on the overall approach. How to use this to get all such matches depends on the language.†
In C++ there are different ways to approach this. Using std::regex_iterator
#include <string>
#include <vector>
#include <iterator>
#include <regex>
#include <iostream>
int main()
{
std::string str{R"(one:two:three)"};
std::regex r{R"([^:]+)"};
std::vector<std::string> result{};
auto it = std::sregex_iterator(str.begin(), str.end(), r);
auto end = std::sregex_iterator();
for(; it != end; ++it) {
auto match = *it;
result.push_back(match[0].str());
}
std::cout << "Input string: " << str << '\n';
for(auto i : result)
std::cout << i << '\n';
}
Prints as expected.
One can also use std::regex_search, even as it returns at first match -- by iterating over the string to move the search start after every match
#include <string>
#include <regex>
#include <iostream>
int main()
{
std::string str{"one:two:three"};
std::regex r{"[^:]+"};
std::smatch res;
std::string::const_iterator search_beg( str.cbegin() );
while ( regex_search( search_beg, str.cend(), res, r ) )
{
std::cout << res[0] << '\n';
search_beg = res.suffix().first;
}
std::cout << '\n';
}
(With this string and regex we don't need the raw string literal so I've removed them here.)
† This question was initially tagged with perl (with no c++), also with an explicit mention of it in text (still there), and the original version of this answer referred to Perl with
/([^:]+)/g
The /g "modifier" is for "global," to find all matches. The // are pattern delimiters.
When this expression is bound (=~) to a variable with a target string then the whole expression returns a list of matches when used in a context in which a list is expected, which can thus be directly assigned to an array variable.
my #captures = $string =~ /[^:]+/g;
(when this is used literally as shown then the capturing () aren't needed)
Assigning to an array provides this "list context." If the matching is used in a "scalar context," in which a single value is expected, like in the condition for an if test or being assigned to a scalar variable, then a single true/false is returned (usually 1 or '', empty string).
Repeating a capture group will only capture the value of the last iteration. Instead, you might make use of the \G anchor to get consecutive matches.
If the whole string can only contain word characters separated by colons:
(?:^(?=\w+(?::\w+)+$)|\G(?!^):)\K\w+
The pattern matches:
(?: Non capture group
^ Assert start of string
(?=\w+(?::\w+)+$) Assert from the current position 1+ word characters and 1+ repetitions of : and 1+ word characters till the end of the string
| Or
\G(?!^): Assert the position at the end of the previous match, not at the start and match :
) Close non capture group
\K\w+ Forget what is matched so far, and match 1+ word characters
Regex demo
To allow only words as well from the start of the string, and allow other chars after the word chars:
\G:?\K\w+
Regex demo

need support defining the right regex

I would like to parse a file using boost::sregex_token_iterator.
Unfortunately I'm not able to find the right regex to extract strings in the form FOO:BAR out of it.
The below code example is usable only if one such occurence per line is found, but I would like to support multiple of this entries per line, and ideally also a comment after an '#'
So entries like this
AA:BB CC:DD EE:FF #this is a comment
should result in 3 identified token (AA:BB, CC:DD, EE:FF)
boost::regex re("((\\W+:\\W+)\\S*)+");
boost::sregex_token_iterator i(line.begin(), line.end(), re, -1), end;
for(; i != end; i++){
std::stringstream ss(*i);
...
}
Any support is very welcome.
I suggest you use splitting to get the values you need.
I would begin by first splitting using #. This separates the comment from the rest of the line. Then split using white space, which separates the pairs out. After this, individual pairs can be split using :.
If, for whatever reason, you must use regex, you can iterate over the matches. In this case I would use the following regex:
(?:#(?:.*))*(\w+:\w+)\s*
This regex will match every pair until it finds a comment. If there is a comment, it will skip to the next new line.
You want to match sequences of 1 or more word chars followed with : and then having again 1 or more word chars.
Thus, you need to replace -1 with 1 in the call to boost::sregex_token_iterator to get Group 1 text chunks and replace the regex you use with \w+:\w+ pattern:
boost::regex re(R"(#.*|(\w+:\w+))");
boost::sregex_token_iterator i(line.begin(), line.end(), re, 1), end;
Note that R"(#.*|(\w+:\w+))" is a raw string literal that actually represents #.*|(\w+:\w+) pattern that matches # and then the rest of the line or matches and captures the pattern you need into Group 1.
See an std::regex C++ example (you may easily adjust the code for Boost):
#include <string>
#include <iostream>
#include <regex>
using namespace std;
int main() {
std::regex r(R"(#.*|(\w+:\w+))");
std::string s = "AA:BB CC:DD EE:FF #this is a comment XX:YY";
for(std::sregex_iterator i = std::sregex_iterator(s.begin(), s.end(), r);
i != std::sregex_iterator();
++i)
{
std::smatch m = *i;
std::cout << m[1].str() << '\n';
}
return 0;
}

why finding an empty string in non-empty string returns 0

I just noticed this strange behavior of string::find. I have a non-empty string b and another empty string a. When I call b.find(a) it should return npos but returning 0.
#include <iostream>
#include <string>
using namespace std;
int main() {
// your code goes here
string a , b("ABC");
if ( string::npos == b.find(a) ) std::cout << std::endl << "TRUE" << std::endl;
return 0;
}
Above code doesn't print true. Can someone please explain me what this means ? Since a is empty and b is non-empty finding a empty string in non-empty doesn't make sense and hence error. So it should return npos
Thanks
Empty string is a substring of all strings. The first position where an empty substring exists is the first index. find returns the first index where it finds the first occurrence of the substring.
If the definition of empty substring confuses you, consider the algorithm that checks if string is a substring. The algorithm checks each character in the potential substring and compares it to the corresponding character in the other string. If any character does not match, then it is not a substring. If the end of the searched string is reached, then it is a match. In the case of an empty string, no character can differ because there are no characters. The end is reached immediately and the conclusion is that the empty string is a substring.

Find Group of Characters From String

I did a program to remove a group of Characters From a String. I have given below that coding here.
void removeCharFromString(string &str,const string &rStr)
{
std::size_t found = str.find_first_of(rStr);
while (found!=std::string::npos)
{
str[found]=' ';
found=str.find_first_of(rStr,found+1);
}
str=trim(str);
}
std::string str ("scott<=tiger");
removeCharFromString(str,"<=");
as for as my program, I got my output Correctly. Ok. Fine. If I give a value for str as "scott=tiger" , Then the searchable characters "<=" not found in the variable str. But my program also removes '=' character from the value 'scott=tiger'. But I don't want to remove the characters individually. I want to remove the characters , if i only found the group of characters '<=' found. How can i do this ?
The method find_first_of looks for any character in the input, in your case, any of '<' or '='. In your case, you want to use find.
std::size_t found = str.find(rStr);
This answer works on the assumption that you only want to find the set of characters in the exact sequence e.g. If you want to remove <= but not remove =<:
find_first_of will locate any of the characters in the given string, where you want to find the whole string.
You need something to the effect of:
std::size_t found = str.find(rStr);
while (found!=std::string::npos)
{
str.replace(found, rStr.length(), " ");
found=str.find(rStr,found+1);
}
The problem with str[found]=' '; is that it'll simply replace the first character of the string you are searching for, so if you used that, your result would be
scott =tiger
whereas with the changes I've given you, you'll get
scott tiger

Get String Between 2 Strings

How can I get a string that is between two other declared strings, for example:
String 1 = "[STRING1]"
String 2 = "[STRING2]"
Source:
"832h0ufhu0sdf4[STRING1]I need this text here[STRING2]afyh0fhdfosdfndsf"
How can I get the "I need this text here"?
Since this is homework, only clues:
Find index1 of occurrence of String1
Find index2 of occurrence of String2
Substring from index1+lengthOf(String1) (inclusive) to index2 (exclusive) is what you need
Copy this to a result buffer if necessary (don't forget to null-terminate)
Might be a good case for std::regex, which is part of C++11.
#include <iostream>
#include <string>
#include <regex>
int main()
{
using namespace std::string_literals;
auto start = "\\[STRING1\\]"s;
auto end = "\\[STRING2\\]"s;
std::regex base_regex(start + "(.*)" + end);
auto example = "832h0ufhu0sdf4[STRING1]I need this text here[STRING2]afyh0fhdfosdfndsf"s;
std::smatch base_match;
std::string matched;
if (std::regex_search(example, base_match, base_regex)) {
// The first sub_match is the whole string; the next
// sub_match is the first parenthesized expression.
if (base_match.size() == 2) {
matched = base_match[1].str();
}
}
std::cout << "example: \""<<example << "\"\n";
std::cout << "matched: \""<<matched << "\"\n";
}
Prints:
example: "832h0ufhu0sdf4[STRING1]I need this text here[STRING2]afyh0fhdfosdfndsf"
matched: "I need this text here"
What I did was create a program that creates two strings, start and end that serve as my start and end matches. I then use a regular expression string that will look for those, and match against anything in-between (including nothing). Then I use regex_match to find the matching part of the expression, and set matched as the matched string.
For more info, see http://en.cppreference.com/w/cpp/regex and http://en.cppreference.com/w/cpp/regex/regex_search
Use strstr http://www.cplusplus.com/reference/clibrary/cstring/strstr/ , with that function you will get 2 pointers, now you should compare them (if pointer1 < pointer2) if so, read all chars between them.