tokenize a c++ string with regex having special characters - c++

I am trying to find the tokens in a string, which has words, numbers, and special chars. I tried the following code:
#include <iostream>
#include <regex>
#include <string>
using namespace std;
int main() {
string str("The ,quick brown. fox \"99\" named quick_joe!");
regex reg("[\\s,.!\"]+");
sregex_token_iterator iter(str.begin(), str.end(), reg, -1), end;
vector<string> vec(iter, end);
for (auto a : vec) {
cout << a << ":";
}
cout << endl;
}
And got the following output:
The:quick:brown:fox:99:named:quick_joe:
But I wanted the output:
The:,:quick:brown:.:fox:":99:":named:quick_joe:!:
What regex should I use for that? I would like to stick to the standard c++ if possible, ie I would not like a solution with boost.
(See 43594465 for a java version of this question, but now I am looking for a c++ solution. So essentially, the question is how to map Java's Matcher and Pattern to C++.)

You're asking to interleave non-matched substrings (submatch -1) with the whole matched substrings (submatch 0), which is slightly different:
sregex_token_iterator iter(str.begin(), str.end(), reg, {-1,0}), end;
This yields:
The: ,:quick: :brown:. :fox: ":99:" :named: :quick_joe:!:
Since you're looking to just drop whitespace, change the regex to consume surrounding whitespace, and add a capture group for the non-whitespace chars. Then, just specify submatch 1 in the iterator, instead of submatch 0:
regex reg("\\s*([,.!\"]+)\\s*");
sregex_token_iterator iter(str.begin(), str.end(), reg, {-1,1}), end;
Yields:
The:,:quick brown:.:fox:":99:":named quick_joe:!:
Splitting the spaces between adjoining words requires splitting on 'just spaces' too:
regex reg("\\s*\\s|([,.!\"]+)\\s*");
However, you'll end up with empty submatches:
The:::,:quick::brown:.:fox:::":99:":named::quick_joe:!:
Easy enough to drop those:
regex reg("\\s*\\s|([,.!\"]+)\\s*");
sregex_token_iterator iter(str.begin(), str.end(), reg, {-1,1}), end;
vector<string> vec;
copy_if(iter, end, back_inserter(vec), [](const string& x) { return x.size(); });
Finally:
The:,:quick:brown:.:fox:":99:":named:quick_joe:!:

If you want to use the approach used in the Java related question, just use a matching approach here, too.
regex reg(R"(\d+|[^\W\d]+|[^\w\s])");
sregex_token_iterator iter(str.begin(), str.end(), reg), end;
vector<string> vec(iter, end);
See the C++ demo. Result: The:,:quick:brown:.:fox:":99:":named:quick_joe:!:. Note this won't match Unicode letters here as \w (\d, and \s, too) is not Unicode aware in an std::regex.
Pattern details:
\d+ - 1 or more digits
| - or
[^\W\d]+ - 1 or more ASCII letters or _
| - or
[^\w\s] - 1 char other than an ASCII letter/digit,_ and whitespace.

Related

How to retrieve the captured substrings from a capturing group that may repeat?

I'm sorry I found it difficult to express this question with my poor English. So, let's go directly to a simple example.
Assume we have a subject string "apple:banana:cherry:durian". We want to match the subject and have $1, $2, $3 and $4 become "apple", "banana", "cherry" and "durian", respectively. The pattern I'm using is ^(\w+)(?::(.*?))*$, and $1 will be "apple" as expected. However, $2 will be "durian" instead of "banana".
Because the subject string to match doesn't need to be 4 items, for example, it could be "one:two:three", and $1 and $2 will be "one" and "three" respectively. Again, the middle item is missing.
What is the correct pattern to use in this case? By the way, I'm going to use PCRE2 in C++ codes, so there is no split, a Perl built-in function. Thanks.
If the input contains strictly items of interest separated by :, like item1:item2:item3, as the attempt in the question indicates, then you can use the regex pattern
[^:]+
which matches consecutive characters which are not :, so a substring up to the first :. That may need to capture as well, ([^:]+), depending on the overall approach. How to use this to get all such matches depends on the language.†
In C++ there are different ways to approach this. Using std::regex_iterator
#include <string>
#include <vector>
#include <iterator>
#include <regex>
#include <iostream>
int main()
{
std::string str{R"(one:two:three)"};
std::regex r{R"([^:]+)"};
std::vector<std::string> result{};
auto it = std::sregex_iterator(str.begin(), str.end(), r);
auto end = std::sregex_iterator();
for(; it != end; ++it) {
auto match = *it;
result.push_back(match[0].str());
}
std::cout << "Input string: " << str << '\n';
for(auto i : result)
std::cout << i << '\n';
}
Prints as expected.
One can also use std::regex_search, even as it returns at first match -- by iterating over the string to move the search start after every match
#include <string>
#include <regex>
#include <iostream>
int main()
{
std::string str{"one:two:three"};
std::regex r{"[^:]+"};
std::smatch res;
std::string::const_iterator search_beg( str.cbegin() );
while ( regex_search( search_beg, str.cend(), res, r ) )
{
std::cout << res[0] << '\n';
search_beg = res.suffix().first;
}
std::cout << '\n';
}
(With this string and regex we don't need the raw string literal so I've removed them here.)
† This question was initially tagged with perl (with no c++), also with an explicit mention of it in text (still there), and the original version of this answer referred to Perl with
/([^:]+)/g
The /g "modifier" is for "global," to find all matches. The // are pattern delimiters.
When this expression is bound (=~) to a variable with a target string then the whole expression returns a list of matches when used in a context in which a list is expected, which can thus be directly assigned to an array variable.
my #captures = $string =~ /[^:]+/g;
(when this is used literally as shown then the capturing () aren't needed)
Assigning to an array provides this "list context." If the matching is used in a "scalar context," in which a single value is expected, like in the condition for an if test or being assigned to a scalar variable, then a single true/false is returned (usually 1 or '', empty string).
Repeating a capture group will only capture the value of the last iteration. Instead, you might make use of the \G anchor to get consecutive matches.
If the whole string can only contain word characters separated by colons:
(?:^(?=\w+(?::\w+)+$)|\G(?!^):)\K\w+
The pattern matches:
(?: Non capture group
^ Assert start of string
(?=\w+(?::\w+)+$) Assert from the current position 1+ word characters and 1+ repetitions of : and 1+ word characters till the end of the string
| Or
\G(?!^): Assert the position at the end of the previous match, not at the start and match :
) Close non capture group
\K\w+ Forget what is matched so far, and match 1+ word characters
Regex demo
To allow only words as well from the start of the string, and allow other chars after the word chars:
\G:?\K\w+
Regex demo

need support defining the right regex

I would like to parse a file using boost::sregex_token_iterator.
Unfortunately I'm not able to find the right regex to extract strings in the form FOO:BAR out of it.
The below code example is usable only if one such occurence per line is found, but I would like to support multiple of this entries per line, and ideally also a comment after an '#'
So entries like this
AA:BB CC:DD EE:FF #this is a comment
should result in 3 identified token (AA:BB, CC:DD, EE:FF)
boost::regex re("((\\W+:\\W+)\\S*)+");
boost::sregex_token_iterator i(line.begin(), line.end(), re, -1), end;
for(; i != end; i++){
std::stringstream ss(*i);
...
}
Any support is very welcome.
I suggest you use splitting to get the values you need.
I would begin by first splitting using #. This separates the comment from the rest of the line. Then split using white space, which separates the pairs out. After this, individual pairs can be split using :.
If, for whatever reason, you must use regex, you can iterate over the matches. In this case I would use the following regex:
(?:#(?:.*))*(\w+:\w+)\s*
This regex will match every pair until it finds a comment. If there is a comment, it will skip to the next new line.
You want to match sequences of 1 or more word chars followed with : and then having again 1 or more word chars.
Thus, you need to replace -1 with 1 in the call to boost::sregex_token_iterator to get Group 1 text chunks and replace the regex you use with \w+:\w+ pattern:
boost::regex re(R"(#.*|(\w+:\w+))");
boost::sregex_token_iterator i(line.begin(), line.end(), re, 1), end;
Note that R"(#.*|(\w+:\w+))" is a raw string literal that actually represents #.*|(\w+:\w+) pattern that matches # and then the rest of the line or matches and captures the pattern you need into Group 1.
See an std::regex C++ example (you may easily adjust the code for Boost):
#include <string>
#include <iostream>
#include <regex>
using namespace std;
int main() {
std::regex r(R"(#.*|(\w+:\w+))");
std::string s = "AA:BB CC:DD EE:FF #this is a comment XX:YY";
for(std::sregex_iterator i = std::sregex_iterator(s.begin(), s.end(), r);
i != std::sregex_iterator();
++i)
{
std::smatch m = *i;
std::cout << m[1].str() << '\n';
}
return 0;
}

Looking for C++ Regex optionally including whitespaces

I'm having a string like
"<firstname>Anna</firstname>"
or
"<firstname>Anna Lena</firstname>"
and I want to use Regex to get the name out of it (so only "Anna" or "Anna Lena"). Currently I'm using:
std::regex reg1 ("(<firstname>)([a-zA-Z0-9]*)(</firstname>)");
and
std::regex_replace (std::back_inserter(result), input.begin(), input.end(), reg1, "$2");
which works well with only one name, but apparently it misses anything after that because it doesn't consider whitespaces. Now I've tried adding \s like ((([a-zA-Z0-9]*)|\s)*) but my IDE (Qt) tells me, that that \s is an unknown escape sequence.
Right now, "<firstname>Anna Lena</firstname>" results in "<firstname>Anna".
How do I solve this in an elegant way?
Use a reluctant quantifier for dot:
std::regex reg1 ("<firstname>(.*?)</firstname>");
Alternately, you can use "not a right angle":
std::regex reg1 ("<firstname>[^<]*</firstname>");
Note that I removed the unnecessary groups around the tag literals, so the target is now group 1 (your regex captured it in group 2).
It seems to me you have an issue with the back_converter in a regex_replace that inserts new elements automatically at the end of the container.
I suggest adding \s to the character class and matching the strings instead of reassigning the vector strings.
Here is a demo of my approach:
#include <iostream>
#include <regex>
using namespace std;
int main() {
std::vector<std::string> strings;
strings.push_back("<firstname>Anna</firstname>");
strings.push_back("<firstname>Anna Lena</firstname>");
std::regex reg("(<firstname>)([a-zA-Z0-9\\s]*)(</firstname>)");
for (size_t k = 0; k < strings.size(); k++)
{
smatch s;
if (std::regex_match(strings[k], s, reg)) {
strings[k] = s[2];
std::cout << strings[k] << std::endl;
}
}
return 0;
}
Output:
Anna
Anna Lena

Using regex to split special char

string MyName = " 'hi, load1', 'hi, load2', varthatnotstring ";
I want to use regex to split the above string at every ,, while preserving strings that are inside quotation.
As such, splitting MyName should yield:
1: 'hi, load1'
2: 'hi, load2'
3: varthatnotstring
I currently use regex MyR("(.),(.),(.*)");, but that gives me:
1: 'hi
2: load1'
3: 'hi
4: load2'
What regular-expression should I use?
Depending on how you want to handle certain corner cases, you can use the following:
std::regex reg(R"--((('.*?')|[^,])+)--");
Step, by step:
R"--(...)--" Is syntax for raw string literals, so we don't have to worry about escaping. We don't need it here, but I'm using them by default for regex strings.
('.*?') all characters between (and including) two apostrophes (non greedy)
[^,] anything that is not a comma
(('.*?')|[^,])+ arbitrary sequence of non-,-characters or '...'-sequences.
(Note: the ('.*?') part has to come first)
So this will also match e.g. tkasd 'rtzrze,123' as a single match. It will also NOT remove any whitespaces.
Usage:
std::regex reg(R"--((('.*?')|[^,])+)--");
std::string s = ",,t '123,4565',k ,'rt',t,z";
for (std::sregex_iterator rit(s.begin(), s.end(), reg), end{}; rit != end; ++rit) {
std::cout << rit->str() << std::endl;
}
Output:
t '123,4565'
k
'rt'
t
z
Edit:
I rarely use regular expressions, so any comments about possible improvements or gotchas are welcome. Maybe there is also an even better solution using regex_token_iterator.

C++ regex escaping punctional characters like "."

Matching a "." in a string with the std::tr1::regex class makes me use a weird workaround.
Why do I need to check for "\\\\." instead of "\\."?
regex(".") // Matches everything (but "\n") as expected.
regex("\\.") // Matches everything (but "\n").
regex("\\\\.") // Matches only ".".
Can someone explain me why? It's really bothering me since I had my code written using boost::regex classes, which didn't need this syntax.
Edit: Sorry, regex("\\\\.") seems to match nothing.
Edit2: Some code
void parser::lex(regex& token)
{
// Skipping whitespaces
{
regex ws("\\s*");
sregex_token_iterator wit(source.begin() + pos, source.end(), ws, regex_constants::match_default), wend;
if(wit != wend)
pos += (*wit).length();
}
sregex_token_iterator it(source.begin() + pos, source.end(), token, regex_constants::match_default), end;
if (it != end)
temp = *it;
else
temp = "";
}
This is because \. is interpreted as an escape sequence, which the language itself is trying to interpret as a single character. What you want is for your regex to contain the actual string "\.", which is written \\. because \\ is the escape sequence for the backslash character (\).
As it turns out, the actual problem was due to the way sregex_token_iterator was used. Using match_default meant it was always finding the next match in the string, if any, even if there is a non-match in-between. That is,
string source = "AAA.BBB";
regex dot("\\.");
sregex_token_iterator wit(source.begin(), source.end(), dot, regex_constants::match_default);
would give a match at the dot, rather than reporting that there was no match.
The solution is to use match_continuous instead.
Try to escape the dot by its ASCII code:
regex("\\x2E")