Tokenizing boost::regex matches - c++

I have created a regex to match the lines of a file which have the following structure: string int int
int main()
{
std::string line;
boost::regex pat("\\w\\s\\d\\s\\d");
while (std::cin)
{
std::getline(std::cin, line);
boost::smatch matches;
if (boost::regex_match(line, matches, pat))
std::cout << matches[2] << std::endl;
}
}
I would like to save those numbers into a pair<string,pair<int,int>>. How can I tokenize match of the boost:regex to achieve this?

First of all your regular expression accepts "one word character then one space then one digit then one space then one digit", I bet this is not what you want. Most probably you want your expression look like:
\w+\s+\d+\s+\d+
where \w+ now means "one or more word characters". If you are sure that there is only one space between tokens you can omit + after \s but this way it is safer.
Then you need to select parts of your expression. That is called sub-expression in RE:
(\w+)\s+(\d+)\s+(\d+)
this way what matches by (\w+) (one or more word characters) will be in matches[1], first (\d+) in matches[2] and second (\d+) in matches[3]. Of course you need to put double \ when you put it in C++ string.

Related

How to retrieve the captured substrings from a capturing group that may repeat?

I'm sorry I found it difficult to express this question with my poor English. So, let's go directly to a simple example.
Assume we have a subject string "apple:banana:cherry:durian". We want to match the subject and have $1, $2, $3 and $4 become "apple", "banana", "cherry" and "durian", respectively. The pattern I'm using is ^(\w+)(?::(.*?))*$, and $1 will be "apple" as expected. However, $2 will be "durian" instead of "banana".
Because the subject string to match doesn't need to be 4 items, for example, it could be "one:two:three", and $1 and $2 will be "one" and "three" respectively. Again, the middle item is missing.
What is the correct pattern to use in this case? By the way, I'm going to use PCRE2 in C++ codes, so there is no split, a Perl built-in function. Thanks.
If the input contains strictly items of interest separated by :, like item1:item2:item3, as the attempt in the question indicates, then you can use the regex pattern
[^:]+
which matches consecutive characters which are not :, so a substring up to the first :. That may need to capture as well, ([^:]+), depending on the overall approach. How to use this to get all such matches depends on the language.†
In C++ there are different ways to approach this. Using std::regex_iterator
#include <string>
#include <vector>
#include <iterator>
#include <regex>
#include <iostream>
int main()
{
std::string str{R"(one:two:three)"};
std::regex r{R"([^:]+)"};
std::vector<std::string> result{};
auto it = std::sregex_iterator(str.begin(), str.end(), r);
auto end = std::sregex_iterator();
for(; it != end; ++it) {
auto match = *it;
result.push_back(match[0].str());
}
std::cout << "Input string: " << str << '\n';
for(auto i : result)
std::cout << i << '\n';
}
Prints as expected.
One can also use std::regex_search, even as it returns at first match -- by iterating over the string to move the search start after every match
#include <string>
#include <regex>
#include <iostream>
int main()
{
std::string str{"one:two:three"};
std::regex r{"[^:]+"};
std::smatch res;
std::string::const_iterator search_beg( str.cbegin() );
while ( regex_search( search_beg, str.cend(), res, r ) )
{
std::cout << res[0] << '\n';
search_beg = res.suffix().first;
}
std::cout << '\n';
}
(With this string and regex we don't need the raw string literal so I've removed them here.)
† This question was initially tagged with perl (with no c++), also with an explicit mention of it in text (still there), and the original version of this answer referred to Perl with
/([^:]+)/g
The /g "modifier" is for "global," to find all matches. The // are pattern delimiters.
When this expression is bound (=~) to a variable with a target string then the whole expression returns a list of matches when used in a context in which a list is expected, which can thus be directly assigned to an array variable.
my #captures = $string =~ /[^:]+/g;
(when this is used literally as shown then the capturing () aren't needed)
Assigning to an array provides this "list context." If the matching is used in a "scalar context," in which a single value is expected, like in the condition for an if test or being assigned to a scalar variable, then a single true/false is returned (usually 1 or '', empty string).
Repeating a capture group will only capture the value of the last iteration. Instead, you might make use of the \G anchor to get consecutive matches.
If the whole string can only contain word characters separated by colons:
(?:^(?=\w+(?::\w+)+$)|\G(?!^):)\K\w+
The pattern matches:
(?: Non capture group
^ Assert start of string
(?=\w+(?::\w+)+$) Assert from the current position 1+ word characters and 1+ repetitions of : and 1+ word characters till the end of the string
| Or
\G(?!^): Assert the position at the end of the previous match, not at the start and match :
) Close non capture group
\K\w+ Forget what is matched so far, and match 1+ word characters
Regex demo
To allow only words as well from the start of the string, and allow other chars after the word chars:
\G:?\K\w+
Regex demo

Why the regex to mark beginning of line doesn't work?

Why the commented regex doesn't behave the same as uncommented regex? I thought '^' also marks beginning of line. isn't it?
#include <iostream>
#include <regex>
int main()
{
std::string str ("this subject has a submarine as a subsequence");
std::regex re ("\\b(sub)([^ ]*)");
// std::regex re ("^(sub)([^ ]*)");
// std::regex re ("(^sub)([^ ]*)");
std::cout << "entire matches:";
std::regex_token_iterator<std::string::iterator> rend;
std::regex_token_iterator<std::string::iterator> a ( str.begin(), str.end(), re );
while (a!=rend) std::cout << " [" << *a++ << "]";
std::cout << std::endl;
return 0;
}
It's working just fine. It can't match anything because your line doesn't start with the letter "sub," it starts with the letters "The"
To explain these regexes:
\\b(sub)([^ ]*)
Start of new word, begins with sub, followed by some number of non-space characters. Two capture groups, one for "sub" and one for the other characters.
^(sub)([^ ]*)
Start of new line, begins with sub, followed by some number of non-space characters. Two capture groups, one for "sub" and one for the other characters.
(^sub)([^ ]*)
[EDIT: Fixed, thanks John] Same as above, but with the anchor being part of the capture group.
These two tries to match a line starting with sub followed by 0 or more non-spaces, like
sub in the string sub bub:
std::regex re ("^(sub)([^ ]*)");
std::regex re ("(^sub)([^ ]*)");
The ^ anchor in that context is what makes it only match the beginning of a line.
^ in brackets means not and in all other places means the start of line (except when you escape the ^)
your uncommented pattern: \b(sub)([^ ]*)
explanation:
\b it stands for the boundary that means the space between one char from \w and one char from \W
Note: \w means all digits between 0-9 and all characters between a-z and A-Z and underline character, also \W means everything except \w so space belongs to \W
(sub) it match "sub" string
([^ ]*) it means everything except space (more general when use ^ in brackets it means not so it means everything except space because in front of ^ you used one space)

RE2 Nested Regex Group Match

I have a RE2 regex as following
const re2::RE2 numRegex("(([0-9]+),)+([0-9])+");
std::string inputStr;
inputStr="apple with make,up things $312,412,3.00");
RE2::Replace(&inputStr, numRegex, "$1$3");
cout << inputStr;
Expected
apple with make,up,things $3124123.00
I was trying to remove the , in the recognized number, $1 would only match 312 but not 412 part. Wondering how to extract the recursive pattern in the group.
Note that RE2 doesn't support lookahead (see Using positive-lookahead (?=regex) with re2) and the solutions I found all use lookaheads.
RE2 based solution
As RE2 does not support lookarounds, there is no pure single-pass regex solution.
You can have a workaround (as usual, when no solution is available): replace the string twice with (\d),(\d) regex and $1$2 substitution:
const re2::RE2 numRegex(R"((\d),(\d))");
std::string inputStr("apple with make,up things $312,412,3.00");
RE2::Replace(&inputStr, numRegex, "$1$2");
RE2::Replace(&inputStr, numRegex, "$1$2"); // <- Second pass to remove commas in 1,2,3,4 like strings
std::cout << inputStr;
C++ std::regex based solution:
You can remove the commas between digits using
std::string inputStr("apple with make,up things $312,412,3.00");
std::regex numRegex(R"((\d),(?=\d))");
std::cout << regex_replace(inputStr, numRegex, "$1") << "\n";
// => apple with make,up things $3124123.00
See the C++ demo. Also, see the regex demo here.
Details:
(\d) - Capturing group 1 ($1): a digit
, - a comma
(?=\d) - a positive lookahead that requires a digit immediately to the right of the current location.
In the pattern that you tried, you are repeating the outer group (([0-9]+),)+ which will then contain the value of the last iteration where it can match a 1+ digits and a comma.
The last iteration will capture 412, and 312, will only be matched.
You are using regex, but as an alternative if you have boost available, you could make use of the \G anchor which can get iterative matches asserting the position at the end of the previous match and replace with an empty string.
(?:\$|\G(?!^))\d+\K,(?=\d)
The pattern matches:
(?: Non capture group
\$ match $
| Or
\G(?!^) Assert the position at the end of the previous match, not at the start
) Close non capture group
\d+\K Match 1+ digits and forget what is matched so far
,(?=\d) Match a comma and assert a digit directly to the right
Regex demo
#include<iostream>
#include <string>
#include <boost/regex.hpp>
using namespace std;
int main()
{
std::string inputStr = "apple with make,up things $312,412,3.00";
boost::regex numRegex("(?:\\$|\\G(?!^))\\d+\\K,(?=\\d)");
std::string result = boost::regex_replace(inputStr, numRegex, "");
std::cout << result << std::endl;
}
Output
apple with make,up things $3124123.00

How to find the exact substring with regex in c++11?

I am trying to find substrings that are not surrounded by other a-zA-Z0-9 symbols.
For example: I want to find substring hello, so it won't match hello1 or hellow but will match Hello and heLLo!##$%.
And I have such sample below.
std::string s = "1mySymbol1, /_mySymbol_ mysymbol";
const std::string sub = "mysymbol";
std::regex rgx("[^a-zA-Z0-9]*" + sub + "[^a-zA-Z0-9]*", std::regex::icase);
std::smatch match;
while (std::regex_search(s, match, rgx)) {
std::cout << match.size() << "match: " << match[0] << '\n';
s = match.suffix();
}
The result is:
1match: mySymbol
1match: , /_mySymbol_
1match: mysymbol
But I don't understand why first occurance 1mySymbol1 also matches my regex?
How to create a proper regex that will ignore such strings?
UDP
If I do like this
std::string s = "mySymbol, /_mySymbol_ mysymbol";
const std::string sub = "mysymbol";
std::regex rgx("[^a-zA-Z0-9]+" + sub + "[^a-zA-Z0-9]+", std::regex::icase);
then I find only substring in the middle
1match: , /_mySymbol_
And don't find substrings at the beggining and at the end.
The regex [^a-zA-Z0-9]* will match 0 or more characters, so it's perfectly valid for [^a-zA-Z0-9]*mysymbol[^a-zA-Z0-9]* to match mysymbol in 1mySymbol1 (allowing for case insensitivity). As you saw, this is fixed when you use [^a-zA-Z0-9]+ (matching 1 or more characters) instead.
With your update, you see that this doesn't match strings at the beginning or end. That's because [^a-zA-Z0-9]+ has to match 1 or more characters (which don't exist at the beginning or end of the string).
You have a few options:
Use beginning/end anchors: (?:[^a-zA-Z0-9]+|^)mysymbol(?:[^a-zA-Z0-9]+|$) (non-alphanumeric OR beginning of string, followed by mysymbol, followed by non-alphanumeric OR end of string).
Use negative lookahead and negative lookbehind: (?<![a-zA-Z0-9])mysymbol(?![a-zA-Z0-9]) (match mysymbol which doesn't have an alphanumeric character before or after it). Note that using this the match won't include the characters before/after mysymbol.
I recommend using https://regex101.com/ to play around with regular expressions. It lists all the different constructs you can use.

c++11 (MSVS2012) regex looking for file names in multiple line std::string

I have been trying to search for a clear answer on this one, but not been able to find it.
So lets say I have the string (where \n could be \r\n - I want to handle both - not sure if that is relevant or not)
"4345t435\ng54t a_file_123.xml rk\ngreg a_file_j34.xml fger 43t54"
Then I want to get matches:
a_file_123.xml
a_file_j34.xml
Here is my test code:
const str::string s = "4345t435\ng54t a_file_123.xml rk\ngreg a_file_j34.xml fger 43t54";
std::smatch matches;
if (std::regex_search(s, matches, std::regex("a_file_(.*)\\.xml")))
{
std::cout << "total: " << matches.size() << std::endl;
for (unsigned int i = 0; i < matches.size(); i++)
{
std::cout << "match: " << matches[i] << std::endl;
}
}
Output is:
total: 2
match: a_file_123.xml
match: 123
I don't quite understand why match 2 is just "123"...
You only have one match, not two, as the regex_search method returns a single match. What you printed is two group values, Group 0 (the whole match, a_file_123.xml here) and Group 1 (the capturing group value, here, 123 that is a substring captured with a capturing group you defined as (.*) in the pattern).
If you want to match multiple strings, you need to use the regex iterator, not just a regex_search that only returns the first match.
Besides, .* is too greedy and will return weird results if you have more than 1 match on the same line. It seems you want to match letter or digits, so .* can be replaced with \w+. Well, if there can really be anything, just use .*?.
Use
const std::string s = "4345t435\ng54t a_file_123.xml rk\ngreg a_file_j34.xml fger 43t54";
const std::regex rx("a_file_\\w+\\.xml");
std::vector<std::string> results(std::sregex_token_iterator(s.begin(), s.end(), rx),
std::sregex_token_iterator());
std::cout << "Number of matches: " << results.size() << std::endl;
for (auto result : results)
{
std::cout << result << std::endl;
}
See the C++ demo yielding
Number of matches: 2
a_file_123.xml
a_file_j34.xml
Notes on regex
a_file_ - a literal substring
\\w+ - 1+ word chars (letters, digits, _) (note you may use [^.]*? here instead of \\w+ if you want to match any char, 0 or more repetitions, as few as possible, up to the first .xml)
\\. - a dot (if you do not escape it, it will match any char except line break chars)
xml - a literal substring.
See the regex demo